# Natural Language Processing in Python

Area of computer science and artifical intelligence concerned with the interactions between computers and human languages, in particular how to program computer to process and analyze large amounts of natural language data

**NLP Basics**
- [1. Tokenization](#1.-Tokenization)
    * [1.1. Part-of-Speech Tagging (POS)](#1.1.-Part-of-Speech-Tagging-(POS))
    * [1.2. Dependencies](#1.2.-Dependencies)
    * [1.3. Named Entities](#1.3.-Named-Entities)
    * [1.4. Noun Chunks](#1.4.-Noun-Chunks)
    * [1.5. Vocabulary and Matching](#1.5.-Vocabulary-and-Matching)
    * [1.6. Additional Token Attributes](#1.6.-Additional-Token-Attributes)
        + [1.6.1. Stemming/Lemmatization](#1.6.1.-Stemming/Lemmatization)
        + [1.6.2. Stop Words](#1.6.2.-Stop-Words)
        + [1.6.3. Spans](#1.6.3.-Spans)
        + [1.6.4. Sentences](#1.6.4.-Sentences)
    * [1.7. Visualization](#1.7.-Visualization)
        + [1.7.1. Visualizing POS](#1.7.1.-Visualizing-POS)
        + [1.7.2. Visualizing NER](#1.7.2.-Visualizing-NER)
- [2. Text Classification](#2.-Text-Classification)
    * [2.1. Bag of Words](#2.1.-Bag-of-Words)
    * [2.2. Term Frequency - Inverse Document Frequency (TF-IDF)](#2.2.-Term-Frequency---Inverse-Document-Frequency-(TF-IDF))
- [3. Semantics and Sentiment Analysis](#3.-Semantics-and-Sentiment-Analysis)
    * [3.1. Semantics and Word Vectors](#3.1.-Semantics-and-Word-Vectors)
        + [3.1.1 Vector Norms](#3.1.1-Vector-Norms)
        + [3.1.2 Vector Arithmetic](#3.1.2-Vector-Arithmetic)
    * [3.2. Sentiment Analysis](#3.2.-Sentiment-Analysis)
- [4. Topic Modeling](#4.-Topic-Modeling)
    * [4.1. Latent Dirichlet Allocation (LDA)](4.1.-Latent-Dirichlet-Allocation-(LDA))
    * [4.2. Non-negative Matrix Factorization](#4.2.-Non-negative-Matrix-Factorization)

In [1]:
# Importing of libraries

import spacy

nlp = spacy.load('en_core_web_sm')
# Other libraries include: 'en_core_web_md' and 'en_core_web_lg'

# 1. Tokenization

- Process of breaking up the original text into components pieces (token)

In [2]:
doc = nlp(u"Tesla isn't looking into startups anymore.")

for token in doc:
    print(f'{token.text:{15}} {token.pos_:{15}} {token.dep_:{15}}')

Tesla           PROPN           nsubj          
is              AUX             aux            
n't             PART            neg            
looking         VERB            ROOT           
into            ADP             prep           
startups        NOUN            pobj           
anymore         ADV             advmod         
.               PUNCT           punct          


## 1.1. Part-of-Speech Tagging (POS)

- POS tagging or grammatical tagging is the process of making up a word in a text as corresponding to a particular part of speech, based on its definition and its context
    * For a full list of POS Tags visit: https://spacy.io/api/annotation#pos-tagging
- To view the **coarse** POS tag use `token.pos_`
- To view the **fine-grained** tag use `token.tag_`
- To view the description of either type of tag use `spacy.explain(tag)`

<div class="alert alert-success">Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.</div>

In [3]:
for token in doc:
    print(f'{token.text:{15}} {token.pos_:{15}} {token.tag_:{15}} {spacy.explain(token.tag_)}')

Tesla           PROPN           NNP             noun, proper singular
is              AUX             VBZ             verb, 3rd person singular present
n't             PART            RB              adverb
looking         VERB            VBG             verb, gerund or present participle
into            ADP             IN              conjunction, subordinating or preposition
startups        NOUN            NNS             noun, plural
anymore         ADV             RB              adverb
.               PUNCT           .               punctuation mark, sentence closer


In [4]:
# Counting POS Tags

POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{96: 1, 87: 1, 94: 1, 100: 1, 85: 1, 92: 1, 86: 1, 97: 1}

## 1.2. Dependencies

- A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads
    * For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing

<div class="alert alert-success">Note that `token.dep` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.</div>

In [5]:
for token in doc:
    print(f'{token.text:{15}} {token.dep_:{15}}')

Tesla           nsubj          
is              aux            
n't             neg            
looking         ROOT           
into            prep           
startups        pobj           
anymore         advmod         
.               punct          


In [6]:
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

400. advmod: 1
405. aux : 1
425. neg : 1
429. nsubj: 1
439. pobj: 1
443. prep: 1
445. punct: 1
8206900633647566924. ROOT: 1


## 1.3. Named Entities

- Named entity recognition (NER) is the task of identifying and categorizing key information (entities) in text
    * An entity can be any word or series of words that consistently refers to the same thing

In [7]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [8]:
doc1 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc1)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [9]:
# Entity annotations

for ent in doc1.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

Washington, DC 4 7 12 26 GPE
next May 7 9 27 35 DATE
the Washington Monument 11 14 43 66 ORG


## 1.4. Noun Chunks

- `Doc.noun_chunks` are *base noun phrases*: token spans that include the noun and words describing the noun. 
- Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.<br>
- Where `Doc.ents` rely on the **ner** pipeline component, `Doc.noun_chunks` are provided by the **parser**.
    * For more on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

In [10]:
doc2 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc2.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward


## 1.5. Vocabulary and Matching

### 1.5.1. Rule-based Matching

spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [11]:
from spacy.matcher import Matcher

# Setting up the matcher
matcher = Matcher(nlp.vocab)

# Example document - linking 'Solar Power', 'Solar-power' and 'solarpower' to be SolarPower
doc3 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

# Generating the patterns to be matched
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Adding the patterns to the matcher
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

# Applying the matcher and showing the matches
found_matches = matcher(doc3)

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # To get the string representation of the word
    span = doc3[start:end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


### 1.5.2. PhraseMatcher

An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

- Methodology is similar to Rule-based Matching
    * However, there is a need to convert the phrase list to be that of nlp format via the function `nlp()`

## 1.6. Additional Token Attributes

### 1.6.1. Stemming/Lemmatization

- **Stemming:**
    * Crude method for cataloging related words; essentially chops off letter from the end until the stem is reached
- **Lemmatization:**
    * Looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words
        + Lemmatization is therefore better than stemming

In [12]:
# Lemmatization

doc4 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc4:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.lemma:{20}} {token.lemma_:{15}}')

I          PRON         561228191312463089 -PRON-         
am         AUX        10382539506755952630 be             
a          DET        11901859001352538922 a              
runner     NOUN       12640964157389618806 runner         
running    VERB       12767647472892411841 run            
in         ADP         3002984154512732771 in             
a          DET        11901859001352538922 a              
race       NOUN        8048469955494714898 race           
because    SCONJ      16950148841647037698 because        
I          PRON         561228191312463089 -PRON-         
love       VERB        3702023516439754181 love           
to         PART        3791531372978436496 to             
run        VERB       12767647472892411841 run            
since      SCONJ      10066841407251338481 since          
I          PRON         561228191312463089 -PRON-         
ran        VERB       12767647472892411841 run            
today      NOUN       11042482332948150395 today        

### 1.6.2. Stop Words

- Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these **stop words**, and they can be filtered from the text to be processed. 

In [13]:
# Checking if a word is a stop word

print(nlp.vocab['myself'].is_stop)

print(nlp.vocab['mystery'].is_stop)

True
False


In [14]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

len(nlp.Defaults.stop_words)

327

In [15]:
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('btw')

# Remove the stop_word tag from the lexeme
nlp.vocab['btw'].is_stop = False

len(nlp.Defaults.stop_words)

326

### 1.6.3. Spans

- A **span** is a slice of the Doc object in the form `Doc[start:stop]`

In [16]:
doc5 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_quote = doc5[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


### 1.6.4. Sentences

- The `sents` tag facilitates the segementation of the document

In [17]:
doc6 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc6.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


## 1.7. Visualization

- spaCy offers an outstanding visualizer called **displaCy**, which allows us to see the link between the tokens
- If using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.
    * Instead of `displacy.render()`, use `displacy.serve()`:

In [18]:
# Import the displaCy library
from spacy import displacy

### 1.7.1. Visualizing POS

In [19]:
# Creating a sample document
doc7 = nlp(u"The quick brown fox jumped over the lazy dog's back.")

displacy.render(doc7, style='dep', jupyter=True, options={'distance': 110})

In [20]:
for token in doc7:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

The        DET     det     determiner
quick      ADJ     amod    adjectival modifier
brown      ADJ     amod    adjectival modifier
fox        PROPN   nsubj   nominal subject
jumped     VERB    ROOT    None
over       ADP     prep    prepositional modifier
the        DET     det     determiner
lazy       ADJ     amod    adjectival modifier
dog        NOUN    poss    possession modifier
's         PART    case    case marking
back       NOUN    pobj    object of preposition
.          PUNCT   punct   punctuation


### 1.7.2. Visualizing NER

In [21]:
doc8 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc8, style='ent', jupyter=True)

In [22]:
# To customize colors, effects and view specific entities
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}
options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc8, style='ent', jupyter=True, options=options)

# 2. Text Classification

## 2.1. Bag of Words

- Bag-of-words is a simplifying representation used in natural language process and information retrieval (IR)
- A text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity
    * To make analysis for robust, we could remove the `Stop Words` and use `Word Stems`

## 2.2. Term Frequency - Inverse Document Frequency (TF-IDF)

- A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighing factor in searches of information retrieval, text mining and user modeling. 
    * Term Frequency:
        + Calculated by dividng the number of occurences of a word by the total number of words in the document
    * Inverse Document Frequency:
        + A measure of how much information the word provides (Common or rate across all documents)
        + Total number of documents divided by the number of documents that contain the word (e.g 4 out of the 5 documents have the word $\frac{5}{4}$)

In [23]:
# Importing of libraries

import pandas as pd

df = pd.read_csv('Sample_datasets/smsspamcollection.tsv', sep='\t')

In [24]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

predictions = text_clf.predict(X_test)

In [26]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print(accuracy_score(y_test,predictions))
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

0.989668297988037
[[1586    7]
 [  12  234]]
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



# 3. Semantics and Sentiment Analysis

## 3.1. Semantics and Word Vectors

- Word Vectors:
    * Mathematical description of individual words such that words that appear frequently together in the language will have similar values
        + In this way we can mathematically derive **context**
        + In the library `spacy`, word vectors are stored as 300-item arrays

In [27]:
nlp = spacy.load('en_core_web_md')

In [28]:
# Create a three-token Doc object:
tokens = nlp(u'lion cat pet')

# Comparing similarity of tokens
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265437
lion pet 0.39923772
cat lion 0.5265437
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923772
pet cat 0.7505456
pet pet 1.0


### 3.1.1 Vector Norms

- It's sometimes helpful to aggregate 300 dimensions into a [Euclidian (L2) norm](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm), computed as the square root of the sum-of-squared-vectors. 
- This is accessible as the `.vector_norm` token attribute. Other helpful attributes include `.has_vector` and `.is_oov` or *out of vocabulary*.

In [29]:
tokens1 = nlp(u'dog cat nargle')

for token in tokens1:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


### 3.1.2 Vector Arithmetic

- We can actually calculate new vectors by adding & subtracting related vectors. 
- A famous example suggests
<pre>"king" - "man" + "woman" = "queen"</pre>

In [30]:
from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
new_vector = king - man + woman
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors and mixed-case words:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

['king', 'woman', 'she', 'lion', 'who', 'when', 'dare', 'cat', 'was', 'not']


## 3.2. Sentiment Analysis

- The goal is to find commonalities between documents, with the understanding that similarly *combined* vectors should correspond to similar sentiments.

In [31]:
import nltk
# nltk.download('vader_lexicon')

df1 = pd.read_csv('Sample_datasets/amazonreviews.tsv', sep='\t')
df1.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [32]:
# Cleaning dataset

df1.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df1.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df1.drop(blanks, inplace=True)

In [33]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

df1['scores'] = df1['review'].apply(lambda review: sid.polarity_scores(review))
df1['compound']  = df1['scores'].apply(lambda score_dict: score_dict['compound'])
df1['comp_score'] = df1['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df1.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [34]:
print(accuracy_score(df1['label'],df1['comp_score']))
print(classification_report(df1['label'],df1['comp_score']))
print(confusion_matrix(df1['label'],df1['comp_score']))

0.7091
              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000

[[2623 2474]
 [ 435 4468]]


# 4. Topic Modeling

- Allows efficient analysis of large volumes of text by clustering documents into topics
    * Good as without labeled data, we will not be able to apply supervised learning techniques

## 4.1. Latent Dirichlet Allocation (LDA)

- Assumptions:
    * Documents with similar topics use similar groups of words
        + Documents are probability distributions over latent topics
    * Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus
        + Topics are probability distributions over words
- Pipeline:
    * Decide on the amount of topics present in the document
    * Document to assign a topic based on LDA
    * Find out the most common words associated with the topic
    * User interpret these topics

In [35]:
npr = pd.read_csv('Sample_datasets/npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [36]:
# Preprocessing of data

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') 
# max_df discard words that show up in 95% in the documents
# min_df only accounts for the word if it shows in at least 2 documents

dtm = cv.fit_transform(npr['Article'])

In [37]:
# Apply the LDA model

from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [38]:
# Showing the top words per topic

for index,topic in enumerate(LDA.components_):
    # LDA.components_ is the list of topics
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]]) # argsort() returns based on index (least to most)
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [39]:
# Attaching discovered topic labels to original articles

topic_results = LDA.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)

## 4.2. Non-negative Matrix Factorization

- An unsupervised algorithm that simultaneously performs dimensionality reduction and clustering
    * Can be used in conjuction with TF-IDF to model topics across documents
- Pipeline:
    * Generate a non-negative data matrix **(A)** by TF-IDF
    * Decide on the number of basis vectors **(k)** (Number of topics)
    * Split matrix A into two matrices **W** and **H**
        + Iterate between two multiplicative update rules until convergence
- Process:
    * Construct vector space model for documents (after stopword filtering) resulting in a term-document matrix A
    * Apply TF-IDF term weight normalization to A
    * Normalize TF-IDF vectors to unit length
    * Initialize factors using NNDSVD on A
    * Apply Projected Gradient NMF to A
        + Coefficient matrix (weights for documents relative to each topic)

In [40]:
npr = pd.read_csv('Sample_datasets/npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = tfidf.fit_transform(npr['Article']) # Document term matrix

In [42]:
from sklearn.decomposition import NMF

nmf_model = NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [43]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'al

In [44]:
topic_results = nmf_model.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
