# Python NLP Library Implmention

### Content:
  *  1 NLTK
  *  2 Word Feature Vector Space
  *  3 TF-IDF representation
  *  4 Latent Dirichlet Allocation (LDA)

In [62]:
sentence1 = "At eight on Thursday morning Arthur felt very good good, but not perfect."
sentence2 = "Thursday night is good!"
sentence3 = "Although Arthur is is feeling good at eight feel"
article = "At eight on Thursday morning Arthur felt very good good, but not prefect. Thursday night is good! Although Arthur is is feeling good at eight feel."

## 1. NLTK

In [2]:
import nltk

### 1.1 Sentence tokenizer

In [3]:
print (type(nltk.sent_tokenize(article)))
print (nltk.sent_tokenize(article))

<class 'list'>
['At eight on Thursday morning Arthur felt very good good, but not prefect.', 'Thursday night is good!', 'Although Arthur is is feeling good at eight feel.']


### 1.2 Normalization

Unify all words in lower case.

In [6]:
sentenc1 = sentence1.lower()
print (sentence1)

at eight on thursday morning arthur felt very good good, but not perfect.


### 1.3 Word tokenizer

In [14]:
words = nltk.word_tokenize(sentence1)
print (sentence1)
print (type(words))
print (words)

at eight on thursday morning arthur felt very good good, but not perfect.
<class 'list'>
['at', 'eight', 'on', 'thursday', 'morning', 'arthur', 'felt', 'very', 'good', 'good', ',', 'but', 'not', 'perfect', '.']


In NLTK, the word tokenizer can also be implemented using reg expression:

In [15]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
words2 = tokenizer.tokenize(sentence1)
print (words2)

['at', 'eight', 'on', 'thursday', 'morning', 'arthur', 'felt', 'very', 'good', 'good', 'but', 'not', 'perfect']


We see that compared with the `nltk.word_tokenize`, `RegespTokenizer` has advantage which removes punctuations.

### 1.4 Remove Punctuations

In [12]:
import string
punctuations = set(string.punctuation)

In [13]:
print (punctuations)

{')', '%', '|', '?', '-', '^', '>', '~', '*', '[', '#', '$', ']', '=', '.', "'", '{', '&', '_', ',', '/', '}', '`', '(', '\\', '+', ';', '"', '<', '!', '@', ':'}


In [16]:
print ([x for x in words if x not in punctuations])

['at', 'eight', 'on', 'thursday', 'morning', 'arthur', 'felt', 'very', 'good', 'good', 'but', 'not', 'perfect']


### 1.5 Remove Stopwords

In [17]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('English'))

In [18]:
print (stop_words)

{'himself', 'isn', 't', 've', 'he', 'with', 'will', 'yourselves', 'about', 'll', 'o', 'did', 'because', 'it', 'of', 'they', 'what', 'those', 'be', 'hasn', 'while', 'and', 'down', 'these', 'which', 'myself', 'at', 'were', 'does', 'where', 'haven', 'shouldn', 'weren', 'has', 'an', 'on', 'doing', 'as', 'was', 'only', 'won', 'through', 'themselves', 'but', 'before', 'if', 'here', 'y', 'its', 'them', 'out', 'to', 'now', 'most', 'hadn', 're', 'under', 'how', 'yourself', 'not', 'between', 'by', 'mustn', 'own', 'such', 'our', 'there', 'off', 'some', 'until', 'then', 'that', 'needn', 'in', 'itself', 's', 'i', 'after', 'over', 'is', 'me', 'don', 'just', 'whom', 'during', 'having', 'aren', 'very', 'should', 'you', 'have', 'am', 'this', 'his', 'wouldn', 'can', 'no', 'again', 'all', 'up', 'ain', 'ma', 'too', 'didn', 'any', 'my', 'ourselves', 'from', 'so', 'm', 'few', 'a', 'above', 'into', 'when', 'couldn', 'yours', 'who', 'further', 'once', 'the', 'why', 'theirs', 'we', 'do', 'against', 'mightn', '

In [19]:
print ([x for x in words if x not in stop_words])

['eight', 'thursday', 'morning', 'arthur', 'felt', 'good', 'good', ',', 'perfect', '.']


### 1.6 Stemming

Stemmers are used to turn all verbs' tenses to the present tense and plural nouns to singular noun. Usually tenses do not bring additional information to us.

#### 1.6.1 Porter stemmer

In [20]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [21]:
print ([ps.stem(x) for x in words])

['at', 'eight', 'on', 'thursday', 'morn', 'arthur', 'felt', 'veri', 'good', 'good', ',', 'but', 'not', 'perfect', '.']


Note here the porter stemmer is **unable** to replace 'felt' with stemmed 'feel', so the porter stemmer doesn't work perfectly.

In [22]:
ps.stem('running'), ps.stem('ran'), ps.stem('feeling'), ps.stem('felt')

('run', 'ran', 'feel', 'felt')

#### 1.6.2 Snowball stemmer

In [23]:
from nltk.stem.snowball import SnowballStemmer

In [24]:
ss = SnowballStemmer("english")

In [25]:
ss.stem('running'), ss.stem('ran'), ss.stem('feeling'), ss.stem('felt')

('run', 'ran', 'feel', 'felt')

### 1.7 All Together

In [26]:
print (set([ps.stem(x.lower()) for x in words if x not in punctuations if x not in stop_words]))

{'eight', 'good', 'arthur', 'thursday', 'felt', 'perfect', 'morn'}


Now we can see the word space is smaller than directly word tokenizing the sentence.

## 2. Word Feature Vector Space

In [27]:
print (article)

At eight on Thursday morning Arthur felt very good good, but not prefect. Thursday night is good! Although Arthur is is feeling good at eight feel.


###  2.1 Feature space

In [28]:
all_corpus_words = []
for sentence in nltk.sent_tokenize(article.lower()):
    print (sentence)
    words = [ps.stem(x) for x in nltk.word_tokenize(sentence) 
                 if x not in punctuations if x not in stop_words]
    all_corpus_words += words
    print (words)
    
all_corpus_words = set(all_corpus_words)

at eight on thursday morning arthur felt very good good, but not prefect.
['eight', 'thursday', 'morn', 'arthur', 'felt', 'good', 'good', 'prefect']
thursday night is good!
['thursday', 'night', 'good']
although arthur is is feeling good at eight feel.
['although', 'arthur', 'feel', 'good', 'eight', 'feel']


### 2.2 Bag-of-words model

In [29]:
print (all_corpus_words)

{'prefect', 'night', 'eight', 'feel', 'good', 'arthur', 'thursday', 'felt', 'although', 'morn'}


There are 10 words in the word sets **`all_corpus_words`** in the **article**, so our vectore space is 10-dimensional.

In [32]:
def get_features(review):
    features = {}
    review_words = set([x.lower() for x in nltk.word_tokenize(str(review)) 
                     if x not in stop_words if x not in punctuations])
    for word in all_corpus_words:
        features[word] = (word in review_words)
    return features

### 2.3 Basic representaiton (0/1) for feature vectors

In [33]:
sent =1
for sentence in nltk.sent_tokenize(article):
    print (sent, get_features(sentence))
    sent +=1

1 {'prefect': True, 'night': False, 'feel': False, 'felt': True, 'thursday': True, 'eight': True, 'arthur': True, 'although': False, 'good': True, 'morn': False}
2 {'prefect': False, 'night': True, 'feel': False, 'felt': False, 'thursday': True, 'eight': False, 'arthur': False, 'although': False, 'good': True, 'morn': False}
3 {'prefect': False, 'night': False, 'feel': True, 'felt': False, 'thursday': False, 'eight': True, 'arthur': True, 'although': True, 'good': True, 'morn': False}


## 3. Sklearn: TF-IDF representation

Here we are going to use sklean to compute TF-IDF for each word feature, rather than simply counting `exist` or `nonexist`. In sklearn, the tf-idf(w,d) of word **w** in document **d** is 

### $$\textrm{tf-idf(w,d)} = \textrm{tf (w,d)}\times (1+ \textrm{idf(w,d)})= \textrm{tf (w,d)}\times (1+\log{\Big( \frac{n_d+1}{\textrm{df(w)}+1} \Big) })$$

**tf(w,d)**: term-frequency of word **w** in document **d**. **df(w)** is the document-frequency, number of documents where the word **w** appeared. **$n_d$** is the total number of documents in all corpus.

### 3.1 TF representation

TF means `term-frequency`, the frequency of the word appeared in the corpus.

In [34]:
import numpy as np
doc = np.array([sentence1, sentence2, sentence3])

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(lowercase=False)
bag = count.fit_transform(doc)
print (count.vocabulary_)

{'eight': 6, 'at': 4, 'arthur': 3, 'thursday': 17, 'very': 18, 'felt': 9, 'feel': 7, 'on': 15, 'Thursday': 2, 'night': 13, 'not': 14, 'feeling': 8, 'good': 10, 'morning': 12, 'Arthur': 1, 'but': 5, 'perfect': 16, 'is': 11, 'Although': 0}


In [37]:
print(bag)

  (0, 16)	1
  (0, 14)	1
  (0, 5)	1
  (0, 10)	2
  (0, 18)	1
  (0, 9)	1
  (0, 3)	1
  (0, 12)	1
  (0, 17)	1
  (0, 15)	1
  (0, 6)	1
  (0, 4)	1
  (1, 11)	1
  (1, 13)	1
  (1, 2)	1
  (1, 10)	1
  (2, 7)	1
  (2, 8)	1
  (2, 1)	1
  (2, 0)	1
  (2, 11)	2
  (2, 10)	1
  (2, 6)	1
  (2, 4)	1


The `bag` is a matrix, `(i,j)` telling us the TF of the `j-th` word in the `i-th` sentence. For example, `(0,10)=2` means that the word **good** appeared twice in **sentence1**, and `(1,10)=1` shows it appeared once in **sentence2**.

We see that `CountVectorizer` only does word tokenizer, without removing `stopwords` such as **very**, **not**, **but**, **on**, **at**, **is**, distinguishing the Captial and little **thursday**/**Thursday**, **Arthur**/**arthur**, and the verb tense **feeling**/**feel**. After these repeats, the dimensionality is still 10. We can turn on the `lowercase`:

In [38]:
'''The deault to use CountVectorizer is lowercase = True'''
count = CountVectorizer()
bag = count.fit_transform(doc)
print (count.vocabulary_)

{'eight': 4, 'at': 2, 'arthur': 1, 'thursday': 15, 'very': 16, 'felt': 7, 'feel': 5, 'on': 13, 'night': 11, 'not': 12, 'feeling': 6, 'good': 8, 'morning': 10, 'but': 3, 'perfect': 14, 'is': 9, 'although': 0}


Now we can see the difference between **thursday**/**Thursday**, **Arthur**/**arthur** is eliminated. This can help us reduce the vector space dimensionality by two.

### 3.2 Feature index

In `CountVectorizer`, we can implement `stop_words`:

In [49]:
count = CountVectorizer(stop_words='english')
bag = count.fit_transform(doc)
print (count.vocabulary_)

{'night': 6, 'feel': 1, 'feeling': 2, 'arthur': 0, 'good': 4, 'felt': 3, 'morning': 5, 'thursday': 8, 'perfect': 7}


Now we can compare `bag` with the `all_corpus_words` from NLTK:

In [40]:
print (all_corpus_words)

{'prefect', 'night', 'eight', 'feel', 'good', 'arthur', 'thursday', 'felt', 'although', 'morn'}


Note here the sklearn's `CountVector` with `stop_words` also removed words **although** and **eight**, but we didn't do stemmer so there is still **feeling**. Now `CountVector` converts the article to a 9-dimensional vector space. On the other hand, previously we implemented stemmer to `all_corpus_word` so there is no **feeling**.

In [41]:
print (doc)

[ 'at eight on thursday morning arthur felt very good good, but not perfect.'
 'Thursday night is good!'
 'Although Arthur is is feeling good at eight feel']


In [42]:
print (bag.toarray())

[[1 0 0 1 2 1 0 1 1]
 [0 0 0 0 1 0 1 0 1]
 [1 1 1 0 1 0 0 0 0]]


In **sentence2**, we have 'Thursday night is good!'. In the vector space, we have 'thursday': 8, 'night': 6, 'good': 4, so `[0 0 0 0 1 0 1 0 1]`.

### 3.3 n-gram representation (n>1)

The above methid is called uni-gram, i.e. tokenize **each** word. In reality, some words are relevant to each other and usually people use the relevant words **together**, like 'bus stop'. In the following, we do two-gram representation:

In [43]:
twograms = CountVectorizer(ngram_range=(1,2), stop_words ='english')

In [44]:
bag = twograms.fit_transform(doc)
print (twograms.vocabulary_)

{'arthur felt': 2, 'feel': 3, 'thursday night': 19, 'good perfect': 11, 'feeling good': 5, 'perfect': 16, 'arthur feeling': 1, 'good good': 10, 'felt good': 7, 'morning': 12, 'night good': 15, 'thursday morning': 18, 'night': 14, 'good feel': 9, 'feeling': 4, 'good': 8, 'arthur': 0, 'thursday': 17, 'felt': 6, 'morning arthur': 13}


In [45]:
print (bag.toarray())

[[1 0 1 0 0 0 1 1 2 0 1 1 1 1 0 0 1 1 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1]
 [1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0]]


### 3.4 TF-IDF representation

Next we need to consider IDF, inverse document frequency. The TF is considered on each **individual** document, or sentence. But we have to consider how often this word appeaed in all corpus. If it is too often, meaning less importance. There are two ways: (1) `tdidfTransformer()`+`CounVectorizer()` and (2) `TdidfVectorizer()`.

#### 3.4.1 TfidfTransformer( )

In [50]:
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
tfidf = TfidfTransformer()  ## this bag is normalized and removes stop words.
np.set_printoptions(precision=2)
print(tfidf.fit_transform(bag).toarray())

[[ 0.32  0.    0.    0.42  0.5   0.42  0.    0.42  0.32]
 [ 0.    0.    0.    0.    0.43  0.    0.72  0.    0.55]
 [ 0.44  0.58  0.58  0.    0.35  0.    0.    0.    0.  ]]


#### 3.4.2 Using TfidfVectorizer( ) = CountVectorizer( ) + TfidfTransformer( )

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,lowercase=True,preprocessor=None, stop_words ='english')
print(tfidf.fit_transform(doc).toarray())

[[ 0.32  0.    0.    0.42  0.5   0.42  0.    0.42  0.32]
 [ 0.    0.    0.    0.    0.43  0.    0.72  0.    0.55]
 [ 0.44  0.58  0.58  0.    0.35  0.    0.    0.    0.  ]]


In [51]:
print (count.vocabulary_)
print (bag.toarray())

{'night': 6, 'feel': 1, 'feeling': 2, 'arthur': 0, 'good': 4, 'felt': 3, 'morning': 5, 'thursday': 8, 'perfect': 7}
[[1 0 0 1 2 1 0 1 1]
 [0 0 0 0 1 0 1 0 1]
 [1 1 1 0 1 0 0 0 0]]


Note that even the same word, say **good**, has different TF-IDF in each document. In **sentence1**, **sentence2** and **sentence3**, they are 0.5, 0.43, 0.35. It is becuase the `TfidfVectorizer` and `TfidfTransformer` eventually do L2-norm (normalization) in each document. 

As an example, let us go over all procedures 
* (1) in **sentence(1,2,3)**, we have `[1,0,0,1,2,1,0,1,1]`, `[0,0,0,0,1,0,1,0,1]` and `[1,1,1,0,1,0,0,0,0]`. So **tf('good',d)=2, 1, 1**, for **d=1,2,3** respectively.
* (2) **'good'** appeared in all three documents, so **df('good')=3** and **idf('good')=log(3+1)/(3+1)=0**.
* (3) Let's focus on **sentence2**:
   *    For "good", **tfidf("good",2)=tf('good',2)*(tdf('good')+1)=1*1=1**.
   *    For "Thurday" and "night", **df("thursday")=2** and **df("night")=1**, such that **idf("thursdau")=log(3+1)/(2+1)=0.287** and **idf('night')=log(3+1)/(1+1)=0.693**. 
   *    Then **tfidf("thursday",2)=1*1.287=1.287** and **tfidf("night",2)=1*1.693=1.693**. Now we have `[0,0,0,0,1,0,1.693,0,1.287]`.
* (4) L2-normalize the vector. The normalized constant is 2.35, so `[0,0,0,0,1,0,1.693,0,1.287]` eventually becomes `[0,0,0,0,1/2.35,0,1.693/2.35,0,1.287/2.35]=[0,0,0,0,0.425,0,0.72,0.547]`.

In [52]:
doc

array([ 'at eight on thursday morning arthur felt very good good, but not perfect.',
       'Thursday night is good!',
       'Although Arthur is is feeling good at eight feel'], 
      dtype='<U73')

In [53]:
index = [count.vocabulary_['thursday'], count.vocabulary_['good'], count.vocabulary_['perfect'],count.vocabulary_['night']]
print (index)
print (bag.toarray()[0][index])

[8, 4, 7, 6]
[1 2 1 0]


#### 3.4.3 Stemmed `tfidfVectorize( )`

Next we can add removing stopwords and stemmer in `tdifvectorize()`:

In [54]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = PorterStemmer()
    def __call__(self, doc):
        return [self.wnl.stem(t) for t in nltk.word_tokenize(doc) if t not in punctuations]

In [71]:
doc

array([ 'at eight on thursday morning arthur felt very good good, but not perfect.',
       'Thursday night is good!',
       'Although Arthur is is feeling good at eight feel'], 
      dtype='<U73')

In [56]:
stemmedCount = CountVectorizer(stop_words ='english', tokenizer=LemmaTokenizer())
stemmedBag = stemmedCount.fit_transform(doc)
stemmedTfidf = TfidfTransformer()
print (stemmedTfidf.fit_transform(stemmedBag).toarray())

[[ 0.3   0.    0.39  0.46  0.39  0.    0.39  0.3   0.39]
 [ 0.    0.    0.    0.43  0.    0.72  0.    0.55  0.  ]
 [ 0.34  0.9   0.    0.27  0.    0.    0.    0.    0.  ]]


In [57]:
print (stemmedCount.vocabulary_)
print (stemmedBag.toarray())

{'night': 5, 'feel': 1, 'veri': 8, 'good': 3, 'felt': 2, 'arthur': 0, 'thursday': 7, 'perfect': 6, 'morn': 4}
[[1 0 1 2 1 0 1 1 1]
 [0 0 0 1 0 1 0 1 0]
 [1 2 0 1 0 0 0 0 0]]


In [58]:
stemmedTfidf = TfidfVectorizer(strip_accents=None,lowercase=True,preprocessor=None, stop_words ='english', tokenizer=LemmaTokenizer())

In [59]:
print(stemmedTfidf.fit_transform(doc).toarray())

[[ 0.3   0.    0.39  0.46  0.39  0.    0.39  0.3   0.39]
 [ 0.    0.    0.    0.43  0.    0.72  0.    0.55  0.  ]
 [ 0.34  0.9   0.    0.27  0.    0.    0.    0.    0.  ]]


In [60]:
print (count.vocabulary_)
print (bag.toarray())

{'night': 6, 'feel': 1, 'feeling': 2, 'arthur': 0, 'good': 4, 'felt': 3, 'morning': 5, 'thursday': 8, 'perfect': 7}
[[1 0 0 1 2 1 0 1 1]
 [0 0 0 0 1 0 1 0 1]
 [1 1 1 0 1 0 0 0 0]]


### 3.5 The entire corpus in terms of the tfidf reprentation

We need to input the **sentence** in list or np.array format for tfidf.fit_transform(input), not all articles. So before input, we use nltk.sent_tokenize( ) to convert the whole article to sentences:

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,lowercase=True,preprocessor=None, stop_words ='english')
doc_tfidf = tfidf.fit_transform(np.array([x for x in nltk.sent_tokenize(article)]))
doc_tfidf = tfidf.fit_transform(doc)

In [82]:
print(doc_tfidf.toarray())

[[ 0.32  0.    0.    0.42  0.5   0.42  0.    0.42  0.32]
 [ 0.    0.    0.    0.    0.43  0.    0.72  0.    0.55]
 [ 0.44  0.58  0.58  0.    0.35  0.    0.    0.    0.  ]]


In [80]:
print (doc_tfidf.toarray()[0][index])

[ 0.32  0.5   0.42  0.  ]


## 4 Latent Dirichlet Allocation (LDA) in Sklearn

Next we try to do [LDA model](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py) using sklearn. Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set. Let's give another example texts:

In [14]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

In [16]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
punctuations = set(string.punctuation)
stop_words = set(stopwords.words('English'))
ps = PorterStemmer()
texts = [ps.stem(x.lower()) for x in nltk.word_tokenize(doc_a) if x not in punctuations if x not in stop_words]

In [17]:
texts, type(texts)

(['brocolli',
  'good',
  'eat',
  'my',
  'brother',
  'like',
  'eat',
  'good',
  'brocolli',
  'mother'],
 list)

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000,stop_words='english')
features_tf = tf_vectorizer.fit_transform(doc_set)
print (features_tf)

Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother.
  (0, 5)	1
  (0, 1)	1
  (0, 3)	2
  (0, 0)	2
  (1, 2)	1
  (1, 5)	1
  (1, 1)	1
  (2, 6)	1
  (2, 4)	1
  (2, 2)	1
  (3, 6)	1
  (3, 5)	1
  (3, 1)	1
  (4, 4)	2
  (4, 3)	1
  (4, 0)	1


In [24]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [26]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 3
n_top_words = 20
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(features_tf)
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
driving pressure mother brother health brocolli good
Topic #1:
health good brocolli pressure brother driving mother
Topic #2:
brocolli good brother mother driving pressure health



In [27]:
n_topics = 2
n_top_words = 20
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(features_tf)
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
mother driving brother pressure brocolli health good
Topic #1:
good brocolli health brother pressure driving mother



In [30]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
n_features = 1000
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(doc_set)
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model:
Topic #0:
mother brother pressure driving health good brocolli
Topic #1:
health good brocolli pressure mother driving brother

