#### Topic Modeling
- Examples of Usage 
  - Organizing text documents 
  - User recommendations once topic models give representation of articles (NYT) 
  - Similarity queries 
- Advantages over word based vector representations  
  - topics and word representation selected algorithmically, not by keywords defined by users

#### LDA (Latent Dirichlet Allocation) 
- Thoretical Underpinnings (TBD)
- Parameters of LDA 
- How to improve LDA 
  - Freq Filter 
  - POS tagging 
  - Batch wise LDA
- How to evaluate quality of topic models   
  - Coherence
- Comparison with LSA   
- Feature selection using Topic Modeling   

In [1]:
import numpy as np
import pandas as pd
import gensim
import nltk
import sklearn
#import spacy



In [2]:
#from sklearn.decomposition import TruncatedSVD
#from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

#### 1. Download Data

In [3]:
#fetch_20newsgroups?

In [36]:
#raw_data = fetch_20newsgroups(subset='train', shuffle=True,
#                              random_state=42, remove=('headers', 'footers', 'quotes'))
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."
documents = [doc1, doc2, doc3, doc4, doc5]

In [37]:
#documents = raw_data.data
#target = raw_data.target_names

In [38]:
#target

In [40]:
documents[2]

'Doctors suggest that driving may cause increased stress and blood pressure.'

#### 2. Pre-process data and create a BOW dictionary  
- Remove digits,there are a lot if them,  punctuations, replace with space(nltk)
- remove short words or letters 
- lemmatize(nltk)
- Tokenize using white space, remove stopwords

In [41]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
#import string

In [42]:
nltk.download('stopwords', 'wordnet')

[nltk_data] Downloading package stopwords to wordnet...
[nltk_data]   Package stopwords is already up-to-date!


True

In [43]:
lemma = WordNetLemmatizer()
#punc = string.punctuation
stop_words = stopwords.words('english')

In [44]:
len(documents)

5

In [45]:
docs_series = pd.Series(documents)

In [46]:
documents_1 = docs_series.str.replace('[^a-zA-Z]', ' ')
documents_2 = documents_1.apply(lambda doc : ' '.join([wrd.lower() for wrd in doc.split() if len(wrd) >3]))
documents_3 = documents_2.apply(lambda doc : ' '.join([lemma.lemmatize(word) for word in doc.split()]))
documents_4 = documents_3.apply(lambda doc : ' '.join([word for word in doc.split() if word not in stop_words]))

In [47]:
print(documents[3],'\n',documents_1[3], '\n',documents_2[3],'\n',documents_3[3],'\n',documents_4[3])

Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better. 
 Sometimes I feel pressure to perform well at school  but my father never seems to drive my sister to do better  
 sometimes feel pressure perform well school father never seems drive sister better 
 sometimes feel pressure perform well school father never seems drive sister better 
 sometimes feel pressure perform well school father never seems drive sister better


In [15]:
#lemma.lemmatize('corpora')

#### 3. Create a vocabulary

In [48]:
from gensim import corpora

In [49]:
#type(docs_list)
#documents_4[0]

In [50]:
list_docs_tokenized = [doc.split() for doc in documents_4]

In [51]:
#docs_tokenize_flat[0]
#corpora.Dictionary?

In [52]:
dictionary = corpora.Dictionary(list_docs_tokenized)
dtm = [dictionary.doc2bow(doc) for doc in list_docs_tokenized]

In [53]:
#len(dtm[1])

In [54]:
#### min_df, max_df affect vocab formation, max_features filters after vocab is formed

#### 4. Run an lda model

In [65]:
LDA = gensim.models.ldamodel.LdaModel
lda_model = LDA(corpus=dtm, num_topics=3, id2word=dictionary,
                passes=50, random_state=20, per_word_topics=True)

In [66]:
#LDA?

In [67]:
lda_model.print_topics()

[(0,
  '0.147*"sugar" + 0.082*"father" + 0.082*"sister" + 0.081*"consume" + 0.081*"like" + 0.020*"health" + 0.020*"good" + 0.020*"expert" + 0.020*"lifestyle" + 0.020*"pressure"'),
 (1,
  '0.075*"driving" + 0.043*"blood" + 0.043*"increased" + 0.043*"cause" + 0.043*"stress" + 0.043*"suggest" + 0.043*"doctor" + 0.043*"around" + 0.043*"spends" + 0.043*"time"'),
 (2,
  '0.060*"sister" + 0.060*"father" + 0.060*"pressure" + 0.060*"school" + 0.060*"drive" + 0.060*"feel" + 0.060*"never" + 0.060*"perform" + 0.060*"better" + 0.060*"seems"')]

#### How is each documents composed in terms of topics

In [68]:
for doc in lda_model.get_document_topics(dtm, per_word_topics=True):
    print(doc)

([(0, 0.90106696), (1, 0.04942177), (2, 0.049511217)], [(0, [0]), (1, [0]), (2, [0]), (3, [0]), (4, [0])], [(0, [(0, 0.9994688)]), (1, [(0, 0.98962265)]), (2, [(0, 0.9994688)]), (3, [(0, 0.9896227)]), (4, [(0, 1.9960601)])])
([(0, 0.0408059), (1, 0.91969544), (2, 0.039498642)], [(1, [1, 0]), (3, [1, 0]), (5, [1]), (6, [1]), (7, [1]), (8, [1]), (9, [1]), (10, [1])], [(1, [(0, 0.014845872), (1, 0.9754104)]), (3, [(0, 0.014845868), (1, 0.9754104)]), (5, [(1, 0.9987731)]), (6, [(1, 0.9987731)]), (7, [(1, 0.9994211)]), (8, [(1, 0.9987731)]), (9, [(1, 0.9987731)]), (10, [(1, 0.9987731)])])
([(0, 0.03750237), (1, 0.92416203), (2, 0.03833565)], [(7, [1]), (11, [1]), (12, [1]), (13, [1]), (14, [1]), (15, [1]), (16, [1]), (17, [1])], [(7, [(1, 0.9995278)]), (11, [(1, 0.9990006)]), (12, [(1, 0.9990006)]), (13, [(1, 0.9990006)]), (14, [(1, 0.9990006)]), (15, [(1, 0.9906253)]), (16, [(1, 0.9990006)]), (17, [(1, 0.9990006)])])
([(0, 0.026759326), (1, 0.02638973), (2, 0.94685096)], [(1, [2]), (3, [2]

#### 5. Visualize topics

In [26]:
import pyLDAvis.gensim

In [27]:
pyLDAvis.enable_notebook()

In [28]:
vis = pyLDAvis.gensim.prepare(lda_model, dtm,dictionary)
vis

KeyboardInterrupt: 

In [29]:
vis

NameError: name 'vis' is not defined

#### 6. Coherence score

In [None]:
from gensim.models import CoherenceModel 

In [None]:
cm1 = CoherenceModel()