<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Topic Modeling
## *Data Science Unit 4 Sprint 1 Assignment 4*

Analyze a corpus of Amazon reviews from Unit 4 Sprint 1 Module 1's lecture using topic modeling: 

- Fit a Gensim LDA topic model on Amazon Reviews
- Select appropriate number of topics
- Create some dope visualization of the topics
- Write a few bullets on your findings in markdown at the end
- **Note**: You don't *have* to use generators for this assignment

In [1]:
#Start Here

## Stretch Goals

* Incorporate Named Entity Recognition in your analysis
* Incorporate some custom pre-processing from our previous lessons (like spacy lemmatization)
* Analyze a dataset of interest to you with topic modeling

In [3]:
# How do we do a gridsearch
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer


In [4]:
categories = ['al']

data = fetch_20newsgroups()

In [5]:
# Method 1- Gridsearch on just a classifier
# Fit the vectorizer and prepare the data *before* it goes into the gridsearch
v1 = TfidfVectorizer()
X_train = v1.fit_transform(data['data'])

In [9]:
params1 = {'n_estimators':[10,20],
           'max_depth':[None,7]}

In [10]:
clf = RandomForestClassifier()
gs1 = GridSearchCV(clf, params1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, data['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:   16.9s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   17.3s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [12]:
# Error= could not convert string to float: 'No drama llama was in Portland last week.'
# Fix with preprocessing. 
gs1.predict(["No drama llama was in Portland last week."])

ValueError: could not convert string to float: 'No drama llama was in Portland last week.'

In [14]:
#Gridsearch with both vectorizer and classifier
from sklearn.pipeline import Pipeline

v2= TfidfVectorizer()
clf1 = RandomForestClassifier()
pipe = Pipeline([('vect', v2), ('clf', clf1)])
p2 = {'vect__max_features':[1000,5000],
    'clf__n_estimators': [10,20],
      'clf__max_depth':[None,7]
     }
gs2 = GridSearchCV(pipe, p2, cv=5, verbose=1, n_jobs=-1)
gs2.fit(data['data'], data['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   35.2s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [15]:
gs2.predict(["No drama llama was in Portland last week."])

array([6])

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility.
* Allows us to tune the parameters of the vectorizer along side the classifier.

In [16]:
%pwd

'C:\\Users\\Magic Rob\\DS-Unit-4-Sprint-1-NLP\\module4-topic-modeling'

In [23]:
df = pd.read_csv('./data/imbd_keywords.csv')
df.head()

Unnamed: 0,review,sentiment,keywords
0,One of the other reviewers has mentioned that ...,positive,"['other shows', 'graphic violence', 'prison ex..."
1,A wonderful little production. The filming tec...,positive,"['halliwell', 'michael sheen', 'realism', 'com..."
2,I thought this was a wonderful way to spend ti...,positive,"['spirited young woman', 'devil wears prada', ..."
3,Basically there's a family where a little boy ...,negative,"['playing parents', 'jake', 'parents', 'descen..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"['mr. mattei', 'good luck', 'mattei', 'human r..."


In [24]:
#Estimating LDA with python.
#These are strings, so we need to change keywords to a list.
from ast import literal_eval

df['keywords'] = df['keywords'].apply(literal_eval)

In [25]:
df.head()

Unnamed: 0,review,sentiment,keywords
0,One of the other reviewers has mentioned that ...,positive,"[other shows, graphic violence, prison experie..."
1,A wonderful little production. The filming tec...,positive,"[halliwell, michael sheen, realism, comedy, wi..."
2,I thought this was a wonderful way to spend ti...,positive,"[spirited young woman, devil wears prada, summ..."
3,Basically there's a family where a little boy ...,negative,"[playing parents, jake, parents, descent dialo..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[mr. mattei, good luck, mattei, human relation..."


In [26]:
# gensim is good for topic modeling
import gensim
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

In [39]:
#A dictionary representation of all the words. Similar to the CountVectorizer.
#List of lists of strings- tokens, lemmas, phrases
id2word = corpora.Dictionary(df['keywords'])

In [40]:
id2word.token2id['spirited young woman']

112

In [41]:
id2word.doc2bow(['other show', 'graphic violence'])

[(16, 1)]

In [42]:
len(id2word.keys())

625927

In [43]:
id2word.filter_extremes(no_below=7, no_above=.95)

In [44]:
len(id2word.keys())

25363

In [45]:
corpus = [id2word.doc2bow(text) for text in df['keywords']]

In [48]:
corpus[560]

[(121, 1),
 (173, 1),
 (192, 1),
 (860, 1),
 (1395, 1),
 (1425, 1),
 (1629, 1),
 (1906, 1),
 (2069, 1),
 (5517, 1),
 (5673, 1),
 (6322, 1),
 (6633, 1),
 (6634, 1),
 (6635, 1),
 (6636, 1)]

In [49]:
lda = LdaMulticore(corpus=corpus, 
                   id2word=id2word, 
                   num_topics=20,
                  passes=50,
                  workers=12)


In [50]:
lda.print_topics()

[(0,
  '0.011*"i" + 0.009*"this film" + 0.007*"first" + 0.006*"the plot" + 0.006*"this movie" + 0.006*"it" + 0.005*"people" + 0.005*"time" + 0.005*"the film" + 0.004*"the world"'),
 (1,
  '0.064*"i" + 0.040*"this movie" + 0.019*"this film" + 0.016*"the acting" + 0.011*"first" + 0.010*"this one" + 0.009*"the plot" + 0.008*"the movie" + 0.008*"people" + 0.008*"the story"'),
 (2,
  '0.023*"i" + 0.017*"this movie" + 0.010*"this film" + 0.008*"people" + 0.006*"the film" + 0.005*"it" + 0.004*"the plot" + 0.004*"the acting" + 0.004*"the movie" + 0.004*"a lot"'),
 (3,
  '0.010*"the film" + 0.010*"it" + 0.008*"the movie" + 0.008*"the story" + 0.007*"american" + 0.007*"love" + 0.007*"this film" + 0.005*"life" + 0.005*"war" + 0.005*"the end"'),
 (4,
  '0.011*"i" + 0.009*"the end" + 0.008*"first" + 0.007*"this movie" + 0.006*"the movie" + 0.005*"it" + 0.005*"this one" + 0.005*"the film" + 0.004*"this film" + 0.004*"second"'),
 (5,
  '0.011*"i" + 0.008*"first" + 0.006*"the film" + 0.006*"the movie"

In [51]:
import re
words = [re.findall(r'"[^"]*"', t[1]) for t in lda.print_topics(20)]

In [52]:
topics = [', '.join(t[0:5]) for t in words]

In [59]:
for id, t in enumerate(topics):
    print(f"---- Topic {id} -----")
    print(t, end="\n\n")

---- Topic 0 -----
"i", "this film", "first", "the plot", "this movie"

---- Topic 1 -----
"i", "this movie", "this film", "the acting", "first"

---- Topic 2 -----
"i", "this movie", "this film", "people", "the film"

---- Topic 3 -----
"the film", "it", "the movie", "the story", "american"

---- Topic 4 -----
"i", "the end", "first", "this movie", "the movie"

---- Topic 5 -----
"i", "first", "the film", "the movie", "this movie"

---- Topic 6 -----
"i", "this show", "first", "the show", "people"

---- Topic 7 -----
"i", "the story", "this film", "this movie", "the movie"

---- Topic 8 -----
"i", "this movie", "this film", "american", "you"

---- Topic 9 -----
"i", "this movie", "it", "the film", "this film"

---- Topic 10 -----
"it", "this film", "this movie", "i", "the film"

---- Topic 11 -----
"i", "first", "this movie", "this film", "people"

---- Topic 12 -----
"i", "this movie", "gore", "the story", "people"

---- Topic 13 -----
"the film", "it", "first", "this film", "people"

# Ways to improve the models:
* Remove stop words, and film, this film, the film, movie, the movie, this movie, etc.
Other steps to improve the model-


In [60]:
import spacy

nlp = spacy.load("en_core_web_lg")



{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [61]:
custom = {'film', 'movie', 'story'}
nlp.Defaults.stop_words |= custom
def tokenize(text):
    """Parse a raw string and return lemmas"""
    doc = nlp(text)
    lemmas = []
    for token in doc:
        if (token.is_stop == False) and (token.is_punct == False) and (token.pos != 'PRON'):
            lemmas.append(token.lemma_)
    return lemmas

In [62]:
from tqdm import tqdm

tqdm.pandas()
df['lemmas'] = df['review'].progress_apply(tokenize) #This will take 20-30 minutes. Plan for that in a SC

  from pandas import Panel
100%|████████████████████████████████████| 40436/40436 [36:45<00:00, 18.34it/s]


## Interpret the LDA results

In [63]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [64]:
pyLDAvis.gensim.prepare(lda,corpus, id2word) #this line will take a long time to run.

  and should_run_async(code)


In [None]:
t['primary_topic'] = t35.idxmax

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.countplot(x='primary_topic', data=t);
plt.xticks(rotation=90)