# Topic Modeling with Latent Dirichlet Allocation

1) Use Gensim, create a trigrams model.

2) Use Spacy, lemmatize and apply parts of speach tags to keep only nouns, adjectives, verbs, and adverbs.

3) Create an LDA model and use word frequencies to determine which words typically appear together and thus, are part of the same topic. 

4) Print out the words most likley to appear in each topic and try to determine what the topics are.

5) Create interactive visualization with pyLDAvis.

6) Return topic probabilities for each trancript as vectors, then concatenate new topic columns to the original dataframe. 

In [1]:
import pandas as pd

df = pd.read_pickle('data/stand-up-data-cleaned.pkl')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330 entries, 0 to 329
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            330 non-null    object 
 1   date_posted      330 non-null    object 
 2   link             330 non-null    object 
 3   name             326 non-null    object 
 4   year             313 non-null    float64
 5   transcript       330 non-null    object 
 6   language         330 non-null    object 
 7   runtime          279 non-null    float64
 8   rating           279 non-null    float64
 9   rating_type      330 non-null    int64  
 10  words            330 non-null    object 
 11  word_count       330 non-null    int64  
 12  f_words          330 non-null    int64  
 13  s_words          330 non-null    int64  
 14  diversity        330 non-null    int64  
 15  diversity_ratio  330 non-null    float64
dtypes: float64(4), int64(5), object(7)
memory usage: 41.4+ KB


In [2]:
# Only take English transcripts
df = df[df.language == 'en']
df.language.value_counts()

en    322
Name: language, dtype: int64

### Get bigrams and trigrams from the corpus
The Gensim library is both powerful and a bit confusing at times. To get bigrams (common two-word phrases), we will first need to create a Phrases object. Then the Phraser object can be instatiated with the Phrases object as input. The trigrams model is built using the bigram model as input.

In [3]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser 

# Build bigram and trigram Phrases objects
bigram_phrases = Phrases(df.words[:], min_count=10)
trigram_phrases = Phrases(bigram_phrases[df.words[:]], min_count=5)

# Create Phraser model object for faster processing by passing in the Phrases object (Gensim has a confusing API...)
bigram_model = Phraser(bigram_phrases)
trigram_model = Phraser(trigram_phrases)

trigrams = [trigram_model[bigram_model[word]] for word in df.words]

### Lemmatize words and filter out unneeded parts of speach

In [4]:
import spacy

nlp = spacy.load('en', disable=['parser', 'ner'])
allowed_postags=['NOUN','ADJ','VERB','ADV']
lemmatized_words = []
for sent in trigrams:
    doc = nlp(" ".join(sent))
    lemmatized_words.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])

In [5]:
print(lemmatized_words[0][:50])

['start', 'time', 'right', 'go', 'bring', 'brother', 'give', 'go', 'hurt', 'real', 'bad', 'man', 'take', 'go', 'rolling', 'stone', 'great', 'stand', 'comic_strip', 'time', 'let', 'bring', 'right', 'brownest', 'working', 'man', 'show_business', 'bugger', 'happen', 'know', 'go', 'sound', 'strange', 'good', 'home', 'mean', 'never', 'still', 'feel', 'home', 'feel', 'thing', 'back', 'home', 'keep', 'comfortable', 'brown', 'man', 'even', 'put']


In [6]:
from gensim.corpora import Dictionary

id2word = Dictionary(lemmatized_words)
id2word.filter_extremes(no_below=10, no_above=0.4)
id2word.compactify()
corpus = [id2word.doc2bow(word) for word in lemmatized_words]

### Create LDA model and print out topics

In [7]:
import numpy as np
from gensim.models import LdaMulticore

num_topics = 7

lda_model = LdaMulticore(corpus=corpus, 
                             id2word=id2word, 
                             num_topics=num_topics, 
                             random_state=1,
                             chunksize=30,
                             passes=20,
                             alpha=0.31,
                             eta=0.91,
                             eval_every=1,
                             per_word_topics=True,
                             workers=1)

In [2]:
# import pickle

# # Save LDA model
# pickle.dump(lda_model, open('models/LDA_model.pkl', 'wb'))

# # Load LDA model
# with open('models/LDA_model.pkl','rb') as f:
#     lda_model = pickle.load(f)

In [6]:
lda_model.print_topics(7,num_words=15)

[(0,
  '0.010*"daddy" + 0.008*"motherfucke" + 0.007*"nigger" + 0.005*"nothing" + 0.005*"sing" + 0.004*"rich" + 0.004*"police" + 0.004*"nigga" + 0.004*"fella" + 0.004*"mama" + 0.004*"jail" + 0.004*"club" + 0.003*"titty" + 0.003*"black_people" + 0.003*"talkin"'),
 (1,
  '0.003*"awesome" + 0.003*"cat" + 0.002*"yell" + 0.002*"daughter" + 0.002*"restaurant" + 0.002*"bathroom" + 0.002*"freak" + 0.002*"drunk" + 0.002*"store" + 0.002*"seriously" + 0.002*"horse" + 0.002*"huge" + 0.002*"horrible" + 0.002*"adult" + 0.002*"cute"'),
 (2,
  '0.011*"quite" + 0.008*"mate" + 0.006*"lovely" + 0.006*"mum" + 0.005*"film" + 0.005*"round" + 0.004*"brilliant" + 0.004*"cheer" + 0.004*"accent" + 0.004*"bloke" + 0.004*"shout" + 0.003*"scottish" + 0.003*"shop" + 0.003*"applause" + 0.003*"gig"'),
 (3,
  '0.011*"rape" + 0.005*"tit" + 0.005*"boyfriend" + 0.005*"porn" + 0.004*"pregnant" + 0.004*"cock" + 0.003*"horrible" + 0.003*"opinion" + 0.003*"abortion" + 0.003*"gun" + 0.003*"penis" + 0.003*"upset" + 0.003*"husba

### And now we use the human brain to interpret the results of unsupervised machine learning
By looking at some of the key words we can try to derive a topic:
- Topic 0, Words: "black_people", "ni**er", "police", "jail". This topic could be about issues regarding policing and African Americans. It will be called "police_AA".


- Topic 1, Words: "awesome", "restaurant", "daughter", "store", "bathroom", "yell", "huge". This one is hard to tell but contains mostly neutral terms. Perhaps this is clean humor. It will be called "clean".


- Topic 2, Words: "mate", "lovley", "brilliant", "mum", "bloke". Several of the comics in the dataset are English and the algorithm has confused the typical mannerisms of a native UK speaker for a topic. It will be called "UK".


- Topic 3, Words: "boyfriend", "pregnant", "abortion","penis", "tit", "husbsand", "wedding", "kiss". This topic seems to be about relationships and sex so it will be called "relationships".


- Topic 4, Words: "cat", "panda", "frog", "fish", "sheep", "bird". Clearly, this one will be called "animals".


- Topic 5, Words: "trump", "immigrant", "president", "racist", "vote", "government", "terrorist". All of these words point to politics and government. It will be called "politics".


- Topic 6, Words: "war", "planet", "religion", "language". This topic contains big ideas and concepts and will be called "big_picture".

### What is the coherence score? Not great.
At 0.36, It's pretty low. Getting something in the range of 0.7 - 0.9 is more ideal, but a 0.36 is not the worst thing. Many professionals in the NLP field would agree that coherence scores don't necesarrily indicate the quality of an LDA model.

In [10]:
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=lda_model, 
                                         texts=lemmatized_words, 
                                         dictionary=id2word, 
                                         coherence='c_v')
coherence_model_lda.get_coherence()

0.3647149558837724

### The pyLDAvis library uses Jensen-Shannon distance to create this MDS visualization.

In [11]:
import pyLDAvis.gensim 
import pyLDAvis

# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
LDAvis_prepared

In [12]:
# Get a list of vectors of topic probabilities
topic_vecs = []
for i in range(len(df.words)):
    top_topics = lda_model.get_document_topics(corpus[i], minimum_probability=0.0)
    topic_vec = [top_topics[i][1] for i in range(num_topics)]
    topic_vecs.append(topic_vec)
    
topic_vecs[0]

[0.014837115,
 0.46809384,
 0.0004913487,
 0.21831581,
 0.0004745026,
 0.2464995,
 0.05128792]

In [13]:
# Add topic probabilities into main df. Create a new column for each topic.
topic_columns = ['police_AA', 'clean', 'UK', 'relationships', 'animals', 'politics', 'big_picture']
LDA_probs = pd.DataFrame(data=topic_vecs, columns=topic_columns, index=df.index)
df = pd.concat([df, LDA_probs], axis=1)

df.head()

Unnamed: 0,title,date_posted,link,name,year,transcript,language,runtime,rating,rating_type,...,s_words,diversity,diversity_ratio,police_AA,clean,UK,relationships,animals,politics,big_picture
0,Russell Peters: Deported,"May 10th, 2020",https://scrapsfromtheloft.com/2020/05/10/russe...,Russell Peters,2020.0,"NARRATOR: Ladies and gentlemen, it’s start t...",en,67.0,6.1,0,...,30,1211,0.244844,0.014837,0.468094,0.000491,0.218316,0.000475,0.246499,0.051288
1,Jimmy O. Yang: Good Deal,"May 10th, 2020",https://scrapsfromtheloft.com/2020/05/10/jimmy...,Jimmy O. Yang,2020.0,"ANNOUNCER: Ladies and gentlemen, welcome to th...",en,,,0,...,39,1238,0.300851,0.160739,0.536182,0.000526,0.13512,0.000513,0.144745,0.022175
2,Jo Koy: Lights Out,"May 9th, 2020",https://scrapsfromtheloft.com/2020/05/09/jo-ko...,Jo Koy,2012.0,"L.A., are you ready? Live from the Alex Thea...",en,59.0,7.8,1,...,35,749,0.278128,0.338921,0.52981,0.000956,0.127428,0.000929,0.000967,0.000989
3,Lee Mack: Going Out Live,"May 8th, 2020",https://scrapsfromtheloft.com/2020/05/08/lee-m...,Lee Mack,2010.0,This programme contains strong language Over ...,en,60.0,7.2,0,...,5,1437,0.35693,0.000455,0.069649,0.840645,0.000443,0.000422,0.000438,0.087947
4,Lee Mack: Live,"May 7th, 2020",https://scrapsfromtheloft.com/2020/05/07/lee-m...,Lee Mack,2007.0,"PRESENTER: Ladies and gentlemen, please welco...",en,68.0,7.7,1,...,19,1684,0.313244,0.01236,0.151348,0.833839,0.000795,0.000339,0.000353,0.000966


### Save data into a file for more analysis later on

In [14]:
# Pickle DataFrame to the current drectory
df.to_pickle('./data/stand-up-data-w-LDA.pkl')