# Assignment 5

##### Topic Modeling

## Introduction

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide 
(1) a document-term matrix and 
(2) the number of topics you would like the algorithm to pick up.


In [1]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm.pkl')
data

Unnamed: 0,05,07,08,10,100,1000,10000,100000,10abox,11,...,ze,zealand,zeppelin,zero,zillion,zombie,zombies,zoning,zoo,éclair
louis,0,0,0,1,1,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
dave,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
ricky,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,1,0
bo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,1,1,1,1,0,0,0,0,1,...,1,0,0,1,1,1,1,1,0,0
jim,0,0,0,4,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
hasan,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ali,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
anthony,0,0,0,1,1,0,0,0,0,0,...,0,10,0,0,0,0,0,0,0,0


In [4]:
#LDA will go through every word & its assigned topic and it will update the topic assignments.

In [5]:
# Import the necessary modules for LDA with gensim
from gensim import matutils, models
import scipy.sparse


In [6]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,louis,dave,ricky,bo,bill,jim,john,hasan,ali,anthony,mike,joe
5,0,0,0,0,1,0,0,0,0,0,0,0
7,0,0,0,0,1,0,0,0,0,0,0,0
8,0,0,0,0,1,0,0,0,0,0,0,0
10,1,1,0,0,1,4,0,0,0,1,1,0
100,1,0,0,0,1,0,1,0,1,1,0,1


In [7]:
# put the term-document matrix into a new gensim format, from df -> sparse matrix -> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [8]:
print(sparse_counts)

  (0, 4)	1
  (1, 4)	1
  (2, 4)	1
  (3, 0)	1
  (3, 1)	1
  (3, 4)	1
  (3, 5)	4
  (3, 9)	1
  (3, 10)	1
  (4, 0)	1
  (4, 4)	1
  (4, 6)	1
  (4, 8)	1
  (4, 9)	1
  (4, 11)	1
  (5, 5)	2
  (5, 10)	1
  (6, 11)	4
  (7, 2)	2
  (8, 8)	1
  (9, 1)	1
  (9, 4)	1
  (9, 10)	1
  (9, 11)	2
  (10, 1)	1
  :	:
  (6572, 6)	1
  (6573, 2)	1
  (6573, 11)	1
  (6574, 0)	1
  (6575, 0)	1
  (6575, 2)	1
  (6575, 4)	1
  (6575, 11)	3
  (6576, 2)	1
  (6577, 2)	1
  (6577, 4)	1
  (6578, 8)	1
  (6579, 4)	1
  (6580, 9)	10
  (6581, 10)	2
  (6582, 0)	2
  (6582, 4)	1
  (6582, 10)	1
  (6583, 4)	1
  (6584, 4)	1
  (6584, 8)	1
  (6585, 4)	1
  (6586, 4)	1
  (6587, 2)	1
  (6588, 6)	1


In [9]:
print(corpus)

<gensim.matutils.Sparse2Corpus object at 0x0000021548EE61D0>


In [10]:

cv = pickle.load(open("cv.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes.

In [11]:
print(cv)

CountVectorizer(stop_words='english')


In [12]:
id2word

{3094: 'introfade',
 3882: 'music',
 3394: 'let',
 4941: 'roll',
 2854: 'hold',
 3425: 'lights',
 5882: 'thank',
 379: 'appreciate',
 1832: 'don',
 3933: 'necessarily',
 248: 'agree',
 3973: 'nice',
 4375: 'place',
 1947: 'easily',
 3974: 'nicest',
 3734: 'miles',
 1749: 'direction',
 1337: 'compliment',
 895: 'building',
 5224: 'shit',
 6010: 'town',
 5159: 'sentence',
 4050: 'odd',
 1892: 'driving',
 1816: 'doesn',
 1733: 'difference',
 5277: 'sidewalk',
 5633: 'street',
 4268: 'pedestrians',
 4287: 'people',
 3218: 'just',
 3261: 'kind',
 6329: 'walk',
 3728: 'middle',
 4924: 'road',
 3513: 'love',
 6050: 'traveling',
 5128: 'seeing',
 1735: 'different',
 4229: 'parts',
 1445: 'country',
 3452: 'live',
 3964: 'new',
 6569: 'york',
 6244: 'value',
 1820: 'doing',
 4073: 'old',
 3307: 'lady',
 1817: 'dog',
 3427: 'like',
 3944: 'neighborhood',
 6331: 'walking',
 5543: 'stands',
 2275: 'fights',
 2625: 'gravity',
 1601: 'day',
 4738: 'really',
 2595: 'got',
 1248: 'cloudy',
 2156: 'eye

In [13]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.033*"like" + 0.016*"just" + 0.016*"know" + 0.015*"don" + 0.011*"right" + 0.008*"said" + 0.007*"got" + 0.007*"gonna" + 0.006*"think" + 0.006*"fucking"'),
 (1,
  '0.022*"like" + 0.016*"just" + 0.013*"know" + 0.012*"don" + 0.011*"people" + 0.010*"right" + 0.007*"said" + 0.006*"shit" + 0.006*"gonna" + 0.006*"got"')]

In [14]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.013*"like" + 0.013*"right" + 0.010*"know" + 0.010*"don" + 0.009*"just" + 0.008*"said" + 0.008*"ve" + 0.008*"got" + 0.007*"say" + 0.007*"joke"'),
 (1,
  '0.028*"like" + 0.018*"just" + 0.016*"don" + 0.015*"right" + 0.013*"know" + 0.011*"people" + 0.009*"fucking" + 0.009*"gonna" + 0.008*"got" + 0.008*"shit"'),
 (2,
  '0.036*"like" + 0.018*"know" + 0.017*"just" + 0.013*"don" + 0.010*"said" + 0.007*"people" + 0.006*"gonna" + 0.006*"right" + 0.006*"time" + 0.006*"think"')]

In [15]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.017*"right" + 0.017*"like" + 0.013*"don" + 0.012*"just" + 0.011*"fucking" + 0.011*"know" + 0.009*"went" + 0.009*"ve" + 0.008*"said" + 0.007*"people"'),
 (1,
  '0.035*"like" + 0.017*"just" + 0.016*"know" + 0.014*"don" + 0.010*"right" + 0.009*"people" + 0.009*"said" + 0.008*"got" + 0.008*"gonna" + 0.007*"think"'),
 (2,
  '0.001*"like" + 0.001*"rights" + 0.001*"loft" + 0.001*"2021" + 0.001*"reserved" + 0.001*"scraps" + 0.001*"just" + 0.001*"know" + 0.001*"don" + 0.001*"right"'),
 (3,
  '0.030*"like" + 0.020*"just" + 0.017*"know" + 0.014*"don" + 0.007*"shit" + 0.007*"people" + 0.007*"gonna" + 0.006*"life" + 0.006*"cause" + 0.005*"ok"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

In [16]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [17]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
louis,introfade the music out let’s roll hold there ...
dave,this is dave he tells dirty jokes for a living...
ricky,hello hello how you doing great thank you wow ...
bo,© 2021 scraps from the loft all rights reserved
bill,all right thank you thank you very much thank...
jim,ladies and gentlemen please welcome to the ...
john,armed with boyish charm and a sharp wit the fo...
hasan,© 2021 scraps from the loft all rights reserved
ali,ladies and gentlemen please welcome to the sta...
anthony,thank you thank you thank you san francisco th...


In [1]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Prince\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [24]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

Unnamed: 0,transcript
louis,music let ’ roll hold lights lights thank i t ...
dave,jokes living stare work profound train thought...
ricky,hello thank fuck thank welcome i m gon tonight...
bo,© scraps rights
bill,thank s thank pleasure georgia area oasis t i ...
jim,ladies gentlemen stage mr jim jefferies thank ...
john,charm wit “ snl ” writer john mulaney marriage...
hasan,© scraps rights
ali,ladies gentlemen stage ali hi thank hello na s...
anthony,thank thank people i ’ em i francisco city wor...


In [None]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index

In [26]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [27]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"day" + 0.009*"joke" + 0.008*"thing" + 0.007*"ve" + 0.006*"way" + 0.006*"years" + 0.005*"things" + 0.005*"dad" + 0.005*"cause" + 0.005*"baby"'),
 (1,
  '0.010*"shit" + 0.010*"gon" + 0.010*"cause" + 0.010*"thing" + 0.009*"guy" + 0.009*"man" + 0.008*"life" + 0.008*"day" + 0.007*"lot" + 0.007*"women"')]

In [28]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.015*"shit" + 0.011*"gon" + 0.010*"life" + 0.009*"thing" + 0.009*"man" + 0.009*"guy" + 0.008*"cause" + 0.008*"lot" + 0.008*"fuck" + 0.007*"dude"'),
 (1,
  '0.011*"day" + 0.010*"thing" + 0.010*"cause" + 0.008*"gon" + 0.007*"way" + 0.007*"ve" + 0.007*"guy" + 0.007*"man" + 0.007*"things" + 0.006*"house"'),
 (2,
  '0.001*"thing" + 0.001*"day" + 0.001*"life" + 0.001*"guy" + 0.000*"cause" + 0.000*"man" + 0.000*"gon" + 0.000*"ve" + 0.000*"don" + 0.000*"years"')]

In [29]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"joke" + 0.009*"ve" + 0.009*"thing" + 0.009*"day" + 0.008*"years" + 0.006*"things" + 0.005*"nuts" + 0.005*"jenner" + 0.005*"god" + 0.005*"don"'),
 (1,
  '0.013*"shit" + 0.009*"guy" + 0.009*"man" + 0.009*"thing" + 0.009*"fuck" + 0.009*"gon" + 0.008*"day" + 0.007*"lot" + 0.006*"dude" + 0.006*"joke"'),
 (2,
  '0.013*"cause" + 0.010*"gon" + 0.010*"life" + 0.009*"thing" + 0.009*"way" + 0.008*"guy" + 0.008*"kind" + 0.007*"man" + 0.007*"house" + 0.007*"kids"'),
 (3,
  '0.014*"day" + 0.011*"women" + 0.010*"thing" + 0.009*"cause" + 0.009*"lot" + 0.008*"shit" + 0.008*"fuck" + 0.007*"gon" + 0.007*"guy" + 0.007*"ve"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [30]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [31]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
louis,music let ’ roll hold lights lights thank much...
dave,dirty jokes living stare most hard work profou...
ricky,hello great thank fuck thank lovely welcome i ...
bo,© scraps loft rights
bill,right thank s thank pleasure greater atlanta g...
jim,ladies gentlemen welcome stage mr jim jefferie...
john,boyish charm sharp wit former “ snl ” writer j...
hasan,© scraps loft rights
ali,ladies gentlemen welcome stage ali wong hi wel...
anthony,thank san francisco thank good people surprise...


In [None]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index

In [33]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [34]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.006*"shit" + 0.005*"kind" + 0.004*"point" + 0.004*"mom" + 0.004*"hey" + 0.004*"ok" + 0.004*"jenny" + 0.004*"kids" + 0.004*"clinton" + 0.003*"kid"'),
 (1,
  '0.010*"shit" + 0.007*"fuck" + 0.006*"fucking" + 0.005*"joke" + 0.005*"dude" + 0.005*"kid" + 0.005*"kids" + 0.004*"fck" + 0.004*"everybody" + 0.003*"baby"')]

In [35]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.008*"shit" + 0.007*"joke" + 0.006*"baby" + 0.005*"ta" + 0.005*"anthony" + 0.005*"husband" + 0.005*"mom" + 0.004*"ok" + 0.004*"kid" + 0.004*"family"'),
 (1,
  '0.006*"fuck" + 0.006*"shit" + 0.004*"black" + 0.004*"room" + 0.004*"fucking" + 0.004*"hey" + 0.004*"kind" + 0.004*"ahah" + 0.004*"friend" + 0.004*"point"'),
 (2,
  '0.011*"shit" + 0.007*"kids" + 0.006*"dude" + 0.006*"fck" + 0.006*"everybody" + 0.006*"kid" + 0.005*"fucking" + 0.005*"fuck" + 0.004*"kind" + 0.004*"joke"')]

In [None]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [47]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.009*"joke" + 0.005*"mom" + 0.005*"parents" + 0.004*"hasan" + 0.004*"jokes" + 0.004*"anthony" + 0.003*"nuts" + 0.003*"dead" + 0.003*"tit" + 0.003*"twitter"'),
 (1,
  '0.005*"mom" + 0.005*"jenny" + 0.005*"clinton" + 0.004*"friend" + 0.004*"parents" + 0.003*"husband" + 0.003*"cow" + 0.003*"ok" + 0.003*"wife" + 0.003*"john"'),
 (2,
  '0.005*"bo" + 0.005*"gun" + 0.005*"guns" + 0.005*"repeat" + 0.004*"um" + 0.004*"ass" + 0.004*"eye" + 0.004*"contact" + 0.003*"son" + 0.003*"class"'),
 (3,
  '0.006*"ahah" + 0.004*"nigga" + 0.004*"gay" + 0.003*"dick" + 0.003*"door" + 0.003*"young" + 0.003*"motherfucker" + 0.003*"stupid" + 0.003*"bitch" + 0.003*"mad"')]

In [48]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(1, 'ali'),
 (0, 'anthony'),
 (2, 'bill'),
 (2, 'bo'),
 (3, 'dave'),
 (0, 'hasan'),
 (2, 'jim'),
 (3, 'joe'),
 (1, 'john'),
 (0, 'louis'),
 (1, 'mike'),
 (0, 'ricky')]