# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

## Topic Modeling - Attempt #1 (All Text)

In [1]:
import pandas as pd
import pickle
data = pd.read_pickle('data_dtm_NLP3.pkl')
data

Unnamed: 0,aah,ab,abandon,abandoned,abbas,abducted,abduction,abdul,ability,able,...,ziploced,zippedy,zit,zo,zombie,zombies,zone,zoo,zoologist,zoom
adel_karam,0,0,0,0,2,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
amy_schumer,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
beth_stelling,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
big_jay_oakerson,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
chelsea_handler,0,0,0,0,0,0,0,0,1,5,...,0,0,0,0,0,0,0,0,0,0
chris_rock,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,1,0,0
dave_chappelle,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
david_cross,0,0,0,0,0,0,0,0,0,6,...,0,0,0,0,0,0,0,0,0,0
dylan_moran,0,1,0,0,0,0,0,0,0,4,...,0,0,0,0,1,0,1,0,0,0
george_carlin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse
# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,adel_karam,amy_schumer,beth_stelling,big_jay_oakerson,chelsea_handler,chris_rock,dave_chappelle,david_cross,dylan_moran,george_carlin,iliza_shlesinger,kevin_hart,kevin_james,louis_c_k,matt_rife,pete_davidson,ricky_gervais,sarah_cooper,tom_segura,trevor_noah
aah,0,1,0,0,0,0,1,0,0,0,0,3,0,0,0,1,0,0,0,0
ab,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
abandoned,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
abbas,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [5]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [6]:
id2word

{5970: 'netflix',
 1774: 'comedy',
 8407: 'specialrecorded',
 1364: 'casino',
 2757: 'du',
 5143: 'liban',
 763: 'beirut',
 4159: 'hello',
 10149: 'wow',
 3890: 'great',
 3813: 'good',
 3038: 'evening',
 3794: 'god',
 7422: 'rest',
 8364: 'soul',
 995: 'bored',
 1954: 'cool',
 10188: 'yeah',
 5178: 'like',
 10229: 'youve',
 442: 'arrived',
 6251: 'outer',
 8381: 'space',
 9846: 'want',
 7716: 'say',
 9080: 'thank',
 9368: 'traveling',
 9894: 'way',
 7102: 'quite',
 2563: 'distance',
 1273: 'came',
 9176: 'thursday',
 10145: 'wouldnt',
 5041: 'late',
 9662: 'usually',
 2635: 'dont',
 9366: 'travel',
 4491: 'important',
 9177: 'thursdaycan',
 1769: 'come',
 4951: 'kiss',
 5399: 'make',
 5319: 'love',
 261: 'amazing',
 5099: 'lebanese',
 4877: 'just',
 114: 'adore',
 4955: 'kissing',
 4841: 'jordan',
 9938: 'welcome',
 7205: 'real',
 4723: 'issue',
 4843: 'jordanians',
 4976: 'know',
 9208: 'times',
 2457: 'didnt',
 9206: 'time',
 165: 'ago',
 5610: 'met',
 3902: 'greet',
 8555: 'started'

In [9]:
len(id2word)

10250

In [10]:
sparse_counts.shape[0]

10237

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [11]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.007*"saddle" + 0.006*"gone" + 0.006*"offputting" + 0.006*"mall" + 0.006*"theyve" + 0.006*"wallbuilder" + 0.006*"shin" + 0.006*"therapist" + 0.005*"gonna" + 0.005*"herring"'),
 (1,
  '0.006*"saddle" + 0.006*"goin" + 0.006*"wallbuilder" + 0.005*"theyve" + 0.005*"herring" + 0.005*"offputting" + 0.005*"gonna" + 0.004*"therapist" + 0.004*"longevity" + 0.004*"cause"')]

In [12]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.007*"theyve" + 0.007*"offputting" + 0.006*"saddle" + 0.006*"herring" + 0.006*"therapist" + 0.005*"gone" + 0.005*"wallbuilder" + 0.005*"goin" + 0.004*"ittybitty" + 0.004*"did"'),
 (1,
  '0.007*"theyve" + 0.007*"wallbuilder" + 0.007*"godspeed" + 0.006*"gonna" + 0.006*"goin" + 0.006*"offputting" + 0.006*"saddle" + 0.006*"fuckin" + 0.005*"herring" + 0.005*"fubu"'),
 (2,
  '0.007*"mall" + 0.007*"saddle" + 0.007*"shin" + 0.006*"gone" + 0.006*"gonna" + 0.005*"goin" + 0.005*"cause" + 0.005*"wallbuilder" + 0.005*"offputting" + 0.005*"savage"')]

In [13]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.006*"offputting" + 0.005*"godspeed" + 0.005*"theyve" + 0.005*"gonna" + 0.005*"herring" + 0.004*"goin" + 0.004*"doing" + 0.004*"sandler" + 0.004*"gone" + 0.004*"cause"'),
 (1,
  '0.007*"theyve" + 0.007*"offputting" + 0.006*"therapist" + 0.006*"wallbuilder" + 0.006*"gone" + 0.006*"goin" + 0.005*"herring" + 0.005*"gonna" + 0.005*"savage" + 0.005*"shin"'),
 (2,
  '0.009*"mall" + 0.007*"saddle" + 0.007*"wallbuilder" + 0.005*"herring" + 0.005*"teeter" + 0.005*"fuckin" + 0.004*"didnt" + 0.004*"shin" + 0.004*"gonna" + 0.004*"did"'),
 (3,
  '0.015*"saddle" + 0.009*"fckin" + 0.007*"mall" + 0.007*"fbi" + 0.005*"gonna" + 0.005*"shin" + 0.005*"gone" + 0.005*"come" + 0.004*"wallbuilder" + 0.004*"got"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [14]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [15]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript,Full_Name
adel_karam,a netflix comedy specialrecorded at the casino...,Adel Karam
amy_schumer,fuck yeah this is such a big night for you but...,Amy Schumer
beth_stelling,beth stellings standup comedy special girl dad...,Beth Stelling
big_jay_oakerson,lets get you going here hey hey hey hey hey he...,Big Jay Oakerson
chelsea_handler,join me in welcoming the author of six number ...,Chris Rock
chris_rock,lets go she said ill do anything you want i sa...,Dave Chappelle
dave_chappelle,the dreamer which was shot in chappelles homet...,Chris Tucker
david_cross,david cross making america great again is a st...,Daniel Tosh
dylan_moran,ladies and gentlemen will you please welcome t...,Dylan Moran
george_carlin,in the indian sergeant was emerging as george ...,George Carlin


In [16]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

Unnamed: 0,transcript
adel_karam,comedy casino beirut hello wow evening evening...
amy_schumer,fuck yeah night celebrating i highschool crush...
beth_stelling,stellings comedy girl daddy hbo max show varsi...
big_jay_oakerson,lets hey hey hey hey hey lets wow lot bravado ...
chelsea_handler,author number york times books star chelsea ch...
chris_rock,lets anything i bitch paint house death penalt...
dave_chappelle,dreamer chappelles hometown washington dc linc...
david_cross,cross making comedy comedian actor david cross...
dylan_moran,ladies gentlemen stage mr dylan hello thank th...
george_carlin,sergeant george carlins premise warrior troops...


In [18]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said', 
                  'get', 'going', 'want', 'make', 'way', 'good', 'thing', 'need', 
                  'lot', 'really', 'come', 'look', 'use', 'said']
stop_words = list(text.ENGLISH_STOP_WORDS.union(add_stop_words))

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn


Unnamed: 0,aah,ab,abbas,abduction,ability,abo,abortion,abortions,abraham,absense,...,zero,zillion,zip,zit,zombie,zombies,zone,zoo,zoologist,zoom
adel_karam,0,0,1,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
amy_schumer,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
beth_stelling,0,0,0,0,0,0,5,0,0,0,...,0,0,1,0,0,0,0,0,0,0
big_jay_oakerson,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
chelsea_handler,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
chris_rock,0,0,0,0,0,0,7,2,0,0,...,0,0,0,0,0,0,0,1,0,0
dave_chappelle,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
david_cross,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dylan_moran,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
george_carlin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [20]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.009*"man" + 0.008*"hes" + 0.006*"day" + 0.006*"gon" + 0.006*"things" + 0.006*"theyre" + 0.006*"life" + 0.006*"kids" + 0.005*"cause" + 0.005*"years"'),
 (1,
  '0.012*"man" + 0.009*"shes" + 0.008*"life" + 0.008*"guy" + 0.007*"cause" + 0.007*"hes" + 0.007*"shit" + 0.006*"day" + 0.006*"everybody" + 0.006*"fuck"')]

In [21]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.010*"man" + 0.008*"guy" + 0.008*"shes" + 0.008*"shit" + 0.008*"hes" + 0.007*"kids" + 0.006*"gon" + 0.006*"cause" + 0.006*"life" + 0.006*"day"'),
 (1,
  '0.009*"hes" + 0.007*"okay" + 0.007*"theyre" + 0.006*"cause" + 0.006*"day" + 0.005*"god" + 0.005*"gon" + 0.005*"shes" + 0.005*"man" + 0.005*"women"'),
 (2,
  '0.016*"man" + 0.009*"life" + 0.008*"everybody" + 0.008*"day" + 0.007*"hes" + 0.006*"things" + 0.006*"shit" + 0.006*"cause" + 0.006*"world" + 0.006*"gon"')]

In [22]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.009*"hes" + 0.007*"theyre" + 0.007*"god" + 0.006*"day" + 0.006*"cause" + 0.005*"gon" + 0.005*"okay" + 0.005*"kids" + 0.005*"life" + 0.005*"bit"'),
 (1,
  '0.010*"mom" + 0.006*"shes" + 0.006*"hes" + 0.006*"sarah" + 0.005*"cause" + 0.005*"house" + 0.005*"okay" + 0.004*"life" + 0.004*"years" + 0.004*"cooper"'),
 (2,
  '0.015*"man" + 0.009*"gon" + 0.008*"day" + 0.008*"hes" + 0.007*"life" + 0.007*"dick" + 0.007*"shes" + 0.007*"things" + 0.007*"shit" + 0.007*"guy"'),
 (3,
  '0.013*"man" + 0.009*"shit" + 0.008*"life" + 0.008*"hes" + 0.008*"guy" + 0.007*"everybody" + 0.007*"kids" + 0.007*"theyre" + 0.007*"women" + 0.007*"shes"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [23]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [24]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
adel_karam,netflix comedy casino du liban beirut hello wo...
amy_schumer,fuck yeah big night celebrating i highschool c...
beth_stelling,beth stellings comedy special girl daddy hbo m...
big_jay_oakerson,lets hey hey hey hey hey lets wow lot bravado ...
chelsea_handler,author number new york times books star chelse...
chris_rock,lets ill anything i bitch paint house death pe...
dave_chappelle,dreamer chappelles hometown washington dc linc...
david_cross,david cross making great standup comedy specia...
dylan_moran,ladies gentlemen stage mr dylan hey hello than...
george_carlin,indian sergeant george carlins premise indian ...


In [25]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aah,ab,abandoned,abbas,abduction,ability,able,abo,abortion,abortions,...,zero,zillion,zip,zit,zombie,zombies,zone,zoo,zoologist,zoom
adel_karam,0,0,0,2,0,0,0,10,0,0,...,0,0,0,1,0,0,0,0,0,0
amy_schumer,1,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
beth_stelling,0,0,0,0,0,0,1,0,5,0,...,0,0,1,0,0,0,0,0,0,0
big_jay_oakerson,0,0,2,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
chelsea_handler,0,0,0,0,0,1,5,0,0,0,...,0,0,0,0,0,0,0,0,0,0
chris_rock,0,0,0,0,0,0,2,0,7,2,...,0,0,0,0,0,0,0,1,0,0
dave_chappelle,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
david_cross,0,0,0,0,0,0,6,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dylan_moran,0,1,0,0,0,0,4,0,0,0,...,0,0,0,0,1,0,0,0,0,0
george_carlin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [27]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"women" + 0.004*"mom" + 0.004*"fuck" + 0.003*"men" + 0.003*"fat" + 0.003*"ta" + 0.003*"fucking" + 0.003*"somebody" + 0.003*"girls" + 0.003*"fun"'),
 (1,
  '0.006*"dick" + 0.005*"black" + 0.005*"fuck" + 0.004*"house" + 0.004*"fck" + 0.003*"women" + 0.003*"ta" + 0.003*"ass" + 0.003*"men" + 0.003*"school"')]

In [28]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"ta" + 0.004*"women" + 0.004*"fuck" + 0.004*"fck" + 0.004*"men" + 0.003*"fat" + 0.003*"house" + 0.003*"wife" + 0.003*"kid" + 0.003*"stuff"'),
 (1,
  '0.008*"fuck" + 0.007*"mom" + 0.007*"dick" + 0.006*"black" + 0.004*"women" + 0.004*"girls" + 0.004*"fucking" + 0.004*"dude" + 0.003*"ass" + 0.003*"pussy"'),
 (2,
  '0.005*"sarah" + 0.004*"anthem" + 0.004*"nice" + 0.004*"cooper" + 0.004*"morning" + 0.003*"women" + 0.003*"news" + 0.003*"somebody" + 0.003*"sex" + 0.003*"america"')]

In [29]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"fuck" + 0.007*"women" + 0.005*"dick" + 0.004*"black" + 0.004*"men" + 0.004*"somebody" + 0.004*"fucking" + 0.004*"fun" + 0.004*"girls" + 0.003*"mom"'),
 (1,
  '0.008*"anthem" + 0.005*"song" + 0.005*"germany" + 0.005*"america" + 0.005*"french" + 0.004*"bathroom" + 0.004*"girls" + 0.004*"number" + 0.004*"national" + 0.004*"country"'),
 (2,
  '0.005*"sarah" + 0.004*"morning" + 0.004*"nice" + 0.004*"cooper" + 0.003*"ta" + 0.003*"water" + 0.003*"dani" + 0.003*"news" + 0.003*"everythings" + 0.003*"hell"'),
 (3,
  '0.009*"fck" + 0.007*"house" + 0.007*"mom" + 0.006*"fcking" + 0.005*"fat" + 0.005*"ta" + 0.005*"fuck" + 0.004*"ass" + 0.004*"dick" + 0.003*"gay"')]

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [30]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.008*"fck" + 0.006*"fcking" + 0.005*"anthem" + 0.004*"mom" + 0.004*"house" + 0.004*"sex" + 0.004*"somebody" + 0.004*"ass" + 0.004*"ta" + 0.004*"dick"'),
 (1,
  '0.007*"fuck" + 0.007*"black" + 0.005*"ass" + 0.005*"women" + 0.005*"dani" + 0.005*"class" + 0.005*"school" + 0.004*"ngga" + 0.004*"men" + 0.004*"shoes"'),
 (2,
  '0.007*"fat" + 0.005*"ta" + 0.004*"game" + 0.003*"gay" + 0.003*"book" + 0.003*"men" + 0.003*"natural" + 0.003*"earth" + 0.003*"ready" + 0.003*"plane"'),
 (3,
  '0.006*"fuck" + 0.005*"women" + 0.005*"dick" + 0.005*"mom" + 0.004*"fucking" + 0.003*"fun" + 0.003*"somebody" + 0.003*"girls" + 0.003*"nice" + 0.003*"house"')]

### To divide the comedian transcript into four topics based on the provided topic keywords, we can infer some general themes for each topic:

* Topic 0: Sex and Vulgarity
Keywords: "fuck", "fucking", "ass", "dick", "sex"
This topic seems to focus on explicit language and sexual content.

* Topic 1: Race and Gender
Keywords: "black", "women", "class", "school", "ngga"
This topic appears to revolve around issues related to race, gender, and societal stereotypes.

* Topic 2: Body Image and Identity
Keywords: "fat", "gay", "natural", "earth"
This topic seems to discuss themes related to body image, sexuality, and perhaps identity.

* Topic 3: Relationships and Social Interactions
Keywords: "mom", "women", "dick", "girls", "house"
This topic might involve discussions about relationships, family dynamics, and social interactions.

These are broad interpretations based on the provided keywords. Depending on the context of the comedian's routine, the actual topics could be more nuanced.

In [32]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(1, 'adel_karam'),
 (0, 'amy_schumer'),
 (0, 'beth_stelling'),
 (3, 'big_jay_oakerson'),
 (3, 'chelsea_handler'),
 (1, 'chris_rock'),
 (3, 'dave_chappelle'),
 (3, 'david_cross'),
 (3, 'dylan_moran'),
 (1, 'george_carlin'),
 (3, 'iliza_shlesinger'),
 (0, 'kevin_hart'),
 (2, 'kevin_james'),
 (3, 'louis_c_k'),
 (3, 'matt_rife'),
 (3, 'pete_davidson'),
 (2, 'ricky_gervais'),
 (3, 'sarah_cooper'),
 (3, 'tom_segura'),
 (0, 'trevor_noah')]

Based on the provided comedian names and their associated topic numbers, here's how they are divided into the four topics:

Topic 0: Sex and Vulgarity
- Amy Schumer
- Beth Stelling
- Kevin Hart
- Trevor Noah

Topic 1: Race and Gender
- Adel Karam
- Chris Rock
- George Carlin

Topic 2: Body Image and Identity
- Kevin James
- Ricky Gervais

Topic 3: Relationships and Social Interactions
- Big Jay Oakerson
- Chelsea Handler
- Dave Chappelle
- David Cross
- Dylan Moran
- Iliza Shlesinger
- Louis C.K.
- Matt Rife
- Pete Davidson
- Sarah Cooper
- Tom Segura

These categorizations are based on the topics inferred from the provided keywords and may not perfectly align with the content or style of each comedian's routine.

### Assignment:
1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.

## Assignment 1

In [33]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=6, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.010*"fck" + 0.007*"fcking" + 0.006*"fat" + 0.005*"ta" + 0.005*"house" + 0.005*"wife" + 0.005*"ngga" + 0.004*"fuck" + 0.004*"ass" + 0.004*"jokes"'),
 (1,
  '0.010*"dick" + 0.007*"sarah" + 0.005*"nice" + 0.005*"black" + 0.005*"cooper" + 0.004*"dude" + 0.004*"hi" + 0.004*"girls" + 0.004*"president" + 0.004*"cool"'),
 (2,
  '0.009*"anthem" + 0.006*"dani" + 0.006*"germany" + 0.005*"america" + 0.005*"song" + 0.005*"number" + 0.005*"class" + 0.004*"national" + 0.004*"country" + 0.004*"public"'),
 (3,
  '0.012*"mom" + 0.006*"ta" + 0.005*"house" + 0.005*"fuck" + 0.004*"game" + 0.004*"stalker" + 0.004*"uh" + 0.003*"dad" + 0.003*"belt" + 0.003*"kid"'),
 (4,
  '0.000*"fuck" + 0.000*"mom" + 0.000*"natural" + 0.000*"men" + 0.000*"ass" + 0.000*"village" + 0.000*"ones" + 0.000*"house" + 0.000*"girls" + 0.000*"easy"'),
 (5,
  '0.008*"women" + 0.007*"fuck" + 0.005*"men" + 0.005*"somebody" + 0.004*"girls" + 0.004*"black" + 0.004*"mom" + 0.004*"fucking" + 0.004*"fun" + 0.003*"dog"')]

To divide the comedian transcript into six topics based on the provided topic keywords, we can infer the following themes for each topic:

Topic 0: Vulgarity and Relationships
- Keywords: "fuck", "fucking", "ass", "dick", "wife", "jokes"
- This topic appears to involve vulgar language and jokes related to relationships.

Topic 1: Social Interactions and Politics
- Keywords: "sarah", "cooper", "president", "girls", "cool"
- This topic seems to discuss social interactions, politics, and possibly gender-related issues.

Topic 2: National Identity and Patriotism
- Keywords: "anthem", "germany", "america", "song", "national", "country"
- This topic may involve discussions about national identity, anthems, and patriotism.

Topic 3: Family and Childhood
- Keywords: "mom", "house", "game", "stalker", "dad", "kid"
- This topic likely covers themes related to family dynamics, childhood experiences, and perhaps darker humor.

Topic 4: Unclear or Irrelevant Keywords
- Keywords: "natural", "men", "village", "ones", "house", "girls", "easy"
- This topic doesn't seem to have clear or relevant keywords to determine its theme.

Topic 5: Gender and Relationships
- Keywords: "women", "fuck", "men", "somebody", "girls", "mom", "fucking", "fun"
- This topic could involve discussions about gender dynamics, relationships, and possibly sexual content.

These are broad interpretations based on the provided keywords. Depending on the context of the comedian's routine, the actual topics could be more nuanced.

In [35]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(2, 'adel_karam'),
 (5, 'amy_schumer'),
 (5, 'beth_stelling'),
 (1, 'big_jay_oakerson'),
 (5, 'chelsea_handler'),
 (5, 'chris_rock'),
 (0, 'dave_chappelle'),
 (1, 'david_cross'),
 (5, 'dylan_moran'),
 (2, 'george_carlin'),
 (5, 'iliza_shlesinger'),
 (0, 'kevin_hart'),
 (3, 'kevin_james'),
 (5, 'louis_c_k'),
 (5, 'matt_rife'),
 (3, 'pete_davidson'),
 (0, 'ricky_gervais'),
 (1, 'sarah_cooper'),
 (5, 'tom_segura'),
 (2, 'trevor_noah')]

## Assignment 2
To create a new topic model that includes terms from a different part of speech, you can modify the text preprocessing step to filter for different parts of speech. For example, you can focus on verbs, adjectives, or adverbs instead of nouns and adjectives. Here's how you can modify the code to include terms from a different part of speech (verbs in this case):

In [36]:
# Function to filter for verbs in a string of text
def verbs(text):
    '''Given a string of text, tokenize the text and pull out only the verbs.'''
    is_verb = lambda pos: pos[:2] == 'VB'  # VB: Verb, base form
    tokenized = word_tokenize(text)
    all_verbs = [word for (word, pos) in pos_tag(tokenized) if is_verb(pos)] 
    return ' '.join(all_verbs)

# Apply the verbs function to the transcripts to filter only on verbs
data_verbs = pd.DataFrame(data_clean.transcript.apply(verbs))

# Recreate a document-term matrix using only verbs
cvv = CountVectorizer(stop_words=stop_words)
data_cvv = cvv.fit_transform(data_verbs.transcript)
data_dtmv = pd.DataFrame(data_cvv.toarray(), columns=cvv.get_feature_names_out())
data_dtmv.index = data_verbs.index

# Create the gensim corpus
corpusv = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmv.transpose()))

# Create the vocabulary dictionary
id2wordv = dict((v, k) for k, v in cvv.vocabulary_.items())

In [37]:
# Let's start with 2 topics
ldav = models.LdaModel(corpus=corpusv, num_topics=4, id2word=id2wordv, passes=10)
ldav.print_topics()

[(0,
  '0.024*"goes" + 0.016*"did" + 0.014*"say" + 0.014*"fucking" + 0.011*"doing" + 0.011*"tell" + 0.010*"mean" + 0.010*"guys" + 0.009*"gon" + 0.009*"feel"'),
 (1,
  '0.016*"did" + 0.015*"say" + 0.011*"went" + 0.011*"goes" + 0.011*"doing" + 0.010*"didnt" + 0.010*"tell" + 0.009*"came" + 0.008*"theres" + 0.008*"mean"'),
 (2,
  '0.015*"love" + 0.014*"say" + 0.012*"did" + 0.010*"doing" + 0.010*"fucking" + 0.009*"gon" + 0.008*"trying" + 0.008*"tell" + 0.007*"getting" + 0.007*"went"'),
 (3,
  '0.017*"tell" + 0.013*"say" + 0.012*"did" + 0.010*"didnt" + 0.008*"theres" + 0.008*"gon" + 0.008*"doing" + 0.008*"hes" + 0.008*"went" + 0.007*"love"')]

To divide the comedian transcript into four topics based on the provided topic keywords, we can infer the following themes for each topic:

Topic 0: Commentary and Expression
- Keywords: "goes", "say", "fucking", "doing", "tell", "guys", "mean", "feel"
- This topic appears to involve commentary, expressions, and perhaps the comedian's observations on various subjects.

Topic 1: Narrative and Action
- Keywords: "did", "say", "went", "doing", "didnt", "tell", "came", "theres", "mean"
- This topic might focus on narratives, actions, and descriptions of events or situations.

Topic 2: Emotion and Effort
- Keywords: "love", "say", "did", "doing", "fucking", "gon", "trying", "tell", "getting", "went"
- This topic could involve discussions about emotions, efforts, and the comedian's experiences.

Topic 3: Statements and Reactions
- Keywords: "tell", "say", "did", "didnt", "theres", "gon", "doing", "hes", "went", "love"
- This topic may revolve around statements, reactions, and the comedian's responses to various scenarios.

These are broad interpretations based on the provided keywords. Depending on the context of the comedian's routine, the actual topics could be more nuanced.

In [1]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldav[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

NameError: name 'ldav' is not defined