# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

## Topic Modeling - Attempt #1 (All Text)

In [2]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,aaaaah,aaaahhh,aaah,aah,abandoned,abc,abilities,ability,able,abnormal,...,zone,zones,zoo,zoom,zorro,zuck,zuckerbergs,álvarez,ándale,ñañaras
david,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,0
gabriel,0,0,1,0,0,1,0,0,0,0,...,0,1,0,4,0,0,0,3,1,1
george,0,0,0,0,0,0,0,1,3,0,...,0,0,0,0,0,1,1,0,0,0
jon,0,0,0,0,0,0,0,0,2,0,...,1,0,0,0,0,0,0,0,0,0
kate,0,0,0,1,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
kevin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
leanne,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
lewis,0,0,0,0,0,0,0,0,1,4,...,0,0,0,0,0,0,0,0,0,0
louis,0,1,0,4,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
matt,0,0,0,0,0,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,0


In [3]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [4]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,david,gabriel,george,jon,kate,kevin,leanne,lewis,louis,matt,pete,roseanne,sammy,shane,stavros
aaaaah,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
aaaahhh,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
aaah,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
aah,0,0,0,0,1,0,0,0,4,0,1,0,0,0,0
abandoned,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [5]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [6]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [25]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=50)
lda.print_topics()

[(0,
  '0.008*"fucking" + 0.007*"okay" + 0.005*"life" + 0.005*"fuck" + 0.005*"theres" + 0.004*"didnt" + 0.004*"did" + 0.004*"mean" + 0.004*"cause" + 0.004*"come"'),
 (1,
  '0.008*"audience" + 0.006*"laughing" + 0.005*"did" + 0.004*"thing" + 0.004*"cause" + 0.004*"shes" + 0.004*"okay" + 0.004*"make" + 0.004*"didnt" + 0.004*"tell"')]

In [26]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=50)
lda.print_topics()

[(0,
  '0.005*"okay" + 0.005*"did" + 0.005*"thank" + 0.004*"tell" + 0.004*"ive" + 0.004*"theres" + 0.004*"make" + 0.004*"thing" + 0.004*"look" + 0.004*"cause"'),
 (1,
  '0.008*"audience" + 0.005*"laughing" + 0.005*"okay" + 0.005*"did" + 0.005*"fucking" + 0.004*"didnt" + 0.004*"life" + 0.004*"cause" + 0.004*"shes" + 0.004*"goes"')]

The weights you see in the LDA topic output (ldana.print_topics()) change slightly every time you run the code because of inherent randomness in the LDA algorithm. Here's why:

Random initialization: LDA often uses random initialization for internal variables like topic assignments for each word. This starting point can influence the final converged model slightly.
Optimization process: LDA uses an iterative optimization process to find the best topic assignments for words. While the algorithm converges towards a good solution, slight variations in the path it takes to reach that point can lead to minor weight differences.
These factors introduce a small degree of randomness in the weights, causing them to fluctuate slightly across different runs. However, the overall thematic trends identified by the model should remain consistent.

Here are some ways to potentially reduce the weight variation:

Set a random seed: LDA libraries often allow setting a random seed for the initialization process. This ensures the starting point is the same for each run, leading to more consistent results (though some randomness might still be present in the optimization).
Increase the number of passes: Increasing the number of iterations (passes) in the LDA model can allow it to converge more thoroughly, potentially reducing weight variations. However, this also increases computation time.

In [48]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=100)
lda.print_topics()

[(0,
  '0.017*"audience" + 0.012*"laughing" + 0.006*"irish" + 0.004*"bit" + 0.004*"world" + 0.004*"applauding" + 0.004*"america" + 0.004*"years" + 0.004*"didnt" + 0.004*"did"'),
 (1,
  '0.009*"okay" + 0.008*"fucking" + 0.005*"fuck" + 0.005*"cause" + 0.005*"did" + 0.005*"tell" + 0.005*"shes" + 0.005*"didnt" + 0.004*"life" + 0.004*"goes"'),
 (2,
  '0.006*"did" + 0.005*"theres" + 0.004*"life" + 0.004*"shes" + 0.004*"okay" + 0.004*"ive" + 0.004*"mean" + 0.004*"tell" + 0.004*"doing" + 0.004*"love"')]

In [47]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=100)
lda.print_topics()

[(0,
  '0.006*"did" + 0.006*"guys" + 0.005*"fcking" + 0.005*"shes" + 0.005*"dude" + 0.005*"dad" + 0.004*"guy" + 0.004*"theres" + 0.004*"life" + 0.004*"doing"'),
 (1,
  '0.016*"audience" + 0.010*"laughing" + 0.008*"fucking" + 0.006*"didnt" + 0.005*"did" + 0.004*"shit" + 0.004*"irish" + 0.004*"fuck" + 0.004*"thing" + 0.004*"love"'),
 (2,
  '0.008*"okay" + 0.005*"theres" + 0.005*"cause" + 0.005*"life" + 0.004*"goes" + 0.004*"make" + 0.004*"did" + 0.004*"shes" + 0.004*"fucking" + 0.004*"tell"'),
 (3,
  '0.006*"okay" + 0.005*"tell" + 0.005*"look" + 0.005*"did" + 0.004*"thank" + 0.004*"thing" + 0.004*"cause" + 0.004*"make" + 0.004*"guys" + 0.004*"way"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [11]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [12]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
david,in this comedy special david nihill humorousl...
gabriel,can you please state your name martin moreno ...
george,george carlin im glad im dead is a controvers...
jon,in an interview conducted by jon stewart georg...
kate,whoa okay yeah good okay dont embarrass yourse...
kevin,kevin james irregardless in kevin james irreg...
leanne,leanne morgan im every woman in im every woma...
lewis,lewis black tragically i need you is a standu...
louis,louis ck at the dolby is louis cks third selfr...
matt,in his second hourlong comedy special matthew ...


In [13]:
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Purpose:

Divides a block of text into a list of sentences.
Handles cases where periods (.) might not indicate sentence boundaries (e.g., abbreviations, Mr.)
Accounts for sentences that don't begin with capital letters (e.g., dialogue).

In [14]:
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

The averaged_perceptron_tagger is a part of the Natural Language Processing Toolkit (NLTK) library in Python. It's a powerful tool for Part-of-Speech (POS) tagging, which involves assigning grammatical labels (nouns, verbs, adjectives, etc.

In [15]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns


Unnamed: 0,transcript
david,comedy david nihill complexities identity norm...
gabriel,state name martin moreno martinnnnn gabriel ig...
george,george carlin im im dead project debate concer...
jon,interview stewart george carlin talks aspects ...
kate,whoa okay okay dont embarrass expectations i w...
kevin,kevin james kevin james standup comedy kevin j...
leanne,leanne morgan woman im woman delivers performa...
lewis,comedy show signature mix sarcasm outrage impa...
louis,louis ck dolby cks standup comedy pandemic gra...
matt,comedy matthew stephen rife rife performance n...


In [19]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = list(text.ENGLISH_STOP_WORDS.union(add_stop_words))  # Convert frozenset to list

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn


Unnamed: 0,aaaaah,aaaahhh,aaah,aah,abc,abilities,ability,abolitionist,abomination,abortion,...,zionists,zip,zone,zones,zoo,zoom,zorro,zuck,zuckerbergs,álvarez
david,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,3,0,0,0
gabriel,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,3,0,0,0,3
george,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,1,1,0
jon,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
kate,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kevin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
leanne,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
lewis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
louis,0,1,0,2,0,0,0,0,0,7,...,0,0,0,0,0,0,0,0,0,0
matt,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [20]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [40]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=100)
ldan.print_topics()

[(0,
  '0.016*"audience" + 0.009*"hes" + 0.008*"thing" + 0.008*"life" + 0.007*"way" + 0.007*"day" + 0.007*"shes" + 0.006*"gon" + 0.006*"man" + 0.006*"years"'),
 (1,
  '0.007*"life" + 0.007*"hes" + 0.006*"way" + 0.006*"day" + 0.006*"lot" + 0.006*"cause" + 0.006*"thing" + 0.005*"shes" + 0.005*"gon" + 0.005*"theyre"')]

In [41]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=100)
ldan.print_topics()

[(0,
  '0.020*"audience" + 0.008*"hes" + 0.008*"thing" + 0.007*"way" + 0.007*"cause" + 0.007*"life" + 0.006*"okay" + 0.006*"years" + 0.006*"gon" + 0.005*"day"'),
 (1,
  '0.009*"life" + 0.008*"hes" + 0.007*"day" + 0.007*"shes" + 0.007*"thing" + 0.007*"lot" + 0.006*"guy" + 0.006*"way" + 0.006*"gon" + 0.006*"theyre"')]

In [42]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=100)
ldan.print_topics()

[(0,
  '0.009*"hes" + 0.009*"cause" + 0.009*"okay" + 0.008*"way" + 0.008*"thing" + 0.007*"shes" + 0.007*"life" + 0.007*"day" + 0.005*"gon" + 0.005*"home"'),
 (1,
  '0.010*"life" + 0.008*"hes" + 0.008*"man" + 0.006*"years" + 0.006*"day" + 0.006*"jesus" + 0.006*"shit" + 0.006*"mom" + 0.006*"lot" + 0.006*"guy"'),
 (2,
  '0.021*"audience" + 0.008*"gon" + 0.008*"thing" + 0.007*"way" + 0.007*"life" + 0.007*"day" + 0.007*"hes" + 0.006*"lot" + 0.006*"shes" + 0.006*"kind"')]

In [32]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=50)
ldan.print_topics()

[(0,
  '0.010*"life" + 0.009*"hes" + 0.009*"okay" + 0.007*"mom" + 0.007*"man" + 0.007*"day" + 0.007*"shes" + 0.006*"way" + 0.006*"guy" + 0.006*"lot"'),
 (1,
  '0.029*"audience" + 0.008*"gon" + 0.008*"way" + 0.008*"life" + 0.008*"cause" + 0.007*"bit" + 0.007*"lot" + 0.007*"day" + 0.007*"thing" + 0.007*"shes"'),
 (2,
  '0.009*"thing" + 0.008*"years" + 0.008*"hes" + 0.006*"way" + 0.006*"things" + 0.006*"gon" + 0.006*"life" + 0.005*"day" + 0.005*"thank" + 0.005*"theyre"'),
 (3,
  '0.008*"life" + 0.008*"hes" + 0.007*"kids" + 0.007*"shes" + 0.007*"day" + 0.006*"thing" + 0.006*"cause" + 0.005*"george" + 0.005*"game" + 0.005*"way"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [34]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [35]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
david,comedy special david nihill complexities ident...
gabriel,state name martin moreno martinnnnn gabriel ig...
george,george carlin im glad im dead controversial pr...
jon,interview jon stewart george carlin talks vari...
kate,whoa okay good okay dont embarrass expectation...
kevin,kevin james kevin james standup comedy special...
leanne,leanne morgan woman im woman morgan delivers s...
lewis,lewis black standup comedy show lewis black si...
louis,louis ck dolby louis cks selfreleased standup ...
matt,second hourlong comedy special matthew stephen...


In [37]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aaaaah,aaaahhh,aaah,aah,abc,abilities,ability,able,abnormal,abolitionist,...,zionists,zip,zone,zones,zoo,zoom,zorro,zuck,zuckerbergs,álvarez
david,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,3,0,0,0
gabriel,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,3,0,0,0,3
george,0,0,0,0,0,0,1,3,0,0,...,0,0,0,0,0,0,0,1,1,0
jon,0,0,0,0,0,0,0,2,0,0,...,0,0,1,0,0,0,0,0,0,0
kate,0,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
kevin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
leanne,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
lewis,0,0,0,0,0,0,0,1,4,0,...,0,0,0,0,0,0,0,0,0,0
louis,0,1,0,3,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
matt,0,0,0,0,0,0,0,3,0,0,...,0,0,1,0,0,0,0,0,0,0


In [38]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

Structure: A corpus in Gensim is not a simple collection of text strings. It's typically represented as an iterable object, like a list or stream, where each element represents a single document.
Documents: Each document within the corpus can be a string of text, a list of words, or even a more complex data structure containing additional information like metadata or annotations.
Purpose: The corpus serves as the raw material for training various NLP models in Gensim. These models analyze the statistical relationships between words and documents within the corpus.

In [43]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.004*"jesus" + 0.004*"okay" + 0.004*"cause" + 0.004*"woman" + 0.004*"shit" + 0.003*"baby" + 0.003*"yall" + 0.003*"nice" + 0.003*"george" + 0.003*"fck"'),
 (1,
  '0.016*"audience" + 0.008*"okay" + 0.006*"cause" + 0.005*"irish" + 0.004*"fuck" + 0.004*"mom" + 0.004*"shit" + 0.004*"ill" + 0.003*"um" + 0.003*"year"')]

In [44]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.006*"mom" + 0.005*"shit" + 0.004*"fck" + 0.004*"guys" + 0.004*"cause" + 0.003*"dude" + 0.003*"fcking" + 0.003*"okay" + 0.003*"ill" + 0.003*"nice"'),
 (1,
  '0.026*"audience" + 0.009*"okay" + 0.008*"irish" + 0.007*"cause" + 0.006*"jesus" + 0.005*"um" + 0.004*"ill" + 0.004*"fuck" + 0.004*"red" + 0.004*"moment"'),
 (2,
  '0.006*"okay" + 0.005*"cause" + 0.005*"yall" + 0.005*"baby" + 0.003*"husband" + 0.003*"year" + 0.003*"car" + 0.003*"money" + 0.003*"shit" + 0.003*"weeks"')]

In [45]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.010*"okay" + 0.007*"mom" + 0.006*"cause" + 0.005*"yall" + 0.005*"ill" + 0.004*"um" + 0.004*"ta" + 0.004*"guys" + 0.004*"husband" + 0.004*"baby"'),
 (1,
  '0.007*"george" + 0.006*"shit" + 0.005*"carlin" + 0.004*"red" + 0.004*"fck" + 0.004*"okay" + 0.004*"asshole" + 0.004*"hard" + 0.004*"fcking" + 0.004*"fuck"'),
 (2,
  '0.019*"audience" + 0.007*"cause" + 0.005*"irish" + 0.005*"okay" + 0.004*"jesus" + 0.004*"shit" + 0.004*"story" + 0.004*"country" + 0.003*"nice" + 0.003*"fuck"'),
 (3,
  '0.000*"connections" + 0.000*"connected" + 0.000*"youtube" + 0.000*"hat" + 0.000*"recognize" + 0.000*"complex" + 0.000*"map" + 0.000*"math" + 0.000*"borders" + 0.000*"dinosaurs"')]

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [46]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.017*"jesus" + 0.008*"cause" + 0.006*"fuck" + 0.006*"confidence" + 0.005*"nice" + 0.005*"okay" + 0.004*"story" + 0.004*"weird" + 0.004*"dead" + 0.004*"homeless"'),
 (1,
  '0.007*"fuck" + 0.006*"okay" + 0.006*"mom" + 0.006*"shit" + 0.005*"fucking" + 0.005*"yall" + 0.004*"dude" + 0.004*"cause" + 0.004*"funny" + 0.004*"friend"'),
 (2,
  '0.011*"okay" + 0.005*"cause" + 0.005*"ill" + 0.004*"um" + 0.004*"woman" + 0.004*"sex" + 0.004*"mom" + 0.003*"car" + 0.003*"shit" + 0.003*"sorry"'),
 (3,
  '0.026*"audience" + 0.008*"irish" + 0.006*"cause" + 0.005*"george" + 0.004*"america" + 0.004*"american" + 0.003*"carlin" + 0.003*"ireland" + 0.003*"moment" + 0.003*"fck"')]

These four topics look pretty decent. Let's settle on these for now.
* Topic 0: religion
* Topic 1: friend
* Topic 2: apology
* Topic 3: politics

In [50]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(3, 'david'),
 (2, 'gabriel'),
 (3, 'george'),
 (3, 'jon'),
 (2, 'kate'),
 (3, 'kevin'),
 (1, 'leanne'),
 (1, 'lewis'),
 (0, 'louis'),
 (1, 'matt'),
 (1, 'pete'),
 (2, 'roseanne'),
 (1, 'sammy'),
 (1, 'shane'),
 (2, 'stavros')]

For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
* Topic 0: religion louis
* Topic 1: friend lewis matt pete sammy shane
* Topic 2: apology gabriel kate roseane stavros
* Topic 3: politics david george jon kevin

## Additional Exercises

1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.

In [51]:
from sklearn.decomposition import LatentDirichletAllocation

# Define the number of topics
num_topics = 10

# Further modify parameters to improve topics
lda_model = LatentDirichletAllocation(n_components=num_topics,          # Number of topics
                                      max_iter=10,                     # Maximum number of iterations
                                      learning_method='online',        # Learning method
                                      random_state=100,                # Random state for reproducibility
                                      n_jobs=-1)                       # Use all available CPU cores

# Fit the LDA model to the data
lda_output = lda_model.fit_transform(data_cvna)

# Display the top words for each topic
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-num_top_words - 1:-1]]))

# Define the number of top words to display for each topic
num_top_words = 10

# Display the topics
print("Topics found via LDA:")
display_topics(lda_model, cvna.get_feature_names_out(), num_top_words)


Topics found via LDA:
Topic 0:
mom fuck cause stalker okay weird terry fucking moms sorry
Topic 1:
cause mom okay room story uhh parents fucking sorry ill
Topic 2:
audience cause okay jesus shit irish woman ill sorry story
Topic 3:
okay shit cause fuck guys year ill white room fck
Topic 4:
um okay ill cameras kate cause tonight mirror whoa camera
Topic 5:
jesus cause audience fuck confidence story fucking okay auschwitz nice
Topic 6:
okay shit ill cause audience crazy news fck white money
Topic 7:
yall husband baby okay children lord money school bed woman
Topic 8:
audience irish cause ireland america moment laughing david american country
Topic 9:
audience okay cause shit ill mom guys room year funny


In [36]:
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords

def nouns_adjectives_verbs(text):
    """
    Given a string of text, tokenize the text and pull out nouns, adjectives, and verbs.
    """
    # Tokenize the text
    tokens = word_tokenize(text)
    # Get the part-of-speech tags for each token
    tagged_words = pos_tag(tokens)
    # Define allowed POS tags for nouns, adjectives, and verbs
    allowed_tags = ['NN', 'NNS', 'NNP', 'NNPS',    # Nouns
                    'JJ', 'JJR', 'JJS',           # Adjectives
                    'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']  # Verbs
    # Filter tokens based on allowed POS tags
    filtered_words = [word for word, tag in tagged_words if tag in allowed_tags]
    return filtered_words

# Example usage:
text = "This is an example sentence containing nouns, adjectives, and verbs."
filtered_text = nouns_adjectives_verbs(text)
print(filtered_text)


['is', 'example', 'sentence', 'containing', 'nouns', 'adjectives', 'verbs']


## Summary of the code:

This code defines a function `nouns_adjectives_verbs` that takes a string of text as input and returns a list of nouns, adjectives, and verbs found in the text.

Here's a breakdown of the steps:

1. **Import libraries:**
    * `nltk.pos_tag`: Used for Part-of-Speech (POS) tagging.
    * `nltk.word_tokenize`: Used for tokenizing the text into words.
    * `nltk.corpus.stopwords` (not used in this specific function): Might be intended for future implementation to remove stop words (common words like "the", "a").

2. **Function definition:**
    * `nouns_adjectives_verbs(text)`: This function takes a string `text` as input.

3. **Tokenization:**
    * `tokens = word_tokenize(text)`: Splits the text into a list of individual words.

4. **POS Tagging:**
    * `tagged_words = pos_tag(tokens)`: Assigns POS tags (e.g., noun, verb, adjective) to each word in the `tokens` list. The output is a list of tuples where each tuple contains a word and its corresponding POS tag.

5. **Filtering by POS tags:**
    * `allowed_tags`: Defines a list of POS tags that represent nouns, adjectives, and verbs.
    * `filtered_words = [word for word, tag in tagged_words if tag in allowed_tags]`: This list comprehension iterates through the `tagged_words` list. It includes only those words where the corresponding POS tag (`tag`) is present in the `allowed_tags` list. This effectively filters the words based on their grammatical categories.

6. **Returning the filtered list:**
    * The function returns the `filtered_words` list, which contains only nouns, adjectives, and verbs from the original text.

7. **Example usage:**
    * Demonstrates how to call the function with a sample sentence.
    * Prints the filtered list containing only nouns, adjectives, and verbs from the example sentence. 