# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

## Topic Modeling - Attempt #1 (All Text)

In [36]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,ab,abilities,ability,able,absolutely,absorbed,absorbing,abstract,academia,academic,...,young,younger,youpaul,youprobablycohentannoudji,youre,yourselfyouve,yoursrobert,yousvante,youve,youwatch
Anne,1,0,0,1,2,0,0,0,1,0,...,4,0,0,0,0,0,0,0,0,0
Claudia,1,0,0,3,1,0,0,4,0,1,...,3,0,0,0,4,0,0,0,1,0
Gerhard,2,0,0,3,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,2,0
Katalin,1,0,0,0,0,0,0,0,0,0,...,5,0,0,0,0,0,0,0,1,0
Leland,1,1,2,6,0,0,0,0,0,0,...,0,0,0,0,6,1,0,0,5,0
Paul,1,0,1,0,0,0,0,0,1,3,...,6,0,1,0,10,0,0,0,5,1
Pierre,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
Richard,1,0,1,11,4,0,0,1,0,0,...,2,0,0,0,11,0,0,0,6,1
Robert,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
Svante,1,0,1,1,0,1,1,0,0,0,...,1,1,0,0,3,0,0,1,0,0


In [37]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [38]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,Anne,Claudia,Gerhard,Katalin,Leland,Paul,Pierre,Richard,Robert,Svante
ab,1,1,2,1,1,1,1,1,1,1
abilities,0,0,0,0,1,0,0,0,0,0
ability,0,0,0,0,2,1,0,1,0,1
able,1,3,3,0,6,0,0,11,0,1
absolutely,2,1,0,0,0,0,1,4,1,0


In [39]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [40]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [41]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.008*"thats" + 0.008*"people" + 0.008*"think" + 0.007*"work" + 0.007*"just" + 0.006*"like" + 0.006*"things" + 0.006*"economics" + 0.006*"know" + 0.006*"dont"'),
 (1,
  '0.011*"think" + 0.011*"people" + 0.007*"really" + 0.007*"just" + 0.007*"time" + 0.006*"like" + 0.006*"nobel" + 0.006*"doing" + 0.005*"thats" + 0.005*"know"')]

In [42]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.012*"think" + 0.012*"people" + 0.008*"just" + 0.008*"time" + 0.007*"really" + 0.007*"like" + 0.007*"doing" + 0.006*"nobel" + 0.006*"say" + 0.006*"roberts"'),
 (1,
  '0.010*"thats" + 0.009*"think" + 0.008*"ertl" + 0.007*"things" + 0.007*"work" + 0.007*"just" + 0.007*"really" + 0.006*"people" + 0.006*"hartwell" + 0.006*"like"'),
 (2,
  '0.016*"economics" + 0.010*"people" + 0.008*"women" + 0.006*"know" + 0.006*"goldin" + 0.006*"work" + 0.005*"dont" + 0.005*"thats" + 0.004*"think" + 0.004*"just"')]

In [43]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.001*"people" + 0.000*"thats" + 0.000*"work" + 0.000*"really" + 0.000*"think" + 0.000*"things" + 0.000*"just" + 0.000*"dont" + 0.000*"know" + 0.000*"time"'),
 (1,
  '0.010*"people" + 0.009*"think" + 0.008*"roberts" + 0.008*"really" + 0.008*"just" + 0.007*"ertl" + 0.006*"economics" + 0.006*"know" + 0.006*"doing" + 0.006*"time"'),
 (2,
  '0.011*"people" + 0.011*"think" + 0.009*"just" + 0.009*"greengard" + 0.008*"really" + 0.007*"work" + 0.007*"thats" + 0.007*"things" + 0.006*"time" + 0.006*"hartwell"'),
 (3,
  '0.012*"think" + 0.010*"like" + 0.009*"important" + 0.009*"people" + 0.008*"nobel" + 0.008*"thats" + 0.007*"different" + 0.006*"things" + 0.006*"say" + 0.006*"science"')]

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

In [44]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [45]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,awardee
Anne,the nobel prize in physics agostiniferenc kra...
Claudia,the sveriges riksbank prize in economic scienc...
Gerhard,the nobel prize in chemistry ertlshare thissh...
Katalin,the nobel prize in physiology or medicine kar...
Leland,the nobel prize in physiology or medicine har...
Paul,the nobel prize in physiology or medicine car...
Pierre,the nobel prize in physics agostiniferenc kra...
Richard,the nobel prize in physiology or medicine j r...
Robert,the nobel prize in chemistry j lefkowitzbrian...
Svante,the nobel prize in physiology or medicine pää...


In [47]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/krishuppal/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [48]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.awardee.apply(nouns))
data_nouns

Unnamed: 0,awardee
Anne,prize physics krauszanne thisshare facebook tr...
Claudia,sveriges sciences memory nobel goldinshare thi...
Gerhard,prize chemistry ertlshare thisshare facebook t...
Katalin,prize physiology medicine karikódrew facebook ...
Leland,prize physiology medicine hartwelltim huntsir ...
Paul,prize physiology medicine carlssonpaul greenga...
Pierre,prize physics krauszanne thisshare facebook tr...
Richard,prize physiology medicine j sharpshare thissha...
Robert,prize chemistry j kobilkashare thisshare faceb...
Svante,prize physiology medicine pääboshare thisshare...


In [52]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = list(text.ENGLISH_STOP_WORDS.union(add_stop_words))  # Convert to list

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.awardee)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())  # Use get_feature_names_out() instead
data_dtmn.index = data_nouns.index
data_dtmn



Unnamed: 0,ab,abilities,ability,academia,academic,academics,account,accumulation,accuracy,achievements,...,yeswhats,york,youd,youits,youleland,youll,youprobablycohentannoudji,yourselfyouve,yousvante,youve
Anne,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Claudia,1,0,0,0,1,1,3,0,0,1,...,0,2,0,0,0,1,0,0,0,0
Gerhard,1,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,1
Katalin,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
Leland,1,1,2,0,0,2,0,1,1,1,...,0,0,0,0,2,0,0,1,0,2
Paul,1,0,1,1,0,0,0,0,0,1,...,0,0,1,1,0,0,0,0,0,3
Pierre,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
Richard,1,0,1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,4
Robert,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Svante,1,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [53]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [54]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.014*"things" + 0.008*"years" + 0.008*"science" + 0.008*"ertl" + 0.007*"hartwell" + 0.007*"students" + 0.007*"cell" + 0.006*"chemistry" + 0.006*"interview" + 0.006*"biology"'),
 (1,
  '0.009*"things" + 0.009*"research" + 0.009*"way" + 0.008*"greengard" + 0.008*"science" + 0.008*"interview" + 0.008*"economics" + 0.007*"prize" + 0.007*"students" + 0.006*"thing"')]

In [55]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.016*"greengard" + 0.010*"students" + 0.009*"ertl" + 0.008*"years" + 0.008*"work" + 0.008*"research" + 0.008*"chemistry" + 0.007*"prize" + 0.007*"way" + 0.007*"drug"'),
 (1,
  '0.018*"things" + 0.011*"science" + 0.010*"interview" + 0.009*"year" + 0.008*"hartwell" + 0.008*"research" + 0.008*"transcript" + 0.008*"prize" + 0.008*"students" + 0.008*"thing"'),
 (2,
  '0.013*"economics" + 0.009*"way" + 0.008*"things" + 0.008*"women" + 0.008*"years" + 0.008*"roberts" + 0.006*"school" + 0.006*"science" + 0.005*"experiment" + 0.005*"rna"')]

In [56]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.017*"things" + 0.010*"science" + 0.010*"way" + 0.010*"biology" + 0.008*"roberts" + 0.008*"year" + 0.008*"thing" + 0.008*"hartwell" + 0.007*"dna" + 0.007*"cell"'),
 (1,
  '0.017*"research" + 0.016*"students" + 0.014*"group" + 0.012*"bit" + 0.008*"science" + 0.008*"physics" + 0.008*"intuition" + 0.007*"interview" + 0.007*"student" + 0.006*"transcript"'),
 (2,
  '0.022*"ertl" + 0.017*"chemistry" + 0.011*"students" + 0.009*"science" + 0.009*"surface" + 0.008*"course" + 0.008*"physics" + 0.007*"interview" + 0.006*"years" + 0.006*"question"'),
 (3,
  '0.012*"greengard" + 0.012*"economics" + 0.009*"work" + 0.009*"years" + 0.009*"women" + 0.009*"interview" + 0.009*"prize" + 0.008*"things" + 0.008*"students" + 0.007*"school"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [57]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [59]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.awardee.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,awardee
Anne,nobel prize physics agostiniferenc krauszanne ...
Claudia,sveriges economic sciences memory alfred nobel...
Gerhard,nobel prize chemistry ertlshare thisshare face...
Katalin,nobel prize physiology medicine karikódrew fac...
Leland,nobel prize physiology medicine hartwelltim hu...
Paul,nobel prize physiology medicine carlssonpaul g...
Pierre,nobel prize physics agostiniferenc krauszanne ...
Richard,nobel prize physiology medicine j sharpshare t...
Robert,nobel prize chemistry j lefkowitzbrian kobilka...
Svante,nobel prize physiology medicine pääboshare thi...


In [62]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=0.8)
data_cvna = cvna.fit_transform(data_nouns_adj.awardee)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())  # Use get_feature_names_out() instead
data_dtmna.index = data_nouns_adj.index
data_dtmna


Unnamed: 0,abilities,ability,able,abstract,academia,academic,academics,accepted,accomplish,accomplished,...,youits,youleland,youll,young,younger,youprobablycohentannoudji,yourselfyouve,yoursrobert,yousvante,youve
Anne,0,0,1,0,1,0,0,0,0,0,...,0,0,0,4,0,0,0,0,0,0
Claudia,0,0,3,4,0,1,1,1,0,0,...,0,0,1,3,0,0,0,0,0,1
Gerhard,0,0,3,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
Katalin,0,0,0,0,0,0,0,0,1,0,...,0,0,0,5,0,0,0,0,0,1
Leland,1,2,6,0,0,0,2,0,0,0,...,0,2,1,0,0,0,1,0,0,4
Paul,0,1,0,0,1,3,0,1,0,1,...,1,0,0,6,0,0,0,0,0,3
Pierre,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Richard,0,1,11,1,0,0,0,0,0,0,...,0,0,1,2,0,0,0,0,0,4
Robert,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Svante,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,1,0


In [63]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [64]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"greengard" + 0.006*"research" + 0.006*"new" + 0.006*"students" + 0.005*"course" + 0.005*"school" + 0.005*"chemistry" + 0.005*"biology" + 0.004*"ertl" + 0.004*"interested"'),
 (1,
  '0.016*"economics" + 0.009*"students" + 0.008*"women" + 0.006*"research" + 0.006*"group" + 0.006*"goldin" + 0.005*"bit" + 0.005*"college" + 0.004*"school" + 0.004*"diversity"')]

In [65]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.010*"greengard" + 0.007*"biology" + 0.006*"research" + 0.005*"new" + 0.005*"hartwell" + 0.005*"cell" + 0.005*"students" + 0.005*"roberts" + 0.005*"dna" + 0.005*"school"'),
 (1,
  '0.010*"research" + 0.008*"students" + 0.008*"hungary" + 0.007*"scientists" + 0.006*"group" + 0.006*"scientist" + 0.006*"karikó" + 0.006*"goal" + 0.005*"bit" + 0.005*"school"'),
 (2,
  '0.015*"economics" + 0.011*"ertl" + 0.008*"course" + 0.008*"students" + 0.008*"chemistry" + 0.007*"women" + 0.005*"goldin" + 0.005*"interested" + 0.005*"school" + 0.005*"new"')]

In [66]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"biology" + 0.006*"hartwell" + 0.006*"rna" + 0.006*"cell" + 0.006*"experiment" + 0.006*"roberts" + 0.006*"dna" + 0.005*"school" + 0.005*"research" + 0.005*"new"'),
 (1,
  '0.013*"research" + 0.010*"students" + 0.010*"course" + 0.009*"group" + 0.009*"pääbo" + 0.008*"bit" + 0.005*"institute" + 0.005*"humans" + 0.005*"life" + 0.005*"example"'),
 (2,
  '0.021*"economics" + 0.015*"ertl" + 0.010*"students" + 0.010*"chemistry" + 0.010*"women" + 0.007*"goldin" + 0.006*"new" + 0.006*"school" + 0.006*"course" + 0.006*"surface"'),
 (3,
  '0.022*"greengard" + 0.008*"drug" + 0.007*"research" + 0.007*"students" + 0.007*"brain" + 0.006*"department" + 0.006*"pharmacology" + 0.006*"biochemistry" + 0.006*"ive" + 0.005*"companies"')]

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [67]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.011*"lefkowitz" + 0.009*"robert" + 0.005*"week" + 0.005*"enthusiasm" + 0.005*"secret" + 0.005*"lefkowitzshare" + 0.004*"piano" + 0.004*"club" + 0.004*"essay" + 0.004*"lessons"'),
 (1,
  '0.009*"students" + 0.009*"research" + 0.008*"greengard" + 0.007*"new" + 0.006*"chemistry" + 0.006*"biology" + 0.005*"ertl" + 0.005*"interested" + 0.005*"school" + 0.005*"course"'),
 (2,
  '0.019*"economics" + 0.012*"women" + 0.007*"school" + 0.007*"hungary" + 0.007*"goldin" + 0.006*"scientist" + 0.006*"scientists" + 0.006*"college" + 0.006*"goal" + 0.006*"karikó"'),
 (3,
  '0.013*"pääbo" + 0.010*"course" + 0.008*"humans" + 0.007*"institute" + 0.007*"svante" + 0.005*"passion" + 0.005*"example" + 0.005*"today" + 0.005*"unique" + 0.005*"neandertals"')]

1. Topic 0 seems to revolve around personal experiences or achievements related to music, piano lessons, clubs, essays, etc.
2. Topic 1 appears to be focused on academic research, with terms like "students," "research," "chemistry," "biology," etc., suggesting discussions related to scientific studies and academia.
3. Topic 2 likely pertains to economics and social sciences, with terms like "economics," "women," "school," "college," "scientist," etc., indicating discussions related to gender economics, education, and social sciences.
4. Topic 3 seems to be about scientific research and institutes, with terms like "pääbo," "humans," "institute," "svante," "neandertals," etc., suggesting discussions related to genetics, human evolution, and research institutes.

Based on this analysis, topics 1 and 2 seem to be more coherent and meaningful compared to topics 0 and 3. Topics 1 and 2 cover academic research and economics/social sciences, which are coherent themes, while topics 0 and 3 appear to be less focused or cohesive.

In [68]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(1, 'Anne'),
 (2, 'Claudia'),
 (1, 'Gerhard'),
 (2, 'Katalin'),
 (1, 'Leland'),
 (1, 'Paul'),
 (2, 'Pierre'),
 (1, 'Richard'),
 (0, 'Robert'),
 (3, 'Svante')]

For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
* Topic 0: mom, parents [Anthony, Hasan, Louis, Ricky]
* Topic 1: husband, wife [Ali, John, Mike]
* Topic 2: guns [Bill, Bo, Jim]
* Topic 3: profanity [Dave, Joe]

### Assignment:
1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.

In [69]:
# Import necessary libraries
from gensim import models

# Further modify the parameters of the LDA model
ldana_modified = models.LdaModel(
    corpus=corpusna,  # Your corpus
    num_topics=6,  # Change the number of topics
    id2word=id2wordna,  # Your id2word mapping
    passes=100,  # Increase the number of passes
    alpha='auto',  # Use automatic alpha estimation
    eta='auto',  # Use automatic eta estimation
    random_state=42  # Set a random state for reproducibility
)

# Print the modified topics
ldana_modified.print_topics()

[(0,
  '0.012*"pierre" + 0.007*"guess" + 0.007*"agostinishare" + 0.005*"experiment" + 0.005*"better" + 0.005*"qualities" + 0.005*"agostini" + 0.005*"literature" + 0.005*"experiments" + 0.005*"aware"'),
 (1,
  '0.014*"hungary" + 0.012*"goal" + 0.012*"karikó" + 0.011*"scientists" + 0.009*"scientist" + 0.007*"school" + 0.006*"rna" + 0.006*"katalin" + 0.006*"money" + 0.006*"vaccine"'),
 (2,
  '0.035*"economics" + 0.018*"women" + 0.012*"goldin" + 0.009*"college" + 0.009*"economic" + 0.008*"students" + 0.008*"dog" + 0.007*"school" + 0.007*"subject" + 0.007*"rights"'),
 (3,
  '0.014*"greengard" + 0.008*"research" + 0.007*"roberts" + 0.006*"new" + 0.006*"experiment" + 0.006*"rna" + 0.005*"nice" + 0.005*"company" + 0.005*"biochemistry" + 0.005*"ive"'),
 (4,
  '0.015*"lefkowitz" + 0.012*"robert" + 0.007*"week" + 0.007*"lefkowitzshare" + 0.007*"enthusiasm" + 0.007*"secret" + 0.005*"piano" + 0.005*"id" + 0.005*"lessons" + 0.005*"club"'),
 (5,
  '0.011*"students" + 0.010*"ertl" + 0.009*"course" + 0

In [71]:
# Import necessary libraries
from gensim import models

# Further modify the parameters of the LDA model
ldana_modified = models.LdaModel(
    corpus=corpusna,  # Your corpus
    num_topics=6,  # Change the number of topics
    id2word=id2wordna,  # Your id2word mapping
    passes=150,  # Increase the number of passes
    alpha='auto',  # Use automatic alpha estimation
    eta='auto',  # Use automatic eta estimation
    random_state=42,  # Set a random state for reproducibility
    chunksize=1000,  # Set the chunk size for processing
    iterations=200,  # Increase the number of iterations
    minimum_probability=0.01,  # Set the minimum probability threshold for a topic
    decay=0.5  # Set the decay parameter for learning
)

# Print the modified topics
ldana_modified.print_topics()

[(0,
  '0.012*"pierre" + 0.007*"agostinishare" + 0.007*"guess" + 0.005*"experiment" + 0.005*"qualities" + 0.005*"aware" + 0.005*"literature" + 0.005*"experiments" + 0.005*"better" + 0.005*"agostini"'),
 (1,
  '0.014*"hungary" + 0.012*"karikó" + 0.012*"goal" + 0.011*"scientists" + 0.009*"scientist" + 0.007*"school" + 0.006*"money" + 0.006*"rna" + 0.006*"katalin" + 0.006*"vaccine"'),
 (2,
  '0.019*"economics" + 0.012*"hartwell" + 0.010*"cell" + 0.010*"women" + 0.009*"biology" + 0.008*"school" + 0.008*"students" + 0.008*"cancer" + 0.008*"college" + 0.007*"goldin"'),
 (3,
  '0.013*"greengard" + 0.008*"research" + 0.007*"roberts" + 0.007*"new" + 0.006*"experiment" + 0.006*"rna" + 0.005*"big" + 0.005*"dna" + 0.005*"nice" + 0.005*"biochemistry"'),
 (4,
  '0.015*"lefkowitz" + 0.012*"robert" + 0.007*"week" + 0.007*"secret" + 0.007*"enthusiasm" + 0.007*"lefkowitzshare" + 0.005*"piano" + 0.005*"id" + 0.005*"essay" + 0.005*"club"'),
 (5,
  '0.017*"ertl" + 0.014*"students" + 0.013*"research" + 0.01

In [72]:
# Let's create a function to pull out nouns, adjectives, and verbs from a string of text
def nouns_adj_verbs(text):
    '''Given a string of text, tokenize the text and pull out only the nouns, adjectives, and verbs.'''
    is_noun_adj_verb = lambda pos: pos[:2] in ['NN', 'JJ', 'VB']  # Including verbs ('VB')
    tokenized = word_tokenize(text)
    nouns_adj_verbs = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj_verb(pos)] 
    return ' '.join(nouns_adj_verbs)

# Apply the nouns_adj_verbs function to the transcripts to filter only on nouns, adjectives, and verbs
data_nouns_adj_verbs = pd.DataFrame(data_clean.awardee.apply(nouns_adj_verbs))

# Create a new document-term matrix using only nouns, adjectives, and verbs
cv_nav = CountVectorizer(stop_words=stop_words, max_df=0.8)
data_cv_nav = cv_nav.fit_transform(data_nouns_adj_verbs.awardee)
data_dtm_nav = pd.DataFrame(data_cv_nav.toarray(), columns=cv_nav.get_feature_names_out())  # Use get_feature_names_out() instead
data_dtm_nav.index = data_nouns_adj_verbs.index

# Create the gensim corpus
corpus_nav = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtm_nav.transpose()))

# Create the vocabulary dictionary
id2word_nav = dict((v, k) for k, v in cv_nav.vocabulary_.items())

# Train LDA model with nouns, adjectives, and verbs
ldana_nav = models.LdaModel(corpus=corpus_nav, num_topics=4, id2word=id2word_nav, passes=10)
ldana_nav.print_topics()


[(0,
  '0.013*"greengard" + 0.010*"ertl" + 0.007*"students" + 0.006*"chemistry" + 0.005*"going" + 0.005*"research" + 0.005*"new" + 0.005*"theres" + 0.004*"interested" + 0.004*"drug"'),
 (1,
  '0.014*"roberts" + 0.009*"going" + 0.007*"research" + 0.006*"rna" + 0.005*"experiment" + 0.005*"group" + 0.005*"trying" + 0.005*"new" + 0.005*"id" + 0.005*"course"'),
 (2,
  '0.012*"hartwell" + 0.010*"cell" + 0.009*"biology" + 0.006*"cancer" + 0.005*"going" + 0.005*"laboratory" + 0.004*"theres" + 0.004*"approach" + 0.004*"disease" + 0.004*"working"'),
 (3,
  '0.014*"economics" + 0.008*"women" + 0.006*"want" + 0.005*"goldin" + 0.005*"school" + 0.005*"karikó" + 0.005*"hungary" + 0.004*"went" + 0.004*"going" + 0.004*"started"')]