<a href="https://colab.research.google.com/github/ChrisGarciaDS/ads-tm-topic-modeling/blob/main/GarciaChristoper_TopicModeling_Assignment5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ADS 509 Assignment 5.1: Topic Modeling

This notebook holds Assignment 5.1 for Module 5 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required.

In this assignment you will work with a categorical corpus that accompanies `nltk`. You will build the three types of topic models described in Chapter 8 of _Blueprints for Text Analytics using Python_: NMF, LSA, and LDA. You will compare these models to the true categories.


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell.

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.*


In [2]:
!pip install pyLDavis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.24.2 (from pyLDavis)
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDavis)
  Downloading pandas-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDavis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, numpy, pandas, pyLDavis
  Attempting uninstall: numpy
    Fou

In [3]:
# Restart notebook
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
!pip install brown

  and should_run_async(code)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting brown
  Downloading brown-0.0.1-py3-none-any.whl (2.1 kB)
Installing collected packages: brown
Successfully installed brown-0.0.1


In [82]:
# These libraries may be useful to you

#!pip install pyLDAvis==3.4.1 --user  #You need to restart the Kernel after installation.
# You also need a Python version => 3.9.0
from nltk.corpus import brown

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import pyLDAvis.lda_model
import pyLDAvis.gensim_models

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation

from spacy.lang.en.stop_words import STOP_WORDS as stopwords

from collections import Counter, defaultdict

nlp = spacy.load('en_core_web_sm')

  and should_run_async(code)


In [38]:
# add any additional libaries you need here
import matplotlib.pyplot as plt
#plt.style.use('ggplot')
import seaborn as sns
import nltk
nltk.download('brown')

  and should_run_async(code)
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [23]:
# This function comes from the BTAP repo.

def display_topics(model, features, no_top_words=5):
    for topic, words in enumerate(model.components_):
        total = words.sum()
        largest = words.argsort()[::-1] # invert sort order
        print("\nTopic %02d" % topic)
        for i in range(0, no_top_words):
            print("  %s (%2.2f)" % (features[largest[i]], abs(words[largest[i]]*100.0/total)))

  and should_run_async(code)


## Getting to Know the Brown Corpus

Let's spend a bit of time getting to know what's in the Brown corpus, our NLTK example of an "overlapping" corpus.

In [24]:
# categories of articles in Brown corpus
for category in brown.categories():
    print(f"For {category} we have {len(brown.fileids(categories=category))} articles.")

For adventure we have 29 articles.
For belles_lettres we have 75 articles.
For editorial we have 27 articles.
For fiction we have 29 articles.
For government we have 30 articles.
For hobbies we have 36 articles.
For humor we have 9 articles.
For learned we have 80 articles.
For lore we have 48 articles.
For mystery we have 24 articles.
For news we have 44 articles.
For religion we have 17 articles.
For reviews we have 17 articles.
For romance we have 29 articles.
For science_fiction we have 6 articles.


  and should_run_async(code)


Let's create a dataframe of the articles in of hobbies, editorial, government, news, and romance.

In [25]:
categories = ['editorial','government','news','romance','hobbies']

category_list = []
file_ids = []
texts = []

for category in categories :
    for file_id in brown.fileids(categories=category) :

        # build some lists for a dataframe
        category_list.append(category)
        file_ids.append(file_id)

        text = brown.words(fileids=file_id)
        texts.append(" ".join(text))



df = pd.DataFrame()
df['category'] = category_list
df['id'] = file_ids
df['text'] = texts

df.shape

  and should_run_async(code)


(166, 3)

In [26]:
# Let's add some helpful columns on the df
df['char_len'] = df['text'].apply(len)
df['word_len'] = df['text'].apply(lambda x: len(x.split()))

  and should_run_async(code)


In [79]:
%matplotlib inline
# This code gave an error
#df.groupby('category').agg({'word_len': 'mean'}).plot.bar(figsize=(10,6))

  and should_run_async(code)


Now do our TF-IDF and Count vectorizations.

In [40]:
count_text_vectorizer = CountVectorizer(stop_words=list(stopwords), min_df=5, max_df=0.7)
count_text_vectors = count_text_vectorizer.fit_transform(df["text"])
count_text_vectors.shape

  and should_run_async(code)


(166, 4941)

In [41]:
tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(stopwords), min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['text'])
tfidf_text_vectors.shape

  and should_run_async(code)


(166, 4941)

Q: What do the two data frames `count_text_vectors` and `tfidf_text_vectors` hold?

A: In the `count_text_vectors` dataframe, we get text that has been transformed into tokens that are representative of the article type. In the `tfidf_text_vectors` dataframe, we get text that has been transformed into a numerica type that represents the frequency or TF-IDF score of the article type.

## Fitting a Non-Negative Matrix Factorization Model

In this section the code to fit a five-topic NMF model has already been written. This code comes directly from the [BTAP repo](https://github.com/blueprints-for-text-analytics-python/blueprints-text), which will help you tremendously in the coming sections.

In [43]:
nmf_text_model = NMF(n_components=5, random_state=314)
W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors)
H_text_matrix = nmf_text_model.components_

  and should_run_async(code)


In [44]:
display_topics(nmf_text_model, tfidf_text_vectorizer.get_feature_names_out())


Topic 00
  mr (0.51)
  president (0.45)
  kennedy (0.43)
  united (0.42)
  khrushchev (0.40)

Topic 01
  said (0.88)
  didn (0.46)
  ll (0.45)
  thought (0.42)
  man (0.37)

Topic 02
  state (0.39)
  development (0.36)
  tax (0.33)
  sales (0.30)
  program (0.25)

Topic 03
  mrs (2.61)
  mr (0.78)
  said (0.63)
  miss (0.52)
  car (0.51)

Topic 04
  game (1.02)
  league (0.74)
  ball (0.72)
  baseball (0.71)
  team (0.66)


  and should_run_async(code)


Now some work for you to do. Compare the NMF factorization to the original categories from the Brown Corpus.

We are interested in the extent to which our NMF factorization agrees or disagrees with the original categories in the corpus. For each topic in your NMF model, tally the Brown categories and interpret the results.


In [49]:
# Your code here
topic_to_category = defaultdict(list)

for index, row in enumerate(W_text_matrix):
  topic = np.where(row == np.amax(row))[0]
  category = df['category'].iloc[index]
  topic_to_category[topic[0]].append(category)

for topic, category in topic_to_category.items():
  print('For topic {topic} we have {len(categories)} documents')
  print(Counter(categories).most_common(5))

For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]


  and should_run_async(code)


Q: How does your five-topic NMF model compare to the original Brown categories?

A: After comparing the NMF model, it seems like it did a good job in classifying the text with their respective topic articles. We seem to have a pretty close representation of the different articles and their categories.

## Fitting an LSA Model

In this section, follow the example from the repository and fit an LSA model (called a "TruncatedSVD" in `sklearn`). Again fit a five-topic model and compare it to the actual categories in the Brown corpus. Use the TF-IDF vectors for your fit, as above.

To be explicit, we are once again interested in the extent to which this LSA factorization agrees or disagrees with the original categories in the corpus. For each topic in your model, tally the Brown categories and interpret the results.


In [51]:
# Your code here
svd_para_model = TruncatedSVD(n_components=5, random_state=42)
w_svd_para_matrix = svd_para_model.fit_transform(tfidf_text_vectors)
h_svd_para_matrix = svd_para_model.components_

  and should_run_async(code)


In [54]:
svd_para_model.singular_values_

  and should_run_async(code)


array([3.70145447, 2.25514643, 1.69735202, 1.66198175, 1.55995291])

In [55]:
topic_to_category = defaultdict(list)

for index, row in enumerate(w_svd_para_matrix):
  topic = np.where(row == np.amax(row))[0]
  category = df['category'].iloc[index]
  topic_to_category[topic[0]].append(category)

for topic, category in topic_to_category.items():
  print('For topic {topic} we have {len(categories)} documents')
  print(Counter(categories).most_common(5))

For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]
For topic {topic} we have {len(categories)} documents
[('editorial', 1), ('government', 1), ('news', 1), ('romance', 1), ('hobbies', 1)]


  and should_run_async(code)


:Q: How does your five-topic LSA model compare to the original Brown categories?

A: <!-- Your answer here -->

In [56]:
# call display_topics on your model
display_topics(svd_para_model, tfidf_text_vectorizer.get_feature_names_out())


Topic 00
  said (0.44)
  mr (0.25)
  mrs (0.22)
  state (0.20)
  man (0.17)

Topic 01
  said (3.89)
  ll (2.73)
  didn (2.63)
  thought (2.20)
  got (1.97)

Topic 02
  mrs (3.14)
  mr (1.73)
  said (1.05)
  kennedy (0.81)
  president (0.77)

Topic 03
  mrs (30.38)
  club (6.70)
  game (6.40)
  jr (5.81)
  dallas (5.50)

Topic 04
  game (4.33)
  league (3.09)
  baseball (3.06)
  ball (2.94)
  team (2.81)


  and should_run_async(code)


Q: What is your interpretation of the display topics output?

A: This model seems to have performed a lot poorer as certain words on this algorithm were not given as much weight per topic as in the previous.

## Fitting an LDA Model

Finally, fit a five-topic LDA model using the count vectors (`count_text_vectors` from above). Display the results using `pyLDAvis.display` and describe what you learn from that visualization.

In [67]:
# Call `display_topics` on your fitted model here
lda = LatentDirichletAllocation(n_components=6, random_state=314)
w_lda_matrix = lda.fit_transform(count_text_vectors)
h_lda_matrix = lda.components_

  and should_run_async(code)


In [76]:
display_topics(lda, count_text_vectorizer.get_feature_names_out())


Topic 00
  mrs (1.80)
  water (0.71)
  clay (0.67)
  use (0.64)
  shelter (0.59)

Topic 01
  business (0.58)
  state (0.58)
  1960 (0.49)
  development (0.48)
  sales (0.47)

Topic 02
  mr (0.89)
  president (0.77)
  united (0.53)
  american (0.52)
  said (0.50)

Topic 03
  feed (0.61)
  college (0.60)
  university (0.50)
  work (0.42)
  student (0.37)

Topic 04
  state (1.23)
  states (1.00)
  tax (0.73)
  united (0.69)
  government (0.57)

Topic 05
  said (1.69)
  old (0.51)
  little (0.48)
  man (0.47)
  ll (0.44)


  and should_run_async(code)


Q: What inference do you draw from the displayed topics for your LDA model?

A: It looks like this model performed better than the previous models as the weight per word for each topic is greator.

Q: Repeat the tallying of Brown categories within your topics. How does your five-topic LDA model compare to the original Brown categories?

A: This model compares much closely to the original Brown categories. Thus nidek

In [84]:
import pyLDAvis.lda_model
lda_display = pyLDAvis.lda_model.prepare(lda, count_text_vectors, count_text_vectorizer, sort_topics=False)

  and should_run_async(code)
ERROR:concurrent.futures:exception calling callback for <Future at 0x7f5c7f7a81f0 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/joblib/externals/loky/_base.py", line 26, in _invoke_callbacks
    callback(self)
  File "/usr/local/lib/python3.10/dist-packages/joblib/parallel.py", line 385, in __call__
    self.parallel.dispatch_next()
  File "/usr/local/lib/pyth

BrokenProcessPool: ignored

In [85]:
pyLDAvis.display(lda_display)

  and should_run_async(code)


NameError: ignored

Q: What conclusions do you draw from the visualization above? Please address the principal component scatterplot and the salient terms graph.

A: <!-- Your answer here -->
