# CSCI 5622: Machine Learning
## Fall 2023
### Instructor: Daniel Acuna, Associate Professor, Department of Computer Science, University of Colorado at Boulder

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Luna McBride"
COLLABORATORS = ""

---

# Homework 6 - Topic Modeling (50 pts)

## Question 1: (10 pts) Dataset Acquisition and Preprocessing

**Objective:** In this question, you will acquire a dataset of text and perform preprocessing steps to prepare it for topic modeling using scikit-learn. This preprocessing step is crucial for effective topic modeling using algorithms like NMF (Non-negative Matrix Factorization) and LDA (Latent Dirichlet Allocation).

**Task:**

1. **Dataset Acquisition:** 
   - Download the '20 Newsgroups' dataset, a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups.
   - URL for the dataset: `http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz`
   - Use the `fetch_20newsgroups` function from `sklearn.datasets` to load the dataset. 
   - Focus on a subset of 4 newsgroups for simplicity: `['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']`.

2. **Preprocessing:**
   - Tokenize and extract features from the text data using `TfidfVectorizer` from `sklearn.feature_extraction.text`.
   - Perform the following preprocessing steps:
     - Convert all text to lowercase.
     - Remove stopwords.
     - Use a `max_df` of 0.95 and `min_df` of 2.
     - Extract the top 1000 most frequent words.

Use the test cell to guide you as to which variables to create.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# YOUR CODE HERE
newsgroups_data = fetch_20newsgroups() #Get the training data from the newsgroups

In [3]:
subset = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] #Get the list of topics from above

newsgroups_actual_data = [] #Holder for the actual data
newsgroups_types = [] #Holder for the target types

#For each piece of data, add it if it is in the expected subset of topics above
for i in range(len(newsgroups_data.data)):
    data = newsgroups_data.data[i] #Get the data
    target = newsgroups_data.target[i] #Get the actual target topic
    types = newsgroups_data.target_names[target] #Get the name of the topic using the target value
    
    #If the type is in the subset of types, add it to the data holder lists
    if types in subset:
        newsgroups_actual_data.append(data) #Add the data to the data holder list
        newsgroups_types.append(types) #Add the topic to the types list
        
#Print out an example to show off the data structure
print("Data: \n", newsgroups_actual_data[0])
print("\n Type: \n", newsgroups_types[0])

Data: 
 From: jgreen@amber (Joe Green)
Subject: Re: Weitek P9000 ?
Organization: Harris Computer Systems Division
Lines: 14
Distribution: world
NNTP-Posting-Host: amber.ssd.csd.harris.com
X-Newsreader: TIN [version 1.1 PL9]

Robert J.C. Kyanko (rob@rjck.UUCP) wrote:
> abraxis@iastate.edu writes in article <abraxis.734340159@class1.iastate.edu>:
> > Anyone know about the Weitek P9000 graphics chip?
> As far as the low-level stuff goes, it looks pretty nice.  It's got this
> quadrilateral fill command that requires just the four points.

Do you have Weitek's address/phone number?  I'd like to get some information
about this chip.

--
Joe Green				Harris Corporation
jgreen@csd.harris.com			Computer Systems Division
"The only thing that really scares me is a person with no sense of humor."
						-- Jonathan Winters


 Type: 
 comp.graphics


In [4]:
max_df = 0.95 #Set the max_df
min_df = 2 #Set the min_df
top_words = 1000 #Set the max number of words

#Initialize the Tfidf vectorizer
tfidf = TfidfVectorizer(lowercase = True, stop_words = "english", max_df = max_df, min_df = min_df, max_features = top_words)
preprocessed_data = tfidf.fit_transform(newsgroups_actual_data) #Get the preprocessed data to throw into the next model

In [5]:
print(tfidf.get_feature_names_out()) #Print out the common words

['00' '000' '01' '04' '10' '100' '11' '12' '128' '13' '14' '15' '16' '17'
 '18' '19' '1993' '1993apr15' '20' '200' '2000' '21' '22' '23' '24' '25'
 '256' '26' '27' '28' '29' '30' '31' '32' '33' '35' '3d' '40' '41' '42'
 '50' '93' '__' '___' '_____' 'able' 'ac' 'accept' 'access' 'according'
 'act' 'action' 'actions' 'activities' 'acts' 'actually' 'ad' 'add'
 'added' 'address' 'advance' 'age' 'agency' 'ago' 'agree' 'air' 'alaska'
 'algorithm' 'allan' 'allen' 'allow' 'allowed' 'alt' 'american' 'ames'
 'amiga' 'analysis' 'ancient' 'andrew' 'animals' 'animation' 'anonymous'
 'answer' 'answers' 'anti' 'anybody' 'apparently' 'apple' 'applications'
 'apply' 'appreciated' 'apr' 'april' 'archive' 'area' 'aren' 'argument'
 'arguments' 'arizona' 'article' 'articles' 'ask' 'asked' 'assume' 'astro'
 'astronomy' 'atheism' 'atheist' 'atheists' 'atmosphere' 'attempt' 'au'
 'aurora' 'australia' 'author' 'authority' 'available' 'away' 'baalke'
 'bad' 'base' 'based' 'basic' 'basically' 'basis' 'bbs' 'beau

In [6]:
# 10 pts
def test_dataset_download():
    global newsgroups_data
    
    assert 'newsgroups_data' in globals(), "Dataset not loaded with the variable name 'newsgroups_data'"
    assert len(newsgroups_data.data) > 0, "Dataset seems to be empty"

def test_preprocessing():
    global tfidf, processed_features
    
    assert 'tfidf' in globals(), "TfidfVectorizer not defined"
    assert hasattr(tfidf, 'fit_transform'), "TfidfVectorizer not properly initialized"
    assert tfidf.get_feature_names_out().shape[0] == 1000, "The number of features extracted does not match 1000"

# Run the tests
test_dataset_download()
test_preprocessing()

## Question 2: Non-negative Matrix Factorization (NMF) and Performance Validation

**Objective:** Apply NMF to the preprocessed text dataset with different numbers of components (topics) and evaluate the performance using a specific metric. This exercise will help you understand the impact of choosing different dimensions (number of topics) in topic modeling.

**Task:**

1. **Implement NMF:**
   - Apply NMF on the preprocessed text data (from Question 1) using scikit-learn's `NMF` class.
   - Experiment with different numbers of components (use 5, 10, 15, 20).
   - Use the 'frobenius' norm as the loss function and a random state of 42 for reproducibility.

2. **Performance Validation:**
   - Evaluate the performance of each NMF model using the Frobenius norm of the matrix difference (i.e., the difference between the original data matrix and the reconstructed matrix from the NMF components and coefficients).
   - Store the Frobenius norm values for each number of components in a list or dictionary for comparison.

Look at the test to determine where you have to save the information.

In [7]:
from sklearn.decomposition import NMF
import numpy as np

# YOUR CODE HERE
nmf_models = {} #Initialize the nmf model dictionaries
frobenius_norms = {} #Initialize the frobenius norm dictionary

#For each step of 5 up to 20, train the model with this number of components
for i in range(5, 25, 5):
    model = NMF(i, beta_loss = "frobenius", random_state = 42) #Initialize the model
    nmf_models[i] = model #Add the model to the list
    model.fit_transform(preprocessed_data) #Fit the model on the data
    
    frobenius_norms[i] = model.reconstruction_err_ #Get the frobenius norms from the model
    

#Dictionary Component Source: https://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary
optimal_components = max(frobenius_norms, key = frobenius_norms.get) #Get the optimal components by getting the max of the frobenius norms
print(optimal_components) #Print the optimal components

5


In [8]:
# 15 pts
def test_nmf_models():
    assert 'nmf_models' in globals(), "nmf_models dictionary not defined"
    assert isinstance(nmf_models, dict), "nmf_models should be a dictionary"
    assert all(isinstance(nmf, NMF) for nmf in nmf_models.values()), "All values in nmf_models should be instances of NMF"
    assert all(n_components in nmf_models for n_components in [5, 10, 15, 20]), "NMF models for all specified component numbers should be created"

def test_frobenius_norms():
    assert 'frobenius_norms' in globals(), "frobenius_norms dictionary not defined"
    assert isinstance(frobenius_norms, dict), "frobenius_norms should be a dictionary"
    assert len(frobenius_norms) == 4, "There should be four Frobenius norm values for the four NMF models"
    assert all(isinstance(norm, float) for norm in frobenius_norms.values()), "All values in frobenius_norms should be floats"

def test_optimal_components_selection():
    assert 'optimal_components' in globals(), "Variable 'optimal_components' not defined"
    assert optimal_components in [5, 10, 15, 20], "Optimal components not selected from the predefined list"

# Run the tests
test_nmf_models()
test_frobenius_norms()
test_optimal_components_selection()

## Question 3: Latent Dirichlet Allocation (LDA) and Topic Interpretation

**Objective:** Apply LDA to the preprocessed text dataset from Question 1, extract 5 topics, and interpret what each topic represents. Use `CountVectorizer` instead of TF-IDF.

**Task:**

1. **Implement LDA:**
   - Apply LDA on the preprocessed text data using scikit-learn's `LatentDirichletAllocation` class.
   - Extract exactly 5 topics from the dataset.

2. **Print and Interpret Topics:**
   - For each topic, print the top 10 words based on their importance in the topic.
   - Write a brief interpretation for each topic, discussing what you think the topic represents based on the top words.

Use the test cell you guide about the variable names.

In [9]:
# 20 pts
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# YOUR CODE HERE
#Vectorize the count vector with the same stipulations as the other vectorizer
count_vectorizer = CountVectorizer(lowercase = True, stop_words = "english", max_df = max_df, min_df = min_df, max_features = top_words)
count_vectorizer_fit = count_vectorizer.fit_transform(newsgroups_actual_data) #Fit of the data
terms = count_vectorizer.get_feature_names_out() #Get the keywords from the vectorizer

lda_model = LatentDirichletAllocation(5, random_state = 42) #Initialize the LDA model
fit_model = lda_model.fit(count_vectorizer_fit) #Fit the model on the fit vectorizer

features = fit_model.get_feature_names_out() #Take out the features

print(fit_model.components_) #Print out the components to show off the structure

# Hypothetical interpretations:
# Topic 1: Likely about space exploration (words like 'nasa', 'space', 'orbit')
# Topic 2: Computer technology (words like 'software', 'graphics', 'image')
# Topic 3: Religion and philosophy (words like 'god', 'morality', 'belief')
# Topic 4: Online communities and discussion (words like 'internet', 'email', 'group')
# Topic 5: Science and research (words like 'data', 'study', 'theory')

[[ 40.13508121  11.79621799  19.2516298  ...   7.86754791   1.22819004
    0.20005377]
 [  0.2014505   36.8608462   15.67361571 ...  17.65254265   0.20083313
    0.20057814]
 [ 35.19403892 127.5287325   30.6544995  ...   5.35969355 119.16600329
   67.19904811]
 [  0.20209784   0.20134425   0.20205368 ...  12.69297176   0.20322138
    0.20005438]
 [  8.26733152   8.61285907   0.21820131 ...  32.42724413   0.20175217
    0.20026559]]


In [10]:
top_n = 10 #Set the number of top words to select
topics_words = [] #Create the list to hold the list of words

#For each topic list, extract the list of words they represent
for topic in fit_model.components_:
    words = list(zip(topic, terms)) #Zip together the topic and term lists to connect commonality to the words
    words.sort(reverse = True) #Sort the list by highest topic value, and thus the biggest words for said topic
    top = words[:top_n] #Take the top n words from this topic set
    top_words = [tup[1] for tup in top] #Pull out only the words from this
    
    print(top_words) #Print the common word list where I can see them
    topics_words.append(top_words) #Add the list to the list holder for the lists of words

['edu', 'graphics', 'image', 'file', 'files', 'university', 'software', 'images', 'data', 'ftp']
['people', 'jesus', 'like', 'don', 'just', 'know', 'say', 'good', 'think', 'life']
['space', 'nasa', 'edu', 'gov', 'launch', 'earth', 'moon', 'orbit', 'access', 'article']
['god', 'edu', 'people', 'don', 'think', 'atheists', 'keith', 'does', 'atheism', 'believe']
['edu', 'com', 'writes', 'article', 'posting', 'university', 'nntp', 'host', 'just', 'don']


In [11]:
# 15 pts
def test_lda_model():
    assert 'lda_model' in globals(), "LDA model not defined"
    assert isinstance(lda_model, LatentDirichletAllocation), "lda_model is not an instance of LatentDirichletAllocation"
    assert lda_model.n_components == 5, "LDA model should have exactly 5 topics"

def test_topic_words():
    assert 'topics_words' in globals(), "topics_words not defined"
    assert isinstance(topics_words, list), "topics_words should be a list"
    assert all(isinstance(topic, list) for topic in topics_words), "Each topic in topics_words should be a list"
    assert all(len(topic) == 10 for topic in topics_words), "Each topic should contain exactly 10 words"

# Run the tests
test_lda_model()
test_topic_words()

**Q. 3.2** (5 pts) What do each of the topics represent?

The easy topics:

Topic 1: This is the graphics topic. It is the only topic with software, image, graphics, and file.

Topic 2: This topic and topic 4 are religious topics. It is hard to tell which due to overlap, so this one will be explained as the positive religious section since it contains more positive words.

Topic 3: This is the space topic. It includes various space words like nasa, earth, and moon.

Topic 4: This is another religious topic, but it holds more critical words and possibly questions regarding religion/atheism. It is hard to tell if this topic represents questions about atheism or more general religious topics given that god is the biggest keyword, so it will be called the critical religious section.

The hard topic:

This one is more difficult due to the first question having everything but 4 topics removed. Not only that, but two of the original topics overlap heavily due to both being related to religion (being atheism and religion; even if you personally believe atheism is not a religion, topics about it use enough religious terminology to have it lumped with religion in the eyes of a model looking at key words). I have added a random state of 42 and that splits the religion category up better. This leaves one topic in this case (two in most other random states) that does not fit the expected topics.

Topics 5: This topic is the leftovers that are likely just common words from the forum. In most other states I looked at, this was represented by two nearly-identical topics. The 42 random state specifically does a better job capturing a difference in the two religious topics, which is why it is being used here.