# CSCI 5622: Machine Learning
## Fall 2023
### Instructor: Daniel Acuna, Associate Professor, Department of Computer Science, University of Colorado at Boulder

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Luk Letif"
COLLABORATORS = ""

---

# Homework 6 - Topic Modeling (50 pts)

## Question 1: (10 pts) Dataset Acquisition and Preprocessing

**Objective:** In this question, you will acquire a dataset of text and perform preprocessing steps to prepare it for topic modeling using scikit-learn. This preprocessing step is crucial for effective topic modeling using algorithms like NMF (Non-negative Matrix Factorization) and LDA (Latent Dirichlet Allocation).

**Task:**

1. **Dataset Acquisition:** 
   - Download the '20 Newsgroups' dataset, a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups.
   - URL for the dataset: `http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz`
   - Use the `fetch_20newsgroups` function from `sklearn.datasets` to load the dataset. 
   - Focus on a subset of 4 newsgroups for simplicity: `['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']`.

2. **Preprocessing:**
   - Tokenize and extract features from the text data using `TfidfVectorizer` from `sklearn.feature_extraction.text`.
   - Perform the following preprocessing steps:
     - Convert all text to lowercase.
     - Remove stopwords.
     - Use a `max_df` of 0.95 and `min_df` of 2.
     - Extract the top 1000 most frequent words.

Use the test cell to guide you as to which variables to create.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# YOUR CODE HERE
# raise NotImplementedError()
# Focus on a subset of 4 newsgroups for simplicity: ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'].
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
# Use the fetch_20newsgroups function from sklearn.datasets to load the dataset.
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
# Perform the preprocessing steps
tfidf = TfidfVectorizer(lowercase=True, stop_words='english', max_df=0.95, min_df=2, max_features=1000)
processed_features = tfidf.fit_transform(newsgroups_data.data)

In [3]:
# 10 pts
def test_dataset_download():
    global newsgroups_data
    
    assert 'newsgroups_data' in globals(), "Dataset not loaded with the variable name 'newsgroups_data'"
    assert len(newsgroups_data.data) > 0, "Dataset seems to be empty"

def test_preprocessing():
    global tfidf, processed_features
    
    assert 'tfidf' in globals(), "TfidfVectorizer not defined"
    assert hasattr(tfidf, 'fit_transform'), "TfidfVectorizer not properly initialized"
    assert tfidf.get_feature_names_out().shape[0] == 1000, "The number of features extracted does not match 1000"

# Run the tests
test_dataset_download()
test_preprocessing()

## Question 2: Non-negative Matrix Factorization (NMF) and Performance Validation

**Objective:** Apply NMF to the preprocessed text dataset with different numbers of components (topics) and evaluate the performance using a specific metric. This exercise will help you understand the impact of choosing different dimensions (number of topics) in topic modeling.

**Task:**

1. **Implement NMF:**
   - Apply NMF on the preprocessed text data (from Question 1) using scikit-learn's `NMF` class.
   - Experiment with different numbers of components (use 5, 10, 15, 20).
   - Use the 'frobenius' norm as the loss function and a random state of 42 for reproducibility.

2. **Performance Validation:**
   - Evaluate the performance of each NMF model using the Frobenius norm of the matrix difference (i.e., the difference between the original data matrix and the reconstructed matrix from the NMF components and coefficients).
   - Store the Frobenius norm values for each number of components in a list or dictionary for comparison.

Look at the test to determine where you have to save the information.

In [4]:
from sklearn.decomposition import NMF
import numpy as np

# YOUR CODE HERE
# raise NotImplementedError()
# same as above
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
tfidf = TfidfVectorizer(lowercase=True, stop_words='english', max_df=0.95, min_df=2, max_features=1000)
processed_features = tfidf.fit_transform(newsgroups_data.data)

# init names
nmf_models = {}
frobenius_norms = {}
#Experiment with different numbers of components (use 5, 10, 15, 20).
n_components_list = [5, 10, 15, 20]

# Implement NMF
for n_components in n_components_list:
    # random state of 42 for reproducibility.
    nmf = NMF(n_components=n_components, random_state=42, solver='cd', beta_loss='frobenius',max_iter=500) # Maximum number of iterations 200 reached
    W = nmf.fit_transform(processed_features)
    H = nmf.components_
    # TODO: err
    reconstructed = np.dot(W, H)
    # Use the 'frobenius' norm as the loss function and a 
    frobenius_norm = np.linalg.norm(processed_features - reconstructed, 'fro')

    # Store the Frobenius norm values for each number of components in a list or dictionary for comparison.
    nmf_models[n_components] = nmf
    frobenius_norms[n_components] = frobenius_norm
    
# TODO:test_optimal_components_selection
optimal_components = min(frobenius_norms, key=frobenius_norms.get)


In [5]:
# 15 pts
def test_nmf_models():
    assert 'nmf_models' in globals(), "nmf_models dictionary not defined"
    assert isinstance(nmf_models, dict), "nmf_models should be a dictionary"
    assert all(isinstance(nmf, NMF) for nmf in nmf_models.values()), "All values in nmf_models should be instances of NMF"
    assert all(n_components in nmf_models for n_components in [5, 10, 15, 20]), "NMF models for all specified component numbers should be created"

def test_frobenius_norms():
    assert 'frobenius_norms' in globals(), "frobenius_norms dictionary not defined"
    assert isinstance(frobenius_norms, dict), "frobenius_norms should be a dictionary"
    assert len(frobenius_norms) == 4, "There should be four Frobenius norm values for the four NMF models"
    assert all(isinstance(norm, float) for norm in frobenius_norms.values()), "All values in frobenius_norms should be floats"

def test_optimal_components_selection():
    assert 'optimal_components' in globals(), "Variable 'optimal_components' not defined"
    assert optimal_components in [5, 10, 15, 20], "Optimal components not selected from the predefined list"

# Run the tests
test_nmf_models()
test_frobenius_norms()
test_optimal_components_selection()

## Question 3: Latent Dirichlet Allocation (LDA) and Topic Interpretation

**Objective:** Apply LDA to the preprocessed text dataset from Question 1, extract 5 topics, and interpret what each topic represents. Use `CountVectorizer` instead of TF-IDF.

**Task:**

1. **Implement LDA:**
   - Apply LDA on the preprocessed text data using scikit-learn's `LatentDirichletAllocation` class.
   - Extract exactly 5 topics from the dataset.

2. **Print and Interpret Topics:**
   - For each topic, print the top 10 words based on their importance in the topic.
   - Write a brief interpretation for each topic, discussing what you think the topic represents based on the top words.

Use the test cell you guide about the variable names.

In [6]:
# 20 pts
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# YOUR CODE HERE
# raise NotImplementedError()
# same as above
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
# Use CountVectorizer instead of TF-IDF.
count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
count_data = count_vectorizer.fit_transform(newsgroups_data.data)

# Apply LDA on the preprocessed text data using scikit-learn's LatentDirichletAllocation class.
lda_model = LatentDirichletAllocation(n_components=5, random_state=0)
lda_model.fit(count_data)
# For each topic, print the top 10 words based on their importance in the topic.
feature_names = count_vectorizer.get_feature_names_out()
topics_words = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_feature_indices = topic.argsort()[-10:][::-1]  # conc
    topic_words = [feature_names[i] for i in top_feature_indices]
    topics_words.append(topic_words)
    
# Hypothetical interpretations:
# Topic 1: Likely about space exploration (words like 'nasa', 'space', 'orbit')
# Topic 2: Computer technology (words like 'software', 'graphics', 'image')
# Topic 3: Religion and philosophy (words like 'god', 'morality', 'belief')
# Topic 4: Online communities and discussion (words like 'internet', 'email', 'group')
# Topic 5: Science and research (words like 'data', 'study', 'theory')
interpretations = [
    "Likely about space exploration (words like 'nasa', 'space', 'orbit')",
    "Computer technology (words like 'software', 'graphics', 'image')",
    "Religion and philosophy (words like 'god', 'morality', 'belief')",
    "Online communities and discussion (words like 'internet', 'email', 'group')",
    "Science and research (words like 'data', 'study', 'theory')"
]
#Print and Interpret Topics:
for i, (topic_words, interpretation) in enumerate(zip(topics_words, interpretations), 1):
    print(f"Top 10 words for Topic {i}: {topic_words}")
    print(f"Hypothetical interpretation of Topic {i}: {interpretation}\n")


Top 10 words for Topic 1: ['jpeg', 'image', 'file', 'gif', 'images', 'color', 'bit', 'format', 'files', 'use']
Hypothetical interpretation of Topic 1: Likely about space exploration (words like 'nasa', 'space', 'orbit')

Top 10 words for Topic 2: ['space', 'nasa', 'earth', 'launch', 'orbit', 'shuttle', 'moon', 'time', 'mission', 'solar']
Hypothetical interpretation of Topic 2: Computer technology (words like 'software', 'graphics', 'image')

Top 10 words for Topic 3: ['edu', 'graphics', 'data', 'image', 'ftp', 'available', 'pub', 'software', 'mail', 'information']
Hypothetical interpretation of Topic 3: Religion and philosophy (words like 'god', 'morality', 'belief')

Top 10 words for Topic 4: ['god', 'jesus', 'people', 'bible', 'know', 'said', 'did', 'like', 'just', 'say']
Hypothetical interpretation of Topic 4: Online communities and discussion (words like 'internet', 'email', 'group')

Top 10 words for Topic 5: ['don', 'people', 'think', 'just', 'does', 'god', 'like', 'say', 'know',

In [7]:
# 15 pts
def test_lda_model():
    assert 'lda_model' in globals(), "LDA model not defined"
    assert isinstance(lda_model, LatentDirichletAllocation), "lda_model is not an instance of LatentDirichletAllocation"
    assert lda_model.n_components == 5, "LDA model should have exactly 5 topics"

def test_topic_words():
    assert 'topics_words' in globals(), "topics_words not defined"
    assert isinstance(topics_words, list), "topics_words should be a list"
    assert all(isinstance(topic, list) for topic in topics_words), "Each topic in topics_words should be a list"
    assert all(len(topic) == 10 for topic in topics_words), "Each topic should contain exactly 10 words"

# Run the tests
test_lda_model()
test_topic_words()

**Q. 3.2** (5 pts) What do each of the topics represent?

Topic 1: Computer technology
Topic 2: space exploration
Topic 3: Science and research
Topic 4: Religion and philosophy 
Topic 5: Religion and philosophy + Online communities and discussion