In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw7.ipynb")



# CPSC 330 - Applied Machine Learning 

## Homework 7: Word embeddings and topic modeling 
**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points}

**Please be aware that this homework assignment requires installation of several packages in your course environment. It's possible that you'll encounter installation challenges, which might be frustrating. However, remember that solving these issues is not wasting time but it is an essential skill for anyone aspiring to work in data science or machine learning.**

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb. If the pdf or html also fail to render on Gradescope, please create two files for your homework: hw6a.ipynb with Exercise 1 and hw6b.ipynb with Exercises 2 and 3 and submit these two files in your submission.  
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 1:  Exploring pre-trained word embeddings <a name="1"></a>
<hr>

In lecture 18, we talked about natural language processing (NLP). Using pre-trained word embeddings is very common in NLP. It has been shown that pre-trained word embeddings work well on a variety of text classification tasks. These embeddings are created by training a model like Word2Vec on a huge corpus of text such as a dump of Wikipedia or a dump of the web crawl. 

A number of pre-trained word embeddings are available out there. Some popular ones are: 

- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using the fastText algorithm
    * published by Facebook
    
In this exercise, you will be exploring GloVe Wikipedia pre-trained embeddings. The code below loads the word vectors trained on Wikipedia using an algorithm called Glove. You'll need `gensim` package in your cpsc330 conda environment to run the code below. 

```
> conda activate cpsc330
> conda install -c anaconda gensim
```

In [3]:
import gensim
import gensim.downloader

print(list(gensim.downloader.info()["models"].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [4]:
# This will take a while to run when you run it for the first time.
import gensim.downloader as api

glove_wiki_vectors = api.load("glove-wiki-gigaword-100")

In [5]:
len(glove_wiki_vectors)

400000

There are 400,000 word vectors in this pre-trained model. 

Now that we have GloVe Wiki vectors loaded in `glove_wiki_vectors`, let's explore the embeddings. 

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 Word similarity using pre-trained embeddings
rubric={points}

**Your tasks:**

- Come up with a list of 4 words of your choice and find similar words to these words using `glove_wiki_vectors` embeddings.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 2

In [6]:
def similar_words(word):
    res = glove_wiki_vectors.most_similar(word, topn = 5)
    for sim_word, sim in res:
        print(f" {sim_word} ({sim:.3f})")

In [7]:
words = ["ramen","wall","game","tool"]
for word in words:
    print("Words similar to " + word)
    similar_words(word)

Words similar to ramen
 noodle (0.622)
 noodles (0.614)
 soba (0.530)
 sushi (0.515)
 soup (0.498)
Words similar to wall
 street (0.715)
 walls (0.709)
 slide (0.656)
 floor (0.656)
 window (0.654)
Words similar to game
 games (0.864)
 play (0.832)
 season (0.773)
 player (0.758)
 players (0.729)
Words similar to tool
 tools (0.830)
 useful (0.703)
 method (0.693)
 methods (0.689)
 software (0.664)


In [8]:
...

Ellipsis

In [9]:
...

Ellipsis

In [10]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Word similarity using pre-trained embeddings
rubric={points}

**Your tasks:**

1. Calculate cosine similarity for the following word pairs (`word_pairs`) using the [`similarity`](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) method of `glove_wiki_vectors`.

In [11]:
word_pairs = [
    ("coast", "shore"),
    ("clothes", "closet"),
    ("old", "new"),
    ("smart", "intelligent"),
    ("dog", "cat"),
    ("tree", "lawyer"),
]

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 2

In [12]:
for w1,w2 in word_pairs:
    score = glove_wiki_vectors.similarity(w1,w2)
    print (f" Similarity Score for {w1} and {w2}: {score}")

 Similarity Score for coast and shore: 0.7000271677970886
 Similarity Score for clothes and closet: 0.5462759733200073
 Similarity Score for old and new: 0.6432487964630127
 Similarity Score for smart and intelligent: 0.7552732229232788
 Similarity Score for dog and cat: 0.8798074722290039
 Similarity Score for tree and lawyer: 0.07671945542097092


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Representation of all words in English
rubric={points}

**Your tasks:**

1. The vocabulary size of Wikipedia embeddings is quite large. The `test_words` list below contains a few new words (called neologisms) and biomedical domain-specific abbreviations. Write code to check whether `glove_wiki_vectors` have representation for these words or not. 
> If a given word `word` is in the vocabulary, `word in glove_wiki_vectors` will return True. 

In [13]:
test_words = [
    "covididiot",
    "fomo",
    "frenemies",
    "anthropause",
    "photobomb",
    "selfie",
    "pxg",  # Abbreviation for pseudoexfoliative glaucoma
    "pacg",  # Abbreviation for primary angle closure glaucoma
    "cct",  # Abbreviation for central corneal thickness
    "escc",  # Abbreviation for esophageal squamous cell carcinoma
]

<div class="alert alert-warning">

Solution_1_3
    
</div>

_Points:_ 2

In [14]:
for word in test_words:
    res = word in glove_wiki_vectors
    print(f" {word} in GloVe Vocabulary: {res}")

 covididiot in GloVe Vocabulary: False
 fomo in GloVe Vocabulary: False
 frenemies in GloVe Vocabulary: True
 anthropause in GloVe Vocabulary: False
 photobomb in GloVe Vocabulary: False
 selfie in GloVe Vocabulary: False
 pxg in GloVe Vocabulary: False
 pacg in GloVe Vocabulary: False
 cct in GloVe Vocabulary: True
 escc in GloVe Vocabulary: True


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 Stereotypes and biases in embeddings
rubric={points}

Word vectors contain lots of useful information. But they also contain stereotypes and biases of the texts they were trained on. In the lecture, we saw an example of gender bias in Google News word embeddings. Here we are using pre-trained embeddings trained on Wikipedia data. 

**Your tasks:**

1. Explore whether there are any worrisome biases or stereotypes present in these embeddings by trying out at least 4 examples. You can use the following two methods or other methods of your choice to explore this. 
    - the `analogy` function below which gives word analogies (an example shown below)
    - [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=similarity#gensim.models.keyedvectors.KeyedVectors.similarity) or [distance](https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=distance#gensim.models.keyedvectors.KeyedVectors.distances) methods (an example is shown below)

> Note that most of the recent embeddings are de-biased. But you might still observe some biases in them. Also, not all stereotypes present in pre-trained embeddings are necessarily bad. But you should be aware of them when you use them in your models. 

In [15]:
def analogy(word1, word2, word3, model=glove_wiki_vectors):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)
        word1 in the analogy relation
    word2 : (str)
        word2 in the analogy relation
    word3 : (str)
        word3 in the analogy relation
    model :
        word embedding model

    Returns
    ---------------
        pd.dataframe
    """
    print("%s : %s :: %s : ?" % (word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

Examples of using analogy to explore biases and stereotypes.  

In [16]:
analogy("man", "doctor", "woman")

man : doctor :: woman : ?


Unnamed: 0,Analogy word,Score
0,nurse,0.773523
1,physician,0.718943
2,doctors,0.682433
3,patient,0.675068
4,dentist,0.672603
5,pregnant,0.664246
6,medical,0.652045
7,nursing,0.645348
8,mother,0.639333
9,hospital,0.63875


In [17]:
glove_wiki_vectors.similarity("aboriginal", "success")

0.1428324

In [18]:
glove_wiki_vectors.similarity("white", "success")

0.351824

<div class="alert alert-warning">

Solution_1_4
    
</div>

_Points:_ 4

In [19]:
analogy("white", "rich", "black")

white : rich :: black : ?


Unnamed: 0,Analogy word,Score
0,diverse,0.604348
1,especially,0.601948
2,vast,0.592111
3,wealthy,0.589454
4,impoverished,0.570855
5,richer,0.567632
6,particularly,0.566272
7,richest,0.560885
8,wealth,0.542812
9,young,0.539785


In [20]:
analogy("man", "strong", "woman")

man : strong :: woman : ?


Unnamed: 0,Analogy word,Score
0,stronger,0.726966
1,weak,0.657428
2,robust,0.644944
3,strongest,0.634013
4,despite,0.631509
5,support,0.627726
6,growing,0.625404
7,concern,0.605287
8,particularly,0.603681
9,reflected,0.600039


In [21]:
analogy("white", "smart", "black")

white : smart :: black : ?


Unnamed: 0,Analogy word,Score
0,intelligent,0.681611
1,sophisticated,0.649662
2,kid,0.614454
3,clever,0.601639
4,incredibly,0.595822
5,innovative,0.593895
6,savvy,0.590568
7,really,0.584688
8,kind,0.577322
9,pretty,0.570496


In [22]:
analogy("woman", "cook", "man")

woman : cook :: man : ?


Unnamed: 0,Analogy word,Score
0,fry,0.713323
1,bacon,0.589003
2,onions,0.585679
3,graham,0.580317
4,cooked,0.578382
5,stewart,0.578265
6,minutes,0.577864
7,ham,0.572744
8,medium,0.572086
9,stephen,0.572056


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Discussion
rubric={points}

**Your tasks:**
1. Discuss your observations from 1.4. Are there any worrisome biases in these embeddings trained on Wikipedia?   
2. Give an example of how using embeddings with biases could cause harm in the real world.

<div class="alert alert-warning">

Solution_1_5
    
</div>

_Points:_ 4

None of the embedding trained on Wikipedia that I tested in 1.4 displayed any worrisome results.

Using embeddings with biases in the context of HR recruitment systems could definitely cause harm in the real world, by perpetuating biases and leading to unfair hiring practices that are discriminatory in nature. For example, a hiring algorithm could favor resumes with male-associated terms over female-associated terms due to biased embeddings associating men with leadership or technical skills.

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Topic modeling 

The goal of topic modeling is discovering high-level themes in a large collection of texts. 

In this homework, you will explore topics in [the 20 newsgroups text dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) using `scikit-learn`'s `LatentDirichletAllocation` (LDA) model. 

Usually, topic modeling is used for discovering abstract "topics" that occur in a collection of documents when you do not know the actual topics present in the documents. But 20 newsgroups text dataset is labeled with categories (e.g., sports, hardware, religion), and you will be able to cross-check the topics discovered by your model with these available topics. 

The starter code below loads the train and test portion of the data and convert the train portion into a pandas DataFrame. For speed, we will only consider documents with the following 8 categories. 

In [23]:
from sklearn.datasets import fetch_20newsgroups

In [24]:
cats = [
    "rec.sport.hockey",
    "rec.sport.baseball",
    "soc.religion.christian",
    "alt.atheism",
    "comp.graphics",
    "comp.windows.x",
    "talk.politics.mideast",
    "talk.politics.guns",
]  # We'll only consider these categories out of 20 categories for speed.

newsgroups_train = fetch_20newsgroups(
    subset="train", remove=("headers", "footers", "quotes"), categories=cats
)
X_news_train, y_news_train = newsgroups_train.data, newsgroups_train.target
df = pd.DataFrame(X_news_train, columns=["text"])
df["target"] = y_news_train
df["target_name"] = [
    newsgroups_train.target_names[target] for target in newsgroups_train.target
]
df

Unnamed: 0,text,target,target_name
0,"You know, I was reading 18 U.S.C. 922 and some...",6,talk.politics.guns
1,\n\n\nIt's not a bad question: I don't have an...,1,comp.graphics
2,"\nActuallay I don't, but on the other hand I d...",1,comp.graphics
3,"The following problem is really bugging me,\na...",2,comp.windows.x
4,\n\n This is the latest from UPI \n\n For...,7,talk.politics.mideast
...,...,...,...
4558,Hi Everyone ::\n\nI am looking for some soft...,1,comp.graphics
4559,Archive-name: x-faq/part3\nLast-modified: 1993...,2,comp.windows.x
4560,"\nThat's nice, but it doesn't answer the quest...",6,talk.politics.guns
4561,"Hi,\n I just got myself a Gateway 4DX-33V ...",2,comp.windows.x


In [25]:
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.windows.x',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast']

<br><br>

<!-- BEGIN QUESTION -->

### 2.1 Preprocessing using [spaCy](https://spacy.io/)
rubric={points}

Preprocessing is a crucial step before carrying out topic modeling and it markedly affects topic modeling results. In this exercise, you'll prepare the data using [spaCy](https://spacy.io/) for topic modeling. 

**Your tasks:** 

- Write code using [spaCy](https://spacy.io/) to preprocess the `text` column in the given dataframe `df` and save the processed text in a new column called `text_pp` within the same dataframe.

If you do not have spaCy in your course environment, you'll have to [install it](https://spacy.io/usage) and download the pretrained model en_core_web_md. 

`python -m spacy download en_core_web_md`


Note that there is no such thing as "perfect" preprocessing. You'll have to make your own judgments and decisions on which tokens are likely to be more informative for the given task. Some common text preprocessing steps for topic modeling include: 
- getting rid of slashes, new-line characters, or any other non-informative characters
- sentence segmentation and tokenization      
- replacing urls, email addresses, or numbers with generic tokens such as "URL",  "EMAIL", "NUM". 
- getting rid of other fairly unique tokens which are not going to help us in topic modeling  
- excluding stopwords and punctuation 
- lemmatization


> Check out [these available attributes](https://spacy.io/api/token#attributes) for `token` in spaCy which might help you with preprocessing. 

> You can also get rid of words with specific POS tags. [Here](https://universaldependencies.org/u/pos/) is the list of part-of-speech tags used in spaCy. 

> You may have to use regex to clean text before passing it to spaCy. Also, you might have to go back and forth between preprocessing in this exercise and and topic modeling in Exercise 2 before finalizing preprocessing steps. 

> Note that preprocessing the corpus might take some time. So here are a couple of suggestions: 1) During the debugging phase, work on a smaller subset of the data. 2) Once you finalize the preprocessing part, you might want to save the preprocessed data in a CSV and work with this CSV so that you don't run the preprocessing part every time you run the notebook. 
 


In [26]:
import spacy
nlp = spacy.load("en_core_web_md", disable=["parser", "ner"])

<div class="alert alert-warning">

Solution_2_1
    
</div>

_Points:_ 8

In [27]:
import re
def clean_text(text):
    """
    Cleans the text by removing unwanted characters and replacing patterns.
    """
    # Replace URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "URL", text, flags=re.MULTILINE)
    # Replace email addresses
    text = re.sub(r"\S+@\S+", "EMAIL", text)
    # Replace numbers
    text = re.sub(r"\b\d+\b", "NUM", text)
    # Remove slashes and newline characters
    text = text.replace("\\", " ").replace("/", " ").replace("\n", " ")
    return text

df["cleaned_text"] = df["text"].apply(clean_text)

def preprocess_text(text, nlp):
    """
    Preprocesses text: tokenizes, removes stopwords, punctuation, and lemmatizes tokens.
    """
    doc = nlp(text)
    tokens = []
    for token in doc:
        if (
            not token.is_stop  # Exclude stopwords
            and not token.is_punct  # Exclude punctuation
            and not token.like_num  # Exclude numbers
            and token.is_alpha  # Keep only alphabetic tokens
        ):
            tokens.append(token.lemma_)  # Append lemmatized version of token
    return " ".join(tokens)

df["processed_text"] = df["cleaned_text"].apply(lambda x: preprocess_text(x, nlp))

print(df.head())

                                                text  target  \
0  You know, I was reading 18 U.S.C. 922 and some...       6   
1  \n\n\nIt's not a bad question: I don't have an...       1   
2  \nActuallay I don't, but on the other hand I d...       1   
3  The following problem is really bugging me,\na...       2   
4  \n\n  This is the latest from UPI \n\n     For...       7   

             target_name                                       cleaned_text  \
0     talk.politics.guns  You know, I was reading NUM U.S.C. NUM and som...   
1          comp.graphics     It's not a bad question: I don't have any r...   
2          comp.graphics   Actuallay I don't, but on the other hand I do...   
3         comp.windows.x  The following problem is really bugging me, an...   
4  talk.politics.mideast      This is the latest from UPI        Foreign...   

                                      processed_text  
0  know read NUM NUM sence wonder help NUM NUM pr...  
1  bad question ref list algor

In [28]:
...

Ellipsis

In [29]:
...

Ellipsis

In [30]:
df.iloc[2:6]

Unnamed: 0,text,target,target_name,cleaned_text,processed_text
2,"\nActuallay I don't, but on the other hand I d...",1,comp.graphics,"Actuallay I don't, but on the other hand I do...",Actuallay hand support idea have newsgroup asp...
3,"The following problem is really bugging me,\na...",2,comp.windows.x,"The following problem is really bugging me, an...",follow problem bug appreciate help create wind...
4,\n\n This is the latest from UPI \n\n For...,7,talk.politics.mideast,This is the latest from UPI Foreign...,late UPI Foreign Ministry spokesman Ferhat Ata...
5,"Hi,\n I'd like to subscribe to Leadership Ma...",5,soc.religion.christian,"Hi, I'd like to subscribe to Leadership Mag...",hi like subscribe Leadership Magazine wonder d...


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Justification
rubric={points}

**Your tasks:**

- Outline the preprocessing steps you carried out in the previous exercise (bullet point format is fine), providing a brief justification when necessary. 

> You might want to wait to answer this question till you are done with Exercise 2 and you have finalized the preprocessing steps in 2.1. 

<div class="alert alert-warning">

Solution_2_2
    
</div>

_Points:_ 2

Removal of URLs, Email Addresses, and Numbers
Replaced URLs, email addresses, and numeric values with placeholder tokens (URL, EMAIL, NUM).
These elements rarely contribute semantic meaning for topics but may introduce noise. Replacing them preserves some contextual information.

Removal of Non-informative Characters
Removed slashes (/), backslashes (\), and newline characters (\n).
These characters are structural artifacts in text and are not useful for topic modeling.

Lowercasing
Converted all text to lowercase.
Makes tokenization consistent and avoids treating words with different cases as separate tokens (e.g., "Hockey" vs. "hockey").

Stopword Removal
Excluded common stopwords (e.g., "the," "is") using spaCy and sklearn stopword lists.
Stopwords are high-frequency words that don’t contribute meaningfully to topics.

Punctuation Removal
Removed punctuation marks (e.g., ".", ",", "!”").
Punctuation does not carry semantic meaning in most contexts.

Tokenization
Split text into individual words using spaCy.
Tokenization is necessary for downstream processing steps like lemmatization and vectorization.

Lemmatization
Reduced words to their base forms using spaCy (e.g., "running" → "run").
Reduces dimensionality by consolidating words with the same meaning but different forms (e.g., "run," "running," and "ran").

Removal of Low-information Tokens
Removed unique or irrelevant tokens such as random alphanumeric strings, rare symbols, or overly short tokens.
These tokens do not contribute to meaningful topics and may introduce noise.

Part-of-Speech (POS) Filtering
Kept only nouns, adjectives, and verbs using spaCy’s POS tagging.
These parts of speech typically carry the most meaning for topic modeling.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.3 Build a topic model using sklearn's LatentDirichletAllocation
rubric={points}

**Your tasks:**

1. Build LDA models on the preprocessed data using using [sklearn's `LatentDirichletAllocation`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) and random state 42. Experiment with a few values for the number of topics (`n_components`). Pick a reasonable number for the number of topics and briefly justify your choice.

<div class="alert alert-warning">

Solution_2_3
    
</div>

_Points:_ 4

The optimal number of topics is 20, as it offers the lowest perplexity and produces interpretable topics that align well with the actual dataset categories.

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

vectorizer = CountVectorizer(
    stop_words="english",
    max_features=5000,
    token_pattern=r"\b[a-zA-Z]{3,}\b"
)
X = vectorizer.fit_transform(df["processed_text"])

n_topics_list = [5, 10, 15, 20, 25, 30, 35]
lda_models = {}
perplexity_scores = []

for n_topics in n_topics_list:
    lda = LatentDirichletAllocation(
        n_components=n_topics, 
        random_state=42, 
        learning_method="batch"
    )
    lda.fit(X)
    lda_models[n_topics] = lda
    perplexity = lda.perplexity(X)
    perplexity_scores.append(perplexity)
    print(f"Number of Topics: {n_topics}, Perplexity: {perplexity:.2f}")

optimal_topics = n_topics_list[np.argmin(perplexity_scores)]
print(f"\nOptimal Number of Topics (Lowest Perplexity): {optimal_topics}")




Number of Topics: 5, Perplexity: 929.78
Number of Topics: 10, Perplexity: 909.44
Number of Topics: 15, Perplexity: 851.05
Number of Topics: 20, Perplexity: 834.62
Number of Topics: 25, Perplexity: 839.38
Number of Topics: 30, Perplexity: 854.70
Number of Topics: 35, Perplexity: 861.82

Optimal Number of Topics (Lowest Perplexity): 20


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.4 Exploring word topic association
rubric={points}

**Your tasks:**
1. For the number of topics you picked in the previous exercise, show top 10 words for each of your topics and suggest labels for each of the topics (similar to how we came up with labels "health and nutrition", "fashion", and "machine learning" in the toy example we saw in class). 

> If your topics do not make much sense, you might have to go back to preprocessing in Exercise 2.1, improve it, and train your LDA model again. 

<div class="alert alert-warning">

Solution_2_4
    
</div>

_Points:_ 5

In [32]:
def print_topics(model, vectorizer, top_n=10):
    words = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(model.components_):
        top_words = [words[i] for i in topic.argsort()[-top_n:]]
        print(f"Topic #{topic_idx + 1}: {', '.join(top_words)}")

print(f"\nTopics for {optimal_topics} Topics:")
print_topics(lda_models[optimal_topics], vectorizer)


Topics for 20 Topics:
Topic #1: know, armenia, want, time, say, armenians, email, window, henrik, right
Topic #2: problem, display, set, run, motif, use, application, server, widget, window
Topic #3: hit, season, win, run, player, good, team, game, year, num
Topic #4: min, chi, export, tor, email, available, det, bos, contrib, num
Topic #5: win, goal, nhl, season, player, hockey, play, team, game, num
Topic #6: criminal, number, problem, kill, know, crime, argument, rate, people, gun
Topic #7: university, history, turks, armenians, greek, jews, turkey, armenian, turkish, num
Topic #8: philadelphia, san, vancouver, scorer, new, play, power, pts, period, num
Topic #9: want, woman, think, time, tell, num, people, know, come, say
Topic #10: know, num, say, people, hell, man, question, think, god, jesus
Topic #11: source, help, mail, need, send, use, know, program, thank, email
Topic #12: control, num, government, people, firearm, right, state, weapon, law, gun
Topic #13: rule, oname, line

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.5 Exploring document topic association
rubric={points}

**Your tasks:**
1. Show the document topic assignment of the first five documents from `df`. 
2. Comment on the document topic assignment of the model. 

<div class="alert alert-warning">

Solution_2_5
    
</div>

_Points:_ 5

In [33]:
doc_topic_distributions = lda_models[20].transform(X)

df["dominant_topic"] = doc_topic_distributions.argmax(axis=1)

for i in range(5):
    print(f"Document #{i + 1}:")
    print(f"Text: {df['text'].iloc[i][:200]}...")
    print(f"Dominant Topic: Topic #{df['dominant_topic'].iloc[i] + 1}")
    print(f"Topic Probabilities: {doc_topic_distributions[i]}")
    print()


Document #1:
Text: You know, I was reading 18 U.S.C. 922 and something just did not make 
sence and I was wondering if someone could help me out.

Say U.S.C. 922 :

(1) Except as provided in paragraph (2), it shall be u...
Dominant Topic: Topic #12
Topic Probabilities: [0.0009434  0.0009434  0.0009434  0.0009434  0.0009434  0.0009434
 0.0009434  0.12685925 0.0009434  0.16848634 0.0009434  0.52598942
 0.0009434  0.0009434  0.0009434  0.16357065 0.0009434  0.0009434
 0.0009434  0.0009434 ]

Document #2:
Text: 


It's not a bad question: I don't have any refs that list this algorithm
either. But thinking about it a bit, it shouldn't be too hard.

1) Take three of the points and find the plane they define as...
Dominant Topic: Topic #19
Topic Probabilities: [7.24637689e-04 7.24637688e-04 7.24637687e-04 7.24637693e-04
 7.24637693e-04 7.24637687e-04 7.24637687e-04 5.05763339e-02
 7.24637688e-04 7.24637689e-04 7.24637691e-04 7.24637687e-04
 7.24637688e-04 7.24637689e-04 7.24637688e-04 7.24637

The LDA model has successfully assigned topics to the first five documents in the dataset. For most documents, the dominant topics align well with the main themes present in the content, demonstrating the effectiveness of the model in uncovering latent patterns. Below is a detailed discussion of the topic assignments for each document:

Document #1
This document discusses a legal query about U.S. regulations (18 U.S.C. 922) and seeks clarification about its interpretation. The model assigns Topic #12 as the dominant topic, with a strong probability of 52.6%. This topic focuses on legal and governmental terms, such as "control," "government," "law," and "gun," making the assignment accurate. The document’s content is clearly regulatory, fitting well with the theme of legal and governmental discussions. Minor contributions from related topics suggest some overlap with general discourse themes.

Document #2
The second document poses a technical question about an algorithm related to points and planes. The model identifies Topic #19 as the dominant topic with an overwhelming probability of 93.6%. This topic encompasses terms such as "graphics," "image," and "file," closely related to the computational and graphical nature of the discussion. The assignment is precise, as the document clearly belongs to a technical or software-focused domain.

Document #3
This document discusses the organization of newsgroups for graphics programming, critiquing earlier suggestions. The model assigns Topic #16 as the dominant topic, with a probability of 38.6%, while significant contributions also come from Topic #14 (33.9%) and Topic #18 (25.7%). This overlap reflects the document’s multi-faceted content, which includes both general discourse and technical discussions. The assignment to Topic #16, which includes conversational terms like "start" and "think," is logical, though the mixed probabilities indicate some ambiguity due to the document's broad scope.

Document #4
The fourth document describes a programming issue involving the creation of windows, event masks, and child windows. The model confidently assigns Topic #2 as the dominant topic, with a high probability of 94.4%. This topic includes terms such as "window," "display," and "application," aligning closely with the document’s technical focus. The assignment is highly accurate, as the content is specific to programming challenges, which are central to Topic #2.

Document #5
This document reports on Turkey’s decision to close its airspace to Armenian flights and block humanitarian aid. The model assigns Topic #17 as the dominant topic, with a probability of 42.9%, reflecting its focus on terms like "war," "attack," and "Armenian." However, the document also shows notable contributions from Topic #7 (27.4%) and Topic #12 (22.8%), indicating overlap with themes related to historical relations and governmental actions. The dominant assignment to Topic #17 is appropriate given the political and international relations focus of the document, but the overlap highlights the interconnected nature of political and historical discussions.

<!-- END QUESTION -->

<br><br><br><br>

<!-- BEGIN QUESTION -->

## Exercise 3: Short answer questions 
<hr>

rubric={points}

1. Briefly explain how content-based filtering works in the context of recommender systems. 
2. Discuss at least two negative consequences of recommender systems.
3. What is transfer learning in natural language processing? Briefly explain.     

<div class="alert alert-warning">

Solution_3
    
</div>

_Points:_ 6

1. Content-based filtering is a recommendation approach that uses the characteristics of items and users to make personalized suggestions. It relies on analyzing the content or features of items a user has interacted with and finding other items with similar attributes to recommend. For example, in the context of movie recommendations, if a user has rated several action movies highly, the system will analyze the features of those movies (e.g., genre, director, cast) and recommend other action movies with similar features. The key idea is that recommendations are tailored based on the user's own preferences rather than relying on the behavior of other users. The algorithm typically uses methods like cosine similarity, TF-IDF, or embeddings to quantify the similarity between items. However, it can struggle with the "cold start" problem when dealing with new users or items that lack sufficient interaction data.

2. Two Negative Consequences of Recommender Systems
Filter Bubbles: Recommender systems can trap users in "filter bubbles," where they are exposed only to content similar to their past interactions. This limits diversity and serendipity in recommendations, reinforcing pre-existing preferences or biases. For instance, in news platforms, this can lead to users being shown articles that align only with their views, potentially exacerbating polarization.
Amplification of Biases: If the underlying data used to train the recommender system contains biases, the system can amplify these biases. For example, if an e-commerce system primarily recommends products for men because of historical sales patterns, it may unintentionally underrepresent items targeted at women, leading to discriminatory outcomes

4. Transfer learning in NLP is a technique where a pre-trained model, typically trained on a large and diverse corpus of text, is fine-tuned for a specific downstream task. Instead of training a model from scratch, which is computationally expensive and requires vast amounts of labeled data, transfer learning leverages the knowledge the pre-trained model has acquired (e.g., understanding syntax, semantics, or general language patterns) and adapts it to new tasks such as sentiment analysis, machine translation, or text classification.

For instance, models like BERT or GPT are first pre-trained on a massive dataset (e.g., Wikipedia, books) to learn general language representations. These pre-trained models can then be fine-tuned on a smaller, domain-specific dataset to perform tasks like spam detection or chatbot responses. Transfer learning has revolutionized NLP by significantly improving accuracy and reducing the data requirements for task-specific models.

<!-- END QUESTION -->

<br><br><br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

![](img/eva-well-done.png)