# Homework 2 (Due Thursday Dec 1, 6:29pm PST)

Please submit as a notebook in the format `HW2_FIRSTNAME_LASTNAME_USCID.ipynb` in a group chat to me and the TAs.

Your `USCID` is your student 10-digit ID.

### Part I.  Topic Modelling and Analysis (5pts)

Pick from **one** of the dataset options below:
* **Negative McDonalds Yelp reviews**: `datasets/mcdonalds-yelp-negative-reviews.csv`
* **[Top 5000 Udemy courses](https://www.kaggle.com/datasets/90eededa5561eee7f62c0e68ecdad14c2bdb58bc923834067025dee655a6083e?resource=download)** - a Kaggle dataset of the course descriptions of the top 5000 Udemy courses in 2022: `datasets/top5000_udemy.csv`

In your notebook, explore the data and perform topic modelling. You may use any vectorization or text preprocessing techniques we have discussed.

In order to earn full credit, you must:

* Show the **# of topics you tried, and explain why you ultimately decided on the final #**.
* Demonstrate **adequate text preprocessing (there are likely obvious stopwords / fuzzy matching / regex groupings that can be done to improve the final results)** - show what you tried.
* In 2-3 sentences: A **business analysis of these topics - what do they reveal as actionable next steps or insights for McDonalds or Udemy?** Please be specific in your recommendations/insights.
    - **Not specific**: *We recommend Amazon look into the quality of their toys, since the reviews show disatisfaction with the value of their product.*
    - **Specific**: *Amazon should explore more durable batteries/hardwares. For example, X% of reviews mention that the toys' batteries were broken or immediately died. This is part of a larger theme of components not being ready to use out the box, which often leads to disappointment on holiday occasions when children open up their gifts. See the following document snippets as examples:...*

#### 1. Loading Data 

In [46]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
reviews = pd.read_csv("../datasets/mcdonalds-yelp-negative-reviews.csv", encoding='latin-1')
text = reviews["review"].values

In [47]:
reviews.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


### 2. Data Cleaning and Text Preprocessing

In [48]:
reviews['reviews_processed'] = reviews['review']

# Remove punctuation
from textacy.preprocessing.remove import punctuation
reviews['reviews_processed'] = reviews['reviews_processed'].apply(punctuation)

# Convert to lowercase 
reviews['reviews_processed'] = reviews['reviews_processed'].map(lambda x: x.lower())

# Replace common entities/concepts 
from textacy.preprocessing.replace import urls, hashtags, numbers, emails, emojis, currency_symbols
#reviews['reviews_processed'] = reviews['reviews_processed'].\
#  apply(urls).\
#  apply(hashtags).\
#  apply(numbers).\
#  apply(currency_symbols).\
#  apply(emojis).\
#  apply(emails)

# Remove or normalize undesired text elements 
from collections import Counter
from textacy.preprocessing.normalize import quotation_marks, bullet_points
quotes = ['"','“','”']

reviews.head(10)


Unnamed: 0,_unit_id,city,review,reviews_processed
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...",i m not a huge mcds lover but i ve been to be...
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,terrible customer service i came in at 9 30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...",first they lost my order actually they gave...
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...,i see i m not the only one giving 1 star only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo...",well it s mcdonald s so you know what the fo...
5,679455658,Atlanta,This has to be one of the worst and slowest Mc...,this has to be one of the worst and slowest mc...
6,679455659,Atlanta,I'm not crazy about this McDonald's. This is p...,i m not crazy about this mcdonald s this is p...
7,679455660,Atlanta,One Star and I'm beng kind. I blame management...,one star and i m beng kind i blame management...
8,679455661,Atlanta,Never been upset about any fast food drive thr...,never been upset about any fast food drive thr...
9,679455662,Atlanta,This McDonald's has gotten much better. Usuall...,this mcdonald s has gotten much better usuall...


In [49]:
# Removing stopwords using gensim 
from gensim.parsing.preprocessing import remove_stopwords
reviews['reviews_processed'] = reviews['reviews_processed'].apply(remove_stopwords)
reviews['reviews_processed'].head(10)


0    m huge mcds lover ve better ones far worst ve ...
1    terrible customer service came 9 30pm stood re...
2    lost order actually gave took 20 minutes figur...
3                       m giving 1 star 25 star s need
4    s mcdonald s know food review reflects solely ...
5    worst slowest mcdonald s franchises t figure e...
6    m crazy mcdonald s primarily slow gosh exactly...
7    star m beng kind blame management day free cof...
8    upset fast food drive service till came mcdona...
9    mcdonald s gotten better usually order wrong s...
Name: reviews_processed, dtype: object

In [50]:
# Vectorize the corpus 
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(3,3), min_df=3,
                            max_df=0.4, stop_words="english")

X, terms = vectorizer.fit_transform(reviews['reviews_processed']), vectorizer.get_feature_names_out()
tf_idf = pd.DataFrame(X.toarray(), columns=terms)

print(f"TF-IDF: {tf_idf.shape}")
tf_idf.head(5)

TF-IDF: (1525, 168)


Unnamed: 0,10 minutes food,10 minutes fries,10 minutes later,10 minutes order,10 piece chicken,15 minutes drive,15 minutes later,20 minutes drive,20 minutes order,24 hour drive,...,window pick food,wish negative stars,worst customer service,worst fast food,worst mcdonald ve,worst mcdonalds planet,worst mcdonalds ve,worst service ve,write review mcdonald,write review mcdonalds
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
# Fit the NMF Model
nmf = NMF(n_components=5)
W = nmf.fit_transform(X)
H = nmf.components_
print(f"Original shape of X is {X.shape}")
print(f"Decomposed W matrix is {W.shape}")
print(f"Decomposed H matrix is {H.shape}")

Original shape of X is (1525, 168)
Decomposed W matrix is (1525, 5)
Decomposed H matrix is (5, 168)




In [52]:
from typing import List
import numpy as np
# Report Results 
def get_top_tf_idf_tokens_for_topic(H: np.array, feature_names: List[str], num_top_tokens: int = 5):
  """
  Uses the H matrix (K components x M original features) to identify for each
  topic the most frequent tokens.
  """
  for topic, vector in enumerate(H):
    print(f"TOPIC {topic}\n")
    total = vector.sum()
    top_scores = vector.argsort()[::-1][:num_top_tokens]
    token_names = list(map(lambda idx: feature_names[idx], top_scores))
    strengths = list(map(lambda idx: vector[idx] / total, top_scores))
    
    for strength, token_name in zip(strengths, token_names):
      print(f"\b{token_name} ({round(strength * 100, 1)}%)\n")
    print(f"=" * 50)

get_top_tf_idf_tokens_for_topic(H, tf_idf.columns.tolist(), 3)

TOPIC 0

open 24 hours (60.4%)

drive open 24 (11.7%)

customer service place (3.4%)

TOPIC 1

got order wrong (50.7%)

order wrong time (6.4%)

mcdonald ve ve (4.3%)

TOPIC 2

eat fast food (71.7%)

fast food restaurants (8.1%)

good service quick (6.8%)

TOPIC 3

bacon egg cheese (36.7%)

egg cheese biscuit (30.5%)

free wi fi (11.3%)

TOPIC 4

worst customer service (44.4%)

customer service ve (9.8%)

drive order wrong (5.6%)



### Part II. Emotion Classification (5 pts)

Use the `datasets/emotions_dataset.zip` (see the original Dataset source on [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp)) to build a classification model that predicts the emotion of sentence. If you would like, you may classify only the top 4 emotions, and group all other classes as `Other`. 

In order to earn full credit, you must:

* Show the performance of your model with `CountVectorizer`, `TfIdfVectorizer`, `word2vec`, and `glove` embeddings.
    - for `word2vec`, make sure not to use the `en_core_web_sm` dataset (these are not real embeddings)
* Perform text preprocessing (or explain why it was not necessary):
    - stopword removal
    - ngram tokenization
    - stemming/lemmatization
    - fuzzy matching / regex cleaning / etc. (as you deem necessary, but show that you analyzed the text to make your decision)
* Show **AUROC / F1 scores** for on the holdout (test + validation) datasets.
* A brief discussion (2-3 sentences) of what could improve your model and why.