---
---
Recitation 11: Machine Learning III

Applied Data Science in Python for Social Scientists

New York University, Abu Dhabi

Dated: 23 Nov 2023
---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of applied machine learning
- Learn the advanced machine learning supervised models

### Specific Goals
- Learn the basics of clustering
- Learn to apply different models:
    - Logistic Regression
    - Support Vector Machines (SVMs)

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Submission
You will submit all your code as a Python Notebook through BrightSpace as **R11_YOUR NETID.ipynb**.

---




# General Instructions
This recitation is worth 50 points. It has 2 parts. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout.



# The World Before Word Embeddings


If you have to create a classifier for **sentiment analysis**, and you have a movie reviews dataset *(aka a corpus)* with each review rated as **positive** or **negative**, which classifier would you use? Well, you could use a *multi-class logistic regression*, a *decision tree classifier* or an *SVM classifier*.

Sure, but a problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer **well defined fixed-length features and outputs**. As you already know now, machine learning algorithms cannot work with raw text directly; **the text must be converted into numbers**. Specifically, **vectors of numbers**. So, **what will be your features?**

Well remembering from PS7, you know that you can split each text review to a bunch of words, create **word embeddings/vectors** for each word, and concatenate those vectors to create a long vector for each review, and *ta-da!* that long vector will be your feature vector. Most recent classification algorithms in fact do something very similar.

But the concept of word embeddings is relatively very new, and sentiment analysis or text processing is much older than that. So aside from word embeddings, **how can you extract features from text?**

Well, pause for a second and think: what is the difference between a "positive", and a "negative" review?

A positive review is likely to have more positive words such as *great*, *good*, *amazing*, and so on, whereas a negative reviews is likely to have more negative words such "sucks", "awful", and so on. This list of words that includes `["great", "good", "amazing", "sucks", "awful"]` is your **vocabulary** of words that you are going to track. But are these words enough? How can you create a list of exhaustive words that generalizes across reviews? Think.



## Bag of Words (BoW)

Well here's the solution. Ultimately, your machine learning model will learn from a training set right? And your training set includes a list of reviews. Why not create a list of all the unique words in your training set of reviews, and use that as your *vocabulary*?

Here's an example:

Say your training set only has 3 reviews as follows:

-----
- **Review 1:** this movie is very scary and long
- **Review 2:** this movie is not scary and is slow
- **Review 3:** this movie is spooky and good
-----

In that case your vocabulary will consist of the following **11 words**:  

`["this", "movie", "is’", "very", "scary", "and", "long", "not",  "slow", "spooky",  "good"]`.

Finally, you can create **feature representation** for each review as follows:

|   | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10  | 11  |   |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|   | This  | movie  | is  | very  | scary  | and  | long  | not  | slow  |  spooky | good  | **Length of review**  |
| Review 1  | 1  | 1  | 1  | 1  | 1 | 1  | 1  | 0  | 0  |  0 | 0  |  7 words  |
| Review 2  | 1  | 1  | 2  | 0  | 1 | 1  | 0  | 1  | 1  | 0  | 0  | 8 words  |
| Review 3  | 1  | 1  | 1  | 0  | 0 | 1  | 0  | 0  | 0  | 1  | 1  | 6 words  |


You can notice that while review 1 has a length of 7 words, review 2 a length of 8 words, and review 3 a length of 6 words, the features for each review have a length of 11, and each feature is the count of the number of times the corresponding word occurs in the review. So if you notice in the table above, for Review 2, the word **"is"** occurs twice, and so the feature value is `2`.

So the feature vectors based on the table for each review would be:

**Feature Vector of Review 1:** `[1 1 1 1 1 1 1 0 0 0 0]`

**Feature Vector of Review 2:** `[1 1 2 0 1 1 0 1 1 0 0]`

**Feature Vector of Review 3:** `[1 1 1 0 0 1 0 0 0 1 1]`

This approach is known as the **Bag of Words (BoW)** approach because what you are essentially doing is treating each review as a bag of words without accounting for the grammar or the order of words.

What that means is that if, for example, Review 2 (i.e. *this movie is not scary and is slow*) was written as ***this is movie scary and is not slow*** which is grammatically incorrect and of different order in comparison from the original review, it won't matter as the feature vector will still remain the same as the words are still the same no matter the order.

The figure below gives an overview of the BoW approach for a single review.

![bow](https://drive.google.com/uc?id=1YF7mcm5hCTJc1WKb5Rvz2pxvUWDHUVdP)


# Part I: Sentiment Analysis with BoW (35 points)

For this recitation, you are given two csv files `sentiment_train.csv` and `sentiment_test.csv` for training and testing purposes. The csv file contains two columns: `Content` which includes the review in text, and `Label` which contains the sentiment label for the review: `pos` or `neg`. Your task is to extract features from the reviews given the Bag of Words approach, and use those features to create a sentiment classifier which, given a new text, outputs `pos` or `neg` based on if the review is positive or negative.

Luckily, `scikit-learn` provides a method for extracting features from text. An example of how it works is given below. Read through the code with comments to understand what each line is doing.


In [2]:
# Importing the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer

# Creating the CountVectorizer object
"""
Notice the stop words argument. What this argument does is it ignores
english stop words as they contribute very little. An example of a stop word
would be words such as "the" or "is" or "that" -- these are words that you
would see in any text and so won't help you in creating a classifier.
By giving this flag you are sort of doing a feature selection by removing
unnecessary features. You can play with this flag, and you will see that
by removing this flag, your performance will go down.
"""

vect = CountVectorizer(stop_words='english')

# An example list of text (otherwise called the corpus)
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

"""
Getting counts of each token (word) in text data (this returns a matrix).

Notice that this is a `transform` -- very similar to how we were transforming
when normalizing or standardizing, and so we would transform the test set
similarly (as we were normalizing)
based on the vocabulary (or distribution) of the training set. You can think
for a second why that makes sense. (If we don't transform test based on train
vocabulary then what if test has new vocabulary not present in train?)
"""
X = vect.fit_transform(corpus)

"""
Finally, converting sparse matrix to numpy array to view.
You should note that the shape of this array is 5 x 35 as there are a total of
35 words in our corpus, and 5 pieces of text in the list.
"""

X.toarray()

array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

## A. Using Logistic Regression (25 points)

Using **Logistic Regression**, create a sentiment classifier that uses Bag of Words approach. Train your classifier on `sentiment_train.csv`, and test it on `sentiment_test.csv`. Use `accuracy_score` from `sklearn.metrics` to evaluate your classifier on the test set.

In [3]:
# Importing libraries you would need

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


In [4]:
input_directory = ""


train = pd.read_csv(input_directory+"sentiment_train.csv")

test = pd.read_csv(input_directory+"sentiment_test.csv")


In [18]:
# Hide warnings
import warnings
warnings.filterwarnings('ignore')

# Feature extraction
vect = CountVectorizer(stop_words='english')
X_train = vect.fit_transform(train['Content'])
y_train = train['Label']

# Model training with grid search
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_estimator_)


{'C': 100}
LogisticRegression(C=100)


In [19]:
# Model evaluation
X_test = vect.transform(test['Content'])
y_test = test['Label']
y_pred = grid.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.855


### Rubric

- +10 points for correctly extracting features from text for train and test set
- +10 points for fitting the logistic regression model using grid search and a reasonable set of parameters
- +5 points for evaluating the model on test set that performs reasonably well.

## B. Using SVM (10 points)

Now implement the sentiment classifier using a Support Vector Machine (SVM) based classifier using linear kernel. For the sake of time, you don't have to use grid search here. Just use a `C` value of `0.001`.

In [21]:
# Set the hyperparameter
C = 0.001

# Train the SVM with linear kernel
SVM = svm.SVC(kernel='linear', C=C).fit(X_train, y_train)

# Predict using the trained SVM model
y_pred = SVM.predict(X_test)

# Evaluate the model
print(accuracy_score(y_test, y_pred))

0.86


### Rubric

- +5 points for writing code for training an SVM with linear kernel
- +5 points for evaluating the model on test set that performs reasonably well.

# Part II: Problems with Bag of Words (15 points)

In less than 10 sentences, describe the problems with representing the text as a **Bag of Words** feature representation? Give examples where BoW approach may not work well. How can some of these issues be fixed? You may find [this](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) and [this](https://medium.com/voice-tech-podcast/nlp-pipeline-101-with-basic-code-example-feature-extraction-ea9894ed8daf) article useful in answering this question.

The Bag of Words (BoW) model has several limitations:

1. **Loss of Context**: It ignores the order of words, leading to a loss of contextual meaning. For example, "not good" and "good" may be treated similarly.
2. **Synonymy and Polysemy Issues**: BoW cannot handle synonyms and polysemy, affecting the understanding of meanings.
3. **High Dimensionality and Sparsity**: Results in large, sparse matrices which are computationally inefficient.
4. **Neglects Negations and Modifiers**: Fails to account for words like "not" or "very" which alter meanings significantly.
5. **Poor Handling of Rare Words**: Rare, yet contextually important words are often overlooked.
6. **No Word Sense Disambiguation**: Cannot discern different meanings of a word in different contexts.

To mitigate these issues, one can use:
1. **Word Embeddings**: Techniques like Word2Vec or GloVe provide vector representations of words that capture semantic relationships and contexts.
2. **Syntax Trees or Dependency Parsing**: These can be used to understand the grammatical structure of sentences, helping in understanding context and relationships between words.

From my knowledge, n-grams and TF-IDF can also be used to improve the BoW model. However, my knowledge of these techniques is limited, so I will not discuss them in detail.

## Rubric

- +10 points for highlighting the problems with BoW approach with reasonable examples
- +5 points for presenting alternative approach(es) to solving some of those issues