# Week 10

Text Processing and Analysis

## Setup

Run the following 2 cells to import all necessary libraries and helpers for this week's exercises

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/text_utils.py
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/datasets/text/movie_reviews.tar.gz | tar xz

### All Scikit-Learn Now!

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import string

from sklearn.cluster import KMeans
from sklearn.decomposition import NMF
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from data_utils import MinMaxScaler
from data_utils import classification_error, display_confusion_matrix

from text_utils import get_top_words

## Text Classification

Let's try to predict whether a given review expresses positive or negative feelings towards a movie.

We have a dataset that basically has $2$ features per record: `review` and `sentiment`.

Let's load and look:

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")
reviews_df.head()

### Features

Text is a very different kind of feature...

We do want to turn it into numbers somehow in order to apply some of the methods and models we've been learning about, but how to do that exactly is not entirely obvious.

We can try to extract some numerical information about the review text. Maybe something like the length of the review or the relative amount of punctuation marks or digits can be indicative of its sentiment.

Let's define some helper functions.

In [None]:
def count_characters(st):
  return len("".join(st.split()))

def count_words(st):
  return len(st.split(" "))

def count_punctuation(st):
  return len([c for c in st if c in string.punctuation])

def count_digits(st):
  return len([c for c in st if c in string.digits])

def get_punctuation_pct(st):
  return count_punctuation(st) / count_characters(st)

def get_digit_pct(st):
  return count_digits(st) / count_characters(st)

Now, let's apply some of these to our `DataFrame` to create numerical features that we can eventually use in a classifier.

In [None]:
reviews_df["char_count"] = reviews_df["review"].apply(count_characters)
reviews_df["word_count"] = reviews_df["review"].apply(count_words)
reviews_df["punctuation_pct"] = reviews_df["review"].apply(get_punctuation_pct)
reviews_df["digit_pct"] = reviews_df["review"].apply(get_digit_pct)

reviews_df

Before we model this data, let's look at some of these features and see if we can visually identify the negative and positive reviews on plots.

In [None]:
plt.scatter(reviews_df["word_count"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Punctuation % x Word Count")
plt.show()

plt.scatter(reviews_df["digit_pct"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Digit % x Word Count")
plt.show()

plt.scatter(reviews_df["word_count"], reviews_df["char_count"], c=reviews_df["sentiment"])
plt.title("Character Count x Word Count")
plt.show()

This is not very promising. It doesn't seem like these features contain enough information to help us extract meaning from the reviews.

Let's just confirm this suspicion by creating a classifier.

We'll use a `MinMaxScaler` since some of the features are already in a $[0,1]$ range.

In [None]:
mScaler = MinMaxScaler()

simple_feats_df = reviews_df.drop(columns=["review", "sentiment"])
simple_feats_scaled_df = mScaler.fit_transform(simple_feats_df)

simple_feats_scaled_df["sentiment"] = reviews_df["sentiment"]

simple_feats_scaled_df

Train/Test splitting should've been done before scaling, but this is just a quick experiment.

In [None]:
reviews_train_df, reviews_test_df = train_test_split(simple_feats_scaled_df, test_size=0.2)

reviews_train_df

In [None]:
mClassifier = RandomForestClassifier()

train_feats = reviews_train_df.drop(columns=["sentiment"])
train_labels = reviews_train_df["sentiment"]

mClassifier.fit(train_feats, train_labels)

train_preds = mClassifier.predict(train_feats)

classification_error(train_labels, train_preds)

In [None]:
test_feats = reviews_test_df.drop(columns=["sentiment"])
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(test_feats)

classification_error(test_labels, test_preds)

# 🤔

Our model is just as good as a random guess.

We'll have to use something else.

Back to the drawing board.

### Bag-of-Words (BoW)

This is a way of modeling sentences as a function of their words. We can think of it as a specialized version of One-Hot encoding, where we turn our single-column `review` feature into a series of numbers that represent which words are present in that review. If a word is not present, that column gets a $0$, if the word is present, the column gets an integer that represents the total number of times that word was used in the review.

There are some specificities to keep in mind when we encode text this way. We need to figure out what constitutes a _word_ and what kind of words we want to ignore.

Do we consider the words `type`, `types`, `typed` as the same word or $3$ different words ?

Do we consider words like `a`, `the`, `of`, `in`, etc ... in our encoding ? What other kinds of words should be treated differently ?

The first consideration is part of the process of _Tokenization_, or, how we turn sequences (of words) into its constitutive components (tokens). There are libraries and pre-trained models that can help us with that task.

To answer part of the second question: it's best to remove common words from text before processing it because they don't add meaning or variance to our data. These words are commonly referred to as _stop words_, or _negative dictionary_, and, again, we can find lists of common _stop words_ for different languages in text-processing libraries and packages.

Our dataset can have other words that show up very frequently, but aren't generally considered _stop words_. For example, a dataset about movie reviews might have the words `movie`, `good`, `director`, etc in every single review. While not typical _stop words_, they should be ignored during encoding because they add no meaningful differentiation to our data.

The same is true for words that are rare and only show up in a small fraction of our sentences/reviews.

This process of encoding text sequences by the count of their words is called Vectorization. This method of encoding keeps track of which words are present in a series of words, and how common they are, but without any significant information about the order of the words or where they occurred in the original text.

That's why models created this way are usually called _Bag-of-Word_ models: they model _what_ words are there, but not _where_ they occurred.

### Vectorize by Count

Let's use the `scikit-learn` class `CountVectorizer` to encode our reviews.

The `min_df` and `max_df` parameters to the class constructor determine the minimum and maximum document frequencies to consider when encoding our data.

With `min_df=5` and `max_df=0.75`, the vectorizer ignores words that show up in less than $5$ reviews and words that show up in more than $75\%$ of reviews.

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")

reviews_train_df, reviews_test_df = train_test_split(reviews_df, test_size=0.2)
reviews_train_df

In [None]:
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=10_000)

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

If we print our encoded features, we should see something like:

# TODO: ADD IMAGE

In [None]:
reviews_train_vct

But we don't.

What !?

### Sparse Matrices

This is why we have to move beyond `DataFrames` for text encoding.

We have thousands of reviews and thousands of possible words in our vocabulary. Encoding this information using a `DataFrame` would be extremely inefficient and wasteful because most of the columns for any given row is most likely a $0$. Even if a review used $1\text{,}000$ unique words, that would still mean that only about $10\%%$ of our columns would have non-zero values.

The `CountVectorizer` object has functions that give us information about the words it encountered.

`get_feature_names_out()`: returns a list of the words seen in the dataset and encoded.

`inverse_transform()`: can be used to turn a sequence of word counts back into actual words, but without the order information of the original sentence.

And, we can index into our vector with `[]` to get the counts of specific words in that vector.

In [None]:
vocab = mCV.get_feature_names_out()
vocab

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

Get indices of words in a review:

In [None]:
reviews_train_vct[0].nonzero()

Get counts of those words in a review:

In [None]:
reviews_train_vct[reviews_train_vct[0].nonzero()]

We can use these functions to order the words of a review by frequency.

The process is:

- Get a `review` by indexing into our list of encoded reviews
- Count the number of tokens/words in the review
- Use [argsort()](https://numpy.org/doc/2.1/reference/generated/numpy.argsort.html) to get the order of indices that would sort the word counts
  - Use negative counts to get the counts ordered from largest to smallest
- Use the first `word_count` items of this array to index into our vocab and get the actual words of the review

In [None]:
review = reviews_train_vct[0]

word_count = len(review.nonzero()[0])

sorted_idxs = (-review.toarray()[0]).argsort()

vocab[sorted_idxs[:word_count]]

This seems like a useful enough operation, that maybe it should be a function that we can use on any sparse matrix of frequency counts...

The `get_top_words(cnt, vocab, n)` function will return the top `n` words of `cnt`, a count vector or count matrix (list of vectors). Omitting `n` makes the function return all of the words present in the sequences, ordered by frequency.

The returned value is a tuple of words and their counts.

In [None]:
from text_utils import get_top_words

get_top_words(reviews_train_vct[:2], vocab, word_count)

### Classifying by Count

Ok. After that little bit of a detour to explore vector count sparse matrices, we are back to our classification problem.

Can we extract whether a review is positive or negative by looking at the words used ?

Let's train a `RandomForestClassifier` and validate with our test dataset.

In [None]:
mClassifier = RandomForestClassifier()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

Not bad. Promising.

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

Ok! This is not bad. After learning about count vectorization and sparse matrices, the code for doing this is actually quite simple.

We could adjust parameters of the classifier or the vectorizer to improve this, but using a `RandomForestClassifier` for this task is quite inefficient. Let's look at a different kind of classifier before we continue exploring vectorization.

### Naive Bayes

Bayesian statistics is a complete and complex field of math and philosophy. At a high level, it's a theory that allows for probabilities (of events, measurements, classifications, etc) to be updated based on the presence of (new) data.

We are going to look at a very slim portion of Bayesian statistics to get a basic understanding of how this theory can be applied within Machine Learning algorithms.

The Naive Bayes methods are a set of supervised learning algorithms based on a version of Bayes' theorem that assumes that all of our features are independent.

As applied to a classification problem, this theorem has the following form:

$$P\left(y \middle| x_1, x_2, \ldots, x_n\right) = \frac{P\left(y\right)P\left(x_1, x_2, \ldots, x_n \middle| y \right)}{P\left(x_1, x_2, \ldots, x_n\right)}$$

This is an eyeful, but given that $y$ is the class associated with feature measurements $x_1, x_2, \ldots, x_n$, it reads:

The probability that a given set of measurements ($x_1, x_2, \ldots, x_n$) represents an object of class $y$ is equal to the probability of seeing an object of class $y$ in our dataset, multiplied by the probability that an object of class $y$ has measurements $x_1, x_2, \ldots, x_n$, divided by how common that particular set of measurements are.

$P\left(y\right)$ is calculated by measuring how many items of our dataset represent an object of class $y$. If we have $10$ objects that are $y$ in a dataset of $50$ objects, our $P\left(y\right) = \frac{10}{50}$.

Likewise, $P\left(x_1, x_2, \ldots, x_n\right)$ represents how many times this exact combination of measurements showed up in our dataset. If only one row out of $50$ has this combination of input features, then $P\left(x_1, x_2, \ldots, x_n\right) = \frac{1}{50}$.

$P\left(x_1, x_2, \ldots, x_n \middle| y \right)$ is the trickier bit, but it gets simplified by the _naive_ assumption of feature independence and can be split into multiple terms:

$P\left(x_1 \middle| y \right) \cdot P\left(x_2 \middle| y \right) \cdot\ldots\cdot P\left(x_n \middle| y \right)$

These are the probabilities that items of class $y$ have specific values for $x_1, x_2, \ldots x_n$. For example, if in our dataset of $50$ elements, $10$ have class $y$, and $2$ out of those $10$ have a particular value $x_1$ for the first feature, $P\left(x_1 \middle| y \right) = \frac{2}{10}$.

#### Naive Bayes Text Example

Let's pretend we want to calculate the probability that a review with the words `awful`, `bloody`, `guns` and `park` is **negative**.

This is equivalent to calculating:
$$P\left(negative \middle| \text{awful}, \text{bloody}, \text{guns}, \text{park} \right) = \frac{P\left(negative\right) P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park} \middle| negative \right)}{P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park}\right)}$$

$P\left(\text{negative}\right)$ is equal to the proportion of **negative** reviews in the dataset. If half are positive and half are negative, $P\left(\text{negative}\right) = 0.5$.

$P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park}\right)$ is the proportion of the number of reviews in the dataset that got vectorized using the words `awful`, `bloody`, `guns` and `park`.

$P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park} \middle| negative \right)$ can be simplified to $P\left(\text{awful} \middle| negative \right) \cdot P\left(\text{bloody} \middle| negative \right) \cdot P\left(\text{guns} \middle| negative \right) \cdot P\left(\text{park} \middle| negative \right)$

$P\left(\text{awful} \middle| negative \right)$ is the proportion of negative reviews that have the word `awful`, $P\left(\text{bloody} \middle| y \right)$ is the proportion of negative review with `bloody` in it, etc etc etc.


### Why ????

It might not be obvious at first, but when used for classification of datasets with sparse feature vectors, the process described above can be extremely efficient because _fitting_ the model means calculating a few probability constants from our training dataset. All of the $P()$ terms on the right hand side of the Bayes equation are basic proportions calculated with addition and division operations.

`Scikit-Learn` has different flavors of Naive Bayes classifiers that make further assumptions about the distributions of the input features and how $P\left(X \middle| y \right)$ can be simplified.

- [Gaussian](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) Naive Bayes assumes the features have gaussian distributions. This is good for datasets with continuous-valued inputs.
- [Bernoulli](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) Naive Bayes assumes the features are all binary values (One-Hot Encoding).
- [Categorical](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html) Naive Bayes assumes our features are integers that represent categories (Ordinal Encoding).
- [Multinomial](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) Naive Bayes assumes our features are discrete measurements.

Given that the feature vectors computed with `CountVectorizer` represent word counts, it makes sense for us to use a Multinomial classifier for this task.

In [None]:
# TODO: repeat classification using the appropriate Bayes model

### N-Grams

Now that we have an efficient classifier for sparse count feature vectors we can finally experiment with n-grams.

In its simplest form, the Bag-of-Words method doesn't take into consideration any information about the order or location of the words in a sequence of words. We can, however, set it up to count pairs (or triplets, or quadruplets, etc) of words instead of single words.

So, instead of breaking up "_it was a good movie_", like this:
|it|was|a|good|movie|
|-|-|-|-|-|

It breaks it up like this:

|it was|was a|a good|good movie|
|-|-|-|-|

These are the 2-grams (or bi-grams) of our sentence, but the concept can be extended to any integer value of $n$ to extract counts for different lengths of n-grams.

While this doesn't help with location information, it does extract some information about word order and common phrases.

To extract n-grams during vectorization, we can give `CountVectorizer` a range of values to consider with the parameter `ngram_range`. A value of $(2,2)$ will only extract bigrams, while $(1,2)$ will extract counts for single words and pairs.

In [None]:
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(2, 2))

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

The `CountVectorizer` functions we saw above and our `get_top_words()` function will work the same way. The only difference is that right now our features are counts of pairs of words.

In [None]:
vocab = mCV.get_feature_names_out()
vocab

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab)

### Train and Validate

Let's try it out !

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

### TF-IDF

# TODO HERE

$$tf(t, d) = \frac{Count(t)}{| d |}$$

$$idf(t, D) = log\left(\frac{|D|}{Count(d : t \in d)}\right)$$

In [None]:
mTfidV = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(1, 1))

reviews_train_vct = mTfidV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mTfidV.transform(reviews_test_df["review"])

And again, the `TfidfVectorizer` has all the functions we saw above in the `CountVectorizer` object and our `get_top_words()` function will work the same way with our tf-idf vectors. The difference is that now our features are not plain integer counts of words or n-grams, but our tf-idf importance metric. The higher the metric, the more significant the word (or n-gram) within our vocabulary.

In [None]:
vocab = mTfidV.get_feature_names_out()
vocab

In [None]:
mTfidV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab, 10)

### Classification with tf-idf

This stays the same. While the multinomial classifier normally requires integer features, in practice, fractional counts such as the ones computed with a `TfidfVectorizer` also work. We can always turn these into `int` by multiplying them by $100$. The main distinction is that these aren't continuous unbounded `float` values. It wouldn't make sense to use the Gaussian Bayes classifier.

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

Not bad.

How does the choice of n-gram range affect classification by tf-idf ?

In [None]:
# TODO: Evaluate the effect of n-grams in the TfidfVectorizer

## Unsupervised Learning

# TODO

Can we extract other info ? Cluster ?

In [None]:
mClust = KMeans(n_clusters=8)
reviews_train_km = mClust.fit_predict(reviews_train_vct)

In [None]:
get_top_words(mClust.cluster_centers_, mTfidV.get_feature_names_out(), 8)[0]

Can we do better ?

We're clustering over 40k features .... very sparse space.

In [None]:
mNmf = NMF(n_components=8)
reviews_train_nmf = mNmf.fit_transform(reviews_train_vct)

In [None]:
get_top_words(mNmf.components_, mTfidV.get_feature_names_out(), 6)[0]

Classification for other dataset.

Amazon products

In [None]:
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/raw/refs/heads/main/datasets/text/amazon_reviews/books.tar.gz | tar xz

Read and drop nan

In [None]:
reviews_full_df = pd.read_csv("./data/text/amazon_reviews/books.csv")
reviews_full_df.dropna(inplace=True)
reviews_full_df

Look at class distribution

In [None]:
reviews_full_df["rating"].value_counts()

# 🫤

A model can guess $5$ all the time and be correct $60\%$ of the time.

In [None]:
min_count = reviews_full_df["rating"].value_counts().min()

def sample_min(df):
  return df.sample(min_count)

rg = reviews_full_df.groupby("rating")
reviews_balanced_df = rg.apply(sample_min).reset_index(drop=True)

del reviews_full_df

reviews_balanced_df

Two more prep steps:

In [None]:
reviews_df = pd.DataFrame(reviews_balanced_df["rating"].astype(int))
reviews_df["review"] = reviews_balanced_df["title"] + " " + reviews_balanced_df["review_text"]
reviews_df

Ok. Go forth and Classify

In [None]:
# T/T Split
# Vectorize
# Classify
# Evaluate
# Repeat with ngrams