# Week 10

Text Processing and Analysis

## Setup

Run the following 2 cells to import all necessary libraries and helpers for this week's exercises

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/text_utils.py
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/datasets/text/movie_reviews.tar.gz | tar xz

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import string

from sklearn.cluster import KMeans
from sklearn.decomposition import NMF
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from data_utils import MinMaxScaler
from data_utils import object_from_json_url, classification_error, display_confusion_matrix

from text_utils import get_top_words

## Text Classification

Let's ____ using a movie review dataset.

Load and look.

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")
reviews_df.head()

### Features

What can we say?

In [None]:
def count_characters(st):
  return len("".join(st.split()))

def count_words(st):
  return len(st.split(" "))

def count_punctuation(st):
  return len([c for c in st if c in string.punctuation])

def count_digits(st):
  return len([c for c in st if c in string.digits])

def get_punctuation_pct(st):
  return count_punctuation(st) / count_characters(st)

def get_digit_pct(st):
  return count_digits(st) / count_characters(st)

Apply to the `DataFrame`.

In [None]:
reviews_df["char_count"] = reviews_df["review"].apply(count_characters)
reviews_df["word_count"] = reviews_df["review"].apply(count_words)
reviews_df["punctuation_pct"] = reviews_df["review"].apply(get_punctuation_pct)
reviews_df["digit_pct"] = reviews_df["review"].apply(get_digit_pct)

reviews_df

Look at some of these features

In [None]:
plt.scatter(reviews_df["word_count"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Punctuation % x Word Count")
plt.show()

plt.scatter(reviews_df["digit_pct"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Digit % x Word Count")
plt.show()

plt.scatter(reviews_df["word_count"], reviews_df["char_count"], c=reviews_df["sentiment"])
plt.title("Character Count x Word Count")
plt.show()

Scale

In [None]:
mScaler = MinMaxScaler()

simple_feats_df = reviews_df.drop(columns=["review", "sentiment"])
simple_feats_scaled_df = mScaler.fit_transform(simple_feats_df)

simple_feats_scaled_df["sentiment"] = reviews_df["sentiment"]

simple_feats_scaled_df

In [None]:
reviews_train_df, reviews_test_df = train_test_split(simple_feats_scaled_df, test_size=0.2)

reviews_train_df

In [None]:
mClassifier = RandomForestClassifier()

train_feats = reviews_train_df.drop(columns=["sentiment"])
train_labels = reviews_train_df["sentiment"]

mClassifier.fit(train_feats, train_labels)

train_preds = mClassifier.predict(train_feats)

classification_error(train_labels, train_preds)

In [None]:
test_feats = reviews_test_df.drop(columns=["sentiment"])
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(test_feats)

classification_error(test_labels, test_preds)

# 🤔

Bag of words

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")

reviews_train_df, reviews_test_df = train_test_split(reviews_df, test_size=0.2)
reviews_train_df

In [None]:
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=10_000)

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

In [None]:
reviews_train_vct

Working with sparse matrices.

Words counted:

In [None]:
vocab = mCV.get_feature_names_out()
vocab

Get words in a review

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

Get indices of words in a review:

In [None]:
reviews_train_vct[0].nonzero()

Get counts of those words in a review

In [None]:
reviews_train_vct[reviews_train_vct[0].nonzero()]

Get words ordered by frequency:

In [None]:
review = reviews_train_vct[0]

n_words = len(review.nonzero()[0])

sorted_idxs = (-review.toarray()[0]).argsort()

vocab[sorted_idxs[:n_words]]

In [None]:
from text_utils import get_top_words

get_top_words(reviews_train_vct[0], vocab, n_words)

In [None]:
mClassifier = RandomForestClassifier()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

Naive Bayes

Some equations

Bernouli vs gaussian vs categorical vs multinomial

In [None]:
# here

Order of words ?

In [None]:
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(2, 2))

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

In [None]:
vocab = mCV.get_feature_names_out()
vocab

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab)

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

Slightly smarter counts

In [None]:
mTfidV = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(1, 1))

reviews_train_vct = mTfidV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mTfidV.transform(reviews_test_df["review"])

In [None]:
vocab = mTfidV.get_feature_names_out()
vocab

In [None]:
mTfidV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab, 10)

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

classification_error(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

classification_error(test_labels, test_preds)

TODO: TFIDF with ngrams ??

Can we extract other info ? Cluster ?

In [None]:
mClust = KMeans(n_clusters=8)
reviews_train_km = mClust.fit_predict(reviews_train_vct)

In [None]:
get_top_words(mClust.cluster_centers_, mTfidV.get_feature_names_out(), 8)[0]

Can we do better ?

We're clustering over 40k features .... very sparse space.

In [None]:
mNmf = NMF(n_components=8)
reviews_train_nmf = mNmf.fit_transform(reviews_train_vct)

In [None]:
get_top_words(mNmf.components_, mTfidV.get_feature_names_out(), 6)[0]

Classification for other dataset.

Amazon products


Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Tokenizing: is _____ ....

One way to do text is Bag of Words:

https://letsdatascience.com/bag-of-words/

https://www.kaggle.com/code/samuelcortinhas/nlp3-bag-of-words-and-similarity

https://letsdatascience.com/word-embeddings/

- each individual token occurrence frequency is treated as a feature.
- the vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.

N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

`CountVectorizer()` can have ngrams using ngram_range.

In a large text corpus, some words appear with higher frequency (e.g. “the”, “a”, “is” in English) and do not carry meaningful information about the actual contents of a document. If we were to feed the word count data directly to a classifier, those very common terms would shadow the frequencies of rarer yet more informative terms. In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf-idf transform as implemented by the `TfidfTransformer()`. TF stands for “term-frequency” while “tf-idf” means term-frequency times inverse document-frequency.

`CountVectorizer() + TfidfTransformer() = TfidfVectorizer()`

Vectorizer parameters:

- Ignore terms that appear in more than 50% of the documents (set by max_df=0.5)

- Ignore terms that are not present in at least 5 documents (set by min_df=5)

Clustering

`TruncatedSVD()`:
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

`Non-Negative Matrix Factorization (NMF)`
Find two non-negative matrices, i.e. matrices with all non-negative elements, (W, H) whose product approximates the non-negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.


### Top terms per cluster
Since TfidfVectorizer can be inverted we can identify the cluster centers, which provide an intuition of the most influential words for each cluster.

However, if documents in the same corpus have very different lengths, or the vocabulary is extremely large, these metrics become less reliable.

Instead, in the NLP domain it is much more common to use Cosine Similarity. This measures the cosine of the angle between any two points (more precisely their vectors starting from the origin). The closer the score 1, the smaller the angle between the vectors and the more similar the documents are.