# Project 3: Vocabulary Construction

CS 598 Practical Statistical Learning

2023-12-03

UIUC Fall 2023

**Authors**
* Ryan Fogle
    - rsfogle2@illinois.edu
    - UIN: 652628818
* Sean Enright
    - seanre2@illinois.edu
    - UIN: 661791377

## Introduction

In this notebook we describe the process we followed to generate a suitably small, but relevant vocabulary for the task of sentiment analysis and classification with the [IMDB movie review](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) dataset.

Our goal was to identify a set of words or phrases that were fewer than 1000 in number, but capable of predicting movie sentiment from the review text with high accuracy, using AUROC as the scoring metric.

## Data Retrieval
Here we retrieve and format the training and test data.

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from datetime import datetime

start_time = datetime.now()

The IMDB movie review data is split into five folds for cross-validation. The following functions are used to read the training and test data for a given split.

The dataset contains columns for `id`, `score`, `sentiment`, and `review`. We are interested in using the `review` text as our input and classifiying  `sentiment`, which uses a value of $0$ for a poor review and $1$ for a good one. The `id` column indexes each observation.

The `review` text is also stripped of character sequences used for HTML tags.

In [None]:
dtypes_dict = {"review": "string",
               "sentiment": "Int32"}

def get_data(base_path=Path.cwd()):
    """Retrieve the training and test data, formatted as DataFrames

    Args:
        base_path (optional): Path of folder with training and test data.
                              Defaults to Path.cwd().

    Returns:
        (train_x, train_y, test_x): Training and test dataframes
    """
    path_train  = base_path / "train.tsv"
    path_test   = base_path / "test.tsv"
    
    train = pd.read_csv(path_train, sep="\t", header=0, dtype=dtypes_dict)
    train_x = train["review"].str.replace("&lt;.*?&gt;", " ", regex=True)
    train_y = train["sentiment"]

    test = pd.read_csv(path_test, sep="\t", header=0,
                       dtype=dtypes_dict, index_col="id")
    test_x = test["review"].str.replace("&lt;.*?&gt;", " ", regex=True)
    return train_x, train_y, test_x

def get_fold_data(fold):
    """Retrieve the training and test data for a given fold

    Args:
        fold (int): Fold number for training andtest data

    Returns:
        (train_x, train_y, test_x, test_y): Training and test dataframes
    """
    fold_path = Path.cwd() / "proj3_data" / f"split_{fold}"
    train_x, train_y, test_x = get_data(base_path=fold_path)
    path_test_y = fold_path / "test_y.tsv"

    test_y = pd.read_csv(path_test_y, sep="\t", header=0, dtype=dtypes_dict)["sentiment"]
    return train_x, train_y, test_x, test_y

To build a more relevant vocabulary, the training and test data are joined for a split. This expands the possibly vocabulary beyond what might be found in a given split, and benefits from the greater sample size.

We select split #2, as it had proven in testing to prove the greatest challenge to our classifier.

In [None]:
fold = 2
train_x, train_y, test_x, test_y = get_fold_data(fold)
full_x = pd.concat((train_x, test_x), axis=0)
full_y = pd.concat((train_y, test_y), axis=0)

## Processing and Prediction Pipeline

An overview of our preprocessing and prediction pipeline:
1) Tokenize review text into a matrix of n-grams
2) Convert n-gram count into TF-IDF
3) Perform two-sample t-tests between n-grams to select a subset that is more relevant to positive or negative reviews.
4) Use cross-validation to find the Lasso regularization weight for logistic regression that further reduces the subset of n-grams below a predefined count
5) Refit the data with the Lasso regularization weight found above and retrieve the subset of selected n-grams

We use `sklearn`'s Pipeline API to streamline the preprocessing of the review data and feature selection. Below we will define each step of the pipeline and then show the complete process.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score

### Tokenization
To convert the input into a matrix of n-grams, the `sklearn.feature_extraction.text.CountVectorizer` class is used. The following configuration is used for this class:
* Words are converted to lowercase.
* n-grams between 1 and 4 in length are generated from the review text.
* Document frequency ("df") is restricted to the range of 0.001 to 0.5.
* The tokenizer pattern selects words consisting of word characters followed by a pipe or apostrophe. These are then used to construct n-grams.

To ignore commonly used words unlikely to carry much information, the below list of stop words is used. We add the word "br" to the set of words suggested by the instructor.

In [None]:
stop_words = [
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
    "yours", "their", "they", "his", "her", "she", "he", "a", "an", "and", "is",
    "was", "are", "were", "him", "himself", "has", "have", "it", "its", "the",
    "us", "br"
]

vectorizer = CountVectorizer(
    preprocessor=lambda x: x.lower(), # Convert to lowercase
    stop_words=stop_words,            # Remove stop words
    ngram_range=(1, 4),               # Use 1- to 4-grams
    min_df=0.001,                     # Minimum term frequency
    max_df=0.5,                       # Maximum document frequency
    token_pattern=r"\b[\w+\|']+\b"    # Use word tokenizer
)

### TF-IDF Conversion
The n-gram counts are converted into term-frequency times inverse document-frequency (TF-IDF) to weigh them within the context of the document (review), and amongst other documents.

In [None]:
tfidf_transformer = TfidfTransformer(use_idf=True)

### T-test Subset Selection
To accomplish the first reduction vocabulary size, we apply a two-sample t-test to each n-gram, comparing the difference between the class of negative reviews and the class of positive reviews, and filter by the magnitude of the t-test statistics. n-grams with high magnitude are more likely to be relevant in deciding whether the n-gram is associated with a positive and negative review, and therefore are better suited for classification and should be selected.

For random variables $X$ and $Y$, each drawn from a different class corresponding to either positive or negative reviews, the t-test statistic is
$$
t = 
\frac{\bar{X} - \bar{Y}}
     {\sqrt{\frac{s_X^2}{m} + \frac{s_Y^2}{n}}}
$$
where $m$ and $n$ are the number of observations in classes $X$ and $Y$, respectively, and $s_X^2$ and $s_Y^2$ are the sample variances.

The built-in `numpy` variance function proved to have a high memory demand, so the following variance identity was used to perform the calculation more efficiently:
$$
\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2
$$

The magnitude of the t-statistic was then used to select a predefined number of n-grams.

In order to incorporate this transformation with the Pipeline, we create a class that inherits from `BaseEstimator` and `TransformerMixin` and define its fitting and transformation methods.

In this example, we select the 2000 n-grams with the greatest t-statistic magnitude.

In [None]:
class T_TestTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocab_size=2000):
        self.vocab_size = vocab_size
        self.subset_inds = []

    def fit(self, x, y):
        # Use two-sample t-test to determine n-grams more likely
        # to be associated with positive or negative reviews
        mask = y.values.to_numpy() == 1
        pos_x = x[mask]
        neg_x = x[~mask]
        
        m = pos_x.shape[0]
        n = neg_x.shape[0]
        mean_pos = pos_x.mean(axis=0)
        mean_neg = neg_x.mean(axis=0)
        # Var(X) = E[X^2] - (E[X])^2
        var_pos = pos_x.power(2).mean(axis=0) - np.power(pos_x.mean(axis=0), 2)
        var_neg = neg_x.power(2).mean(axis=0) - np.power(neg_x.mean(axis=0), 2)
        t_stat = (mean_pos - mean_neg) / np.sqrt(var_pos / m + var_neg / n)
        self.subset_inds = np.abs(np.ravel(t_stat)).argsort()[-self.vocab_size:]
        return self

    def transform(self, x, y=None):
        # Select columns corresponing to n-grams likely to be relevant
        return x[:, self.subset_inds]

### Logistic Regression with Lasso Regularization

We use logistic regression with Lasso regularization to select an even smaller subset of relevant features. To find the optimal regularization weight, we perform cross-validated logistic regression via `LogisticRegressionCV` and search a range of logarithmically scaled weights, scored by AUROC.

Cross-validation is performed by the default parameter, which is [stratified k-folds](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) with $k=5$ splits. We use the "saga" solver, since this dataset is on the large side.

To ensure repeatability, a random seed is set, and to ensure convergence, we allow up to 1000 iterations.

In [None]:
Cs = np.logspace(-2, -0.4, 20)
logistic_regression = LogisticRegressionCV(
    n_jobs=-1,
    Cs=Cs,
    solver="saga",
    penalty="l1",
    scoring="roc_auc",
    max_iter=1000,
    random_state=0
)

### Pipeline Execution

The above transformation and classification steps are defined in a `Pipeline`, which is then used to fit the combined training and test data.

In [None]:
pipeline_vocab = Pipeline([
    ("vectorizer", vectorizer),
    ("t-score", T_TestTransformer()),
    ("tfidf", tfidf_transformer),
    ("logreg", logistic_regression),
])

pipeline_vocab.fit(full_x, full_y)

### Selecting the Lasso Regularization Parameter
The optimal regularization weight is found by first filtering by weights that produced a final vocabulary size smaller than a predefined size, and then selecting the weight within this subset that produces the highest AUROC.

We have specified a maximum final vocabulary size of 950 n-grams.

In [None]:
max_vocab = 950
vocab_size = np.empty(Cs.shape)
scores = pipeline_vocab["logreg"].scores_[1].mean(axis=0)
for i in range(len(Cs)):
    vocab_size[i] = (pipeline_vocab["logreg"].coefs_paths_[1]
                     .mean(axis=0)[i, :] != 0).sum() - 1
mask = vocab_size < max_vocab
best_c = Cs[mask][np.argmax(scores[mask])]

This regularization weight is used to define a new classifier, which replaces the cross-validated classifier in the pipeline. Then we refit the data with it.

In [None]:
pipeline_vocab.steps.pop(3)
pipeline_vocab.steps.append(
    ["logreg", LogisticRegression(n_jobs=-1,
                                  C=best_c,
                                  solver="saga",
                                  penalty="l1",
                                  max_iter=1000)])

pipeline_vocab.fit(full_x, full_y)

## Final Vocabulary Selection
We then select all of the non-zero logistic regression coefficients to identify the final subset of relevant n-grams. The column indices for the t-test and regression are referenced against the `CountVectorizer` columns to identify a list of n-gram strings for our vocabulary.

With the provided configuration, this gives us a vocabulary of 874 n-grams.

Finally, this vocabulary is written to file with each line corresponding to a space-delimited n-gram.

In [None]:
# Select vocabulary
t_score_inds = pipeline_vocab["t-score"].subset_inds
vocab_t = pipeline_vocab["vectorizer"].get_feature_names_out()[t_score_inds]
lasso_inds = (pipeline_vocab["logreg"].coef_ != 0).reshape(-1)
vocab = vocab_t[lasso_inds]
print(f"\nVocab size  (t-test selection): {len(vocab_t)}")
print(f"  Vocab size (logistic w/ lasso): {len(vocab)} words\n")

# Write vocab to file
pd.DataFrame(vocab).to_csv("myvocab.txt", header=False, index=False)

In [None]:
print('Total Time (s):', (datetime.now() - start_time).total_seconds())