# Intro to Natural Language Processing
Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.
Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval.
In this notebook, we'll dive into the world of text analysis.
We will explore ways to extract meaning from text, and build a model that can differentiate between positive and negative sentiment in movie reviews.
We'll be using a simplistic technique called Bag of Words,
 which involves representing text as numerical vectors of words represented their frequency and index.




In [1]:
import numpy as np

# Data
To start off let's get some data. The data we are going to use is from the IMDb Dataset.
The IMDB dataset is a large dataset of movie reviews from the website IMDb.
It contains 50,000 movie reviews, half of which are labeled as positive and half as negative,
and is often used as a benchmark dataset for natural language processing tasks.
Grab the dataset from [kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) if you haven't already.

In [2]:
import pandas as pd

# load dataset into a pandas dataframe
# try using 20 newsgroups instead: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
# switch to pytorch lightning: https://lightning.ai/docs/pytorch/stable/
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
#  Let's see what's in it!
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Preprocessing Data
In order to get usable data, we must transform the data to be suitable for analysis.
We'll be using some regular expression to clean out unwanted strings and
the `CountVectorizer` from scikit-learn to transform the collection of text
 into a matrix of token counts where each row represents a document and
  each column represents a unique word in the document collection.
  Let's see it in action.

In [4]:
import re
from pprint import pprint

# before
pprint(df.iloc[0]['review'])

# remove non characters from review
regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
df['review'].replace(regex, ' ', regex=True, inplace=True)

# after
pprint(df.iloc[0]['review'])

('One of the other reviewers has mentioned that after watching just 1 Oz '
 "episode you'll be hooked. They are right, as this is exactly what happened "
 'with me.<br /><br />The first thing that struck me about Oz was its '
 'brutality and unflinching scenes of violence, which set in right from the '
 'word GO. Trust me, this is not a show for the faint hearted or timid. This '
 'show pulls no punches with regards to drugs, sex or violence. Its is '
 'hardcore, in the classic use of the word.<br /><br />It is called OZ as that '
 'is the nickname given to the Oswald Maximum Security State Penitentary. It '
 'focuses mainly on Emerald City, an experimental section of the prison where '
 'all the cells have glass fronts and face inwards, so privacy is not high on '
 'the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, '
 'Christians, Italians, Irish and more....so scuffles, death stares, dodgy '
 'dealings and shady agreements are never far away.<br /><br />I would

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
review_bow = vectorizer.fit_transform(df['review'])

In [6]:
review_bow

<50000x25453 sparse matrix of type '<class 'numpy.int64'>'
	with 4131900 stored elements in Compressed Sparse Row format>

In [7]:
# see some of the tokens it collected
vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aag', ..., 'zulu', 'zuniga', 'über'], dtype=object)

In [8]:
review_bow[23001].toarray().squeeze()

array([0, 0, 0, ..., 0, 0, 0])

# Dataset
Now let's make it into a pytorch `Dataset`.

In [9]:
import torch
from torch.utils.data import Dataset
from pandas.core.series import Series

class IMDBDataset(Dataset):
    regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')

    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)

        # clean data
        self.df['review'].replace(IMDBDataset.regex, ' ', regex=True, inplace=True)

        # fit vectorizer
        self.bows = self.vectorizer.fit_transform(self.df['review'])
        
        # map targets
        self.sentiments = self.df.sentiment.map({
            'negative': 0,
            'positive': 1
        }).values

    def __getitem__(self, index: int):
        X = self.bows[index].toarray().squeeze().astype(np.float32)
        Y = self.sentiments[index].astype(np.float32)

        return X, Y

    def __len__(self):
        return len(self.df)

    @property
    def classes(self):
        return 'negative', 'positive'
    
    @property
    def vocab_size(self):
        return len(self.vectorizer.get_feature_names_out())

In [10]:
dataset = IMDBDataset(df)

In [11]:
dataset[0]

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 1.0)

# Model
Next, let's make a super simple logistic regression model. The model should receive a tensor of term frequencies and output a value between 0 and 1.

In [12]:
from torch import nn
class LogisticRegression(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)

    def forward(self, X: torch.Tensor):
        return torch.sigmoid(self.linear(X))

In [13]:
model = LogisticRegression(dataset.vocab_size, 1)

# Hyper parameters
Set some hyper parameters

In [14]:
epochs = 3
batch_size = 16
lr = 3e-2
num_folds = 3

In [15]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Cross Validation
Before we get to the training loop, let's understand cross-validation.
Cross-validation is a technique used to evaluate a machine learning
model's performance by splitting the dataset into multiple subsets or folds.
The model is trained on a portion of the data and tested on the remaining fold, which is repeated for each fold.
The overall performance is then calculated by averaging the performance of each fold.

In [16]:
from sklearn.model_selection import KFold

# k-fold cross validation
kf = KFold(n_splits=num_folds)

# Metrics
We'll need a way to score the performance of our model.
We'll benchmark our model with `AUROC` from torchmetrics.
The `AUROC` score summarizes the Receiver Operating Characteristic Curve
 into a single number that describes the performance of a model for multiple thresholds at the same time.
 Notably, an `AUROC` score of 1 is a perfect score and an `AUROC` score of 0.5 corresponds to random guessing.

In [17]:
from torchmetrics import AUROC

metric = AUROC(task='binary').to(device)

# Training
Our training loop performs k-fold cross validation, with each fold iterating through training and evaluating a logistic regression model on a binary classification task, with the `Adam` optimizer and `BCELoss` function, and computing the average train and test loss.

In [18]:
from typing import List
from torch.utils.data import DataLoader, SubsetRandomSampler
from torch.optim import Adam, SGD
from copy import deepcopy

def cross_validate(model: nn.Module, dataset: Dataset, *, num_folds=3, epochs=3, lr=1e-3, batch_size=16):

    # keep track of model scores
    scores: List[float] = []

    kf = KFold(n_splits=num_folds)
    metric = AUROC(task='binary').to(device)

    for fold, (train_indices, test_indices) in enumerate(kf.split(dataset)):

        # make samplers that samples elements randomly from a given list of indices without replacement.
        train_sampler = SubsetRandomSampler(train_indices)
        test_sampler = SubsetRandomSampler(test_indices)

        # data loaders
        train_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=train_sampler,
        )
        test_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=test_sampler,
        )

        # re-initialize model
        # need to do this, otherwise model will pick up
        # where it left off on last cross-validation fold
        for layer in model.children():
            if hasattr(layer, 'reset_parameters'):
                layer.reset_parameters()

        # loss function
        criterion = nn.BCELoss()

        # optimizer
        optimizer = SGD(model.parameters(), lr=lr)

        # training loop
        for epoch in range(epochs):
            total_train_loss = 0
            total_test_loss = 0

            # training
            print('training ', end='')
            model.train()
            for i, (X, Y) in enumerate(train_loader):
                # print progress
                if i % (len(train_loader) // 10) == 0:
                    print('.', end='')

                # forward pass
                outputs: torch.Tensor = model(X.to(device)).squeeze()

                # calculate loss
                loss: torch.Tensor = criterion(outputs, Y.to(device))

                # zero out accumulated gradients
                optimizer.zero_grad()

                # backpropagation
                loss.backward()
                total_train_loss += loss.item()

                # update weights and biases
                optimizer.step()

            # testing
            print('\ntesting  ', end='')
            model.eval()
            with torch.inference_mode():
                total_score = 0
                for i, (X, Y) in enumerate(test_loader):
                    # print progress
                    if i % (len(test_loader) // 10) == 0:
                        print('.', end='')

                    outputs: torch.Tensor = model(X.to(device)).squeeze()
                    loss: torch.Tensor = criterion(outputs, Y.to(device))

                    # update loss and scores
                    total_test_loss += loss.item()
                    total_score += metric(outputs, Y.to(device)).item()
            scores.append(total_score / len(test_loader))
            print(f'\nFold {fold} | Epoch {epoch} | train loss {total_train_loss / len(train_loader):.2f} | '
                  f'test loss {total_test_loss / len(test_loader):.2f} | '
                  f'test auroc {total_score / len(test_loader):.2f}')
        print(f'\n{"-" * 90}\n')
    return scores

In [19]:
model.to(device)
scores = cross_validate(
    model=model,
    dataset=dataset,
    num_folds=num_folds,
    epochs=epochs,
    batch_size=batch_size,
    lr=lr
)

training ...........
testing  ...........
Fold 0 | Epoch 0 | train loss 0.42 | test loss 0.35 | test auroc 0.93
training ...........
testing  ...........
Fold 0 | Epoch 1 | train loss 0.32 | test loss 0.32 | test auroc 0.94
training ...........
testing  ...........
Fold 0 | Epoch 2 | train loss 0.29 | test loss 0.31 | test auroc 0.95

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 1 | Epoch 0 | train loss 0.42 | test loss 0.36 | test auroc 0.93
training ...........
testing  ...........
Fold 1 | Epoch 1 | train loss 0.32 | test loss 0.33 | test auroc 0.94
training ...........
testing  ...........
Fold 1 | Epoch 2 | train loss 0.29 | test loss 0.32 | test auroc 0.94

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 2 | Epoch 0 | train loss 0.42 | test loss 0.35 | test auroc 0.93
training ...........
testing  

# Confidence Interval
Calculate range of values that is likely to contain
the true population parameter with a certain level of
confidence based on a sample of data.

In [20]:
import numpy as np
from scipy import stats

# confidence interval function
def confidence_interval(data: List[float]):
    sem = stats.sem(data)
    if sem == 0:
        return data[0], data[0]
    return stats.t.interval(confidence=.95, df=len(data)-1, loc=np.mean(data), scale=sem)

In [21]:
confidence_interval(scores)

(0.9369050515870934, 0.9453037602403791)

# Improvement
Now let's try to improve our model. Let's add more non-linearity with ReLU activation function!

In [22]:
class LogisticRegression(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.relu = nn.ReLU()

    def forward(self, X: torch.Tensor):
        return torch.sigmoid(self.relu(self.linear(X)))

In [23]:
model = LogisticRegression(len(dataset.vectorizer.get_feature_names_out()), 1).to(device)
scores = cross_validate(
    model=model,
    dataset=dataset,
    num_folds=num_folds,
    epochs=epochs,
    batch_size=batch_size,
    lr=lr
)
confidence_interval(scores)

training ...........
testing  ...........
Fold 0 | Epoch 0 | train loss 0.56 | test loss 0.53 | test auroc 0.93
training ...........
testing  ...........
Fold 0 | Epoch 1 | train loss 0.51 | test loss 0.52 | test auroc 0.93
training ...........
testing  ...........
Fold 0 | Epoch 2 | train loss 0.49 | test loss 0.51 | test auroc 0.94

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 1 | Epoch 0 | train loss 0.56 | test loss 0.53 | test auroc 0.93
training ...........
testing  ...........
Fold 1 | Epoch 1 | train loss 0.51 | test loss 0.51 | test auroc 0.93
training ...........
testing  ...........
Fold 1 | Epoch 2 | train loss 0.49 | test loss 0.51 | test auroc 0.93

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 2 | Epoch 0 | train loss 0.56 | test loss 0.53 | test auroc 0.93
training ...........
testing  



....
Fold 2 | Epoch 1 | train loss 0.51 | test loss 0.51 | test auroc 0.93
training ...........
testing  ...........
Fold 2 | Epoch 2 | train loss 0.49 | test loss 0.51 | test auroc 0.93

------------------------------------------------------------------------------------------



(0.9287067517338505, 0.9333876505285355)

Did it improve?

Try increasing the model complexity with more layers.

In [24]:
class LogisticRegression(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden_units: int = 32):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(in_features, hidden_units),
            nn.ReLU(),
            nn.Linear(hidden_units, out_features),
            nn.ReLU(),
        )

    def forward(self, X: torch.Tensor):
        return torch.sigmoid(self.block(X))

In [25]:
model = LogisticRegression(len(dataset.vectorizer.get_feature_names_out()), 1).to(device)
scores = cross_validate(
    model=model,
    dataset=dataset,
    num_folds=num_folds,
    epochs=epochs,
    batch_size=batch_size,
    lr=lr
)
confidence_interval(scores)

training ...........
testing  ...........
Fold 0 | Epoch 0 | train loss 0.55 | test loss 0.51 | test auroc 0.92
training ...........
testing  ...........
Fold 0 | Epoch 1 | train loss 0.48 | test loss 0.50 | test auroc 0.92
training ...........
testing  ...........
Fold 0 | Epoch 2 | train loss 0.45 | test loss 0.50 | test auroc 0.92

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 1 | Epoch 0 | train loss 0.47 | test loss 0.45 | test auroc 0.94
training ...........
testing  ...........
Fold 1 | Epoch 1 | train loss 0.45 | test loss 0.45 | test auroc 0.94
training ...........
testing  ...........
Fold 1 | Epoch 2 | train loss 0.43 | test loss 0.45 | test auroc 0.92

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 2 | Epoch 0 | train loss 0.45 | test loss 0.42 | test auroc 0.95
training ...........
testing  

(0.9246694048741035, 0.942074215096939)

How was it this time?

Try one more time with regularization (prevent overfitting and improve generalization performance) using `Dropout`.
`Dropout` will randomly turn off some nodes during training.

In [26]:
class LogisticRegression(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden_units: int = 32):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(in_features, hidden_units),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(hidden_units, out_features),
            nn.ReLU(),
            nn.Dropout(),
        )

    def forward(self, X: torch.Tensor):
        return torch.sigmoid(self.block(X))

In [27]:
model = LogisticRegression(len(dataset.vectorizer.get_feature_names_out()), 1).to(device)
scores = cross_validate(
    model=model,
    dataset=dataset,
    num_folds=num_folds,
    epochs=epochs,
    batch_size=batch_size,
    lr=lr
)
confidence_interval(scores)

training ...........
testing  ...........
Fold 0 | Epoch 0 | train loss 0.63 | test loss 0.55 | test auroc 0.91
training ...........
testing  ...........
Fold 0 | Epoch 1 | train loss 0.60 | test loss 0.53 | test auroc 0.93
training ...........
testing  ...........
Fold 0 | Epoch 2 | train loss 0.59 | test loss 0.52 | test auroc 0.92

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 1 | Epoch 0 | train loss 0.60 | test loss 0.50 | test auroc 0.94
training ...........
testing  ...........
Fold 1 | Epoch 1 | train loss 0.59 | test loss 0.49 | test auroc 0.93
training ...........
testing  ...........
Fold 1 | Epoch 2 | train loss 0.58 | test loss 0.50 | test auroc 0.91

------------------------------------------------------------------------------------------

training ...........
testing  ...........
Fold 2 | Epoch 0 | train loss 0.59 | test loss 0.48 | test auroc 0.94
training ...........
testing  

(0.9208162858122616, 0.9416881319202838)

# Conclusion
 As you can see the rest of the models didn't do that well. Machine learning requires lots of experimenting.
It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results.
This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.