In [1]:
import pandas as pd
import numpy as np
import os
import torch   
import matplotlib.pyplot as plt
import re
import random
import torch.nn as nn
import torch.nn.functional as F
import csv
import seaborn as sns
import nltk
import torch.optim.lr_scheduler as lr_scheduler

from collections import Counter
from tqdm import tqdm
from IPython.display import display, Math, Latex
from sklearn.utils import resample
from sklearn.metrics import classification_report, f1_score, confusion_matrix
from torch.utils.data import DataLoader, Dataset, WeightedRandomSampler
from nltk.stem import PorterStemmer
from torch.utils.data import TensorDataset, Dataset, DataLoader
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.utils import resample
from torch.nn import Dropout
from wordcloud import WordCloud
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# What is Natural Language Processing

Natural Language Processing (NLP) is a branch of AI that focuses on the interaction between computers and human language. It combines computational linguistics and machine learning to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. 

Key applications of NLP:
- Text Classification
- Machine Translation 
- Chatbots and Virtual Assistants 
- Sentiment Analyis 
- Text Summarization
- Named Entity Recognition
- Speech Recognition
- Language Generation 
- Question Answering 

# Representing Text for Computational Understanding

To enable computers to effectively work with human language, it's essential to convert text into a format they can understand and process. We need to represent language in a numerical or symbolic format that a computer can manipulate. This conversion is a fundamental step in NLP and is essential for various downstream tasks like analysis, translation, and generation. 

### Bag of Words

One of the simplest and most common methods for representing text in a way that computers can understand is the Bag of Words (BoW) model. The idea behind bag-of-words is quite simple and can be summarized as follows:
1. Create a vocabulary of unique tokens-for example, words-from the entire set of text documents. 
2. Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document. 

#### Example

In [2]:
count = CountVectorizer()

documents = np.array(['The sky is blue', 'The weather is nice', 'The sky is blue, and the weather is nice', 'and one and one is two'])
bag = count.fit_transform(documents)

print(f'Vocab content: \n {count.vocabulary_}')

df = pd.DataFrame(data=bag.toarray(),columns = count.get_feature_names_out())
print('\n Bag of Words: \n ')
df

Vocab content: 
 {'the': 6, 'sky': 5, 'is': 2, 'blue': 1, 'weather': 8, 'nice': 3, 'and': 0, 'one': 4, 'two': 7}

 Bag of Words: 
 


Unnamed: 0,and,blue,is,nice,one,sky,the,two,weather
0,0,1,1,0,0,1,1,0,0
1,0,0,1,1,0,0,1,0,1
2,1,1,2,1,0,1,2,0,1
3,2,0,1,0,2,0,0,1,0


**Note:** 
1. We can observe that since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them **sparse**.
2. For a BoW model, the word or term order in a sentence or document does not matter. 

### Assessing word relevancy via term frequency-inverse document frequency (tfidf)

We often need to account for frequently occuring words that don't contain useful or discriminatory information. **tfidf** is a useful technique which can be used to downweight these frequently occuring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency: <br />

*tf-idf(t, d)* = *tf(t,d)* x *(idf(t,d) + 1)*

where <br />

$ $idf(t,d)$ = log\frac{1+n_{d}}{1+df(d,t)}$

--------- this needs to be cleaned up ------------

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2',
                         smooth_idf=True)

tfidf_model = tfidf.fit_transform(bag).toarray()

df = pd.DataFrame(data=tfidf_model,columns = count.get_feature_names_out())
print('\n Tfidf: \n ')
df.round(decimals=2)


 Tfidf: 
 


Unnamed: 0,and,blue,is,nice,one,sky,the,two,weather
0,0.0,0.57,0.38,0.0,0.0,0.57,0.46,0.0,0.0
1,0.0,0.0,0.38,0.57,0.0,0.0,0.46,0.0,0.57
2,0.33,0.33,0.43,0.33,0.0,0.33,0.53,0.0,0.33
3,0.57,0.0,0.19,0.0,0.72,0.0,0.0,0.36,0.0


In [1]:
## EXERCISE: TRY WITH YOUR OWN SENTENCES

def dense_vectors(docs):
    bag = count.fit_transform(docs)
    tfidf_model = tfidf.fit_transform(bag).toarray()
    return pd.DataFrame(data=tfidf_model, columns = count.get_feature_names_out())

docs = np.array(['', 
                 '', 
                 '', 
                 ''])

df = dense_vectors(docs)

df.round(decimals=2)

# Text Preprocessing

Text preprocessing in Natural Language Processing (NLP) is a crucial step that involves cleaning and formatting text data before it is used in NLP tasks. The main purpose of text preprocessing is to transform raw text data into a more structured and consistent format that can be easily analyzed and understood by NLP algorithms.

Text preprocessing typically involves several steps, including:

- Tokenization: This involves breaking down text into individual words or tokens, which can then be analyzed and processed separately.
- Stopword removal: This involves removing common words such as "the," "and," and "a" that do not contribute much meaning to the text and can be safely removed without losing important information. (self-note to add: no order anyway)
- Stemming and lemmatization: These techniques involve reducing words to their base or root form, which can help to reduce the dimensionality of the text data and improve the accuracy of NLP algorithms. (self-note: study and studies shouldnt be considered different tokens)
- Noise removal: This involves removing unnecessary characters, symbols, and other forms of noise that can interfere with NLP algorithms.
- Text normalization: This involves converting text to a consistent format, such as converting all text to lowercase or uppercase, removing punctuation, and replacing abbreviations with their full forms.

# Neural Network: Multi-Layer Perceptrons

intro to MLP here
something short since we may not have time

In [5]:
imdb_dataset = pd.read_csv('./data/IMDB Dataset.csv', encoding='utf8')

df = imdb_dataset.sample(10000).reset_index(drop=True)
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df_train, df_valid = train_test_split(df, test_size=0.3)

In [6]:
# function to preprocess our text
def preprocess_text(col, stop_words, stemmer):
    """
    Argument:
    1. col to preprocess
    2. list of stop_words
    3. word stemmer

    steps: tokenize -> lower case ->
            remove stop words -> stem
    """
    col = re.sub(r"[^a-zA-Z0-9 ]", "", col) # only letters and numbers
    tokenize = col.split() # split into tokens
    lower_case = [word.lower() for word in tokenize] # lowercase for consistent format
    remove_stop_words = [word for word in lower_case if word not in stop_words] # remove stop words
    stemmed_words = [stemmer.stem(word) for word in remove_stop_words] # stem each word
    result = " ".join(stemmed_words) # join the text back
    return result

stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()
kwargs = {'stop_words': stop_words,
           'stemmer': stemmer}
df_train['review'] = df_train['review'].apply(lambda x: preprocess_text(x, **kwargs))
df_valid['review'] = df_valid['review'].apply(lambda x: preprocess_text(x, **kwargs))

In [7]:
X_train = df_train['review']
y_train = np.array(df_train['sentiment'])
X_val = df_valid['review']
y_val = np.array(df_valid['sentiment'])

count_vec = CountVectorizer()
X_train_counts = count_vec.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_val_counts = count_vec.transform(X_val)
X_val_tfidf = tfidf_transformer.transform(X_val_counts)

In [8]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]
    
    
X_train = torch.from_numpy(np.array(X_train_tfidf.todense())).float()
y_train = torch.from_numpy(y_train)


X_val = torch.from_numpy(np.array(X_val_tfidf.todense())).float()
y_val = torch.from_numpy(y_val)



train_ds = CustomDataset(X_train, y_train)
valid_ds = CustomDataset(X_val, y_val)

In [9]:
class NeuralNet(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.layer1 = nn.Linear(input_size, 1024)
        self.layer2 = nn.Linear(1024, 128)
        self.layer3 = nn.Linear(128, num_classes)
        self.drop = nn.Dropout()
    
    def forward(self, x):
        x = self.layer1(x)
        x = nn.ReLU()(x)

        x = self.drop(x)
        x = self.layer2(x)
        x = nn.ReLU()(x)
        
        x = self.layer3(x)
        x = nn.Softmax(dim=1)(x)

        return x

In [10]:
def train_model(NeuralNet, train_ds, valid_ds, epochs, learning_rate, batch_size, model_kwargs):
    torch.manual_seed(10)
    model = NeuralNet(**model_kwargs).to(Config.device)
    print(model)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)

    train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
    valid_dl = DataLoader(valid_ds, batch_size=32, shuffle=False)

    loss_hist_train = []
    loss_hist_val = []
    acc_hist_train = []
    acc_hist_val = []
    f1_hist_train = []
    f1_hist_val = []


    for epoch in range(epochs):
        running_hist_loss = 0.0
        running_hist_acc = 0.0
        running_hist_f1 = 0.0
        lens = 0.0
        model.train()
        for x_batch, y_batch  in tqdm(train_dl):
            x_batch, y_batch = x_batch.to(Config.device), y_batch.to(Config.device)
            pred = model(x_batch)
            loss = criterion(pred, y_batch)
            running_hist_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            f1_score_calc = f1_score(to_numpy(torch.argmax(pred, dim=1)), to_numpy(y_batch), average='macro')
            running_hist_f1 += f1_score_calc
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            running_hist_acc += is_correct.sum()
            lens += len(x_batch)
        running_hist_acc /= lens
        running_hist_loss /= lens
        running_hist_f1 /= np.ceil(len(train_dl.dataset.X)/batch_size)





        running_hist_loss_test = 0.0
        running_hist_acc_test = 0.0
        running_hist_f1_test = 0.0
        lens = 0.0
        model.eval()
        with torch.no_grad():
            for x, y in tqdm(valid_dl):
                x, y = x.to(Config.device), y.to(Config.device)
                pred = model(x)
                loss = criterion(pred, y)
                running_hist_loss_test += loss.item()
                f1_score_calc = f1_score(to_numpy(torch.argmax(pred, dim=1)), to_numpy(y), average='macro')
                running_hist_f1_test += f1_score_calc
                is_correct = (torch.argmax(pred, dim=1) == y)
                running_hist_acc_test += is_correct.float().sum()
                lens += len(x)

            running_hist_loss_test /= lens
            running_hist_acc_test /= lens
            running_hist_f1_test /= np.ceil(len(valid_dl.dataset.X)/batch_size)
        print(f'Predicted Test: {torch.argmax(pred, dim=1).detach().cpu().numpy()}')
        print(f'Actual Test:    {y.detach().cpu().numpy()}')
        print(f"Epoch {epoch}")
        print("Train Loss: \t{:.4f}".format(running_hist_loss))
        print("Validation Loss: \t{:.4f}".format(running_hist_loss_test))
        print("Train Accuracy: \t{:.3f}".format(running_hist_acc))
        print("Validation Accuracy: \t{:.3f}".format(running_hist_acc_test))
        print("Train F1: \t{:.3f}".format(running_hist_f1))
        print("Validation F1: \t{:.3f}".format(running_hist_f1_test))
        print("------------------------------")

            # save best model based on accuracy
        if (epoch > 1 and running_hist_loss_test < np.min(acc_hist_val)) or epoch == 1:
            torch.save(model.state_dict(), "../models/checkpoint.pt")

        loss_hist_train.append(running_hist_loss)
        loss_hist_val.append(running_hist_loss_test)
        acc_hist_val.append(running_hist_acc_test.detach().cpu().numpy())
        acc_hist_train.append(running_hist_acc.detach().cpu().numpy())
        f1_hist_train.append(running_hist_f1)
        f1_hist_val.append(running_hist_f1_test)

        

        
    # instantiate model and load best weights
    model = NeuralNet(**model_kwargs)
    model.load_state_dict(torch.load("../models/checkpoint.pt"))
    model.eval()

    return model, loss_hist_train, loss_hist_val, acc_hist_train, acc_hist_val, f1_hist_train, f1_hist_val

In [11]:
class Config:
    batch_size = 32
    learning_rate = 0.0001
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    epochs = 5

In [12]:
def to_numpy(x):
    return x.cpu().numpy()

In [13]:
kwargs = {'input_size': train_ds.X.shape[1],
          'num_classes': 2}
model, loss_train, loss_val, acc_train, acc_val, f1_train, f1_val = train_model(NeuralNet, train_ds, valid_ds, Config.epochs, Config.learning_rate, 
                                                                            Config.batch_size, kwargs)

NeuralNet(
  (layer1): Linear(in_features=45504, out_features=1024, bias=True)
  (layer2): Linear(in_features=1024, out_features=128, bias=True)
  (layer3): Linear(in_features=128, out_features=2, bias=True)
  (drop): Dropout(p=0.5, inplace=False)
)


100%|██████████| 219/219 [00:13<00:00, 15.97it/s]
100%|██████████| 94/94 [00:04<00:00, 20.16it/s]


Predicted Test: [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0]
Actual Test:    [0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0]
Epoch 0
Train Loss: 	0.0204
Validation Loss: 	0.0170
Train Accuracy: 	0.714
Validation Accuracy: 	0.858
Train F1: 	0.653
Validation F1: 	0.856
------------------------------


100%|██████████| 219/219 [00:13<00:00, 15.83it/s]
100%|██████████| 94/94 [00:04<00:00, 20.11it/s]


Predicted Test: [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0]
Actual Test:    [0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0]
Epoch 1
Train Loss: 	0.0138
Validation Loss: 	0.0140
Train Accuracy: 	0.923
Validation Accuracy: 	0.879
Train F1: 	0.921
Validation F1: 	0.876
------------------------------


100%|██████████| 219/219 [00:11<00:00, 18.31it/s]
100%|██████████| 94/94 [00:05<00:00, 18.08it/s]


Predicted Test: [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0]
Actual Test:    [0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0]
Epoch 2
Train Loss: 	0.0113
Validation Loss: 	0.0137
Train Accuracy: 	0.970
Validation Accuracy: 	0.877
Train F1: 	0.969
Validation F1: 	0.875
------------------------------


100%|██████████| 219/219 [00:11<00:00, 18.76it/s]
100%|██████████| 94/94 [00:04<00:00, 21.02it/s]


Predicted Test: [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1]
Actual Test:    [0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0]
Epoch 3
Train Loss: 	0.0105
Validation Loss: 	0.0137
Train Accuracy: 	0.987
Validation Accuracy: 	0.875
Train F1: 	0.986
Validation F1: 	0.872
------------------------------


100%|██████████| 219/219 [00:11<00:00, 19.22it/s]
100%|██████████| 94/94 [00:04<00:00, 23.05it/s]


Predicted Test: [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1]
Actual Test:    [0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0]
Epoch 4
Train Loss: 	0.0102
Validation Loss: 	0.0137
Train Accuracy: 	0.993
Validation Accuracy: 	0.872
Train F1: 	0.993
Validation F1: 	0.870
------------------------------
