# Improving the model

Today we will do some improvements on yesterday's model. 

# TF-IDF
We will be using the same data loading startegy, but instead of the very sparse one-hot encoding, we will use TF-IDF for vectorization. We will use slearn's implementation, which means that we do not need tokenization (this is means that we do not need tokenization.) 

In [255]:
from sklearn.feature_extraction.text import TfidfVectorizer

Let us first test the vectorizer. We will use some of its optional arguments to look for better results. In particular, we will experiment with removing stop words. Moreover, we shall use normalization to avoid problems in the NN's gradient learning. We will also restrcit the dimensions so not have overflows.

In [256]:
vectorizer = TfidfVectorizer(stop_words = 'english', norm = 'l2', max_features = 10000)
text = ['i love you so much', 'he loves her a lot as well']
X = vectorizer.fit_transform(text)

We can check the vectorization.

In [257]:
print(X)
print(vectorizer.get_feature_names_out())

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3 stored elements and shape (2, 3)>
  Coords	Values
  (0, 1)	1.0
  (1, 2)	0.7071067811865476
  (1, 0)	0.7071067811865476
['lot' 'love' 'loves']


Most words were removed for being stop words. We can see that without their removal, the vocabulary (and, consequently, the vectors' dimension) become much bigger.

In [258]:
vectorizer = TfidfVectorizer(stop_words = None, norm = 'l2', max_features = 10000)
text = ['i love you so much', 'he loves her a lot as well']
X = vectorizer.fit_transform(text)
print(X)
print(vectorizer.get_feature_names_out())

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10 stored elements and shape (2, 10)>
  Coords	Values
  (0, 4)	0.5
  (0, 9)	0.5
  (0, 7)	0.5
  (0, 6)	0.5
  (1, 1)	0.408248290463863
  (1, 5)	0.408248290463863
  (1, 2)	0.408248290463863
  (1, 3)	0.408248290463863
  (1, 0)	0.408248290463863
  (1, 8)	0.408248290463863
['as' 'he' 'her' 'lot' 'love' 'loves' 'much' 'so' 'well' 'you']


It is hard to decide only based on these artificial examples whether to use, but we will try with it.

## Load data
The data loader class changes very little, except, of course, for the vectorization part.

In [259]:
from torch.utils.data import Dataset
class ReviewDataset(Dataset):
    def __init__(self, review_df, stop_word = True):
        # Load and split data
        self.review_df = review_df

        self.train_df = self.review_df[self.review_df['split'] == 'train']
        self.train_size = len(self.train_df)

        self.val_df = self.review_df[self.review_df['split'] == 'val']
        self.val_size = len(self.val_df)

        self.test_df = self.review_df[self.review_df['split'] == 'test']
        self.test_size = len(self.test_df)

        # We now create the attributes that will be used to interact with each of these datasets
        self._lookup_dict = {
            'train' : (self.train_df, self.train_size),
            'val' : (self.val_df, self.val_size),
            'test' : (self.test_df, self.test_size) 
        }

        # This will be defined below, but basically allows for choosing an external dataset at each time
        # By default, this is train
        # It is important to do so because the __len__ and __get_item__ methods are defined on a fixed
        # external data
        self.set_split('train') 

        # Because we will be using the sklearn vectorization, we will need
        # to fit it to train from this point on
        if stop_word:
            stop_word = 'english'
        else:
            stop_word = None
        self.vectorizer = TfidfVectorizer(stop_words = stop_word, norm = 'l2', max_features = 10000)
        # We fit the vectorizer only in the train set
        self.vectorizer.fit(self.train_df['review'])

    def set_split(self, split):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
    
    
    # This method will be useful for generating an object of the class directly from the csv path
    @classmethod
    def load_dataset_and_make_vectorizer(cls, review_csv, stop_word = True):
        '''
        Loads the csv directly and makes a vectorizer based on it. 
        The vectorizer is naturally constructed only over the train set.
        '''
        review_df = pd.read_csv(review_csv)
        return cls(review_df, stop_word)
    
    # Here are the important methods so the training loop does work
    def __len__(self):
        return self._target_size
    
    def __getitem__(self, index):
        '''
        This basically allows one to iterate over the rows of the csv.
        '''
        row = self._target_df.iloc[index]
        # Vectorize the row
        review_vector = self.vectorizer.transform([row['review']]).toarray()
        mapper =  {'negative' : 0, 'positive' : 1}
        rating_index = mapper[row['rating']]
        return {
            'x_data': review_vector,
            'y_data': rating_index,
            'text': row['review']
            }
    
    def get_num_batches(self, batch_size):
        '''
        Returns the number of batches needed for that particular batch size.
        '''
        return len(self)//batch_size

Let us check it.

In [260]:
review = ReviewDataset.load_dataset_and_make_vectorizer("reviews_with_splits_lite.csv")
encoded = review[0]
print(encoded['x_data'].shape)

(1, 10000)


All good! We will again implement the batch generator, but this time, we allow for GPU.

In [271]:
from torch.utils.data import DataLoader

def generate_batches(dataset, batch_size, shuffle = True,
                     drop_last = True, device = "cpu"):
    dataloader = DataLoader(dataset = dataset, batch_size = batch_size,
                            shuffle = shuffle, drop_last = drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, _ in data_dict.items():
            if name != 'text':
                out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

## Early stopping

We will implement a class for computing early stopping with a minimum error and patience. In case of early stopping, we would like to return to the best model.

In [272]:
import numpy as np
import torch

class EarlyStopping:
    def __init__(self, patience = 5, min_delta = 0, save = None):
        self.patience = patience
        self.min_delta = min_delta
        self.save = save
        self.best_loss = np.inf
        self.patience_counter = 0
        self.flag = False
    
    def __call__(self, val_loss, model = None):
        # If the validation loss improved, we basically 
        # update the new best lost and save the model
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.patience_counter = 0
        
            if (self.save) and (model is not None):
                torch.save(model.state_dict(), self.save)
        
        else:
            self.patience_counter += 1
            if self.patience_counter > self.patience:
                self.flag = True

        

## Model

We will not change much of the model, except that for now, we allow dropout for regularization (we saw that our original training was overfitting a lot). 

In [273]:
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, input_size, hidden_dimension, dropout_prob = 0.5):
        super().__init__() 
        self.fc1 = nn.Linear(input_size, hidden_dimension)
        self.fc2 = nn.Linear(hidden_dimension, 1)
        self.dropout = nn.Dropout(p = dropout_prob)
        self.input_size = input_size
    
    def forward(self, x_in):
        if x_in.shape[-1] != self.input_size:
            print(x_in.shape)
            raise Exception("Input dimension of the object must be equal to the model's expected diemension!") 
        intermediate = F.relu(self.fc1(x_in))
        # Add dropout
        intermediate = self.dropout(intermediate)
        y_out = self.fc2(intermediate)
        return y_out

## Training loop

This time, we will have some metrics implemented to check the model performance. For that we will use sklearn metrics.

In [274]:
from sklearn.metrics import accuracy_score, f1_score

Notice that, while we can compute running accuracy, F1 is not an average, so we need to compute on the whole evaluation dataset. This will be done below. Another nice inclusion we will do is establishing a learning schedule.

In [281]:
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

lr = 1e-3
hidden_dimension = 5
n_epochs = 50
loss_func = nn.BCEWithLogitsLoss()
batch_size = 128
# Allows for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

classifier = MLP(10000, hidden_dimension, 0.5)
classifier.to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr = 1e-3)
# Multiplies the learning rate by `gamma` every `step_size` epoch
scheduler = StepLR(optimizer, step_size = 5, gamma = 0.1)

early_stopping = EarlyStopping(min_delta = 0.001, save = 'best_mlp_model.pt')

for epoch in range(n_epochs): 
    review.set_split('train')
    batch_generator = generate_batches(review, batch_size = batch_size, device = device)
    running_loss = 0.
    classifier.train()

    for batch_index, batch_dict in enumerate(batch_generator):
        # 1. Zero the gradient
        classifier.zero_grad()

        # 2. Prediction
        y_pred = classifier(x_in = batch_dict['x_data'].float()).squeeze()

        # 3. Compute loss
        loss = loss_func(y_pred, batch_dict['y_data'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)


        # 4. Backpropagate
        loss.backward()
        
        # 5. Optimize
        optimizer.step()

    # Evaluation part, we don't want paramereres to change
    classifier.eval()
    review.set_split('val')
    
    # Will be used to store predictions, as F1 can only
    # be computed at the end (no running F1)
    all_preds = []
    all_labels = []

    batch_generator = generate_batches(review, batch_size = batch_size, device = device)
    for _, batch_dict in enumerate(batch_generator):
        with torch.no_grad(): # Avoids backpropagation
            all_preds.append(classifier(x_in = batch_dict['x_data'].float()).squeeze().cpu())
            all_labels.append(batch_dict['y_data'].cpu())
            
    # Computes metrics
    all_labels, all_preds = torch.cat(all_labels), torch.cat(all_preds)
    val_loss = loss_func(all_preds, all_labels.float())
    # For computing accuracy and F1, we need the variables as binary
    all_preds = (torch.sigmoid(all_preds)>0.5).long()
    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds)

    # Updates early stop
    early_stopping(val_loss, classifier)
    if early_stopping.flag:
        print('Early stopped at epoch:', epoch)
        break

    # Updates scheduler
    scheduler.step()

    print('Epoch: ', epoch)
    print('Training loss', running_loss)
    print('Validation loss', val_loss.item())
    print('Validation accuracy: ', acc)
    print('Validation F1: ', f1)

Epoch:  0
Training loss 0.6407634097766254
Validation loss 0.5343649387359619
Validation accuracy:  0.8700721153846154
Validation F1:  0.8595921548253019
Epoch:  1
Training loss 0.4799858628729589
Validation loss 0.3869459927082062
Validation accuracy:  0.9001201923076924
Validation F1:  0.8976474935336864
Epoch:  2
Training loss 0.39671912789344765
Validation loss 0.3153764605522156
Validation accuracy:  0.9068509615384616
Validation F1:  0.9058208773848584
Epoch:  3
Training loss 0.3508301390931494
Validation loss 0.27585718035697937
Validation accuracy:  0.9106971153846154
Validation F1:  0.9106863805745883
Epoch:  4
Training loss 0.3266943512009638
Validation loss 0.25198763608932495
Validation accuracy:  0.9115384615384615
Validation F1:  0.9120458891013384
Epoch:  5
Training loss 0.3088339102618833
Validation loss 0.24987995624542236
Validation accuracy:  0.9120192307692307
Validation F1:  0.9122091628687935
Epoch:  6
Training loss 0.30722826646239154
Validation loss 0.2476821243

We now load the best model so we can evaluate.

In [282]:
classifier.load_state_dict(torch.load("best_mlp_model.pt"))
classifier.eval()

  classifier.load_state_dict(torch.load("best_mlp_model.pt"))


MLP(
  (fc1): Linear(in_features=10000, out_features=5, bias=True)
  (fc2): Linear(in_features=5, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [283]:
classifier.eval()
review.set_split('test')

# Will be used to store predictions, as F1 can only
# be computed at the end (no running F1)
all_preds = []
all_labels = []


batch_generator = generate_batches(review, batch_size = batch_size, device = device)
for _, batch_dict in enumerate(batch_generator):
    with torch.no_grad(): # Avoids backpropagation
        all_preds.append(classifier(x_in = batch_dict['x_data'].float()).squeeze().cpu())
        all_labels.append(batch_dict['y_data'].cpu())
        
# Computes metrics
all_labels, all_preds = torch.cat(all_labels), torch.cat(all_preds)
val_loss = loss_func(all_preds, all_labels.float())
# For computing accuracy and F1, we need the variables as binary
all_preds = (torch.sigmoid(all_preds)>0.5).long()
acc = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)
print('Test accuracy: ', acc)
print('Test F1: ', f1)

Test accuracy:  0.9143028846153847
Test F1:  0.9139823862950899


We can also apply to individaul reviews.

In [284]:
def predict_rating(text, classifier, vectorizer, decision_threshold=0.5):  
    vectorized_review = torch.tensor(review.vectorizer.transform([text]).toarray()).to(device)
    classifier.eval()
    result = classifier(vectorized_review.float())
    
    probability_value = F.sigmoid(result).cpu().item()
    index = 1
    if probability_value < decision_threshold:
        index = 0

    mapper = {'0': 'NEGATIVE', '1': 'POSITIVE'}
    return mapper[str(index)], probability_value

test_review = "i love this place"

prediction, probability = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
print("The review '{}' is {}".format(test_review, prediction), 'with probability', probability)

The review 'i love this place' is POSITIVE with probability 0.965078592300415
