# Term Project
## Mohammed Furkhan, Shaikh

## import necessary libraries

In [174]:
import torch
import torchtext
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch.autograd import Variable
from torch.nn import functional as F

In [175]:
import pandas as pd
import numpy as np
import re
import random

## Load the data and Pre-process

In [176]:
data = pd.read_csv("bgg-15m-reviews.csv", usecols=[ "rating", "comment"])[["comment", "rating"]]
### https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [177]:
data.head()

Unnamed: 0,comment,rating
0,,10.0
1,Hands down my favorite new game of BGG CON 200...,10.0
2,I tend to either love or easily tire of co-op ...,10.0
3,,10.0
4,This is an amazing co-op game. I play mostly ...,10.0


In [178]:
df = data[data['comment'].notna()]
### https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan

In [179]:
df.head()

Unnamed: 0,comment,rating
1,Hands down my favorite new game of BGG CON 200...,10.0
2,I tend to either love or easily tire of co-op ...,10.0
4,This is an amazing co-op game. I play mostly ...,10.0
5,Hey! I can finally rate this game I've been pl...,10.0
8,Love it- great fun with my son. 2 plays so far...,10.0


In [180]:
df['rating'] = df['rating'].apply(lambda x: round(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating'] = df['rating'].apply(lambda x: round(x))


In [181]:
df['comment'] = df['comment'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['comment'] = df['comment'].apply(lambda x: x.lower())


In [182]:
df['rating'].unique()

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0], dtype=int64)

In [183]:
### https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
df = df.sample(frac=1).reset_index(drop=True)

In [184]:
df.head(10)

Unnamed: 0,comment,rating
0,not my favorite...too much chance.,4
1,"i have carcassonne and the river , inns cathed...",8
2,die solovariante ist klasse und schwer zu gewi...,8
3,red raven games,7
4,eurogame 2-4. pequeño.mayorias.,6
5,"a very well-written and rich story, disguised ...",9
6,trying to group cards into combinations just i...,5
7,dice throne has reinvigorated the head-to-head...,6
8,the best cooperative game i have ever played. ...,8
9,"2014年4月購入　￥4,924 2014年7月放出　￥5,000",7


In [185]:
pattern = re.compile("[^a-zA-Z ]+")
df["comment"] = df['comment'].map(lambda x: pattern.sub('', x))
df.head(10)

Unnamed: 0,comment,rating
0,not my favoritetoo much chance,4
1,i have carcassonne and the river inns cathedr...,8
2,die solovariante ist klasse und schwer zu gewi...,8
3,red raven games,7
4,eurogame pequeomayorias,6
5,a very wellwritten and rich story disguised as...,9
6,trying to group cards into combinations just i...,5
7,dice throne has reinvigorated the headtohead b...,6
8,the best cooperative game i have ever played ...,8
9,,7


In [186]:
# drop rows with comment length <= 10
df = df[df['comment'].map(len) > 10]
print(len(df))
df = df.reset_index(drop=True)

2774889


In [187]:
df.head(10)

Unnamed: 0,comment,rating
0,not my favoritetoo much chance,4
1,i have carcassonne and the river inns cathedr...,8
2,die solovariante ist klasse und schwer zu gewi...,8
3,red raven games,7
4,eurogame pequeomayorias,6
5,a very wellwritten and rich story disguised as...,9
6,trying to group cards into combinations just i...,5
7,dice throne has reinvigorated the headtohead b...,6
8,the best cooperative game i have ever played ...,8
9,the first game which used multipurpose cards a...,8


In [188]:
df['comment'].map(len).max()

25083

## Make Train and Test Datasets

In [189]:
training_df = df.loc[:20000]
testing_df = df.loc[20000:25000]

In [190]:
training_df.head()

Unnamed: 0,comment,rating
0,not my favoritetoo much chance,4
1,i have carcassonne and the river inns cathedr...,8
2,die solovariante ist klasse und schwer zu gewi...,8
3,red raven games,7
4,eurogame pequeomayorias,6


In [191]:
testing_df.head()

Unnamed: 0,comment,rating
20000,i could see how this game was great when it re...,5
20001,my opinion is based on gameplay with two expan...,9
20002,bit complicated when you start but then really...,9
20003,too much luck,4
20004,when i was in college i played in a thoroughly...,6


In [192]:
training_df.to_csv("training.csv", index=False)

In [193]:
testing_df.to_csv("testing.csv", index=False)

## Prepare dataset for Pytorch torchtext

Data is tokenized and converted into numerical values for Neural network

In [194]:
tokenizer = lambda x: x.split()

In [195]:
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True, include_lengths=True, batch_first=True, fix_length=200)
LABEL = torchtext.data.LabelField(dtype=torch.float)

In [196]:
fields = [('comment',TEXT),('rating', LABEL)]

In [197]:
train_data = torchtext.data.TabularDataset("training.csv","csv", fields, skip_header=True)

In [198]:
test_data = torchtext.data.TabularDataset("testing.csv","csv", fields, skip_header=True)

In [199]:
train_data.examples[0].comment, train_data.examples[0].rating

(['not', 'my', 'favoritetoo', 'much', 'chance'], '4')

## Create Word Embedding vectors for embedding layer

In [200]:
TEXT.build_vocab(train_data, vectors=torchtext.vocab.GloVe(name='6B', dim=100))
LABEL.build_vocab(train_data)

In [201]:
word_embeddings = TEXT.vocab.vectors

In [202]:
train_data, valid_data = train_data.split()

In [203]:
train_iter, valid_iter, test_iter = torchtext.data.BucketIterator.splits((train_data, valid_data, test_data),
                                                               batch_size=32,
                                                               sort_key=lambda x: len(x.comment),
                                                               repeat=False,
                                                               shuffle=True)

In [204]:
vocab_size = len(TEXT.vocab)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Model Design: LSTM Classifier

In [205]:
class ClassifierModel(nn.Module):
    def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
        super(ClassifierModel, self).__init__()
        """
        output_size : 2 = (pos, neg)
        """
        self.batch_size = batch_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.embedding_length = embedding_length

        self.word_embeddings = nn.Embedding(vocab_size, embedding_length)  # Initiale the look-up table.
        self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False) # Assign pre-trained GloVe word embedding.
        self.lstm = nn.LSTM(embedding_length, hidden_size)
        self.label = nn.Linear(hidden_size, output_size)

    def forward(self, input_sentence, batch_size=None):
        """ 
        final_output.shape = (batch_size, output_size)
        """
        input = self.word_embeddings(input_sentence) # embedded input of shape = (batch_size, num_sequences,  embedding_length)
        input = input.permute(1, 0, 2) # input.size() = (num_sequences, batch_size, embedding_length)
        if batch_size is None:
            h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)) # Initial hidden state of the LSTM
            c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)) # Initial cell state of the LSTM
        else:
            h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size))
            c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size))
        output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
        final_output = self.label(final_hidden_state[-1]) # final_hidden_state.size() = (1, batch_size, hidden_size) & final_output.size() = (batch_size, output_size)

        return final_output


## handle vanishing and exploding gradients

In [206]:
def clip_gradient(model, clip_value):
    params = list(filter(lambda p: p.grad is not None, model.parameters()))
    for p in params:
        p.grad.data.clamp_(-clip_value, clip_value)

## Training and Evaluation functions

In [207]:
def train_model(model, train_iter, epoch):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.to(device)
    optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
    steps = 0
    model.train()
    for idx, batch in enumerate(train_iter):
        text = batch.comment[0]
        target = batch.rating
        target = torch.autograd.Variable(target).long()
        if torch.cuda.is_available():
            text = text.cuda()
            target = target.cuda()
        if (text.size()[0] != 32):# One of the batch has length different than 32.
            continue
        optim.zero_grad()
        prediction = model(text)
        loss = loss_fn(prediction, target)
        num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).float().sum()
        acc = 100.0 * num_corrects/len(batch)
        loss.backward()
        clip_gradient(model, 1e-1)
        optim.step()
        steps += 1
        
        if steps % 100 == 0:
            print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%')
        
        total_epoch_loss += loss.item()
        total_epoch_acc += acc.item()
        
    return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter)

In [208]:
def eval_model(model, val_iter):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for idx, batch in enumerate(val_iter):
            text = batch.comment[0]
            if (text.size()[0] != 32):
                continue
            target = batch.rating
            target = torch.autograd.Variable(target).long()
            if torch.cuda.is_available():
                text = text.cuda()
                target = target.cuda()
            prediction = model(text)
            loss = loss_fn(prediction, target)
            num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).sum()
            acc = 100.0 * num_corrects/len(batch)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)

In [209]:
batch_size = 32
output_size = 11
hidden_size = 256
embedding_length = 100
model = ClassifierModel(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings)

In [210]:
learning_rate = 0.001
loss_fn = F.cross_entropy

In [211]:
for epoch in range(20):
    train_loss, train_acc = train_model(model, train_iter, epoch)
    val_loss, val_acc = eval_model(model, valid_iter)
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')

Epoch: 1, Idx: 100, Training Loss: 1.8817, Training Accuracy:  15.62%
Epoch: 1, Idx: 200, Training Loss: 1.9796, Training Accuracy:  21.88%
Epoch: 1, Idx: 300, Training Loss: 1.9345, Training Accuracy:  12.50%
Epoch: 1, Idx: 400, Training Loss: 1.8232, Training Accuracy:  25.00%
Epoch: 01, Train Loss: 1.959, Train Acc: 23.31%, Val. Loss: 1.959021, Val. Acc: 24.14%
Epoch: 2, Idx: 100, Training Loss: 1.7796, Training Accuracy:  34.38%
Epoch: 2, Idx: 200, Training Loss: 1.9129, Training Accuracy:  15.62%
Epoch: 2, Idx: 300, Training Loss: 1.8297, Training Accuracy:  18.75%
Epoch: 2, Idx: 400, Training Loss: 2.0461, Training Accuracy:  28.12%
Epoch: 02, Train Loss: 1.942, Train Acc: 24.58%, Val. Loss: 1.956214, Val. Acc: 24.15%
Epoch: 3, Idx: 100, Training Loss: 1.8465, Training Accuracy:  25.00%
Epoch: 3, Idx: 200, Training Loss: 1.9638, Training Accuracy:  25.00%
Epoch: 3, Idx: 300, Training Loss: 2.0503, Training Accuracy:  25.00%
Epoch: 3, Idx: 400, Training Loss: 1.6950, Training Accu

In [212]:
test_loss, test_acc = eval_model(model, test_iter)
print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%')

Test Loss: 1.980, Test Acc: 21.82%


## Save the model

In [213]:
torch.save(model.state_dict(), 'saved_weights.pt')

## Predict with custom input

In [214]:
test_sent = "This game is gets boring over time"
test_sent = TEXT.preprocess(test_sent)
test_sent = [[TEXT.vocab.stoi[x] for x in test_sent]]
test_sent = np.asarray(test_sent)
test_sent = torch.LongTensor(test_sent)
test_tensor = Variable(test_sent)
model.eval()
output = model(test_tensor, 1)
out = F.softmax(output, 1)
out

tensor([[2.9283e-01, 3.7589e-01, 3.1843e-01, 2.8570e-04, 1.2258e-02, 1.4841e-04,
         8.0877e-05, 2.0231e-05, 4.4586e-05, 4.8509e-06, 3.5981e-11]],
       grad_fn=<SoftmaxBackward>)

In [215]:
print("rating",torch.argmax(out[0]).item())

rating 1


## References

In [216]:
### https://pytorch.org/text/stable/data.html
### https://pytorch.org/tutorials/beginner/transformer_tutorial.html
### https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
### https://github.com/prakashpandey9/Text-Classification-Pytorch/blob/master/main.py
### https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
### https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496

## Controbution over the references

1. Data cleaning and preparation
2. Explicityly based on torchtext and self preprocessed dataset.
3. Different WordEmbedding Vectors and parameters
4. Optimized the hyperparameters empirically
5. Classifier based on 11 classes 0 - 10
6. Deploying on cloud
7. Faster processing using sclied data

## Findings

- Hyperparameters
1. The values for embedding vectors and their dimensions can increase the number of parameters required by the program.
2. Batch Size can be 16, 32, 64, .. In this notebook I have used 32.
3. The number of layers in the model can be increased but not necessarily may have better results.
4. The input length has been fixed at 200 characters but can be increased. The smaller text will be padded by default.

- Overfitting
1. The model training accuracy and loss are closely related to the validation accuracy and loss
2. The model does not overfit. Also I had to use less amount of data due to resource limits


## Why use embedding vectors?/What does the embeddings do?

The Embedding vectors defines the relations between different words based on several features. For example King is related to Queen just like a Man is related to Women. Another generic example is oange and apple, both are fruits and the relation is defined by embeddings

## What is LSTM?

LSTM (Long Short Term Memory) is recurrent neural network model and is mostly used for processing sequential data. Like in our case the text data is sequential by nature. Hence LSTM is usefull for NLP tasks. It is also a powerful model compared to Vanilla RNN. There are different variants of LSTM which can be experimented with.

## Evaluation

After 5 epochs on the dataset, average training accuracy was around 37% and validation accuracy about 36%. Surely this numbers can be increased by tuning the hyperparameters defined above, and training more.