# Sentiment Analysis Starter Code
Use this code as a template, starting place, or inspiration... whatever helps you get started!

## Imports
This starter code will be using the following packages:
- `torch`
- `torchtext`
- `pandas`
- `numpy`
- `tqdm`
- `matplotlib`

Be sure to install these using either `pip` or `conda`!

In [None]:
# Python Standard Library
import math
import os

from collections import Counter, OrderedDict


# PyTorch Modules
import torch
from torch import nn

from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataset import random_split

from torch.nn.utils.rnn import pad_sequence

from torchtext.vocab import vocab
from torchtext.data.utils import get_tokenizer


# Other Packages
import pandas as pd
import numpy as np
from tqdm import tqdm
from matplotlib import pyplot as plt

## Downloading Data
Visit [https://www.kaggle.com/competitions/osuaiclub-fall2022-sentiment-analysis/data](https://www.kaggle.com/competitions/osuaiclub-fall2022-sentiment-analysis/data) to download the dataset!

## Loading Data
We will be using the `pandas` package to load in our data. All the data is conveniently stored in a `.csv` file which is really easy to construct a `pandas` dataframe out of.

In [None]:
DATA_DIR = './data/'

In [None]:
if os.path.exists(os.path.join(DATA_DIR, 'train.csv')):
    train_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), index_col='id')
else:
    train_df = pd.read_csv("https://raw.githubusercontent.com/OSU-AIClub/Fall-2022/main/Kaggle%20Competition/data/train.csv")
train_df

## Using Subset of Dataset for Quicker Experimentation
We recommend using and triaining on a small subset of the dataset while you are prototyping and trying to get your model to work.

In [None]:
# Calculate the size of the dataset
num_samples = len(train_df.index)

# Define how many samples we want in our smaller dataset
target_num_samples = 1000

# Calculate how many training samples we need to remove
n_remove = num_samples - target_num_samples

# Randomly choose the n_remove indices we will remove
drop_indices = np.random.choice(train_df.index, n_remove, replace=False)
train_df = train_df.drop(drop_indices)

# Show the remaining dataframe
train_df

## Fix Class Imbalance in Dataset
This dataset heavily favors the `1` sentiment, which represents a positive sentiment. This results in there being relatively more positive training samples than there are negative training samples.

In [None]:
train_df['sentiment'].value_counts()

For simplicity, we will address this imbalance with [undersampling](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/) by reducing the number of positive sentiment samples in the dataset at random until it matches the number of negative sentiment samples. While we do not have a significant class imbalance, this can slightly help. Try removing the undersampling and see how it works!

In [None]:
# Define values for positive and negative sentiment
POSITIVE_SENTIMENT = 1
NEGATIVE_SENTIMENT = 0

# Count the number of positive and negative samples
num_pos_samples = train_df['sentiment'].value_counts()[POSITIVE_SENTIMENT] 
num_neg_samples = train_df['sentiment'].value_counts()[NEGATIVE_SENTIMENT]

# Calculate the number of positive samples we need to remove to have 
# the same number as negative samples 
num_pos_remove = max(num_pos_samples - num_neg_samples,0)
num_pos_remove

In [None]:
# Split the Dataset into Dataframes of Postive and Negative Only Samples
pos_df = train_df[train_df['sentiment'] == POSITIVE_SENTIMENT]
neg_df = train_df[train_df['sentiment'] == NEGATIVE_SENTIMENT]
print(pos_df.head(), neg_df.head())
# Randomly caluclate the postive dataframe indeces to remove
pos_drop_indices = np.random.choice(pos_df.index, num_pos_remove, replace=False)

# Drop Selected Samples from the Positive Dataframe to balance out both sentiment values
pos_undersampled = pos_df.drop(pos_drop_indices)
pos_undersampled

In [None]:
# Combine the negative samples and the positive samples into one dataframe
balanced_train_df = pd.concat([neg_df, pos_undersampled])

# Check the counts to make sure the classes are now even
balanced_train_df['sentiment'].value_counts()

In [None]:
balanced_train_df.to_csv(os.path.join(DATA_DIR, 'all_sets.csv'))

TOTAL_SAMPLES = balanced_train_df.shape[0]
TOTAL_SAMPLES

## Data Preprocessing
Now that we have created the training and testing split for our data, we can use techniques like tokenization to make the dataset easier for our model to process and train on. We will only be showing how to apply tokenization, but we encourage you to try other techniques!

We will be using the PyTorch torchtext libary to achieve this.

### Creating a "Vocabulary"
Next, we need to create a "vocabulary" of all words in the dataset. In NLP, a vocabulary is the mapping of each word to a unique ID. We will represent words in numerical form for the model to be able to interpret them.

By creating this mapping, one can write a sentence with numbers. For instance, if the vocab is as follows:

```python
{
  "i": 0,
 "the": 1,
 "ate": 2,
 "pizza": 3
}
```

We can say "I ate the pizza" by saying `[0, 2, 1, 3]`.

This is an oversimplified explanation of encoding, but the general idea is the same.


`<START>` and `<END>` represent the start and end of the sample respectively. They are tokens used to identify the beginning and ending of each sentence in order to train the model. As shown, they will be inserted at the beginning and end of each sample.

`<UNK>` is the token used to represent any word not in our vocabulary. This is most useful when you want to limit the vocabulary size to increase the speed of training or run inference on text never seen before. 

`<PAD>` is the token used to pad shorter inputs to the length of the longest input, to ensure we can have a constant input size for batching.

## Build Data Processing Pipelines
The first step is to load the raw data from the `.csv` file, and then define our Tokenizer and Vocab(ulary). Using these, we can define piplines which will take any input text and convert it to a list of numeric tokens.

In [None]:
tokenizer = get_tokenizer('basic_english')
counter = Counter()

for (_,text, sentiment) in list(pd.read_csv(os.path.join(DATA_DIR,'all_sets.csv')).itertuples(index=False, name=None)):
    tokenized = tokenizer(text)
    counter.update(tokenized)
    
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab = vocab(counter, min_freq = 10, specials=('<UNK>', '<START>', '<END>', '<PAD>'))

In [None]:
# Define how we convert a token to a number in the vocab
def vocab_token(token):
    if token in vocab:
        return vocab[token]
    else:
        return vocab['<UNK>']

In [None]:
# Definte how we process the input text from the CSV file for use in the Torch Dataset
def text_pipeline(x):
    return [vocab_token(token) for token in tokenizer(x)]
    
# Definte how we process the input label from the CSV file for use in the Torch Dataset
def label_pipline(y):
    return int(y)

In [None]:
# Example use of the text_pipeline
text_pipeline('This is an incredibly amazing absolutely perfect example.')

## Create Dataset Object
The next step is to define the PyTorch `Dataset` object we will use to load the data. You can find more information on PyTorch Datasets [here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

In [None]:
# Define token numbers 
PAD_IDX = vocab['<PAD>']
START_IDX = vocab['<START>']
END_IDX = vocab['<END>']

# Define whether we are laoding this data in the CPU or GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define Dataset Class
class SentimentDataset(Dataset):
    
    # Helper function
    def process_input(self, sample):
        # unpack sample input
        _, _text, _label = sample
        
        # Pass input through their respective pipelines and construct a tensor out of them
        label = torch.tensor([label_pipline(_label)], dtype=torch.int32)
        text = torch.tensor([START_IDX] + text_pipeline(_text) + [END_IDX], dtype=torch.int32)
        
        # Return the tesnor input
        return text, label
    
    def __init__(self, csv_file_path):
        # Load the raw data from file
        self.data = list(pd.read_csv(csv_file_path).itertuples(index=False, name=None))
        
        # Process the data and pass through data pipelines; convert to tensors
        self.processed_data = [self.process_input(x) for x in self.data]
        
        # Pad and Separate Text Input Lengths
        self.inpts = pad_sequence([x for x, _ in self.processed_data], batch_first=True, padding_value=PAD_IDX)
        
        # Separate Labels from Text
        self.labels = [y for _, y in self.processed_data]
        
    # Required function for Dataset subclasses
    def __len__(self):
        return len(self.data)
    
    # Called whenever a data sample is generated
    def __getitem__(self, idx):
        return self.inpts[idx].to(device), self.labels[idx].to(device)

In [None]:
# Instantiate the Dataset Object
dataset = SentimentDataset(os.path.join(DATA_DIR, 'all_sets.csv'))

## Generate DataLoader Object
Next, we need to create the `DataLoader` object, which is used to concatonate samples into iterable batches. Here we need to define the batch size, the ratio of the training split to validation split (we use the validation dataset for an unbiased performance metric - see more information about hold-out datasets [here](https://www.datarobot.com/wiki/training-validation-holdout/)) and then generate the training data loaders and validation dataloaders respectively.

In [None]:
BATCH_SIZE = 4

train_ratio = 0.85
val_ratio = 0.15

# Calculate the number of samples in the training dataset
train_counts = math.ceil(len(dataset) * train_ratio)

# Calculate the number of samples in the validation dataset
val_counts = len(dataset) - train_counts

print(train_counts, val_counts)

In [None]:
# Randomly Split the dataset into a Training Split and Validation Split
train_ds, val_ds = random_split(dataset, [train_counts, val_counts])

# Create Training and Validation DataLoaders
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
val_dl = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

## Define the Model
Now that we have our data, we can definte our model - the final step before we train it. This is the most important part of the code! I recommend trying out different models, changing architectures, etc. to see what works best! This model is not good - it only achieves at best 50% accuracy. However, that is part of the fun! It is on you to see how you can design and train a model to do even better!

In [None]:
class SentimentRNN(nn.Module):
    def __init__(self, n_layers, vocab_size, output_dim, hidden_dim, embedding_dim, drop_prob=0.5):
        super(SentimentRNN,self).__init__()
 
        # Define Fully Connected Layer Hyperparameters
        self.output_dim = output_dim
        self.drop_prob = drop_prob
 
        # Define LSTM and Embedding Hyperparameters
        self.n_layers = n_layers
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
    
        # Define Embedding and LSTM layers
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        
        self.lstm = nn.LSTM(input_size=self.embedding_dim,hidden_size=self.hidden_dim,
                           num_layers=self.n_layers, batch_first=True)
        
        
        # Dropout Layer - Reduces Overfitting
        self.dropout = nn.Dropout(self.drop_prob)
    
        # Linear (Fully Connected) and Sigmoid Activation
        self.fc = nn.Linear(self.hidden_dim, self.output_dim)
        self.sig = nn.Sigmoid()
    
    
    # Required function for PyTorch Models
    #       Defines how inputs are passed layer by layer
    def forward(self,x):
        batch_size = x.size(0)
        
        # Initialize Hidden State
        hidden = self.init_hidden(batch_size)
        
        # embeddings and lstm_out
        embeds = self.embedding(x)  # shape: B x S x Feature   since batch_first = True
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim) 
        
        # dropout and fully connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)

        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
        
        
        
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        h0 = torch.zeros((self.n_layers,batch_size,self.hidden_dim)).to(device)
        c0 = torch.zeros((self.n_layers,batch_size,self.hidden_dim)).to(device)
        
        hidden = (h0.data, c0.data)
        return hidden

In [None]:
# Define Model Hyperparameters
n_layers = 2
vocab_size = len(vocab)
embedding_dim = 400
output_dim = 1
hidden_dim = 256

# Instantiate Model
model = SentimentRNN(n_layers, vocab_size, output_dim, hidden_dim, embedding_dim,drop_prob=0.5)

# Move model to GPU
model.to(device)

print(model)

## Train the Model
The final step is to train the model. I recommend playing around with the training hyperparameters to see what achieves the best result. Additionally, note that if you change the model class, it might mean you have to change slight things in the training loop.

In [None]:
# Define Training Hyperparameters
lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
clip = 5
epochs = 100

# Accuracy Metric
def acc(pred,label):
    pred = torch.round(pred.squeeze())
    return torch.sum(pred == label.squeeze()).item()

In [None]:
# Use for keeping track of training history
valid_loss_min = np.Inf
epoch_tr_loss,epoch_vl_loss = [],[]
epoch_tr_acc,epoch_vl_acc = [],[]

# Load Previously Trained Model if Exists
if os.path.exists('./models/state_dict.pt'):
    print("Loading existing model...")
    # model.load_state_dict(torch.load('./models/state_dict.pt'))
    #       Note that the above line throws an error when you change hyperparameters
    #       For that reeason, it is commented out for now.

for epoch in (range(epochs)):
    train_losses = []
    train_acc = 0.0
    model.train()

    for inputs, labels in tqdm(train_dl):
        
        inputs, labels = inputs.to(device), labels.to(device)   
        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        
        model.zero_grad()
        output,h = model(inputs)
        
        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float().squeeze())
        loss.backward()
        train_losses.append(loss.item())
        
        # calculating accuracy
        accuracy = acc(output,labels)
        train_acc += accuracy
        
        #`clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
 

    # Evaluate on Validation Dataset
    val_losses = []
    val_acc = 0.0
    model.eval()
    for inputs, labels in tqdm(val_dl):
            inputs, labels = inputs.to(device), labels.to(device)

            output, val_h = model(inputs, val_h)
            val_loss = criterion(output.squeeze(), labels.float().squeeze())

            val_losses.append(val_loss.item())
            
            accuracy = acc(output,labels)
            val_acc += accuracy
       
    # Calculate and Save Training Statistics     
    epoch_train_loss = np.mean(train_losses)
    epoch_val_loss = np.mean(val_losses)
    epoch_train_acc = train_acc/len(train_dl.dataset)
    epoch_val_acc = val_acc/len(val_dl.dataset)

    epoch_tr_loss.append(epoch_train_loss)
    epoch_vl_loss.append(epoch_val_loss)
    epoch_tr_acc.append(epoch_train_acc)
    epoch_vl_acc.append(epoch_val_acc)
    
    # Print Training Metrics
    print(f'Epoch {epoch+1}/{epochs}') 
    print(f'train_loss : {epoch_train_loss} val_loss : {epoch_val_loss}')
    print(f'train_accuracy : {epoch_train_acc*100} val_accuracy : {epoch_val_acc*100}')
    
    # Save Model if this is the best model yet
    if epoch_val_loss <= valid_loss_min:
        torch.save(model.state_dict(), './models/state_dict.pt')
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,epoch_val_loss))
        valid_loss_min = epoch_val_loss
    
    print(25*'==')

In [None]:
# Plot Training/Validaiton Accuracy and Loss Per Epoch
fig = plt.figure(figsize = (20, 6))
plt.subplot(1, 2, 1)
plt.plot(epoch_tr_acc, label='Train Acc')
plt.plot(epoch_vl_acc, label='Validation Acc')
plt.title("Accuracy")
plt.legend()
plt.grid()
    
plt.subplot(1, 2, 2)
plt.plot(epoch_tr_loss, label='Train loss')
plt.plot(epoch_vl_loss, label='Validation loss')
plt.title("Loss")
plt.legend()
plt.grid()

plt.show()

## Inference On Test Dataset
We have now fully trained our model! The next task is to perform inference on the test dataset so we can save our predictions and submit them to Kaggle. The first step is to load the best model saved during training.

In [None]:
# Instantiate a model with the same parameters
inference_model = SentimentRNN(n_layers, vocab_size, output_dim, hidden_dim, embedding_dim, drop_prob=0.5)

# Load the saved weights and biases from file
inference_model.load_state_dict(torch.load('./models/state_dict.pt'))

# Prep model for inference
inference_model.eval()
inference_model.to(device)

In [None]:
# Define text processing and model output interpretting
def inference(model, text):
    # Process input sentence
    text_tensor = torch.tensor([START_IDX] + text_pipeline(text) + [END_IDX], dtype=torch.int32).to(device).unsqueeze(0)
    
    # Initialize hidden state
    batch_size = 1
    h = model.init_hidden(batch_size)
    h = tuple([x.data for x in h])
    
    # Pass through model
    output, h = model(text_tensor, h)
    prediction = output.item()
    
    # Return prediction
    return 1 if prediction > 0.5 else 0

In [None]:
# download test dataset if it doesn't exist locally
if os.path.exists(os.path.join(DATA_DIR, 'test.csv')):
    test_df = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'), index_col='id')
else:
    test_df = pd.read_csv("https://raw.githubusercontent.com/OSU-AIClub/Fall-2022/main/Kaggle%20Competition/data/test.csv")
test_df

In [None]:
# Perform Inference on Test Dataset 
predictions = [] # {id,prediction}

for id_, text in tqdm(list(test_df).itertuples(index=True, name=None)):
    prediction = inference(model, text)
    predictions.append({'id': id_, 'sentiment': prediction})


# Save to CSV file
preds = pd.DataFrame(predictions)
preds.to_csv('submission.csv', index=False)
preds

# Time to Submit!
We now are ready to submit our predictions. Uplodad your `submission.csv` file to do this!