# Sentiment Analysis Starter Code
Use this code as a template, starting place, or inspiration... whatever helps you get started!

## Imports
This starter code will be using the following packages:
- `Pandas`
- `NumPy`
- `PyTorch`
- `nltk`
Be sure to install these using either `pip` or `conda`!

In [294]:
import pandas as pd
import os
import numpy as np
import nltk
nltk.download('punkt')
import torch

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Downloading Data
Visit [https://www.kaggle.com/competitions/osuaiclub-fall2022-nlp-challenge/data](https://www.kaggle.com/competitions/osuaiclub-fall2022-nlp-challenge/data) to download the dataset!

## Loading Data
We will be using the `pandas` package to load in our data. All the data is conveniently stored in a `.csv` file which is really easy to construct a `pandas` dataframe out of.

In [295]:
DATA_DIR = './data/'

In [296]:
train_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), index_col='id')
train_df

Unnamed: 0_level_0,text,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"""2nd day on 5mg started to work with rock hard...",0
1,"""He pulled out, but he cummed a bit in me. I t...",0
2,""" I Ve had nothing but problems with the Kepp...",0
3,"""I had Crohn's with a resection 30 years ago a...",0
4,"""Have a little bit of a lingering cough from a...",0
...,...,...
999995,Awful awful awful. Length. Came to top of wa...,0
999996,VERY SMALL. COULDN'T USE AS GIFT AS INTENDED.,0
999997,"Very thin material. Good for summer, spring an...",0
999998,Says Navy. Looks black.,0


## Using Subset of Dataset for Quicker Experimentation
We recommend using and triaining on a small subset of the dataset while you are prototyping and trying to get your model to work.

In [297]:
# Calculate the size of the dataset
num_samples = len(train_df.index)

# Define how many samples we want in our smaller dataset
target_num_samples = 1000

# Calculate how many training samples we need to remove
n_remove = num_samples - target_num_samples

# Randomly choose the n_remove indices we will remove
drop_indices = np.random.choice(train_df.index, n_remove, replace=False)
train_df = train_df.drop(drop_indices)

# Show the remaining dataframe
train_df

Unnamed: 0_level_0,text,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1988,"""I've been on the pill a couple months now and...",1
2062,"""I've had problems with sleep for around 7-8 y...",1
7912,"""Been taking Daily 4+ years living normal life...",1
8741,"""My mother is 88 and has had a problem with Am...",0
10041,"""Have used Flonase for the last year - I used ...",1
...,...,...
998544,Shoes are great!! Very comfortable.,1
998834,Love It ! Fits comfortable and perfect!,1
998906,Great cross training shoe. Absolutely love the...,1
999439,I love them! They are very soft and warm on t...,1


## Fix Class Imbalance in Dataset
This dataset heavily favors the `1` sentiment, which represents a positive sentiment. This results in there being significantly more positive training samples than there are negative training samples.

In [298]:
train_df['sentiment'].value_counts()

1    653
0    347
Name: sentiment, dtype: int64

For simplicity, we will address this imbalance with [undersampling](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/) by reducing the number of positive sentiment samples in the dataset at random until it matches the number of negative sentiment samples.

In [299]:
# Define values for positive and negative sentiment
POSITIVE_SENTIMENT = 1
NEGATIVE_SENTIMENT = 0

# Count the number of positive and negative samples
num_pos_samples = train_df['sentiment'].value_counts()[POSITIVE_SENTIMENT] 
num_neg_samples = train_df['sentiment'].value_counts()[NEGATIVE_SENTIMENT]

# Calculate the number of positive samples we need to remove to have 
# the same number as negative samples 
num_pos_remove = num_pos_samples - num_neg_samples

In [300]:
# Split the Dataset into Dataframes of Postive and Negative Only Samples
pos_df = train_df[train_df['sentiment'] == POSITIVE_SENTIMENT]
neg_df = train_df[train_df['sentiment'] == NEGATIVE_SENTIMENT]

# Randomly caluclate the postive dataframe indeces to remove
pos_drop_indices = np.random.choice(pos_df.index, num_pos_remove, replace=False)

# Drop Selected Samples from the Positive Dataframe to balance out both sentiment values
pos_undersampled = pos_df.drop(pos_drop_indices)
pos_undersampled

Unnamed: 0_level_0,text,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1988,"""I've been on the pill a couple months now and...",1
2062,"""I've had problems with sleep for around 7-8 y...",1
14619,"""I applied this cream twice daily, after clean...",1
17107,"""So far works better than anything else I've t...",1
30052,"""Hello fellow subscribers,\nI am a new member ...",1
...,...,...
995509,I like it,1
996264,"Exactly as expected, fast shipping",1
996609,Perfect for multiple holes,1
998834,Love It ! Fits comfortable and perfect!,1


In [301]:
# Combine the negative samples and the positive samples into one dataframe
balanced_train_df = pd.concat([neg_df, pos_undersampled])

# Check the counts to make sure the classes are now even
balanced_train_df['sentiment'].value_counts()

0    347
1    347
Name: sentiment, dtype: int64

## Train, Val, Test Split

In [302]:
from sklearn.model_selection import train_test_split 

train_val_set, test_set = train_test_split(balanced_train_df, test_size=0.20)
train_set, val_set = train_test_split(train_val_set, test_size=0.125)

# Save these splits for later use
train_set.to_csv(os.path.join(DATA_DIR, 'train_set.csv'))
val_set.to_csv(os.path.join(DATA_DIR, 'val_set.csv'))
test_set.to_csv(os.path.join(DATA_DIR, 'test_set.csv'))

## Data Preprocessing
Now that we have created the training and testing split for our data, we can use techniques like tokenization to make the dataset easier for our model to process and train on. We will only be showing how to apply tokenization, but we encourage you to try other techniques!

We will be using the PyTorch torchtext libary to achieve this.

### Creating a "Vocabulary"
Next, we need to create a "vocabulary" of all words in the dataset. In NLP, a vocabulary is the mapping of each word to a unique ID. We will represent words in numerical form for the model to be able to interpret them.

By creating this mapping, one can write a sentence with numbers. For instance, if the vocab is as follows:

```python
{
  "i": 0,
 "the": 1,
 "ate": 2,
 "pizza": 3
}
```

We can say "I ate the pizza" by saying `[0, 2, 1, 3]`.

This is an oversimplified explanation of encoding, but the general idea is the same.


`<START>` and `<END>` represent the start and end of the sample respectively. They are tokens used to identify the beginning and ending of each sentence in order to train the model. As shown, they will be inserted at the beginning and end of each sample.

`<UNK>` is the token used to represent any word not in our vocabulary. This is most useful when you want to limit the vocabulary size to increase the speed of training or run inference on text never seen before. 

In [303]:
from torchdata.datapipes.iter import FileOpener, IterableWrapper

def row_processer(row):
    return (row[1], row[2]) # [1]: text, [2]: sentiment

def build_datapipe(split):
    datapipe = IterableWrapper([os.path.join(DATA_DIR, f"{split}.csv")])
    datapipe = FileOpener(datapipe, mode='b')
    datapipe = datapipe.parse_csv(delimiter=",", skip_lines=1)
    datapipe = datapipe.shuffle()
    datapipe = datapipe.map(row_processer)
    
    return datapipe

In [304]:
train_dp = build_datapipe('train_set')
val_dp = build_datapipe('val_set')
test_dp = build_datapipe('test_set')

In [305]:
for sample in dp:
    print(sample)
    break

TypeError: argument of type 'int' is not iterable
This exception is thrown by __iter__ of CSVParserIterDataPipe(fmtparams={'delimiter': ','}, source_datapipe=FileOpenerIterDataPipe)

## Build Data Processing Pipelines

In [None]:
from torchtext.data.utils import get_tokenizer
from collections import Counter, OrderedDict
from torchtext.vocab import vocab

tokenizer = get_tokenizer('basic_english')
counter = Counter()

MAX_INPUT_LEN = 0

for (text, sentiment) in dp:
    counter.update(tokenizer(text))

sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab = vocab(counter, min_freq = 1, specials=('\<UNK\>', '\<START\>', '\<END\>', '\<PAD\>'))

In [None]:
[vocab[token] for token in "this is an example".split()]

In [None]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x)

In [None]:
text_pipeline('here is an example')

In [None]:
label_pipeline('1')

## Generate DataLoader Object

In [None]:
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def pad_tensor(t):
     t = torch.tensor(t)
     padding = max(MAX_INPUT_LEN) - t.size()[0]
     t = nn.functional.pad(t, (0, padding))
     return t

def collate_batch(batch):
    label_list, text_list = [], []
    
    for (_text, _label) in batch:
        label_list.append(label_pipeline(_label))
        text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.int64))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.stack([pad_tensor(t) for t in text_list])
    
    return label_list.to(device), text_list.to(device)

## Define the Model
Now we can create a model and train it!

In [None]:
from torch import nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        print(text.shape)
        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        print(output.shape, hidden.shape)
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

In [None]:
OUTPUT_DIM = 1
INPUT_DIM = len(vocab)
EMBEDDING_DIM = 400
HIDDEN_DIM = 256

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM).to(device)
model

## Train the Model
https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb

In [None]:
import torch.optim as optim

BATCH_SIZE = 64

train_dl = DataLoader(train_dp, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
val_dl = DataLoader(val_dp, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(test_dp, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)


optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss().to(device)



def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for (labels, text) in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(text).squeeze(1)
        
        loss = criterion(predictions, labels)
        
        acc = binary_accuracy(predictions, labels)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)



def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for (labels, text) in iterator:

            predictions = model(text).squeeze(1)
            
            loss = criterion(predictions, labels)
            
            acc = binary_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_dl, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, val_dl, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')