<a href="https://colab.research.google.com/github/Marie000/Sentiment_Classifier/blob/main/Sentiment_Classifier_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from sklearn.model_selection import train_test_split

tokenizer = get_tokenizer("basic_english")

In [None]:
BATCH_SIZE = 32
device = "cuda" if torch.cuda.is_available() else "cpu"

## Import Data

In [None]:
df = pd.read_csv("drive/MyDrive/pytorch datasets/IMDB-Dataset.csv")

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [None]:
X_train, X_test_val, y_train, y_test_val = train_test_split(df["review"], df["sentiment"], test_size=0.2)

In [None]:
print(len(X_train), len(X_test_val))

40000 10000


In [None]:
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, test_size=0.5)

In [None]:
print(len(X_train), len(X_test), len(X_val))

40000 5000 5000


In [None]:
def yield_tokens(data_iter):
  for text in data_iter:
    yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(X_train), specials=["<unk>", "<pad>"], max_tokens=20000)

In [None]:
vocab_size = len(vocab)
print(vocab_size)

20000


In [None]:
vocab.set_default_index(vocab["<unk>"])

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))

In [None]:
print(text_pipeline("This movie is terrible!"))

[14, 20, 10, 384, 36]


In [None]:
# X_train = X_train.apply(text_pipeline)
# X_val = X_val.apply(text_pipeline)
# X_test = X_test.apply(text_pipeline)

In [None]:
label_values = ['negative', 'positive']
label_pipeline = lambda x: label_values.index(x)

In [None]:
# y_train = y_train.apply(label_pipeline)
# y_val = y_val.apply(label_pipeline)
# y_test = y_test.apply(label_pipeline)

In [None]:
train_data = list(zip(X_train, y_train))
val_data = list(zip(X_val, y_val))
test_data = list(zip(X_test, y_test))

In [None]:
train_data[0]

('Director Kinka Usher stays true to his own credo, "Play it straight and they will laugh," and with the help of a superb cast has crafted what should become the #1 cult film of all time, `Mystery Men.\' When an evil villain, Casanova Frankenstein (Geoffrey Rush) is released from a mental institution, captures the local superhero, Captain Amazing (Greg Kinnear), and threatens to take over Champion City, three wanna-be superheroes, Mr. Furious (Ben Stiller), The Shoveler (William H. Macy) and The Blue Raja (Hank Azaria) come to the rescue. Frankenstein has been joined by a myriad assortment of underworld scum, however, and has become a formidable opponent. The trio realize that help is needed, and decide to recruit; what they end up with is nothing less than the most unforgettable team of `superheroes\' ever assembled in the history of the cinema. Mr. Furious has his rage; The Shoveler, his shovel; The Blue Raja flings silverware (mainly forks, and the occasional spoon, but never a knif

## Create Datasets and DataLoaders

I will create the train, test and val datasets as separate datasets instead of subsets of a dataset. I am doing that because it made it easier to create the vocabulary from just the training data.

In [None]:
# class SentimentDataset(Dataset):

#   def __init__(self, X, y):
#     self.dataset = torch.tensor(X['x'])
#     self.labels = torch.tensor(y.reshape(-1)).long()

#   def __len__(self):
#     return len(self.dataset)

#   def __getitem__(self, idx):
#     return self.dataset[idx], self.labels[idx]

In [None]:
# train_data = SentimentDataset(X_train, y_train)
# test_data = SentimentDataset(X_test, y_test)
# val_data = SentimentDataset(X_val, y_val)

In [None]:
padding_index = vocab["<pad>"]
print(padding_index)

1


In [None]:
def collate_with_padding(data):
  text_list, label_list = [], []
  for text, label in data:
    label_list.append(label_pipeline(label))
    text = torch.tensor(text_pipeline(text), dtype=torch.int64)
    text_list.append(text)
  #text_list = torch.cat(text_list).to(device)
  text_list = pad_sequence(text_list, batch_first=True, padding_value=padding_index)
  label_list = torch.tensor(label_list, dtype=torch.int64, device=device)

  return text_list, label_list
#   x, y = zip(*data)
#   #x = torch.tensor(text_pipeline(x), dtype=torch.int64, device=device)
#   x = [text_pipeline(i) for i in x]
#   x = torch.tensor(x, dtype=torch.int64, device=device)

#   y = [torch.tensor(label_pipeline(i), device=device) for i in y]
#   x = pad_sequence(x, batch_first=True, padding_value=padding_index)

# # x = torch.tensor(x)
#  # y = torch.cat(y)
#   y = torch.tensor(y, dtype=torch.int64, device=device)
#   #y = torch.stack(y)
#   return x, y

In [None]:
train_dataloader = DataLoader(
    train_data,
    collate_fn = collate_with_padding,
    batch_size = BATCH_SIZE,
    drop_last = True
)
test_dataloader = DataLoader(
    test_data,
    collate_fn = collate_with_padding,
    batch_size = BATCH_SIZE,
    drop_last = True
)
val_dataloader = DataLoader(
    val_data,
    collate_fn = collate_with_padding,
    batch_size = BATCH_SIZE,
    drop_last = True
)

## Model

In [None]:
class RNNModel(nn.Module):
  def __init__(self, input_dim, embedding_dim, hidden_dim,num_layers=2, output_dim=2):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.num_layers = num_layers
    self.embedding = nn.Embedding(input_dim, embedding_dim)
    self.rnn = nn.LSTM(embedding_dim, hidden_dim,num_layers=num_layers, dropout=0.2)
    self.fc = nn.Linear(hidden_dim*num_layers, output_dim)
    self.init_weights()

  def init_weights(self):
    self.embedding.weight.data.uniform_(-0.5, 0.5)
    #self.rnn.weight.data.uniform_(-0.5, 0.5)
    self.fc.weight.data.uniform_(-0.5, 0.5)

  def forward(self, x):
    x = x.permute(1,0)
    emb = self.embedding(x)
    # output will not be used because we have a many-to-one rnn
    output, (hidden, cell) = self.rnn(emb)
    hidden.squeeze_(0)
    hidden = hidden.transpose(0,1)
    hidden = hidden.reshape(-1, self.hidden_dim*self.num_layers)
    out = self.fc(hidden)
    return out

In [None]:
EMBED_SIZE = 64
HIDDEN_DIM = 32
NUM_LAYERS=2

model = RNNModel(vocab_size, EMBED_SIZE, HIDDEN_DIM, num_layers=NUM_LAYERS)

In [None]:
def acc_fn(y_pred, y_true):
  correct = torch.eq(y_true, y_pred).sum().item()
  acc = (correct / len(y_pred)) * 100
  return acc

In [None]:
train_loss_values = []
train_acc_values = []
test_loss_values = []
test_acc_values = []

In [None]:
def train(model, dataloader, loss_fn, optimizer):
  model.train()
  model.to(device)
  train_acc, train_loss = 0, 0
  for text, label in dataloader:
    optimizer.zero_grad()
    label, text = label.to(device), text.to(device)
    y_pred = model(text)
    #logits = logits.float()
    #y_pred = logits.argmax(dim=1).float()
    #y_pred.requires_grad_(True)
    loss = loss_fn(y_pred, label.squeeze())
    #loss.requires_grad_(True)
    train_loss += loss
    train_acc += acc_fn(y_pred.argmax(dim=1), label)


    loss.backward()

    optimizer.step()

  train_loss /= len(dataloader)
  train_acc /= len(dataloader)
  train_loss_values.append(train_loss)
  train_acc_values.append(train_acc)
  print(f'Train Loss: {train_loss}, Accuracy: {train_acc}')

In [None]:
def eval(model, dataloader, loss_fn):
  model.eval()

  eval_acc, eval_loss = 0,0
  with torch.inference_mode():
    for text, label in dataloader:
      label, text = label.to(device), text.to(device)
      y_pred = model(text)
      loss = loss_fn(y_pred, label.squeeze())
      eval_loss += loss
      eval_acc += acc_fn(y_pred.argmax(dim=1), label)

    eval_loss /= len(dataloader)
    eval_acc /= len(dataloader)
    test_loss_values.append(eval_loss)
    test_acc_values.append(eval_acc)
    print(f'Test Loss: {eval_loss}, Accuracy: {eval_acc}')

In [None]:
epochs = 15
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
for epoch in range(epochs):
  print(f'Epoch: {epoch}\n------------------')
  train(model, train_dataloader, loss_fn, optimizer)
  eval(model, val_dataloader, loss_fn)

Epoch: 0
------------------
Train Loss: 0.6953838467597961, Accuracy: 50.345
Test Loss: 0.692063570022583, Accuracy: 51.34214743589744
Epoch: 1
------------------
Train Loss: 0.681847095489502, Accuracy: 53.98
Test Loss: 0.6259027719497681, Accuracy: 65.98557692307692
Epoch: 2
------------------
Train Loss: 0.5081482529640198, Accuracy: 75.865
Test Loss: 0.43813061714172363, Accuracy: 79.84775641025641
Epoch: 3
------------------
Train Loss: 0.3842177391052246, Accuracy: 83.545
Test Loss: 0.4170593023300171, Accuracy: 82.17147435897436
Epoch: 4
------------------
Train Loss: 0.3349791467189789, Accuracy: 86.4075
Test Loss: 0.4025084972381592, Accuracy: 83.41346153846153
Epoch: 5
------------------
Train Loss: 0.30929380655288696, Accuracy: 87.6425
Test Loss: 0.3863655924797058, Accuracy: 84.13461538461539
Epoch: 6
------------------
Train Loss: 0.2924521267414093, Accuracy: 88.4425
Test Loss: 0.37646836042404175, Accuracy: 84.23477564102564
Epoch: 7
------------------
Train Loss: 0.280

In [None]:
eval(model, test_dataloader, loss_fn)

Test Loss: 0.40897777676582336, Accuracy: 85.71714743589743


In [None]:
torch.save(model.state_dict(), "drive/MyDrive/pytorch datasets/model_0.pt")

In [None]:
loaded_model = RNNModel(vocab_size, EMBED_SIZE, HIDDEN_DIM, num_layers=NUM_LAYERS)
loaded_model.load_state_dict(torch.load("drive/MyDrive/pytorch datasets/model_0.pt"))
loaded_model.to(device)

RNNModel(
  (embedding): Embedding(20000, 64)
  (rnn): LSTM(64, 32, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=64, out_features=2, bias=True)
)

In [None]:
eval(loaded_model, test_dataloader, loss_fn)

Test Loss: 0.40897777676582336, Accuracy: 85.71714743589743


In [None]:
def predict(text):
  with torch.no_grad():
    text = torch.tensor(text_pipeline(text), dtype=torch.int64, device=device)
    text = torch.unsqueeze(text, 0)
    output = loaded_model(text)
    return label_values[output.argmax(1).item()]

In [None]:
predict("this movie is pretty good")

'positive'

In [None]:
predict("Better emotional beats and action than the first film. Less to offer in terms of world building which I preferred in RM pt.1. Overall I thought RM pt. 2 was about average but it does have something to say about war, the scars it causes & whether people can heal from that. Both parts work better when watched together. ")

'negative'

In [None]:
predict(" fell asleep the first hour of it. The rest was just so so bad. Why doesn't anyone just tell Zach No. Someone please make it stop. Unwatchable garbage. Some of the stunts were ok but nothing would ever make me sit through it again. I know there are people who love Z snyders work but I am no longer one of them. I love the 300, BvS, watchmen..but this nah. ")

'negative'

In [None]:
test_example = test_data[128]
print(test_example[0])
print(test_example[1])
predict(test_example[0])

I see this movie as a poor tribute to the old slasher movies. Because it really doesn't hold a candle to the 70's and 80's gold-era of horror, this is of course where personal taste comes in.<br /><br />This movie just falls into the category of "New generation of slashers" in my book, the cast is the typical ones 18-24 years and potential models. I'm personally quite tired of that image in horror movies, the old movies at least had some variation in people. One or more fat people, and dorks in general. Just plain looking persons, of course having a couple of good lookers is fine they always been there. But when the entire cast is just a bunch of nice racks and butts it's getting silly. I mean, OK yeah i like to watch HOT chicks. But not in a horror that is supposed to reflect some ordinary people getting hunted down by for example a knife-wielding maniac... You expect the people being hunted to look something like any random person you see on the street. I think. There are of course a

'positive'