## Assignment 2.2: Text classification via CNN (50 points)

In this assignment you should perform sentiment analysis of the IMDB reviews based on CNN architecture. Read carefully [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf) by Yoon Kim.

In [1]:
#!pip install torch==1.6.0
#!pip install torchtext==0.7
#!pip install numpy
#!pip install pandas

In [2]:
#!pip install torch==1.11.0
#!pip install torchtext==0.12.0

In [3]:
#!pip install -U torch==1.8.0 torchtext==0.9.0

In [4]:
import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

In [5]:
from torchtext.legacy.data import Field, LabelField
from torchtext.legacy.data import BucketIterator

In [6]:
import numpy as np
import torch

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchtext.legacy.data import BucketIterator
from torchtext.legacy.data import Iterator

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set(style='darkgrid', font_scale=1.3)

### Preparing Data

In [7]:
TEXT = Field(sequential=True, lower=True, batch_first=True)
LABEL = LabelField(batch_first=True)

In [8]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

In [9]:
# %%time
TEXT.build_vocab(trn)

In [10]:
LABEL.build_vocab(trn)

### Creating the Iterator


Define an iterator here

In [11]:
train_iter, val_iter, test_iter = BucketIterator.splits((trn, vld, tst), 
                                                  batch_sizes=(64,64,64), 
                                                  sort_key=lambda x: len(x.text), 
                                                  device='cuda', 
                                                  sort=True,
                                                  sort_within_batch=True,
                                                  repeat=False
                                                 )

### Define CNN-based text classification model (20 points)

In [12]:
class CNN(nn.Module):
    def __init__(self, V, D, kernel_sizes, out_channels, dropout=0.5):
        super(CNN, self).__init__()
        
        self.emb = nn.Embedding(V, D)
        self.convs = nn.ModuleList([nn.Conv1d(D, out_channels, K) 
            for K in kernel_sizes])
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(out_channels * len(kernel_sizes), 1)
        
    def forward(self, x):
        x = self.emb(x)
        x = x.permute(0, 2, 1)
        x = [F.relu(conv(x))for conv in self.convs]
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
        x = torch.cat(x, 1)
        x = self.dropout(x)
        logit = self.linear(x).reshape(-1)
        return logit

In [13]:
kernel_sizes = [3,4,5]
vocab_size = len(TEXT.vocab)
dropout = 0.5
dim = 300
out_channels = 100

model = CNN(vocab_size, dim, kernel_sizes, out_channels, dropout)

In [16]:
#!pip install -y nvidia-cuda-toolkit


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -y


In [None]:
#model.cuda()
model.cuda('cuda:0')

### The training loop (10 points)

Define the optimization function and the loss functions.

In [None]:
learning_rate = 1e-3
opt = optim.Adam(model.parameters(), lr=learning_rate)
loss_func = nn.BCEWithLogitsLoss()

Think carefully about the stopping criteria. 

In [None]:
epochs = 30

In [None]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter:         
        
        x = batch.text
        y = batch.label
        
        opt.zero_grad()
        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()
        
    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    correct = 0
    total = 0 
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

In [None]:
%%time

train_loss_history = []
valid_loss_history = []

train_loop()

In [None]:
plt.figure(figsize=(15, 8))
plt.plot(np.arange(epochs) + 1, train_loss_history, label='train')
plt.plot(np.arange(epochs) + 1, valid_loss_history, label='valid')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend()
plt.show()

In [None]:
model.load_state_dict(torch.load('cnn_base_5.pt'))

### Calculate performance of the trained model (10 points)

In [None]:
for batch in test_iter:
    x = batch.text
    y = batch.label

Write down the calculated performance

### Accuracy:
### Precision:
### Recall:
### F1:
# ниже

In [None]:
y_pred = []
y_true = []
model.eval()

for batch in test_iter:
    x = batch.text
    y = batch.label

    y_pred += list((torch.sigmoid(model(x)) >= 0.5).detach().cpu().numpy())
    y_true += list((y.detach().cpu().numpy()))

In [None]:
print(accuracy_score(y_true, y_pred))
print(precision_score(y_true, y_pred))
print(recall_score(y_true, y_pred))
print(f1_score(y_true, y_pred))

### Experiments (10 points)

Experiment with the model and achieve better results. Implement and describe your experiments in details, mention what was helpful.

### 1. ?
### 2. ?
### 3. ?