<a href="https://colab.research.google.com/github/MariGaS/Aprendizaje_Maquina/blob/main/Sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introductions to embedding with sentiment analysis

## Getting Started

### Dataset and task

- The [Twitter sentiment analysis](https://www.kaggle.com/c/twitter-sentiment-analysis2/overview) is an open source dataset available on Kaggle. It contains 100000 twits labeled as either negative (0) or positive (1). 

- The task consist in writing a model that takes twits as input and output 1 if the sentiment is positive or 0 if the sentiment is negative.

### Import required libraries

In [None]:
#If the next code block give you : ModuleNotFoundError: No module named 'torchtext.legacy'
#RUn the following:
!pip install -U torch==1.8.0 torchtext==0.9.0
exit()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 11 kB/s 
[?25hCollecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 26.6 MB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.1

In [None]:
from typing import Tuple, List
import time
import random
import os
import zipfile

import pandas
import numpy
import scipy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F
from torchtext.legacy.data import Field, TabularDataset, Iterator
from google_drive_downloader import GoogleDriveDownloader
import spacy

spacy_en = spacy.load('en_core_web_sm')

### Define some constants

In [None]:
class Constants:
    
    DATA_FILE_ID = '1wrfQmCShiTmbIsr7LpZhEiYw7dhuaOhk'                     # Google drive id to be able to download from drive
    
    SEED = 1                                                               # random seed for reproductability
    
    DATA_DIR = 'data/twitter/'                                             # path to the csv data
    DATA_ZIP_FILE = f'{DATA_DIR}data.zip'                                  # path where to dowload the zipped data
    DATA_PATH = '{}data.csv'.format(DATA_DIR)                              # path to the news data
    TRAIN_PATH = '{}train.csv'.format(DATA_DIR)
    VALID_PATH = '{}valid.csv'.format(DATA_DIR)
    TEST_PATH = '{}test.csv'.format(DATA_DIR)
    
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # set device to GPU if availale

constants = Constants

### Fix random seed for reproductability

In [None]:
numpy.random.seed(constants.SEED)
random.seed(constants.SEED)
torch.manual_seed(constants.SEED)
torch.backends.cudnn.deterministic = True

### Download the data on your local server

In [None]:
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.DATA_FILE_ID, dest_path=constants.DATA_ZIP_FILE, unzip=False)

zip_ref = zipfile.ZipFile(constants.DATA_ZIP_FILE, 'r')
zip_ref.extractall(constants.DATA_DIR)
zip_ref.close()

os.rename(f'{constants.DATA_DIR}train.csv', f'{constants.DATA_DIR}data.csv')
!rm data/twitter/test.csv

!ls data/twitter

### Visualize the data with `pandas.DataFrame`

In [None]:
data = pandas.read_csv(constants.DATA_PATH, encoding="ISO-8859-1") # weird encoding: https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python
data.head()
# data.iloc[4]

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


## Methodology

- Validate the data (number of examples, number of features, label distribution, number of `nan`, etc)
- Choose a good metric that you will use for deciding the best model
- Split the data into train/valid/test
- Implement the simplest classifier and evaluate the performance on the train and the validation set
- Data exploration + model exploration (e.g. small litterature review)
- Base on data exploration and litterature, decide on a set of model to test with range of architecture (this includes preprocessing)
- Select hyperparameters based on the performance on the validation set
- Test your model on the test set and decide if it's good enough for production; else you need a new test set

### Dataset validation

In [None]:
N_OBS = len(data)

assert N_OBS == 99989

N_POSITIVE_LABEL = len(data[data.Sentiment == 1])
N_NEGATIVE_LABEL = len(data[data.Sentiment == 0])

assert N_POSITIVE_LABEL == 56457
assert N_NEGATIVE_LABEL == 43532
assert N_POSITIVE_LABEL + N_NEGATIVE_LABEL == N_OBS

assert len(data.dropna()) == N_OBS  # Make sure there is no nan

### Split the data into a train and a validation set and print some informations (split percentage, class distribution)

In [None]:
TRAIN_SIZE = round(0.7 * N_OBS)
VALID_SIZE = round(0.15 * N_OBS) + 1
TEST_SIZE = round(0.15 * N_OBS)
assert TRAIN_SIZE + VALID_SIZE + TEST_SIZE == N_OBS, f'{TRAIN_SIZE + VALID_SIZE + TEST_SIZE} != {N_OBS}'

In [None]:
# shuffle the indices
examples = set(range(N_OBS))
train_indices = set(random.sample(examples, TRAIN_SIZE))
remaining_examples = set(i for i in examples if i not in train_indices)
valid_indices = set(random.sample(remaining_examples, VALID_SIZE))
test_indices = [i for i in remaining_examples if i not in valid_indices]

In [None]:
# Split the data
train_df = data.iloc[list(train_indices)]
valid_df = data.iloc[list(valid_indices)]
test_df = data.iloc[list(test_indices)]

In [None]:
n_train = len(train_df)
n_train_positive = len(train_df[train_df.Sentiment == 1])

n_valid = len(valid_df)
n_valid_positive = len(valid_df[valid_df.Sentiment == 1])

n_test = len(test_df)
n_test_positive = len(test_df[test_df.Sentiment == 1])

print('# train example: {} ({:.2f} %) | positive: {:.2f} % | negative: {:.2f} %'.format(n_train, n_train / N_OBS * 100, n_train_positive / n_train * 100, 100 - n_train_positive / n_train * 100))
print('# valid example: {} ({:.2f} %) | positive: {:.2f} % | negative: {:.2f} %'.format(n_valid, n_valid / N_OBS * 100, n_valid_positive / n_valid * 100, 100 - n_valid_positive / n_valid * 100))
print('# test example: {} ({:.2f} %) | positive: {:.2f} % | negative: {:.2f} %'.format(n_test, n_test / N_OBS * 100, n_test_positive / n_test * 100, 100 - n_test_positive / n_test * 100))

# train example: 69992 (70.00 %) | positive: 56.50 % | negative: 43.50 %
# valid example: 14999 (15.00 %) | positive: 56.06 % | negative: 43.94 %
# test example: 14998 (15.00 %) | positive: 56.69 % | negative: 43.31 %


In [None]:
train_df.to_csv(constants.TRAIN_PATH, encoding='utf-8', index=False)
valid_df.to_csv(constants.VALID_PATH, encoding='utf-8', index=False)
test_df.to_csv(constants.TEST_PATH, encoding='utf-8', index=False)

!ls data/twitter

data.csv  data.zip  test.csv  train.csv  valid.csv


In [None]:
train_df.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sunny Again Work Tomorrow :-| ...


## Representing sentence with bag of words

In [None]:
train_inputs = train_df.SentimentText
train_labels = train_df.Sentiment

valid_inputs = valid_df.SentimentText
valid_labels = valid_df.Sentiment

#### Vectorizing the features with `CountVecoctorizer` [[docs]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
# CountVectorizer is an object for 
# converting stings into bag of words
vectorizer = CountVectorizer()

vectorizer.fit(train_inputs)
train_bow = vectorizer.transform(train_inputs)
valid_bow = vectorizer.transform(valid_inputs)

In [None]:
print(vectorizer.vocabulary_)
print(len(vectorizer.vocabulary_))

83208


In [None]:
train_inputs.iloc[0]

'                     is so sad for my APL friend.............'

In [None]:
vectorizer.transform(['is is is']).toarray().squeeze()[56200]

1

#### Classifying twits with logistic regression [[docs]](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# Initialize the classifier. `lbfgs` is the default optimizer. 
# Set `max_iter` to 1000 to avoid annoying convergence warning
lr = LogisticRegression(solver='lbfgs', max_iter=1000)

In [None]:
# optimize the parameters of the classifier
lr = lr.fit(train_bow, train_labels)

In [None]:
# Evaluate the accuracy of our baseline model
train_predictions = lr.predict(train_bow)
valid_predictions = lr.predict(valid_bow)

print('Train accuracy: {:.2f} %'.format(accuracy_score(train_predictions, train_labels) * 100))
print('Valid accuracy: {:.2f} %'.format(accuracy_score(valid_predictions, valid_labels) * 100))

Train accuracy: 90.63 %
Valid accuracy: 76.64 %


In [None]:
from sklearn.metrics import confusion_matrix

numpy.round(confusion_matrix(valid_labels, valid_predictions) / VALID_SIZE * 100)

array([[31., 13.],
       [10., 46.]])

### Data exploration + model exploration (e.g. small litterature review)

- Things to consider in your research
 - The task: "sentiment classification" < "text classification" < "classification"
 - Preprocessing and feature representation
 - ...

- Where to look:
 - [Google scholar](https://scholar.google.ca/schhp?hl=en&as_sdt=0,5)
 - Forums and blogs (e.g. Reddit, Medium)
 - [NLP progress](http://nlpprogress.com/)
 - ...

## Logistic Regression with pytorch

### DataLoader

In [None]:
vectorizer = CountVectorizer(min_df=5, stop_words='english')
vectorizer.fit(train_inputs)
train_bow = vectorizer.transform(train_inputs)
valid_bow = vectorizer.transform(valid_inputs)
len(vectorizer.vocabulary_) # ~83200 if min_df =1

82908

In [None]:
class SentimentDataset:
    """
    Abstract class for representing a dataset. 
    This is useful for using pytorch DataLoader since 
    they require an object with a __getitem__ and a __len__
    methods for representing the data.

    inputs : scipy.sparse.csr.csr_matrix
       Sparse representation of the the bag of words (the input data).
    targets : pandas.Series
        Binary (0 or 1) target data.
    """
    def __init__(self, inputs: scipy.sparse.csr.csr_matrix, targets: pandas.Series):
        self.inputs = inputs
        self.targets = targets.to_numpy().astype(numpy.float32)
        
    def __getitem__(self, idx: int) -> Tuple[numpy.array, float]:
        """
        Select the observation with id `idx`, convert it to an array 
        of type float32 to save some memory usage.

        idx : int
            Index of the selected observation.
        """
        return self.inputs[idx].toarray().ravel().astype(numpy.float32), self.targets[idx]
    
    def __len__(self) -> int:
        """
        Return the number of observations in the dataset.
        """
        return len(self.targets)

In [None]:
train_data = SentimentDataset(train_bow, train_labels)
train_data[0]

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 0.0)

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_data, batch_size=1024)

batch_x, batch_y = next(iter(train_loader))
batch_x.shape

torch.Size([1024, 9501])

In [None]:
class LR(nn.Module):
    """
    Implementation of a logistic regression mapping in Pytorch.

    input_dim : int
        The dimension of the imput data. For instance, 
        if the input data consist of bag of words, `input_dim`
        should be the size of the vocabulary.
    output_dim : int
        The dimenstion of the ouput. For a binary classifier,
        `ouput_dim` should be equal to 1. 
    """
    def __init__(self, input_dim: int, output_dim: int):
        super().__init__()
        # using pytorch bultin linear mapping
        self.linear_model = nn.Linear(input_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        """
        Map the input data to the probability p(y = 1).

        inputs : torch.Tensor
            Tensor of bag of words representing the inputs.
        """
        h = self.linear_model(inputs)  # h = wx + b
        proba = self.sigmoid(h)        # p = sigmoid(h)
        return proba.squeeze()         # squeeze to remobe unecessary dimension

In [None]:
def evaluate(model: nn.Module, iterator: DataLoader, device: torch.device):
    """
    Evaluate a trained binary classifier on a specific data iterator 
    by computing the binary cross entropy and the accuracy.

    model : nn.Module
        Trained classifier that will be evaluate.
    iterator : DataLoader
        Data iterator of which the model will be evaluate.
    device : torch.device
        Device on which computations will be done (either cpu or gpu).
    """
    model.eval()
    
    targets, predictions = [], []
    epoch_loss = 0
    
    # no gradient for baseline
    with torch.no_grad():
    
        for batch_input, batch_label in iterator:

            batch_input = batch_input.to(device)
            batch_label = batch_label.to(device)

            batch_proba = model(batch_input)
            
            # compute and store batch predictions
            batch_prediction = batch_proba.cpu().numpy()
            batch_prediction[batch_prediction < 0.5] = 0
            batch_prediction[batch_prediction >= 0.5] = 1
            
            predictions.extend([y for y in batch_prediction])
            targets.extend([y for y in batch_label.cpu().numpy()])
            
            loss = torch.nn.functional.binary_cross_entropy(batch_proba, batch_label.float())
            epoch_loss += loss.item()
    
    epoch_loss = epoch_loss / len(iterator)
    epoch_acc = accuracy_score(targets, predictions)
    
    return epoch_loss, epoch_acc

In [None]:
len(vectorizer.vocabulary_)

9501

In [None]:
model = LR(len(vectorizer.vocabulary_), 1).to(constants.DEVICE)

In [None]:
train_loader = DataLoader(train_data, batch_size=2048)

In [None]:
# Evaluate without optimizing
evaluate(model, train_loader, constants.DEVICE)

(45.47789557320731, 0.5459766830494913)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
criterion = nn.BCELoss()

In [None]:
def epoch_time(start_time, end_time):
    """Utility function for calculating time between batch"""
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
train_data = SentimentDataset(train_bow, train_labels)
valid_data = SentimentDataset(valid_bow, valid_labels)

train_loader = DataLoader(train_data, batch_size=512)
valid_loader = DataLoader(valid_data, batch_size=512)

In [None]:
for epoch in range(1):
    model.train()
    
    start_time = time.time()
    
    for batch_input, batch_label in train_loader:

        batch_input = batch_input.to(constants.DEVICE)
        batch_label = batch_label.to(constants.DEVICE)

        batch_proba = model(batch_input)
        assert batch_label.shape == batch_proba.shape, f'{batch_label.shape} != {output.shape}'

        loss = torch.nn.functional.binary_cross_entropy(batch_proba, batch_label)

        loss.backward()

        optimizer.step()
    
    train_loss, train_acc = evaluate(model, train_loader, constants.DEVICE)
    valid_loss, valid_acc = evaluate(model, valid_loader, constants.DEVICE)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s | Train Loss: {train_loss:.3f} | Train Acc.: {train_acc:.2f} | Val. Loss: {valid_loss:.3f} |  Val. Acc.: {valid_acc:.2f}')

Epoch: 01 | Time: 0m 22s | Train Loss: 28.649 | Train Acc.: 0.71 | Val. Loss: 31.556 |  Val. Acc.: 0.68


## Representing words with *embeddings*

### Tokenization

In [None]:
STOPWORDS = ['a', 'an', 'the', 'and', 'or', 'to', 'it', 'for', 'is']

def tokenizer(text: str) -> List[str]:
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    tokens = [tok.text for tok in spacy_en.tokenizer(text) if tok.text not in STOPWORDS]
    return tokens

tokenizer('i love ice,cream')

['i', 'love', 'ice', ',', 'cream']

#### Loading and preprocessing `csv` file with `torchtext.data.TabularDataset`

#### **Field**

Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented
by tensors.  It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

In [None]:
input_field = Field(sequential=True, tokenize=tokenizer, pad_token='<pad>', unk_token='<unk>', lower=True, batch_first=True)
label_field = Field(sequential=False, use_vocab=False, is_target=True, unk_token=None, batch_first=True, dtype=torch.float32)

fields = {
    'SentimentText': ('input', input_field),
    'Sentiment': ('label', label_field)
}

train_data = TabularDataset(path=constants.TRAIN_PATH, format='csv', fields=fields)
valid_data = TabularDataset(path=constants.VALID_PATH, format='csv', fields=fields)

<torchtext.legacy.data.dataset.TabularDataset at 0x7f050c976350>

In [None]:
print(train_df.SentimentText.iloc[1])
print(vars(train_data.examples[1]))

                   I missed the New Moon trailer...
{'input': ['                   ', 'i', 'missed', 'new', 'moon', 'trailer', '...'], 'label': '0'}


#### `Field.build_vocab`

- `min_freq`: The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.

In [None]:
input_field.build_vocab(train_data, min_freq=5)
print(dict(input_field.vocab.stoi))
print(len(input_field.vocab.stoi))

10003


#### Testing the `Iterator`

In [None]:
train_iterator = Iterator(train_data, batch_size=64)
train_iterator = iter(train_iterator)

batch = next(train_iterator)

batch_input = batch.input

print(batch_input.shape)

print(batch_input)

torch.Size([64, 34])
tensor([[  28,  184,    0,  ...,    1,    1,    1],
        [8670,    2,   78,  ...,    1,    1,    1],
        [2239,  204,    6,  ...,    1,    1,    1],
        ...,
        [5004,  490, 3806,  ...,    1,    1,    1],
        [   0,  120, 3442,  ...,    1,    1,    1],
        [   5,  342,  975,  ...,    1,    1,    1]])


### Building the RNN classifier

In [None]:
class SequenceClassifier(nn.Module):
    def __init__(
        self, 
        input_dim: int, 
        emb_dim: int, 
        pretrained_emb: torch.Tensor, 
        hidden_dim: int, 
        num_layers: int, 
        bidirectional: bool, 
        dropout: float, 
        device: torch.device
    ):
        """
        Classifier that uses rnn to map the input to the class probability.

        input_dim : int
            The dimension of word one hot encoding, i.e. the size of the vocabulary.
        emb_dim : int
            The size of the embeddings.
        pretrained_emb : torch.Tensor
            Pre-optimized embeddings that will be use to represent the words in a lower 
            but dense representation. If pretrained embedding is None, an embedding
            matrix will be optimized instead.
        hidden_dim : int
            Dimension of the hidden state vector.
        num_layers : int
            Number of layer of the RNN.
        bidirectional : bool
            Whether the RNN should be biderectional.
        dropout : float
            percentage of features that should be drop before the classification layer.
        device : torch.device
            Device on which calculation should be done (cpu or gpu).
        """
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.num_direction = 2 if bidirectional else 1
        self.device = device
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        if pretrained_emb is not None:
            self.embedding.weight.data.copy_(pretrained_emb)
            self.embedding.weight.requires_grad = False # make embedding non trainable
        
        self.rnn = nn.LSTM(emb_dim, hidden_dim, num_layers, bidirectional=bidirectional, dropout=dropout, batch_first=True)
        
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, 1),
            nn.Sigmoid()
        )
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, batch_input: torch.Tensor) -> torch.Tensor:
        """
        batch_input: torch.Tensor 
            Batch of shape (`batch_size`, `sentence_lenght`)
        """
        batch_size = batch_input.shape[0]
        sentence_lenght = batch_input.shape[1]
        
        embedded = self.dropout(self.embedding(batch_input))  # `embedded `shape = (sentence_lenght, batch_size, embedding_dim)

        _, (hidden, _) = self.rnn(embedded)  # `hidden` shape = (n_layers * n_directions, batch_size, hidden_dim)
                
        code = torch.cat([hidden[-1], hidden[-2]], 1)
        code = self.dropout(code)
        
        outputs = self.classifier(code).squeeze()
        
        return outputs

In [None]:
def train_iteration(model: nn.Module, iterator: DataLoader, optimizer: optim.Adam, device: torch.device):
    """
    Run one iteration of the Adam optimization routine of all the mini batch of a dataset.

    model : nn.Module
        The model to optimize
    iterator : DataLoader
        Iterator for the batch of the dataset on which the model will be optimized.
    optimizer : optim.Adam
        Implementation of the Adam optimization routine.
    device : torch.device
        Device on which calculation should occur. 
    """
    model.train()
    
    for i, batch in enumerate(iterator):
        
        optimizer.zero_grad()
        
        batch_input = batch.input.to(device)
        batch_label = batch.label.to(device)
        
        output = model(batch_input)
        loss = torch.nn.functional.binary_cross_entropy(output, batch_label)
        
        loss.backward()
        
        optimizer.step()

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.1, 0.1)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
input_field = Field(sequential=True, tokenize=tokenizer, pad_token='<pad>', unk_token='<unk>', lower=True, batch_first=True)
label_field = Field(sequential=False, use_vocab=False, is_target=True, unk_token=None, batch_first=True, dtype=torch.float32)

fields = {'SentimentText': ('input', input_field), 'Sentiment': ('label', label_field)}

train_data = TabularDataset(path=constants.TRAIN_PATH, format='csv', fields=fields)
valid_data = TabularDataset(path=constants.VALID_PATH, format='csv', fields=fields)
test_data = TabularDataset(path=constants.TEST_PATH, format='csv', fields=fields)

input_field.build_vocab(train_data, min_freq=5) # , vectors="glove.6B.100d"

In [None]:
input_field.vocab.vectors

In [None]:
INPUT_DIM = len(input_field.vocab)
EMB_DIM = 64
HID_DIM = 128
NUM_LAYERS = 2
ENC_DROPOUT = 0.4
N_EPOCHS = 10
BATCH_SIZE = 32
BIDIRECTIONAL = True
pretrained_embeddings = None # input_field.vocab.vectors

model = SequenceClassifier(input_dim=INPUT_DIM, 
                           emb_dim=EMB_DIM, 
                           pretrained_emb=pretrained_embeddings, 
                           hidden_dim=HID_DIM, 
                           num_layers=NUM_LAYERS, 
                           bidirectional=BIDIRECTIONAL,
                           dropout=ENC_DROPOUT, device=constants.DEVICE)
model.to(constants.DEVICE)
model.apply(init_weights)


optimizer = optim.Adam(model.parameters(), lr=0.001)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,234,369 trainable parameters


In [None]:
train_iterator = Iterator(train_data, batch_size=BATCH_SIZE, device=constants.DEVICE)
valid_iterator = Iterator(valid_data, batch_size=512, device=constants.DEVICE)
test_iterator = Iterator(test_data, batch_size=512)

In [None]:
best_valid_loss = float('Inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_iteration(model, train_iterator, optimizer, constants.DEVICE)
    
    if (epoch + 1) % 1 == 0:  
        train_loss, train_acc = evaluate(model, train_iterator, constants.DEVICE)
        valid_loss, valid_acc = evaluate(model, valid_iterator, constants.DEVICE)

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'tut1-model.pt')

        print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s | Train Loss: {train_loss:.3f} | Train Acc.: {train_acc:.2f} | Best. Loss: {best_valid_loss:.3f} | Val. Loss: {valid_loss:.3f} |  Val. Acc.: {valid_acc:.2f}')

Epoch: 01 | Time: 6m 20s | Train Loss: 10.435 | Train Acc.: 0.90 | Best. Loss: 20.772 | Val. Loss: 20.772 |  Val. Acc.: 0.79
Epoch: 02 | Time: 6m 13s | Train Loss: 9.898 | Train Acc.: 0.90 | Best. Loss: 20.772 | Val. Loss: 20.881 |  Val. Acc.: 0.79
Epoch: 03 | Time: 6m 16s | Train Loss: 9.269 | Train Acc.: 0.91 | Best. Loss: 20.772 | Val. Loss: 21.515 |  Val. Acc.: 0.78
Epoch: 04 | Time: 6m 15s | Train Loss: 8.167 | Train Acc.: 0.92 | Best. Loss: 20.772 | Val. Loss: 21.227 |  Val. Acc.: 0.79
Epoch: 05 | Time: 6m 25s | Train Loss: 7.788 | Train Acc.: 0.92 | Best. Loss: 20.772 | Val. Loss: 21.408 |  Val. Acc.: 0.79
Epoch: 06 | Time: 6m 32s | Train Loss: 6.961 | Train Acc.: 0.93 | Best. Loss: 20.772 | Val. Loss: 21.598 |  Val. Acc.: 0.79
Epoch: 07 | Time: 6m 25s | Train Loss: 6.330 | Train Acc.: 0.94 | Best. Loss: 20.772 | Val. Loss: 21.743 |  Val. Acc.: 0.78
Epoch: 08 | Time: 6m 42s | Train Loss: 5.759 | Train Acc.: 0.94 | Best. Loss: 20.772 | Val. Loss: 21.882 |  Val. Acc.: 0.78
Epoch: 

In [None]:
# DO NOT RUN THIS UNTIL YOU ARE SURE ABOUT YOUR HYPERPARAMETERS; THERE IS NO GOING BACK ;)
# model.load_state_dict(torch.load('tut1-model.pt'))
# evaluate(model, test_iterator, constants.DEVICE)

(20.01987845102946, 0.80044005867449)