##### Week 3 

Let k be the number of members in your group. Implement and train k different supervised classifiers for each of the three languages separately, using the training data for that language. The classifiers must only use the document and question as input. Evaluate the classifiers on the respective validation sets, report and analyse the performance for each language and compare the scores across languages.

The classifiers can use linguistic/lexical features, e.g., bag-of-words, n-gram counts, overlaps of words between question and document, etc.; word embed- dings, or word/sentence representations from neural language models. You can, for example, find pretrained Transformer language models for different languages, trained with different language objectives, and fine-tuned for differ- entdownstreamtasks,fromHuggingFace.9 Youcanalsotrainorfine-tuneyour own neural language models on the dataset. Motivate your choice of features and classifier.

In [None]:
!pip install bpemb
!pip install gensim
!python -m spacy download en_core_web_sm

In [4]:
# Preamble 
import sys 
sys.path.append('..')

In [5]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("copenlu/answerable_tydiqa")

train_set = dataset["train"]
validation_set = dataset["validation"]

df_train = train_set.to_pandas()
df_val = validation_set.to_pandas()

print(len(df_train))
print(len(df_val))

df_train.head()


116067
13325


Unnamed: 0,question_text,document_title,language,annotations,document_plaintext,document_url
0,Milloin Charles Fort syntyi?,Charles Fort,finnish,"{'answer_start': [18], 'answer_text': ['6. elo...",Charles Hoy Fort (6. elokuuta (joidenkin lähte...,https://fi.wikipedia.org/wiki/Charles%20Fort
1,“ダン” ダニエル・ジャドソン・キャラハンの出身はどこ,ダニエル・J・キャラハン,japanese,"{'answer_start': [35], 'answer_text': ['カリフォルニ...",“ダン”こと、ダニエル・ジャドソン・キャラハンは1890年7月26日、カリフォルニア州サンフ...,https://ja.wikipedia.org/wiki/%E3%83%80%E3%83%...
2,వేప చెట్టు యొక్క శాస్త్రీయ నామం ఏమిటి?,వేప,telugu,"{'answer_start': [12], 'answer_text': ['Azadir...","వేప (లాటిన్ Azadirachta indica, syn. Melia aza...",https://te.wikipedia.org/wiki/%E0%B0%B5%E0%B1%...
3,চেঙ্গিস খান কোন বংশের রাজা ছিলেন ?,চেঙ্গিজ খান,bengali,"{'answer_start': [414], 'answer_text': ['বোরজি...",চেঙ্গিজ খান (মঙ্গোলীয়: Чингис Хаан আ-ধ্ব-ব: ...,https://bn.wikipedia.org/wiki/%E0%A6%9A%E0%A7%...
4,రెయ్యలగడ్ద గ్రామ విస్తీర్ణత ఎంత?,రెయ్యలగడ్ద,telugu,"{'answer_start': [259], 'answer_text': ['27 హె...","రెయ్యలగడ్ద, విశాఖపట్నం జిల్లా, గంగరాజు మాడుగుల...",https://te.wikipedia.org/wiki/%E0%B0%B0%E0%B1%...


In [6]:
# Get train and validation data for each language
df_train_bengali = df_train[df_train['language'] == 'bengali']
df_train_arabic = df_train[df_train['language'] == 'arabic']
df_train_indonesian = df_train[df_train['language'] == 'indonesian']

df_val_bengali = df_val[df_val['language'] == 'bengali']
df_val_arabic = df_val[df_val['language'] == 'arabic']
df_val_indonesian = df_val[df_val['language'] == 'indonesian']


# For testing
df_val_english = df_val[df_val['language'] == 'english']
df_train_english = df_train[df_train['language'] == 'english']


In [7]:
# Create a new dataframe with the combined documents and questions and add if they are answerable
df_train_bengali_merged = pd.DataFrame({
    'text':(df_train_bengali["document_plaintext"] + df_train_bengali["question_text"]),
    'answerable':(df_train_bengali["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })
df_train_arabic_merged = pd.DataFrame({
    'text': (df_train_arabic["document_plaintext"] + df_train_arabic["question_text"]),
    'answerable': (df_train_arabic["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
                                    })
df_train_indonesian_merged = pd.DataFrame({
    'text':(df_train_indonesian["document_plaintext"] + df_train_indonesian["question_text"]),
    'answerable':(df_train_indonesian["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })
df_train_english_merged = pd.DataFrame({
    'text':(df_train_english["document_plaintext"] + df_train_english["question_text"]),
    'answerable':(df_train_english["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })


## Same for validation data
df_val_bengali_merged = pd.DataFrame({
    'text':(df_val_bengali["document_plaintext"] + df_val_bengali["question_text"]),
    'answerable':(df_val_bengali["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })
df_val_arabic_merged = pd.DataFrame({
    'text': (df_val_arabic["document_plaintext"] + df_val_arabic["question_text"]),
    'answerable': (df_val_arabic["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
                                    })
df_val_indonesian_merged = pd.DataFrame({
    'text':(df_val_indonesian["document_plaintext"] + df_val_indonesian["question_text"]),
    'answerable':(df_val_indonesian["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })
df_val_english_merged = pd.DataFrame({
    'text':(df_val_english["document_plaintext"] + df_val_english["question_text"]),
    'answerable':(df_val_english["annotations"].apply(lambda x: 0 if x['answer_start'] == [-1] else 1))
    })

df_val_english_merged.head()

Unnamed: 0,text,answerable
30,Wound care encourages and speeds wound healing...,1
47,Brothers Amos and Wilfrid Ayre founded Burntis...,1
59,"For species of mammals, larger brains (in abso...",1
77,"As from 31 March 1989, fishing vessel registra...",1
106,"When Quezon City was created in 1939, the foll...",1


#### Model 1: Logistic Regression

In [8]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer



## Train the model on the Bengali training data
# select the input and label columns
# Indonesian
X_train_indonesian = df_train_indonesian_merged.iloc[:, 0].values.reshape(-1, 1)
y_train_indonesian = df_train_indonesian_merged.iloc[:, 1].values

# Bengali
X_train_bengali = df_train_bengali_merged.iloc[:, 0].values.reshape(-1, 1)
y_train_bengali = df_train_bengali_merged.iloc[:, 1].values

#Arabic
X_train_arabic = df_train_arabic_merged.iloc[:, 0].values.reshape(-1, 1)
y_train_arabic = df_train_arabic_merged.iloc[:, 1].values

# English
X_train_english = df_train_english_merged.iloc[:, 0].values.reshape(-1, 1)
y_train_english = df_train_english_merged.iloc[:, 1].values

# Validation data
# Indonesian
X_val_indonesian = df_val_indonesian_merged.iloc[:, 0].values.reshape(-1, 1)
y_val_indosnesian = df_val_indonesian_merged.iloc[:, 1].values

# Bengali
X_val_bengali = df_val_bengali_merged.iloc[:, 0].values.reshape(-1, 1)
y_val_bengali = df_val_bengali_merged.iloc[:, 1].values

#Arabic
X_val_arabic = df_val_arabic_merged.iloc[:, 0].values.reshape(-1, 1)
y_val_arabic = df_val_arabic_merged.iloc[:, 1].values

# English
X_val_english = df_val_english_merged.iloc[:, 0].values.reshape(-1, 1)
y_val_english = df_val_english_merged.iloc[:, 1].values



# Tokenize the text
vectorizer = CountVectorizer()

# Indonesian
X_train_indonesian_tokenized = vectorizer.fit_transform(X_train_indonesian.ravel())
X_val_tokenized_indonesian = vectorizer.transform(X_val_indonesian.ravel())

# Bengali
X_train_bengali_tokenized = vectorizer.fit_transform(X_train_bengali.ravel())
X_val_tokenized_bengali = vectorizer.transform(X_val_bengali.ravel())

# Arabic
X_train_arabic_tokenized = vectorizer.fit_transform(X_train_arabic.ravel())
X_val_tokenized_arabic = vectorizer.transform(X_val_arabic.ravel())

# English
X_train_english_tokenized = vectorizer.fit_transform(X_train_english.ravel())
X_val_tokenized_english = vectorizer.transform(X_val_english.ravel())

# Create a logistic regression model
model_indonesian = LogisticRegression()
model_bengali = LogisticRegression()
model_arabic = LogisticRegression()
model_english = LogisticRegression()

# Fit the model to the data
model_indonesian.fit(X_train_indonesian_tokenized, y_train_indonesian)
model_bengali.fit(X_train_bengali_tokenized, y_train_bengali)
model_arabic.fit(X_train_arabic_tokenized, y_train_arabic)
model_english.fit(X_train_english_tokenized, y_train_english)


## Test the model on the validation data

# Indonesian
y_pred_indonesian = model_indonesian.predict(X_val_tokenized_indonesian)
print()
print("INDONESIAN - Logistic Regression")
print("Accuracy:", accuracy_score(y_val_indosnesian, y_pred_indonesian))
print("Precision:", precision_score(y_val_indosnesian, y_pred_indonesian))
print("Recall:", recall_score(y_val_indosnesian, y_pred_indonesian))
print("F1:", f1_score(y_val_indosnesian, y_pred_indonesian))

# Bengali
y_pred_bengali = model_bengali.predict(X_val_tokenized_bengali)
print()
print("BENGALI - Logistic Regression")
print("Accuracy:", accuracy_score(y_val_bengali, y_pred_bengali))
print("Precision:", precision_score(y_val_bengali, y_pred_bengali))
print("Recall:", recall_score(y_val_bengali, y_pred_bengali))
print("F1:", f1_score(y_val_bengali, y_pred_bengali))

# Arabic
y_pred_arabic = model_arabic.predict(X_val_tokenized_arabic)
print()
print("ARABIC - Logistic Regression")
print("Accuracy:", accuracy_score(y_val_arabic, y_pred_arabic))
print("Precision:", precision_score(y_val_arabic, y_pred_arabic))
print("Recall:", recall_score(y_val_arabic, y_pred_arabic))
print("F1:", f1_score(y_val_arabic, y_pred_arabic))

# English
y_pred_english = model_english.predict(X_val_tokenized_english)
print()
print("ENGLISH - Logistic Regression")
print("Accuracy:", accuracy_score(y_val_english, y_pred_english))
print("Precision:", precision_score(y_val_english, y_pred_english))
print("Recall:", recall_score(y_val_english, y_pred_english))
print("F1:", f1_score(y_val_english, y_pred_english))



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


INDONESIAN - Logistic Regression
Accuracy: 0.7380352644836272
Precision: 0.7766990291262136
Recall: 0.6700167504187605
F1: 0.7194244604316546

BENGALI - Logistic Regression
Accuracy: 0.7098214285714286
Precision: 0.688
Recall: 0.7678571428571429
F1: 0.7257383966244725

ARABIC - Logistic Regression
Accuracy: 0.7923238696109358
Precision: 0.8054945054945055
Recall: 0.7707676130389064
F1: 0.7877485222998388

ENGLISH - Logistic Regression
Accuracy: 0.7121212121212122
Precision: 0.7160493827160493
Recall: 0.703030303030303
F1: 0.7094801223241589


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### Model 2: Random Forest

In [9]:
# Random forest classifier
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
model_indonesian = RandomForestClassifier()
model_bengali = RandomForestClassifier()
model_arabic = RandomForestClassifier()
model_english = RandomForestClassifier()

# Fit the model to the data
model_indonesian.fit(X_train_indonesian_tokenized, y_train_indonesian)
model_bengali.fit(X_train_bengali_tokenized, y_train_bengali)
model_arabic.fit(X_train_arabic_tokenized, y_train_arabic)
model_english.fit(X_train_english_tokenized, y_train_english)

# Evaluate the model
# Indonesian
y_pred_indonesian = model_indonesian.predict(X_val_tokenized_indonesian)
print()
print("INDONESIAN - Random Forest")
print("Accuracy:", accuracy_score(y_val_indosnesian, y_pred_indonesian))
print("Precision:", precision_score(y_val_indosnesian, y_pred_indonesian))
print("Recall:", recall_score(y_val_indosnesian, y_pred_indonesian))
print("F1:", f1_score(y_val_indosnesian, y_pred_indonesian))

# Bengali
y_pred_bengali = model_bengali.predict(X_val_tokenized_bengali)
print()
print("BENGALI - Random Forest")
print("Accuracy:", accuracy_score(y_val_bengali, y_pred_bengali))
print("Precision:", precision_score(y_val_bengali, y_pred_bengali))
print("Recall:", recall_score(y_val_bengali, y_pred_bengali))
print("F1:", f1_score(y_val_bengali, y_pred_bengali))

# Arabic
y_pred_arabic = model_arabic.predict(X_val_tokenized_arabic)
print()
print("ARABIC - Random Forest")
print("Accuracy:", accuracy_score(y_val_arabic, y_pred_arabic))
print("Precision:", precision_score(y_val_arabic, y_pred_arabic))
print("Recall:", recall_score(y_val_arabic, y_pred_arabic))
print("F1:", f1_score(y_val_arabic, y_pred_arabic))

# English
y_pred_english = model_english.predict(X_val_tokenized_english)
print()
print("ENGLISH - Random Forest")
print("Accuracy:", accuracy_score(y_val_english, y_pred_english))
print("Precision:", precision_score(y_val_english, y_pred_english))
print("Recall:", recall_score(y_val_english, y_pred_english))
print("F1:", f1_score(y_val_english, y_pred_english))


INDONESIAN - Random Forest
Accuracy: 0.7674223341729639
Precision: 0.7973977695167286
Recall: 0.7185929648241206
F1: 0.7559471365638766

BENGALI - Random Forest
Accuracy: 0.6785714285714286
Precision: 0.6754385964912281
Recall: 0.6875
F1: 0.6814159292035398

ARABIC - Random Forest
Accuracy: 0.7765509989484752
Precision: 0.8002283105022832
Recall: 0.7371188222923238
F1: 0.7673782156540777

ENGLISH - Random Forest
Accuracy: 0.7242424242424242
Precision: 0.7211155378486056
Recall: 0.7313131313131314
F1: 0.7261785356068206


#### A note on word counts vs GloVe embeddings and BPEMB embeddings (remember for report)
Question: Just as a note, you can actually get much better performance using simple word counts -- why do you think this is?

Possible answer:
The reason simple word counts can sometimes outperform more complex models like GloVe or BPEmb embeddings in certain tasks is due to the nature of the data and the task itself.

**In some tasks, the presence or absence of specific words can be highly indicative of the class or category. For example, in sentiment analysis, words like 'good', 'awesome', 'bad', 'terrible' etc. can be strong indicators of the sentiment. A simple word count vectorizer captures this information effectively.**

On the other hand, word embeddings like GloVe or BPEmb capture semantic and syntactic relationships between words, which can be very useful for tasks that require understanding of context or when dealing with words not present in the training set. However, these embeddings might introduce noise for tasks that can be solved based on simple word occurrence statistics.

In summary, the effectiveness of a method depends on the specific task and the nature of the data. It's always a good idea to start with simpler methods and then move to more complex ones if necessary.

#### Model 3: LSTM

The code below is taken from lab_2.ipynb and modified to fit the task at hand.

# Reading data into a model

A simple and common way that data is read in PyTorch is to use the two following classes: `torch.utils.data.Dataset` and `torch.utils.data.DataLoader`.

The `Dataset` class can be extended to read in and store the data you are using for your experiment. The only requirements are to implement the `__len__` and `__getitem__` methods. `__len__` simply returns the size of your dataset and `__getitem__` takes an index and returns that sample from your dataset, processed in whatever way is necessary to be input to your model.

The `DataLoader` class determines how to iterate through your `Dataset`, including how to shuffle and batch your data.



In [31]:
from torch.utils.data import Dataset, DataLoader
from typing import List, Tuple
import numpy as np
import torch
from torch.utils.data import Dataset

In [32]:
def text_to_batch_bilstm(text: List, tokenizer, max_len=512) -> Tuple[List, List]:
    """
    Creates a tokenized batch for input to a bilstm model
    :param text: A list of sentences to tokenize
    :param tokenizer: A tokenization function to use (i.e. fasttext)
    :return: Tokenized text as well as the length of the input sequence
    """
    # Some light preprocessing
    input_ids = [tokenizer.encode_ids_with_eos(t)[:max_len] for t in text]

    return input_ids, [len(ids) for ids in input_ids]

In [39]:
def collate_batch_bilstm(input_data: Tuple) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Combines multiple data samples into a single batch
    :param input_data: The combined input_ids, seq_lens, and labels for the batch
    :return: A tuple of tensors (input_ids, seq_lens, labels)
    """
    input_ids = [i[0][0] for i in input_data]
    seq_lens = [i[1][0] for i in input_data]
    labels = [i[2] for i in input_data]

    max_length = max([len(i) for i in input_ids])

    # Pad all of the input samples to the max length (25000 is the ID of the [PAD] token)
    input_ids = [(i + [25000] * (max_length - len(i))) for i in input_ids]

    # Make sure each sample is max_length long
    assert (all(len(i) == max_length for i in input_ids))
    return torch.tensor(input_ids), torch.tensor(seq_lens), torch.tensor(labels)

In [42]:

# This will load the dataset and process it lazily in the __getitem__ function
class ClassificationDatasetReader(Dataset):
  def __init__(self, df, tokenizer):
    self.df = df
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    row = self.df.values[idx]
    # Calls the text_to_batch function
    input_ids,seq_lens = text_to_batch_bilstm([row[0]], self.tokenizer)
    label = row[1]
    return input_ids, seq_lens, label

In [43]:
from bpemb import BPEmb

# Load english model with 25k word-pieces
bpemb_id= BPEmb(lang='id', dim=100, vs=25000)

In [44]:
# Extract the embeddings and add a randomly initialized embedding for our extra [PAD] token
pretrained_embeddings = np.concatenate([bpemb_id.emb.vectors, np.zeros(shape=(1,100))], axis=0)
# Extract the vocab and add an extra [PAD] token
vocabulary = bpemb_id.emb.index_to_key + ['[PAD]']

In [45]:
reader = ClassificationDatasetReader(df_train_bengali_merged, bpemb_id)
reader[0]

([[24764,
   0,
   24764,
   0,
   59,
   0,
   24795,
   24764,
   24955,
   24835,
   24837,
   24913,
   24835,
   24846,
   24764,
   24965,
   24826,
   15738,
   24764,
   0,
   24791,
   0,
   24791,
   0,
   24795,
   24764,
   0,
   24767,
   0,
   24776,
   0,
   24770,
   24764,
   24967,
   24763,
   0,
   24784,
   8606,
   187,
   0,
   24780,
   0,
   24801,
   0,
   1081,
   690,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24784,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   59,
   0,
   24795,
   19359,
   0,
   24882,
   0,
   24835,
   24837,
   2286,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   24764,
   0,
   59,
   24841,
   24858,
   0,
   16176,
   24831,


# Creating the model

Next we will create a BiLSTM model with BPE word-piece embeddings. In this case we will extend the PyTorch class `torch.nn.Module`. To create your own module, you need only define your model architecture in the `__init__` function, and define how tensors are processed by your model in the `__forward__` function.

In [47]:
from torch import nn

# Define a default lstm_dim
lstm_dim = 100

In [49]:

# Define the model
class BiLSTMNetwork(nn.Module):
    """
    Basic BiLSTM network
    """
    def __init__(
            self,
            pretrained_embeddings: torch.tensor,
            lstm_dim: int,
            dropout_prob: float = 0.1,
            n_classes: int = 2
    ):
        """
        Initializer for basic BiLSTM network
        :param pretrained_embeddings: A tensor containing the pretrained BPE embeddings
        :param lstm_dim: The dimensionality of the BiLSTM network
        :param dropout_prob: Dropout probability
        :param n_classes: The number of output classes
        """

        # First thing is to call the superclass initializer
        super(BiLSTMNetwork, self).__init__()

        # We'll define the network in a ModuleDict, which makes organizing the model a bit nicer
        # The components are an embedding layer, a 2 layer BiLSTM, and a feed-forward output layer
        self.model = nn.ModuleDict({
            'embeddings': nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=pretrained_embeddings.shape[0] - 1),
            'bilstm': nn.LSTM(
                pretrained_embeddings.shape[1],
                lstm_dim,
                1,
                batch_first=True,
                dropout=dropout_prob,
                bidirectional=True),
            'cls': nn.Linear(2*lstm_dim, n_classes)
        })
        self.n_classes = n_classes
        self.dropout = nn.Dropout(p=dropout_prob)

        # Initialize the weights of the model
        self._init_weights()

    def _init_weights(self):
        all_params = list(self.model['bilstm'].named_parameters()) + \
                     list(self.model['cls'].named_parameters())
        for n,p in all_params:
            if 'weight' in n:
                nn.init.xavier_normal_(p)
            elif 'bias' in n:
                nn.init.zeros_(p)

    def forward(self, inputs, input_lens, labels = None):
        """
        Defines how tensors flow through the model
        :param inputs: (b x sl) The IDs into the vocabulary of the input samples
        :param input_lens: (b) The length of each input sequence
        :param labels: (b) The label of each sample
        :return: (loss, logits) if `labels` is not None, otherwise just (logits,)
        """

        # Get embeddings (b x sl x edim)
        embeds = self.model['embeddings'](inputs)

        # Pack padded: This is necessary for padded batches input to an RNN
        lstm_in = nn.utils.rnn.pack_padded_sequence(
            embeds,
            input_lens.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # Pass the packed sequence through the BiLSTM
        lstm_out, hidden = self.model['bilstm'](lstm_in)

        # Unpack the packed sequence --> (b x sl x 2*lstm_dim)
        lstm_out,_ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)

        # Max pool along the last dimension
        ff_in = self.dropout(torch.max(lstm_out, 1)[0])
        # Some magic to get the last output of the BiLSTM for classification (b x 2*lstm_dim)
        #ff_in = lstm_out.gather(1, input_lens.view(-1,1,1).expand(lstm_out.size(0), 1, lstm_out.size(2)) - 1).squeeze()

        # Get logits (b x n_classes)
        logits = self.model['cls'](ff_in).view(-1, self.n_classes)
        outputs = (logits,)
        if labels is not None:
            # Xentropy loss
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
            outputs = (loss,) + outputs

        return outputs



In [52]:
device = torch.device("cpu")
if torch.cuda.is_available():
  print("cuda available")
  device = torch.device("cuda")

In [53]:
# Create the model
model = BiLSTMNetwork(
    pretrained_embeddings=torch.FloatTensor(pretrained_embeddings),
    lstm_dim=lstm_dim,
    dropout_prob=0.1,
    n_classes=2
  ).to(device)


In [54]:
def accuracy(logits, labels):
  logits = np.asarray(logits).reshape(-1, len(logits[0]))
  labels = np.asarray(labels).reshape(-1)
  return np.sum(np.argmax(logits, axis=-1) == labels).astype(np.float32) / float(labels.shape[0])

In [55]:
from tqdm import tqdm_notebook as tqdm

In [56]:
def evaluate(model: nn.Module, valid_dl: DataLoader):
  """
  Evaluates the model on the given dataset
  :param model: The model under evaluation
  :param valid_dl: A `DataLoader` reading validation data
  :return: The accuracy of the model on the dataset
  """
  # VERY IMPORTANT: Put your model in "eval" mode -- this disables things like
  # layer normalization and dropout
  model.eval()
  labels_all = []
  logits_all = []

  # ALSO IMPORTANT: Don't accumulate gradients during this process
  with torch.no_grad():
    for batch in tqdm(valid_dl, desc='Evaluation'):
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      seq_lens = batch[1]
      labels = batch[2]

      _, logits = model(input_ids, seq_lens, labels=labels)
      labels_all.extend(list(labels.detach().cpu().numpy()))
      logits_all.extend(list(logits.detach().cpu().numpy()))
    acc = accuracy(logits_all, labels_all)

    return acc,labels_all,logits_all

In [57]:
def train(
    model: nn.Module,
    train_dl: DataLoader,
    valid_dl: DataLoader,
    optimizer: torch.optim.Optimizer,
    n_epochs: int,
    device: torch.device,
    patience: int = 10
):
  """
  The main training loop which will optimize a given model on a given dataset
  :param model: The model being optimized
  :param train_dl: The training dataset
  :param valid_dl: A validation dataset
  :param optimizer: The optimizer used to update the model parameters
  :param n_epochs: Number of epochs to train for
  :param device: The device to train on
  :return: (model, losses) The best model and the losses per iteration
  """

  # Keep track of the loss and best accuracy
  losses = []
  best_acc = 0.0
  pcounter = 0

  # Iterate through epochs
  for ep in range(n_epochs):

    loss_epoch = []

    #Iterate through each batch in the dataloader
    for batch in tqdm(train_dl):
      # VERY IMPORTANT: Make sure the model is in training mode, which turns on
      # things like dropout and layer normalization
      model.train()

      # VERY IMPORTANT: zero out all of the gradients on each iteration -- PyTorch
      # keeps track of these dynamically in its computation graph so you need to explicitly
      # zero them out
      optimizer.zero_grad()

      # Place each tensor on the GPU
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      seq_lens = batch[1]
      labels = batch[2]

      # Pass the inputs through the model, get the current loss and logits
      loss, logits = model(input_ids, seq_lens, labels=labels)
      losses.append(loss.item())
      loss_epoch.append(loss.item())

      # Calculate all of the gradients and weight updates for the model
      loss.backward()

      # Optional: clip gradients
      #torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # Finally, update the weights of the model
      optimizer.step()
      #gc.collect()

    # Perform inline evaluation at the end of the epoch
    acc,_,_ = evaluate(model, valid_dl)
    print(f'Validation accuracy: {acc}, train loss: {sum(loss_epoch) / len(loss_epoch)}')

    # Keep track of the best model based on the accuracy
    if acc > best_acc:
      torch.save(model.state_dict(), 'best_model')
      best_acc = acc
      pcounter = 0
    else:
      pcounter += 1
      if pcounter == patience:
        break
        #gc.collect()

  model.load_state_dict(torch.load('best_model'))
  return model, losses

In [58]:
from torch.optim import Adam

In [59]:
# Define some hyperparameters
batch_size = 32
lr = 3e-4
n_epochs = 100

In [61]:
# Create the dataset readers
train_dataset = ClassificationDatasetReader(df_train_bengali_merged[:5000], bpemb_id)
# dataset loaded lazily with N workers in parallel
train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch_bilstm)

valid_dataset = ClassificationDatasetReader(df_val_bengali_merged[:1000], bpemb_id)
valid_dl = DataLoader(valid_dataset, batch_size=len(df_val_bengali_merged[:1000]), collate_fn=collate_batch_bilstm)

# Create the optimizer
optimizer = Adam(model.parameters(), lr=lr)

# Train
model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm(train_dl):


  0%|          | 0/150 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm(valid_dl, desc='Evaluation'):


Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.7053571428571429, train loss: 0.6428279948234558


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.7098214285714286, train loss: 0.5990146553516388


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.7008928571428571, train loss: 0.5851875883340836


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.7098214285714286, train loss: 0.5742191980282466


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.7008928571428571, train loss: 0.569000807205836


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.6964285714285714, train loss: 0.5626910463968913


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.6830357142857143, train loss: 0.562864805261294


  0%|          | 0/150 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Validation accuracy: 0.71875, train loss: 0.5586183309555054


  0%|          | 0/150 [00:00<?, ?it/s]

In [None]:
import matplotlib.pyplot as plt

plt.plot(losses)