Group work (2 students)

<h2 style="text-align: center">344.075 KV: Natural Language Processing (WS2021/22)</h2>
<h1 style="color:rgb(0,120,170)">Assignment 3</h1>
<h2 style="color:rgb(0,120,170)">Document Classification with PyTorch and BERT</h2>

<h2>Table of contents</h2>
<ol>
    <a href="#section-general-guidelines"><li style="font-size:large;font-weight:bold">General Guidelines</li></a>
    <a href="#section-taskA"><li style="font-size:large;font-weight:bold">Task A: Document Classification with PyTorch (25 points)</li></a>
    <a href="#section-taskB"><li style="font-size:large;font-weight:bold">Task B: Document Classification with BERT (15 points)</li></a>
    <a href="#section-tensorboard"><li style="font-size:large;font-weight:bold">Task C: Publishing Results with TensorBoard (2 extra points)</li></a>
    
</ol>

<a name="section-general-guidelines"></a><h2 style="color:rgb(0,120,170)">General Guidelines</h2>

<div style="background-color:rgb(224, 243, 255)">

### Assignment objective
This assignment aims to provide the necessary practices for learning the principles of deep learning programing in NLP using PyTorch. To this end, Task A provides the space for becoming fully familiar with PyTorch programming by implementing a "simple" document (sentence) classification model with PyTorch, and Task B extends this classifier with a BERT model. As the assignment requires working with PyTorch and Huggingface Transformers, please familiarize yourself with these libraries using any possible available teaching resources in particular the libraries' documentations. The assignment has in total **40 points**, and also offers **2 extra points** which can cover any missing point.

This Notebook encompasses all aspects of the assignment, namely the descriptions of tasks as well as your solutions and reports. Feel free to add any required cell for solutions. The cells can contain code, reports, charts, tables, or any other material, required for the assignment. Feel free to provide the solutions in an interactive and visual way! 

Please discuss any unclear point in the assignment in the provided forum in MOODLE. It is also encouraged to provide answers to your peer's questions. However when submitting a post, keep in mind to avoid providing solutions. Please let the tutor(s) know shall you find any error or unclarity in the assignment.

### Implementation & Libraries

The assignment should be implemented with recent versions of `Python`, `PyTorch` and, `transformers`. Any standard Python library can be used, so far that the library is free and can be simply installed using `pip` or `conda`. Examples of potentially useful libraries are `scikit-learn`, `numpy`, `scipy`, `gensim`, `nltk`, `spaCy`, and `AllenNLP`. Use the latest stable version of each library.

### Submission

Each group submits one Notebook file (`.ipynb`) through MOODLE. Do not forget to put in your names and student numbers in the first cell of the Notebook. **In the submitted Notebook, all the results and visualizations should already be present, and can be observed simply by loading the Notebook in a browser.** The Notebook must be self-contained, meaning that one can run all the cells from top to bottom without any error. If you need to include extra files in the submission, compress all files (together with the Notebook) in a `zip` file and submit the zip file to MOODLE. You do not need to include the data files in the submission.

Cover the questions/points, mentioned in the tasks, but also add any necessary point for understanding your experiments.  


### Dataset

To conduct the experiments, two datasets are provided. The datasets are taken from the data of `thedeep` project, produced by the DEEP (https://www.thedeep.io) platform. The DEEP is an open-source platform, which aims to facilitate processing of textual data for international humanitarian response organizations. The platform enables the classification of text excerpts, extracted from news and reports into a set of domain specific classes. The provided dataset has 12 classes (labels) like agriculture, health, and protection. 

The difference between the datasets is in their sizes. We refer to these as `medium` and `small`, containing an overall number of 38,000 and 12,000 annotated text excerpts, respectively. Select one of the datasets, and use it for all of the tasks. `medium` provides more data and therefore reflects a more realistic scenario. `small` is however provided for the sake of convenience, particularly if running the experiments on your available hardware takes too long. Using `medium` is generally recommended, but from the point of view of assignment grading, there is no difference between the datasets.

Download the dataset from [this link](https://drive.jku.at/filr/public-link/file-download/0cce88f07c9c862b017c9cfba294077a/33590/5792942781153185740/nlp2021_22_data.zip).

Whether `medium` or `small`, you will find the following files in the provided zip file:
- `thedeep.$name$.train.txt`: Train set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.validation.txt`: Validation set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.test.txt`: Test set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.label.txt`: Captions of the labels.
- `README.txt`: Terms of use of the dataset.

</div>


<a name="section-taskA"></a><h2 style="color:rgb(0,120,170)">Task A: Document Classification with PyTorch (25 points)</h2>

<div style="background-color:rgb(224, 243, 255)">

The aim of this task is identical to the one of Assignment 2 - Task B, namely to design a document classification model that exploits pre-trained word embeddings. It is of course allowed to use the preprocessed text, the dictionary, or any other relevant code or processings, done in the previous assignments.

In this task, you implement a document classification model using PyTorch, which given a document/sentence (consisting of a set of words) predicts the corresponding class. Before getting started with coding, have a look at the optional <a href="#section-tensorboard">Task C</a>, as you may want to already include `Tensorboard` in the code. The implementation of the classifier should cover the points below.

**Preprocessing and dictionary (1 point):** Following previous assignments, load the train, validation, and test datasets, apply necessary preprocessing steps, and create a dictionary of words. 

**Data batching (4 points):** Using the dictionary, create batches for any given dataset (train/validation/test). Each batch is a two-dimensional matrix of *batch-size* to *max-document-length*, containing the IDs of the words in the corresponding documents. *Batch-size* and *max-document-length* are two hyper-parameters and can be set to any appropriate values (*Batch-size* must be higher than 1 and *max-document-length* at least 50 words). If a document has more than *max-document-length* words, only the first *max-document-length* words should be kept.

**Word embedding lookup (2 point):** Using `torch.nn.Embedding`, create a lookup for the embeddings of all the words in the dictionary. The lookup is in fact a matrix, which maps the ID of each word to the corresponding word vector. Similar to Assignment 2, use the pre-trained vectors of a word embedding model (like word2vec or GloVe) to initialize the word embeddings of the lookup. Keep in mind that the embeddings of the words in the lookup should be matched with the correct vector in the pretrained word embedding. If the vector of a word in the lookup does not exist in the pretrained word embeddings, the corresponding vector should be initialized randomly. 

**Model definition (3 points):** Define the class `ClassificationAverageModel` as a PyTorch model. In the initialization procedure, the model receives the word embedding lookup, and includes it in the model as model's parameters. These embeddings parameters should be trainable, meaning that the word vectors get updated during model training. Feel free to add any other parameters to the model, which might be necessary for accomplishing the functionalities explained in the following.

**Forward function (5 points):** The forward function of the model receives a batch of data, and first fetches the corresponding embeddings of the word IDs in the batch using the lookup. Similar to Assignment 2, the embedding of a document is created by calculating the *element-wise mean* of the embeddings of the document's words. Formally, given the document $d$, consisting of words $\left[ v_1, v_2, ..., v_{|d|} \right]$, the document representation $\mathbf{e}_d$ is defined as:

<center><div>$\mathbf{e}_d = \frac{1}{|d|}\sum_{i=1}^{|d|}{\mathbf{e}_{v_i}}$</div></center>

where $\mathbf{e}_{v}$ is the vector of the word $v$, and $|d|$ is the length of the document. An important point in the implementation of this formula is that the documents in the batch might have different lengths and therefore each document should be divided by its corresponding $|d|$. Finally, this document embedding is utilized to predict the probability of the output classes, done by applying a linear transformation from the embeddings size to the number of classes, followed by Softmax. The linear transformation also belongs to the model's parameters and will be learned in training.

**Loss Function and optimization (2 point):** The loss between the predicted and the actual classes is calculated using Negative Log Likelihood or Cross Entropy. Update the model's parameters using any appropriate optimization mechanism such as Adam.

**Early Stopping (2 points):** After each epoch, evaluate the model on the *validation set* using accuracy. If the evaluation result (accuracy) improves, save the model as the best performing one so far. If the results are not improving after a certain number of evaluation rounds (set as another hyper-parameter) or if training reaches a certain number of epochs, terminate the training procedure. 

**Test Set Evaluation (1 points):** After finishing the training, load the (already stored) best performing model, and use it for class prediction on the test set. Evaluate and report the final results.

**Overall functionality of the training procedure (5 point).**

</div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

  warn(f"Failed to load image Python extension: {e}")


## Preprocessing and dictionary: 

Following previous assignments, load the train, validation, and test datasets, apply necessary preprocessing steps, and create a dictionary of words.

In [2]:
size = "small"
#size = "medium"
data_nlp = Path("nlp2021_22_data")
data_nlp_nlp = data_nlp / Path("nlpwdl2021_data")

In [3]:
labels = (data_nlp_nlp / "thedeep.labels.txt").read_text().splitlines()
labels = [x.split(",")[1] for x in labels]
labels

['Agriculture',
 'Cross',
 'Education',
 'Food',
 'Health',
 'Livelihood',
 'Logistic',
 'NFI',
 'Nutrition',
 'Protection',
 'Shelter',
 'WASH']

In [4]:
Z_test_dir = (data_nlp_nlp / f"thedeep.{size}.test.txt")
Z_train_dir = (data_nlp_nlp / f"thedeep.{size}.train.txt")
Z_validation_dir = (data_nlp_nlp / f"thedeep.{size}.validation.txt")

In [5]:
pd_test = pd.read_csv(Z_test_dir, sep=',', names=["id", "text", "label"])#, engine='python', encoding="unicode_escape", error_bad_lines=False)
pd_train = pd.read_csv(Z_train_dir, sep=',', names=["id", "text", "label"])#, engine='python', encoding="unicode_escape", error_bad_lines=False)
pd_vali = pd.read_csv(Z_validation_dir, sep=',', names=["id", "text", "label"])#, engine='python', encoding="unicode_escape", error_bad_lines=False)

In [6]:
pd_train

Unnamed: 0,id,text,label
0,6615,Cholera Daily Situation Report as of 4 Novembe...,4
1,659,"12 321 people affected, five deaths, one perso...",10
2,8591,Violent clashes and inter-communal tensions ha...,3
3,8373,AT least 12 people have been killed and severa...,5
4,10125,"Unidentified gunmen attacked a civilian home, ...",9
...,...,...,...
8395,528,Sandbag walls have failed to hold back the flo...,10
8396,2272,"More than 8,000 displaced civilians have been ...",10
8397,11493,Protection Durable Solutions ? Most of the...,9
8398,3816,She said access challenges have continued to d...,3


In [7]:
X_train, X_test, X_vali = [x.text.to_numpy() for x in [pd_train, pd_test, pd_vali]]
y_train, y_test, y_vali = [y.label.to_numpy() for y in [pd_train, pd_test, pd_vali]]

In [8]:
data_path = Path("vocab_words_ordered_small.txt")
pd_word_ids = pd.read_csv(data_nlp / data_path, sep='\n', names=["name"])

In [9]:
pd_word_ids["idxs"] = pd_word_ids.index
pd_word_ids = pd_word_ids.set_index('name')

In [10]:
words_ids_dict = pd_word_ids.to_dict()["idxs"]

In [11]:
# I added this cell afterwards to prevent an error, even though it´s not a perfect fix

words_ids_dict2 = {}
for i, word in enumerate(words_ids_dict):
    words_ids_dict2[word] = i
    
words_ids_dict = words_ids_dict2

In [12]:
len(words_ids_dict), words_ids_dict

(14197,
 {'[UNK]': 0,
  'the': 1,
  'of': 2,
  'and': 3,
  'in': 4,
  'to': 5,
  'a': 6,
  'are': 7,
  'is': 8,
  'for': 9,
  'have': 10,
  'as': 11,
  'on': 12,
  'food': 13,
  'from': 14,
  'by': 15,
  'with': 16,
  'that': 17,
  'people': 18,
  'were': 19,
  'has': 20,
  'been': 21,
  'at': 22,
  'water': 23,
  'cases': 24,
  'reported': 25,
  'children': 26,
  'their': 27,
  'health': 28,
  'areas': 29,
  '2017': 30,
  'an': 31,
  'be': 32,
  'was': 33,
  'affected': 34,
  'or': 35,
  'this': 36,
  'more': 37,
  'access': 38,
  'than': 39,
  'they': 40,
  'due': 41,
  'also': 42,
  'not': 43,
  'which': 44,
  'some': 45,
  'including': 46,
  'most': 47,
  'said': 48,
  'humanitarian': 49,
  'per': 50,
  'who': 51,
  'since': 52,
  'displaced': 53,
  'there': 54,
  'will': 55,
  'it': 56,
  'over': 57,
  'security': 58,
  'million': 59,
  'assistance': 60,
  'need': 61,
  'households': 62,
  'number': 63,
  'during': 64,
  'these': 65,
  'refugees': 66,
  'percent': 67,
  'cent': 68

In [13]:
def show_random_samples(df, how_many = 3):
    for _ in range(how_many):
        x = np.random.choice(len(df))
        l = pd_train.iloc[x].label
        idd = pd_train.iloc[x].id
        print(f"label: {l} = {labels[l]}\t\t id = {idd}", )    
        print(pd_train.iloc[x].text)
        print("\n")

In [14]:
show_random_samples(pd_train, 100)

label: 3 = Food		 id = 5830
Domestic prices of rice, the main staple food in the country, rose considerably for four consecutive months, reaching record levels in December 2016. The spike in prices is due to the reduced 2016 secondary “yala” output, harvested in September, and the unfavourable prospects for the main 2017 “maha” crop. I


label: 10 = Shelter		 id = 10646
UNHCR has been working to bolster the quality of shelters in the camps by supplying higher quality materials as well as expanding technical support for construction and drainage.


label: 3 = Food		 id = 9250
Thus, from June to September 2017, 28% of the population (about 855,800 people) will be in IPC Phases 3 & 4, and would need an emergency action to protect their livelihood, reduce the food deficit and acute malnutrition. In the Southeastern regions, this proportion is more important (34%) than in the South (24%); which can be explained by the numerous emergency interventions carried out since the end of 2015. House

## Data batching:

Using the dictionary, create batches for any given dataset (train/validation/test). Each batch is a two-dimensional matrix of batch-size to max-document-length, containing the IDs of the words in the corresponding documents. Batch-size and max-document-length are two hyper-parameters and can be set to any appropriate values (Batch-size must be higher than 1 and max-document-length at least 50 words). If a document has more than max-document-length words, only the first max-document-length words should be kept.

In [15]:
batch_size = 200
max_doc_length = 70
shuffle_train = False

np.random.seed(7)

In [16]:
len(X_train) / 200

42.0

In [17]:
# since ouer vocab_words are all lowercase & almost without punctuation we need to process ouer input equivalently
import string
import collections
string.punctuation = string.punctuation + "“"


def standardize(word):
    word = word.strip(string.punctuation)    
    word = word.lower()
    return word

def data_batching(data, labels, batch_size, max_doc_length, shuffle_train=False):
    size = len(data)
    sentence_ids_list = np.arange(size)
    tens = lambda t: torch.tensor(t, dtype=torch.int64) # just for convinience
    
    if shuffle_train: np.random.shuffle(sentence_ids_list)
        
    batch = np.zeros((batch_size, max_doc_length), dtype=int)
    batch_i = 0
    list_of_real_lengths = []
    labels_mask = []
    
    for si in sentence_ids_list: # loop trough all sentences in e.g.: the X_train

        words = data[si].split()[ :max_doc_length]
        
        sent_of_ids = []
        for w in [standardize(word) for word in words]:
            try:
                sent_of_ids.append(words_ids_dict[ w ])
            except KeyError:
                sent_of_ids.append(0)
        
        batch[ batch_i ][0:len(sent_of_ids)] = np.array(sent_of_ids, dtype=int)
        batch_i += 1    
        list_of_real_lengths.append( len(sent_of_ids) )
        labels_mask.append(si)
        
        if batch_i == batch_size or si == sentence_ids_list[-1]: # the or is for the case len(batch) < batch_size
            
            yield tens(batch[:]), tens(labels[ labels_mask ]), tens(list_of_real_lengths) 
            
            batch_i = 0
            batch = np.zeros((batch_size, max_doc_length), dtype=int) 
            list_of_real_lengths = []
            labels_mask = []

In [18]:
batch_gen = data_batching(X_train, y_train, batch_size, max_doc_length, shuffle_train)

In [19]:
first, fi_labels, fi_lenghts = next(batch_gen)
second, se_labels, se_lenghts = next(batch_gen)

In [20]:
first.shape, first, fi_lenghts

(torch.Size([200, 70]),
 tensor([[ 109,  581,   93,  ..., 2610,    3, 4256],
         [ 275, 5020,   18,  ...,    0,    0,    0],
         [1340,  588,    3,  ...,    0,    0,    0],
         ...,
         [ 397,   80,    1,  ...,    4,   75,  229],
         [  52, 7468,  176,  ...,    5,    1,  569],
         [  49,    3,  156,  ...,   18,    7,  133]]),
 tensor([70, 46, 37, 70, 36, 58, 70, 60, 70, 22, 66, 55, 70, 42, 26, 30, 70, 68,
         22, 45, 70, 32, 43, 70, 39, 70, 41, 53, 33, 40, 34, 32, 46, 46, 17, 62,
         36, 35, 70, 23, 70, 55, 53, 60, 70, 26, 49, 49, 70, 70, 46, 70, 57, 70,
         20, 48, 70, 70, 36, 31, 70, 62, 66, 13, 70, 70, 66, 70, 70, 32, 15, 70,
         70, 60, 18, 70, 45, 70, 60, 43, 22, 66, 47, 70, 23, 70, 70, 28, 59, 62,
         70, 47, 70, 25, 41, 41, 36, 70, 54, 48, 28, 41, 63, 12, 56, 70, 70, 70,
         38, 70, 28, 70, 25, 53, 70, 33, 15, 65, 65, 33, 70,  9, 68, 70, 48, 21,
         59, 49, 70, 51, 70, 46,  8, 70, 70, 23, 70, 19, 52, 70, 70, 63, 70

In [21]:
second.shape, second, se_lenghts.shape

(torch.Size([200, 70]),
 tensor([[   1, 1453, 5561,  ...,    0,    0,    0],
         [   1, 3809,  451,  ...,    0,    0,    0],
         [ 480,  511,    6,  ..., 2303,    4, 2237],
         ...,
         [  38,    5,  281,  ...,    0,    0,    0],
         [ 186,    0,  310,  ...,    4, 1062, 4226],
         [  81,    2, 6007,  ...,  487,  254,   38]]),
 torch.Size([200]))

## Word embedding lookup:

Using torch.nn.Embedding, create a lookup for the embeddings of all the words in the dictionary. The lookup is in fact a matrix, which maps the ID of each word to the corresponding word vector. Similar to Assignment 2, use the pre-trained vectors of a word embedding model (like word2vec or GloVe) to initialize the word embeddings of the lookup. Keep in mind that the embeddings of the words in the lookup should be matched with the correct vector in the pretrained word embedding. If the vector of a word in the lookup does not exist in the pretrained word embeddings, the corresponding vector should be initialized randomly.

In [22]:
df = pd.read_csv(data_nlp / Path("glove.6B.100d.txt"), sep=" ", quoting=3, header=None, index_col=0)
gloveModel = {key: val.values for key, val in df.T.items()}

In [23]:
gloveModel["service"], gloveModel["service"].shape

(array([-0.4224  , -0.13313 , -0.41418 , -0.23677 ,  0.19041 , -0.32738 ,
        -0.23698 ,  0.57607 , -0.072985,  0.035825,  0.31916 , -0.33207 ,
         0.26005 ,  0.29836 ,  0.026027, -1.0519  ,  1.4233  , -0.28315 ,
        -0.75118 ,  0.19966 , -0.44334 , -0.25169 ,  0.12302 ,  0.12018 ,
         0.48829 , -0.29525 , -0.095826,  0.37184 ,  0.046189,  0.029206,
         0.12688 ,  1.0816  , -0.25109 , -0.42187 ,  0.22496 ,  0.44294 ,
        -0.98031 , -0.070257,  0.2825  , -0.069401,  0.40148 , -0.34647 ,
        -0.22201 , -0.044113,  0.62697 , -0.035738,  0.35029 , -0.80169 ,
         0.4902  , -0.30755 ,  0.72715 ,  0.19385 , -0.066447,  0.93629 ,
        -0.035545, -1.9468  ,  0.3688  , -0.32078 ,  3.1815  ,  0.70017 ,
        -0.1323  , -0.32202 ,  0.35374 , -0.22017 , -0.014307, -0.2664  ,
        -0.24965 , -0.057606,  1.3354  ,  0.63444 , -0.22013 , -0.44862 ,
         0.1921  , -0.61758 ,  0.73737 ,  0.19194 ,  0.67979 ,  0.49879 ,
        -0.96222 , -0.85897 ,  0.41978

In [24]:
number_of_words = len(words_ids_dict)
glove_dim = 100
random_embedding_stdv = 1.0 # 1.0 == numpy default

lookup_matrix = np.zeros((number_of_words, glove_dim))
n_words_found = 0

for i, word in enumerate(words_ids_dict):
    try: 
        lookup_matrix[i] = gloveModel[word] #28.1.22: da is die unstimmigkeit: weil len(words_ids_dict) = 14197 aber nicht 14199
        n_words_found += 1
    except KeyError:
        lookup_matrix[i] = np.random.normal(scale=random_embedding_stdv, size=(glove_dim, ))

lookup_matrix = torch.tensor(lookup_matrix, dtype=torch.float)
        
print(f"Ouer wordsidsdict had {number_of_words} words & only {number_of_words-n_words_found} vecs had to be created randomly")

Ouer wordsidsdict had 14197 words & only 2697 vecs had to be created randomly


In [25]:
lookup_matrix.shape

torch.Size([14197, 100])

In [26]:
emb_layer = torch.nn.Embedding.from_pretrained( lookup_matrix )
emb_layer.weight[3] # this is nearly the same as emb_layer( torch.tensor([3]) )

tensor([-0.0720,  0.2313,  0.0237, -0.5064,  0.3392,  0.1959, -0.3294,  0.1836,
        -0.1806,  0.2896,  0.2045, -0.5496,  0.2740,  0.5833,  0.2047, -0.4923,
         0.1997, -0.0702, -0.8805,  0.2948,  0.1407, -0.1009,  0.9945,  0.3697,
         0.4455,  0.2900, -0.1376, -0.5637, -0.0294, -0.4122, -0.2527,  0.6318,
        -0.4477,  0.2436, -0.1081,  0.2516,  0.4697,  0.3755, -0.2361, -0.1413,
        -0.4454, -0.6574, -0.0424, -0.2864, -0.2881,  0.0638,  0.2028, -0.5354,
         0.4131, -0.5972, -0.3861,  0.1939, -0.1781,  1.6618, -0.0118, -2.3737,
         0.0584, -0.2698,  1.2823,  0.8192, -0.2232,  0.7293, -0.0532,  0.4351,
         0.8501, -0.4293,  0.9266,  0.3905,  1.0585, -0.2456, -0.1826, -0.5328,
         0.0595, -0.6602,  0.1899,  0.2884, -0.2434,  0.5278, -0.6576, -0.1408,
         1.0491,  0.5134, -0.2382,  0.6989, -1.4813, -0.2487, -0.1794, -0.0591,
        -0.0806, -0.4878,  0.0145, -0.6259, -0.3237,  0.4186, -1.0807,  0.4674,
        -0.4993, -0.7189,  0.8689,  0.19

In [27]:
#emb_layer.weight[first].shape, emb_layer.weight[first]

In [28]:
#torch.sum(emb_layer.weight[first], 1).shape, torch.sum(emb_layer.weight[first], 1) # that !-------

In [29]:
#torch.sum(emb_layer.weight[first], 1)[0] == torch.sum(emb_layer.weight[first[0]], 0)

In [30]:
#final = torch.sum(emb_layer.weight[first], 1) / fi_lenghts[:, None] # this is for the forward() function later on
#final.shape, final

## Model definition:

Define the class ClassificationAverageModel as a PyTorch model. In the initialization procedure, the model receives the word embedding lookup, and includes it in the model as model's parameters. These embeddings parameters should be trainable, meaning that the word vectors get updated during model training. Feel free to add any other parameters to the model, which might be necessary for accomplishing the functionalities explained in the following.

For the procedure we oriented ouerself with the help of these docs: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model

## Forward function:

The forward function of the model receives a batch of data, and first fetches the corresponding embeddings of the word IDs in the batch using the lookup. Similar to Assignment 2, the embedding of a document is created by calculating the element-wise mean of the embeddings of the document's words. Formally, given the document  𝑑 , consisting of words  [𝑣1,𝑣2,...,𝑣|𝑑|] , the document representation  𝐞𝑑  is defined as:....

In [31]:
from torch import nn

class ClassificationAverageModel(nn.Module):
    def __init__(self, embedding_lookup, num_classes):
        super(ClassificationAverageModel, self).__init__()
        self.vocab_size, self.embed_dim = embedding_lookup.shape 
        #self.embedding = torch.nn.Embedding.from_pretrained( torch.tensor(embedding_lookup, dtype=torch.float) )
        self.embedding = torch.nn.Embedding.from_pretrained( embedding_lookup )
        self.fc = nn.Linear(self.embed_dim, num_classes)
        
    def forward(self, text_batch, list_of_real_lengths):
        embedded = self.embedding.weight[ text_batch ] 
        
        # e.g.: batchsize=200, maxdoclength=70, gloveModelshape=100; calculating the true mean for every document
        # sum(shape(200, 70, 100)) -> shape(200, 100) & then the 200 documents are divided by their corresponding true lengths
        x = torch.sum(embedded, 1) / list_of_real_lengths[:, None]    
        return self.fc(x)
    
        #return nn.Softmax(x) 
        # quote from t Internet: "The definition of CrossEntropyLoss in PyTorch is a combination of softmax and cross-entropy."
        # Since we will use the Pytorch CrossEntropyLoss we therfore dont need nn.Softmax(x) in the network.

In [32]:
model = ClassificationAverageModel(lookup_matrix, len(labels))
model

ClassificationAverageModel(
  (embedding): Embedding(14197, 100)
  (fc): Linear(in_features=100, out_features=12, bias=True)
)

## Loss Function and optimization:

The loss between the predicted and the actual classes is calculated using Negative Log Likelihood or Cross Entropy. Update the model's parameters using any appropriate optimization mechanism such as Adam.

In [33]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 10#500
    start_time = time.time()

    for batch_i, (data, targets, real_lenghts) in enumerate(dataloader):
        scores = model(data, real_lenghts)
        loss = criterion(scores, targets)
        
        optimizer.zero_grad()
        
        loss.backward()
        
        optimizer.step()

        total_acc += (scores.argmax(1) == targets).sum().item()
        total_count += targets.size(0)
        if batch_i % log_interval == 0 and batch_i > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d} batch '
                  '| accuracy {:8.3f}'.format(epoch, batch_i,
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

In [34]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 1 # 0

    with torch.no_grad():
        for idx, (data, targets, real_lenghts) in enumerate(dataloader):
            predicted_label = model(data, real_lenghts)
            loss = criterion(predicted_label, targets)
            total_acc += (predicted_label.argmax(1) == targets).sum().item()
            total_count += targets.size(0)
            
    model.train()
    return total_acc/total_count

In [35]:
batch_size = 200
max_doc_length = 70
learning_rate = 0.01 # 0.001
epochs = 20
shuffle_train = False

np.random.seed(7)

In [36]:
# probably it´s just better to use Pytroch´s Dataloader, creating it ouerself added a lot of problems BUT also learnings
### train_dataloader = data_batching(X_train, y_train, batch_size, max_doc_length, shuffle_train)
### vali_dataloader = data_batching(X_vali, y_vali, batch_size, max_doc_length, False)
### test_dataloader = data_batching(X_test, y_test, batch_size, max_doc_length, False)

model = ClassificationAverageModel(lookup_matrix, len(labels))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None

In [37]:
for epoch in range(1, epochs + 1):
    train_dataloader = data_batching(X_train, y_train, batch_size, max_doc_length, shuffle_train)
    vali_dataloader = data_batching(X_vali, y_vali, batch_size, max_doc_length, False)
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(vali_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |    10 batch | accuracy    0.175
| epoch   1 |    20 batch | accuracy    0.248
| epoch   1 |    30 batch | accuracy    0.224
| epoch   1 |    40 batch | accuracy    0.302
-----------------------------------------------------------
| end of epoch   1 | time:  1.68s | valid accuracy    0.377 
-----------------------------------------------------------
| epoch   2 |    10 batch | accuracy    0.386
| epoch   2 |    20 batch | accuracy    0.423
| epoch   2 |    30 batch | accuracy    0.374
| epoch   2 |    40 batch | accuracy    0.368
-----------------------------------------------------------
| end of epoch   2 | time:  0.60s | valid accuracy    0.380 
-----------------------------------------------------------
| epoch   3 |    10 batch | accuracy    0.415
| epoch   3 |    20 batch | accuracy    0.436
| epoch   3 |    30 batch | accuracy    0.383
| epoch   3 |    40 batch | accuracy    0.384
-----------------------------------------------------------
| end of epoch   3 | time:

## Test Set Evaluation:

After finishing the training, load the (already stored) best performing model, and use it for class prediction on the test set. Evaluate and report the final results.

In [38]:
test_dataloader = data_batching(X_test, y_test, batch_size, max_doc_length, False)
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.549


Since we have 12 labels, random evaluation would mean acc.: 1/12. Therfore a test accuracy of 0.553 is quite good since the valid acc. is 0.52

Of course this could be further improved by using more advanced techniques or more epochs but I am not sure how much since it´s quite a big task. One advanced technique that could help would probalby be: to freeze the embedding layer, train the NN layer till convergence & than unfreeze & train everything again.

In [39]:
def text_pipeline(text):
    words = text.split()[ :max_doc_length]

    sent_of_ids = []
    for w in [standardize(word) for word in words]:
        try:
            sent_of_ids.append(words_ids_dict[ w ])
        except KeyError:
            sent_of_ids.append(0)
    
    return torch.tensor(sent_of_ids)

In [40]:
def predict(text):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

In [41]:
#predict(X_test[3])