# Preprocessing and analysis

The first thing I want to do is to analyze the dataset I'm facing to understand which model to choose and how parametrize its input. To avoid to spend much time on this step I'll use "pandas" to import and collect statistics. I want to remove also as soon as possible all the stopwords using "nltk" to have a clearer idea on how many words in each text are really important. [I used the code on the NLP slide for stopwords, if you have an error as I had probably you need to download different resources with: (as suggested by the python error message)
- nltk.download('stopwords')
- nltk.download('punkt')]

I'll use for word embedding "gensim", during the analysis of the text I'll take care only of words that appears in there if it contains almost all the words in our training texts.

In [1]:
import pandas as pd
import contractions
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim.downloader as api

In [2]:
torch.manual_seed(21368330231508068)

<torch._C.Generator at 0x7fda66d2e3b0>

In [3]:
#this contains all the embedding for our words
embed=api.load("word2vec-google-news-300")

In [4]:
#as I said to run if you don't have stopwords already downloaded.
# nltk.download('stopwords') 
# nltk.download('punkt')
def tokenize_and_remove(data_path,embed):
    '''to load data in "data_path", tokenize only words in "embed" and make small analysis adding columns of length for texts and tokens'''
    sw = stopwords.words("english")
    df = pd.read_json(data_path,lines=True)
    #this list will become a new column of our dataframe after fixing contractions and tokenize each text removing stopwords
    token_cols=[]
    #to do statistics on them
    text_len=[]
    token_len=[]
    for idx in tqdm(range(len(df['text']))):
        #tokenize and remove w classified as english stopwords or not in gensim dictionary we adopted
        line=df['text'][idx]
        text_len.append(len(line))
        line=contractions.fix(line)
        line=[w for w in word_tokenize(line) if not w.lower() in sw and w.isalpha() and w in embed]
        token_len.append(len(line))
        token_cols.append(line)
    df['tokens']=token_cols
    df['tokens_len']=token_len
    df['text_len']=text_len
    return df

In [5]:
data='../data/train.jsonl'
df=tokenize_and_remove(data,embed)

100%|██████████| 186282/186282 [01:00<00:00, 3073.30it/s]


## Statistics

In [6]:
print(df[['text_len','tokens_len']].describe())

            text_len     tokens_len
count  186282.000000  186282.000000
mean      287.790823      27.730130
std       100.141311       8.966043
min        22.000000       2.000000
25%       216.000000      22.000000
50%       268.000000      27.000000
75%       352.000000      33.000000
max      8267.000000     870.000000


Using pandas "describe()" I'm trying to understand better this huge dataset according to the datas we had.
I repeated the expirement also without filtering out the ones not in embed, in that case the statistics were:
- count  186282.000000  186282.000000
- mean      287.790823      29.307706
- std       100.141311       9.675957
- min        22.000000       3.000000
- 25%       216.000000      23.000000
- 50%       268.000000      28.000000
- 75%       352.000000      35.000000
- max      8267.000000     917.000000

since they are not really different (27 vs 29 avg) I decided to keep this instance of gensim to handle my word embeddings.
We can see that after the tokenization and removing of stop and not-embedded words the average of words in each text is much smaller.

For curiosity I want to check the most frequent words in all out documents.

In [7]:
from collections import OrderedDict
words_count={}
for line in df['tokens']:
    for w in line:
        words_count[w] = words_count.get(w, 0)+1
words_count = dict(sorted(words_count.items(), key=lambda item: item[1], reverse=True))

In [8]:
print("There are",len(words_count.keys()), "different tokens.")

There are 121011 different tokens.


In [9]:
n=10 #n most common words
for key in words_count.keys():
    print(key,words_count[key])
    n-=1
    if not n:
        break

said 27597
new 18238
one 17369
Reuters 14430
AP 14208
first 13895
people 12915
New 12831
would 12716
two 12221


# Encoding of the text

At this point I want to exploit the work done so far to encode the text in a 300 dim vector summing the results of all the value of the words not removed. I tought also about on concatenate the say n (maybe 20 considring that 75%+ of the texts have at least 20 words) most common words adding 0-padding where there was not possible to have n words keeping the sum of all the others, but maybe I'll dedicate to this solution later on.

In [10]:
#utilities to retrive and encode labels
def label_to_idx(label):
    labels_dict={"business":0, "crime":1, "culture/arts":2, "education":3, "entertainment":4,
                "environment":5, "food/drink":6, "home/living":7, "media":8, "politics":9, 
                "religion":10, "sci/tech":11, "sports":12, "wellness":13, "world":14}
    return labels_dict[label]
def idx_to_label(idx):
    labels_list=["business", "crime", "culture/arts", "education", "entertainment",
                "environment", "food/drink", "home/living", "media", "politics", 
                "religion", "sci/tech", "sports", "wellness", "world"]
    return labels_list[idx]

This function allows to encode the vectors (by summing the contributes of each word token) given the dataframe "df" as input.

In [11]:
class TextDataset(Dataset):
    def encode_text(self,df,embed): #it also encode labels if labest parameter is set to true
        data=[]
        for idx,line in df.iterrows():
            t=torch.zeros(300,)
            for w in line['tokens']:
                t+=torch.from_numpy(embed[w].copy())
            l=torch.tensor(label_to_idx(line['label']),dtype=torch.int64) #only the label since we are going to use crossentropy
            # print("label",l)
            data.append({'id': line['id'],'inputs':t,'outputs':l})
        return data     
    #without output lables_forward_reduce_cuda_kernel_1d_index
    def encode_text_simple(self,df,embed):
        data=[]
        for idx,line in df.iterrows():
            t=torch.zeros(300,)
            for w in line['tokens']:
                t+=torch.from_numpy(embed[w].copy())
            data.append({'id': line['id'],'inputs':t})
        return data     
        
    def __init__(self, df, embed, labels = True):
        self.labels=labels
        if labels:
            self.data=self.encode_text(df,embed)
        else:
            self.data=self.encode_text_simple(df,embed)
        self.num_samples = len(self.data)
            
    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx]  

# Neural Networks model

Once the dataset is ready rather than going for ML models like svm that Iìve already used in the machine learning course I decided to practice in neural networks developing an artificial neural network (multi layer perceptron). I decided also to use the trainer class that the professor show us almost untouched for what concern the training.  

In [12]:
class TextClassifier(nn.Module): #an ANN
    def __init__(self,input_dim,output_dim,hparams=None):
        """_summary_

        Args:
            input_dim (int): is the input dimension of our network
            output_dim (int): output dimension of our networks, the number of classes.
            hparams (optional): the hyperparameters of our model. Defaults to None for now,
                    maybe in future extensions I could add here the parameter for which I'm dividing (2)
        """
        super(TextClassifier, self).__init__()
        self.in_dim=input_dim
        self.out_dim=output_dim
        self.body=nn.ModuleList()
        i=input_dim
        while(i>4*output_dim):
            self.body.append(nn.Sequential(
                nn.Linear(i,int(i/2)),
                nn.ReLU(),  
                nn.Dropout(0.2) 
            ))
            i=int(i/2)
        self.final=nn.Sequential(
            nn.Linear(i,int(i/2)),
            nn.ReLU(),
            nn.Linear(int(i/2),output_dim),
            )

    def forward(self, x):
        for i, l in enumerate(self.body):
            x=l(x)
        return self.final(x)

In [13]:
class Trainer():
    """Utility class to train and evaluate a model."""

    def __init__(
        self,
        model,
        loss_function,
        optimizer,
        device):
        """
        Args:
            model: the model we want to train.
            loss_function: the loss_function to minimize.
            optimizer: the optimizer used to minimize the loss_function.
            device: cuda or cpu depending on where our training will be performed.
        """
        self.model = model
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.device=device
        self.model.to(self.device)  # move model to GPU if available

    def train(self, train_dataset, valid_dataset, epochs=1):
        """
        Args:
            train_dataset: a Dataset or DatasetLoader instance containing the training instances.
            valid_dataset: a Dataset or DatasetLoader instance used to evaluate learning progress.
            epochs: the number of times to iterate over train_dataset.
        Returns:
            avg_train_loss: the average training loss on train_dataset over
                epochs.
        """
        assert epochs > 1 and isinstance(epochs, int)
        print('Training...')
        train_loss = 0.0
        
        for epoch in range(epochs):
            self.model.train()
            print(' Epoch {:03d}'.format(epoch + 1))
            epoch_loss = 0.0
            for _, sample in enumerate(train_dataset):
                inputs = sample['inputs'].to(self.device)
                labels = sample['outputs'].to(self.device)
                self.optimizer.zero_grad()

                predictions = self.model(inputs)               
                sample_loss = self.loss_function(predictions, labels)
                sample_loss.backward()
                self.optimizer.step()
                # sample_loss is a Tensor
                epoch_loss += sample_loss.tolist()
            
            avg_epoch_loss = epoch_loss / len(train_dataset)
            train_loss += avg_epoch_loss
            print('  [E: {:2d}] train loss = {:0.4f}'.format(epoch, avg_epoch_loss))

            valid_loss = self.evaluate(valid_dataset)
            
            print('  [E: {:2d}] valid loss = {:0.4f}'.format(epoch, valid_loss))
            ## these two lines can be removed, I used them to check how good we are doing at each epochs
            save_results(data_dev,"../predictions/nnbv1_dev.tsv",self)
            !python3 ../scorer.py --prediction_file ../predictions/nnbv1_dev.tsv --gold_file ../gold/gold_dev.tsv
        print('... Done!')
    
        avg_epoch_loss = train_loss / epochs
        return avg_epoch_loss
    

    def evaluate(self, valid_dataset):
        # self.model.eval()
        """
        Args:
            valid_dataset: the dataset to use to evaluate the model.

        Returns:
            avg_valid_loss: the average validation loss over valid_dataset.
        """
        valid_loss = 0.0
        # no gradient updates here
        with torch.no_grad():
            for sample in valid_dataset:
                inputs = sample['inputs'].to(self.device)
                labels = sample['outputs'].to(self.device)
                predictions = self.model(inputs)
                # print("pred",predictions,"labels",labels)
                sample_loss = self.loss_function(predictions, labels)
                valid_loss += sample_loss.tolist()
        
        return valid_loss / len(valid_dataset)
    

    def predict(self, x):
        # self.model.eval()
        """
        Returns: hopefully the right prediction.
        """
        res=[]
        output=self.model(x.to(self.device))
        for y in output:
            max=idx_to_label(torch.argmax(y))
            res.append(max)
        return res

Now we load the Datasets and create the trainer with model, loss and optimizer.


In [14]:
dftrain=TextDataset(df,embed,True)
print(len(dftrain))
print(dftrain[3])
# load and encode dev and test dataset too
dev='../data/dev.jsonl'
dfde=tokenize_and_remove(dev,embed)
dfdev=TextDataset(dfde,embed,True)

test='../data/test.jsonl'
dfte=tokenize_and_remove(test,embed)
dftest=TextDataset(dfte,embed,False)

  4%|▍         | 279/6844 [00:00<00:02, 2786.84it/s]

186282
{'id': 34826, 'inputs': tensor([ 0.5984,  2.1544, -2.1497,  3.4221, -1.0396, -1.1547,  1.5708, -4.1302,
         0.7389,  3.0746, -0.8054, -3.1855,  1.3969,  2.0278, -2.7422,  2.1461,
         3.3774,  5.3034,  1.4444, -1.1899, -0.1482,  2.1570,  0.6310, -0.3157,
         0.8607, -1.3098, -1.0935,  2.7092,  2.1862, -0.5253, -0.6049,  0.3447,
         0.5847,  2.0790,  0.7202, -0.6893,  2.6213, -1.6462, -0.3467,  2.2975,
         3.0027, -2.3688,  2.2446, -2.7668,  1.4780, -1.9380, -1.8821,  0.6446,
         3.0092,  2.9491, -1.9568,  2.6705, -1.5483, -1.2118,  0.0370, -1.3667,
        -1.7024, -3.1867, -1.5602, -3.4138,  0.6928,  2.0115, -2.8325, -2.8777,
        -0.6772,  1.8848, -2.3920,  2.5192,  0.2378,  2.8743,  2.8725,  1.1666,
         1.1083,  0.7915, -4.2026,  0.6115,  1.8562,  3.4143,  2.0905,  4.6426,
         0.8409, -2.3152, -0.5070, -1.3265, -0.8783, -0.2410, -1.8527,  3.2277,
         1.3466,  1.1963,  0.2794,  0.3676, -1.9456, -2.5863, -1.2511, -1.0417,
         

100%|██████████| 6844/6844 [00:02<00:00, 2749.59it/s]
100%|██████████| 6849/6849 [00:02<00:00, 3241.31it/s]


In [16]:
data_train=DataLoader(dftrain,batch_size=128,num_workers=6)
data_dev=DataLoader(dfdev,batch_size=10)
data_test=DataLoader(dftest,batch_size=10)

In [15]:
mod=TextClassifier(300,15)
loss=nn.CrossEntropyLoss()
optimizer = optim.SGD(mod.parameters(), lr=1e-5)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
trainer=Trainer(mod,loss,optimizer,device)

cuda


For the optimizer I run tons of different combinations, not only for what concern the learning rate but also with weight decay and momentums, the following pth is the best I was able to achieve.

In [19]:
#run only if you want to start with a pre-trained model resulting from another epoch
trainer.model.load_state_dict(torch.load("nnbonusv1_400+100A.pth"))

<All keys matched successfully>

In [17]:
#utility to save results
def save_results(dataset,path,trainer):
    output = open(path,"w")
    for x in dataset:
        y=trainer.predict(x['inputs'])
        for a,id in zip(y,x['id']):
            print(str(id.item())+"\t"+a,file=output)
    output.close()

The best model I was able to obtain is the one that results from training for 400 epochs uning SGD and then switching to Adam optimizer for 100 more.

(I used a really small lr=1e-5, probably with an higher one way less epochs would have been needed)

Weights of that model can be load above "nnbonusv1_400+100A.pth".

In [None]:
trainer.train(data_train,data_dev,400)

In [None]:
trainer.loss = optim.Adam(trainer.model.parameters(), lr=1e-5)

In [None]:
trainer.train(data_train,data_dev,100)

In [None]:
trainer.mod.load_state_dict(torch.load("nnbonusv1_400+100A.pth"))

In [22]:
trainer.model.eval()
trainer.evaluate(data_dev)

0.7701703387564116

Save results of our training

In [23]:
save_results(data_dev,"../predictions/nnbv1_dev.tsv",trainer)
!python3 ../scorer.py --prediction_file ../predictions/nnbv1_dev.tsv --gold_file ../gold/gold_dev.tsv

{'err_rate': '21.41'}


In [26]:
save_results(data_test,"../predictions/predictions_test.tsv",trainer)

In [None]:
torch.save(mod.state_dict(), "nnbonusv1_seed200.pth")

In [None]:
torch.seed()