<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex1/ex1_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [2]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [4]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

# Set seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Training a classifier and making predictions

In [5]:
# Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex1/x_train.txt
100%|██████████████████████████████████████| 64.1M/64.1M [00:06<00:00, 9.23MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex1/x_test.txt
100%|██████████████████████████████████████| 65.2M/65.2M [00:07<00:00, 8.88MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex1/y_train.txt
100%|████████████████████████████████████████| 480k/480k [00:00<00:00, 5.84MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex1/y_test.txt
100%|████████████████████████████████████████| 480k/480k [00:00<00:00, 7.05MB/s]


In [6]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [7]:
# Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
2,भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षे...,mai
3,"Après lo cort periòde d'establiment a Basilèa,...",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung...,tha


### Data preparation

Prepare your dataset for this experiment using the same method as you did in part 1.

Get a subset of the train/test data that includes 20 languages. Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

Don't forget to encode your labels using the adjusted code snippet from part 1!


In [8]:
# TODO: Create your train/test subsets of languages
from sklearn.model_selection import train_test_split

pd.set_option('display.max_rows', None)
# Print the full label distribution
print(train_df['label'].value_counts())

print("Training samples:", len(train_df))
print("Testing samples:", len(test_df))

# Concatenate train and test dataframes
full_df = pd.concat([train_df, test_df]).reset_index(drop=True)
# Inspect the new dataframe
print(full_df['label'].value_counts())

print("Full_df samples:", len(full_df))
subset_df = full_df[full_df['label'].isin(['eng', 'deu', 'nld', 'dan', 'swe', 'nob', 'jpn', 'cmn', 'rus', 'ukr', 'gle', 'fin', 'ara', 'hin', 'asm', 'tam', 'cor', 'wuu', 'khm', 'jav'])]
final_train_df, final_test_df = train_test_split(subset_df, 
                                                test_size=0.2, 
                                                stratify=subset_df['label'], 
                                                random_state=42)

X_train_final = final_train_df['text']
y_train_final = final_train_df['label']
X_test_final  = final_test_df['text']
y_test_final  = final_test_df['label']

label
est          500
swe          500
mai          500
oci          500
tha          500
orm          500
lim          500
guj          500
pnb          500
zea          500
krc          500
hat          500
pcd          500
tam          500
vie          500
pan          500
szl          500
ckb          500
fur          500
wuu          500
arz          500
ton          500
eus          500
map-bms      500
glk          500
nld          500
bod          500
jpn          500
arg          500
srd          500
ext          500
sin          500
kur          500
che          500
tuk          500
pag          500
tur          500
als          500
koi          500
lat          500
urd          500
tat          500
bxr          500
ind          500
kir          500
zh-yue       500
dan          500
por          500
fra          500
ori          500
nob          500
jbo          500
kok          500
amh          500
khm          500
hbs          500
slv          500
bos          500
tet     

In [9]:
# TODO: Use your adjusted code from part 1 to encode the labels again
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder().fit(y_train_final)
y_train_enc, y_test_enc = label_encoder.transform(y_train_final), label_encoder.transform(y_test_final)
pd.set_option('display.max_columns', None)
print("Classes:", label_encoder.classes_)
print("Encoded y_train:", y_train_enc)
print("Encoded y_test:", y_test_enc)

Classes: ['ara' 'asm' 'cor' 'dan' 'deu' 'eng' 'fin' 'gle' 'hin' 'jav' 'jpn' 'khm'
 'nld' 'nob' 'rus' 'swe' 'tam' 'ukr' 'wuu']
Encoded y_train: [ 8  1  3 ...  0 16 11]
Encoded y_test: [17  0 14 ... 12  3 18]


### Feature Extraction

In [14]:
# First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=100, binary=True)
X = vectorizer.fit_transform(X_train_final)

In [15]:
# We need to change the datatype to make it play nice with pytorch
X = X.astype(np.float32)
y = y_train_enc.astype(np.int64)

In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [18]:
# TODO: In the following, you can find a small (almost) working example of a neural network.
# Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable. (Hint: the input and output sizes look a bit weird...)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        input_size,
        num_units=200,
        num_classes=20,
        nonlin=F.relu,

    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin

        self.dense0 = nn.Linear(input_size, num_units)
        self.nonlin = nonlin
        self.dense1 = nn.Linear(num_units, 50)
        self.output = nn.Linear(50, num_classes)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = F.relu(self.dense1(X))
        X = self.output(X)
        return X.squeeze(dim=1)


In [19]:
# Initalise the neural net classifier.
net = NeuralNetClassifier(
    ClassifierModule(
        input_size=X.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [20]:
# Train the classifier
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.6641[0m       [32m0.3349[0m        [35m2.3983[0m  0.8601
      2        [36m1.9993[0m       [32m0.4349[0m        [35m1.6370[0m  0.6679
      3        [36m1.4709[0m       [32m0.5362[0m        [35m1.3310[0m  0.6488
      4        [36m1.2569[0m       [32m0.5704[0m        [35m1.1739[0m  0.6536
      5        [36m1.1364[0m       [32m0.5928[0m        [35m1.0891[0m  0.6597
      6        [36m1.0675[0m       [32m0.6201[0m        [35m1.0382[0m  0.6597
      7        [36m1.0210[0m       [32m0.6270[0m        [35m1.0031[0m  0.6547
      8        [36m0.9893[0m       [32m0.6306[0m        [35m0.9818[0m  0.6550
      9        [36m0.9691[0m       [32m0.6352[0m        [35m0.9688[0m  0.6574
     10        [36m0.9546[0m       [32m0.6395[0m        [35m0.9597[0m  0.6552
     11        [36m0.9432[0m       [32m0.64

0,1,2
,module,"ClassifierMod..., bias=True) )"
,criterion,CrossEntropyLoss()
,train_split,<skorch.datas...x7f5e4dd7f390>
,classes,
,optimizer,<class 'torch.optim.sgd.SGD'>
,lr,0.1
,max_epochs,20
,batch_size,128
,iterator_train,<class 'torch...r.DataLoader'>
,iterator_valid,<class 'torch...r.DataLoader'>


Note, you can also use `GridSearchCV` with `skorch`, but be aware that training a neural network takes much more time.

Play around with 5 different sets of hyperparameters. For example, consider some of the following:

- layer sizes
- activation functions
- regularizers
- early stopping
- vectorizer parameters

Report your best hyperparameter combination. \\
📝❓ What is the effect of your modifcations on validation performance? Discuss potential reasons.

☝ Note, during model development, if you run into the infamous CUDA out-of-memory (OOM) error, try clearing the GPU memory either with `torch.cuda.empty_cache()` or restarting the runtime.


---

📝❓ Write your lab report here addressing all questions in the notebook