<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex1/ex1_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [1]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [2]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

# Set seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Training a classifier and making predictions

In [3]:
# Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:02<00:00, 27.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:02<00:00, 26.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 103MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 122MB/s]


In [19]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [20]:
# Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
2,भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षे...,mai
3,"Après lo cort periòde d'establiment a Basilèa,...",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung...,tha


### Data preparation

Prepare your dataset for this experiment using the same method as you did in part 1.

Get a subset of the train/test data that includes 20 languages. Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

Don't forget to encode your labels using the adjusted code snippet from part 1!


In [21]:
# TODO: Create your train/test subsets of languages
# Note, make sure these are the same as what you used in Part 1!

# Create a list of languages to include
languages = ['eng', 'deu', 'nld', 'dan', 'swe', 'nor', 'jpn', 'fra', 'spa', 'ita', 'por', 'rus', 'zho', 'ara', 'hin', 'ben', 'kor', 'tur', 'vie', 'ind']

train_df_subset = train_df[train_df['label'].isin(languages)]
test_df_subset = test_df[test_df['label'].isin(languages)]

# Inspect the first 5 items in the train split
train_df_subset.head(10)

Unnamed: 0,text,label
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
14,Bùi Tiến Dũng (sinh năm 1959 tại huyện Ứng Hòa...,vie
26,De spons behoort tot het geslacht Haliclona en...,nld
29,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,jpn
38,Tsutinalar (İngilizce: Tsuut'ina): Kanada'da A...,tur
46,シャーリー・フィールドは、サン・ベルナルド・アベニュー沿い市民センターとR&Tマーティン高校...,jpn
49,Kemunculan pertamanya adalah ketika mencium ka...,ind
52,Indtil 1545 havde flådecheferne kunnet hyre et...,dan
53,Barocco (pt: Escândalo de 1ª página) é um film...,por
54,Association de recherche et de sauvegarde de l...,fra


In [22]:
# TODO: Use your adjusted code from part 1 to encode the labels again
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
train_df_subset['label'] = label_encoder.fit_transform(train_df_subset['label'])
test_df_subset['label'] = label_encoder.transform(test_df_subset['label'])

# Inspect the first 5 items in the train split
train_df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df_subset['label'] = label_encoder.fit_transform(train_df_subset['label'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df_subset['label'] = label_encoder.transform(test_df_subset['label'])


Unnamed: 0,text,label
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",15
14,Bùi Tiến Dũng (sinh năm 1959 tại huyện Ứng Hòa...,17
26,De spons behoort tot het geslacht Haliclona en...,11
29,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,9
38,Tsutinalar (İngilizce: Tsuut'ina): Kanada'da A...,16


In [23]:
x_train = train_df_subset['text'].tolist()
y_train = train_df_subset['label'].tolist()
x_test = test_df_subset['text'].tolist()
y_test = test_df_subset['label'].tolist()

In [24]:
x_train[0], y_train[0]

('Sebes, Joseph; Pereira Thomas (1961) (på eng). The Jesuits and the Sino-Russian treaty of Nerchinsk (1689): the diary of Thomas Pereira. Bibliotheca Instituti historici S. I., 99-0105377-3 ; 18. Rome. Libris 677492',
 15)

### Feature Extraction

In [25]:
# First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=100, binary=True)
X = vectorizer.fit_transform(x_train)

In [28]:
y_train = np.array(y_train)
y_test = np.array(y_test)

In [29]:
# We need to change the datatype to make it play nice with pytorch
X = X.astype(np.float32)
y = y_train.astype(np.int64)

In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [38]:
# TODO: In the following, you can find a small (almost) working example of a neural network.
# Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable. (Hint: the input and output sizes look a bit weird...)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        num_units=200,
        input_size=100,
        num_classes=20,
        nonlin=F.relu,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin

        self.dense0 = nn.Linear(input_size, num_units)
        self.nonlin = nonlin
        self.dense1 = nn.Linear(num_units, 50)
        self.output = nn.Linear(50, num_classes)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = F.relu(self.dense1(X))
        X = self.output(X)
        return X.squeeze(dim=1)


In [47]:
# Initalise the neural net classifier.
net = NeuralNetClassifier(
    ClassifierModule(
        input_size=X.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [48]:
# Train the classifier
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.7897[0m       [32m0.1653[0m        [35m2.6194[0m  0.7891
      2        [36m2.4650[0m       [32m0.2768[0m        [35m2.2633[0m  0.7825
      3        [36m1.9611[0m       [32m0.3905[0m        [35m1.7300[0m  0.7567
      4        [36m1.4942[0m       [32m0.4763[0m        [35m1.4040[0m  0.7543
      5        [36m1.2499[0m       [32m0.5442[0m        [35m1.2211[0m  0.7554
      6        [36m1.1073[0m       [32m0.5979[0m        [35m1.0991[0m  0.9649
      7        [36m1.0025[0m       [32m0.6242[0m        [35m1.0011[0m  1.1600
      8        [36m0.9206[0m       [32m0.6726[0m        [35m0.9250[0m  1.3268
      9        [36m0.8601[0m       [32m0.6926[0m        [35m0.8697[0m  0.8546
     10        [36m0.8149[0m       [32m0.7179[0m        [35m0.8294[0m  0.7809
     11        [36m0.7789[0m       [32m0.72

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=100, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=50, bias=True)
    (output): Linear(in_features=50, out_features=19, bias=True)
  ),
)

Note, you can also use `GridSearchCV` with `skorch`, but be aware that training a neural network takes much more time.

Play around with 5 different sets of hyperparameters. For example, consider some of the following:

- layer sizes
- activation functions
- regularizers
- early stopping
- vectorizer parameters

Report your best hyperparameter combination. \\
📝❓ What is the effect of your modifcations on validation performance? Discuss potential reasons.

☝ Note, during model development, if you run into the infamous CUDA out-of-memory (OOM) error, try clearing the GPU memory either with `torch.cuda.empty_cache()` or restarting the runtime.


---

📝❓ Write your lab report here addressing all questions in the notebook