<a href="https://colab.research.google.com/github/JaveyBae/ML4NLP1/blob/main/ex1_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [None]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

# Set seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Training a classifier and making predictions

In [None]:
#¬†Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs #¬†x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X #¬†y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:00<00:00, 131MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:00<00:00, 119MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 8.16MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 7.13MB/s]


In [None]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [None]:
#¬†Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
#¬†Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (p√• eng)....",swe
2,‡§≠‡§æ‡§∞‡§§‡•Ä‡§Ø ‡§∏‡•ç‡§µ‡§æ‡§§‡§®‡•ç‡§§‡•ç‡§∞‡•ç‡§Ø ‡§Ü‡§®‡•ç‡§¶‡•ã‡§≤‡§® ‡§∞‡§æ‡§∑‡•ç‡§ü‡•ç‡§∞‡•Ä‡§Ø ‡§è‡§µ‡§Æ ‡§ï‡•ç‡§∑‡•á...,mai
3,"Apr√®s lo cort peri√≤de d'establiment a Basil√®a,...",oci
4,‡∏ñ‡∏ô‡∏ô‡πÄ‡∏à‡∏£‡∏¥‡∏ç‡∏Å‡∏£‡∏∏‡∏á (‡∏≠‡∏±‡∏Å‡∏©‡∏£‡πÇ‡∏£‡∏°‡∏±‡∏ô: Thanon Charoen Krung...,tha


### Data preparation

Prepare your dataset for this experiment using the same method as you did in part 1.

Get a subset of the train/test data that includes 20 languages. Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

Don't forget to encode your labels using the adjusted code snippet from part 1!


In [None]:
# TODO: Create your train/test subsets of languages
# Note, make sure these are the same as what you used in Part 1!

selected_labels = ['eng', 'deu', 'nld', 'dan', 'swe', 'nob', 'jpn','est', 'mai', 'oci', 'tha', 'orm', 'lim', 'guj', 'pnb', 'zea', 'krc', 'hat', 'pcd', 'tam']
print(f'The subset includes {len(set(selected_labels))} languages.')
train_df = train_df[train_df['label'].isin(selected_labels)]
test_df = test_df[test_df['label'].isin(selected_labels)]

X_train = train_df['text']
y_train = train_df['label']
X_test = test_df['text']
y_test = test_df['label']

# Check the subset
print(train_df['label'].value_counts())
print(test_df['label'].value_counts())

The subset includes 20 languages.
label
est    500
swe    500
mai    500
oci    500
tha    500
orm    500
lim    500
guj    500
pnb    500
zea    500
krc    500
hat    500
pcd    500
tam    500
nld    500
jpn    500
dan    500
nob    500
eng    500
deu    500
Name: count, dtype: int64
label
nld    500
hat    500
jpn    500
pnb    500
orm    500
nob    500
eng    500
guj    500
lim    500
krc    500
zea    500
oci    500
mai    500
tam    500
pcd    500
swe    500
dan    500
deu    500
est    500
tha    500
Name: count, dtype: int64


In [None]:
# TODO: Use your adjusted code from part 1 to encode the labels again
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)
print(label_encoder.classes_)
print(y_train)
print(y_test)


['dan' 'deu' 'eng' 'est' 'guj' 'hat' 'jpn' 'krc' 'lim' 'mai' 'nld' 'nob'
 'oci' 'orm' 'pcd' 'pnb' 'swe' 'tam' 'tha' 'zea']
[ 3 16  9 ... 12  8  0]
[10  5  6 ...  4  0 15]


### Feature Extraction

In [None]:
#¬†First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=100, binary=True)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec  = vectorizer.transform(X_test)

# convert to float32
X = X_train_vec.toarray().astype(np.float32)
X_test = X_test_vec.toarray().astype(np.float32)
y = np.asarray(y_train, dtype=np.int64)
y_test = np.asarray(y_test, dtype=np.int64)

print("train shapes:", X.shape, y.shape)
print("test  shapes:",  X_test.shape,  y_test.shape)

train shapes: (10000, 100) (10000,)
test  shapes: (10000, 100) (10000,)


In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [None]:
# TODO: In the following, you can find a small (almost) working example of a neural network.
# Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable. (Hint: the input and output sizes look a bit weird...)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        input_size=100,
        num_units1=100,
        num_units2=50,
        num_classes=20,
        nonlin=F.relu,
    ):
        super(ClassifierModule, self).__init__()
        self.nonlin = nonlin
        self.dense0 = nn.Linear(input_size, num_units1)
        self.dense1 = nn.Linear(num_units1, num_units2)
        self.output = nn.Linear(num_units2, num_classes)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.nonlin(self.dense1(X))
        X = self.output(X)   # logits for CrossEntropyLoss
        return X



In [None]:
# Initalise the neural net classifier.
net = NeuralNetClassifier(
    ClassifierModule(
        input_size=X.shape[1],
        num_units1=100,
        num_units2=50,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda' if torch.cuda.is_available() else 'cpu',
)

In [None]:
# Train the classifier
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.8553[0m       [32m0.1650[0m        [35m2.6872[0m  0.5098
      2        [36m2.5362[0m       [32m0.3055[0m        [35m2.3327[0m  0.1481
      3        [36m2.0647[0m       [32m0.4050[0m        [35m1.8415[0m  0.1489
      4        [36m1.6610[0m       [32m0.4450[0m        [35m1.5453[0m  0.1518
      5        [36m1.4436[0m       [32m0.4720[0m        [35m1.3802[0m  0.1529
      6        [36m1.3016[0m       [32m0.5280[0m        [35m1.2774[0m  0.1631
      7        [36m1.1984[0m       [32m0.5470[0m        [35m1.1973[0m  0.1536
      8        [36m1.1157[0m       [32m0.5640[0m        [35m1.1332[0m  0.2002
      9        [36m1.0519[0m       [32m0.5765[0m        [35m1.0850[0m  0.2200
     10        [36m1.0051[0m       [32m0.5925[0m        [35m1.0500[0m  0.2185
     11        [36m0.9699[0m       [32m0.61

Note, you can also use `GridSearchCV` with `skorch`, but be aware that training a neural network takes much more time.

Play around with 5 different sets of hyperparameters. For example, consider some of the following:

- layer sizes
- activation functions
- regularizers
- early stopping
- vectorizer parameters

Report your best hyperparameter combination. \\
üìù‚ùì What is the effect of your modifcations on validation performance? Discuss potential reasons.

‚òù Note, during model development, if you run into the infamous CUDA out-of-memory (OOM) error, try clearing the GPU memory either with `torch.cuda.empty_cache()` or restarting the runtime.

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer

class CountVectorizerWrapper(BaseEstimator, TransformerMixin):
    def __init__(self, analyzer='char', ngram_range=(2, 2), max_features=200, **kwargs):
        self.analyzer = analyzer
        self.ngram_range = ngram_range
        self.max_features = max_features
        self.vectorizer = CountVectorizer(analyzer=self.analyzer, ngram_range=self.ngram_range, max_features=self.max_features, **kwargs)

    def fit(self, X, y=None):
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        return self.vectorizer.transform(X).astype(np.float32)

    def fit_transform(self, X, y=None):
        return self.vectorizer.fit_transform(X).astype(np.float32)

    def get_feature_names_out(self):
        return self.vectorizer.get_feature_names_out()


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import TransformerMixin

# Define a pipeline with feature extraction, TF-IDF transformation, and neural network classifier
pipe = Pipeline([
    ('vect', CountVectorizerWrapper()),
    ('tfidf', TfidfTransformer()),
    ('net', net)                  # Apply the neural network classifier
])

# Define the parameter grid for GridSearchCV
param_grid = {
    'vect__max_features': [200,300],
    'net__module__input_size': [200,300],
    'net__lr': [0.5, 0.1],                     # Learning rate
    'net__max_epochs': [20, 50],               # Number of epochs
}

# Use GridSearchCV to find the best parameters
gs = GridSearchCV(pipe, param_grid, refit=True, cv=5, scoring='accuracy', verbose=1)

# Fit the grid search with training data
gs.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters: ", gs.best_params_)
print("Best Score: ", gs.best_score_)


Fitting 5 folds for each of 16 candidates, totalling 80 fits
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.9837[0m       [32m0.2694[0m        [35m2.9537[0m  0.4872
      2        [36m2.7568[0m       [32m0.3412[0m        [35m2.3293[0m  0.5365
      3        [36m1.8964[0m       [32m0.6338[0m        [35m1.4487[0m  0.4936
      4        [36m1.2136[0m       [32m0.6937[0m        [35m0.8971[0m  0.5221
      5        [36m0.8114[0m       [32m0.7969[0m        [35m0.6331[0m  0.4777
      6        [36m0.6145[0m       [32m0.8331[0m        [35m0.5407[0m  0.5339
      7        [36m0.5275[0m       [32m0.8469[0m        [35m0.4891[0m  0.4803
      8        [36m0.4764[0m       [32m0.8531[0m        [35m0.4566[0m  0.5371
      9        [36m0.4384[0m       [32m0.8594[0m        [35m0.4361[0m  0.4779
     10        [36m0.4099[0m       [32m0.8619[0m        [35m0.42

40 fits failed out of a total of 80.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.12/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sklearn/pipeline.py", line 662, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/usr/local/lib/python3.12/dist-packages/skorch/classifier.py", line 168,

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.9725[0m       [32m0.2090[0m        [35m2.9094[0m  0.6543
      2        [36m2.4094[0m       [32m0.4660[0m        [35m1.7653[0m  0.6063
      3        [36m1.3373[0m       [32m0.6825[0m        [35m1.0079[0m  0.6087
      4        [36m0.8175[0m       [32m0.7210[0m        [35m0.7371[0m  0.6163
      5        [36m0.6167[0m       [32m0.7715[0m        [35m0.5832[0m  0.6123
      6        [36m0.5012[0m       [32m0.8350[0m        [35m0.4497[0m  0.6281
      7        [36m0.4307[0m       [32m0.8745[0m        [35m0.3801[0m  0.6104
      8        [36m0.3859[0m       [32m0.8830[0m        [35m0.3567[0m  0.6137
      9        [36m0.3571[0m       0.8805        [35m0.3514[0m  0.6233
     10        [36m0.3351[0m       0.8810        [35m0.3500[0m  0.6354
     11        [36m0.3161[0m       [32m0.8840[0m        [35

In [None]:
svc_df = pd.DataFrame.from_dict(gs.cv_results_)
svc_df.sort_values(by=["rank_test_score"], inplace=True)
svc_df.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_net__lr,param_net__max_epochs,param_net__module__input_size,param_vect__max_features,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
5,27.813629,0.33165,0.470517,0.131624,0.5,50,200,300,"{'net__lr': 0.5, 'net__max_epochs': 50, 'net__...",0.891,0.906,0.911,0.9145,0.9135,0.9072,0.008617,1
4,27.604493,0.188422,0.523491,0.136353,0.5,50,200,200,"{'net__lr': 0.5, 'net__max_epochs': 50, 'net__...",0.8935,0.9055,0.9085,0.905,0.9195,0.9064,0.008309,2
1,11.939804,0.096168,0.417216,0.012118,0.5,20,200,300,"{'net__lr': 0.5, 'net__max_epochs': 20, 'net__...",0.8925,0.8995,0.8935,0.901,0.896,0.8965,0.003302,3
0,12.25832,0.361542,0.539453,0.155838,0.5,20,200,200,"{'net__lr': 0.5, 'net__max_epochs': 20, 'net__...",0.89,0.8925,0.8915,0.9045,0.9035,0.8964,0.006264,4
13,27.778677,0.223999,0.404107,0.007437,0.1,50,200,300,"{'net__lr': 0.1, 'net__max_epochs': 50, 'net__...",0.872,0.898,0.892,0.906,0.882,0.89,0.011933,5



---

üìù‚ùì Write your lab report here addressing all questions in the notebook