<a href="https://colab.research.google.com/github/Fisherman-s-Friend/atmt_2024/blob/main/Copy_of_ex1_lr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1

## Starting Point for Exercise 1, part I

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part I.

One of the goals of this exercise is to get you acquainted with sklearn and related libraries like pandas and numpy. You will probably need to consult the documentation of those libraries:
- sklearn: [Documentation](https://scikit-learn.org/stable/user_guide.html)
- Pandas: [Documentation](https://pandas.pydata.org/docs/#)
- NumPy: [Documentation](https://numpy.org/doc/)
- SHAP: [Documentation](https://shap.readthedocs.io/en/latest/index.html)

## Task Description

Follow the instructions in this notebook to:

1. Explore the data and create training/test splits for your experiments

2. Build a LogisticRegression classifier and design some relevant features to apply it to your data

3. Conduct hyperparameter tuning to find the optimal hyperparameters for your model

4. Explore your model's predictions and conduct an error analysis to see where the model fails

5. Conduct an interpretability analysis, investigating the model's most important features.

6. Conduct an ablation study using a subset of languages


Throughout the notebook, there are questions that you should address in your report. These are marked with 🗒❓.

☝ Note, these questions are intended to provide you with an opportunity to reflect on what it is that you are doing and the kind of challenges you might face along the way.




In [1]:
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

### Loading the datasets

In [2]:
# Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:02<00:00, 25.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:01<00:00, 38.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 99.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 96.0MB/s]


In [3]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [4]:
# Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
2,भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षे...,mai
3,"Après lo cort periòde d'establiment a Basilèa,...",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung...,tha


In [5]:
# Get list of all labels
labels = train_df['label'].unique().tolist()
print(labels)

['est', 'swe', 'mai', 'oci', 'tha', 'orm', 'lim', 'guj', 'pnb', 'zea', 'krc', 'hat', 'pcd', 'tam', 'vie', 'pan', 'szl', 'ckb', 'fur', 'wuu', 'arz', 'ton', 'eus', 'map-bms', 'glk', 'nld', 'bod', 'jpn', 'arg', 'srd', 'ext', 'sin', 'kur', 'che', 'tuk', 'pag', 'tur', 'als', 'koi', 'lat', 'urd', 'tat', 'bxr', 'ind', 'kir', 'zh-yue', 'dan', 'por', 'fra', 'ori', 'nob', 'jbo', 'kok', 'amh', 'khm', 'hbs', 'slv', 'bos', 'tet', 'zho', 'kor', 'sah', 'rup', 'ast', 'wol', 'bul', 'gla', 'msa', 'crh', 'lug', 'sun', 'bre', 'mon', 'nep', 'ibo', 'cdo', 'asm', 'grn', 'hin', 'mar', 'lin', 'ile', 'lmo', 'mya', 'ilo', 'csb', 'tyv', 'gle', 'nan', 'jam', 'scn', 'be-tarask', 'diq', 'cor', 'fao', 'mlg', 'yid', 'sme', 'spa', 'kbd', 'udm', 'isl', 'ksh', 'san', 'aze', 'nap', 'dsb', 'pam', 'cym', 'srp', 'stq', 'tel', 'swa', 'vls', 'mzn', 'bel', 'lad', 'ina', 'ava', 'lao', 'min', 'ita', 'nds-nl', 'oss', 'kab', 'pus', 'fin', 'snd', 'kaa', 'fas', 'cbk', 'cat', 'nci', 'mhr', 'roa-tara', 'frp', 'ron', 'new', 'bar', 'ltg'


### 1.1 Exploring the training data

📝❓Take a look at a couple of texts from different languages and answer the following questions:

1. Do you notice anything that might be challenging for the classification?
2. How is the data distributed? (i.e., how many instances per label are there in the training and test set? Is it a balanced dataset?)
3. Do you think the train/test split is appropriate (i.e., is the test data representative of the training data)? If not, please rearrange the data in a more appropriate way.


In [6]:
# TODO: Inspect the training data
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117500 entries, 0 to 117499
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    117500 non-null  object
 1   label   117500 non-null  object
dtypes: object(2)
memory usage: 1.8+ MB


In [7]:
# number of entries per label in training set
train_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
est,500
eng,500
vep,500
sgs,500
uig,500
...,...
lmo,500
mya,500
ilo,500
csb,500


In [8]:
# and in test set
test_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
mwl,500
uig,500
tat,500
nno,500
new,500
...,...
frp,500
krc,500
mlg,500
msa,500


In [9]:
# Adjust pandas display options to show full text
pd.set_option('display.max_colwidth', None)

In [10]:
train_df.head(15)

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning paigutati mausoleumi. Surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundemärke. 1962. aastal viidi ta surnukeha mausoleumist ära ja kremeeriti. Zlíni linn kandis aastatel 1949–1989 nime Gottwaldov. Ukrainas Harkivi oblastis kandis Zmiivi linn aastatel 1976–1990 nime Gotvald.,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng). The Jesuits and the Sino-Russian treaty of Nerchinsk (1689): the diary of Thomas Pereira. Bibliotheca Instituti historici S. I., 99-0105377-3 ; 18. Rome. Libris 677492",swe
2,"भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षेत्रीय आह्वान, उत्तेजनासभ एवम प्रयत्नसँ प्रेरित, भारतीय राजनैतिक सङ्गठनद्वारा सञ्चालित अहिंसावादी आ सैन्यवादी आन्दोलन छल, जेकर एक समान उद्देश्य, अङ्ग्रेजी शासनक भारतीय उपमहाद्वीपसँ जडीसँ उखाड फेकनाई छल। ई आन्दोलनक शुरुआत १८५७ मे भेल सिपाही विद्रोहक मानल जाइत अछि। स्वाधीनताक लेल हजारो लोग अपन प्राणक बलि देलक। भारतीय राष्ट्रीय कांग्रेस १९३० कांग्रेस अधिवेशन मे अङ्ग्रेजसँ पूर्ण स्वराजक मांग केने छल।",mai
3,"Après lo cort periòde d'establiment a Basilèa, tornèt a viatjar, venent un «sabent caminaire», . Sa preséncia es atestada a Colmar ont escriguèt sus la sifilis e Bertheonea sive chirurgia minor, manual per «legit» e interpretar los signes corporals, a Esslingen (aprigondèt sas coneissença en sciéncias ocultas), Nurembèrg (novembre de 1529: faguèt la coneissénça del mistic Sébastien Franck, Beratzhausen (1530: comencèt a escriure de la teologia e lo Paragranum), Sankt Gallen (1531: acabèt Liber paramirum), Appenzell (1533), Sterzing (1534 : sonhava de la pèsta), Merano, Sant Moritz, Pfäffers (Bad Ragaz), Ulm, Augsborg (1536), Munic, Eferding (1537), Kromau (en Moràvia: escriguèt son Astronomia magna), Viena (1537-1538, lo recebèt Ferrand I del Sant Empèri, rei de Boèlia e d'Ongria, rei dels Romans), Villach (mai de 1538). En païses minièrs (val d’Inn), a Appenzell, escriguèt sus las malautiás dels minors (1533); dins las vilas d’aigas (coma Pfäffers) estudièt los benfachs de las aigas termalas (1535), fondant atal la medecina professionala e la balneoterapia.",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung) เริ่มตั้งแต่ถนนสนามไชยถึงแม่น้ำเจ้าพระยาที่ถนนตก กรุงเทพมหานคร เป็นถนนรุ่นแรกที่ใช้เทคนิคการสร้างแบบตะวันตก ปัจจุบันผ่านพื้นที่เขตพระนคร เขตป้อมปราบศัตรูพ่าย เขตสัมพันธวงศ์ เขตบางรัก เขตสาทร และเขตบางคอแหลม,tha
5,"Mudde 14, 2012tti, TNQ USA raxibacumab antiraaksii qilleensaan seenuu yaaluu akka limmoodhaan kennamu eeyyame. Raxibacumab antiibodii monoclonial kan summii Baasiles Antiraasisiin uummamu sana halaksiisa. Summiin kun tishuurratti miidhama hin fayyinee fi du'a fida. Antiibodii monoclonial kun pirootiinii antiibodii namaa fakkaatu yoo ta'u qaamota alaa nama keessa seenan kan akka baakteeriyaa fi vaayiresii barbaadanii balleessa.",orm
6,"Kiribati is 't 174e land op de wereld, nao Saõ Tomé en Príncipe en veur Bahrein. 't Besteit oet inkel tientalle eilendsjes, allemaol koraoleilendsjes (väölal atolle), die naovenant mer kort bove zie oetkoume. De eilendsjes groepere ziech zoe: Banaba (e geïsoleerd eiland in 't weste), de Gilberteilen (16 atolle in 't midde), de Fenikseilen (8 atolle en koraoleilen in 't zuidzuidooste) en de Lijneilen (8 atolle en ei rif in 't ooste; dao-oonder Kiritimati, 't groetste bovezies atol op de wereld). De mierderheid vaan de eilen is bewoend. Banaba haolt de veur koraoleilen oetzunderleke huugde vaan 81 meter, de res vaan 't land löp acuut gevaar veur es gevolg vaan 't breujkaseffek oonder te loupe.",lim
7,"He was a economics graduate from Elphinstone College, Mumbai. He was an industrialist in plastics business. He served as a president of Gujarat Chamber of Commerce and Industry in 1990s.",guj
8,براعظم ایشیاء تے یورپ اتے پھیلے ہوئے دیس ترکی دے 81 صوبے نیں ۔ اس دے بوہتے صوبےآں دے ناں اس صوبے دے صدرمقام شہر اتے نیں ۔ صوبہ ہاتے ترکی دے 81 صوبےآں چوں اک اے تے اسدا انتظامی مرکز انطاکیہ شہر اے ۔,pnb
9,"Vanwehe zen Gentsen ofkomst wor 't een ok wè Gandensis enoemd. In meie van 't jaer 980 trouwen Arnold mie Lutgardis van Luxemburg, den dochter van Siegfried van Verdun, graef van Luxemburg. Naeda Arnold was esneuveld in de Slag bie Wienkel, volhen zen zeune Dirk III van 'Olland een op as graef.",zea


In [11]:
# Select a random item to inspect
random_instance = test_df.sample(1)

# Print the contents of the 'text' column at index 0 (there is only 1 item)
print('**** INPUT ****')
print(random_instance['text'].iloc[0])

# Print the contents of the 'label' column at index 0 (there is only 1 item)
print('*** TARGET ****')
print(random_instance['label'].iloc[0])

**** INPUT ****
En il diever linguistic general designeschan ils terms ‹romantica› e ‹romantic› oz per ordinari in stadi sentimental, magari er collià cun brama. Tipicas èn cumbinaziuns da pleds sco ‹muments romantics› u ‹ina cuntrada romantica›. Per ‹in’affera romantica› è er vegnì en diever il term ‹romanza› che designava oriundamain il gener litterar romanza. Er en quest pled sa manifestescha la transfurmaziun da l’idea romantica da l’epoca istorica en il mund dad oz.
*** TARGET ****
roh


### 1.2 Data preparation

Get a subset of the train/test data that includes 20 languages.
Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

In [32]:
language_codes = ['eng', 'deu', 'nld', 'dan', 'swe', 'nob', 'nno', 'jpn', 'tsn', 'xho', 'swa', 'rus', 'afr', 'ell', 'zh-yue', 'aym', 'fao', 'mai', 'hau', 'por']


In [33]:
len(language_codes)

20

In [35]:
# Get the unique labels in DataFrame
unique_labels_in_df = train_df['label'].unique()

# Check which codes are present and which are missing
present_codes = [code for code in language_codes if code in unique_labels_in_df]
missing_codes = [code for code in language_codes if code not in unique_labels_in_df]

print("Present codes:", present_codes)
print("Missing codes:", missing_codes)

print(len(present_codes))

Present codes: ['eng', 'deu', 'nld', 'dan', 'swe', 'nob', 'nno', 'jpn', 'tsn', 'xho', 'swa', 'rus', 'afr', 'ell', 'zh-yue', 'aym', 'fao', 'mai', 'hau', 'por']
Missing codes: []
20


There seem to be 2 variants of Norwegian in the dataset - *Bokmål* (nob) and *Nynorsk* (nno). Zulu is lacking.

In [36]:
! head -n 5 x_test.txt

Ne l fin de l seclo XIX l Japon era inda çconhecido i sótico pa l mundo oucidental. Cula antroduçon de la stética japonesa, particularmente na Sposiçon Ounibersal de 1900, an Paris, l Oucidente adquiriu un apetite ansaciable pul Japon i Heiarn se tornou mundialmente coincido pula perfundidade, ouriginalidade i sinceridade de ls sous cuntos. An sous radadeiros anhos, alguns críticos, cumo George Orwell, acusórun Heiarn de trasferir sou nacionalismo i fazer l Japon parecer mais sótico, mas, cumo l'home qu'oufereciu al Oucidente alguns de sous purmeiros lampeijos de l Japon pré-andustrial i de l Período Meiji, sou trabalho inda ye balioso até hoije.
Schiedam is gelegen tussen Rotterdam en Vlaardingen, oorspronkelijk aan de Schie en later ook aan de Nieuwe Maas. Per 30 april 2017 had de gemeente 77.833 inwoners (bron: CBS). De stad is vooral bekend om haar jenever, de historische binnenstad met grachten, en de hoogste windmolens ter wereld.
ГIурусаз батальонал, гьоркьор гIарадабиги лъун, у

In [37]:
! head -n 5 y_test.txt

mwl
nld
ava
tcy
bjn


In [38]:
# Create training subset
subset_x_train = [x for x,y in zip(x_train, y_train) if y in language_codes]
subset_y_train = [y for y in y_train if y in language_codes]

# Create testing subset
subset_x_test = [x for x,y in zip(x_test, y_test) if y in language_codes]
subset_y_test = [y for y in y_test if y in language_codes]

In [39]:
# TODO: With the following code, we wanted to ENCODE the labels, however, our cat was walking on the keyboard and some of it got changed. Can you fix it?
from sklearn.preprocessing import LabelEncoder

# initialize and fit label encoder
label_encoder = LabelEncoder().fit(subset_y_train)

# transform training and test data
encoded_y_train = label_encoder.transform(subset_y_train)
encoded_y_test = label_encoder.transform(subset_y_test)

label_encoder.classes_
encoded_y_train
encoded_y_test

array([10,  8, 17, ...,  4,  2, 15])

### 2.1 Build a LogisticRegression classifier

To start with, we're going to build a very simple LogisticRegression classifier.
Use a `Pipeline` to chain togther a `CountVectorizer` and a `LogisticRegression` estimator. Then perform a 5-fold cross validation and report the scores of this model as a baseline.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# TODO: Define a very basic pipeline using a CountVectorizer and a LogisticRegression classifier



In [47]:

# Also added max_features parameter so that we don't end up possibly computing huge feature matrices for rare character combinations
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(max_features=10000)),
    ('classifier', LogisticRegression(max_iter=1000))
])


In [48]:
# TODO: Run a cross validation to estimate the model's expected performance

# 5-fold cross-validation on our subset of 20 languages, using our subset_x_train and our encoded_y_train as parameters
cv_scores = cross_val_score(pipeline, subset_x_train, encoded_y_train, cv=5, scoring='accuracy')

# cross-validation scores and the mean score as a baseline
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation score: ", np.mean(cv_scores))

Cross-validation scores:  [0.922  0.919  0.9295 0.9225 0.9225]
Mean cross-validation score:  0.9231



### 2.2 Feature Engineering

So far, we've only considered the basic `CountVectorizer` at the word level to encode our input texts for our model.

Your task is to apply some text preprocessing and engineer some more informative features.

To do this, think about what other features might be relevant for determining the language of an input text.

Define a custom set of feature extractors and implement the necessary preprocessing steps to extract these features from strings.

Then initialise a processing pipeline that converts your input data into features that the model can take as input.

☝ Note, this step can be as involved as your heart desires, there is only one minimal requirement: you must use something more than the base `CountVectorizer`. We recommend that you take a look at the [`BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#transformermixin) classes from `sk-learn`, as these can be helpful for defining custom transformers.


In [50]:
# TODO: Data cleaning/Feature engineering steps
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer


# Custom functions to calculate additional text features
def word_length_stats(text):
    word_lengths = [len(word) for word in text.split()]
    return [np.mean(word_lengths), np.std(word_lengths), np.median(word_lengths)]

# Function to count various punctuation and language-specific symbols
def punctuation_and_symbol_count(text):
    punctuation_counts = {
        'comma': text.count(','),
        'period': text.count('.'),
        'question': text.count('?'),
        'exclamation': text.count('!'),
        'colon': text.count(':'),
        'semicolon': text.count(';'),
        'quote': text.count('"'),
        'parentheses': text.count('(') + text.count(')'),
        # Language-specific symbols
        'inverted_question': text.count('¿'),
        'inverted_exclamation': text.count('¡'),
        'french_quotes': text.count('«') + text.count('»'),
        'german_quotes': text.count('„') + text.count('“')
    }
    return list(punctuation_counts.values())

# Wrapping custom functions for use in the pipeline
word_length_transformer = FunctionTransformer(lambda x: np.array([word_length_stats(doc) for doc in x]))
punctuation_transformer = FunctionTransformer(lambda x: np.array([punctuation_and_symbol_count(doc) for doc in x]))

# Building the pipeline with feature union
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('char_ngrams', CountVectorizer(analyzer='char', ngram_range=(1, 3), max_features=10000)),
        ('word_ngrams', TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=5000)),
        ('word_length_stats', word_length_transformer),
        ('punctuation_count', punctuation_transformer)
    ])),
    ('classifier', LogisticRegression(max_iter=1000))
])


---

### 3.1 Grid Search

Use sklearn's GridSearchCV and experiment with the following hyperparameters:
1. Penalty (Regularization)
2. Solver
3. Experiment with parameters of the Vectorizer (optional, but highly advised)

☝ Note, don't overdo it at the beginning, since runtime might go up fast!

Make sure you read through the [docs](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html#logisticregression) to get an understanding of what these parameters do.


In [56]:
# TODO: GridSearchCV

from sklearn.model_selection import GridSearchCV



# Define the parameter grid
param_grid = {
    'classifier__penalty': ['l1', 'l2'],  # Add 'elasticnet' only if you set 'solver' to 'saga'
    'classifier__solver': ['liblinear'],        # 'saga' supports elasticnet, 'liblinear' doesn't
    'features__char_ngrams__ngram_range': [(1, 2), (1, 3)],  # Character n-grams
    'features__word_ngrams__ngram_range': [(1, 1), (1, 2)],  # Word n-grams
    'features__char_ngrams__max_features': [5000, 10000],     # Adjust to control dimensionality
    'features__word_ngrams__max_features': [3000, 5000]
}

# Instantiate GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=3,  # Cross-validation splits, adjust as needed
    scoring='accuracy',  # Or any other metric suitable for your task
    verbose=1,
    n_jobs=-1  # Use all processors to speed up
)

# Assuming you have your data ready in X and y
grid_search.fit(subset_x_train, subset_y_train)

# Display best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)


Fitting 3 folds for each of 32 candidates, totalling 96 fits
Best parameters found:  {'classifier__penalty': 'l2', 'classifier__solver': 'liblinear', 'features__char_ngrams__max_features': 10000, 'features__char_ngrams__ngram_range': (1, 3), 'features__word_ngrams__max_features': 3000, 'features__word_ngrams__ngram_range': (1, 1)}
Best cross-validation score:  0.9751998295210446


### 3.2 Best Model Selection

After conducting our Grid Search, we should be able to identify our best model by inspecting the using the Grid Search result attribute `cv_results_`. (Hint: `cv_results_` returns a dictionay, so convert it to a Pandas Dataframe for easy inspection.)

📝❓ What were the hyperparameter combinations for your best-performing model on the test set.

📝❓ What is the advantage of grid search cross-validation?


In [58]:
# TODO: Select the best model based on the GridSearch results
# Access cv_results_ from the GridSearchCV instance
cv_results = grid_search.cv_results_

# Convert cv_results_ to a DataFrame
results_df = pd.DataFrame(cv_results)


In [59]:
# Sort by the mean test score in descending order
sorted_results = results_df.sort_values(by='mean_test_score', ascending=False)

# Display the top 5 configurations
print(sorted_results[['mean_test_score', 'std_test_score', 'params']].head())

    mean_test_score  std_test_score  \
28           0.9752        0.001352   
29           0.9752        0.001352   
30           0.9752        0.001352   
31           0.9751        0.001473   
21           0.9746        0.001479   

                                                                                                                                                                                                                                                     params  
28  {'classifier__penalty': 'l2', 'classifier__solver': 'liblinear', 'features__char_ngrams__max_features': 10000, 'features__char_ngrams__ngram_range': (1, 3), 'features__word_ngrams__max_features': 3000, 'features__word_ngrams__ngram_range': (1, 1)}  
29  {'classifier__penalty': 'l2', 'classifier__solver': 'liblinear', 'features__char_ngrams__max_features': 10000, 'features__char_ngrams__ngram_range': (1, 3), 'features__word_ngrams__max_features': 3000, 'features__word_ngrams__ngram_range': (1, 2)}  
30 

## 3.3 Model Evaluation

Once you have identified your best model, use it to predict the languages of texts in the test split.

📝❓ According to standard metrics (e.g. Accurracy, Precision, Recall and F1), how well does your model perform on the heldout test set?


In [60]:
# TODO: Evaluate the model by inspecting the predictions on the heldout test set
# Retrieve the best model from the grid search
best_model = grid_search.best_estimator_

# Use the best model to predict languages on the test split
predictions = best_model.predict(subset_x_test)

from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy
accuracy = accuracy_score(subset_y_test, predictions)
print("Test set accuracy: ", accuracy)

# Detailed classification report
print(classification_report(subset_y_test, predictions))


Test set accuracy:  0.9807
              precision    recall  f1-score   support

         afr       0.99      0.99      0.99       500
         aym       0.99      0.98      0.99       500
         dan       0.96      0.96      0.96       500
         deu       0.96      0.98      0.97       500
         ell       1.00      0.99      0.99       500
         eng       0.90      0.98      0.94       500
         fao       1.00      1.00      1.00       500
         hau       1.00      0.99      1.00       500
         jpn       1.00      0.99      0.99       500
         mai       1.00      0.98      0.99       500
         nld       1.00      0.99      0.99       500
         nno       0.94      0.97      0.96       500
         nob       0.95      0.91      0.93       500
         por       0.98      0.99      0.99       500
         rus       0.99      1.00      1.00       500
         swa       0.97      0.98      0.98       500
         swe       1.00      1.00      1.00       500


---

### 4.1 Error Analysis

Inspect your model's predictions using a confusion matrix and provide a summary of what you find in your report.

📝❓ Where does your model do well and where does it fail?

📝❓ What are some possible reasons for why it fails in these cases?

In [None]:
# TODO: Inspect the model's predcitions on the different classes

---

### 5.1 Interpretability Analysis

Now that you have your best model, it's time to dive deep into understanding how the model makes predictions.

It is important that we can explain and visualise our models to improve task performance. Explainable models help characterise model fairness, transparency, and outcomes.

Let's try to understand what our best-performing logistic regression classification model has learned.

Inspect the 20 most important features for the languages English, Swedish, Norwegian, and Japanese. Please make sure that the features are named and human-interpretable, not things like "Feat_1". (Hint: if you have used custom feature extractors in your pipeline, you may need to adapt these to make sure that the feature names are maintained.)

📝❓ What is more important, extra features or the outputs of the vectorizer? Please discuss.

We recommend using the [SHAP library](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/linear_models/Sentiment%20Analysis%20with%20Logistic%20Regression.html) as discussed in the tutorial. We've provided an example notebook for working with SHAP for multi-class classification in the course GitHub repo.

☝ Note, if you prefer to use another interpretability tool, we will accept answers from any explanation library/method as long as the explanations for the model weights are provided in a structured/clear way.



In [62]:
# To use shap, we first need to install it into the current environment
!pip install --upgrade shap

import shap

Collecting shap
  Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting slicer==0.0.8 (from shap)
  Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.1/540.1 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading slicer-0.0.8-py3-none-any.whl (15 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.46.0 slicer-0.0.8


---

### 6.1 Ablation Study

Lastly, we want to conduct a small ablation study to investigate how well our model performs under different conditions.

As a first step, choose the two languages for which the classifier worked best.

Next, re-fit the best model six times, each time reducing the **length** of each instance in the training set. To do this, create a custom `TextReducer` class that you can include as a preprocessing step in your pipeline. The class should take a `max_len` argument as a hyperparameter that can be set to train the following models:

- Model 1: `max_len = None` (i.e. no truncation!)
- Model 2: `max_len = 500`
- Model 3: `max_len = 250`
- Model 4: `max_len = 150`
- Model 5: `max_len = 100`
- Model 6: `max_len = 50`

Use average accuracy over the cross validation scores for each model to measure performance for each ablation setting.

📝❓ How does the reduction of training data affect the performance of the classifier? And what could be some possible reasons for this?

In [None]:
# TODO: Ablation study

---

📝❓ Write your lab report here addressing all questions in the notebook

## Inspecting the data

One of the challenges in dealing with this vast amount of languages is the use of different scripts, as well as the fact that in many languages (e.g. Japanese, Mandarin, Thai etc), words are not separated by spaces in text.