# Language Identification Project
## Problem statement

The aim of this project is to compare different text representations and learning models in the context of a classification task. 
Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

The dataset that we will use is the [WiLI-2018 - Wikipedia Language Identification Dataset](https://zenodo.org/record/841984)

## Overview

I tried three different text representations :

* CountVectorizer (BoW)
* TF-IDF
* Dense Word Embedding (FastText)

For each of them I compared three different algorithms for classification:

* Multinomial Naive Bayes
* Rocchio Algorithm (Nearest Centroids)
* k-Nearest Neighbors

Then I compared the results obtained with the previous techniques with a state-of-the-art model:

* BERT Transformer model 


## Configuration

In [None]:
!pip install fasttext transformers

In [2]:
import pandas as pd
import numpy as np
import random
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import NearestCentroid, KNeighborsClassifier
from matplotlib import pyplot as plt
from google.colab import drive
import scipy
import math
import fasttext
import fasttext.util
from collections import OrderedDict
import torch
from torch.utils.data import Dataset
from transformers import Trainer, TrainingArguments, DistilBertTokenizer, DistilBertForSequenceClassification

drive.mount('/content/drive')
PATH = '/content/drive/MyDrive/UniBO/DM_TM_BDA/TM/'

Mounted at /content/drive


### Download Dataset

In [3]:
!wget https://zenodo.org/record/841984/files/wili-2018.zip?download=1
!unzip wili-2018.zip\?download\=1

--2022-05-20 09:19:11--  https://zenodo.org/record/841984/files/wili-2018.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62403646 (60M) [application/octet-stream]
Saving to: ‘wili-2018.zip?download=1’


2022-05-20 09:19:27 (4.33 MB/s) - ‘wili-2018.zip?download=1’ saved [62403646/62403646]

Archive:  wili-2018.zip?download=1
  inflating: x_train.txt             
  inflating: y_train.txt             
  inflating: x_test.txt              
  inflating: y_test.txt              
  inflating: labels.csv              
  inflating: README.txt              
  inflating: urls.txt                


## Dataset Exploration

As we can see, in this section we briefly inspected our dataset.

* It contains 1000 paragraphs of 235 different languages, totaling in 23500 paragraphs. We decided to reduce the dataset to 750 paragraphs due to time and memory constraints
* Each paragraph includes only one language
* The data is split into training set (75%) and test set (25%):
  * The dataset is perfectly balanced
  * We have 500 paragrahs for each language in the training set and 250 in the test set

In [4]:
labels_df = pd.read_csv('labels.csv', sep=';')
labels_df.head()

Unnamed: 0,Label,English,Wiki Code,ISO 369-3,German,Language family,Writing system,Remarks,Synonyms
0,ace,Achinese,ace,ace,Achinesisch,Austronesian,,,
1,afr,Afrikaans,af,afr,Afrikaans,Indo-European,,,
2,als,Alemannic German,als,gsw,Alemannisch,Indo-European,,(ursprünglich nur Elsässisch),
3,amh,Amharic,am,amh,Amharisch,Afro-Asiatic,,,
4,ang,Old English,ang,ang,Altenglisch,Indo-European,,(ca. 450-1100),Angelsächsisch


In [5]:
n_languages = len(labels_df)
labels = list(labels_df['Label'])
languages_names = labels_df['English']
languages_dict = {labels_df['Label'][i]: labels_df['English'][i] for i in range(len(labels_df))}
print(f'In the dataset there are {n_languages} different languages')

In the dataset there are 235 different languages


In [6]:
class Dataset():
  def __init__(self, dataset_loc):
    self.X_train = open(dataset_loc + 'x_train.txt').read().splitlines()
    self.Y_train = open(dataset_loc + 'y_train.txt').read().splitlines()
    self.X_test = open(dataset_loc + 'x_test.txt').read().splitlines()
    self.Y_test = open(dataset_loc + 'y_test.txt').read().splitlines()

  def get_data(self):
    self.split_dataset(n_items = 250)
    return self.X_train, self.Y_train, list(self.X_test), list(self.Y_test)
  
  def split_dataset(self, n_items):
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(self.Y_test)
    corpus = [(self.X_test[i], y_encoded[i]) for i in range(len(self.X_test))]
    sorted_corpus = sorted(corpus, key=lambda x: x[1])
    self.X_test = []
    self.Y_test = []
    for i in range(0,len(sorted_corpus), 500):
      self.X_test += [t[0] for t in sorted_corpus[i:i+n_items]]
      self.Y_test += [t[1] for t in sorted_corpus[i:i+n_items]]
    self.Y_test = label_encoder.inverse_transform(self.Y_test)
    shuffle_test = list(zip(self.X_test, self.Y_test))
    random.shuffle(shuffle_test)
    self.X_test, self.Y_test = zip(*shuffle_test)
    

In [7]:
X_train, Y_train, X_test, Y_test = Dataset('/content/').get_data()

print("X_train shape:", len(X_train))
print("Y_train shape:", len(Y_train))
print("X_test shape:", len(X_test))
print("Y_test shape:", len(Y_test))

X_train shape: 117500
Y_train shape: 117500
X_test shape: 58750
Y_test shape: 58750


In [8]:
print(f"Example of paragraph: \n\t{X_train[0]}\n\nLanguage: {languages_dict[Y_train[0]]}")

Example of paragraph: 
	Klement Gottwaldi surnukeha palsameeriti ning paigutati mausoleumi. Surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundemärke. 1962. aastal viidi ta surnukeha mausoleumist ära ja kremeeriti. Zlíni linn kandis aastatel 1949–1989 nime Gottwaldov. Ukrainas Harkivi oblastis kandis Zmiivi linn aastatel 1976–1990 nime Gotvald.

Language: Estonian


Here we are counting how many paragraphs there are in the training set for each different language

This in order to check if the dataset is balanced or not.

In [None]:
items_per_class = np.zeros(n_languages)
for i in range(n_languages):
  for item in Y_train:
    if str(labels[i]) == item:
      items_per_class[i] += 1

items_per_class_dict = {languages_names[i]: int(items_per_class[i]) for i in range(n_languages)}
print('Number of occurences for each language in the training set:')
for key in items_per_class_dict.keys():
  print(f'\t{key}: {items_per_class_dict[key]}')

Number of occurences for each language in the training set:
	Achinese: 500
	Afrikaans: 500
	Alemannic German: 500
	Amharic: 500
	Old English : 500
	Arabic: 500
	Aragonese: 500
	Egyptian Arabic: 500
	Assamese: 500
	Asturian: 500
	Avar: 500
	Aymara: 500
	South Azerbaijani: 500
	Azerbaijani: 500
	Bashkir: 500
	Bavarian: 500
	Central Bikol: 500
	Belarusian (Taraschkewiza): 500
	Belarusian: 500
	Bengali: 500
	Bhojpuri: 500
	Banjar: 500
	Tibetan: 500
	Bosnian: 500
	Bishnupriya: 500
	Breton: 500
	Bulgarian: 500
	Buryat: 500
	Catalan: 500
	Chavacano: 500
	Min Dong: 500
	Cebuano: 500
	Czech: 500
	Chechen: 500
	Cherokee: 500
	Chuvash: 500
	Central Kurdish: 500
	Cornish: 500
	Corsican: 500
	Crimean Tatar: 500
	Kashubian: 500
	Welsh: 500
	Danish: 500
	German: 500
	Dimli: 500
	Dhivehi: 500
	Lower Sorbian: 500
	Doteli: 500
	Emilian: 500
	Modern Greek: 500
	English: 500
	Esperanto: 500
	Estonian: 500
	Basque: 500
	Extremaduran: 500
	Faroese: 500
	Persian: 500
	Finnish: 500
	French: 500
	Arpitan: 500


## Models and Text representation

I tried three different text representations :

* HashingVectorizer (BoW)
* TF-IDF
* Dense Word Embedding (FastText)

For each of them I compared three different algorithms for classification:

* Multinomial Naive Bayes
* Rocchio Algorithm (Nearest Centroids)
* k-Nearest Neighbors


### Grid search

For some classifiers and models we will perform a Grid Search on certain parameters in order to find the best hyperparameters for each configuration.

In particular, we decided to tune the following hyperparameters for the vectorizers:

*  ngram_range: [(1,1), (1,2)]
  *  Unigrams or Bigrams
*  analyzer: word, char, char_wb
  *  Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

Instead for what concerns the classifiers we tuned the following parameters:

* MultinomialNB: 
  * alpha = (0.5, 1.0) : Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
* k-Nearest Neighbors: 
  *  n_neighbors = (3, 5, 7) : Number of neighbors to use
* Rocchio Algorithm



In [37]:
def custom_grid_search(clf_flag, tf_idf_flag, x_train, y_train):

  def get_pipeline(classifier, parameters):
    if tf_idf_flag:
      pipeline = Pipeline([('vectorizer', HashingVectorizer()),
                           ('tf_idf', TfidfTransformer()),
                           ('classifier', classifier)])
    else:
      pipeline = Pipeline([('vectorizer', HashingVectorizer()),
                           ('classifier', classifier)])
    return pipeline

  parameters = {
    'vectorizer__analyzer': ('word', 'char', 'char_wb'),
    'vectorizer__ngram_range': ((1, 1), (1, 2)),
    }

  if clf_flag == 'mNB':
    pipeline = get_pipeline(MultinomialNB(), parameters)
    parameters['vectorizer__alternate_sign'] = [False]
    parameters['classifier__alpha'] = (0.5, 1.0)

  elif clf_flag == 'kNN':
    pipeline = get_pipeline(KNeighborsClassifier(), parameters)
    parameters['classifier__n_neighbors'] = (5, 7)
  
  else:
    pipeline = get_pipeline(NearestCentroid(), parameters)
    
  grid_search = GridSearchCV(pipeline, parameters, scoring='accuracy', verbose=1)
  grid_search.fit(x_train, y_train)

  print("Best parameters set:")
  best_parameters = grid_search.best_estimator_.get_params()
  for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

  return grid_search

### Confidence interval

In this section we define the functions that we will need to calculate the confidence interval for each model 

In [10]:
def get_confidence_interval(n, acc, alpha):
  """
  Return the confidence interval for the confidence level 
  specified by the alpha parameter

  Parameters
  ----------
  - n : cardinality of the test set 
  - acc : accuracy of the model 
  - alpha : significance level. The confidence level will be 1-alpha

  Return
  ------
  - [p_min, p_max]: list of lower and upper values of the confidence interval 
 
 """
  # Compute confidence level 
  conf_level = 1-alpha

  # Retrieve the value of Z for the specified confidence level
  Z = scipy.stats.norm.ppf((1+conf_level)/2)

  # Compute p_min and p_max
  first_num = (2*n*acc) + pow(Z,2)
  second_num = Z * math.sqrt(pow(Z,2)+(4*n*acc)-(4*n*pow(acc,2)))
  den = 2*(n+pow(Z,2))
  p_min = (first_num - second_num) / den 
  p_max = (first_num + second_num) / den 

  # Round p_min and p_max to a precision of 3 in decimal digits 
  return [round(p_min,3), round(p_max,3)]

In [11]:
def print_conf_int_per_class(Y_test, Y_pred, alpha):
  """
  Print the confidence interval for each class

  Parameters
  ----------
  - n : cardinality of the test set 
  - acc : accuracy of the model 
  - alpha : significance level. The confidence level will be 1-alpha
  """
  matrix = confusion_matrix(Y_test, Y_pred)
  accuracy_per_class = matrix.diagonal()/matrix.sum(axis=1)
  intervals = []
  for acc in accuracy_per_class:
    intervals.append(get_confidence_interval(250, acc, alpha))
  df = pd.DataFrame(list(zip(languages_names, accuracy_per_class, intervals)), columns =['Label', 'Accuracy', 'Confidence Interval'])
  pd.set_option('display.max_rows', 235)
  print(df)

## Hashing Vectorizer

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information).

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

During the experimentation process we tried first the CountVectorizer but we had to switch to the HashingVectorizer due to RAM problems.

### Multinomial Naive Bayes

In [None]:
grid_search = custom_grid_search(clf_flag='mNB', tf_idf_flag=False, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters set:
	classifier__alpha: 0.5
	vectorizer__alternate_sign: False
	vectorizer__analyzer: 'word'
	vectorizer__ngram_range: (1, 1)


In [None]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [None]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.809     0.984     0.888       250
                 Afrikaans      0.984     0.992     0.988       250
          Alemannic German      0.971     0.940     0.955       250
                   Amharic      0.988     0.992     0.990       250
              Old English       0.978     0.900     0.938       250
                    Arabic      0.905     0.988     0.945       250
                 Aragonese      0.984     0.956     0.970       250
           Egyptian Arabic      0.954     0.916     0.935       250
                  Assamese      0.988     0.964     0.976       250
                  Asturian      0.964     0.960     0.962       250
                      Avar      0.952     0.720     0.820       250
                    Aymara      0.967     0.932     0.949       250
         South Azerbaijani      0.976     0.996     0.986       250
               Azerbaijani      0.992     0.980

In [None]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.9283
Confidence interval:  [0.926, 0.93]


In [None]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.984       [0.96, 0.994]
1                     Afrikaans     0.992      [0.971, 0.998]
2              Alemannic German     0.940      [0.903, 0.963]
3                       Amharic     0.992      [0.971, 0.998]
4                  Old English      0.900      [0.857, 0.931]
5                        Arabic     0.988      [0.965, 0.996]
6                     Aragonese     0.956      [0.923, 0.975]
7               Egyptian Arabic     0.916      [0.875, 0.944]
8                      Assamese     0.964      [0.933, 0.981]
9                      Asturian     0.960      [0.928, 0.978]
10                         Avar     0.720      [0.661, 0.772]
11                       Aymara     0.932      [0.894, 0.957]
12            South Azerbaijani     0.996      [0.978, 0.999]
13                  Azerbaijani     0.980      [0.954, 0.991]
14                      Bashkir     0.968      [0.938, 0.984]
15      

### Rocchio (Nearest Centroids)

In [None]:
grid_search = custom_grid_search(clf_flag='NC', tf_idf_flag=False, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters set:
	vectorizer__analyzer: 'char'
	vectorizer__ngram_range: (1, 2)


In [None]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [None]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.991     0.904     0.946       250
                 Afrikaans      0.888     0.892     0.890       250
          Alemannic German      0.725     0.780     0.751       250
                   Amharic      1.000     0.992     0.996       250
              Old English       0.982     0.888     0.933       250
                    Arabic      0.830     0.936     0.880       250
                 Aragonese      0.640     0.732     0.683       250
           Egyptian Arabic      0.920     0.824     0.869       250
                  Assamese      1.000     0.960     0.980       250
                  Asturian      0.770     0.684     0.725       250
                      Avar      0.883     0.696     0.779       250
                    Aymara      0.986     0.852     0.914       250
         South Azerbaijani      0.918     0.980     0.948       250
               Azerbaijani      0.971     0.924

In [None]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.8496
Confidence interval:  [0.847, 0.852]


In [None]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.904      [0.861, 0.935]
1                     Afrikaans     0.892      [0.847, 0.925]
2              Alemannic German     0.780      [0.725, 0.827]
3                       Amharic     0.992      [0.971, 0.998]
4                  Old English      0.888      [0.843, 0.921]
5                        Arabic     0.936       [0.899, 0.96]
6                     Aragonese     0.732      [0.674, 0.783]
7               Egyptian Arabic     0.824      [0.772, 0.866]
8                      Assamese     0.960      [0.928, 0.978]
9                      Asturian     0.684      [0.624, 0.738]
10                         Avar     0.696       [0.636, 0.75]
11                       Aymara     0.852      [0.803, 0.891]
12            South Azerbaijani     0.980      [0.954, 0.991]
13                  Azerbaijani     0.924      [0.884, 0.951]
14                      Bashkir     0.952      [0.918, 0.972]
15      

### k-Nearest Neighbors

In [20]:
grid_search = custom_grid_search(clf_flag='kNN', tf_idf_flag=False, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters set:
	classifier__n_neighbors: 5
	vectorizer__analyzer: 'char'
	vectorizer__ngram_range: (1, 2)


In [21]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [22]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.979     0.948     0.963       250
                 Afrikaans      0.758     0.952     0.844       250
          Alemannic German      0.816     0.904     0.858       250
                   Amharic      0.992     0.992     0.992       250
              Old English       0.957     0.888     0.921       250
                    Arabic      0.845     0.960     0.899       250
                 Aragonese      0.737     0.752     0.745       250
           Egyptian Arabic      0.950     0.828     0.885       250
                  Assamese      0.860     0.960     0.907       250
                  Asturian      0.623     0.792     0.697       250
                      Avar      0.769     0.800     0.784       250
                    Aymara      0.983     0.924     0.953       250
         South Azerbaijani      0.969     1.000     0.984       250
               Azerbaijani      0.996     0.956

In [23]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.8922
Confidence interval:  [0.89, 0.895]


In [24]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.948      [0.913, 0.969]
1                     Afrikaans     0.952      [0.918, 0.972]
2              Alemannic German     0.904      [0.861, 0.935]
3                       Amharic     0.992      [0.971, 0.998]
4                  Old English      0.888      [0.843, 0.921]
5                        Arabic     0.960      [0.928, 0.978]
6                     Aragonese     0.752      [0.695, 0.801]
7               Egyptian Arabic     0.828       [0.776, 0.87]
8                      Assamese     0.960      [0.928, 0.978]
9                      Asturian     0.792      [0.737, 0.838]
10                         Avar     0.800      [0.746, 0.845]
11                       Aymara     0.924      [0.884, 0.951]
12            South Azerbaijani     1.000        [0.985, 1.0]
13                  Azerbaijani     0.956      [0.923, 0.975]
14                      Bashkir     0.968      [0.938, 0.984]
15      

## TF-IDF
Transform a count matrix to a normalized tf or tf-idf representation.

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

### Multinomial Naive Bayes


In [27]:
grid_search = custom_grid_search(clf_flag='mNB', tf_idf_flag=True, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters set:
	classifier__alpha: 0.5
	vectorizer__alternate_sign: False
	vectorizer__analyzer: 'word'
	vectorizer__ngram_range: (1, 1)


In [28]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [29]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.801     0.984     0.883       250
                 Afrikaans      0.984     0.992     0.988       250
          Alemannic German      0.971     0.948     0.960       250
                   Amharic      0.992     0.996     0.994       250
              Old English       0.991     0.904     0.946       250
                    Arabic      0.904     0.980     0.940       250
                 Aragonese      0.992     0.988     0.990       250
           Egyptian Arabic      0.954     0.912     0.933       250
                  Assamese      0.992     0.964     0.978       250
                  Asturian      0.976     0.968     0.972       250
                      Avar      0.919     0.768     0.837       250
                    Aymara      0.963     0.932     0.947       250
         South Azerbaijani      0.973     0.996     0.984       250
               Azerbaijani      0.992     0.980

In [30]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.9347
Confidence interval:  [0.933, 0.937]


In [31]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.984       [0.96, 0.994]
1                     Afrikaans     0.992      [0.971, 0.998]
2              Alemannic German     0.948      [0.913, 0.969]
3                       Amharic     0.996      [0.978, 0.999]
4                  Old English      0.904      [0.861, 0.935]
5                        Arabic     0.980      [0.954, 0.991]
6                     Aragonese     0.988      [0.965, 0.996]
7               Egyptian Arabic     0.912       [0.87, 0.941]
8                      Assamese     0.964      [0.933, 0.981]
9                      Asturian     0.968      [0.938, 0.984]
10                         Avar     0.768      [0.712, 0.816]
11                       Aymara     0.932      [0.894, 0.957]
12            South Azerbaijani     0.996      [0.978, 0.999]
13                  Azerbaijani     0.980      [0.954, 0.991]
14                      Bashkir     0.976      [0.949, 0.989]
15      

### Rocchio (Nearest Centroids)

In [32]:
grid_search = custom_grid_search(clf_flag='NC', tf_idf_flag=True, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters set:
	vectorizer__analyzer: 'char'
	vectorizer__ngram_range: (1, 2)


In [33]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [34]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      1.000     0.932     0.965       250
                 Afrikaans      0.933     0.948     0.940       250
          Alemannic German      0.790     0.860     0.824       250
                   Amharic      1.000     0.996     0.998       250
              Old English       1.000     0.888     0.941       250
                    Arabic      0.860     0.960     0.907       250
                 Aragonese      0.902     0.884     0.893       250
           Egyptian Arabic      0.947     0.852     0.897       250
                  Assamese      1.000     0.964     0.982       250
                  Asturian      0.903     0.852     0.877       250
                      Avar      0.922     0.712     0.804       250
                    Aymara      0.995     0.884     0.936       250
         South Azerbaijani      0.914     0.980     0.946       250
               Azerbaijani      0.940     0.948

In [35]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.8866
Confidence interval:  [0.884, 0.889]


In [36]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.932      [0.894, 0.957]
1                     Afrikaans     0.948      [0.913, 0.969]
2              Alemannic German     0.860      [0.812, 0.898]
3                       Amharic     0.996      [0.978, 0.999]
4                  Old English      0.888      [0.843, 0.921]
5                        Arabic     0.960      [0.928, 0.978]
6                     Aragonese     0.884      [0.838, 0.918]
7               Egyptian Arabic     0.852      [0.803, 0.891]
8                      Assamese     0.964      [0.933, 0.981]
9                      Asturian     0.852      [0.803, 0.891]
10                         Avar     0.712      [0.653, 0.765]
11                       Aymara     0.884      [0.838, 0.918]
12            South Azerbaijani     0.980      [0.954, 0.991]
13                  Azerbaijani     0.948      [0.913, 0.969]
14                      Bashkir     0.964      [0.933, 0.981]
15      

### k-Nearest Neighbors

In [38]:
grid_search = custom_grid_search(clf_flag='kNN', tf_idf_flag=True, x_train=X_train, y_train=Y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters set:
	classifier__n_neighbors: 5
	vectorizer__analyzer: 'char'
	vectorizer__ngram_range: (1, 2)


In [39]:
Y_pred = grid_search.best_estimator_.predict(X_test)

In [40]:
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.984     0.964     0.974       250
                 Afrikaans      0.819     0.980     0.893       250
          Alemannic German      0.898     0.848     0.872       250
                   Amharic      1.000     0.996     0.998       250
              Old English       0.970     0.904     0.936       250
                    Arabic      0.852     0.968     0.906       250
                 Aragonese      0.884     0.852     0.868       250
           Egyptian Arabic      0.963     0.844     0.900       250
                  Assamese      0.856     0.972     0.910       250
                  Asturian      0.802     0.840     0.820       250
                      Avar      0.810     0.820     0.815       250
                    Aymara      1.000     0.920     0.958       250
         South Azerbaijani      0.958     1.000     0.978       250
               Azerbaijani      1.000     0.968

In [41]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.9137
Confidence interval:  [0.911, 0.916]


In [42]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.964      [0.933, 0.981]
1                     Afrikaans     0.980      [0.954, 0.991]
2              Alemannic German     0.848      [0.798, 0.887]
3                       Amharic     0.996      [0.978, 0.999]
4                  Old English      0.904      [0.861, 0.935]
5                        Arabic     0.968      [0.938, 0.984]
6                     Aragonese     0.852      [0.803, 0.891]
7               Egyptian Arabic     0.844      [0.794, 0.884]
8                      Assamese     0.972      [0.943, 0.986]
9                      Asturian     0.840       [0.789, 0.88]
10                         Avar     0.820      [0.768, 0.863]
11                       Aymara     0.920       [0.88, 0.948]
12            South Azerbaijani     1.000        [0.985, 1.0]
13                  Azerbaijani     0.968      [0.938, 0.984]
14                      Bashkir     0.968      [0.938, 0.984]
15      

## Word embeddings

Until now we've worked with sparse embedding methods, which lead to high dimensional word embeddings. The main drawback of such approach is that words belong to separate dimensions. So we might prefer a dense embedding technique; in this way words do not belong to separate dimensions anymore and semantic relationships are easily modelled.

In this section, we will experiment with the FastText dense embedding model.

**Workflow:**

1.   Vocabulary Creation
2.   Training of an embedding model using FastText and the corpus
3.   Compute OOV terms
4.   Compute Embedding Matrix
5.   Prepare Input Data
6.   Training classifiers



In [None]:
def build_vocabulary(train_corpus, test_corpus):
    """
    Given train corpus and test corpus, builds the corresponding word vocabulary.

    --------------
    Return: 
      - word vocabulary: vocabulary index to word
      - inverse word vocabulary: word to vocabulary index
      - word listing: set of unique terms that build up the vocabulary
    """
    if not isinstance(train_corpus, list) and not isinstance(test_corpus, list):
      corpus = train_corpus.tolist() + test_corpus.tolist() 
    else: 
      corpus = train_corpus + test_corpus 
    word_to_idx = OrderedDict()  
    idx_to_word = OrderedDict()
    idx = 0
    word_listing_provv = set()
    word_listing = []


    # Get all unique words in corpus 
    for sentence in corpus: 
      sequence = sentence.split()
      for word in sequence: 
        word_listing_provv.add(word)
        
   # Cast to list
    word_listing = list(word_listing_provv)

    # Build word_to_idx <word : idx>
    for word in word_listing:
      word_to_idx[word] = idx
      idx += 1

    # Build vocabulary index to word <idx : word>
    for (k,v) in word_to_idx.items():
      idx_to_word[v] = k 
    idx_to_word = OrderedDict(idx_to_word)

    return idx_to_word, word_to_idx, word_listing

In [None]:
idx_to_word, word_to_idx, word_listing = build_vocabulary(X_train, X_test)
VOCAB_SIZE = len(word_listing)

Here we have trained a FastText embedding model from scratch using the training corpus. We used an unsupervised training approach with embedding dimension equal to 100.

In [None]:
fasttext.FastText.eprint = lambda x: None
EMBEDDING_DIM = 100

try:
    EMBEDDING_MODEL = fasttext.load_model(PATH + "embedding_model.bin")
except ValueError as e:
    print(e)
    EMBEDDING_MODEL = fasttext.train_unsupervised(PATH +'x_train.txt', model='skipgram', minCount=1, dim=EMBEDDING_DIM)
    EMBEDDING_MODEL.save_model(PATH + "embedding_model.bin")

In [None]:
def check_OOV_terms(embedding_model, vocabulary_terms):
  '''Returns a list of out-of-vocabulary (OOV) terms and the corresp. len'''
  oov = set(vocabulary_terms).difference(set(embedding_model.words))
  return list(oov), len(oov)

In [None]:
_, n_oov_terms = check_OOV_terms(EMBEDDING_MODEL, word_listing)
print(f"Total OOV terms: {n_oov_terms} ({n_oov_terms*100/len(word_listing):.03f}%)")

Total OOV terms: 694567 (26.447%)


We have a 26.44% of out-of-vocabulary terms because we have trained our embedding model only on the training corpus. When computing the embedding matrix this can be a problem, however FastText models allow us to compute the embedding of a oov term starting from the n-grams of the words that are in the vocabulary. In this way we can solve this problem.

In [None]:
def compute_embedding_matrix(embedding_model, idx_to_word):
  """
  Return the embedding matrix of the train vocabulary:
  """
  embedding_matrix = np.zeros((VOCAB_SIZE, EMBEDDING_DIM), dtype='float32')

  for idx in idx_to_word.keys():
    embedding_matrix[idx] = embedding_model.get_word_vector(idx_to_word[idx])
  return embedding_matrix

In [None]:
WEIGHT_MATRIX = compute_embedding_matrix(EMBEDDING_MODEL, idx_to_word)

In [None]:
print(f'Weights matrix size: {WEIGHT_MATRIX.shape}')
print(f'Vocab size: {VOCAB_SIZE}')

Weights matrix size: (2626309, 100)
Vocab size: 2626309


In [None]:
def word_to_idx_conversion(corpus, word_to_idx): 
  """
  Return input data (sentences) encoded
  
  Parameters
  ----------
  - corpus (train or test corpus)

  Return
  ------
  - result [list]: sentences encoded using the word_to_idx dict
  """
  result = []
  for sentence in corpus:
    encoded_sentence = []
    for token in sentence.split():
      encoded_sentence.append(word_to_idx[token])
    result.append(encoded_sentence)
  return result

encoded_X_train = word_to_idx_conversion(X_train, word_to_idx)
encoded_X_test = word_to_idx_conversion(X_test, word_to_idx)

In [None]:
def compute_sentence_embeddings(encoded_X):
  """
  Return an embedding for each sentence
  
  Parameters
  ----------
  - encoded train or test corpus

  Return
  ------
  - result [list]: sentences embedding computed as mean of each word embedding
  """
  result = []
  for sentence in encoded_X:
    embedding_list = []
    for idx in sentence:
      embedding_list.append(WEIGHT_MATRIX[idx])
    sentence_embedding = np.mean(embedding_list, axis=0)
    result.append(sentence_embedding)
  return result

X_train = compute_sentence_embeddings(encoded_X_train)
X_test = compute_sentence_embeddings(encoded_X_test)

In [None]:
label_encoder = LabelEncoder()
Y_train = label_encoder.fit_transform(Y_train)
Y_test = label_encoder.fit_transform(Y_test)

### Multinomial Naive Bayes

We will not use the MultinomialNB classifier because our vectors have negative components and consequently the input dataset has features with negative values. In particular, MultinomialNB assumes that features having multinomial distribution cannot contain negative values.

### Rocchio (Nearest Centroids)

In [None]:
clf = NearestCentroid()
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.988     0.956     0.972       250
                 Afrikaans      0.984     0.956     0.970       250
          Alemannic German      0.891     0.820     0.854       250
                   Amharic      0.946     0.904     0.924       250
              Old English       0.590     0.892     0.710       250
                    Arabic      0.765     0.744     0.755       250
                 Aragonese      0.925     0.836     0.878       250
           Egyptian Arabic      0.744     0.768     0.756       250
                  Assamese      0.757     0.612     0.677       250
                  Asturian      0.764     0.740     0.752       250
                      Avar      0.284     0.092     0.139       250
                    Aymara      0.616     0.596     0.606       250
         South Azerbaijani      0.687     0.780     0.730       250
               Azerbaijani      0.978     0.908

In [None]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.7899
Confidence interval:  [0.787, 0.793]


In [None]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.956      [0.923, 0.975]
1                     Afrikaans     0.956      [0.923, 0.975]
2              Alemannic German     0.820      [0.768, 0.863]
3                       Amharic     0.904      [0.861, 0.935]
4                  Old English      0.892      [0.847, 0.925]
5                        Arabic     0.744      [0.686, 0.794]
6                     Aragonese     0.836      [0.785, 0.877]
7               Egyptian Arabic     0.768      [0.712, 0.816]
8                      Assamese     0.612        [0.55, 0.67]
9                      Asturian     0.740       [0.682, 0.79]
10                         Avar     0.092      [0.062, 0.134]
11                       Aymara     0.596      [0.534, 0.655]
12            South Azerbaijani     0.780      [0.725, 0.827]
13                  Azerbaijani     0.908      [0.866, 0.938]
14                      Bashkir     0.436      [0.376, 0.498]
15      

### k-Nearest Neighbors

In [None]:
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print(classification_report(Y_test, Y_pred, digits=3, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.984     0.980     0.982       250
                 Afrikaans      0.943     0.984     0.963       250
          Alemannic German      0.818     0.916     0.864       250
                   Amharic      0.953     0.984     0.969       250
              Old English       0.890     0.908     0.899       250
                    Arabic      0.806     0.900     0.851       250
                 Aragonese      0.907     0.856     0.881       250
           Egyptian Arabic      0.884     0.792     0.835       250
                  Assamese      0.833     0.956     0.890       250
                  Asturian      0.748     0.916     0.824       250
                      Avar      0.687     0.772     0.727       250
                    Aymara      0.955     0.928     0.941       250
         South Azerbaijani      0.928     0.976     0.951       250
               Azerbaijani      0.927     0.964

In [None]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.8899
Confidence interval:  [0.887, 0.892]


In [None]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.980      [0.954, 0.991]
1                     Afrikaans     0.984       [0.96, 0.994]
2              Alemannic German     0.916      [0.875, 0.944]
3                       Amharic     0.984       [0.96, 0.994]
4                  Old English      0.908      [0.866, 0.938]
5                        Arabic     0.900      [0.857, 0.931]
6                     Aragonese     0.856      [0.807, 0.894]
7               Egyptian Arabic     0.792      [0.737, 0.838]
8                      Assamese     0.956      [0.923, 0.975]
9                      Asturian     0.916      [0.875, 0.944]
10                         Avar     0.772       [0.716, 0.82]
11                       Aymara     0.928      [0.889, 0.954]
12            South Azerbaijani     0.976      [0.949, 0.989]
13                  Azerbaijani     0.964      [0.933, 0.981]
14                      Bashkir     0.868       [0.82, 0.904]
15      

## BERT Transformer

In order to verify the performances of the previous models we decided to train a state-of-the-art model.

We trained a BERT Transformer model applying a fine-tuning with our corpus. 

We decided to use the **distilbert-base-multilingual-cased** model that is a lighter and faster distilled version of the **BERT base multilingual** model.
The model is trained on the concatenation of Wikipedia in 104 different languages. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).

In [None]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
class TransformerDataset(Dataset):
    def __init__(self, tokenizer, X, y):
        self.encodings = tokenizer(X, truncation=True, padding=True)
        self.label_encoder = LabelEncoder()
        self.labels = self.label_encoder.fit_transform(y)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
model_name = 'distilbert-base-multilingual-cased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=n_languages).to(DEVICE)

In [None]:
train_set = TransformerDataset(tokenizer, X_train, Y_train)
test_set = TransformerDataset(tokenizer, X_test, Y_test)

In [None]:
training_args = TrainingArguments(
    output_dir='/content/bert_model/',
    num_train_epochs=1, 
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=50,   
    weight_decay=0.01,    
    logging_dir='./logs',    
    logging_steps=25,
    save_steps=200
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)

trainer.train()

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(PATH + '/checkpoint-3200')
model.to(DEVICE)

trainer.model = model
prediction_result = trainer.predict(test_set)

Y_pred = np.argmax(prediction_result.predictions, axis=1).tolist()
Y_pred = LabelEncoder().fit(Y_train).inverse_transform(Y_pred)

In [None]:
print(classification_report(Y_test, Y_pred, digits=3, zero_division=0, target_names=languages_names))

                            precision    recall  f1-score   support

                  Achinese      0.992     0.984     0.988       250
                 Afrikaans      1.000     0.992     0.996       250
          Alemannic German      0.940     0.884     0.911       250
                   Amharic      0.844     0.736     0.786       250
              Old English       0.987     0.944     0.965       250
                    Arabic      0.942     0.976     0.959       250
                 Aragonese      0.992     0.984     0.988       250
           Egyptian Arabic      0.983     0.932     0.957       250
                  Assamese      1.000     0.960     0.980       250
                  Asturian      0.996     0.972     0.984       250
                      Avar      0.871     0.732     0.796       250
                    Aymara      0.975     0.940     0.957       250
         South Azerbaijani      0.996     0.996     0.996       250
               Azerbaijani      1.000     0.988

In [None]:
# Computing the confidence interval for the model accuracy

accuracy = accuracy_score(Y_test, Y_pred)
print('Accuracy: ', np.round(accuracy, 4))
confidence_interval = get_confidence_interval(len(X_test), accuracy, 0.05)
print('Confidence interval: ', confidence_interval)

Accuracy:  0.9428
Confidence interval:  [0.941, 0.945]


In [None]:
# Compute the confidence interval for the model accuracy for each class

print_conf_int_per_class(Y_test, Y_pred, 0.05)

                          Label  Accuracy Confidence Interval
0                      Achinese     0.984       [0.96, 0.994]
1                     Afrikaans     0.992      [0.971, 0.998]
2              Alemannic German     0.884      [0.838, 0.918]
3                       Amharic     0.736      [0.678, 0.787]
4                  Old English      0.944      [0.908, 0.966]
5                        Arabic     0.976      [0.949, 0.989]
6                     Aragonese     0.984       [0.96, 0.994]
7               Egyptian Arabic     0.932      [0.894, 0.957]
8                      Assamese     0.960      [0.928, 0.978]
9                      Asturian     0.972      [0.943, 0.986]
10                         Avar     0.732      [0.674, 0.783]
11                       Aymara     0.940      [0.903, 0.963]
12            South Azerbaijani     0.996      [0.978, 0.999]
13                  Azerbaijani     0.988      [0.965, 0.996]
14                      Bashkir     0.988      [0.965, 0.996]
15      

## Results
* **MultinomialNB** + **HashingVectorizer**: 
  * Accuracy = 0.9283 ± 0.0023
* **Rocchio Algorithm** + **HashingVectorizer**: 
  * Accuracy = 0.8496 ± 0.0026
* **k-Nearest Neighbors** + **HashingVectorizer**: 
  * Accuracy = 0.8922 ± 0.0022

---

* **MultinomialNB** + **TF-IDF Vectorizer**: 
  * Accuracy = 0.9347 ± 0.0017
* **Rocchio Algorithm** + **TF-IDF Vectorizer**: 
  * Accuracy = 0.8866 ± 0.0026
* **k-Nearest Neighbors** ± **TF-IDF Vectorizer**: 
  * Accuracy = 0.9137 ± 0.0036

---

* **Rocchio Algorithm** + **Dense Word Embeddings**: 
  * Accuracy = 0.7899 ± 0.003
* **k-Nearest Neighbors** + **Dense Word Embeddings**: 
  * Accuracy = 0.8899 ± 0.003

---

* **BERT Transformer model**:
  * Accuracy: 0.9428 ± 0.0018


## Final considerations

* Obviously the best model in terms of accuracy is the **BERT Transformer model**, but also the **MultinomialNB + TF-IDF Vectorizer** turned out to be very performing in this task, obtaining very good performances even if it is a simpler model. 

* In general for what concerns the text representation methods, we can say that using the **TF-IDF Vectorizer** results to be the best approach; in fact if we make a comparison with the **Hashing Vectorizer** we archived better performances for all the classifiers. Instead with the use of a **Dense word embedding** model we obtained comparable results with the other text representations for what concerns the **k-Nearest Neighbors**, and worse results using the the **Rocchio algorithm**. This can be due to the simple approach adopted when creating the sentence embeddings or maybe to the fact that we trained the embedding model only on the training corpus.

* Instead with regards to the classification methods tried, we can say that the most effective classifier turned out to be the **MultinomialNB**, followed by the **k-Nearest Neighbors** and lastly by the **Rocchio Algorithm**. We can notice that there is not so much difference between the performances obtained by the **BERT Transformer** model and the **MultinomialNB**; this is probably due to a not so thorough fine-tuning of the parameters of the SOTA model.

### Future works

* The exploration of hyperparameters could be much more extensive, but it was necessary to find a tradeoff between the time needed to search for them and the time available for experimentation.

* For what concerns the BERT Transformer model we can try using the **BERT base multilingual** model, that despite being a heavier model and having longer training times, it can provide better performance

* Implementing a better preprocessing of the different sentences

* Training the dense embedding model on the whole corpus (train + test)

* Trying other simple classifiers or neural networks

* Experiments with sentences, containing 2 languages showed very bad ability of such models to predict both labels, but it was expected behavior. Training dataset contains only single-language samples and models were trained to predict only one label well. So, models are quite confused when they get something different. We can overcome these issues by changing model architectures and logic, changing dataset to have multilanguage samples.
