Amazon has released a 51-languages parallel database called MASSIVE to the public domain. The same dataset
is also available in Huggingface at https://huggingface.co/datasets/qanastek/MASSIVE . The dataset typically
consists of sentences from 51 languages structured in a JSON format. The JSON structure contains the
following headings (features).
['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'tokens', 'ner_tags', 'worker_id', 'slot_method',
'judgments']
From those headings, we are interested only in the following subset {‘locale’, ‘partition’, ‘utt’, ‘tokens’}.
‘locale’ represents the language-country pair, ‘partition’ represents where the sentence is coming from amidst
{‘train’, ‘test’, ‘validation’}, ‘utt’ represents the actual sentence, and finally ‘tokens’ represents the split tokens
from the sentence.
We are going to build a language classifier the covers all the languages with roman letters. There is already a
classifier built on this dataset for all the 51 languages using transformers, which appears to be SOTA.
https://huggingface.co/qanastek/51-languages-classifier. Our goal is not to compete with transformers,
rather we are going to use this exercise to learn and overcome the challenges in dealing with multilingual
datasets.

# Task 1
Let’s construct a dataset ourselves for with a subset of languages that are roman-script based. The following
are the locales that we want to consider in our dataset [27 languages].
af-ZA, da-DK, de-DE, en-US, es-ES, fr-FR, fi-FI, hu-HU, is-IS, it-IT, jv-ID, lv-LV, ms-MY, nb-NO, nl-NL, pl-PL, pt-PT,
ro-RO, ru-RU, sl-SL, sv-SE, sq-AL, sw-KE, tl-PH, tr-TR, vi-VN, cy-GB
Programmatically, extract the utterances “utt” from the dataset for each of the above languages. You can
choose between your tokenization vs the preexisting tokens. By the end of this step, you should have 27 files
(one for each language) with one sentence per line. Typically, all the 27 files will end up have the same
number of lines as the dataset is a parallel-corpus.
Besides simple English like characters, you may encounter other characters and characters with accents. You
may choose to deaccent the characters if accents are not useful in your method. Choose wisely

In [None]:
pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

In [None]:
import os
from datasets import load_dataset
import unicodedata

Train dataset

In [None]:
locales = ['af-ZA', 'da-DK', 'de-DE', 'en-US', 'es-ES', 'fr-FR', 'fi-FI', 'hu-HU',
           'is-IS', 'it-IT', 'jv-ID', 'lv-LV', 'ms-MY', 'nb-NO', 'nl-NL', 'pl-PL',
           'pt-PT', 'ro-RO', 'ru-RU', 'sl-SL', 'sv-SE', 'sq-AL', 'sw-KE', 'tl-PH',
           'tr-TR', 'vi-VN', 'cy-GB']

data=load_dataset('qanastek/MASSIVE')

output_directory='all_language_files'
os.makedirs(output_directory,exist_ok=True)

def extract_sentence(locale,data):
  output_file=os.path.join(output_directory,f"{locale}.txt")
  local_data=data.filter(lambda x:x['locale']==locale)
  with open(output_file,'w',encoding='utf-8') as file:
    for ex in local_data['train']:
      utt=ex['utt']
      file.write(utt + '\n')
  print(f"finished extracting sentences for {locale}.")


for locale in locales:
  extract_sentence(locale,data)

print("Extraction complete for all 27 language")


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for af-ZA.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for da-DK.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for de-DE.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for en-US.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for es-ES.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for fr-FR.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for fi-FI.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for hu-HU.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for is-IS.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for it-IT.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for jv-ID.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for lv-LV.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for ms-MY.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for nb-NO.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for nl-NL.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for pl-PL.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for pt-PT.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for ro-RO.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for ru-RU.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for sl-SL.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for sv-SE.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for sq-AL.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for sw-KE.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for tl-PH.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for tr-TR.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for vi-VN.


Filter:   0%|          | 0/587214 [00:00<?, ? examples/s]

Filter:   0%|          | 0/103683 [00:00<?, ? examples/s]

Filter:   0%|          | 0/151674 [00:00<?, ? examples/s]

finished extracting sentences for cy-GB.
Extraction complete for all 27 language


In [None]:
!zip -r all_language_files.zip /content/all_language_files

  adding: content/all_language_files/ (stored 0%)
  adding: content/all_language_files/jv-ID.txt (deflated 71%)
  adding: content/all_language_files/sw-KE.txt (deflated 73%)
  adding: content/all_language_files/vi-VN.txt (deflated 76%)
  adding: content/all_language_files/nl-NL.txt (deflated 72%)
  adding: content/all_language_files/da-DK.txt (deflated 71%)
  adding: content/all_language_files/it-IT.txt (deflated 74%)
  adding: content/all_language_files/af-ZA.txt (deflated 71%)
  adding: content/all_language_files/ms-MY.txt (deflated 74%)
  adding: content/all_language_files/ro-RO.txt (deflated 71%)
  adding: content/all_language_files/sq-AL.txt (deflated 73%)
  adding: content/all_language_files/es-ES.txt (deflated 73%)
  adding: content/all_language_files/pl-PL.txt (deflated 72%)
  adding: content/all_language_files/fi-FI.txt (deflated 72%)
  adding: content/all_language_files/de-DE.txt (deflated 72%)
  adding: content/all_language_files/lv-LV.txt (deflated 72%)
  adding: content/al

Test dataset

In [None]:
locales = ['af-ZA', 'da-DK', 'de-DE', 'en-US', 'es-ES', 'fr-FR', 'fi-FI', 'hu-HU',
           'is-IS', 'it-IT', 'jv-ID', 'lv-LV', 'ms-MY', 'nb-NO', 'nl-NL', 'pl-PL',
           'pt-PT', 'ro-RO', 'ru-RU', 'sl-SL', 'sv-SE', 'sq-AL', 'sw-KE', 'tl-PH',
           'tr-TR', 'vi-VN', 'cy-GB']

data=load_dataset('qanastek/MASSIVE')

output_directory='test_dataset'
os.makedirs(output_directory,exist_ok=True)

def extract_sentence(locale,data):
  output_file=os.path.join(output_directory,f"{locale}.txt")
  local_data=data.filter(lambda x:x['locale']==locale)
  with open(output_file,'w',encoding='utf-8') as file:
    for ex in local_data['test']:
      utt=ex['utt']
      file.write(utt + '\n')
  print(f"finished extracting sentences for {locale}.")


for locale in locales:
  extract_sentence(locale,data)

print("Extraction complete for all 27 language")


finished extracting sentences for af-ZA.
finished extracting sentences for da-DK.
finished extracting sentences for de-DE.
finished extracting sentences for en-US.
finished extracting sentences for es-ES.
finished extracting sentences for fr-FR.
finished extracting sentences for fi-FI.
finished extracting sentences for hu-HU.
finished extracting sentences for is-IS.
finished extracting sentences for it-IT.
finished extracting sentences for jv-ID.
finished extracting sentences for lv-LV.
finished extracting sentences for ms-MY.
finished extracting sentences for nb-NO.
finished extracting sentences for nl-NL.
finished extracting sentences for pl-PL.
finished extracting sentences for pt-PT.
finished extracting sentences for ro-RO.
finished extracting sentences for ru-RU.
finished extracting sentences for sl-SL.
finished extracting sentences for sv-SE.
finished extracting sentences for sq-AL.
finished extracting sentences for sw-KE.
finished extracting sentences for tl-PH.
finished extract

In [None]:
!zip -r test_dataset.zip /content/test_dataset

  adding: content/test_dataset/ (stored 0%)
  adding: content/test_dataset/jv-ID.txt (deflated 67%)
  adding: content/test_dataset/sw-KE.txt (deflated 70%)
  adding: content/test_dataset/vi-VN.txt (deflated 73%)
  adding: content/test_dataset/nl-NL.txt (deflated 69%)
  adding: content/test_dataset/da-DK.txt (deflated 68%)
  adding: content/test_dataset/it-IT.txt (deflated 70%)
  adding: content/test_dataset/af-ZA.txt (deflated 68%)
  adding: content/test_dataset/ms-MY.txt (deflated 71%)
  adding: content/test_dataset/ro-RO.txt (deflated 67%)
  adding: content/test_dataset/sq-AL.txt (deflated 70%)
  adding: content/test_dataset/es-ES.txt (deflated 70%)
  adding: content/test_dataset/pl-PL.txt (deflated 68%)
  adding: content/test_dataset/fi-FI.txt (deflated 69%)
  adding: content/test_dataset/de-DE.txt (deflated 69%)
  adding: content/test_dataset/lv-LV.txt (deflated 69%)
  adding: content/test_dataset/pt-PT.txt (deflated 69%)
  adding: content/test_dataset/sl-SL.txt (deflated 67%)
  ad

Validation Dataset

In [None]:
locales = ['af-ZA', 'da-DK', 'de-DE', 'en-US', 'es-ES', 'fr-FR', 'fi-FI', 'hu-HU',
           'is-IS', 'it-IT', 'jv-ID', 'lv-LV', 'ms-MY', 'nb-NO', 'nl-NL', 'pl-PL',
           'pt-PT', 'ro-RO', 'ru-RU', 'sl-SL', 'sv-SE', 'sq-AL', 'sw-KE', 'tl-PH',
           'tr-TR', 'vi-VN', 'cy-GB']

data=load_dataset('qanastek/MASSIVE')

output_directory='validation_dataset'
os.makedirs(output_directory,exist_ok=True)

def extract_sentence(locale,data):
  output_file=os.path.join(output_directory,f"{locale}.txt")
  local_data=data.filter(lambda x:x['locale']==locale)
  with open(output_file,'w',encoding='utf-8') as file:
    for ex in local_data['validation']:
      utt=ex['utt']
      file.write(utt + '\n')
  print(f"finished extracting sentences for {locale}.")


for locale in locales:
  extract_sentence(locale,data)

print("Extraction complete for all 27 language")


finished extracting sentences for af-ZA.
finished extracting sentences for da-DK.
finished extracting sentences for de-DE.
finished extracting sentences for en-US.
finished extracting sentences for es-ES.
finished extracting sentences for fr-FR.
finished extracting sentences for fi-FI.
finished extracting sentences for hu-HU.
finished extracting sentences for is-IS.
finished extracting sentences for it-IT.
finished extracting sentences for jv-ID.
finished extracting sentences for lv-LV.
finished extracting sentences for ms-MY.
finished extracting sentences for nb-NO.
finished extracting sentences for nl-NL.
finished extracting sentences for pl-PL.
finished extracting sentences for pt-PT.
finished extracting sentences for ro-RO.
finished extracting sentences for ru-RU.
finished extracting sentences for sl-SL.
finished extracting sentences for sv-SE.
finished extracting sentences for sq-AL.
finished extracting sentences for sw-KE.
finished extracting sentences for tl-PH.
finished extract

In [None]:
!zip -r validation_dataset.zip /content/validation_dataset

  adding: content/validation_dataset/ (stored 0%)
  adding: content/validation_dataset/jv-ID.txt (deflated 67%)
  adding: content/validation_dataset/sw-KE.txt (deflated 69%)
  adding: content/validation_dataset/vi-VN.txt (deflated 72%)
  adding: content/validation_dataset/nl-NL.txt (deflated 68%)
  adding: content/validation_dataset/da-DK.txt (deflated 67%)
  adding: content/validation_dataset/it-IT.txt (deflated 69%)
  adding: content/validation_dataset/af-ZA.txt (deflated 67%)
  adding: content/validation_dataset/ms-MY.txt (deflated 70%)
  adding: content/validation_dataset/ro-RO.txt (deflated 66%)
  adding: content/validation_dataset/sq-AL.txt (deflated 69%)
  adding: content/validation_dataset/es-ES.txt (deflated 69%)
  adding: content/validation_dataset/pl-PL.txt (deflated 67%)
  adding: content/validation_dataset/fi-FI.txt (deflated 68%)
  adding: content/validation_dataset/de-DE.txt (deflated 68%)
  adding: content/validation_dataset/lv-LV.txt (deflated 68%)
  adding: content/va

# Task 2
Build a multinomial Naive Bayes classifier on your 27 language dataset using the ‘training’ partition of the
dataset. Finetune the model with the validation partition. Finally, report the performance metrics for all the
three partitions.

In [45]:
from google.colab import drive
import zipfile
import os

drive.mount('/content/drive')
zip_path='/content/drive/MyDrive/all_language_files.zip'
with zipfile.ZipFile(zip_path,'r') as zipref:
  zipref.extractall('/content/train_language_dataset')

zip_path='/content/drive/MyDrive/test_dataset.zip'
with zipfile.ZipFile(zip_path,'r') as zipref:
  zipref.extractall('/content/test_language_dataset')

zip_path='/content/drive/MyDrive/validation_dataset.zip'
with zipfile.ZipFile(zip_path,'r') as zipref:
  zipref.extractall('/content/validation_language_dataset')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [46]:
import os

def get_data_from_directory(directory):
  X,y=[],[]
  for filename in os.listdir(directory):
    if filename.endswith(".txt"):
      label=filename.split(".")[0]
      with open(os.path.join(directory,filename),'r',encoding='utf-8') as file:
        for line in file:
          X.append(line.strip())
          y.append(label)
  return X, y

train_dir='/content/train_language_dataset/all_language_files'
test_dir='/content/test_language_dataset/test_dataset'
validation_dir='/content/validation_language_dataset/validation_dataset'

X_train,y_train=get_data_from_directory(train_dir)
X_test,y_test=get_data_from_directory(test_dir)
X_validation,y_validation=get_data_from_directory(validation_dir)

# concatenating the training ,test and validation dataset into a single dataset say X and Y
X=X_train+X_test+X_validation
X1=X
Y=y_train+y_test+y_validation
Y1=Y

In [47]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import LabelEncoder


In [48]:
label_encoder = LabelEncoder()
label_encoder.fit(Y)
y_train = label_encoder.transform(y_train)
y_validation = label_encoder.transform(y_validation)
y_test = label_encoder.transform(y_test)

In [49]:
tfidf=TfidfVectorizer(max_features=10000)
X_train=tfidf.fit_transform(X_train)
X_test=tfidf.transform(X_test)
X_validation=tfidf.transform(X_validation)
X=tfidf.transform(X)
# X_train_tran=tfidf.fit_transform(X_train)
#X_test=tfidf.fit_transform(X_test)
#X_validation=tfidf.fit_transform(X_validation)

Tune the parameter

In [50]:
param_grid={
    'alpha':[0.01,0.1,0.5,1.0,2.0],
    'fit_prior':[True,False]
}

model=MultinomialNB()
grid_search=GridSearchCV(model,param_grid,cv=5,scoring='accuracy')

grid_search.fit(X_train,y_train)
print("Best parameters:",grid_search.best_params_)

Best parameters: {'alpha': 0.1, 'fit_prior': True}


Train the model with the best parameters on the training dataset

In [51]:
best_param=grid_search.best_params_
final_model=MultinomialNB(alpha=grid_search.best_params_['alpha'],fit_prior=grid_search.best_params_['fit_prior'])
final_model.fit(X,Y)

Evaluation on training,validation and test dataset

In [52]:
y_train_pred = final_model.predict(X_train)
y_train_pred = label_encoder.transform(y_train_pred)
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Classification Report (Train Set):\n", classification_report(y_train, y_train_pred))

y_val_pred = final_model.predict(X_validation)
y_val_pred = label_encoder.transform(y_val_pred)
print("Validation Accuracy:", accuracy_score(y_validation, y_val_pred))
print("Classification Report (Validation Set):\n", classification_report(y_validation, y_val_pred))

y_test_pred = final_model.predict(X_test)
y_test_pred = label_encoder.transform(y_test_pred)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("Classification Report (Test Set):\n", classification_report(y_test, y_test_pred))


Train Accuracy: 0.9606179916237237
Classification Report (Train Set):
               precision    recall  f1-score   support

           0       0.64      0.97      0.77     11514
           1       0.99      0.97      0.98     11514
           2       0.93      0.91      0.92     11514
           3       0.99      0.97      0.98     11514
           4       0.96      0.98      0.97     11514
           5       0.96      0.96      0.96     11514
           6       0.99      0.94      0.97     11514
           7       0.99      0.98      0.98     11514
           8       0.98      0.92      0.95     11514
           9       0.99      0.96      0.98     11514
          10       0.97      0.97      0.97     11514
          11       0.99      0.97      0.98     11514
          12       0.99      0.96      0.97     11514
          13       0.98      0.98      0.98     11514
          14       0.92      0.91      0.92     11514
          15       0.97      0.95      0.96     11514
          

# Task 3
Convert further your 27 language dataset into language groups, where the grouping is via their respective
continent names. It appears that the dataset has Asia, Africa, Europe, and North america. So, you will have
four classes now. Collapse the dataset into 4 classes by appending the files into large files.
Build a Regularized Discriminant Analysis (RDA) model, which has a hyper-parameter lambda to tradeoff
between LDA and QDA. You may use bag-of-words via CountVectorizer or Tfidf Vectorizer to create the feature
space of your dataset. It will be a huge feature space, but LDA/QDA can handle large feature spaces, so no
worries there. Of course, you may use some clever feature elimination methods such as low frequency
pruning, noise removal, etc

In [53]:
continent_map = {
    'Asia': ['jv-ID', 'ms-MY', 'tl-PH', 'tr-TR', 'vi-VN'],
    'Africa': ['af-ZA', 'sw-KE'],
    'Europe': ['da-DK', 'de-DE', 'es-ES', 'fr-FR', 'fi-FI', 'hu-HU', 'is-IS', 'it-IT', 'lv-LV', 'nb-NO', 'nl-NL', 'pl-PL', 'pt-PT', 'ro-RO', 'ru-RU', 'sl-SL', 'sv-SE', 'sq-AL', 'cy-GB'],
    'North America': ['en-US']
}

def map_language_to_continent(label):
    for continent, languages in continent_map.items():
        if label in languages:
            return continent
    return None

y_continent = [map_language_to_continent(label) for label in Y1]

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X1, y_continent, test_size=0.3, random_state=42, stratify=y_continent)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)


vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

In [55]:
from sklearn.decomposition import TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
import pycountry_convert as pc
import numpy as np
from datasets import load_dataset

In [56]:
# Reduce dimensionality using TruncatedSVD (LSA)
svd = TruncatedSVD(n_components=300)
X_train_svd = svd.fit_transform(X_train_vec)
X_val_svd = svd.transform(X_val_vec)
X_test_svd = svd.transform(X_test_vec)

In [57]:
class RegularizedDiscriminantAnalysis:
  def __init__(self, lda_weight=0.5):
    self.lda_weight = lda_weight
    self.lda = LinearDiscriminantAnalysis()
    self.qda = QuadraticDiscriminantAnalysis()

  def fit(self, X, y):
    self.lda.fit(X, y)
    self.qda.fit(X, y)

  def predict(self, X):
    lda_preds = self.lda.predict_proba(X)
    qda_preds = self.qda.predict_proba(X)

    combined_preds = (self.lda_weight * lda_preds) + ((1 - self.lda_weight) * qda_preds)
    return np.argmax(combined_preds, axis=1)

  def predict_probability(self, X):
    lda_preds = self.lda.predict_proba(X)
    qda_preds = self.qda.predict_proba(X)
    combined_preds = (self.lda_weight * lda_preds) + ((1 - self.lda_weight) * qda_preds)
    return combined_preds


lda_weight = 0.7
rda_model = RegularizedDiscriminantAnalysis(0.7)

label_encoder = LabelEncoder()
train_encoded = label_encoder.fit_transform(y_train)
val_encoded = label_encoder.transform(y_val)
test_encoded = label_encoder.transform(y_test)

rda_model.fit(X_train_svd, train_encoded)

val_predictions_encoded = rda_model.predict(X_val_svd)
val_accuracy = accuracy_score(val_encoded, val_predictions_encoded)
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")

test_predictions_encoded = rda_model.predict(X_test_svd)
test_accuracy = accuracy_score(test_encoded, test_predictions_encoded)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

test_predictions = label_encoder.inverse_transform(test_predictions_encoded)
test_labels = label_encoder.inverse_transform(test_encoded)

print("Classification Report (Test Set):")
print(classification_report(test_labels, test_predictions, target_names=label_encoder.classes_))


train_predictions_encoded = rda_model.predict(X_train_svd)
train_predictions = label_encoder.inverse_transform(train_predictions_encoded)
train_accuracy = accuracy_score(train_encoded, train_predictions_encoded)
print(f"Training Accuracy: {train_accuracy * 100:.2f}%")

Validation Accuracy: 93.42%
Test Accuracy: 93.37%
Classification Report (Test Set):
               precision    recall  f1-score   support

       Africa       0.95      0.78      0.86      4957
         Asia       0.99      0.80      0.89     12391
       Europe       0.92      0.99      0.96     47085
North America       0.92      0.80      0.86      2478

     accuracy                           0.93     66911
    macro avg       0.95      0.84      0.89     66911
 weighted avg       0.94      0.93      0.93     66911

Training Accuracy: 93.50%


Why to use svd?

Reducing the Feature Space:
Textual data, when vectorized using techniques like TfidfVectorizer or CountVectorizer, typically results in a very large number of features, especially when dealing with natural language data. This is because every unique word or token becomes a feature, and this can lead to high-dimensional data.
In my case, I am working with text data from 27 languages, and when transformed into numerical features via TF-IDF, this results in thousands of features.
SVD helps reduce this high-dimensional feature space to a more manageable number of dimensions (e.g., reducing from 5000 features to 300 in my code). It is crucial because classifiers like LDA/QDA can struggle with very high-dimensional data, where there may not be enough samples to support accurate classification.