# Language identification for regional languages of France  

In this lab, you will work on varities on French and on the task of language identification.

## 0. Install libraries and import data

In [1]:
from collections import Counter

import pandas as pd
import fasttext as ft
from huggingface_hub import hf_hub_download
import numpy as np
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

  from .autonotebook import tqdm as notebook_tqdm


You now have access to the file `parable_dataset.csv` in the folder `data`. Some explanations about this dataset:

### Description of the dataset

The dataset has been acquired from a book entitled ["Parabole de l'enfant prodigue en divers dialectes, patois de la France, avec une introduction sur la formation des dialectes et patois de la France"](https://archive.org/details/paraboledelenfan00favr/), published in 1879 and authored by Léopold Favre.
It contains translations of the "Parable of the Lost Son" in several regional languages of France, collected at the beginning of the 19th century.

The dataset presents a combination of challenges which are often observed in low-resource settings:
- diachronic variation
- diatopic variation
- lack of spelling standards
- errors due to OCR (Optical Charactacter Recognition)
- approximate characterization of the language due to the lack of dedicated ISO 639.3 codes.

## 1. Exploration of the dataset

**1) Load the parable dataset into a pandas DataFrame object called `parable_df`. Also print the shape of the dataset, its columns, the five first rows and the five last rows.**

In [2]:
# TO DO
parable_df = pd.read_csv("data/parable_dataset.csv")
parable_df.head()

Unnamed: 0,page,verse_number,verse,title,variety,variety_glottocode,language,language_glottocode,lang_iso639.3,var_lang_rel
0,2,11,Jesus leur dit encore : Un homme avait deux fils.,Parabole de l'Enfant prodigue. Evangile selon ...,French,stan1290,French,stan1290,fra,match
1,2,12,"dont le plus jeune dit à son père : mon père, ...",Parabole de l'Enfant prodigue. Evangile selon ...,French,stan1290,French,stan1290,fra,match
2,2,13,"Peu de jours après, le plus jeune de ces deux ...",Parabole de l'Enfant prodigue. Evangile selon ...,French,stan1290,French,stan1290,fra,match
3,2,14,"Après qu’il eut tout dépensé, il survint une g...",Parabole de l'Enfant prodigue. Evangile selon ...,French,stan1290,French,stan1290,fra,match
4,2,15,"Il s’en alla donc, et s’attacha au service d’u...",Parabole de l'Enfant prodigue. Evangile selon ...,French,stan1290,French,stan1290,fra,match


**2) Print all the verses of the version whose title is `"Parabole de l'Enfant prodigue. Evangile selon Saint-Luc, chap. XV. (Traduction de Le Maistre de Sacy.)"`. If you are a French speaker, you should be able to understand it (even if some sentences "feel old").**

In [6]:
title_1 = "Parabole de l'Enfant prodigue. Evangile selon Saint-Luc, chap. XV. (Traduction de Le Maistre de Sacy.)"

# TO DO
verses = parable_df[parable_df['title']==title_1]["verse"].values
verses

array(['Jesus leur dit encore : Un homme avait deux fils.',
       'dont le plus jeune dit à son père : mon père, donnezmoi ce qui doit me revenir de votre bien. Et le père leur fit le partage de son bien.',
       "Peu de jours après, le plus jeune de ces deux fils, ayant amassé tout ce qu’il avait, s'en alla dans un pays étranger fort éloigné, où il dissipa tout son bien en excés et en débauches.",
       'Après qu’il eut tout dépensé, il survint une grande famine dans ce pays-là, et il commença à tomber en nécessité.',
       "Il s’en alla donc, et s’attacha au service d’un des habitants du pays, qui l'envoya dans sa maison des champs pour y garder les pourceaux.",
       'Et là il eût été bien aise de remplir son ventre des cosses que les pourceaux mangeaient; mais personne ne lui en donnait.',
       'Enfin, étant rentré en lui-même, il dit : Combien y a-t-il, chez mon père, de serviteurs à gages qui ont plus de qu’il ne leur en faut; et moi je meurs ici de faim !',
       "Il fau

**3) Now print all the verses of the version whose title is `Id. en patois d'Onville, canton de Gorze (Moselle)"`. Can you understand it this time ? Which variety is it ? Which language ?**



In [8]:
title_2 = "Id. en patois d'Onville, canton de Gorze (Moselle)"

# TO DO
verse2 = parable_df[parable_df['title'] == title_2]["verse"].values
print(verse2)

['Ain oumme aiveu daoz offans ;'
 'Lou pu jonne des daut déheu ait se pairre: Papa, ce que deu me revenain dé vote bain, et le p pairre li ao féyeu le pertaige de se bain.'
 'Paot de jou etpré, lou pu jonne de cés daoz offans réméssay tourtou ce que leveu sé enolaye dans ain É . pays e!rége bin long, osse qué le dépainaye tourtou se E bain en excès et en debauches.'
 'Etpré que lé evu tourtou dépainaye, lè airivaye ainne grande faimaine dans ce pays let, et lé dooit béson.'
 "T s'en ait don ennolaye et sé étéchi au que des habitans don pays, qué lé env: en sait mohon des champs, pou y oidaye les cochons."
 "Et toulet l'éreu etu ben ahhe de rémplire se ventre des crofouilles qué les cochons maingaint ; ma péhounne ne li en beilleu."
 "Enfin, réntraye en lu-maime, y debeu: Combain y ait-il dans lait mohon dé me pairre de qu'ont pu dé pé qui ni aut en faut; ei mé jot ioussé et à meurri de fé?"
 'Y faut qué je me leuveusse, et qué j’oleusse treuvaye me pairre, et qué jé li deheusse : Papa,

**4) Now we'll have a look at the different varities and languages contained in the dataset. Print the list of all varieties contained in the dataset, and the list of all languages.**

In [9]:
# TO DO
print(parable_df["variety"].unique())

['French' 'Auvergnat' 'Walloon' 'Picard' 'Lorrain Roman' 'Franc-Comtois'
 'Bourguignon-Morvandiau' 'Limousin' 'Poitevin-Saintongeais'
 'Languedocian Occitan' 'Gascon' 'Catalan' 'Normand'
 'Vivaro-Alpine Occitan' 'Provençal' 'Arpitan' 'Romansh']


In [11]:
print(parable_df["language"].unique())

['French' 'Occitan' 'Walloon' 'Picard' 'Catalan' 'Arpitan' 'Romansh']


You should find 16 varities, and 7 languages, Occitan being the most represented.


Note that there are 3 types of relations between a variety and a language:
- `dialect`: the variety is considered to be a dialect of the corresponding language
- `match`: the variety and the language match, meaning that this is considered to be an independent language
- `same_macro_language`: the variety is not a dialect of the language, but they have the same macro language. For instance, "Bourguignon-Morvandiau" is not a dialect of French, but both "Bourguignon-Morvandiau" and "French" are both "Langues d'oïl".

**5) Which varieties are considered a dialect of a given language, according to the `var_lang_rel` column ?**

In [20]:
# TO DO
var = parable_df[parable_df['var_lang_rel']=="dialect"]
var

Unnamed: 0,page,verse_number,verse,title,variety,variety_glottocode,language,language_glottocode,lang_iso639.3,var_lang_rel
22,4,11,En home aviot dous efons.,Id. en patois de nahrte ouvérgna,Auvergnat,auve1239,Occitan,occi1239,oci,dialect
23,4,12,Lou pe dzouïne diguet à soun païre: Moun paire...,Id. en patois de nahrte ouvérgna,Auvergnat,auve1239,Occitan,occi1239,oci,dialect
24,4,13,"Quahrques dzours apréz, lou dzouïne garçou ram...",Id. en patois de nahrte ouvérgna,Auvergnat,auve1239,Occitan,occi1239,oci,dialect
25,4,14,"Apréz qu'aguét tout mandza, la famina se fagué...",Id. en patois de nahrte ouvérgna,Auvergnat,auve1239,Occitan,occi1239,oci,dialect
26,4,15,"S'enanét d'ati, se loudzét à un rilge bourdzou...",Id. en patois de nahrte ouvérgna,Auvergnat,auve1239,Occitan,occi1239,oci,dialect
...,...,...,...,...,...,...,...,...,...,...
1804,151,28,"Aco l'aguendou messou en ira, ou nou vourea pa...",Id. en patois génois de Mons et d'Escragnolles...,Provençal,prov1235,Occitan,occi1239,oci,dialect
1805,151,29,Ou gué respougé : Ve-li-za tanti bei agni que ...,Id. en patois génois de Mons et d'Escragnolles...,Provençal,prov1235,Occitan,occi1239,oci,dialect
1806,151,30,Ma prestou qué rou vostrou aoutrou fillou qui ...,Id. en patois génois de Mons et d'Escragnolles...,Provençal,prov1235,Occitan,occi1239,oci,dialect
1807,151,31,"Alavouo rou par gué diché : Mé fillou, ou s'é ...",Id. en patois génois de Mons et d'Escragnolles...,Provençal,prov1235,Occitan,occi1239,oci,dialect


**6) Which varieties are considered as independant languages, according to the `var_lang_rel` column ?**

In [22]:
# TO DO
parable_df["var_lang_rel"].unique()

array(['match', 'dialect', 'same_macro_language'], dtype=object)

In [34]:
var = parable_df[parable_df['var_lang_rel']=="match"]
var['variety'].unique()

array(['French', 'Walloon', 'Picard', 'Catalan', 'Arpitan', 'Romansh'],
      dtype=object)

**7) Which varieties share a macro language with the corresponding languages, according to the `var_lang_rel` column ?**

In [28]:
# TO DO
var = parable_df[parable_df['var_lang_rel']=="same_macro_language"]
var['variety'].unique()

array(['Lorrain Roman', 'Franc-Comtois', 'Bourguignon-Morvandiau',
       'Poitevin-Saintongeais', 'Normand'], dtype=object)

Note that:
- Franc-Comtois is spoken in the Franche-Comté region and some cantons in Switzerland
- Poitevin-Saintongeais is spoken in central Western France
- Lorrain Roman is spoken in Lorraine
- Normand is spoken in Normandy
- Bourguignon-Morvandiau is spoken in Burgundy

## 2. Language identification


### Question 1  
Try the following language identification tools on the dataset (1 verse = 1 prediction):
  - Language identification tools:
    - GlotLID (v2 and v3) : https://github.com/cisnlp/GlotLID (v2 and v3)
    - OpenLID (v2) : https://huggingface.co/laurievb/OpenLID-v2
    - fastText : https://huggingface.co/facebook/fasttext-language-identification  

You can store the predicted labels and confidence scores directly as columns of the dataset for each model (e.g in columns `glotlid_2_preds` and `glotlid_2_scores` for model glotlid_2). Then, answer the following questions:
  -  Which model obtains the best overal results (in terms of precision, recall, f-measure) ? You can compute scores for each of the languages contained in the dataset. Hint : consider using `sklearn.metrics.classifiction_report`.
  -  Which model has the largest coverage of the languages in the dataset ?

In case you need a little help to get started, here are some directions on how to load and use the first model **glotlid_v2**:

In [37]:
# You can use this function to load the different models
def load_model(model_name, file_name):
    model_path = hf_hub_download(repo_id=model_name, filename=file_name)
    model = ft.load_model(model_path)
    return model

# Source information on GlotLID models: https://huggingface.co/cis-lmu/glotlid
# v1: introduced in the GlotLID paper and used in all experiments.
# v2: an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1
# v3: an edited version of v2, featuring more languages, excluding macro languages, further cleaned from noisy corpora
# and incorrect metadata labels based on the analysis of v2, supporting "zxx" and "und" series labels

# Load the glotlid_v2 model
# check the model's documentation to find the values of model_name and file_name
model_name = "cis-lmu/glotlid"
file_name = "model_v2.bin"
glotlid_2 = load_model(model_name = model_name, file_name = file_name)

EntryNotFoundError: 404 Client Error. (Request ID: Root=1-679a3108-49c37c22206a22c83c4e80b4;ceeb004c-743c-4b7d-ae59-4991faf55152)

Entry Not Found for url: https://huggingface.co/laurievb/OpenLID-v2/resolve/main/model_v2.bin.

In [36]:
# Try the model on a few example sentences to understand the type of predictions returned
print(glotlid_2.predict("Hello, how are you today?"))
print(glotlid_2.predict("Bonjour à tous et à toutes, j'espère que vous allez bien"))

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [None]:
# TO DO

### Question 2  
 Display the confusion matrix for each model to analyse the most frequent confusions. Do they make sense from a linguistic point-of-view?  
* Consider using `sklearn.metrics.confusion_matrix` and `sns.heatmap` for a nicer display. As you will find that the models sometimes predict labels outside of the target languages, you can have a column in your confusion matrix named `other`. Once you succeeded, you can also plot a more detailed version of the confusion matrix including columns of other languages commonly predicted by the models (e.g. the top 15).

In [None]:
# TO DO

### Question 3  
Analyse the results obtained for the Occitan dialects. Are some dialects better classified as others?

In [None]:
# TO DO

### Question 4  

Write a synthesis of your observations: which tool would you recommend using in this setting, if any? What are the main advantages and disadvantages of each tool?

### Done early ?

If you're done early, here are some additionnal questions that you can consider:  


*   Until now, we have considered the performance of models separately. Is it possible to improve the results by using a majority vote strategy ? or a strategy where we favor the model with highest confidence on a given verse ?
*   Until now, we have performed language identification verse by verse. But can we better recognise language by predicting the language of an entire version of the parable of the Lost Son ?  
* Have a look at the predictions with very low model certainty. Do they correspond to unsupported models or poorly predicted languages ?


