# Language identification for regional languages of France  

In this lab, you will work on varities on French and on the task of language identification.

## 0. Install libraries and import data

In [None]:
!pip install fasttext==0.9.2

In [None]:
from collections import Counter

import pandas as pd
import fasttext as ft
from huggingface_hub import hf_hub_download
import numpy as np
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# create a folder called 'data'
!mkdir data
# download document available at https://seafile.unistra.fr/f/a6246a7152d44122b957/?dl=1 and store it in the folder you just created
!wget --output-document data/parable_dataset.csv -P data https://seafile.unistra.fr/f/a6246a7152d44122b957/?dl=1

--2025-01-23 08:59:32--  https://seafile.unistra.fr/f/a6246a7152d44122b957/?dl=1
Resolving seafile.unistra.fr (seafile.unistra.fr)... 77.72.44.41
Connecting to seafile.unistra.fr (seafile.unistra.fr)|77.72.44.41|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://seafile.unistra.fr/seafhttp/files/98184606-23df-47a1-8ecc-f3d28e6baef9/parable_dataset.csv [following]
--2025-01-23 08:59:33--  https://seafile.unistra.fr/seafhttp/files/98184606-23df-47a1-8ecc-f3d28e6baef9/parable_dataset.csv
Reusing existing connection to seafile.unistra.fr:443.
HTTP request sent, awaiting response... 200 OK
Length: 439717 (429K) [application/octet-stream]
Saving to: ‘data/parable_dataset.csv’


2025-01-23 08:59:34 (916 KB/s) - ‘data/parable_dataset.csv’ saved [439717/439717]



You now have access to the file `parable_dataset.csv` in the folder `data`. Some explanations about this dataset:

### Description of the dataset

The dataset has been acquired from a book entitled ["Parabole de l'enfant prodigue en divers dialectes, patois de la France, avec une introduction sur la formation des dialectes et patois de la France"](https://archive.org/details/paraboledelenfan00favr/), published in 1879 and authored by Léopold Favre.
It contains translations of the "Parable of the Lost Son" in several regional languages of France, collected at the beginning of the 19th century.

The dataset presents a combination of challenges which are often observed in low-resource settings:
- diachronic variation
- diatopic variation
- lack of spelling standards
- errors due to OCR (Optical Charactacter Recognition)
- approximate characterization of the language due to the lack of dedicated ISO 639.3 codes.

## 1. Exploration of the dataset

**1) Load the parable dataset into a pandas DataFrame object called `parable_df`. Also print the shape of the dataset, its columns, the five first rows and the five last rows.**

In [None]:
# TO DO

**2) Print all the verses of the version whose title is `"Parabole de l'Enfant prodigue. Evangile selon Saint-Luc, chap. XV. (Traduction de Le Maistre de Sacy.)"`. If you are a French speaker, you should be able to understand it (even if some sentences "feel old").**

In [None]:
title_1 = "Parabole de l'Enfant prodigue. Evangile selon Saint-Luc, chap. XV. (Traduction de Le Maistre de Sacy.)"

# TO DO

**3) Now print all the verses of the version whose title is `Id. en patois d'Onville, canton de Gorze (Moselle)"`. Can you understand it this time ? Which variety is it ? Which language ?**



In [None]:
title_2 = "Id. en patois d'Onville, canton de Gorze (Moselle)"

# TO DO

**4) Now we'll have a look at the different varities and languages contained in the dataset. Print the list of all varieties contained in the dataset, and the list of all languages.**

In [None]:
# TO DO

You should find 16 varities, and 7 languages, Occitan being the most represented.


Note that there are 3 types of relations between a variety and a language:
- `dialect`: the variety is considered to be a dialect of the corresponding language
- `match`: the variety and the language match, meaning that this is considered to be an independent language
- `same_macro_language`: the variety is not a dialect of the language, but they have the same macro language. For instance, "Bourguignon-Morvandiau" is not a dialect of French, but both "Bourguignon-Morvandiau" and "French" are both "Langues d'oïl".

**5) Which varieties are considered a dialect of a given language, according to the `var_lang_rel` column ?**

In [None]:
# TO DO

**6) Which varieties are considered as independant languages, according to the `var_lang_rel` column ?**

In [None]:
# TO DO

**7) Which varieties share a macro language with the corresponding languages, according to the `var_lang_rel` column ?**

In [None]:
# TO DO

Note that:
- Franc-Comtois is spoken in the Franche-Comté region and some cantons in Switzerland
- Poitevin-Saintongeais is spoken in central Western France
- Lorrain Roman is spoken in Lorraine
- Normand is spoken in Normandy
- Bourguignon-Morvandiau is spoken in Burgundy

## 2. Language identification


### Question 1  
Try the following language identification tools on the dataset (1 verse = 1 prediction):
  - Language identification tools:
    - GlotLID (v2 and v3) : https://github.com/cisnlp/GlotLID (v2 and v3)
    - OpenLID (v2) : https://huggingface.co/laurievb/OpenLID-v2
    - fastText : https://huggingface.co/facebook/fasttext-language-identification  

You can store the predicted labels and confidence scores directly as columns of the dataset for each model (e.g in columns `glotlid_2_preds` and `glotlid_2_scores` for model glotlid_2). Then, answer the following questions:
  -  Which model obtains the best overal results (in terms of precision, recall, f-measure) ? You can compute scores for each of the languages contained in the dataset. Hint : consider using `sklearn.metrics.classifiction_report`.
  -  Which model has the largest coverage of the languages in the dataset ?

In case you need a little help to get started, here are some directions on how to load and use the first model **glotlid_v2**:

In [None]:
# You can use this function to load the different models
def load_model(model_name, file_name):
    model_path = hf_hub_download(repo_id=model_name, filename=file_name)
    model = ft.load_model(model_path)
    return model

# Source information on GlotLID models: https://huggingface.co/cis-lmu/glotlid
# v1: introduced in the GlotLID paper and used in all experiments.
# v2: an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1
# v3: an edited version of v2, featuring more languages, excluding macro languages, further cleaned from noisy corpora
# and incorrect metadata labels based on the analysis of v2, supporting "zxx" and "und" series labels

# Load the glotlid_v2 model
# check the model's documentation to find the values of model_name and file_name
model_name = "cis-lmu/glotlid"
file_name = "model_v2.bin"
glotlid_2 = load_model(model_name = model_name, file_name = file_name)

In [None]:
# Try the model on a few example sentences to understand the type of predictions returned
print(glotlid_2.predict("Hello, how are you today?"))
print(glotlid_2.predict("Bonjour à tous et à toutes, j'espère que vous allez bien"))

(('__label__eng_Latn',), array([0.99998987]))
(('__label__fra_Latn',), array([0.99996352]))


In [None]:
# TO DO

### Question 2  
 Display the confusion matrix for each model to analyse the most frequent confusions. Do they make sense from a linguistic point-of-view?  
* Consider using `sklearn.metrics.confusion_matrix` and `sns.heatmap` for a nicer display. As you will find that the models sometimes predict labels outside of the target languages, you can have a column in your confusion matrix named `other`. Once you succeeded, you can also plot a more detailed version of the confusion matrix including columns of other languages commonly predicted by the models (e.g. the top 15).

In [None]:
# TO DO

### Question 3  
Analyse the results obtained for the Occitan dialects. Are some dialects better classified as others?

In [None]:
# TO DO

### Question 4  

Write a synthesis of your observations: which tool would you recommend using in this setting, if any? What are the main advantages and disadvantages of each tool?

### Done early ?

If you're done early, here are some additionnal questions that you can consider:  


*   Until now, we have considered the performance of models separately. Is it possible to improve the results by using a majority vote strategy ? or a strategy where we favor the model with highest confidence on a given verse ?
*   Until now, we have performed language identification verse by verse. But can we better recognise language by predicting the language of an entire version of the parable of the Lost Son ?  
* Have a look at the predictions with very low model certainty. Do they correspond to unsupported models or poorly predicted languages ?


