# Neural unsupervised approach for proto-form reconstruction : experiment about the impact of the language model.

Authors : Benjamin BADOUAILLE, Eliott CAMOU, Thomas HORRUT (Université de Bordeaux, CPBx)

Supervised by : Rachel Bawden (Inria, Paris, ALMAnaCH group)


---

This notebook contains an implementation of the proto-form reconstruction neural model that [Andre He et al.](https://arxiv.org/abs/2211.08684) designed. This implementation covers the training, the inference and the evaluation of the general model, with also the training of several language models of the proto-language. Some details and tests about generation, sampling and maximisation steps could be addressed in other [notebooks](./notebooks/).

The aim of the experiment is to reproduce the latin reconstruction as in the authors' paper, with similar experimental conditions. The main change is that the training and the evaluation will be carried out several times with differently-configured prior language models. Then, the models' performances could be compared in function of these language models.

*This experiment is led in the context of an end-of-preparatory-cycle scientific research project, whose analysis will fed a wider reflexion about AI potential and limitations for solving Historical Linguistics problems.*

In [2]:
# update working repository

!git pull

Déjà à jour.


# Framework's datasets



*   $L$ is the roman languages set with which we work (`french`, `romanian`, `spanish`, `italian`, `portuguese`).
*   $\Sigma$ is the IPA characters set which is used in our tokenisation. With the special characters `'('` and `')'`, it constitutes the input vocabulary for the language and edit models. The insertion edit model ($q_\textrm{ins}$) returns a probability distribution over $\Sigma \cup \{\textrm{"<del>"}\}$ and the substition one ($q_\textrm{sub}$) returns over $\Sigma \cup \{\textrm{"<del>"}\}$
*   Let $C$ be the cognates pairs set. We note the batch size $B := |C|$. 



## $Σ$ initialisation

In [1]:
from data import vocab

IPA_VOC = vocab.SIGMA_INV

print("IPA characters : ")
for i in range(len(IPA_VOC)):
    print(f'/ {str(IPA_VOC[i])} /', end = ' ')
    if i%20==19: print()

IPA characters : 
/  / / z / / ɣ / / m / / ɒ / / u / / s / / ʲ / / ˈ / / ː / / ʔ / / ɨ / / o / / k / / p / / e / / a / / β / / ø / / b / 
/ f / / ɡ / / ʒ / / y / / ɲ / / ɾ / / ˌ / / ɛ / / d / / ɹ / / w / / x / / n / / l / / r / / œ / / ɐ / / ʁ / / v / / ʌ / 
/ ʊ / / ŋ / / ʝ / / ʰ / / ʎ / / j / / h / / ð / / ʃ / / ɪ / / - / / ɔ / / ̃ / / ə / / ɑ / / i / / θ / / t / / ɥ / 

## Preprocessing of $C$

In [5]:
from data.getDataset import getCognatesSet
from data.vocab import wordToOneHots, computeInferenceData
from Types.articleModels import ModernLanguages
from Types.models import InferenceData
from torch.nn.utils.rnn import pad_sequence

cognates = getCognatesSet()
cognatesInferenceData: dict[ModernLanguages, InferenceData] = {language:computeInferenceData(pad_sequence([wordToOneHots(w) for w in wList], batch_first=True)) for (language, wList) in getCognatesSet().items()}


# Edit models initialisation

For each iteration of the Bouchard-Côté et al.' probabilistic reconstruction model, the backward dynamic program is run and the edit models are trained from the computed probabilities. This initialisation step is independant of the prior language model choice and the initial state of the edit models after these iterations is therefore the same for each of the future EM iterations.