Skip to content

GiuliaGualtieri/french-exercises

Repository files navigation

Let's practice in French! 🔴 ⚪ 🔵

This is a simple app to do practice in french language, based on a pretrained model for the fill-mask task.

Table of Contents

📥 Installation

  • to be done asap

🚀 Usage

screenshot | Example of the FRENCH EXERCISES APP on Windows 11 with light mode and 'blue' theme

FRENCH_EXERCISES_APP.mp4

| _Video example of how to do practice with FRENCH EXERCISES app! 🎈 _

💡 Idea

The idea behind this project is to exploit existing pretrained model on fill-mask task in order to learn better a language. In particular, given a text or even better a bunch of text, it's possible collect a set of sentences to create a Complete the sentence with the missing word test. Indeed, for each sentence it's possible to mask randomly a word and offer to the user a list of possible words/answers. One of them is the right one, the remaining are the most possible returned by a specific pretrained model on french language.

✨ Features and Technologies


The relevant features exploited in this projects are:

  • Package configuration with pyproject.toml built with hatch
  • Code formatting and linting with ruff, and black
  • pre-commit configuration file
  • CI-CD Pipelines with GitHub Actions
  • Basic pytest set-up for unit tests
  • Auto-generated docs with mkdocs and mkdocs-material
  • mBARThez as pretrained model for the fill-mask task
  • CustomTkinter as a python UI-library based on Tkinter

🎯 Fill-Mask task


Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. I decided to exploit a model trained for this kind of task, in particular for the french language.

In the next section, the exploited model is described more in detail, i.e. mBARThez

🎨 Model description


The mBARThez model is a child of BART, which is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. BARThez is pretrained by learning to reconstruct a corrupted input sentence. A corpus of 66GB of french raw text is used to carry out the pretraining. Both its encoder and its decoder are pretrained. In addition to BARThez, that is pretrained from scratch, the multilingual BART, mBART, is continuously pretrained which boosted its performance in both discriminative and generative tasks. The french adapted version is called mBARThez and it's the one exploited in this project.

For more detail on the model, check the following sources: paper and github.

📨 Contact

Author: Giulia Gualtieri Email: giulia.gualtieri@mail.polimi.it

About me: Data Scientist 🔬 with strong passion for Machine Learning 💻 and Artificial Intelligence 💭 applications. As you can read, I'm also passionate about french language and culture. Language speaken: English, Italian, French.