## üß© **Projects**

### üí¨ **Large Language Models (LLM) & NLP**

#### ü§ñ **English to Darija and Darija to English Translator**

**Technologies :** Python, Hugging Face Transformers, PyTorch, FastAPI, Tokenizers, LangChain.

**But :** Cr√©er un mod√®le bidirectionnel capable de traduire l‚Äôanglais vers le darija marocain et vice-versa, en tenant compte des particularit√©s linguistiques locales.

**Comp√©tences :** NLP, LLM, Fine-tuning, Prompt Engineering, Preprocessing linguistique, √©valuation BLEU/METEOR, d√©ploiement d‚ÄôAPI.

# **Une feuille de route compl√®te pour le projet ¬´ English ‚Üî Darija Translator ¬ª**

## 1) Objectif clair

* **But** : Traduire automatiquement entre l‚Äôanglais et le darija marocain.
* **Cas d‚Äôusage** : Chatbots, applications √©ducatives, outils d‚Äôaide √† la communication interculturelle.
* **Cible / label** : Phrases ou segments de texte anglais ‚Üî phrases en darija.
* **Indicateurs de succ√®s** : BLEU ‚â• 25‚Äì30 sur corpus de test, coh√©rence syntaxique et s√©mantique.


## 2) Jeux de donn√©es (suggestions & types)

Types de donn√©es utiles :

* Corpus bilingue anglais ‚Üî darija (textes transcrits, dialogues, r√©seaux sociaux, forums)
* Donn√©es textuelles brutes, tokenisables en sous-mots (BPE / SentencePiece)
* JSON/CSV avec colonnes : `english_text`, `darija_text`

Exemples publics √† rechercher :

* Hugging Face Multilingual datasets
* CCAligned ou OPUS pour corpus anglais ‚Üî arabe dialectal
* Corpus maison collect√© via crowdsourcing ou scraping (forum, commentaires sociaux)


## 3) Strat√©gie mod√®le ‚Äî options probables

* Base : **MarianMT** ou **mBART** ou **nllb-200-distilled-600M** (pr√©-entra√Æn√© multilingue)
* Fine-tuning sur corpus anglais ‚Üî darija
* Option avanc√©e : encoder-decoder transformer sp√©cifique au darija
* Technique : traduction sequence-to-sequence, attention multi-head, embeddings multilingues


## 4) Nettoyage & pr√©-traitement (m√©thodes d√©taill√©es)

* Normalisation des caract√®res (arabes translitt√©r√©s ‚Üí formes standard)
* Suppression des doublons et corrections typographiques
* Tokenization avec subword (BPE / SentencePiece)
* D√©coupage en phrases et gestion des ponctuations


## 5) Gestion du d√©s√©quilibre de classes

* V√©rifier longueur moyenne des phrases, nombre d‚Äôoccurrences des mots rares
* Data augmentation : paraphrases, back-translation pour √©quilibrer le corpus
* Oversampling des phrases peu fr√©quentes en darija


## 6) M√©triques d‚Äô√©valuation

* **BLEU** (coh√©rence traduction)
* **METEOR** (alignement s√©mantique)
* **chrF** (caract√®res et n-grammes)
* Analyse qualitative par des locuteurs natifs


## 7) Validation exp√©rimentale

* Split corpus : 80% train, 10% validation, 10% test
* K-fold cross-validation possible si dataset limit√©
* Monitoring des pertes et m√©triques sur validation √† chaque epoch


## 8) Explainability & interpr√©tabilit√©

* Visualisation des poids d‚Äôattention pour comprendre comment le mod√®le traduit certaines expressions
* Analyse des erreurs fr√©quentes (faux positifs / inversions sens)


## 9) Pipeline d‚Äôentra√Ænement (pratique / hyperparam√®tres de d√©part)

* Optimizer : AdamW
* Learning rate : 5e-5
* Batch size : 16‚Äì32
* Epochs : 3‚Äì5 (ajuster selon corpus)
* Scheduler : linear warmup decay
* Early stopping sur BLEU validation


## 10) Infrastructure & outils

* Colab Pro / GPU T4 ou A100 pour entra√Ænement
* Hugging Face Transformers + Datasets
* TensorBoard pour suivi des m√©triques
* FastAPI pour API de test et d√©mo


## 11) CRISP-DM appliqu√© ‚Äî mapping concret

| √âtape CRISP-DM                | Description concr√®te pour le projet NLP                                                                                                                  |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Business Understanding** | D√©finir les labels anglais ‚Üî darija, m√©triques de succ√®s (BLEU, METEOR), cas d‚Äôusage (chatbots, applications √©ducatives, communication interculturelle). |
| **2. Data Understanding**     | Collecte du corpus bilingue, exploration des donn√©es, nettoyage initial, cr√©ation de l‚ÄôAnalytical Base Table (ABT) : mapping patient/texte ‚Üî labels.     |
| **3. Data Preparation**       | Normalisation des textes, tokenization subword, splits train/validation/test, gestion des doublons et des incoh√©rences.                                  |
| **4. Modeling**               | Fine-tuning d‚Äôun LLM pr√©-entra√Æn√© (MarianMT / mBART ou nllb-200-distilled-600M) sur le corpus, recherche d‚Äôhyperparam√®tres, suivi des losses et m√©triques.                          |
| **5. Evaluation**             | Calcul des m√©triques BLEU, METEOR, chrF sur validation/test, analyse qualitative des traductions, revue des erreurs fr√©quentes.                          |
| **6. Deployment**             | Mise en place d‚Äôune API via FastAPI, container Docker, tests end-to-end, monitoring de la performance et logs.                                           |
| **7. Management**             | Documentation compl√®te, versioning des mod√®les et datasets, auditabilit√©, suivi des seeds et reproducibility.                                            |


## 12) Bonnes pratiques & aspects √©thiques / r√©glementaires

* Anonymisation des sources publiques
* Consentement si donn√©es crowdsourc√©es
* V√©rification des biais linguistiques (dialectes, longueur phrases, raret√©)
* Logging, reproducibility, seed fix pour d√©ploiement fiable


## 13) Organisation du repository (exemple)

```
english_darija_translator/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îú‚îÄ‚îÄ notebooks/
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py
‚îÇ   ‚îú‚îÄ‚îÄ train.py
‚îÇ   ‚îú‚îÄ‚îÄ evaluate.py
‚îÇ   ‚îî‚îÄ‚îÄ inference.py
‚îú‚îÄ‚îÄ models/
‚îú‚îÄ‚îÄ Dockerfile
‚îú‚îÄ‚îÄ requirements.txt
‚îî‚îÄ‚îÄ README.md
```


## 14) Checklist concr√®te pour d√©marrer (priorit√©s)

* [ ] Collecte corpus anglais ‚Üî darija
* [ ] Nettoyage et tokenization
* [ ] Pr√©parer dataset Hugging Face Datasets / PyTorch Dataset
* [ ] Fine-tuning LLM
* [ ] √âvaluation sur validation et test
* [ ] D√©ploiement API / notebook de test


## 15) Pi√®ges courants & comment les √©viter

* Corpus darija trop petit ‚Üí utiliser augmentation / back-translation
* Orthographe inconsistante ‚Üí normaliser avant tokenization
* Overfitting ‚Üí early stopping + dropout
* M√©triques automatiques seules insuffisantes ‚Üí analyse qualitative obligatoire


## 16) Exemples de pertes / combinaisons utiles

* CrossEntropyLoss pour seq2seq
* Label smoothing pour r√©gularisation
* Optional : weighted loss si certains tokens rares sont critiques

# ‚öôÔ∏è Architecture globale du projet

```

                                           Corpus anglais ‚Üî darija
                                                       ‚îÇ
                                  Pr√©traitement (normalisation, tokenization)
                                                       ‚îÇ
                                            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                            ‚îÇ                    ‚îÇ
                                            ‚îÇ NLLB / Transformer ‚îÇ
                                            ‚îÇ       Decoder      ‚îÇ
                                            ‚îÇ                    ‚îÇ
                                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                       ‚îÇ
                                              Fusion / projection
                                                       ‚îÇ
                                        Traduction finale (anglais ‚Üî darija)
```

---

## üß© R√©sum√©

| √âl√©ment       | Recommandation                                                                              |
| ------------- | --------------------------------------------------------------------------------------------|
| **Donn√©es**   | Corpus bilingue anglais ‚Üî darija, texte brut + JSON/CSV metadata                            |
| **Type**      | Textes segment√©s en phrases, tokenis√©s subwords                                             |
| **Mod√®les**   | MarianMT / mBART ou  nllb-200-distilled-600M  pr√©-entra√Æn√© ‚Üí fine-tuning sur corpus darija  |
| **Fusion**    | Encoder-decoder attention ou concat des embeddings                                          |
| **GPU Colab** | T4 ou sup√©rieur, batch size 16‚Äì32                                                           |
| **Objectif**  | Traduction bidirectionnelle fiable et adapt√©e au darija marocain                            |


1Ô∏è‚É£ Choix du Model


| Mod√®le                               | Performance potentielle | Points forts                                                                                                | Points faibles                                                                       |
| ------------------------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| **facebook/nllb-200-distilled-600M** | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê                   | Tr√®s robuste sur langues rares et low-resource, multilingue (200 langues), state-of-the-art sur BLEU/METEOR | Tr√®s lourd, n√©cessite GPU, fine-tuning sur Darija n√©cessaire |
| **mBART-large / mBART50**            | ‚≠ê‚≠ê‚≠ê‚≠ê                    | Seq2seq multilingue, excellent transfert learning, bon pour dialectes                                       | Lourd, fine-tuning n√©cessite GPU                                             |
| **MarianMT**                         | ‚≠ê‚≠ê‚≠ê                     | L√©ger, facile √† fine-tuner                                                                                  | Moins performant sur dialectes rares et sur phrases complexes                        |



2Ô∏è‚É£ √âtapes du projet

les √©tapes classiques appliqu√©es √† ce projet :
| √âtape                                                            | Description / Application                                                                                                |
| ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **1. Collecte de donn√©es (Data Collection)**                     | Corpus bilingue anglais ‚Üî Darija (~110k paires)                                                                          |
| **2. Nettoyage & Normalisation (Text Cleaning & Normalization)** | - Lowercase<br>- Suppression ponctuation inutile<br>- Standardisation Darija translitt√©r√©<br>- Supprimer doublons et NaN |
| **3. Tokenization**                                              | D√©coupage en tokens / sous-mots via SentencePiece ou tokenizer de NLLB                                                   |
| **4. Pr√©paration Dataset**                                       | Cr√©ation de DataLoaders PyTorch pour train/val/test                                                                      |
| **5. Mod√©lisation (Modeling)**                                   | - Utiliser `facebook/nllb-200-distilled-600M`<br>- Encoder-decoder seq2seq<br>- Fine-tuning sur Darija dataset           |
| **6. Entra√Ænement (Training)**                                   | Optimizer AdamW, LR scheduler, batch size adapt√© GPU, early stopping                                                     |
| **7. √âvaluation (Evaluation)**                                   | M√©triques NLP : BLEU, METEOR, chrF; analyse qualitative des traductions                                                  |
| **8. D√©ploiement (Deployment)**                                  | FastAPI / Docker pour fournir une API de traduction en temps r√©el                                                        |
| **9. Maintenance & am√©lioration (Maintenance)**                  | R√©entra√Ænement avec nouvelles donn√©es, suivi des erreurs, logs                                                           |


# ‚öôÔ∏è Task

## Load and preprocess the dataset

### Subtask:
Load the dataset from the specified path, handle missing values, and split it into training, validation, and test sets.


**Reasoning**:
The first step is to load the dataset from the specified path. I will use pandas to read the csv file and then check for missing values.



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
df = pd.read_csv(r'data\raw\darija_english.csv')

# Check for missing values
print("Missing values before handling:")
print(df.isnull().sum())

# Handle missing values (dropping rows with any missing values)
df.dropna(inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())

# Split the data into training, validation, and test sets
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print("\nDataset splits:")
print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")

Missing values before handling:
darija     56
english    60
dtype: int64

Missing values after handling:
darija     0
english    0
dtype: int64

Dataset splits:
Training set size: 55855
Validation set size: 6982
Test set size: 6982


## Load a pre-trained model and tokenizer

### Subtask:
Choose a suitable pre-trained model for translation (like `facebook/nllb-200-distilled-600M`) and load its corresponding tokenizer.


**Reasoning**:
Import necessary classes, define the model name, load the tokenizer and model, and set the source and target language codes.



In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
# Define the model name
model_name = "facebook/nllb-200-distilled-600M"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Set the source and target language codes for the tokenizer
tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "ary_Arab"

## Prepare the data for training

### Subtask:
Tokenize the text data and create input features and labels that are suitable for the chosen model.


**Reasoning**:
Define a function to tokenize the English and Darija sentences and apply it to the training, validation, and test sets, then convert the resulting DataFrames into Dataset objects and remove the original text columns.



In [None]:
from datasets import Dataset

In [None]:
def tokenize_function(examples):
    # Tokenize English sentences as inputs
    tokenizer.src_lang = "eng_Latn"
    inputs = tokenizer(examples["english"], truncation=True)

    # Tokenize Darija sentences as labels
    tokenizer.tgt_lang = "ary_Arab"
    labels = tokenizer(examples["darija"], truncation=True)

    # Combine inputs and labels into a dictionary
    inputs["labels"] = labels["input_ids"]
    return inputs

# Apply the tokenization function to the dataframes and convert to Dataset
train_dataset = Dataset.from_pandas(train_df).map(tokenize_function, batched=True)
val_dataset = Dataset.from_pandas(val_df).map(tokenize_function, batched=True)
test_dataset = Dataset.from_pandas(test_df).map(tokenize_function, batched=True)

# Remove original text columns
train_dataset = train_dataset.remove_columns(["english", "darija"])
val_dataset = val_dataset.remove_columns(["english", "darija"])
test_dataset = test_dataset.remove_columns(["english", "darija"])

print("Tokenized datasets:")
print(train_dataset)
print(val_dataset)
print(test_dataset)

## Fine-tune the model

### Subtask:
Set up the training parameters and fine-tune the pre-trained model on your dataset.


**Reasoning**:
Import necessary classes, define training arguments, and instantiate and start the Seq2SeqTrainer.



In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

In [None]:
# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/ColabNotebooks/nllb-fine-tuned",
    num_train_epochs=5,
    per_device_train_batch_size=8,  # Batch size per device during training
    per_device_eval_batch_size=8,  # Batch size per device during evaluation
    learning_rate=2e-5,  # Learning rate
    weight_decay=0.01,  # Weight decay
    eval_steps=500,  # Evaluate every 500 training steps
    save_steps=5000,  # Save checkpoints every 500 training steps
    predict_with_generate=True, # Use generation for prediction
)

# Define a data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Instantiate Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training
trainer.train()

Step,Training Loss
500,3.8569
1000,3.1229
1500,2.9341
2000,2.7372
2500,2.6348
3000,2.4675
3500,2.3846
4000,2.2915
4500,2.1767
5000,2.1345


TrainOutput(global_step=13964, training_loss=2.0732512658457125, metrics={'train_runtime': 11811.2478, 'train_samples_per_second': 9.458, 'train_steps_per_second': 1.182, 'total_flos': 2438515406708736.0, 'train_loss': 2.0732512658457125, 'epoch': 2.0})

## Inference with Examples

### Subtask:
Use the fine-tuned model to translate example English sentences into Darija.

**Reasoning**:
Load the fine-tuned model and tokenizer, define a function for translation, and use it to translate example sentences.

## Inference

### Subtask:
Use the fine-tuned model to translate example English sentences into Darija.

**Reasoning**:
Load the fine-tuned model and tokenizer, define a function for translation, and use it to translate example sentences.

## Evaluate the model

### Subtask:
Evaluate the fine-tuned model on the test dataset to assess its performance.

**Reasoning**:
Use the `trainer.evaluate()` method to calculate evaluation metrics on the test dataset.

In [59]:
# Evaluate the model on the test set
results = trainer.evaluate(test_dataset)

print("Evaluation results:")
print(results)

Evaluation results:
{'eval_loss': 1.5423707962036133, 'eval_runtime': 55.1568, 'eval_samples_per_second': 126.585, 'eval_steps_per_second': 15.828, 'epoch': 2.0}


In [66]:
# Define a function to translate English sentences to Darija
def translate_english_to_darija(text):
    # Tokenize the input English text and move to GPU
    tokenizer.src_lang = "eng_Latn"
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    # Generate the translation
    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("__ary_Arab__"), max_length=128
    )

    # Decode the translated tokens
    translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

    return translated_text

# Get input from the user
english_sentence = input("Enter an English sentence to translate: ")

# Translate the input sentence
darija_translation = translate_english_to_darija(english_sentence)

# Print the result
print(f"\nEnglish: {english_sentence}")
print(f"Darija: {darija_translation}\n")

Enter an English sentence to translate: how are you

English: how are you
Darija: kifach



In [67]:
!pip install huggingface_hub -q

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# Push the fine-tuned model to the Hugging Face Hub
trainer.push_to_hub("NeoAivara/English_to_Darija_translator")