# MSA-Tunisian Arabic Parallel Corpus Generation Pipeline

This notebook implements the pipeline described in the dataset card for creating a synthetic parallel corpus between Modern Standard Arabic (MSA) and Tunisian Arabic (aeb).

**Key Steps:**
1. Load raw Tunisian corpus from Hugging Face.
2. Preprocess the data.
3. Generate MSA translations using an LLM.
4. Apply quality filtering and regeneration.
5. Reverse pairs and finalize the dataset.
6. Save to JSONL.

**Best Practices:**
- Run on a machine with GPU for LLM inference.
- Monitor memory usage for large datasets.
- Use virtual environments (e.g., conda) for dependency management.

**Dataset Chosen:** AzizBelaweid/Tunisian_Language_Dataset (138k examples, Tunisian Arabic text in Arabic script, CC BY-SA 4.0 license). If you prefer another (e.g., linagora/Tunisian_Derja_Dataset), swap the dataset name.

In [None]:
# Install required libraries (run once; comment out after)
# !pip install datasets transformers sentence-transformers torch numpy tqdm unicodedata langid presidio-analyzer pyarabic

import json
import uuid
import datetime
import random
import math
import numpy as np
from typing import List, Dict, Any
from tqdm import tqdm
import unicodedata
import langid
from presidio_analyzer import AnalyzerEngine  # For PII redaction
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer  # For LLM
from sentence_transformers import SentenceTransformer, util  # For embeddings

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Constants (as per dataset card)
MODEL_USED = "InceptionAI/jais-13b"  # Or use a available HF model like 'aubmindlab/bert-base-arabertv02' for placeholders
MAX_ATTEMPTS = 3
QUALITY_THRESHOLD = 0.7
SPLITS = {"train": 0.8, "validation": 0.1, "test": 0.1}

# Example: Test imports
print("Libraries imported successfully!")

## Load Raw Tunisian Corpus from Hugging Face

We use the `datasets` library to load the corpus. Example dataset: AzizBelaweid/Tunisian_Language_Dataset (text column with Tunisian Arabic).

**Best Practices:**
- Load only necessary splits (e.g., 'train').
- Subset for testing (e.g., first 100 examples).
- Inspect data: Print samples to verify language/script.

**Example Output:** A Hugging Face Dataset object with 'text' column.

In [3]:
from datasets import load_dataset

# Load the dataset (caches locally)
dataset = load_dataset("hamzabouajila/tunisian-derja-unified-raw-corpus", split="train")  # Assuming 'train' split; adjust if needed

# Example: Subset for testing (use full for production)
raw_corpus = dataset.select(range(100))['text']  # List of strings; use dataset['text'] for full

# Inspect examples
print("Sample Tunisian texts:")
for text in raw_corpus[:5]:
    print(text[:100] + "...")  # Truncate for display

# Best Practice: Log size
print(f"Loaded {len(raw_corpus)} examples.")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/176M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/860184 [00:00<?, ? examples/s]

Sample Tunisian texts:
آه هاي تتفرج متكية...
تعدالنا بوك جأنا نصف النهار يعمللنا هيا تمشيوا توا...
حول « حدائق تونس » بمدينة إسطنبول – Al-Sabîl
حول « حدائق تونس » بمدينة إسطنبول
13 janvier 2019 11 fé...
أیام الجهات/ مدن الفنون في مدینة الثقافة: ولایة توزر 12-01-2018 - مدينة الثقافة
Anis2019-01-12T19:28...
يقونة أتو تجي غدوة الصباح غدوة الصباح أتو تجي أتو نعمللها شوية هكة حنة كي تبدا راقدة...
Loaded 100 examples.
