# Arabic GEC Pipeline runner

This notebook orchestrates the complete GEC pipeline.

**Steps:**
0. Setup (Git Clone, Install libraries & Download Data)
1. Data Processing (m2_to_csv.py)
2. Model Training (train_gec.py)
3. Inference (inference.py)

## 0. Setup
Clone the repository, install dependencies, and download the QALB dataset.

In [None]:
# 1. Clone the repository
!git clone https://github.com/HosamKsbaa/Test23.git

# 2. Change directory into the repo
%cd Test23

In [None]:
# 3. Install necessary libraries
!pip install transformers datasets pyarabic gdown sentencepiece

In [None]:
# 4. Download and unzip data
import gdown
import zipfile
import os

# Download the file from Google Drive
file_id = '1hvLiiMvvubyCEAZK4KIWgu7qHBNCHOp-'
url = f'https://drive.google.com/uc?id={file_id}'
output_file = 'qalb_dataset.zip'

# Only download if not exists
if not os.path.exists(output_file):
    gdown.download(url, output_file, quiet=False)

# Unzip the file
if os.path.exists(output_file):
    with zipfile.ZipFile(output_file, 'r') as zip_ref:
        zip_ref.extractall("data")
    print("Dataset extracted to 'data' directory.")
else:
    print("Download failed.")

## 1. Data Processing
Parse the M2 file and generate `qalb_full_gec.csv`.

In [None]:
!python m2_to_csv.py

## 2. Model Training
Fine-tune AraT5v2. **Note:** This takes a long time.

In [None]:
!python train_gec.py

## 3. Inference
Test the model on new sentences.

In [None]:
!python inference.py