### Author: Shams

**Description:**  
This notebook handles the pre-processing logic. I need to clean up the raw text before training the tokenizer. It splits Arabic prefixes and handles English contractions so the model understands words better. Then it saves the result to disk.

### 1. Login to Hugging Face
I need to log in here so I can grab my private dataset from the hub.

In [None]:
from huggingface_hub import notebook_login

# logging in to access the data
notebook_login()

### 2. Loading the Data
Pulling the dataset I uploaded earlier. I'm using a wildcard to catch all the parquet files in the directory.

In [None]:
from datasets import load_dataset
import re
from typing import List, Optional, Tuple, Set

dataset_id = "Shams03/ARZ-EN"
# grabbing all the parquet files
data_files = {"train": "AllDataPARQUET/*.parquet"}

dataset = load_dataset(dataset_id, data_files=data_files, split='train')
print("Got the data.")

### 3. Check the Data
Just printing a few rows to make sure I downloaded the right thing and the columns look correct.

In [None]:
print("Data info:")
print(dataset)

print("\nColumns:")
print(dataset.column_names)

print("\nRandom samples:")
# check 10 random lines
for sample in dataset.shuffle(seed=42).select(range(10)):
    print(f"Text: {sample['sentence']}")

### 4. Arabic Tokenizer Logic
Arabic is tricky because words stick together. I need to split prefixes like "wa" (and) or "al" (the) from the main word. I also need to put spaces around punctuation so the tokenizer sees them separately.

In [None]:
class ArzTokenizer:
    """
    My custom logic for splitting Arabic text.
    It cuts off common prefixes and spaces out punctuation.
    """

    def __init__(self):
        # these are the sticky prefixes I want to cut off
        self.prefixes: List[str] = ['ال', 'يا', 'لل', 'وال', 'بال']

        # listing all punctuation to handle them quickly
        punctuation_chars = '؟،؛:!.()[]{}""«»–—…/\\|@#$%^&*+=<>~`' + \
                            '?,;:!.()[]{}""\'---…/\\|@#$%^&*+=<>~`'
        self.punctuation_set: Set[str] = set(punctuation_chars)

        # regex to catch the punctuation
        escaped_punct = ''.join([re.escape(p) for p in self.punctuation_set])
        self.split_pattern = re.compile(f'([{escaped_punct}])')

    def _split_prefixes(self, word: str) -> List[str]:
        # if a word starts with a prefix, cut it off
        # but only if the remaining word is long enough
        for prefix in self.prefixes:
            if word.startswith(prefix) and len(word) > len(prefix):
                return [prefix, word[len(prefix):]]
        return [word]

    def tokenize(self, text: str) -> List[str]:
        if not text:
            return []

        # step 1: split by punctuation marks
        parts = self.split_pattern.split(text)

        final_tokens = []
        for part in parts:
            if not part:
                continue

            # step 2: if it is a symbol, just keep it
            if part in self.punctuation_set:
                final_tokens.append(part)
            else:
                # step 3: if it is words, check for prefixes
                words = part.strip().split()
                for word in words:
                    final_tokens.extend(self._split_prefixes(word))

        return final_tokens

### 5. English Tokenizer Logic
English is easier. I mainly need to split contractions like "don't" into "do" and "n't" so the model learns "not". I also space out punctuation here too.

In [None]:
class EnTokenizer:
    """
    Simple rules for English text.
    Handles contractions and punctuation.
    """
    def __init__(self):
        # finding things like n't, 's, 've to put a space before them
        self.CONTRACTIONS_RE = re.compile(r"(n't|'s|'ve|'re|'d|'ll)\b", re.IGNORECASE)

        # finding punctuation but ignoring the apostrophe so I don't break the contractions
        self.PUNCT_RE = re.compile(r'([!"#$%&()*+,-./:;<=>?@\[\\\]^_`{|}~])')

        # finding extra spaces
        self.WHITESPACE_RE = re.compile(r'\s+')

    def tokenize(self, text: str) -> list[str]:
        # step 1: space out contractions
        text = self.CONTRACTIONS_RE.sub(r' \1', text)

        # step 2: space out symbols
        text = self.PUNCT_RE.sub(r' \1 ', text)

        # step 3: clean up double spaces and split
        text = self.WHITESPACE_RE.sub(' ', text).strip()
        return text.split()

### 6. Processing the Dataset
Now I run the whole dataset through those functions. I check the filename (`source_file`) to see if it's English or Arabic and apply the right class. I'm using `num_proc` to make it faster.

In [None]:
TokenizeArz = ArzTokenizer()
TokenizeEN = EnTokenizer()

def TokenBatches(batch):
    """
    Decides which tokenizer to use based on the source file name.
    """
    tokenized_output = []

    # loop through the text and the filename at the same time
    for sentence, source_file in zip(batch['sentence'], batch['source_file']):
        
        # if filename has ARZ, use arabic rules
        if 'ARZ' in source_file:
            tokens = TokenizeArz.tokenize(sentence)
        # if filename has EN, use english rules
        elif 'EN' in source_file:
            tokens =TokenizeEN.tokenize(sentence)
        else:
            # fallback for empty or weird files
            tokens = []
            
        tokenized_output.append(tokens)
        
    return {"pretokenized": tokenized_output}

# running the map function
# num_proc=4 makes it run in parallel
MyPreTokenizedData = dataset.map(
    TokenBatches,
    batched=True,
    num_proc=4, 
    batch_size=2000,
    remove_columns=['sentence'], # dropping the old column to save space
    desc="Running tokenizer..."
)

### 7. Inspecting Results
Checking if the column `pretokenized` was created correctly.

In [None]:
print("\nDone.")
print("New dataset structure:")
print(MyPreTokenizedData)

print("\nChecking a processed row:")
print(MyPreTokenizedData[5000000])

### 8. Save to Disk
Saving it as a Hugging Face dataset folder so I can load it quickly later without re-running this logic.

In [None]:
save_path = "/kaggle/working/MyPreTokenizedData"
MyPreTokenizedData.save_to_disk(save_path)

print("Saved dataset folder.")

### 9. Save as Parquet
Also saving as a parquet file just in case I need to move it around easily or use it in pandas.

In [None]:
save_path = "/kaggle/working/MyPreTokenizedData.parquet"

try:
    MyPreTokenizedData.to_parquet(save_path)
    print(f"Saved parquet to {save_path}")
except Exception as e:
    print(f"Error saving parquet: {e}")

### 10. Final Check
List the directory to confirm the files are there.

In [None]:
import os 
!ls