# 1. Introduction

This notebook builds a small transcription pipeline to collect and prepare audio data related to the Damascene (Syrian) dialect. The goal is to convert raw audio recordings into clean, structured text that can be used for training, analysis, or building linguistic resources for the dialect.

# 2. Project Overview









1.  Audio Collection: Storing all Damascene-dialect recordings in a single directory.
2.   Automatic Transcription: Converting audio files to text using Whisper.
3. Text Cleaning & Normalization: Removing noise, symbols, inconsistent spellings, and applying basic Arabic normalization.
4. Dataset Export: Saving all processed text into a CSV dataset that can later be used for NLP tasks or model training.



```mermaid
Pipeline diagram:
    A[Audio Files] --> B[Whisper Transcription]
    B --> C[Text Normalization]
    C --> D[CSV Output]
    D --> E[Dialect Dataset]
```


# Setup Environment

This section installs and loads the required libraries:

Whisper: for high-quality speech-to-text transcription.
Torch: a backend dependency for Whisper
pydub: for handling audio formats.
tqdm: for progress bars during batch transcription.
Pandas: for saving results in CSV format.

In [None]:
# ============================================================
# Install Required Libraries
# ============================================================
# Now install all dependencies
!pip install -q numpy openai-whisper torch soundfile pydub librosa tqdm pandas rapidfuzz stanza supabase

# ============================================================
# Import Libraries
# ============================================================
import os
import re
import numpy as np
import pandas as pd
import librosa
import soundfile as sf
from tqdm import tqdm
from pydub import AudioSegment

# Whisper for transcription
import whisper

# Optional: text cleaning helpers
from rapidfuzz import fuzz
import stanza

# Optional: Supabase client
from supabase import create_client, Client


# Connect to drive

In [None]:
# Colab cell 1 — mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Define Pathes

Here we define:

The directory containing all Damascene audio files.

The output paths for the CSV file storing transcriptions.

Additional folders for backup or temporary files if needed.

This ensures the notebook can run repeatedly without changing the core code.

In [None]:
DRIVE_AUDIO_DIR = "/content/drive/MyDrive/Damascene_Accent"
OUTPUT_CSV = "/content/drive/MyDrive/Damascene_Accent/Damascene_transcriptions_large_acc.csv"

REVIEW_TRANSCRIPT = f"{DRIVE_AUDIO_DIR}/reviewed_transcript/reviewed_damascene_transcriptions.csv"
NORMALIZED_PATH = f"{DRIVE_AUDIO_DIR}/reviewed_transcript/normalized_damascene_transcriptions.csv"
TOKENIZED_TRANSCRIPT = f"{DRIVE_AUDIO_DIR}/reviewed_transcript/tokenized_damascene_transcriptions.csv"
DEDUPED_PATH = f"{DRIVE_AUDIO_DIR}/reviewed_transcript/unique_damascene_tokens.csv"
DUPLICATES_PATH = f"{DRIVE_AUDIO_DIR}/reviewed_transcript/possible_duplicates_with_lines.csv"
DATA_SAMPLE_PATH = f"{DRIVE_AUDIO_DIR}/final_sample.csv"
DATA_PATH = f"{DRIVE_AUDIO_DIR}/final_sample_with_pos_stanza.csv"

DATA_SAMPLE_PATH = f"{DRIVE_AUDIO_DIR}/final_sample.csv"

## Extract Transcript from Voice Records

This stage processes each audio file and generates text using Whisper.
Key points to note:

Whisper supports Arabic and performs reasonably well on many Levantine dialects, including Damascene.

Model size affects accuracy and speed (e.g., large > base > small).

Files already processed are skipped to avoid duplication.

Errors are caught so corrupted audio files do not stop the pipeline.

In [None]:
# Choose model: tiny, base, small, medium, large
MODEL_NAME = "large-v3"   # You can also try "small" or "large-v3" for better accuracy
model = whisper.load_model(MODEL_NAME)
print("✅ Model loaded:", MODEL_NAME)

# Gather audio files
files = sorted([f for f in os.listdir(DRIVE_AUDIO_DIR) if f.lower().endswith(('.wav','.mp3','.m4a','.flac', 'mpeg'))])
print(f"Found {len(files)} audio files")

# Load existing CSV (if any)
if os.path.exists(OUTPUT_CSV) and os.path.getsize(OUTPUT_CSV) > 0:
    df_existing = pd.read_csv(OUTPUT_CSV)
    processed = set(df_existing['filename'].astype(str).tolist())
    print(f"Loaded existing CSV with {len(df_existing)} records; skipping {len(processed)} files.")
else:
    df_existing = pd.DataFrame(columns=["filename","transcription","language","duration","model","error","timestamp"])
    processed = set()
    print("CSV file either does not exist or is empty; initializing new DataFrame.")

# Buffer for new transcriptions
rows = []

# Main loop
for filename in tqdm(files, desc="Transcribing"):
    if filename in processed:
        print(f"⏩ Skipping already processed: {filename}")
        continue

    file_path = os.path.join(DRIVE_AUDIO_DIR, filename)
    ts = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
    try:
        result = model.transcribe(file_path, language="ar")
        text = result.get("text", "").strip()
        duration = result.get("duration", None)
        row = {
            "filename": filename,
            "transcription": text,
            "language": result.get("language", "ar"),
            "duration": duration,
            "model": MODEL_NAME,
            "error": "",
            "timestamp": ts
        }
        print(f"[OK] {filename} -> {len(text.split())} words")
    except Exception as e:
        print(f"[ERR] {filename} -> {e}")
        row = {
            "filename": filename,
            "transcription": "",
            "language": "",
            "duration": None,
            "model": MODEL_NAME,
            "error": str(e),
            "timestamp": ts
        }

    # Add to list of new rows
    rows.append(row)

# ✅ Save all results *once* at the end
if rows:
    df_new = pd.DataFrame(rows)
    df = pd.concat([df_existing, df_new], ignore_index=True)
    df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8-sig")
    print(f"✅ Saved {len(df)} total records to {OUTPUT_CSV}")
else:
    print("⚠️ No new files were processed.")

✅ Model loaded: large-v3
Found 16 audio files
Loaded existing CSV with 16 records; skipping 16 files.


Transcribing: 100%|██████████| 16/16 [00:00<00:00, 70492.50it/s]

⏩ Skipping already processed: segment_1.mp3
⏩ Skipping already processed: segment_10.mp3
⏩ Skipping already processed: segment_11.mp3
⏩ Skipping already processed: segment_12.mp3
⏩ Skipping already processed: segment_13.mp3
⏩ Skipping already processed: segment_14.mp3
⏩ Skipping already processed: segment_15.mp3
⏩ Skipping already processed: segment_16.mp3
⏩ Skipping already processed: segment_2.mp3
⏩ Skipping already processed: segment_3.mp3
⏩ Skipping already processed: segment_4.mp3
⏩ Skipping already processed: segment_5.mp3
⏩ Skipping already processed: segment_6.mp3
⏩ Skipping already processed: segment_7.mp3
⏩ Skipping already processed: segment_8.mp3
⏩ Skipping already processed: segment_9.mp3
⚠️ No new files were processed.






## Transcript Transformation Stage

**Goal**: Clean and normalize the reviewed transcript

  - Remove noise, diacritics, elongation, and non-Arabic symbols

  - Produce a uniform Arabic text suitable for tokenization

### Normalization

The purpose of this step is to transform noisy, inconsistent raw transcripts into clean, uniform Arabic text.

Common operations include:

Removing diacritics.

Normalizing alef forms (أ / إ / آ → ا).

Normalizing taa marbouta → ه or keeping it as ـة depending on your preference.

Removing extra punctuation, emojis, or non-Arabic symbols.

Removing repeated characters used for emphasis (e.g., "ههههه" → "ه").

Example:

Before:
"ايييي بس ليش هيك؟!!!"

After:
"اي بس ليش هيك"

In [None]:
# ======================================================
#  Arabic Text Normalization for Reviewed Transcripts
#  Production-grade, no semantic changes
# ======================================================

import re
import unicodedata

# ------------------------
# Precompiled regexes
# ------------------------
RE_DIACRITICS = re.compile(r'[\u0610-\u061A\u064B-\u065F\u06D6-\u06ED]')
RE_NON_ARABIC = re.compile(r'[^ء-ي0-9\u0660-\u0669A-Za-z\s\-]')
RE_MULTI_SPACE = re.compile(r'\s+')
RE_REPEAT = re.compile(r'(.)\1{2,}')  # Match 3+ repeated chars


# ------------------------
# Utility functions
# ------------------------
def unicode_nfc(text: str) -> str:
    """Normalize Unicode to NFC form."""
    return unicodedata.normalize('NFC', text)


def remove_bom_and_trim(text: str) -> str:
    """Remove BOM markers and surrounding whitespace."""
    return (
        text.replace('\ufeff', '')
            .replace('\u200f', '')
            .replace('\u200e', '')
            .strip()
    )


def remove_invisible(text: str) -> str:
    """Remove invisible or elongation characters."""
    return (
        text.replace('\u0640', '')  # tatweel
            .replace('\u200c', '')  # zero-width non-joiner
            .replace('\u200d', '')  # zero-width joiner
    )


def remove_diacritics(text: str) -> str:
    """Strip Arabic diacritics."""
    return RE_DIACRITICS.sub('', text)


def remove_non_arabic_punct(text: str) -> str:
    """Keep Arabic/Latin letters, digits, hyphens and spaces."""
    return RE_NON_ARABIC.sub(' ', text)


def collapse_repeats(text: str, max_repeats=2) -> str:
    """Convert elongated characters (e.g., حلوووو → حلو)."""
    def repl(m):
        return m.group(1) * max_repeats
    return RE_REPEAT.sub(repl, text)


def final_cleanup(text: str) -> str:
    """Normalize spacing and lowercase Latin."""
    text = RE_MULTI_SPACE.sub(' ', text)
    return text.strip().lower()


# ------------------------
# Master normalization function
# ------------------------
def normalize_text(text):
    """Normalize a string and return (text, steps)."""
    if not isinstance(text, str):
        return text, []

    steps = []
    s = unicode_nfc(text); steps.append("unicode_nfc")
    s = remove_bom_and_trim(s); steps.append("trim_bom")
    s = remove_invisible(s); steps.append("remove_invisible")
    s = remove_diacritics(s); steps.append("remove_diacritics")
    s = remove_non_arabic_punct(s); steps.append("remove_punct")
    s = collapse_repeats(s, max_repeats=2); steps.append("collapse_repeats")
    s = final_cleanup(s); steps.append("final_cleanup")

    return s, steps


In [None]:
# ------------------------
# Apply to your dataset
# ------------------------


# Load the dataset
df = pd.read_csv(REVIEW_TRANSCRIPT, encoding="utf-8")

# Ensure we have a column named 'transcription'
if "transcription" not in df.columns:
    raise ValueError("The CSV must contain a column named 'transcription'.")

# Apply normalization
normalized_texts = []
steps_logs = []

for text in df["transcription"]:
    norm, steps = normalize_text(text)
    normalized_texts.append(norm)
    steps_logs.append("|".join(steps))

df["normalized_transcription"] = normalized_texts
df["normalization_steps"] = steps_logs

# Save the result
df.to_csv(NORMALIZED_PATH, index=False, encoding="utf-8-sig")

print(f"✅ Normalization complete. Saved to:\n{NORMALIZED_PATH}")
print(f"Total rows processed: {len(df)}")

# Display sample
df.head(10)


✅ Normalization complete. Saved to:
/content/drive/MyDrive/Damascene_Accent/reviewed_transcript/normalized_damascene_transcriptions.csv
Total rows processed: 16


Unnamed: 0,filename,transcription,language,duration,model,error,timestamp,normalized_transcription,normalization_steps
0,segment_1.mp3,موسيقى موسيقى موسيقى موسيقى موسيقى موسيقى لازم...,ar,,large-v3,,11/14/2025 20:54,موسيقى موسيقى موسيقى موسيقى موسيقى موسيقى لازم...,unicode_nfc|trim_bom|remove_invisible|remove_d...
1,segment_2.mp3,انا لي الي علاقة يا روحي الله مأنعم عليي بالصح...,ar,,large-v3,,11/14/2025 20:58,انا لي الي علاقة يا روحي الله مأنعم عليي بالصح...,unicode_nfc|trim_bom|remove_invisible|remove_d...
2,segment_3.mp3,كرش انا بغينه عنه ويمكن ما يلزمني منه بلا ايش ...,ar,,large-v3,,11/14/2025 20:59,كرش انا بغينه عنه ويمكن ما يلزمني منه بلا ايش ...,unicode_nfc|trim_bom|remove_invisible|remove_d...
3,segment_4.mp3,عم تلبسي مشد شو صار بالمشد يلي اشتريتلك يا من ...,ar,,large-v3,,11/14/2025 20:59,عم تلبسي مشد شو صار بالمشد يلي اشتريتلك يا من ...,unicode_nfc|trim_bom|remove_invisible|remove_d...
4,segment_5.mp3,يقدروا يوقفوا بوشك يا ستي حتى الحكام يلي بيقيس...,ar,,large-v3,,11/14/2025 21:01,يقدروا يوقفوا بوشك يا ستي حتى الحكام يلي بيقيس...,unicode_nfc|trim_bom|remove_invisible|remove_d...
5,segment_6.mp3,هاردي؟ لا شوفوا التاني وين شطح تفكيرها شوفوا ب...,ar,,large-v3,,11/14/2025 21:01,هاردي لا شوفوا التاني وين شطح تفكيرها شوفوا بس...,unicode_nfc|trim_bom|remove_invisible|remove_d...
6,segment_7.mp3,لكن نشوفي داعي لكل المسخرة ليلى اذا ما عملت مع...,ar,,large-v3,,11/14/2025 21:02,لكن نشوفي داعي لكل المسخرة ليلى اذا ما عملت مع...,unicode_nfc|trim_bom|remove_invisible|remove_d...
7,segment_8.mp3,منشان صالحك يعني منشان منفعتك لك ما شفتيه دي ك...,ar,,large-v3,,11/14/2025 21:02,منشان صالحك يعني منشان منفعتك لك ما شفتيه دي ك...,unicode_nfc|trim_bom|remove_invisible|remove_d...
8,segment_9.mp3,ناقصة كيليين؟ ليش انت هززلي بدني مرة واحدة بس؟...,ar,,large-v3,,11/14/2025 21:02,ناقصة كيليين ليش انت هززلي بدني مرة واحدة بس ا...,unicode_nfc|trim_bom|remove_invisible|remove_d...
9,segment_10.mp3,موسيقى اخذت دواء من الصيدالية لفتح الشهيرية وم...,ar,,large-v3,,11/14/2025 20:54,موسيقى اخذت دواء من الصيدالية لفتح الشهيرية وم...,unicode_nfc|trim_bom|remove_invisible|remove_d...


### Tokenization

After normalizing the transcripts, we need to tokenize them, so we can have a meaningful units to store them inside the database.

To make it easier, we'll use a liberary for CAML Tools

In [None]:
from camel_tools.tokenizers.word import simple_word_tokenize

# ------------------------------------------------------
# Load data
# ------------------------------------------------------
df = pd.read_csv(NORMALIZED_PATH, encoding="utf-8-sig")
if "normalized_transcription" not in df.columns:
    raise ValueError("Missing column 'normalized_transcription' in your CSV.")

print(f"Loaded {len(df)} records from {NORMALIZED_PATH}")

# ------------------------------------------------------
# Tokenization utilities
# ------------------------------------------------------

def basic_token_clean(token):
    """Clean token from stray punctuation/spaces."""
    token = re.sub(r"[^\u0600-\u06FF0-9A-Za-z]", "", token)  # keep Arabic & alphanum
    token = token.strip()
    return token

def tokenize_text(text):
    """Tokenize Arabic text using CAMeL Tools."""
    tokens = simple_word_tokenize(str(text))
    tokens = [basic_token_clean(tok) for tok in tokens if tok.strip()]
    return tokens

# ------------------------------------------------------
# Tokenize all transcripts
# ------------------------------------------------------
all_tokens = []

for i, row in tqdm(df.iterrows(), total=len(df), desc="Tokenizing transcripts"):
    filename = row.get("filename", f"row_{i}")
    text = row["normalized_transcription"]

    tokens = tokenize_text(text)
    for token in tokens:
        if token:  # skip empty
            all_tokens.append({
                "filename": filename,
                "token": token,
                "transcription_id": i
            })

# ------------------------------------------------------
# Save tokenized output
# ------------------------------------------------------
tokens_df = pd.DataFrame(all_tokens)
tokens_df.to_csv(TOKENIZED_TRANSCRIPT, index=False, encoding="utf-8-sig")

print(f"✅ Tokenization complete.")
print(f"Total tokens: {len(tokens_df)}")
print(f"Output saved to: {TOKENIZED_TRANSCRIPT}")

# Preview few results
tokens_df.head(20)

Loaded 16 records from /content/drive/MyDrive/Damascene_Accent/reviewed_transcript/normalized_damascene_transcriptions.csv


Tokenizing transcripts: 100%|██████████| 16/16 [00:01<00:00, 13.52it/s]


✅ Tokenization complete.
Total tokens: 2662
Output saved to: /content/drive/MyDrive/Damascene_Accent/reviewed_transcript/tokenized_damascene_transcriptions.csv


Unnamed: 0,filename,token,transcription_id
0,segment_1.mp3,موسيقى,0
1,segment_1.mp3,موسيقى,0
2,segment_1.mp3,موسيقى,0
3,segment_1.mp3,موسيقى,0
4,segment_1.mp3,موسيقى,0
5,segment_1.mp3,موسيقى,0
6,segment_1.mp3,لازم,0
7,segment_1.mp3,تنحفي,0
8,segment_1.mp3,حالك,0
9,segment_1.mp3,خدي,0


#### Remove Literal Duplicates

#### Remove Similar Words

This time we'll do a fuzzy comparison between words, so we can discover more duplicated words


In [None]:
# ======================================================
#  Deduplicate Tokens → Unique Words Only
# ======================================================

# Load token data
df = pd.read_csv(TOKENIZED_TRANSCRIPT, encoding="utf-8-sig")
print(f"Loaded {len(df)} tokens from {TOKENIZED_TRANSCRIPT}")

# ------------------------------------------------------
# Basic cleanup before deduplication
# ------------------------------------------------------
def normalize_token(token):
    """Clean token to unify duplicates."""
    token = str(token).strip()
    token = re.sub(r"[^\u0600-\u06FF0-9A-Za-z]", "", token)  # Keep Arabic & alphanum only
    token = re.sub(r"\s+", "", token)
    return token

df["clean_token"] = df["token"].apply(normalize_token)
df = df[df["clean_token"].astype(bool)]  # Drop empty tokens

# ------------------------------------------------------
# Deduplicate
# ------------------------------------------------------
unique_tokens = df["clean_token"].drop_duplicates().reset_index(drop=True)
unique_df = pd.DataFrame({"id": range(1, len(unique_tokens) + 1), "token": unique_tokens})
unique_df['filename'] = df['filename']

# ------------------------------------------------------
# Save output
# ------------------------------------------------------
unique_df.to_csv(DEDUPED_PATH, index=False, encoding="utf-8-sig")

print(f"✅ Unique token list generated.")
print(f"Total unique tokens: {len(unique_df)}")
print(f"Saved to: {DEDUPED_PATH}")

# Preview
unique_df.head(20)


Loaded 2662 tokens from /content/drive/MyDrive/Damascene_Accent/reviewed_transcript/tokenized_damascene_transcriptions.csv
✅ Unique token list generated.
Total unique tokens: 1391
Saved to: /content/drive/MyDrive/Damascene_Accent/reviewed_transcript/unique_damascene_tokens.csv


Unnamed: 0,id,token,filename
0,1,موسيقى,segment_1.mp3
1,2,لازم,segment_1.mp3
2,3,تنحفي,segment_1.mp3
3,4,حالك,segment_1.mp3
4,5,خدي,segment_1.mp3
5,6,طلعي,segment_1.mp3
6,7,على,segment_1.mp3
7,8,الميزان,segment_1.mp3
8,9,شوفي,segment_1.mp3
9,10,وزنك,segment_1.mp3


#### Remove Similar Words

This time we'll do a fuzzy comparison between words, so we can discover more duplicated words


In [None]:
!pip install camel-tools



In [None]:
# ======================================================
#   Detect Near-Duplicate Arabic Tokens (Fuzzy Matching)
#   + Include line numbers for human review
# ======================================================
# ------------------------------------------------------
# Load tokens
# ------------------------------------------------------
df = pd.read_csv(DEDUPED_PATH, encoding="utf-8-sig")
if "token" not in df.columns:
    raise ValueError("The file must contain a 'token' column.")

tokens = df["token"].dropna().tolist()
print(f"Loaded {len(tokens)} tokens for fuzzy comparison.")

# ------------------------------------------------------
# Settings
# ------------------------------------------------------
SIMILARITY_THRESHOLD = 85 # 85-90 is typical for Arabic dialect variants (78% for precision)

# ------------------------------------------------------
# Compare tokens pairwise
# ------------------------------------------------------
checked_pairs = set()
duplicates = []

for i, token in tqdm(enumerate(tokens), total=len(tokens), desc="Comparing tokens"):
    for j in range(i + 1, len(tokens)):
        other = tokens[j]
        pair = tuple(sorted([token, other]))
        if pair in checked_pairs:
            continue
        score = fuzz.token_sort_ratio(token, other)
        if score >= SIMILARITY_THRESHOLD and token != other:
            duplicates.append({
                "line_1": i + 2,   # +2 because pandas adds header row (index starts from 0)
                "token_1": token,
                "line_2": j + 2,
                "token_2": other,
                "similarity": score
            })
        checked_pairs.add(pair)

# ------------------------------------------------------
# Save results
# ------------------------------------------------------
if duplicates:
    dup_df = pd.DataFrame(duplicates).sort_values("similarity", ascending=False)
    dup_df.to_csv(DUPLICATES_PATH, index=False, encoding="utf-8-sig")
    print(f"✅ Found {len(dup_df)} potential duplicates.")
    print(f"Saved to: {DUPLICATES_PATH}")
else:
    print("✅ No fuzzy duplicates detected above threshold.")

# Preview top results
if duplicates:
    dup_df.head(20)


Loaded 1391 tokens for fuzzy comparison.


Comparing tokens: 100%|██████████| 1391/1391 [00:01<00:00, 996.02it/s] 


✅ Found 298 potential duplicates.
Saved to: /content/drive/MyDrive/Damascene_Accent/reviewed_transcript/possible_duplicates_with_lines.csv


Note: For More Flexibility, The duplicated words are stored in a separate file, so users can make sure it's really presenting the same meaning.

### Word Position

for each word, I found the position, I've used **Stanza**, which is an AI model that can detect the position instead of manual input.

In [None]:


# --------------------------------------------------
# Download the Arabic model (run once)
# --------------------------------------------------
stanza.download("ar")

# --------------------------------------------------
# Initialize the Arabic NLP pipeline
# --------------------------------------------------
nlp = stanza.Pipeline(lang="ar", processors="tokenize,pos", use_gpu=True)

# --------------------------------------------------
# Load your dataset
# --------------------------------------------------
df = pd.read_csv(DATA_SAMPLE_PATH, encoding="utf-8-sig")

# --------------------------------------------------
# POS tagging helper function
# --------------------------------------------------
def get_pos_stanza(text):
    """Return the main POS tag for the input Arabic text using Stanza."""
    if not isinstance(text, str) or not text.strip():
        return ""
    doc = nlp(text)
    pos_tags = [word.upos for sent in doc.sentences for word in sent.words]
    if not pos_tags:
        return "UNK"
    # Return the most frequent POS tag
    return max(set(pos_tags), key=pos_tags.count)

# --------------------------------------------------
# Apply POS tagging to your gloss_ar column
# --------------------------------------------------
df["pos"] = df["gloss_ar"].apply(get_pos_stanza)

# --------------------------------------------------
# Save the result
# --------------------------------------------------
OUTPUT_PATH = DATA_SAMPLE_PATH.replace(".csv", "_with_pos_stanza.csv")
df.to_csv(OUTPUT_PATH, index=False, encoding="utf-8-sig")

print(f"✅ POS tagging done! Saved to:\n{OUTPUT_PATH}")
df.head()


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: ar (Arabic) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ar/resolve/v1.11.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/ar/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ar (Arabic):
| Processor | Package     |
---------------------------
| tokenize  | padt        |
| mwt       | padt        |
| pos       | padt_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


✅ POS tagging done! Saved to:
/content/drive/MyDrive/Damascene_Accent/final_sample_with_pos_stanza.csv


Unnamed: 0,id,dialect_region,token,gloss_ar,source,filename,source_url,pos
0,1,Damascene,يسلمو,سلمت يداك (شكراً),,,,PUNCT
1,2,Damascene,تؤبرني,لا أطيق الحياة من بعدك,,,,NOUN
2,3,Damascene,شو,ما هي,,,,DET
3,4,Damascene,مع السلامة,مع السلامة,,,,ADP
4,5,Damascene,شو رأيك,ما رأيك؟,,,,VERB


Measuring the model effeciency

for missing pos types, we can fill them manually

In [None]:

total_rows = len(df)
invalid_rows = df["pos"].isin(["X", "UNK"]).sum()
valid_rows = total_rows - invalid_rows

efficiency = (valid_rows / total_rows) * 100

print(f"🔍 Total rows: {total_rows}")
print(f"❌ Invalid tags (X or UNK): {invalid_rows}")
print(f"✅ Valid tags: {valid_rows}")
print(f"📊 Efficiency: {efficiency:.2f}%")

🔍 Total rows: 364
❌ Invalid tags (X or UNK): 68
✅ Valid tags: 296
📊 Efficiency: 81.32%


## 7.Load Dataset to PostGreSQL

In this step, we'll integrate with **supabase** so we can use **PostGreSQL**, And upload our dataset.

In [None]:
# !pip install supabase pandas python-dotenv --quiet


Renaming and Prepare Dataset
Rename columns so it makes scense, fill the NaN values with an empty string, so we can upload them correcty to the database.

In [None]:

df = pd.read_csv(DATA_PATH, encoding="utf-8-sig")

# Select and rename columns to match your table
df_to_upload = df.rename(columns={
    "token": "word_or_phrase",
    "gloss_ar": "gloss_ar",
    "pos": "pos",
    "dialect_region": "dialect_region",
    "source": "source",
    "filename": "filename",
    "source_url": "source_url",
})[[
    "dialect_region", "word_or_phrase", "gloss_ar",
    "pos", "source", "filename", "source_url"
]]

# Replace NaN values (important for JSON encoding)
df_to_upload = df_to_upload.where(pd.notnull(df_to_upload), None)

print("✅ Data prepared for upload.")
df_to_upload.head()

✅ Data prepared for upload.


Unnamed: 0,dialect_region,word_or_phrase,gloss_ar,pos,source,filename,source_url
0,Damascene,يسلمو,سلمت يداك (شكراً),PUNCT,,,
1,Damascene,تؤبرني,لا أطيق الحياة من بعدك,NOUN,,,
2,Damascene,شو,ما هي,DET,,,
3,Damascene,مع السلامة,مع السلامة,ADP,,,
4,Damascene,شو رأيك,ما رأيك؟,VERB,,,


### Setup Supabase Connection

In [None]:

SUPABASE_URL = "YOUR_SUPABASE_URL"
SUPABASE_KEY = "YOUR_SUPABASE_KEY"
supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)
print("Connected to Supabase.")


Connected to Supabase.


### Insert Data into Supabase

In [None]:
batch_size = 100
for start in range(0, len(df_to_upload), batch_size):
    end = start + batch_size
    batch = df_to_upload.iloc[start:end].to_dict(orient="records")
    response = supabase.table("Damascene_dialect_words").insert(batch).execute()
    print(f"✅ Uploaded rows {start}–{end}: {len(response.data) if response.data else 'OK'}")

✅ Uploaded rows 0–100: 100
✅ Uploaded rows 100–200: 100
✅ Uploaded rows 200–300: 100
✅ Uploaded rows 300–400: 64


### Verify data is inserted successfully

In [None]:
response = supabase.table("words").select("*").limit(5).execute()
for row in response.data:
    print(row)


{'id': 7, 'dialect_region': 'Damascene', 'word_or_phrase': 'يسلمو', 'gloss_ar': 'سلمت يداك (شكراً)', 'pos': 'PUNCT', 'source': None, 'filename': None, 'source_url': None}
{'id': 8, 'dialect_region': 'Damascene', 'word_or_phrase': 'تؤبرني', 'gloss_ar': 'لا أطيق الحياة من بعدك', 'pos': 'X', 'source': None, 'filename': None, 'source_url': None}
{'id': 9, 'dialect_region': 'Damascene', 'word_or_phrase': 'شو', 'gloss_ar': 'ما هي', 'pos': 'PRON', 'source': None, 'filename': None, 'source_url': None}
{'id': 10, 'dialect_region': 'Damascene', 'word_or_phrase': 'مع السلامة', 'gloss_ar': 'مع السلامة', 'pos': 'NOUN', 'source': None, 'filename': None, 'source_url': None}
{'id': 11, 'dialect_region': 'Damascene', 'word_or_phrase': 'شو رأيك', 'gloss_ar': 'ما رأيك؟', 'pos': 'VERB', 'source': None, 'filename': None, 'source_url': None}
