<a href="https://colab.research.google.com/github/NoraHK3/DataSciProject/blob/main/Data_cleaning_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Step 2: Merge Image Filenames and Assign Unique IDs


We’re combining the Arabic and English datasets so that each English row keeps its original image file name and a new unique ID.

Why:

The Arabic file has the real image filenames (used later for matching or renaming images).

The English file has the cleaned and translated dish information.

Merging them lets us link each translated dish to its actual image and give every dish a clear numeric ID for easy reference in preprocessing and modeling.

steps:

Read both CSVs using pandas.

Find the correct “image file” column from the Arabic dataset.

Add that column into the English dataset as original names.

Create a new column image_id that numbers each row sequentially.

Save the final combined dataset as a new CSV for later steps.


**Input & Output for Step 1**



/content/SaudiFoodFile.csv → (Arabic dataset with original image filenames)

/content/SaudiFoodFile_english_FIXED.csv → (English translated dataset)

Output file:

/content/SaudiFoodFile_english_WITH_original_names_and_ID.csv → (English dataset with added original names and image_id columns)

In [29]:
# === Extract 'image file' -> add as 'original names' -> add image_id ===
import pandas as pd

# 1) Load both files (put both CSVs in /content first)
ar_path = "/content/SaudiFoodFile.csv"
en_path = "/content/SaudiFoodFile_english_FIXED.csv"

df_ar = pd.read_csv(ar_path, encoding="utf-8-sig")
df_en = pd.read_csv(en_path, encoding="utf-8-sig")

# 2) Find the 'image file' column in the Arabic CSV (robust to spaces/underscores/case)
def find_col(cols, target="image file"):
    t = target.lower().replace("_", " ").strip()
    for c in cols:
        if c.lower().replace("_", " ").strip() == t:
            return c
    return None

image_col = find_col(df_ar.columns, "image file")
if image_col is None:
    raise KeyError(
        "Could not find a column named 'image file' in SaudiFoodFile.csv. "
        f"Available columns: {list(df_ar.columns)}"
    )

# 3) Basic sanity check (same length)
if len(df_ar) != len(df_en):
    raise ValueError(f"Row count mismatch! Arabic={len(df_ar)}, English={len(df_en)}")

# 4) Add as 'original names'
df_en["original names"] = df_ar[image_col]

# 5) Add an ID for each 'original names' row (1..N)
df_en["image_id"] = range(1, len(df_en) + 1)

# 6) Save
out_path = "/content/SaudiFoodFile_english_WITH_original_names_and_ID.csv"
df_en.to_csv(out_path, index=False, encoding="utf-8-sig")

print("✅ Done! Saved to:", out_path)
df_en.head()


✅ Done! Saved to: /content/SaudiFoodFile_english_WITH_original_names_and_ID.csv


Unnamed: 0,dish_name,classifications,image_file,scrape_date,original names,image_id
0,Traditional Hijazi almond coffee,loafs | cinnamon | coconut,images/traditional_hejazi_almond_coffee.jpg,30/09/2025,images/قهوة_اللوز_الحجازية_التقليدية.jpg,1
1,Hejaz Shakshuka for Saudi National Day,egg | cheese | bread,images/Shakshuka_Hejazia_for_Saudi_National_Da...,30/09/2025,images/شكشوكة_حجازية_لليوم_الوطني_السعودي.jpg,2
2,Saudi meat kabsa and daqoos salad,tomatoes | hot green pepper | salt | cumin | r...,images/Saudi_meat_kabsa_and_dakous_salad.jpg,30/09/2025,images/كبسة_اللحمة_السعودية_وسلطة_الدقوس.jpg,3
3,How to make Saudi kleija,dates | haw | cinnamon | ginger | summit | eggs,images/how_to_work_the_college_of_Saudi Arabia...,30/09/2025,images/طريقة_عمل_الكليجة_السعودية.jpg,4
4,Saudi style chicken kabsa,saffron | haw | cinnamon | mixed spices | whit...,images/Kabsa_chicken_style_Saudi_style.jpg,30/09/2025,images/كبسة_الدجاج_على_الطريقة_السعودية.jpg,5


#Step 3: Remove Duplicate Rows by 'original names'



We’re cleaning the dataset by removing any duplicate rows that have the same **original image name**.

**Why:**

Some images might appear more than once (for example, the same dish uploaded multiple times). Duplicates can **distort analysis and model training**, so we keep only the first occurrence of each unique image.

**steps:**

1. Load the previous cleaned CSV file.
2. Detect the **"original names"** column (even if spacing or capitalization differs).
3. Use `drop_duplicates()` to remove repeated image entries.
4. Save the new deduplicated file as **`imagesID_no_duplicates.csv`**.


Input & Output for This Step

Input file:
/content/SaudiFoodFile_english_WITH_original_names_and_ID.csv
(the file created in Step 1)

Output file:
/content/imagesID_no_duplicates.csv
(same data, but with duplicate image names removed)


In [31]:
# === Remove Duplicate Rows by 'original names' ===
import pandas as pd

# 1) Load your CSV
df = pd.read_csv("/content/SaudiFoodFile_english_WITH_original_names_and_ID.csv", encoding="utf-8-sig")

# 2) Identify the correct column name (handles variations)
cols = [c.lower().replace("_", " ").strip() for c in df.columns]
if "original names" in cols:
    col_name = df.columns[cols.index("original names")]
else:
    raise KeyError(f"❌ Column 'original names' not found. Columns: {df.columns.tolist()}")

# 3) Drop duplicate rows based on that column (keep first)
df_clean = df.drop_duplicates(subset=[col_name], keep="first")

# 4) Save cleaned file
output_path = "/content/imagesID_no_duplicates.csv"
df_clean.to_csv(output_path, index=False, encoding="utf-8-sig")

print(f"✅ Duplicates removed successfully!")
print(f"Original rows: {len(df)}")
print(f"Cleaned rows:  {len(df_clean)}")
print(f"📁 Saved as: {output_path}")


✅ Duplicates removed successfully!
Original rows: 285
Cleaned rows:  278
📁 Saved as: /content/imagesID_no_duplicates.csv


# Step 4: Rename ZIP Images Using CSV Mapping (Arabic-safe)


Rename images inside a ZIP to the format img(ID).ext using the CSV’s original names → image_id mapping.

Why:
Ensures consistent, model-friendly filenames and reliable linkage between each dish row and its image.


**steps:**

(Optional) Upload CSV and ZIP via Colab widgets (or use fixed paths).

Read CSV and build a robust, Unicode-safe lookup from original names → image_id.

Iterate ZIP files; for each matched filename, write it out as img(ID).ext to the output folder.

Zip the results and write a rename report (renamed vs. skipped).


**Inputs:**

CSV: /content/imagesID_no_duplicates.csv (must contain original names and image_id)

ZIP: /content/images.zip (original images)


**Outputs:**

Folder: /content/renamed_images/ (renamed files as img(ID).ext)

ZIP: /content/renamed_images.zip (packaged renamed images)

Report CSV: /content/rename_report.csv (status for each file)

In [None]:
# ===========================
# Rename images from ZIP using CSV (Arabic-safe)
# ===========================

# --- SETTINGS ---
USE_UPLOAD_WIDGETS = True  # True → show upload widgets for CSV + ZIP. False → use paths below.

CSV_PATH  = "/content/imagesID_no_duplicates.csv"  # used when USE_UPLOAD_WIDGETS=False
ZIP_PATH  = "/content/images.zip"                  # used when USE_UPLOAD_WIDGETS=False
OUT_DIR   = "/content/renamed_images"              # output folder
OUT_ZIP   = "/content/renamed_images.zip"          # zipped output
NAME_PREF = "img"                                  # final name format: img(ID).ext

import os, csv, zipfile, shutil, unicodedata, io
from pathlib import Path
import pandas as pd

# ---- (A) Upload files interactively in Colab (optional) ----
if USE_UPLOAD_WIDGETS:
    try:
        from google.colab import files
        print("📤 Please upload your CSV (with 'original names' + 'image_id'):")
        up1 = files.upload()
        CSV_PATH = "/content/" + list(up1.keys())[0]
        print("📤 Please upload your IMAGES ZIP:")
        up2 = files.upload()
        ZIP_PATH = "/content/" + list(up2.keys())[0]
    except Exception as e:
        raise RuntimeError("Colab file upload failed. Set USE_UPLOAD_WIDGETS=False to use paths.") from e

print("CSV_PATH:", CSV_PATH)
print("ZIP_PATH:", ZIP_PATH)

# ---- Helpers ----
def norm_header(s: str) -> str:
    return s.lower().replace("_", " ").strip()

def norm_key(name: str) -> str:
    """
    Normalize a filename for robust matching:
    - keep only basename (drop any 'images/...').
    - NFC Unicode normalization for Arabic.
    - casefold() for case-insensitive match.
    - strip spaces.
    """
    base = Path(str(name)).name
    return unicodedata.normalize("NFC", base).casefold().strip()

# ---- 1) Load mapping CSV (Arabic-safe) ----
df = pd.read_csv(CSV_PATH, encoding="utf-8-sig")
col_map = {norm_header(c): c for c in df.columns}

# find columns
if "original names" not in col_map:
    # try common variants
    for alt in ["original name", "image file", "image", "images", "original_images"]:
        if alt in col_map:
            col_map["original names"] = col_map[alt]
            break

id_col_key = "image id" if "image id" in col_map else ("image_id" if "image_id" in col_map else None)
if "original names" not in col_map or id_col_key is None:
    raise KeyError(f"CSV must have 'original names' and 'image_id' columns. Found: {list(df.columns)}")

orig_col = col_map["original names"]
id_col   = col_map[id_col_key]

# Build lookup: normalized basename → image_id (int)
lookup = {}
dups = df[orig_col].duplicated(keep=False)
if dups.any():
    print("⚠️ CSV contains duplicate values in 'original names'. First occurrence will be used.")

for _, row in df.iterrows():
    key = norm_key(row[orig_col])
    # keep first occurrence
    if key not in lookup:
        lookup[key] = int(row[id_col])

print(f"✅ Mapping loaded: {len(lookup)} names → IDs")

# ---- 2) Prepare output dir ----
shutil.rmtree(OUT_DIR, ignore_errors=True)
Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

# ---- 3) Read ZIP and rename files ----
renamed, skipped = [], []

def write_member_to(path_out, zin, zinfo):
    with zin.open(zinfo) as src, open(path_out, "wb") as dst:
        dst.write(src.read())

with zipfile.ZipFile(ZIP_PATH, "r") as zin:
    for zinfo in zin.infolist():
        if zinfo.is_dir():
            continue

        # Get the basename (ignore any folders inside the zip)
        orig_zip_path = zinfo.filename
        base = Path(orig_zip_path).name
        key  = norm_key(base)

        if key in lookup:
            img_id = lookup[key]
            ext = Path(base).suffix  # keep original extension
            new_name = f"{NAME_PREF}({img_id}){ext}"
            out_path = os.path.join(OUT_DIR, new_name)
            write_member_to(out_path, zin, zinfo)
            renamed.append((orig_zip_path, new_name, img_id))
        else:
            skipped.append((orig_zip_path, "no match for 'original names'"))

# ---- 4) Zip the output folder ----
shutil.make_archive(OUT_ZIP.replace(".zip", ""), "zip", OUT_DIR)

# ---- 5) Report CSV ----
report_path = "/content/rename_report.csv"
with open(report_path, "w", newline="", encoding="utf-8-sig") as f:
    w = csv.writer(f)
    w.writerow(["status", "zip_original_path", "new_name_or_reason", "image_id"])
    for o, n, i in renamed:
        w.writerow(["renamed", o, n, i])
    for o, reason in skipped:
        w.writerow(["skipped", o, reason, ""])

print("🎉 Done!")
print(f"• Renamed: {len(renamed)}")
print(f"• Skipped (no match): {len(skipped)}")
print(f"📂 Output folder: {OUT_DIR}")
print(f"🗜️ Output zip:    {OUT_ZIP}")
print(f"🧾 Report CSV:    {report_path}")


📤 Please upload your CSV (with 'original names' + 'image_id'):


KeyboardInterrupt: 

#Step 5: Update Image Filenames Inside the CSV


Rename the image file entries inside the CSV so they match the renamed image format img(ID).ext.


**Why:**
To keep the dataset consistent with the renamed images in your folder/ZIP — ensuring every row’s filename matches its actual image file.


**steps:**

Load the latest CSV containing image_id.

Automatically find the column that holds image filenames.

Replace each filename with the new standardized format img(ID).ext while keeping the same extension.

Save the updated CSV for use in later steps.



**Input file:**
/content/imagesID_no_duplicates.csv (before renaming inside CSV)

**Output file:**
/content/imagesID_renamed_in_csv.csv (filenames now follow the img(ID).ext format)

In [32]:
# === Rename filenames in CSV to match img(ID).ext format ===
import pandas as pd
from pathlib import Path

# 1) Load your CSV (make sure it contains 'image_id' and the column with the image file paths)
CSV_PATH = "/content/imagesID_no_duplicates.csv"  # or your latest version
df = pd.read_csv(CSV_PATH, encoding="utf-8-sig")

# 2) Identify the image path column (commonly 'image_file' or 'images')
def find_col(cols, target="image"):
    t = target.lower()
    for c in cols:
        if t in c.lower():
            return c
    return None

img_col = find_col(df.columns, "image")
if not img_col:
    raise KeyError(f"No column found containing 'image'. Columns: {df.columns.tolist()}")

# 3) Rename each image according to its ID, keeping the same file extension
def make_new_name(old_path, image_id):
    ext = Path(str(old_path)).suffix or ".jpg"  # default to .jpg if missing
    return f"img({image_id}){ext}"

df[img_col] = [make_new_name(df.loc[i, img_col], df.loc[i, "image_id"]) for i in range(len(df))]

# 4) Save the updated CSV
OUTPUT_PATH = "/content/imagesID_renamed_in_csv.csv"
df.to_csv(OUTPUT_PATH, index=False, encoding="utf-8-sig")

print("✅ All image filenames updated inside the CSV.")
print(f"📁 Saved as: {OUTPUT_PATH}")
df.head()


✅ All image filenames updated inside the CSV.
📁 Saved as: /content/imagesID_renamed_in_csv.csv


Unnamed: 0,dish_name,classifications,image_file,scrape_date,original names,image_id
0,Traditional Hijazi almond coffee,loafs | cinnamon | coconut,img(1).jpg,30/09/2025,images/قهوة_اللوز_الحجازية_التقليدية.jpg,1
1,Hejaz Shakshuka for Saudi National Day,egg | cheese | bread,img(2).jpg,30/09/2025,images/شكشوكة_حجازية_لليوم_الوطني_السعودي.jpg,2
2,Saudi meat kabsa and daqoos salad,tomatoes | hot green pepper | salt | cumin | r...,img(3).jpg,30/09/2025,images/كبسة_اللحمة_السعودية_وسلطة_الدقوس.jpg,3
3,How to make Saudi kleija,dates | haw | cinnamon | ginger | summit | eggs,img(4).jpg,30/09/2025,images/طريقة_عمل_الكليجة_السعودية.jpg,4
4,Saudi style chicken kabsa,saffron | haw | cinnamon | mixed spices | whit...,img(5).jpg,30/09/2025,images/كبسة_الدجاج_على_الطريقة_السعودية.jpg,5


#Step 6: Remove the “original names” Column


Delete the **`original names`** column from the dataset.

**Why:**
After all images have been renamed and matched using IDs, the original filenames are no longer needed.
Removing this column keeps the dataset **clean and ready for modeling or analysis**.

**steps**

1. Load the latest CSV file.
2. Find and drop the **`original names`** column (handles different spellings or spacing).
3. Save the final cleaned version.



**Input file:**
`/content/imagesID_renamed_in_csv.csv`

**Output file:**
`/content/remove_original_names.csv` *(final cleaned dataset without the old image name column)*


In [37]:
# === Remove "original names" column ===
import pandas as pd

# 1) Load your CSV file
df = pd.read_csv("/content/imagesID_renamed_in_csv.csv", encoding="utf-8-sig")

# 2) Remove the column safely (handles naming variations)
cols = [c.lower().replace("_", " ").strip() for c in df.columns]
if "original names" in cols:
    col_name = df.columns[cols.index("original names")]
    df = df.drop(columns=[col_name])
else:
    raise KeyError(f"Column 'original names' not found. Columns: {df.columns.tolist()}")

# 3) Save the new file
output_path = "/content/remove_original_names.csv"
df.to_csv(output_path, index=False, encoding="utf-8-sig")

print("✅ 'original names' column removed.")
print(f"📁 Clean file saved as: {output_path}")


✅ 'original names' column removed.
📁 Clean file saved as: /content/remove_original_names.csv


#Step 7: Clean and Standardize Dish Names


This step cleans and unifies all dish names — keeping authentic Arabic and Middle Eastern dishes intact while removing unnecessary English words, event mentions, and quantity descriptions.

**Why:**
To make the dataset consistent and ready for analysis by:

Removing irrelevant phrases like “How to make” or “for Saudi National Day”.

Avoiding over-cleaning that might erase Arabic or culturally important dish names.

Standardizing different spellings (e.g., kbsa, kabsah → Kabsa) into one consistent form.

**steps:**

Load the latest dataset (remove_original_names.csv).

Use clean_dish_name() to strip only specific English words, numbers, and phrases — preserving Arabic names.

Use standardize_dish_name() to unify variations and detect dish types and proteins (e.g., Kabsa Chicken).

Display before/after examples of cleaned names.

Save the cleaned and standardized dataset as SaudiFoodFile_cc.csv.

**Input file:**
remove_original_names.csv

**Output file:**
SaudiFoodFile_cleaned_dishName.csv (final dataset with standardized and culturally accurate dish names)

In [50]:
import pandas as pd
import numpy as np
import re

# Load the data
df = pd.read_csv('remove_original_names.csv')

# Display initial data info
print("Initial data shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Task 1: Clean dish names - remove descriptions, quantities, and non-ingredient information
def clean_dish_name(name):
    """
    Remove descriptions, quantities, and non-ingredient information from dish names
    while preserving authentic Arabic/Middle Eastern dish names
    """
    # Common patterns to remove (ONLY specific English descriptions)
    patterns_to_remove = [
        # Occasions and events (English)
        r'\bfor saudi national day\b',
        r'\bfor hosting\b',
        r'\bsummer offer\b',

        # Cooking methods and styles (English)
        r'\bhow to make\b',
        r'\bhow to boil\b',
        r'\bsaudi style\b',
        r'\bhijazi style\b',
        r'\btraditional\b',
        r'\bauthentic\b',
        r'\beasy\b',
        r'\bcopycat recipe\b',
        r'\bslow-?roast\b',
        r'\bno bake\b',
        r'\brussian style\b',
        r'\brussian\b',

        # Remove ALL parentheses and their content
        r'\([^)]*\)',

        # Very specific quantity descriptions only
        r'\bwhole grain\b',
        r'\bhalf a piece\b',
        r'\bhalf piece\b',
        r'\bone person\b',

        # General English descriptions (be very specific)
        r'\bmethod for\b',
        r'\baccording to\b',
        r'\bway\b',
        r'\brecipe\b',
    ]

    cleaned_name = name.strip()

    # Remove only very specific patterns
    for pattern in patterns_to_remove:
        cleaned_name = re.sub(pattern, '', cleaned_name, flags=re.IGNORECASE)

    # Clean up extra spaces and punctuation carefully
    cleaned_name = re.sub(r'\s+', ' ', cleaned_name)
    cleaned_name = cleaned_name.strip()

    return cleaned_name

# Task 2: Standardize dish name variations to consistent naming
def standardize_dish_name(name):
    """
    Standardize variations of dish names to consistent naming conventions
    while preserving the main dish identity
    """
    if not name or name.strip() == '':
        return 'Unclassified Dish'

    # First, extract the main dish components
    name_lower = name.lower()

    # Identify the main dish type
    main_dish = None
    protein = None

    # Check for main dishes (preserve Arabic names)
    dish_patterns = {
        'Kabsa': r'\bkabsa\b|\bkabsah\b|\bkbsa\b',
        'Mandi': r'\bmandi\b',
        'Madhbi': r'\bmadhbi\b',
        'Madfoon': r'\bmadfoon\b',
        'Madghut': r'\bmadghog\b|\bmadjoun\b|\bmadjoon\b',
        'Shakshuka': r'\bshakshuka\b|\bshaksuka\b',
        'Jareesh': r'\bjareesh\b|\bjarish\b|\bgroats\b',
        'Maqluba': r'\bmaqluba\b|\bmakloubeh\b|\bmagloba\b',
        'Kleja': r'\bkleija\b|\bkleja\b|\bklija\b',
        'Maamoul': r'\bmaamoul\b|\bmamoul\b',
        'Mutabak': r'\bmutabbaq\b|\bmutabak\b',
        'Sambusa': r'\bsambosa\b|\bsambousek\b|\bsamosa\b',
        'Basbousa': r'\bbasbousa\b|\bbasbosa\b',
        'Kunafa': r'\bkunafa\b|\bknafeh\b',
        'Mulukhiyah': r'\bmulukhiyah\b|\bmolokhia\b|\bmolokhiya\b',
        'Saleek': r'\bsaleek\b|\bsaleeq\b|\bsuliq\b|\bsulait\b',
        'Freekeh': r'\bfreekeh\b|\bfreekey\b',
        'Mujaddara': r'\bmujadara\b|\bmujaddara\b',
        'Luqaimat': r'\bluqaimat\b|\bluqaymat\b',
        'Harees': r'\bharees\b|\bhareeseh\b',
        'Thareed': r'\bthareed\b|\btharid\b',
    }

    for dish, pattern in dish_patterns.items():
        if re.search(pattern, name_lower):
            main_dish = dish
            break

    # If no specific dish found, try to identify by main components
    if not main_dish:
        # Check for proteins
        if re.search(r'\bchicken\b', name_lower):
            protein = 'Chicken'
        elif re.search(r'\blamb\b', name_lower):
            protein = 'Lamb'
        elif re.search(r'\bmeat\b|\bbeef\b', name_lower):
            protein = 'Meat'
        elif re.search(r'\bfish\b', name_lower):
            protein = 'Fish'
        elif re.search(r'\bshrimp\b', name_lower):
            protein = 'Shrimp'
        elif re.search(r'\bcamel\b', name_lower):
            protein = 'Camel'

    # Build standardized name
    if main_dish:
        if protein:
            standardized_name = f"{main_dish} {protein}"
        else:
            standardized_name = main_dish
    elif protein:
        # If we only have protein but no specific dish, use the cleaned name
        standardized_name = name.title()
    else:
        # For names without clear dish type, clean but preserve the name
        standardized_name = name.title()

        # Apply gentle standardization for common variations
        variations = {
            r'\bkabsah\b': 'Kabsa',
            r'\bkbsa\b': 'Kabsa',
            r'\bmandi\b': 'Mandi',
            r'\bmutabbaq\b': 'Mutabak',
            r'\bsambosa\b': 'Sambusa',
            r'\bmaamoul\b': 'Maamoul',
            r'\bkunafa\b': 'Kunafa',
            r'\bbasbousa\b': 'Basbousa',
        }

        for pattern, replacement in variations.items():
            standardized_name = re.sub(pattern, replacement, standardized_name, flags=re.IGNORECASE)

    return standardized_name.strip()

# Apply cleaning and standardization
print("\nApplying data cleaning...")

# Create cleaned dish names
df['cleaned_dish_name'] = df['dish_name'].apply(clean_dish_name)
df['standardized_dish_name'] = df['cleaned_dish_name'].apply(standardize_dish_name)

# Show before and after examples - focus on problem cases
print("\nName cleaning examples (focusing on problem cases):")
problem_cases = [
    "A quarter of mandi chicken",
    "Quarter goat",
    "Chicken kabsa with rice",
    "Meat kabsa and daqoos salad",
    "Saudi meat kabsa and daqoos salad",
    "How to make Saudi kleija",
    "Saudi style chicken kabsa",
    "Russian style borscht soup",
    "Chicken Kabsa (curry) with rice",
    "Plain mandi rice"
]

for i, row in df.iterrows():
    if any(case.lower() in row['dish_name'].lower() for case in problem_cases):
        print(f"Original: {row['dish_name']}")
        print(f"Cleaned: {row['cleaned_dish_name']}")
        print(f"Standardized: {row['standardized_dish_name']}")
        print("-" * 50)

# Show most common dish names after standardization
print("\nMost common standardized dish names:")
print(df['standardized_dish_name'].value_counts().head(20))

# Save the cleaned data
df_cleaned = df.copy()
# Keep original name and use standardized as main dish name
df_cleaned['dish_name_original'] = df['dish_name']
df_cleaned['dish_name'] = df['standardized_dish_name']

# Drop temporary columns
df_cleaned = df_cleaned.drop(['cleaned_dish_name', 'standardized_dish_name'], axis=1)

print(f"\nFinal data shape: {df_cleaned.shape}")
print("\nFirst few rows of cleaned data:")
print(df_cleaned[['dish_name_original', 'dish_name']].head(20))

# Save to new CSV file
output_filename = 'SaudiFoodFile_cleaned_dishName.csv'
df_cleaned.to_csv(output_filename, index=False)
print(f"\nCleaned data saved to: {output_filename}")

# Additional analysis: Show name standardization results
print("\n" + "="*80)
print("NAME STANDARDIZATION SUMMARY")
print("="*80)

# Show specific problem cases and their resolution
print("\nSpecific problem cases and their resolution:")
test_cases = [
    "A quarter of mandi chicken",
    "Quarter goat",
    "Chicken kabsa with rice",
    "Saudi meat kabsa and daqoos salad",
    "How to make Saudi kleija",
    "Saudi style chicken kabsa",
    "Plain mandi rice",
    "Half grilled chicken with rice",
    "Russian style borscht soup",
    "Chicken Kabsa (curry) with rice",
    "Mandi Chicken (whole grain)",
    "Grilled chicken (half a piece)"
]

for case in test_cases:
    cleaned = clean_dish_name(case)
    standardized = standardize_dish_name(cleaned)
    print(f"Original: '{case}'")
    print(f"Cleaned: '{cleaned}'")
    print(f"Standardized: '{standardized}'")
    print("-" * 40)

# Show reduction in unique names
original_unique = df['dish_name'].nunique()
cleaned_unique = df_cleaned['dish_name'].nunique()
print(f"\nUnique name reduction: {original_unique} → {cleaned_unique} ({(1-cleaned_unique/original_unique)*100:.1f}% reduction)")

print("\nCleaning complete! Arabic dish names are preserved while English descriptions are removed.")

Initial data shape: (278, 5)

First few rows:
                                dish_name  \
0        Traditional Hijazi almond coffee   
1  Hejaz Shakshuka for Saudi National Day   
2       Saudi meat kabsa and daqoos salad   
3                How to make Saudi kleija   
4               Saudi style chicken kabsa   

                                     classifications  image_file scrape_date  \
0                         loafs | cinnamon | coconut  img(1).jpg  30/09/2025   
1                               egg | cheese | bread  img(2).jpg  30/09/2025   
2  tomatoes | hot green pepper | salt | cumin | r...  img(3).jpg  30/09/2025   
3    dates | haw | cinnamon | ginger | summit | eggs  img(4).jpg  30/09/2025   
4  saffron | haw | cinnamon | mixed spices | whit...  img(5).jpg  30/09/2025   

   image_id  
0         1  
1         2  
2         3  
3         4  
4         5  

Applying data cleaning...

Name cleaning examples (focusing on problem cases):
Original: Saudi meat kabsa and daqoos 