<a href="https://colab.research.google.com/github/NoraHK3/DataSciProject/blob/main/Data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#step 1
**clean dish name, removing irrelavent extra discription**

This code cleans and standardizes dish names in the Saudi food dataset.
 1. It removes unnecessary words and descriptions (like "for Saudi National Day",
   "how to make", "traditional", etc.) from the dish names.
2. It then standardizes different spellings or variations of the same dish
  (e.g., "kabsah", "kbsa" → "Kabsa", "shaksoka" → "Shakshuka").
  
 3. Finally, it shows before/after examples, reports the most common dish names,
 and saves the cleaned dataset as 'SaudiFoodFile_cleaned.csv' for later use.

In [1]:
import pandas as pd
import numpy as np
import re

# Load the data
df = pd.read_csv('SaudiFoodFile_english_FIXED.csv')

# Display initial data info
print("Initial data shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Task 1: Clean dish names - remove extra descriptions
def clean_dish_name(name):
    """
    Remove extra descriptions from dish names like 'for Saudi National Day',
    'how to make', 'Saudi style', etc.
    """
    # Common patterns to remove
    patterns_to_remove = [
        r'for saudi national day',
        r'how to make',
        r'saudi style',
        r'saudi',
        r'traditional',
        r'the saudi',
        r'method for',
        r'according to',
        r'with.*',
        r'for.*',
        r'the hijazi way',
        r'hijazi',
        r'recipe',
        r'easy',
        r'authentic',
        r'copycat',
        r'slow-?roast',
        r'no bake',
        r'healthy',
        r'vegetarian',
        r'stuffed',
        r'baked',
        r'grilled',
        r'roasted',
        r'creamy',
        r'spiced',
        r'middle eastern'
    ]

    cleaned_name = name.lower().strip()

    # Remove patterns
    for pattern in patterns_to_remove:
        cleaned_name = re.sub(pattern, '', cleaned_name, flags=re.IGNORECASE)

    # Remove extra spaces and punctuation
    cleaned_name = re.sub(r'[^\w\s]', ' ', cleaned_name)  # Remove punctuation
    cleaned_name = re.sub(r'\s+', ' ', cleaned_name)  # Remove extra spaces
    cleaned_name = cleaned_name.strip()

    # Remove common measurement/portion descriptions
    portion_patterns = [
        r'\([^)]*\)',  # Remove anything in parentheses
        r'\bwhole grain\b',
        r'\bhalf a piece\b',
        r'\bhalf piece\b',
        r'\bquarter\b',
        r'\bone person\b',
        r'\bperson\b',
        r'\bplain\b',
        r'\bwith rice\b',
        r'\bwithout rice\b'
    ]

    for pattern in portion_patterns:
        cleaned_name = re.sub(pattern, '', cleaned_name, flags=re.IGNORECASE)

    # Final cleanup
    cleaned_name = re.sub(r'\s+', ' ', cleaned_name).strip()

    # Title case for consistency
    cleaned_name = cleaned_name.title()

    return cleaned_name

# Task 2: Standardize dish name variations
def standardize_dish_name(name):
    """
    Standardize variations of dish names (kabsa/kabsah/kbsa -> kabsa)
    """
    standardization_map = {
        r'\bkabsah?\b': 'Kabsa',
        r'\bkbsa\b': 'Kabsa',
        r'\bkleija\b': 'Kleja',
        r'\bkulaija\b': 'Kleja',
        r'\bklija\b': 'Kleja',
        r'\bshaksoka\b': 'Shakshuka',
        r'\bshakshuka\b': 'Shakshuka',
        r'\bshaksuka\b': 'Shakshuka',
        r'\bbasbousa\b': 'Basbousa',
        r'\bbasbosa\b': 'Basbousa',
        r'\bjareesh\b': 'Jareesh',
        r'\bjarish\b': 'Jareesh',
        r'\bgreesh\b': 'Jareesh',
        r'\bgroats\b': 'Jareesh',
        r'\bmaqshoosh\b': 'Maqshush',
        r'\bmaqshush\b': 'Maqshush',
        r'\bmutabbaq\b': 'Mutabak',
        r'\bmutabak\b': 'Mutabak',
        r'\bsaleeq\b': 'Saleek',
        r'\bsaliq\b': 'Saleek',
        r'\bsaleek\b': 'Saleek',
        r'\bsulait?\b': 'Saleek',
        r'\bmaamoul\b': 'Mamoul',
        r'\bmamoul\b': 'Mamoul',
        r'\bmadhbi\b': 'Madhbi',
        r'\bmadghog\b': 'Madhghut',
        r'\bmadjou?h\b': 'Madhghut',
        r'\bmadfoon\b': 'Madfun',
        r'\bmadfoun\b': 'Madfun',
        r'\bmandi\b': 'Mandi',
        r'\bzurbian\b': 'Zurbian',
        r'\bzerbian\b': 'Zurbian',
        r'\bshrimp\b': 'Shrimp',
        r'\bshurbian\b': 'Shrimp',
        r'\bsambosa\b': 'Sambusa',
        r'\bsambousek\b': 'Sambusa',
        r'\bsamosa\b': 'Sambusa',
        r'\bmagloba\b': 'Maqluba',
        r'\bmaqluba\b': 'Maqluba',
        r'\bmakloubeh\b': 'Maqluba',
        r'\bmoussaka\b': 'Musaqa',
        r'\bmoussaqa\b': 'Musaqa',
        r'\bmusakaa\b': 'Musaqa',
        r'\bmolokhia\b': 'Mulukhiyah',
        r'\bmolokhiya\b': 'Mulukhiyah',
        r'\bmulukhiyah\b': 'Mulukhiyah',
        r'\bmargog\b': 'Marqouq',
        r'\bmarqouk\b': 'Marqouq',
        r'\bmarqooq\b': 'Marqouq',
        r'\bmatazeez\b': 'Mataziz',
        r'\bmogalgal\b': 'Muqalqal',
        r'\bmqalqal\b': 'Muqalqal',
        r'\bhemees\b': 'Hamees',
        r'\bhemen\b': 'Hamees',
        r'\bmohalabiya\b': 'Muhalabiya',
        r'\bmohala\b': 'Muhalabiya',
        r'\bkunafa\b': 'Kunafa',
        r'\bknafeh\b': 'Kunafa',
        r'\bsabeeb\b': 'Sabeeb',
        r'\bsabib\b': 'Sabeeb',
        r'\btaheena\b': 'Tahini',
        r'\btainna\b': 'Tahini',
        r'\btahini\b': 'Tahini',
        r'\bfatteh\b': 'Fatteh',
        r'\bfateh\b': 'Fatteh',
        r'\bfreekeh\b': 'Freekeh',
        r'\bfreekey\b': 'Freekeh',
        r'\bhashweh\b': 'Hashu',
        r'\bhashu\b': 'Hashu',
        r'\bmujadara\b': 'Mujaddara',
        r'\bmujaddara\b': 'Mujaddara',
        r'\bzaatar\b': 'Zaatar',
        r'\bza\'atar\b': 'Zaatar'
    }

    standardized_name = name
    for pattern, replacement in standardization_map.items():
        standardized_name = re.sub(pattern, replacement, standardized_name, flags=re.IGNORECASE)

    return standardized_name

# Apply cleaning and standardization
print("\nApplying data cleaning...")

# Create cleaned dish names
df['cleaned_dish_name'] = df['dish_name'].apply(clean_dish_name)
df['standardized_dish_name'] = df['cleaned_dish_name'].apply(standardize_dish_name)

# Show before and after examples
print("\nName cleaning examples:")
sample_size = min(10, len(df))
for i in range(sample_size):
    print(f"Original: {df['dish_name'].iloc[i]}")
    print(f"Cleaned: {df['cleaned_dish_name'].iloc[i]}")
    print(f"Standardized: {df['standardized_dish_name'].iloc[i]}")
    print("-" * 50)

# Show most common dish names after standardization
print("\nMost common standardized dish names:")
print(df['standardized_dish_name'].value_counts().head(20))

# Check for remaining variations
print("\nChecking for remaining variations (sample):")
unique_names = df['standardized_dish_name'].unique()
for name in sorted(unique_names)[:30]:  # Show first 30
    print(f"  - {name}")

# Save the cleaned data
df_cleaned = df.copy()
# You can choose to replace the original dish_name or keep both
df_cleaned['dish_name_original'] = df['dish_name']
df_cleaned['dish_name'] = df['standardized_dish_name']

# Drop temporary columns
df_cleaned = df_cleaned.drop(['cleaned_dish_name', 'standardized_dish_name'], axis=1)

print(f"\nFinal data shape: {df_cleaned.shape}")
print("\nFirst few rows of cleaned data:")
print(df_cleaned[['dish_name_original', 'dish_name']].head(15))

# Save to new CSV file
output_filename = 'SaudiFoodFile_cleaned.csv'
df_cleaned.to_csv(output_filename, index=False)
print(f"\nCleaned data saved to: {output_filename}")

# Additional analysis: Show name standardization results
print("\n" + "="*80)
print("NAME STANDARDIZATION SUMMARY")
print("="*80)

# Group similar names to show standardization effect
name_groups = {}
for orig, new in zip(df['dish_name'], df_cleaned['dish_name']):
    if new not in name_groups:
        name_groups[new] = []
    if orig not in name_groups[new]:
        name_groups[new] = sorted(name_groups[new] + [orig])

print("\nStandardization groups (showing first 15 groups):")
count = 0
for standardized_name, original_names in name_groups.items():
    if len(original_names) > 1:  # Only show names that had variations
        print(f"\n{standardized_name}:")
        for orig_name in original_names:
            print(f"  - {orig_name}")
        count += 1
        if count >= 15:
            break

Initial data shape: (285, 4)

First few rows:
                                dish_name  \
0        Traditional Hijazi almond coffee   
1  Hejaz Shakshuka for Saudi National Day   
2       Saudi meat kabsa and daqoos salad   
3                How to make Saudi kleija   
4               Saudi style chicken kabsa   

                                     classifications  \
0                         loafs | cinnamon | coconut   
1                               egg | cheese | bread   
2  tomatoes | hot green pepper | salt | cumin | r...   
3    dates | haw | cinnamon | ginger | summit | eggs   
4  saffron | haw | cinnamon | mixed spices | whit...   

                                          image_file scrape_date  
0        images/traditional_hejazi_almond_coffee.jpg    30-09-25  
1  images/Shakshuka_Hejazia_for_Saudi_National_Da...    30-09-25  
2       images/Saudi_meat_kabsa_and_dakous_salad.jpg    30-09-25  
3  images/how_to_work_the_college_of_Saudi Arabia...    30-09-25  
4         i

#step2

**Changing image name (make it like the dish name )**



 Purpose: Standardize image file names in the CSV based on dish names, ensure uniqueness,
          and save the result for downstream use.

What it does:
 1) Loads 'SaudiFoodFile_cleaned.csv' and inspects dish_name quality (missing/non-string).
2) Builds clean image file names from dish_name:
    - lowercase, remove special chars, replace spaces/dashes with underscores,
    - keep the original file extension (e.g., .jpg, .png),
    - fallback to original image base name if dish_name is missing.
 3) Ensures uniqueness by appending _2, _3, ... for duplicates.
 4) Reports examples and a summary (duplicate groups, most common dish names, short names).
5) Writes a new CSV 'SaudiFoodFile_final_cleaned.csv' with:
   - image_file_original (old),
   - image_file (new standardized).
 Note: This updates names in the CSV only. It does NOT rename files on disk.

In [2]:
import pandas as pd
import re
import os
import numpy as np

# Load the cleaned data
df = pd.read_csv('SaudiFoodFile_cleaned.csv')

# Display initial data info
print("Initial data shape:", df.shape)
print("\nFirst few rows:")
print(df[['dish_name', 'image_file']].head())

# Check for missing or non-string values in dish_name
print(f"\nData types: {df['dish_name'].dtype}")
print(f"Missing values in dish_name: {df['dish_name'].isna().sum()}")
print(f"Non-string values sample: {df[df['dish_name'].apply(lambda x: not isinstance(x, str))].head()}")

# Function to create clean image filename from dish name
def create_image_filename(dish_name, original_image_file):
    """
    Create clean image filename based on dish name and handle duplicates
    """
    # Handle NaN or non-string values
    if not isinstance(dish_name, str) or pd.isna(dish_name):
        # Use original image file name as fallback
        base_name = os.path.splitext(os.path.basename(original_image_file))[0]
        clean_name = base_name.lower()
    else:
        # Clean the dish name for filename
        clean_name = dish_name.lower()

    # Remove special characters and replace spaces with underscores
    clean_name = re.sub(r'[^\w\s-]', '', clean_name)
    clean_name = re.sub(r'[-\s]+', '_', clean_name)

    # Keep the file extension from original
    file_extension = os.path.splitext(original_image_file)[1]

    # Create base filename
    base_filename = f"{clean_name}{file_extension}"

    return base_filename

# Apply image filename creation
print("\nCreating standardized image filenames...")

# Create base image filenames
df['base_image_file'] = df.apply(
    lambda row: create_image_filename(row['dish_name'], row['image_file']),
    axis=1
)

# Handle duplicates by adding incremental IDs
print("\nHandling duplicate image filenames...")

# Count occurrences and add IDs to duplicates
duplicate_count = {}
df['new_image_file'] = ""

for idx, row in df.iterrows():
    base_name = row['base_image_file']

    if base_name in duplicate_count:
        duplicate_count[base_name] += 1
        # Add ID to duplicate (before extension)
        name_without_ext, ext = os.path.splitext(base_name)
        final_name = f"{name_without_ext}_{duplicate_count[base_name]}{ext}"
    else:
        duplicate_count[base_name] = 1
        final_name = base_name

    df.at[idx, 'new_image_file'] = final_name

# Show before and after examples
print("\nImage filename standardization examples:")
sample_size = min(20, len(df))
for i in range(sample_size):
    print(f"Dish: {df['dish_name'].iloc[i]}")
    print(f"Original image: {df['image_file'].iloc[i]}")
    print(f"New image: {df['new_image_file'].iloc[i]}")
    print("-" * 60)

# Show duplicates that were handled
duplicates = {name: count for name, count in duplicate_count.items() if count > 1}
if duplicates:
    print(f"\nFound {len(duplicates)} image names with duplicates:")
    for name, count in list(duplicates.items())[:15]:
        print(f"  - {name}: {count} occurrences")

    # Show specific examples of duplicate resolution
    print("\nExamples of duplicate resolution:")
    for duplicate_name in list(duplicates.keys())[:10]:
        matching_rows = df[df['base_image_file'] == duplicate_name]
        print(f"\n{duplicate_name}:")
        for _, row in matching_rows.iterrows():
            print(f"  - {row['new_image_file']} (from: {row['dish_name']})")
else:
    print("\nNo duplicate image names found!")

# Create the final dataframe
df_final = df.copy()
df_final['image_file_original'] = df['image_file']
df_final['image_file'] = df['new_image_file']

# Drop temporary columns
df_final = df_final.drop(['base_image_file', 'new_image_file'], axis=1)

print(f"\nFinal data shape: {df_final.shape}")

# Save to new CSV
output_filename = 'SaudiFoodFile_final_cleaned.csv'
df_final.to_csv(output_filename, index=False)
print(f"\nFinal cleaned data saved to: {output_filename}")

# Summary statistics
print("\n" + "="*80)
print("IMAGE FILENAME STANDARDIZATION SUMMARY")
print("="*80)
print(f"Total dishes: {len(df_final)}")
print(f"Unique original image names: {df['image_file'].nunique()}")
print(f"Unique new image names: {df_final['image_file'].nunique()}")
print(f"Duplicates handled: {len(duplicates)}")

# Show most common dish names and their image files
print("\nMost common dish names and their new image files:")
common_dishes = df_final['dish_name'].value_counts().head(15)
for dish, count in common_dishes.items():
    matching_images = df_final[df_final['dish_name'] == dish]['image_file'].tolist()
    print(f"\n{dish} (appears {count} times):")
    for img in matching_images:
        print(f"  - {img}")

# Show problematic cases (very short names or empty names)
print("\nChecking for problematic dish names:")
short_names = df_final[df_final['dish_name'].str.len() < 3] if 'dish_name' in df_final.columns else pd.DataFrame()
if len(short_names) > 0:
    print("Very short dish names found:")
    for _, row in short_names.iterrows():
        print(f"  - '{row['dish_name']}' -> {row['image_file']}")

# Show the complete mapping for verification
print("\nComplete filename mapping (first 30 entries):")
print("Dish Name -> Original Image -> New Image")
for i in range(min(30, len(df_final))):
    dish_name = df_final['dish_name'].iloc[i] if isinstance(df_final['dish_name'].iloc[i], str) else "MISSING_NAME"
    print(f"{dish_name} -> {df_final['image_file_original'].iloc[i]} -> {df_final['image_file'].iloc[i]}")

# Additional: Show any rows with missing dish names
missing_dish_names = df_final[df_final['dish_name'].isna()]
if len(missing_dish_names) > 0:
    print(f"\nWARNING: Found {len(missing_dish_names)} rows with missing dish names:")
    for idx, row in missing_dish_names.iterrows():
        print(f"  - Row {idx}: Original image: {row['image_file_original']}, New image: {row['image_file']}")

Initial data shape: (285, 5)

First few rows:
                     dish_name  \
0                Almond Coffee   
1              Hejaz Shakshuka   
2  Meat Kabsa And Daqoos Salad   
3                        Kleja   
4                Chicken Kabsa   

                                          image_file  
0        images/traditional_hejazi_almond_coffee.jpg  
1  images/Shakshuka_Hejazia_for_Saudi_National_Da...  
2       images/Saudi_meat_kabsa_and_dakous_salad.jpg  
3  images/how_to_work_the_college_of_Saudi Arabia...  
4         images/Kabsa_chicken_style_Saudi_style.jpg  

Data types: object
Missing values in dish_name: 1
Non-string values sample:     dish_name classifications         image_file scrape_date  \
131       NaN    unclassified  images/creamy.png    30-09-25   

    dish_name_original  
131             Creamy  

Creating standardized image filenames...

Handling duplicate image filenames...

Image filename standardization examples:
Dish: Almond Coffee
Original image: imag

 # step 3

 **Image File Renaming (Done in a Separate Colab) with the name (renaming images file)**

 In this step, which was performed in a separate Colab notebook,
 we renamed all the image files on disk to match their corresponding
 standardized names in the CSV file
  
  
  what this step do: Rename image files on disk to match the standardized image names
          listed in the CSV file.
 What it does:
 1) Reads the CSV (which contains the mapping between old and new image names).
2) Finds each original image file in your folder.
 3) Renames it to the corresponding new standardized name.
4) Creates a backup (optional) before renaming, to keep the original files safe.
 5) Reports missing or renamed files for verification.
#
 Notes:
 - This step actually changes filenames in your images folder, unlike the earlier
   CSV-only step that just updated name references in the file.
 - Make sure to set the correct folder path for your images before running.