# Preprocessing (ground truth & ChatGPT predictions)

This notebook is part of our project on evaluating the performance of ChatGPT-4o in recognizing food ingredients from images. It focuses on **preprocessing** the two datasets used in the experiments:

- Ground truth ingredient labels
- Predicted ingredient labels generated by ChatGPT-4o

### Objectives:
- Clean and standardize ingredient names
- Apply a unified normalization map to both datasets
- Ensure format compatibility for fair comparison
- Export the normalized data for use in evaluation and statistical testing



import libraries and uploas files

In [None]:
import pandas as pd
from google.colab import files
import string

In [None]:
df = pd.read_excel('ground_truth.xlsx')
df.head()


Unnamed: 0,image_ID,Dish Name,Number of Items/Ingrdients,Difficulty_level,ing_1,ing_2,ing_3,ing_4,ing_5,ing_6,...,ing_12,ing_13,ing_14,ing_15,ing_16,ing_17,ing_18,ing_19,ing_20,ing_21
0,1,Prawns and garin poke bowls,8,Easy,prawns,chilli,lime,avocado,radishes,mango,...,,,,,,,,,,
1,2,"Chicken, mango & noodle salad",8,Medium,limes,fish sauce,honey,sesame oil,roast chicken,mangoes,...,,,,,,,,,,
2,3,Beef sandwish with pink pickled onions,8,Medium,red onion,rice vinegar,caster sugar,green beans,demi baguette,avocado sauce,...,,,,,,,,,,
3,4,Shake-it-up chopped salad,17,Medium,yogurt,mayo,olive oil,Dijon mustard,garlic clove,honey,...,cucmber,cherry tomatoes,red onions,pepper,lettuce,croutons,,,,
4,5,Chilled-green-soup-with-feta,12,Easy,feta,olive oil,coriander seeds,chilli,lemon zest,baby spimach,...,seeds,,,,,,,,,


In [None]:
df_gpt = pd.read_excel('GPT_data.xlsx')
df_gpt.head()

Unnamed: 0,Image ID,Dish Name,Number of Ingredients,ing_01,ing_02,ing_03,ing_04,ing_05,ing_06,ing_07,...,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57
0,1,Shrimp Grain Bowl,8,Shrimp,avocado,mango,radish,spring onion,grains,black sesame,...,,,,,,,,,,
1,2,Mango Chicken Vermicelli,6,Chicken,mango,rice noodles,lime,red chili,cilantro,,...,,,,,,,,,,
2,3,Steak Baguette Sandwich,7,Baguette,steak,kale,green beans,pickled onions,mustard sauce,sauce,...,,,,,,,,,,
3,4,Italian Antipasto Salad,8,Lettuce,salami,cheese,tomatoes,cucumber,olives,croutons,...,,,,,,,,,,
4,5,Spinach Feta Soup,5,Spinach,feta cheese,mixed seeds,olive oil,spices,,,...,,,,,,,,,,


in this section, i will:

- flatten all ingredients to lowercase
- strip whitespace
- apply the normalization map (i have the map in a excel file)

**Normalization map**

In [None]:


normalization_map = {
    'avocado': 'avocado',
    'bacon bits': 'bacon',
    'baguette bread': 'bread',
    'baked chicken leg': 'chicken',
    'baked potato': 'potato',
    'baked salmon': 'salmon',
    'basil': 'basil',
    'basil leaves': 'basil',
    'basmati rice': 'rice',
    'battered fish': 'fish',
    'beef': 'beef',
    'beef burger': 'beef',
    'beef mince': 'beef',
    'beef patty': 'beef',
    'bell pepper': 'bell pepper',
    'black beans': 'beans',
    'black olives': 'olives',
    'boiled egg': 'egg',
    'boiled potato': 'potato',
    'bread': 'bread',
    'bread bun': 'bread',
    'broccoli': 'broccoli',
    'brown rice': 'rice',
    'butter': 'butter',
    'cabbage': 'cabbage',
    'canned tuna': 'tuna',
    'carrot': 'carrots',
    'carrots': 'carrots',
    'cauliflower': 'cauliflower',
    'celery': 'celery',
    'cheddar': 'cheese',
    'cheddar cheese': 'cheese',
    'cheese': 'cheese',
    'cherry tomatoes': 'tomato',
    'chicken': 'chicken',
    'chicken breast': 'chicken',
    'chicken drumsticks': 'chicken',
    'chicken thigh': 'chicken',
    'chili flakes': 'spices',
    'chopped onion': 'onion',
    'coriander': 'coriander',
    'corn': 'corn',
    'cucumber': 'cucumber',
    'cumin': 'spices',
    'egg': 'egg',
    'egg noodles': 'pasta',
    'eggs': 'egg',
    'feta cheese': 'cheese',
    'fish': 'fish',
    'flour tortilla': 'bread',
    'fried egg': 'egg',
    'garlic': 'garlic',
    'ginger': 'ginger',
    'goat cheese': 'cheese',
    'grated cheese': 'cheese',
    'grilled chicken': 'chicken',
    'grilled salmon': 'salmon',
    'ground beef': 'beef',
    'iceberg lettuce': 'lettuce',
    'ketchup': 'tomato sauce',
    'kidney beans': 'beans',
    'lamb': 'meat',
    'lasagna noodles': 'pasta',
    'leafy greens': 'lettuce',
    'lettuce': 'lettuce',
    'macaroni': 'pasta',
    'mashed potato': 'potato',
    'mayonnaise': 'mayo',
    'meat': 'meat',
    'meatballs': 'meat',
    'minced beef': 'beef',
    'mozzarella': 'cheese',
    'mozzarella cheese': 'cheese',
    'mushrooms': 'mushrooms',
    'noodles': 'pasta',
    'oil': 'oil',
    'olive oil': 'oil',
    'olives': 'olives',
    'onion': 'onion',
    'parmesan cheese': 'cheese',
    'parsley': 'parsley',
    'pasta': 'pasta',
    'peas': 'peas',
    'penne': 'pasta',
    'penne pasta': 'pasta',
    'pepper': 'pepper',
    'pickle': 'pickle',
    'pickles': 'pickle',
    'plain yogurt': 'yogurt',
    'poached egg': 'egg',
    'potato': 'potato',
    'potatoes': 'potato',
    'red onion': 'onion',
    'rice': 'rice',
    'rice noodles': 'rice',
    'roast chicken': 'chicken',
    'romaine lettuce': 'lettuce',
    'salmon': 'salmon',
    'salt': 'salt',
    'sauce': 'sauce',
    'scrambled egg': 'egg',
    'shredded cheese': 'cheese',
    'smoked salmon': 'salmon',
    'soy sauce': 'soy sauce',
    'spaghetti': 'pasta',
    'spinach': 'spinach',
    'spring onion': 'onion',
    'steamed chicken': 'chicken',
    'steamed rice': 'rice',
    'sweet potatoes': 'potato',
    'tagliatelle': 'pasta',
    'tomato': 'tomato',
    'tomato sauce': 'sauce',
    'tomatoes': 'tomato',
    'toasted bread': 'bread',
    'tuna': 'tuna',
    'turkey': 'turkey',
    'vegetables': 'vegetables',
    'white bread': 'bread',
    'white rice': 'rice',
    'whole wheat bread': 'bread',
    'whole wheat penne': 'whole wheat penne',
    'yogurt': 'yogurt',
    'yogurt dressing': 'yogurt'
}


**Normalize ground truth labels**

In [None]:
df.columns

Index(['image_ID', 'Dish Name', 'Number of Items/Ingrdients',
       'Difficulty_level', 'ing_1', 'ing_2', 'ing_3', 'ing_4', 'ing_5',
       'ing_6', 'ing_7', 'ing_8', 'ing_9', 'ing_10', 'ing_11', 'ing_12',
       'ing_13', 'ing_14', 'ing_15', 'ing_16', 'ing_17', 'ing_18', 'ing_19',
       'ing_20', 'ing_21'],
      dtype='object')

In [None]:

# identify ingredient columns
ingredient_cols = [col for col in df.columns if col.startswith('ing_')]



# normalize ingredients per row
def normalize_ingredient_list(row):
    ingredients = []
    for col in ingredient_cols:
        item = row[col]
        if pd.notna(item):
            clean_item = str(item).lower().strip().translate(str.maketrans('', '', string.punctuation))
            normalized = normalization_map.get(clean_item, clean_item)
            ingredients.append(normalized)
    return ingredients

# apply normalization
df['normalized_ground_truth'] = df.apply(normalize_ingredient_list, axis=1)
df['ground_truth_set'] = df['normalized_ground_truth'].apply(set)

# keep important columns
final_df = df[['image_ID', 'Dish Name', 'Difficulty_level', 'ground_truth_set']]


final_df.head()


Unnamed: 0,image_ID,Dish Name,Difficulty_level,ground_truth_set
0,1,Prawns and garin poke bowls,Easy,"{chilli, prawns, radishes, spring onions, lime..."
1,2,"Chicken, mango & noodle salad",Medium,"{sesame oil, fish sauce, mangoes, honey, chick..."
2,3,Beef sandwish with pink pickled onions,Medium,"{demi baguette, spicy greens, green beans, cas..."
3,4,Shake-it-up chopped salad,Medium,"{oil, dried origano, lettuce, lemon, mayo, pro..."
4,5,Chilled-green-soup-with-feta,Easy,"{oil, natural yogurt, lemon zest, chilli, baby..."


Save the preprocessed data

In [None]:
# save the ground truth with normalized sets
final_df.to_excel('ground_truth_normalized.xlsx', index=False)

**Normalize ChatGPT predictions**

In [None]:
# standardize column name to match 'image_ID' from ground truth
df_gpt = df_gpt.rename(columns={'Image ID': 'image_ID'})

# get ChatGPT ingredient columns
gpt_ingredient_cols = [col for col in df_gpt.columns if col.startswith('ing_')]

# normalization function for GPT ingredient list
def normalize_gpt_ingredients(row):
    ingredients = []
    for col in gpt_ingredient_cols:
        item = row[col]
        if pd.notna(item):
            clean_item = str(item).lower().strip().translate(str.maketrans('', '', string.punctuation))
            normalized = normalization_map.get(clean_item, clean_item)
            ingredients.append(normalized)
    return ingredients

# apply normalization
df_gpt['normalized_prediction'] = df_gpt.apply(normalize_gpt_ingredients, axis=1)
df_gpt['prediction_set'] = df_gpt['normalized_prediction'].apply(set)


# keep the Dish Name column along with image_ID and prediction_set
df_gpt = df_gpt[['image_ID', 'Dish Name', 'prediction_set']]

df_gpt.head()


Unnamed: 0,image_ID,Dish Name,prediction_set
0,1,Shrimp Grain Bowl,"{grains, sesame, shrimp, radish, black sesame,..."
1,2,Mango Chicken Vermicelli,"{rice, chicken, lime, mango, red chili, cilantro}"
2,3,Steak Baguette Sandwich,"{sauce, kale, steak, baguette, green beans, pi..."
3,4,Italian Antipasto Salad,"{lettuce, cucumber, cheese, tomato, olives, cr..."
4,5,Spinach Feta Soup,"{oil, cheese, spinach, spices, mixed seeds}"


In [None]:
# save ChatGPT predictions with normalized sets
df_gpt.to_excel('gpt_predictions_normalized.xlsx', index=False)


# Merge files according to the image ID

In [None]:
# Merge using only the image_ID as the key
df_final = pd.merge(final_df, df_gpt, on='image_ID')

# rename the Dish Name columns for clarity
df_final = df_final.rename(columns={
    'Dish Name_x': 'Dish Name (ground truth)',
    'Dish Name_y': 'Dish Name (ChatGPT)'
})
df_final.head()



Unnamed: 0,image_ID,Dish Name (ground truth),Difficulty_level,ground_truth_set,Dish Name (ChatGPT),prediction_set
0,1,Prawns and garin poke bowls,Easy,"{chilli, prawns, radishes, spring onions, lime...",Shrimp Grain Bowl,"{grains, sesame, shrimp, radish, black sesame,..."
1,2,"Chicken, mango & noodle salad",Medium,"{sesame oil, fish sauce, mangoes, honey, chick...",Mango Chicken Vermicelli,"{rice, chicken, lime, mango, red chili, cilantro}"
2,3,Beef sandwish with pink pickled onions,Medium,"{demi baguette, spicy greens, green beans, cas...",Steak Baguette Sandwich,"{sauce, kale, steak, baguette, green beans, pi..."
3,4,Shake-it-up chopped salad,Medium,"{oil, dried origano, lettuce, lemon, mayo, pro...",Italian Antipasto Salad,"{lettuce, cucumber, cheese, tomato, olives, cr..."
4,5,Chilled-green-soup-with-feta,Easy,"{oil, natural yogurt, lemon zest, chilli, baby...",Spinach Feta Soup,"{oil, cheese, spinach, spices, mixed seeds}"


In [None]:
# Save the final merged dataset
df_final.to_excel('merged_dataset.xlsx', index=False)

the obtained normalized files will be used in the notebook presenting the statistical tests.