# 01. Data Preparation for Food Similarity Model

This notebook handles the initial data loading, exploration, preprocessing, and feature preparation for the KNN food similarity model.

## Objectives:

- Load and explore the cleaned nutritional dataset
- Prepare features for similarity analysis
- Perform feature scaling and create evaluation subsets
- Set up food lookup structures for similarity queries

**Note**: This is the first step in building a food similarity recommendation system.

In [3]:
# Import required libraries for data preparation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set random seed for reproducibility
np.random.seed(10)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
print("📊 Data preparation libraries imported successfully!")

📊 Data preparation libraries imported successfully!


In [4]:
# Load the cleaned dataset
df = pd.read_csv('../../dataset/daily_food_nutrition_dataset_cleaned.csv')

print("Dataset Shape:", df.shape)
print("\nDataset Info:")
print(df.info())
print("\nFirst 5 rows:")
print(df.head())
print("\nTarget variable distribution:")
print(df['category'].value_counts())

Dataset Shape: (10000, 13)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             10000 non-null  object 
 1   food_item      10000 non-null  object 
 2   category       10000 non-null  object 
 3   calories       10000 non-null  int64  
 4   proteins       10000 non-null  float64
 5   carbohydrates  10000 non-null  float64
 6   fats           10000 non-null  float64
 7   fibers         10000 non-null  float64
 8   sugars         10000 non-null  float64
 9   sodium         10000 non-null  int64  
 10  cholesterol    10000 non-null  int64  
 11  meal_type      10000 non-null  object 
 12  water_intake   10000 non-null  int64  
dtypes: float64(5), int64(4), object(4)
memory usage: 1015.8+ KB
None

First 5 rows:
                         id       food_item category  calories  proteins  \
0  6843fa1e7fe66773fab3281d  

In [5]:
# Data exploration and quality checks
print("🔍 Data Quality Assessment")
print("=" * 40)

print("Checking for missing values:")
print(df.isnull().sum())

print(f"\nDataset Summary:")
print(f"- Total records: {len(df):,}")
print(f"- Unique food items: {df['food_item'].nunique():,}")
print(f"- Food categories: {df['category'].nunique()}")
print(f"- Duplicate records: {df.duplicated().sum()}")

# Show food distribution by category
print(f"\nFood distribution by category:")
category_counts = df['category'].value_counts()
for category, count in category_counts.items():
    percentage = (count / len(df)) * 100
    print(f"  {category}: {count:,} ({percentage:.1f}%)")

# Show some example food items per category
print(f"\nExample food items by category:")
for category in df['category'].unique()[:5]:
    foods_in_category = df[df['category'] == category]['food_item'].unique()[:3]
    print(f"  {category}: {', '.join(foods_in_category)}")

🔍 Data Quality Assessment
Checking for missing values:
id               0
food_item        0
category         0
calories         0
proteins         0
carbohydrates    0
fats             0
fibers           0
sugars           0
sodium           0
cholesterol      0
meal_type        0
water_intake     0
dtype: int64

Dataset Summary:
- Total records: 10,000
- Unique food items: 35
- Food categories: 7
- Duplicate records: 0

Food distribution by category:
  Dairy: 1,460 (14.6%)
  Fruits: 1,453 (14.5%)
  Beverages: 1,445 (14.4%)
  Snacks: 1,432 (14.3%)
  Meat: 1,418 (14.2%)
  Vegetables: 1,408 (14.1%)
  Grains: 1,384 (13.8%)

Example food items by category:
  Meat: Eggs, Chicken Breast, Beef Steak
  Fruits: Apple, Banana, Grapes
  Grains: Oats, Quinoa, Pasta
  Vegetables: Carrot, Tomato, Broccoli
  Snacks: Cookies, Nuts, Chocolate


In [6]:
# Define and prepare features for similarity analysis
print("🥗 Feature Preparation for Food Similarity")
print("=" * 50)

# Define nutritional features for KNN food similarity model
feature_columns = ['calories', 'proteins', 'carbohydrates', 'fats', 'fibers', 'sugars', 'sodium', 'cholesterol']
X = df[feature_columns]

# Keep food items and categories for analysis (not for prediction)
food_items = df['food_item']
categories = df['category']

print(f"Features selected: {feature_columns}")
print(f"Features shape: {X.shape}")
print(f"Unique food items: {food_items.nunique()}")
print(f"Food categories: {categories.nunique()}")

# Display feature statistics
print(f"\n📊 Nutritional Feature Statistics:")
print(X.describe().round(2))

# Check for any extreme values or outliers
print(f"\n⚠️  Feature Range Analysis:")
for col in feature_columns:
    q1 = X[col].quantile(0.25)
    q3 = X[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = X[(X[col] < lower_bound) | (X[col] > upper_bound)][col].count()
    print(f"  {col}: Range [{X[col].min():.1f}, {X[col].max():.1f}], Outliers: {outliers}")

🥗 Feature Preparation for Food Similarity
Features selected: ['calories', 'proteins', 'carbohydrates', 'fats', 'fibers', 'sugars', 'sodium', 'cholesterol']
Features shape: (10000, 8)
Unique food items: 35
Food categories: 7

📊 Nutritional Feature Statistics:
       calories  proteins  carbohydrates      fats    fibers    sugars  \
count  10000.00  10000.00       10000.00  10000.00  10000.00  10000.00   
mean     327.69     25.52          52.57     25.44      4.99     25.05   
std      158.19     14.13          27.39     14.15      2.86     14.48   
min       50.00      1.00           5.00      1.00      0.00      0.00   
25%      190.00     13.20          28.80     13.30      2.50     12.50   
50%      328.00     25.50          52.80     25.30      5.00     25.00   
75%      464.00     37.70          76.40     37.60      7.50     37.70   
max      600.00     50.00         100.00     50.00     10.00     50.00   

         sodium  cholesterol  
count  10000.00     10000.00  
mean     497

In [7]:
# Feature scaling and data preparation for similarity analysis
print("⚖️  Feature Scaling and Data Preparation")
print("=" * 50)

# Feature scaling - essential for KNN distance calculations
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Original features shape: {X.shape}")
print(f"Scaled features shape: {X_scaled.shape}")
print(f"✅ Features scaled successfully!")

# Verify scaling worked correctly
print(f"\nScaling verification:")
print(f"- Scaled features mean: {np.mean(X_scaled, axis=0).round(4)}")
print(f"- Scaled features std: {np.std(X_scaled, axis=0).round(4)}")

# Create a food lookup dataframe for easy access during similarity queries
food_lookup = pd.DataFrame({
    'index': range(len(df)),
    'food_item': food_items,
    'category': categories
}).reset_index(drop=True)

print(f"\n📚 Food lookup table created with {len(food_lookup)} entries")
print(f"Sample lookup entries:")
print(food_lookup.head(10))

⚖️  Feature Scaling and Data Preparation
Original features shape: (10000, 8)
Scaled features shape: (10000, 8)
✅ Features scaled successfully!

Scaling verification:
- Scaled features mean: [ 0. -0. -0.  0.  0. -0.  0.  0.]
- Scaled features std: [1. 1. 1. 1. 1. 1. 1. 1.]

📚 Food lookup table created with 10000 entries
Sample lookup entries:
   index       food_item    category
0      0            Eggs        Meat
1      1           Apple      Fruits
2      2  Chicken Breast        Meat
3      3          Banana      Fruits
4      4          Banana      Fruits
5      5            Oats      Grains
6      6          Carrot  Vegetables
7      7         Cookies      Snacks
8      8           Apple      Fruits
9      9          Quinoa      Grains


In [8]:
# Create evaluation subset for model testing
print("🎯 Creating Evaluation Subset")
print("=" * 40)

# For evaluation, we'll create a representative subset
np.random.seed(42)
eval_size = min(1000, len(X_scaled))
eval_indices = np.random.choice(len(X_scaled), size=eval_size, replace=False)

X_eval = X_scaled[eval_indices]
food_eval = food_lookup.iloc[eval_indices].reset_index(drop=True)

print(f"Evaluation subset size: {len(X_eval)} samples")
print(f"Evaluation percentage: {(len(X_eval) / len(X_scaled)) * 100:.1f}%")

# Check category distribution in evaluation set
eval_category_dist = food_eval['category'].value_counts()
print(f"\nCategory distribution in evaluation set:")
for category, count in eval_category_dist.items():
    percentage = (count / len(food_eval)) * 100
    print(f"  {category}: {count} ({percentage:.1f}%)")

print(f"\n✅ Evaluation subset created successfully!")

🎯 Creating Evaluation Subset
Evaluation subset size: 1000 samples
Evaluation percentage: 10.0%

Category distribution in evaluation set:
  Snacks: 153 (15.3%)
  Beverages: 149 (14.9%)
  Dairy: 145 (14.5%)
  Vegetables: 142 (14.2%)
  Meat: 140 (14.0%)
  Grains: 136 (13.6%)
  Fruits: 135 (13.5%)

✅ Evaluation subset created successfully!


In [9]:
# Save prepared data for next notebooks
print("💾 Saving Prepared Data")
print("=" * 30)

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the prepared data components
import joblib

# Save scaler for consistent feature scaling
joblib.dump(scaler, '../models/feature_scaler.pkl')
print("✅ Saved feature scaler")

# Save scaled features
joblib.dump(X_scaled, '../models/X_scaled.pkl')
print("✅ Saved scaled features")

# Save food lookup table
joblib.dump(food_lookup, '../models/food_lookup.pkl')
print("✅ Saved food lookup table")

# Save evaluation subset
joblib.dump({'X_eval': X_eval, 'food_eval': food_eval}, '../models/eval_subset.pkl')
print("✅ Saved evaluation subset")

# Save feature information
feature_info = {
    'feature_columns': feature_columns,
    'dataset_shape': df.shape,
    'n_foods': len(food_lookup),
    'n_categories': len(categories.unique()),
    'categories': sorted(categories.unique())
}
joblib.dump(feature_info, '../models/feature_info.pkl')
print("✅ Saved feature information")

print(f"\n🎉 Data preparation complete! Ready for model training.")
print(f"📁 Saved files in '../models/' directory:")
print(f"   • feature_scaler.pkl - StandardScaler for feature normalization")
print(f"   • X_scaled.pkl - Scaled feature matrix")
print(f"   • food_lookup.pkl - Food item lookup table")
print(f"   • eval_subset.pkl - Evaluation subset for testing")
print(f"   • feature_info.pkl - Feature metadata and dataset info")

💾 Saving Prepared Data
✅ Saved feature scaler
✅ Saved scaled features
✅ Saved food lookup table
✅ Saved evaluation subset
✅ Saved feature information

🎉 Data preparation complete! Ready for model training.
📁 Saved files in '../models/' directory:
   • feature_scaler.pkl - StandardScaler for feature normalization
   • X_scaled.pkl - Scaled feature matrix
   • food_lookup.pkl - Food item lookup table
   • eval_subset.pkl - Evaluation subset for testing
   • feature_info.pkl - Feature metadata and dataset info
✅ Saved feature scaler
✅ Saved scaled features
✅ Saved food lookup table
✅ Saved evaluation subset
✅ Saved feature information

🎉 Data preparation complete! Ready for model training.
📁 Saved files in '../models/' directory:
   • feature_scaler.pkl - StandardScaler for feature normalization
   • X_scaled.pkl - Scaled feature matrix
   • food_lookup.pkl - Food item lookup table
   • eval_subset.pkl - Evaluation subset for testing
   • feature_info.pkl - Feature metadata and dataset in

## Data Preparation Summary

This notebook prepared the foundation for food similarity analysis:

### ✅ Completed Tasks:
1. **Data Loading**: Loaded {dataset_shape[0]:,} food items with {len(feature_columns)} nutritional features
2. **Quality Assessment**: Verified data integrity and explored distributions
3. **Feature Scaling**: Applied StandardScaler for consistent KNN distance calculations
4. **Lookup Creation**: Built food lookup table for similarity queries
5. **Evaluation Setup**: Created representative test subset for model validation
6. **Data Persistence**: Saved all prepared components for subsequent notebooks

### 📊 Dataset Overview:
- **Food Items**: {n_foods:,} unique foods across {n_categories} categories
- **Features**: {feature_columns}
- **Scaling**: Mean=0, Std=1 for optimal KNN performance

### ➡️ Next Steps:
Continue to **02_baseline_models.ipynb** to create initial KNN models and establish baseline performance.