# Task: Cuisine Classification


## Objective: Develop a machine learning model to classify restaurants based on their cuisines.


Steps:
* Preprocess the dataset by handling missing values
and encoding categorical variables.
* Split the data into training and testing sets.
* Select a classification algorithm (e.g., logistic
regression, random forest) and train it on the
training data.
* Evaluate the model's performance using
appropriate classification metrics (e.g., accuracy,
precision, recall) on the testing data.
* Analyze the model's performance across different
cuisines and identify any challenges or biases.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("Dataset .csv")
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

class_df = df.dropna(subset = ['Cuisines', 'Restaurant Name']).copy()
class_df['Cuisines List'] = class_df['Cuisines'].apply(lambda x: x.split(', '))

all_cuisines = class_df['Cuisines'].explode()
top_20_cuisines = all_cuisines.value_counts().nlargest(20).index

class_df['Cuisine List Filtered'] = class_df['Cuisines List'].apply(lambda lst: [c for c in lst if c in top_20_cuisines])

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(class_df['Cuisines List'])
# print(y)
cuisine_columns = mlb.classes_ 
# print(cuisine_columns)
print("Target variable 'y' created successfully.")
print(f"Shape of y: {y.shape}")
print(f"This means we are predicting across {y.shape[1]} unique cuisines.")
print("\nHere are the first 10 cuisine columns our model will learn to predict:")
print(cuisine_columns[:10])

features = ['Restaurant Name', 'Locality', 'Average Cost for two']
X = class_df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(f"\nData split into {X_train.shape[0]} training samples and {X_test.shape[0]} testing samples.")

# Verification 
y_train_df = pd.DataFrame(y_train , columns = cuisine_columns)

# Find columns (cuisines) that sum to 0 
empty_cuisines = y_train_df.columns[y_train_df.sum() == 0]

# If any such cuisines are found, remove them from the training and testing sets 
if not empty_cuisines.empty:
    print(f"Found cuisines with no examples in the training set: {list(empty_cuisines)}")
    print("Removing these cuisines from the training and testing sets.")

    # Convert y_test to a DataFrame to drop the same columns for consistency 
    y_test_df = pd.DataFrame(y_test, columns = cuisine_columns)
    
    # Drop the problematic columns from both the sets 
    y_train_df = y_train_df.drop(columns = empty_cuisines)
    y_test_df = y_test_df.drop(columns = empty_cuisines)

    # Convert back to numpy arrays for the model 
    y_train = y_train_df.values
    y_test = y_test_df.values

    print(f"Cleaned y_train shape: {y_train.shape}")
    print(f"Cleaned y_test shape: {y_test.shape}")

else:
    print("All cuisines have examples in the training set . No cleaning needed")


Target variable 'y' created successfully.
Shape of y: (9542, 145)
This means we are predicting across 145 unique cuisines.

Here are the first 10 cuisine columns our model will learn to predict:
['Afghani' 'African' 'American' 'Andhra' 'Arabian' 'Argentine' 'Armenian'
 'Asian' 'Asian Fusion' 'Assamese']

Data split into 7633 training samples and 1909 testing samples.
Found cuisines with no examples in the training set: ['B�_rek', 'Peranakan', 'Peruvian']
Removing these cuisines from the training and testing sets.
Cleaned y_train shape: (7633, 142)
Cleaned y_test shape: (1909, 142)


In [59]:
X_train.head()

Unnamed: 0,Restaurant Name,Locality,Average Cost for two
8177,Vaango!,"Logix City Centre, Sector 32, Noida",450
6401,Domino's Pizza,Punjabi Bagh,700
81,Sainte Marie Gastronomia,Vila S��nia,120
1332,Shama Chicken Corner,DLF Phase 2,300
9041,Subway,"Spice World Mall, Sector 25",500


In [60]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier


# 1. Define the preprocessing steps for our columns
# We create a processor that applies different transformations to different columns 
# 1. 'name_tfidf': Converts 'Restaurant Name' into a TF-IDF matrix, ignoring common English stop words.
# 2. 'loc_tfidf': Converts 'Locality' into a TF-IDF matrix, also ignoring common English stop words.
# 3. 'cost_scaler': Standardizes the 'Average Cost for two' column to have zero mean and unit variance.
preprocessor = ColumnTransformer(
    transformers = [
        ('name_tfidf', TfidfVectorizer(stop_words = 'english'), 'Restaurant Name'),
        ('loc_tfidf', TfidfVectorizer(stop_words = 'english'), 'Locality'),
        ('cost_scaler', StandardScaler(), ['Average Cost for two'])
    ],
    remainder = 'passthrough'
)

# 2. Create the full machine learning pipeline 
# The pipeline first runs our preprocessor, then trains the model 
# 1. 'preprocessor': Applies the transformations defined in preprocessor.
# 2. 'classifier': Trains a logistic regression model for multi-label classification.

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', MultiOutputClassifier(LogisticRegression(solver = 'liblinear', random_state = 42)))
])

# Train the pipeline on the training data 

print('Training the classification model...')
pipeline.fit(X_train, y_train)
print("Model training complete!")

Training the classification model...
Model training complete!


In [61]:
from sklearn.metrics import accuracy_score , classification_report 

# Make predictions on the test set 
print("Making predictions on the test data...")
y_pred  = pipeline.predict(X_test)

# Get the cleaned Cuisine Names 
# We need to get the final list of cuisine names after the cleaning step
# The y_train_df from the previous step has the correct columns
final_cuisine_columns = y_train_df.columns

# Evaluate Performance 
# Calculate the strict accuracy (exact match ratio)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nOverall Accuracy (Exact Match Ratio): {accuracy:.2f}")
print("--------------------------------------------------")

# Print the detailed classification report for each cuisine
print("\nClassification Report (per cuisine):")
# Use the final_cuisine_columns as target_names
report = classification_report(y_test, y_pred, target_names=final_cuisine_columns, zero_division=0)
print(report)






Making predictions on the test data...

Overall Accuracy (Exact Match Ratio): 0.25
--------------------------------------------------

Classification Report (per cuisine):
                   precision    recall  f1-score   support

          Afghani       0.00      0.00      0.00         4
          African       0.00      0.00      0.00         3
         American       0.95      0.27      0.43        73
           Andhra       0.00      0.00      0.00         2
          Arabian       0.00      0.00      0.00         2
        Argentine       0.00      0.00      0.00         1
         Armenian       0.00      0.00      0.00         0
            Asian       0.00      0.00      0.00        50
     Asian Fusion       0.00      0.00      0.00         1
         Assamese       0.00      0.00      0.00         0
       Australian       0.00      0.00      0.00         1
           Awadhi       0.00      0.00      0.00         0
              BBQ       0.00      0.00      0.00         8
 

# Final Analysis 

### 1. Performance is Proportional to Data Availability
The model performs best for cuisines like "North Indian" and "Chinese" precisely because they have the most examples (support) in the dataset. The model has more patterns to learn from, making it more confident and accurate. For rare cuisines, it has very little information and struggles to find a reliable signal.

### 2. Geographical Bias Skews Locality
All most all of the restaurants are concentrated in the Delhi NCR region. This severely limits the usefulness of Locality as a predictive feature on a global scale. The model might learn that a certain neighborhood in Delhi has many Mughlai restaurants, but that knowledge is not transferable to any other city. The model isn't learning "Mughlai food is common in this type of area" but rather "Mughlai food is common in this specific named place." 

### 3. Cuisine Imbalance is the Core Challenge
This is the central issue of the task. The dataset is heavily skewed towards a few dominant cuisines. Enough data is not present to classify all the cuisines accurately.

### 4. The Ambiguity of "Secondary" Cuisines: 
Some cuisines are inherently harder to classify because they are often secondary to a primary offering.
* Example: A restaurant might be primarily Italian but also serve Desserts and Beverages.
* Impact: The model struggles to find a unique signal for these secondary categories. The features that predict Italian are strong and clear, while the features for Desserts are weak and appear alongside many other primary cuisines. This contributes to the lower F1-scores for categories like Desserts, Beverages, and even the very broad Fast Food.
