### Task 3

#### Objective: Develop a machine learning model to classify restaurants based on their cuisines.

#### Steps:

1. Preprocess the dataset by handling missing values and encoding categorical variables.
2. Split the data into training and testing sets.
3. Select a classification algorithm (e.g., logistic regression, random forest) and train it on the training data.
4. Evaluate the model's performance using appropriate classification metrics (e.g., accuracy,precision, recall) on the testing data.
5. Analyze the model's performance across different cuisines and identify any challenges or biases.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("Dataset.csv")

In [3]:
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


In [4]:
df.shape

(9551, 21)

In [5]:
df.columns

Index(['Restaurant ID', 'Restaurant Name', 'Country Code', 'City', 'Address',
       'Locality', 'Locality Verbose', 'Longitude', 'Latitude', 'Cuisines',
       'Average Cost for two', 'Currency', 'Has Table booking',
       'Has Online delivery', 'Is delivering now', 'Switch to order menu',
       'Price range', 'Aggregate rating', 'Rating color', 'Rating text',
       'Votes'],
      dtype='object')

In [6]:
df.isnull().sum()

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

In [7]:
print("Columns with missing values:")
print(df.isnull().sum()[df.isnull().sum() > 0])

Columns with missing values:
Cuisines    9
dtype: int64


In [8]:
df.dropna(subset=['Cuisines'],inplace=True)

In [9]:
cuisine_list = df['Cuisines'].str.split(', ')
for cuisine in cuisine_list.explode().unique():
    df[cuisine] = cuisine_list.apply(lambda x: 1 if cuisine in x else 0)

In [10]:
# Drop the original 'Cuisines' column
df.drop('Cuisines', axis=1, inplace=True)

In [11]:
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Average Cost for two,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,1100,...,0,0,0,0,0,0,0,0,0,0
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,1200,...,0,0,0,0,0,0,0,0,0,0
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,4000,...,0,0,0,0,0,0,0,0,0,0
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,1500,...,0,0,0,0,0,0,0,0,0,0
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,1500,...,0,0,0,0,0,0,0,0,0,0


In [12]:
unique_counts = df.nunique()

# Display the counts
print("Count of unique values for each column:")
print(unique_counts)

Count of unique values for each column:
Restaurant ID      9542
Restaurant Name    7437
Country Code         15
City                140
Address            8910
                   ... 
Izgara                2
World Cuisine         2
D�_ner                2
Restaurant Cafe       2
B�_rek                2
Length: 165, dtype: int64


In [13]:
df.corr(numeric_only=True)

Unnamed: 0,Restaurant ID,Country Code,Longitude,Latitude,Average Cost for two,Price range,Aggregate rating,Votes,French,Japanese,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
Restaurant ID,1.000000,0.146270,-0.224362,-0.052626,-0.001629,-0.134528,-0.327160,-0.147434,-0.001399,-0.016426,...,0.000280,-0.005170,0.010801,-0.011207,-0.010021,-0.005081,-0.007189,-0.003542,-0.007293,-0.003642
Country Code,0.146270,1.000000,-0.694629,0.018049,0.043717,0.245363,0.281295,0.154361,0.074093,0.114408,...,0.065417,0.075907,0.030981,0.108918,0.097409,0.048689,0.068864,0.034427,0.068864,0.034427
Longitude,-0.224362,-0.694629,1.000000,0.045415,0.045948,-0.080257,-0.114733,-0.084371,-0.033692,-0.089167,...,-0.019686,-0.021968,-0.008941,-0.024723,-0.022116,-0.011730,-0.016565,-0.007805,-0.017521,-0.008772
Latitude,-0.052626,0.018049,0.045415,1.000000,-0.111080,-0.166735,0.000197,-0.022914,-0.071050,-0.011465,...,-0.042029,-0.117880,-0.047996,0.041360,0.037007,0.019231,0.027189,0.013073,0.028275,0.014111
Average Cost for two,-0.001629,0.043717,0.045948,-0.111080,1.000000,0.075111,0.051864,0.067833,0.040698,0.063236,...,-0.001340,-0.001122,-0.000559,-0.002266,-0.002035,-0.001010,-0.001364,-0.000730,-0.001411,-0.000737
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Izgara,-0.005081,0.048689,-0.011730,0.019231,-0.001010,0.019108,0.016086,0.014346,-0.000799,-0.001735,...,-0.000297,-0.000363,-0.000148,0.223279,-0.000419,1.000000,-0.000297,-0.000148,-0.000297,-0.000148
World Cuisine,-0.007189,0.068864,-0.016565,0.027189,-0.001364,0.038334,0.022076,0.015209,-0.001131,-0.002453,...,0.249685,-0.000514,-0.000210,-0.000663,-0.000593,-0.000297,1.000000,-0.000210,-0.000419,-0.000210
D�_ner,-0.003542,0.034427,-0.007805,0.013073,-0.000730,0.002205,0.011711,-0.002017,-0.000565,-0.001226,...,-0.000210,-0.000257,-0.000105,0.316079,0.353424,-0.000148,-0.000210,1.000000,-0.000210,-0.000105
Restaurant Cafe,-0.007293,0.068864,-0.017521,0.028275,-0.001411,0.032680,0.018362,0.030121,-0.001131,-0.002453,...,-0.000419,-0.000514,-0.000210,-0.000663,-0.000593,-0.000297,-0.000419,-0.000210,1.000000,-0.000210


In [14]:
df.duplicated().sum()

0

In [15]:
df.columns

Index(['Restaurant ID', 'Restaurant Name', 'Country Code', 'City', 'Address',
       'Locality', 'Locality Verbose', 'Longitude', 'Latitude',
       'Average Cost for two',
       ...
       'Patisserie', 'South African', 'Durban', 'Kebab', 'Turkish Pizza',
       'Izgara', 'World Cuisine', 'D�_ner', 'Restaurant Cafe', 'B�_rek'],
      dtype='object', length=165)

In [16]:
df = df.drop('Restaurant Name', axis=1)
df = df.drop('Restaurant ID', axis=1)
df = df.drop('Address', axis=1)
df = df.drop('Locality', axis=1)
df = df.drop('Locality Verbose', axis=1)
df = df.drop('Longitude', axis=1)
df = df.drop('Latitude', axis=1)
df = df.drop('Currency', axis=1)
df = df.drop('Switch to order menu', axis=1)

In [17]:
df.head()

Unnamed: 0,Country Code,City,Average Cost for two,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Rating color,Rating text,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
0,162,Makati City,1100,Yes,No,No,3,4.8,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
1,162,Makati City,1200,Yes,No,No,3,4.5,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
2,162,Mandaluyong City,4000,Yes,No,No,4,4.4,Green,Very Good,...,0,0,0,0,0,0,0,0,0,0
3,162,Mandaluyong City,1500,No,No,No,4,4.9,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
4,162,Mandaluyong City,1500,Yes,No,No,4,4.8,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df.duplicated().sum()

922

In [19]:
df.drop_duplicates(inplace=True)

In [20]:
df.describe()

Unnamed: 0,Country Code,Average Cost for two,Price range,Aggregate rating,Votes,French,Japanese,Desserts,Seafood,Asian,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
count,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,...,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0,8620.0
mean,20.016705,1298.707889,1.87935,2.92993,173.42355,0.003364,0.015661,0.072854,0.020186,0.02703,...,0.000464,0.000696,0.000116,0.00116,0.000928,0.000232,0.000464,0.000116,0.000464,0.000116
std,59.09937,16966.496434,0.916551,1.327748,449.446361,0.057908,0.124168,0.259911,0.140643,0.162181,...,0.021538,0.026375,0.010771,0.034042,0.030452,0.015231,0.021538,0.010771,0.021538,0.010771
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,300.0,1.0,2.8,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,450.0,2.0,3.3,40.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,700.0,2.0,3.7,149.25,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,216.0,800000.0,4.0,4.9,10934.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [21]:
df

Unnamed: 0,Country Code,City,Average Cost for two,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Rating color,Rating text,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
0,162,Makati City,1100,Yes,No,No,3,4.8,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
1,162,Makati City,1200,Yes,No,No,3,4.5,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
2,162,Mandaluyong City,4000,Yes,No,No,4,4.4,Green,Very Good,...,0,0,0,0,0,0,0,0,0,0
3,162,Mandaluyong City,1500,No,No,No,4,4.9,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
4,162,Mandaluyong City,1500,Yes,No,No,4,4.8,Dark Green,Excellent,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9546,208,��stanbul,80,No,No,No,3,4.1,Green,Very Good,...,0,0,0,0,0,0,0,0,0,0
9547,208,��stanbul,105,No,No,No,3,4.2,Green,Very Good,...,1,0,0,0,0,0,1,0,0,0
9548,208,��stanbul,170,No,No,No,4,3.7,Yellow,Good,...,0,0,0,0,0,0,1,0,0,0
9549,208,��stanbul,120,No,No,No,4,4.0,Green,Very Good,...,0,0,0,0,0,0,0,0,1,0


In [22]:
from sklearn.preprocessing import LabelEncoder
label_encoders = {}
for column in ['City', 'Has Table booking', 'Has Online delivery', 'Is delivering now', 'Rating color', 'Rating text']:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

In [23]:
X = df.drop(cuisine_list.explode().unique(), axis=1)  # Drop the cuisine columns from features
y = df[cuisine_list.explode().unique()]

In [24]:
X

Unnamed: 0,Country Code,City,Average Cost for two,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Rating color,Rating text,Votes
0,162,73,1100,1,0,0,3,4.8,0,1,314
1,162,73,1200,1,0,0,3,4.5,0,1,591
2,162,75,4000,1,0,0,4,4.4,1,5,270
3,162,75,1500,0,0,0,4,4.9,0,1,365
4,162,75,1500,1,0,0,4,4.8,0,1,229
...,...,...,...,...,...,...,...,...,...,...,...
9546,208,139,80,0,0,0,3,4.1,1,5,788
9547,208,139,105,0,0,0,3,4.2,1,5,1034
9548,208,139,170,0,0,0,4,3.7,5,2,661
9549,208,139,120,0,0,0,4,4.0,1,5,901


In [25]:
y

Unnamed: 0,French,Japanese,Desserts,Seafood,Asian,Filipino,Indian,Sushi,Korean,Chinese,...,Patisserie,South African,Durban,Kebab,Turkish Pizza,Izgara,World Cuisine,D�_ner,Restaurant Cafe,B�_rek
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9546,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9547,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
9548,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9549,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [26]:
# Saving X as CSV
X.to_csv('X_data.csv', index=False)
# Saving y as CSV
y.to_csv('y_data.csv', index=False)

In [27]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
X = pd.read_csv("X_data.csv")
y = pd.read_csv("y_data.csv")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model (using MultiOutputClassifier to handle multilabel classification)
model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

In [29]:
# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred,average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

Accuracy: 0.04988399071925754
Precision: 0.35475140559623913
Recall: 0.2066182405165456
F1-score: 0.24751251804509314
