 ### Task 3: Cuisine Classification

 Objective: Develop a machine learning model to
 classify restaurants based on their cuisines.
 
 Steps:
 Preprocess the dataset by handling missing values
 and encoding categorical variables.
 Split the data into training and testing sets.
 Select a classification algorithm (e.g., logistic
 regression, random forest) and train it on the
 training data.
 Evaluate the model's performance using
 appropriate classification metrics (e.g., accuracy,
 precision, recall) on the testing data.
 Analyze the model's performance across different
 cuisines and identify any challenges or biases.


In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [52]:
#load dataframe
df = pd.read_csv('restaurant.csv', encoding='ISO-8859-1')
df.head

<bound method NDFrame.head of       ï»¿Restaurant ID           Restaurant Name  Country Code  \
0              6317637          Le Petit Souffle           162   
1              6304287          Izakaya Kikufuji           162   
2              6300002    Heat - Edsa Shangri-La           162   
3              6318506                      Ooma           162   
4              6314302               Sambo Kojin           162   
...                ...                       ...           ...   
9546           5915730              NamlÛ± Gurme           208   
9547           5908749         Ceviz Aï¿½ï¿½acÛ±           208   
9548           5915807                     Huqqa           208   
9549           5916112         Aï¿½ï¿½ï¿½k Kahve           208   
9550           5927402  Walter's Coffee Roastery           208   

                  City                                            Address  \
0          Makati City  Third Floor, Century City Mall, Kalayaan Avenu...   
1          Makati City 

Data Preprocessing and Splitting

In [53]:
# removing features that will inhibit model training
columns_to_drop = ['Restaurant ID', 'Country Code', 'City', 'Address', 'Locality', 'Locality Verbose', 'Longitude', 'Latitude', 'Currency', 'Has Table booking', 'Has Online delivery', 'Is delivering now', 'Switch to order menu', 'Price range', 'Aggregate rating', 'Rating color', 'Rating text', 'Votes']
df.drop(columns=[col for col in columns_to_drop if col in df.columns], axis=1, inplace=True)

In [54]:
#handle missing values
df.isna().sum()

ï»¿Restaurant ID        0
Restaurant Name         0
Cuisines                9
Average Cost for two    0
dtype: int64

In [55]:
df.dropna(inplace=True)

In [56]:
df.shape

(9542, 4)

In [57]:
df.describe(include="all")
     

Unnamed: 0,ï»¿Restaurant ID,Restaurant Name,Cuisines,Average Cost for two
count,9542.0,9542,9542,9542.0
unique,,7437,1825,
top,,Cafe Coffee Day,North Indian,
freq,,83,936,
mean,9043301.0,,,1200.326137
std,8791967.0,,,16128.743876
min,53.0,,,0.0
25%,301931.2,,,250.0
50%,6002726.0,,,400.0
75%,18352600.0,,,700.0


In [58]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Restaurant Name'] = label_encoder.fit_transform(df['Restaurant Name'])
df['Cuisines'] = label_encoder.fit_transform(df['Cuisines'])
df

Unnamed: 0,ï»¿Restaurant ID,Restaurant Name,Cuisines,Average Cost for two
0,6317637,3742,920,1100
1,6304287,3167,1111,1200
2,6300002,2892,1671,4000
3,6318506,4700,1126,1500
4,6314302,5515,1122,1500
...,...,...,...,...
9546,5915730,4436,1813,80
9547,5908749,1310,1824,105
9548,5915807,3063,1110,170
9549,5916112,512,1657,120


In [59]:
x = df.drop('Cuisines',axis=1)
y = df['Cuisines']

In [60]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x=scaler.fit_transform(x)

In [61]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=15)

Random Forest Model


In [62]:
from sklearn.ensemble import RandomForestClassifier

In [63]:
model_rfc = RandomForestClassifier(n_estimators=100, random_state=42)

model_rfc.fit(x_train, y_train)

rfc_pred = model_rfc.predict(x_test)

In [64]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [65]:
accuracy = accuracy_score(y_test, rfc_pred)
print(f"Accuracy: {accuracy:.2f}")

# Precision, recall, F1-score
precision = precision_score(y_test, rfc_pred, average='micro')
recall = recall_score(y_test, rfc_pred, average='micro')
f1 = f1_score(y_test, rfc_pred, average='micro')
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

Accuracy: 0.19
Precision: 0.19
Recall: 0.19
F1-score: 0.19


In [66]:
#cfm = confusion_matrix(y_test, rfc_pred)
#print(cfm)
     

Logistic Regression Model


from sklearn.linear_model import LogisticRegression

In [67]:
from sklearn.linear_model import LogisticRegression

classifier_logreg = LogisticRegression(multi_class="multinomial")
classifier_logreg.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [68]:
logreg_pred = classifier_logreg.predict(x_test)
print(logreg_pred)

[1306 1306 1306 ... 1306 1306 1306]


In [69]:
accuracy = accuracy_score(y_test, logreg_pred)
print(f"Accuracy: {accuracy:.2f}")

# Precision, recall, F1-score
precision = precision_score(y_test, logreg_pred, average='micro')
recall = recall_score(y_test, logreg_pred, average='micro')
f1 = f1_score(y_test, logreg_pred, average='micro')
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
     

Accuracy: 0.10
Precision: 0.10
Recall: 0.10
F1-score: 0.10


### Conclusion:
On comparison, we can conclude that Random Forest performs better on our model than logistic regression.
Despite repeatedly trying my best on preprocessing and model selection, model performance could not be elevated beyond the current accuracy score.
This might be because of some underlying biases either in the model training or the dataset itself.