<a href="https://colab.research.google.com/github/Dishasoni1009/Cognifyz_Technology/blob/main/Predict_Restaurant_Ratings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PREDICT RESTAURANT RATINGS**

Objective: Predict the aggregate rating of a restaurant based on other features using machine learning regression models.


#**DATASET DESCRIPTION**

Here is a brief overview of each column in your restaurant dataset:

    1.Restaurant ID: Unique identifier for each restaurant.

    2.Restaurant Name: Name of the restaurant.

    3.Country Code: Numeric code for the country where the restaurant is located.

    4.City: City where the restaurant operates.

    5.Address: Full physical address.

    6.Locality: Area or neighborhood of the restaurant.

    7.Locality Verbose: A more detailed locality description.

    8.Longitude / Latitude: Geographical location of the restaurant.

    9.Cuisines: Types of cuisines served (can be multiple).

    10.Average Cost for two: Cost estimate for two people.

    11.Currency: Currency used in the restaurant's country.

    12.Has Table booking: Whether the restaurant allows table booking (Yes/No).

    13.Has Online delivery: Whether the restaurant provides online delivery (Yes/No).

    14.Is delivering now: Whether the restaurant is delivering at the current moment.

    15.Switch to order menu: Likely a UX feature on the platform (can be dropped).

    16.Price range: Price bracket (e.g., 1 to 4).

    17.Aggregate rating: The main target variable — average rating of the restaurant.

    18.Rating color: Visual indicator of the rating (e.g., Dark Green).

    19.Rating text: Text label for rating (e.g., Excellent, Good).

    20.Votes: Number of customer votes the restaurant has received.

#**STEPS:**

1.Preprocess the dataset by handling missing values,
encoding categorical variables, and splitting the data
into training and testing sets.

2.Select a regression algorithm (e.g., linear regression,
decision tree regression) and train it on the training data.

3.Evaluate the model's performance using appropriate
regression metrics (e.g., mean squared error, R-squared)
on the testing data.

4.Interpret the model's results and analyze the most
influential features affecting restaurant ratings.

In [5]:
#import Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

In [6]:
# DRIVE MOUNT
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# READ DATA
df=pd.read_csv('/content/drive/MyDrive/Dataset .csv')
print(df)

      Restaurant ID           Restaurant Name  Country Code              City  \
0           6317637          Le Petit Souffle           162       Makati City   
1           6304287          Izakaya Kikufuji           162       Makati City   
2           6300002    Heat - Edsa Shangri-La           162  Mandaluyong City   
3           6318506                      Ooma           162  Mandaluyong City   
4           6314302               Sambo Kojin           162  Mandaluyong City   
...             ...                       ...           ...               ...   
9546        5915730               Naml۱ Gurme           208         ��stanbul   
9547        5908749              Ceviz A��ac۱           208         ��stanbul   
9548        5915807                     Huqqa           208         ��stanbul   
9549        5916112               A���k Kahve           208         ��stanbul   
9550        5927402  Walter's Coffee Roastery           208         ��stanbul   

                           

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

In [10]:
df.describe()

Unnamed: 0,Restaurant ID,Country Code,Longitude,Latitude,Average Cost for two,Price range,Aggregate rating,Votes
count,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0,9551.0
mean,9051128.0,18.365616,64.126574,25.854381,1199.210763,1.804837,2.66637,156.909748
std,8791521.0,56.750546,41.467058,11.007935,16121.183073,0.905609,1.516378,430.169145
min,53.0,1.0,-157.948486,-41.330428,0.0,1.0,0.0,0.0
25%,301962.5,1.0,77.081343,28.478713,250.0,1.0,2.5,5.0
50%,6004089.0,1.0,77.191964,28.570469,400.0,2.0,3.2,31.0
75%,18352290.0,1.0,77.282006,28.642758,700.0,2.0,3.7,131.0
max,18500650.0,216.0,174.832089,55.97698,800000.0,4.0,4.9,10934.0


# DATA ANALYSIS

In [12]:
df.isnull().sum()

Unnamed: 0,0
Restaurant ID,0
Restaurant Name,0
Country Code,0
City,0
Address,0
Locality,0
Locality Verbose,0
Longitude,0
Latitude,0
Cuisines,9


In [13]:
#only cuisine have 9 null values so we drop this null values because its just a small number
# Treatment of NUll values
df.dropna(inplace=True)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9542 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9542 non-null   int64  
 1   Restaurant Name       9542 non-null   object 
 2   Country Code          9542 non-null   int64  
 3   City                  9542 non-null   object 
 4   Address               9542 non-null   object 
 5   Locality              9542 non-null   object 
 6   Locality Verbose      9542 non-null   object 
 7   Longitude             9542 non-null   float64
 8   Latitude              9542 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9542 non-null   int64  
 11  Currency              9542 non-null   object 
 12  Has Table booking     9542 non-null   object 
 13  Has Online delivery   9542 non-null   object 
 14  Is delivering now     9542 non-null   object 
 15  Switch to order menu  9542

In [15]:
df.isnull().sum()

Unnamed: 0,0
Restaurant ID,0
Restaurant Name,0
Country Code,0
City,0
Address,0
Locality,0
Locality Verbose,0
Longitude,0
Latitude,0
Cuisines,0


**These Columns Are Often Dropped Because:**

->Restaurant ID	Just a unique identifier; not useful for prediction.

->Restaurant Name	It's a name, not a generalizable feature. The model can’t learn useful patterns from names like "KFC" vs. "Local Tandoori".

->Address	It's unstructured text and usually unique for each entry. Not generalizable unless converted to geographical zones or proximity.

->Locality Verbose	It's usually redundant with City or Locality, and also text-heavy.

In [16]:
# Drop unnecessary columns
columns_to_drop = ['Restaurant ID', 'Restaurant Name', 'Address', 'Locality Verbose']
df = df.drop(columns=columns_to_drop)

# Show the remaining columns
df.columns.tolist()

['Country Code',
 'City',
 'Locality',
 'Longitude',
 'Latitude',
 'Cuisines',
 'Average Cost for two',
 'Currency',
 'Has Table booking',
 'Has Online delivery',
 'Is delivering now',
 'Switch to order menu',
 'Price range',
 'Aggregate rating',
 'Rating color',
 'Rating text',
 'Votes']

In [17]:
#Detecting Duplicates
df.duplicated().sum()

np.int64(2)

In [18]:
#drop duplicates
df.drop_duplicates(inplace=True)
print(df.shape)

(9540, 17)


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9540 entries, 0 to 9550
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country Code          9540 non-null   int64  
 1   City                  9540 non-null   object 
 2   Locality              9540 non-null   object 
 3   Longitude             9540 non-null   float64
 4   Latitude              9540 non-null   float64
 5   Cuisines              9540 non-null   object 
 6   Average Cost for two  9540 non-null   int64  
 7   Currency              9540 non-null   object 
 8   Has Table booking     9540 non-null   object 
 9   Has Online delivery   9540 non-null   object 
 10  Is delivering now     9540 non-null   object 
 11  Switch to order menu  9540 non-null   object 
 12  Price range           9540 non-null   int64  
 13  Aggregate rating      9540 non-null   float64
 14  Rating color          9540 non-null   object 
 15  Rating text           9540

#**Encoding on Categorical variable**

In [20]:
df = pd.get_dummies(df, columns=['City', 'Cuisines', 'Currency'], drop_first=True)

Convert binary categorical columns to 0/1:

In [21]:
binary_map = {'Yes': 1, 'No': 0}
df['Has Table booking'] = df['Has Table booking'].map(binary_map)
df['Has Online delivery'] = df['Has Online delivery'].map(binary_map)
df['Is delivering now'] = df['Is delivering now'].map(binary_map)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9540 entries, 0 to 9550
Columns: 1988 entries, Country Code to Currency_Turkish Lira(TL)
dtypes: bool(1974), float64(3), int64(7), object(4)
memory usage: 19.1+ MB
