## Zomato Bangalore Restaurants Prediction
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment
of different types of restaurant at different places in Bengalur.

In [1]:
DF_PATH = '../data/raw/zomato.csv'
DF_SAVE_PATH = '../data/processed/ML_zomato_processed.csv'

### Import libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Read dataframe

In [3]:
df = pd.read_csv(DF_PATH)
df.sample(3)

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
47354,https://www.zomato.com/bangalore/freshmenu-ric...,"Richmond Town, Richmond Road, Bangalore",FreshMenu,Yes,No,3.6 /5,385,080 49653271,Central Bangalore,Delivery,"Salads, Burgers, Salad, Potato Wedges, Chicken...","Continental, Asian, Healthy Food, Burger, Biry...",450,"[('Rated 1.0', 'RATED\n I ordered a Paneer Ti...","['Jeera Rice and Smokey Butter Chicken', 'Mala...",Delivery,Residency Road
42677,https://www.zomato.com/bangalore/the-wok-shop-...,"The Millenia, The Market Place Food Court, Lev...",The Wok Shop,Yes,No,3.4 /5,7,+91 8043745442,Ulsoor,Quick Bites,,"Asian, Chinese, Momos, Fast Food, Indonesian, ...",500,"[('Rated 4.0', 'RATED\n Recently tried the Wo...",[],Delivery,MG Road
37799,https://www.zomato.com/bangalore/cafe-cassia-d...,"Cinnamon, Ground Floor, 24 Gangadhar Chetty Ro...",Cafe Cassia& Deli,No,No,4.2 /5,116,080 41154102,Ulsoor,"Casual Dining, Cafe","Coffee, Pasta, Tea","Lebanese, Mediterranean, Cafe",1000,"[('Rated 5.0', 'RATED\n Beautifully designed ...",[],Cafes,Lavelle Road


**To do list:**
- Drop unneeded features: (name, url, address, phone, dish_liked, reviews_list, menu_item, votes).
- Fix (rate, approx_cost) features problems.
- Check null values and handle it.
- Engineer features (cuisines, rest_type).
- Remove duplicated rows.
- Create Target Feature.
- Save the prepared data into csv file.

---------

### Drop unneeded features: (url, address, phone, dish_liked, reviews_list, menu_item).

In [4]:
cols_to_drop = ['name', 'url', 'address', 'phone', 'dish_liked', 'reviews_list', 'menu_item', 'votes']
df.drop(cols_to_drop, axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   online_order                 51717 non-null  object
 1   book_table                   51717 non-null  object
 2   rate                         43942 non-null  object
 3   location                     51696 non-null  object
 4   rest_type                    51490 non-null  object
 5   cuisines                     51672 non-null  object
 6   approx_cost(for two people)  51371 non-null  object
 7   listed_in(type)              51717 non-null  object
 8   listed_in(city)              51717 non-null  object
dtypes: object(9)
memory usage: 3.6+ MB


### Fix (rate, approx_cost) features problems.

In [5]:
def fix_rate(rate):
    try:
        return float(rate[0:3])
    except:
        return np.nan

In [6]:
df['rate'].apply(fix_rate).unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, nan, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2,
       2.3, 4.8, 4.9, 2.1, 2. , 1.8])

In [7]:
df['rate'] = df['rate'].apply(fix_rate)

In [8]:
df['rate'].unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, nan, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2,
       2.3, 4.8, 4.9, 2.1, 2. , 1.8])

In [9]:
df['approx_cost(for two people)'].astype(str).apply(lambda c : float(c.replace(",", ""))).unique()

array([ 800.,  300.,  600.,  700.,  550.,  500.,  450.,  650.,  400.,
        900.,  200.,  750.,  150.,  850.,  100., 1200.,  350.,  250.,
        950., 1000., 1500., 1300.,  199.,   80., 1100.,  160., 1600.,
        230.,  130.,   50.,  190., 1700.,   nan, 1400.,  180., 1350.,
       2200., 2000., 1800., 1900.,  330., 2500., 2100., 3000., 2800.,
       3400.,   40., 1250., 3500., 4000., 2400., 2600.,  120., 1450.,
        469.,   70., 3200.,   60.,  560.,  240.,  360., 6000., 1050.,
       2300., 4100., 5000., 3700., 1650., 2700., 4500.,  140.])

In [10]:
df['approx_cost(for two people)'] = df['approx_cost(for two people)'].astype(str).apply(lambda c : float(c.replace(",", "")))

In [11]:
df['approx_cost(for two people)'].unique()

array([ 800.,  300.,  600.,  700.,  550.,  500.,  450.,  650.,  400.,
        900.,  200.,  750.,  150.,  850.,  100., 1200.,  350.,  250.,
        950., 1000., 1500., 1300.,  199.,   80., 1100.,  160., 1600.,
        230.,  130.,   50.,  190., 1700.,   nan, 1400.,  180., 1350.,
       2200., 2000., 1800., 1900.,  330., 2500., 2100., 3000., 2800.,
       3400.,   40., 1250., 3500., 4000., 2400., 2600.,  120., 1450.,
        469.,   70., 3200.,   60.,  560.,  240.,  360., 6000., 1050.,
       2300., 4100., 5000., 3700., 1650., 2700., 4500.,  140.])

### Check null values and handle it.

In [12]:
df.isna().mean() * 100

online_order                    0.000000
book_table                      0.000000
rate                           19.436549
location                        0.040606
rest_type                       0.438927
cuisines                        0.087012
approx_cost(for two people)     0.669026
listed_in(type)                 0.000000
listed_in(city)                 0.000000
dtype: float64

> **Note:** The null values count are not many regarding the count of dataset, so we will drop these rows.

In [13]:
df.dropna(axis = 0, inplace = True)

In [14]:
df.isna().mean() * 100

online_order                   0.0
book_table                     0.0
rate                           0.0
location                       0.0
rest_type                      0.0
cuisines                       0.0
approx_cost(for two people)    0.0
listed_in(type)                0.0
listed_in(city)                0.0
dtype: float64

### Engineer features (cuisines, rest_type).

In [15]:
df['rest_type_counts'] = df['rest_type'].astype(str).apply(lambda c : len(c.split(',')))

In [16]:
df['cuisines_counts'] = df['cuisines'].astype(str).apply(lambda c : len(c.split(',')))

#### Drop 2 features (rest_type, cuisines)

In [17]:
df.drop(['rest_type', 'cuisines'], axis = 1, inplace = True)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41263 entries, 0 to 51716
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   online_order                 41263 non-null  object 
 1   book_table                   41263 non-null  object 
 2   rate                         41263 non-null  float64
 3   location                     41263 non-null  object 
 4   approx_cost(for two people)  41263 non-null  float64
 5   listed_in(type)              41263 non-null  object 
 6   listed_in(city)              41263 non-null  object 
 7   rest_type_counts             41263 non-null  int64  
 8   cuisines_counts              41263 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 3.1+ MB


### Remove duplicated rows.

In [19]:
df.duplicated().sum()

2299

In [20]:
df.drop_duplicates(inplace = True)

In [21]:
df.duplicated().sum()

0

### Create Target Feature.

In [22]:
df['rate'].describe()

count    38964.000000
mean         3.705726
std          0.446271
min          1.800000
25%          3.400000
50%          3.700000
75%          4.000000
max          4.900000
Name: rate, dtype: float64

In [23]:

def create_target(rate):
    try:
        if rate >= 3.75:
            return  1
        else:
            return 0
    except:
        return np.nan

In [24]:
df['success'] = df['rate'].apply(create_target)
df.drop('rate', axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38964 entries, 0 to 51716
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   online_order                 38964 non-null  object 
 1   book_table                   38964 non-null  object 
 2   location                     38964 non-null  object 
 3   approx_cost(for two people)  38964 non-null  float64
 4   listed_in(type)              38964 non-null  object 
 5   listed_in(city)              38964 non-null  object 
 6   rest_type_counts             38964 non-null  int64  
 7   cuisines_counts              38964 non-null  int64  
 8   success                      38964 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 3.0+ MB


In [25]:
df['success'].value_counts()

success
0    19716
1    19248
Name: count, dtype: int64

### Save the prepared data into csv file.

In [26]:
df.to_csv(DF_SAVE_PATH, index = False)