# Data Science Project
## Zomato Bangalore Restaurants - Data Analysis & Predictive Modeling

**Project Prepared by:** Diaa Aldein Alsayed Ibrahim Osman  
**Prepared for:** Epsilon AI Institute  

**Background:**  
Zomato, founded in 2008, is a global restaurant aggregator and food delivery company. In the context of Bangalore, the IT capital of India, where the restaurant industry continues to grow, this project aims to leverage the Zomato Bangalore Restaurants dataset from Kaggle. The study will provide valuable insights into the factors influencing the establishment of different types of restaurants.

**Project Overview:**  
The objective of this data science project is to conduct a comprehensive analysis of Zomato Bangalore Restaurants and develop a predictive model for determining the success of new restaurants. Leveraging advanced data analytics and machine learning techniques, this project aims to provide actionable insights to restaurant owners, investors, and decision-makers.

**Dataset Description:**  
The dataset is sourced from Kaggle and is available [here](https://www.kaggle.com/datasets/himanshupoddar/zomato-bangalore-restaurants?authuser=0). It contains 51,717 instances and 17 features. The data is accurate to that available on the Zomato website until 15 March 2019. Bengaluru, being one such city, has more than 12,000 restaurants with establishments serving dishes from all over the world.

**Features Description:**  
1. `url`: Contains the URL of the restaurant on the Zomato website.
2. `address`: Contains the address of the restaurant in Bengaluru.
3. `name`: Contains the name of the restaurant.
4. `online_order`: Indicates whether online ordering is available in the restaurant or not.
5. `book_table`: Indicates whether the table booking option is available or not.
6. `rate`: Contains the overall rating of the restaurant out of 5.
7. `votes`: Contains the total number of ratings for the restaurant as of the above-mentioned date.
8. `phone`: Contains the phone number of the restaurant.
9. `location`: Contains the neighborhood in which the restaurant is located.
10. `rest_type`: Contains the type of restaurant.
11. `dish_liked`: Contains dishes that people liked in the restaurant.
12. `cuisines`: Contains food styles, separated by commas.
13. `approx_cost(for two people)`: Contains the approximate cost for a meal for two people.
14. `reviews_list`: Contains a list of tuples containing reviews for the restaurant. Each tuple consists of two values: rating and review by the customer.
15. `menu_item`: Contains a list of menus available in the restaurant.
16. `listed_in(type)`: Contains the type of meal.
17. `listed_in(city)`: Contains the neighborhood in which the restaurant is listed.


## Step 1: Data Cleaning & preparation.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import warnings            
warnings.filterwarnings("ignore") 

In [2]:
# loading the dataset & veiwing the head of the data the first 5 rows.
df = pd.read_csv("../data/raw/zomato.csv")
df.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


In [3]:
# checking the shape of the dataset
df.shape

(51717, 17)

In [4]:
# Geitting Information about the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [5]:
#checking for missing data percentage
df.isnull().mean()*100

url                             0.000000
address                         0.000000
name                            0.000000
online_order                    0.000000
book_table                      0.000000
rate                           15.033741
votes                           0.000000
phone                           2.335789
location                        0.040606
rest_type                       0.438927
dish_liked                     54.291626
cuisines                        0.087012
approx_cost(for two people)     0.669026
reviews_list                    0.000000
menu_item                       0.000000
listed_in(type)                 0.000000
listed_in(city)                 0.000000
dtype: float64

In [6]:
# descriptive statistics for categorical columns
df.describe(include="O")

Unnamed: 0,url,address,name,online_order,book_table,rate,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
count,51717,51717,51717,51717,51717,43942,50509,51696,51490,23639,51672,51371,51717,51717,51717,51717
unique,51717,11495,8792,2,2,64,14926,93,93,5271,2723,70,22513,9098,7,30
top,https://www.zomato.com/bangalore/jalsa-banasha...,Delivery Only,Cafe Coffee Day,Yes,No,NEW,080 43334321,BTM,Quick Bites,Biryani,North Indian,300,[],[],Delivery,BTM
freq,1,128,96,30444,45268,2208,216,5124,19132,182,2913,7576,7595,39617,25942,3279


From all of the above we can found that the following features:
1. `url`: all the value of url is unique and it will not be usefull we are going to drop it
2. `address`: the uniqueness is very high 11495 unique value we are going to Keep it and use it to validate location col and extracting usefull information.
3. `name`: have high unique value of 8792 but we are going to keep it for the analysis stage and drop it for the modeling stage.
4. `rate`: Contains 15% null values we are going to drop these rows. and need more envestigation for strange values.
5. `phone`: have 2% null values and high uniqueness 14926 values and it is useless for the aim of this analysis we are going to drop this column
9. `location`: Contains 0.04% null values we are going to drop these rows.
10. `rest_type`: Contains 0.4% null values we are going to drop these rows.
11. `dish_liked`: Contains 54% null values and high uniqueness of 5271 values we are going to drop this columns.
12. `cuisines`: Contains 0.087% null values we are going to drop these rows.
13. `approx_cost(for two people)`:Contains 0.669% null values we are going to drop these rows.
14. `reviews_list`: as the project will not contain NLP and the uniqueness is very high 22513 values we are going to drop this column. 
15. `menu_item`: as the uniqueness is very high 9098 unique values, Also, it is useless for the aim of this analysis as there are other features reflect the food types we are going to drop this column.

In [7]:
# dropping unnecessary columns ['url','address','phone','dish_liked','reviews_list','menu_item']:
df.drop(['url','phone','dish_liked','reviews_list','menu_item'],axis=1,inplace=True)

In [8]:
# dropping null values rows.
df.dropna(axis=0,inplace=True)

In [9]:
#checking for duplication in data 
df.duplicated().sum()

72

In [10]:
# Drop duplicates and reset the index
df = df.drop_duplicates().reset_index(drop=True)

In [11]:
df.duplicated().sum()

0

## Maintaing features columns name

In [12]:
# Maintaing features column name by removing "(",")" signes and replacing spaces with "_"
df.columns = df.columns.str.replace("("," ").str.replace(")","").str.replace(" ","_")
df.columns

Index(['address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'location', 'rest_type', 'cuisines', 'approx_cost_for_two_people',
       'listed_in_type', 'listed_in_city'],
      dtype='object')

## Checking features for strange features values

## 1. address col:

In [13]:
# checking numbers of unique resturants address
df.address.nunique()

9432

In [14]:
df.address.unique()

array(['942, 21st Main Road, 2nd Stage, Banashankari, Bangalore',
       '2nd Floor, 80 Feet Road, Near Big Bazaar, 6th Block, Kathriguppe, 3rd Stage, Banashankari, Bangalore',
       '1112, Next to KIMS Medical College, 17th Cross, 2nd Stage, Banashankari, Bangalore',
       ...,
       'Cessna Business Park, Sarjapur Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x96 Marathahalli Outer Ring Road, Kadubeesanahalli, Bellandur Post, Bengaluru, Karnataka',
       '44, Kodigehalli to Hoodi Main Road, Mahadevapura Post, KR Puram, Bangalore',
       '139/C1, Next To GR Tech Park, Pattandur Agrahara, ITPL, Main Road, Whitefield, Bangalore'],
      dtype=object)

In [15]:
# Set display options to show all rows
pd.set_option('display.max_rows', None)
df.address.value_counts()

Delivery Only                                                                                                                                                                                                                                                                                                                                                 95
The Ritz-Carlton, 99, Residency Road, Bangalore                                                                                                                                                                                                                                                                                                               61
Conrad Bengaluru, Kensington Road, Ulsoor, Bangalore                                                                                                                                                                                                                                                  

* there are strange values like '?,ÃÂ'we are going to remove them from address values like (Shop no. LG-5, Spendid Plaza, 5th A Block, 100 Feet Road, Koramangala 5th Block, Bangalore) Also there use of 'Bengaluru' instead of 'Bangalore' we going to unified the wold and using 'Bangalore'

In [16]:
# removing strange values like '?,ÃÂ' in Shop no. LG-5, Spendid Plaza, 5th, replacing 'Bengaluru' with'Bangalore'

df.address = df.address.str.replace('[?ÃÂ]','').str.replace('Bengaluru','Bangalore')

In [17]:
df.address.value_counts()

Delivery Only                                                                                                                                                                                                     95
The Ritz-Carlton, 99, Residency Road, Bangalore                                                                                                                                                                   61
Conrad Bangalore, Kensington Road, Ulsoor, Bangalore                                                                                                                                                              49
14th Main, 4th Sector, HSR, Bangalore                                                                                                                                                                             47
1, 100 Feet Ring Road, 1st Phase, 2nd Stage, BTM, Bangalore                                                                                         

In [18]:
df.address.nunique()

9421

* `address`feature: after removing strange values like '?ÃÂ', replace 'Bengaluru' with 'Bangalore' from address values the Number of unique values drop from 9432 to 9421 unique values of resturant address.

### 2. name col:

In [19]:
# checking numbers of unique resturants name
df.name.nunique()

7112

In [20]:
# Set display options to show all rows
pd.set_option('display.max_rows', None)
print(df.name.value_counts())

Cafe Coffee Day                                                                                                                                                    89
Onesta                                                                                                                                                             85
Empire Restaurant                                                                                                                                                  71
Five Star Chicken                                                                                                                                                  68
Kanti Sweets                                                                                                                                                       68
Just Bake                                                                                                                                                          68
Peto

In [21]:
# removing strange values like 'Ã', replace '[©¢ª¨±]' with 'e' and '[»]' with 'u' from name 

df.name = df.name.str.replace('[ÃÂÂ]','').str.replace('[©¢ª¨±]','e').str.replace('»','u')

In [22]:
print(df.name.value_counts())

Cafe Coffee Day                                            89
Onesta                                                     85
Empire Restaurant                                          71
Just Bake                                                  68
Kanti Sweets                                               68
Five Star Chicken                                          68
Petoo                                                      64
Baskin Robbins                                             63
Pizza Hut                                                  62
Polar Bear                                                 61
Domino's Pizza                                             60
Beijing Bites                                              60
Sweet Truth                                                60
KFC                                                        60
Smoor                                                      59
McDonald's                                                 59
Subway  

Name: name, dtype: int64


In [23]:
df.name.nunique()

7084

* `name`feature: after removing strange values like 'Ã', replace '[©¢ª¨±]' with 'e' and '[»]' with 'u' from name values the Number of unique values drop from 8792 to 7084 unique values of resturant names.

### 3. online_order col:

In [24]:
df.online_order.unique()

array(['Yes', 'No'], dtype=object)

In [25]:
# changing the value to 1 if yes and 0 if false and change dtype to int
df.online_order = df.online_order.apply(lambda x: 1 if x=='Yes' else 0).astype('int')
df.online_order.unique()

array([1, 0])

### 4. book_table col:

In [26]:
df.book_table.unique()

array(['Yes', 'No'], dtype=object)

In [27]:
# changing the value to 1 if yes and 0 if false and change dtype to int
df.book_table = df.book_table.apply(lambda x: 1 if x=='Yes' else 0).astype('int')
df.book_table.unique()

array([1, 0])

### 5. rate col:

In [28]:
df.rate.value_counts()

NEW       2194
3.9/5     2088
3.7/5     2006
3.8/5     1997
3.9 /5    1865
3.8 /5    1819
3.7 /5    1799
3.6/5     1752
4.0/5     1596
4.0 /5    1546
3.6 /5    1533
4.1/5     1467
4.1 /5    1455
3.5/5     1422
3.5 /5    1340
3.4/5     1246
3.4 /5    1197
3.3/5     1147
4.2 /5    1141
3.3 /5    1125
4.2/5     1010
3.2/5      996
4.3 /5     908
3.1/5      851
3.2 /5     847
4.3/5      769
3.1 /5     699
4.4 /5     626
3.0/5      543
4.4/5      510
3.0 /5     447
2.9/5      426
4.5 /5     408
2.9 /5     374
2.8/5      302
2.8 /5     278
4.5/5      245
4.6 /5     174
2.7/5      167
2.6/5      140
2.7 /5     136
4.6/5      125
2.6 /5     109
4.7 /5      86
4.7/5       81
-           65
2.5 /5      56
2.5/5       44
4.8 /5      43
2.4/5       36
4.9 /5      30
2.4 /5      30
2.3/5       28
4.9/5       25
2.3 /5      23
4.8/5       23
2.2/5       19
2.1 /5      13
2.1/5       11
2.2 /5       7
2.0 /5       7
2.0/5        4
1.8 /5       3
1.8/5        2
Name: rate, dtype: int64

In [29]:
df.rate.unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', 'NEW', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5',
       '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
       '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
       '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
       '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
       '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
       '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

* from the above values for rate features:
1. there are two strange values 'NEW', and '-' need to be replace with null values to drop them.
2. need to remove /5
3. change the data type to float.
4. creating Target columns to determine if a restaurant success is considered good (1) for rate >= 3.75 or not (0) 

In [30]:
# Replace 'NEW' and '-' with NaN, and remove '/5' and convert to float
df['rate'] = df['rate'].replace(['NEW', '-'], np.nan).str.rstrip('/5').astype(float)
df.rate.unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, nan, 2.9, 2.6, 3.4, 2. , 2.7, 4.7, 2.4, 2.2, 2.3, 4.8,
       3.5, 2.5, 4.5, 4.9, 2.1, 1.8])

In [31]:
df.rate.isnull().sum()

2259

In [32]:
# dropping null values rows and reset the index.
df = df.dropna(axis=0).reset_index(drop=True)

In [33]:
df.rate.isnull().sum()

0

### 6. votes col:

In [34]:
# checking votes columns values
df.votes.value_counts() # it is looks good

4        1123
6         979
7         857
9         735
11        685
5         659
8         617
10        616
16        524
12        463
13        458
17        446
14        426
23        401
21        397
18        388
24        371
22        361
19        359
15        355
25        328
34        323
20        291
26        288
31        272
27        269
33        264
32        262
28        257
38        249
30        249
41        236
42        233
36        233
46        231
39        225
35        217
48        200
47        200
37        199
29        198
53        197
43        195
44        189
61        176
54        169
49        167
56        164
50        162
57        159
40        155
59        154
74        153
51        153
70        145
62        142
52        141
68        138
67        135
45        132
75        129
64        128
69        127
58        126
66        124
76        120
63        120
73        119
55        118
93        115
97        112
95    

### 7. location col:

In [35]:
df.location.nunique()

92

In [36]:
# checking location values.
df.location.unique() # it look good

array(['Banashankari', 'Basavanagudi', 'Mysore Road', 'Jayanagar',
       'Kumaraswamy Layout', 'Rajarajeshwari Nagar', 'Vijay Nagar',
       'Uttarahalli', 'JP Nagar', 'South Bangalore', 'City Market',
       'Bannerghatta Road', 'BTM', 'Kanakapura Road', 'Bommanahalli',
       'Electronic City', 'Wilson Garden', 'Shanti Nagar',
       'Koramangala 5th Block', 'Richmond Road', 'HSR',
       'Koramangala 7th Block', 'Bellandur', 'Sarjapur Road',
       'Marathahalli', 'Whitefield', 'East Bangalore', 'Old Airport Road',
       'Indiranagar', 'Koramangala 1st Block', 'Frazer Town', 'MG Road',
       'Brigade Road', 'Lavelle Road', 'Church Street', 'Ulsoor',
       'Residency Road', 'Shivajinagar', 'Infantry Road',
       'St. Marks Road', 'Cunningham Road', 'Race Course Road',
       'Commercial Street', 'Vasanth Nagar', 'Domlur',
       'Koramangala 8th Block', 'Ejipura', 'Jeevan Bhima Nagar',
       'Old Madras Road', 'Seshadripuram', 'Kammanahalli',
       'Koramangala 6th Block', 'Ma

### 8. rest_type col:

In [37]:
# checking  rest_type col number of unique values
df.rest_type.nunique()

87

In [38]:
# checking  rest_type col unique values
df.rest_type.unique()

array(['Casual Dining', 'Cafe, Casual Dining', 'Quick Bites',
       'Casual Dining, Cafe', 'Cafe', 'Quick Bites, Cafe',
       'Cafe, Quick Bites', 'Delivery', 'Mess', 'Dessert Parlor',
       'Bakery, Dessert Parlor', 'Pub', 'Bakery', 'Takeaway, Delivery',
       'Fine Dining', 'Beverage Shop', 'Sweet Shop', 'Bar',
       'Dessert Parlor, Sweet Shop', 'Bakery, Quick Bites',
       'Sweet Shop, Quick Bites', 'Kiosk', 'Food Truck',
       'Quick Bites, Dessert Parlor', 'Beverage Shop, Quick Bites',
       'Beverage Shop, Dessert Parlor', 'Takeaway', 'Pub, Casual Dining',
       'Casual Dining, Bar', 'Dessert Parlor, Beverage Shop',
       'Quick Bites, Bakery', 'Microbrewery, Casual Dining', 'Lounge',
       'Bar, Casual Dining', 'Food Court', 'Cafe, Bakery', 'Dhaba',
       'Quick Bites, Sweet Shop', 'Microbrewery',
       'Food Court, Quick Bites', 'Quick Bites, Beverage Shop',
       'Pub, Bar', 'Casual Dining, Pub', 'Lounge, Bar',
       'Dessert Parlor, Quick Bites', 'Food Court, 

* From the above, we can see that there are some values repeated in different orders, such as 'Cafe, Casual Dining' and 'Casual Dining, Cafe.' We will address this issue by creating a function to sort each value and return the sorted result.

In [39]:
# Creating string sort function
def sort_rest_type(rest_type_str):
    # Split the input string into a list of individual rest_type names
    rest_type = rest_type_str.split(', ')
    
    # Sort the list of rest_type names alphabetically
    sorted_rest_type = ', '.join(sorted(rest_type))
    
    # Join the sorted rest_type names back into a single string separated by ', '
    return sorted_rest_type

In [40]:
# Apply the function to the 'rest_type' column
df.rest_type = df.rest_type.apply(sort_rest_type)

In [41]:
# checking  rest_type col unique values
df.rest_type.unique()

array(['Casual Dining', 'Cafe, Casual Dining', 'Quick Bites', 'Cafe',
       'Cafe, Quick Bites', 'Delivery', 'Mess', 'Dessert Parlor',
       'Bakery, Dessert Parlor', 'Pub', 'Bakery', 'Delivery, Takeaway',
       'Fine Dining', 'Beverage Shop', 'Sweet Shop', 'Bar',
       'Dessert Parlor, Sweet Shop', 'Bakery, Quick Bites',
       'Quick Bites, Sweet Shop', 'Kiosk', 'Food Truck',
       'Dessert Parlor, Quick Bites', 'Beverage Shop, Quick Bites',
       'Beverage Shop, Dessert Parlor', 'Takeaway', 'Casual Dining, Pub',
       'Bar, Casual Dining', 'Casual Dining, Microbrewery', 'Lounge',
       'Food Court', 'Bakery, Cafe', 'Dhaba', 'Microbrewery',
       'Food Court, Quick Bites', 'Bar, Pub', 'Bar, Lounge',
       'Dessert Parlor, Food Court', 'Casual Dining, Sweet Shop',
       'Casual Dining, Food Court', 'Casual Dining, Lounge',
       'Cafe, Food Court', 'Beverage Shop, Cafe', 'Cafe, Dessert Parlor',
       'Microbrewery, Pub', 'Bakery, Food Court', 'Club', 'Cafe, Pub',
       '

In [42]:
# checking  rest_type col number of unique values
df.rest_type.nunique()

66

* After applying the `sort_rest_type` function on the `rest_type` column, the number of unique values dropped from 87 to 66.

### 9. cuisines col:

In [43]:
# finding the number of unique values 
df.cuisines.nunique()

2367

In [44]:
# checking cuisines col values
df.cuisines.value_counts() 

North Indian                                                                            2107
North Indian, Chinese                                                                   1949
South Indian                                                                            1231
Cafe                                                                                     620
Bakery, Desserts                                                                         613
Biryani                                                                                  600
South Indian, North Indian, Chinese                                                      561
Desserts                                                                                 545
Fast Food                                                                                513
Chinese                                                                                  409
Ice Cream, Desserts                                                   

* After inspecting the values in the `cuisines` feature, we noticed that certain values are repeated in different orders, for example, 'Bakery, Desserts' and 'Desserts, Bakery'. To address this issue, we will create a function to sort each value and return the sorted result.

In [45]:
# Creating string sort function
def sort_cuisines(cuisine_str):
    # Split the input string into a list of individual cuisine names
    cuisines = cuisine_str.split(', ')
    
    # Sort the list of cuisine names alphabetically
    sorted_cuisines = ', '.join(sorted(cuisines))
    # Join the sorted cuisine names back into a single string separated by ', '
    
    return sorted_cuisines

In [46]:
# Apply the function to the 'rest_type' column
df.cuisines = df.cuisines.apply(sort_cuisines)

In [47]:
# checking cuisines col values
df.cuisines.value_counts() 

Chinese, North Indian                                                                   2284
North Indian                                                                            2107
South Indian                                                                            1231
Chinese, North Indian, South Indian                                                     1058
Bakery, Desserts                                                                         777
Desserts, Ice Cream                                                                      622
Cafe                                                                                     620
Biryani                                                                                  600
Biryani, Chinese, North Indian                                                           550
Desserts                                                                                 545
Fast Food                                                             

In [48]:
# finding the number of unique values 
df.cuisines.nunique()

1688

* After applying the sort_cuisines function on the cuisines column, the number of unique values dropped from 2367 to 1688.

### 10. approx_cost_for_two_people col:

In [49]:
# checking approx_cost_for_two_people column
df['approx_cost_for_two_people'].unique()

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '900', '200', '750', '150', '850', '100', '1,200', '350', '250',
       '950', '1,000', '1,500', '1,300', '199', '1,100', '1,600', '230',
       '130', '1,700', '1,350', '2,200', '1,400', '2,000', '1,800',
       '1,900', '180', '330', '2,500', '2,100', '3,000', '2,800', '3,400',
       '50', '40', '1,250', '3,500', '4,000', '2,400', '2,600', '1,450',
       '70', '3,200', '240', '6,000', '1,050', '2,300', '4,100', '120',
       '5,000', '3,700', '1,650', '2,700', '4,500', '80'], dtype=object)

* From the above we need to remove ',' and then change data type to int

In [50]:
# removing ',' from string using str.replace, and change data type to int
df['approx_cost_for_two_people'] = df['approx_cost_for_two_people'].str.replace(',','').str.strip().astype('int')

df['approx_cost_for_two_people'].unique()

array([ 800,  300,  600,  700,  550,  500,  450,  650,  400,  900,  200,
        750,  150,  850,  100, 1200,  350,  250,  950, 1000, 1500, 1300,
        199, 1100, 1600,  230,  130, 1700, 1350, 2200, 1400, 2000, 1800,
       1900,  180,  330, 2500, 2100, 3000, 2800, 3400,   50,   40, 1250,
       3500, 4000, 2400, 2600, 1450,   70, 3200,  240, 6000, 1050, 2300,
       4100,  120, 5000, 3700, 1650, 2700, 4500,   80])

### 11. listed_in_type column:

In [51]:
# checking listed_in_type column
df['listed_in_type'].unique() # it look good

array(['Buffet', 'Cafes', 'Delivery', 'Desserts', 'Dine-out',
       'Drinks & nightlife', 'Pubs and bars'], dtype=object)

### 12. listed_in_city column:

In [52]:
# checking listed_in(city) column
df['listed_in_city'].unique() # it look good

array(['Banashankari', 'Bannerghatta Road', 'Basavanagudi', 'Bellandur',
       'Brigade Road', 'Brookefield', 'BTM', 'Church Street',
       'Electronic City', 'Frazer Town', 'HSR', 'Indiranagar',
       'Jayanagar', 'JP Nagar', 'Kalyan Nagar', 'Kammanahalli',
       'Koramangala 4th Block', 'Koramangala 5th Block',
       'Koramangala 6th Block', 'Koramangala 7th Block', 'Lavelle Road',
       'Malleshwaram', 'Marathahalli', 'MG Road', 'New BEL Road',
       'Old Airport Road', 'Rajajinagar', 'Residency Road',
       'Sarjapur Road', 'Whitefield'], dtype=object)

In [53]:
# checking info after Cleaning & prepration step.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41202 entries, 0 to 41201
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   address                     41202 non-null  object 
 1   name                        41202 non-null  object 
 2   online_order                41202 non-null  int32  
 3   book_table                  41202 non-null  int32  
 4   rate                        41202 non-null  float64
 5   votes                       41202 non-null  int64  
 6   location                    41202 non-null  object 
 7   rest_type                   41202 non-null  object 
 8   cuisines                    41202 non-null  object 
 9   approx_cost_for_two_people  41202 non-null  int32  
 10  listed_in_type              41202 non-null  object 
 11  listed_in_city              41202 non-null  object 
dtypes: float64(1), int32(3), int64(1), object(7)
memory usage: 3.3+ MB


In [54]:
#checking for duplication in data 
df.duplicated().sum()

0

In [55]:
# saving file after cleaning process.
df.to_csv('../data/processed/cleaned_data.csv',index=False)

## Data Cleaning & prepration Summery:

- The data set shape change from (51717, 17) to (41202 entries, 12 columns) after cleaning process.
- Maintaing features columns name:
    1. From approx_cost(for to people) to approx_cost_for_two_people.
    2. From listed_in(type) to listed_in_type.
    3. From listed_in(city) to listed_in_city.

- The summery of the cleaning for each features as follows:
    1. **`url`:** 
        * all the value of url is unique and it will not be usefull we drop it.
    2. **`address`:** 
        * the uniqueness is very high 11495 unique value we are going to Keep it and use it to validate location col and extracting usefull information.
        * Also, after removing strange values like '?ÃÂ', replace 'Bengaluru' with 'Bangalore' from address values the Number of unique values drop from 9432 to 9421 unique values of resturant address.
    3. **`name`:** 
        * have high unique value of 8792 but we are going to keep it for the analysis stage and drop it for the modeling stage.
        * Also, after removing strange values like 'Ã', replace '[©¢ª¨±]' with 'e' and '[»]' with 'u' from name values the Number of unique values drop from 8792 to 7084 unique values of resturant names.
    4. **`online_order`:** 
        * changing the value to 1 if yes and 0 if false and change dtype from object to int
    5. **`book_table`:** 
        * changing the value to 1 if yes and 0 if false and change dtype from object to int
    6. **`rate`:** 
        * Contains 15% null values we drop these rows.
        * Replace 'NEW' and '-' with np.nan & drop null values, and remove '/5' and convert data type from object to float.
    7. **`votes`:** 
        * it is looks good & nothing changed.
    8. **`phone`:** 
        * have 2% null values and high uniqueness 14926 values and it is useless for the aim of this analysis we drop this column.
    9. **`location`:** 
        * Contains 0.04% null values we drop these rows.   
    10. **`rest_type`:** 
        * Contains 0.4% null values we drop these rows.
        * After applying the `sort_rest_type` function on the `rest_type` column, the number of unique values dropped from 87 to 66.
    11. **`dish_liked`:** 
        * Contains 54% null values and high uniqueness of 5271 values we drop it.
    12. **`cuisines`:** 
        * Contains 0.087% null values we drop these rows.
        * After applying the sort_cuisines function on the cuisines column, the number of unique values dropped from 2367 to 1688.
    13. **`approx_cost_for_two_people`:** 
        * Contains 0.669% null values we drop these rows.
        * removing ',' from string using str.replace, and change data type from object to int
    14. **`reviews_list`:** 
        * As the project will not contain NLP and the uniqueness is very high 22513 values we drop this column.
    15. **`menu_item`:** 
        * as the uniqueness is very high 9098 unique values, Also, it is useless for the aim of this analysis as there are other features reflect the food types we are going to drop this column.
    16. **`listed_in_type`:** 
        * it is looks good & nothing changed.
    17. **`listed_in_city`:** 
        * it is looks good & nothing changed.
    
- saving file as cleaned_data.csv after cleaning process for next project step.