# Exploratory Data Analysis (EDA) on Zomato dataset

Let us learn about univariate analysis and bivariate analysis using the dataset of Zomato, the online food delivery platform.

Before we go ahead with the analysis, let us have a quick understanding of all the columns in the data.

Here’s a brief column-wise description of the zomato.csv dataset for your reference.

1. **Restaurant ID**: A unique identifier for each restaurant in the dataset.
2. **Restaurant Name**: The name of the restaurant.
3. **Country Code**: A numeric code representing the country where the restaurant is located.
4. **City**: The city where the restaurant is located.
5. **Address**: The physical address of the restaurant.
6. **Locality**: A more specific location or area within the city.
7. **Locality Verbose**: A detailed description of the restaurant's location.
8. **Longitude**: The geographic longitude of the restaurant's location.
9. **Latitude**: The geographic latitude of the restaurant's location.
10. **Cuisines**: A list of cuisines offered by the restaurant, such as Indian, Chinese, Italian, etc.
11. **Average Cost for Two**: The average cost for a meal for two people at the restaurant.
12. **Currency**: The currency used in the restaurant's pricing.
13. **Has Table booking**: A boolean column indicating whether the restaurant accepts table bookings.
14. **Has Online delivery**: A boolean column indicating whether the restaurant offers online delivery.
15. **Is delivering now**: A boolean column indicating whether the restaurant is currently delivering food.
16. **Switch to order menu**: A boolean column indicating if the restaurant has a menu for ordering food online.
17. **Price range**: A numerical value indicating the price range at the restaurant (1 to 4 scale).
18. **Aggregate rating**: The overall rating of the restaurant based on user reviews.
19. **Rating color**: The rating category (e.g., Dark Green, Green, Orange, etc.), indicating the quality of the restaurant based on the rating.
20. **Rating text**: Descriptive text based on the restaurant's rating (e.g., Excellent, Very Good, Good, etc.).
21. **Votes**: The number of user votes or reviews received for the restaurant.

In the 'Country-Code.xlsx' file, we have 
1. **Country**: The name of the country where the restaurant is located.
2. **Country Code**: The code number for every country in the dataset

### Loading the necessary libraries

In [1]:
# Import necessary libraries required for the data exploration
# Loading "numpy" and "pandas" for manipulating numbers, vectors and data frames
# Loading "matplotlib.pyplot" and "seaborn" for data visualisation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Reading in the 'zomato.csv' & 'Country-Code.xlsx' file as a Pandas Data Frame

Here we have used a CSV file which is on our local system drive. You can also use the URL (https://www.kaggle.com/datasets/swatikhedekar/eda-on-zomato-dataset) of the dataset from Kaggle or Github or use the relevant function based on the file format of the input data.

In [2]:
# Load the zomato.csv file

zomato_df=pd.read_csv("")
# Display the first few rows of the dataframe


# Load the Country-Code.xlsx file


# Display the first few rows of the dataframe


Now that we have loaded both the datasets, one is the zomato_df (from zomato.csv) file, the other a country_code_df (from Country_Code.xlsx) file, let us merge the both in a single file to get the country names.

In [3]:
# Merge the datasets on the 'Country Code' column


# Print the column names to check the correct column names


# After the merge, ensure the correct column name for 'Country'
# Check if there's a column like 'Country_x' or 'Country_y' and select it


# Drop unnecessary columns (if any)
# Since we already have the country names in the column 'Country', we would not need the columns 'Country Code' and 'Country_y'. 
# So we are dropping the same.


# Display the updated dataframe


# Now drop 'Country_x' as we have country names in 'Country' column
# Similarly, since we already have the country names in the column 'Country', we would not need the columns 'Country_x'


In [4]:
# We are checking if we have all the relevant columns, and have dropped unnecessary columns.
# Getting the column names of the data frame using ".columns"


Let's get started with our exploration. We shall do this by finding the answers to a few critical questions, that may be of interest to the company.

# Questions for Practice

1. What is the distribution of restaurant ratings across all restaurants? What inferences can you draw from the same?

   (Note that each row represents a restaurant)
   

In [5]:
# Filter out the restaurants with a rating of 0 since no rating was given to those restaurants


# Plotting the distribution of restaurant ratings


# Adding title and labels


# Display the plot


### Inferences drawn from the solution

2. What is the average price range of restaurants across different cities? List the cities with highest range and the ones with lowest range. Draw conclusions from the output.

(Note that here price range values are integer values from 1 to 4, 1 representing an affordable range of restaurants, and 4 would indicate expensive restaurants).  

To answer this question, you can calculate the mean price range for each city.

In [6]:
# Group by 'City' and calculate the mean 'Price range' and sort them in descending order


# Get the highest price range and lowest price range


# List all cities with the highest price range


# List all cities with the lowest price range


# Print the results




### Inferences drawn from the solution

3. What is the distribution of restaurants that are offering online delivery across countries? What inferences can you draw from the same?

 To answer this question, we need the count of how many restaurants offer online delivery and how many do not. Also, we need to find this across the countries, and compare.
  

In [7]:
# Frequency of restaurants offering online delivery


# Plotting the frequency of online delivery


# Adding title and labels


# Now, to check if it varies significantly across different countries
# Group by 'Country Code' and get the count of restaurants offering online delivery


# Display the result to understand the variation across countries


### Inferences drawn from the solution

4. What is the distribution of the 'Average Cost for two' in India? Draw inferences from the output.

In [8]:
# Filter for Indian restaurants


# Plot the distribution


# Display summary statistics


### Inferences drawn from the solution

### Inferences drawn from the solution
So we can see that 'North Indian' cuisine is most popular, and forms 9.8% of the share of all cuisines.

5. Does the availability of the feature 'Has Table booking' impact the aggregate rating? Draw inferences from the output.

To create the boxplot excluding the 0 ratings, we'll first filter the dataset to remove restaurants with a rating of 0. Then, we will generate the boxplot to compare the aggregate ratings between restaurants with and without table booking.

In [9]:
# Filter out the restaurants with 0 ratings


# Boxplot for 'Has Table booking' vs 'Aggregate rating' (excluding 0 ratings)


# Adding title and labels


# Display the plot


# Calculate the average ratings for restaurants with and without table booking (excluding 0 ratings)


# Display the average ratings


In [10]:
#To calculate the correlation between Table Booking (where Yes is coded as 1 and No as 0) and Aggregate Rating

# Convert 'Has Table booking' column to numeric (Yes = 1, No = 0)


# Calculate the Pearson correlation coefficient between 'Table booking' and 'Aggregate rating'


# Display the correlation result



### Inferences drawn from the solution

6. Does having an online delivery option influence the restaurant's rating? Which country's restaurant has the best and which one has the worst rating out of those who do not offer online delivery? Draw inferences from the output.

To investigate whether having an online delivery option influences restaurant ratings, we can compare the ratings of restaurants that offer online delivery and those that do not. We will use a box plot to visually compare the distributions of ratings for both groups (We are not including restaurants that have a 0 rating).

In [11]:
# Filter out the restaurants with a 0 rating


# Boxplot for 'Has Online delivery' vs 'Aggregate rating' (excluding 0 ratings)


# Adding title and labels


# Display the plot


# Calculate the best and worst countries for restaurants without online delivery


# Group by 'Country Code' and calculate the mean rating for restaurants without online delivery


# Find the country with the highest and lowest ratings


# Get the average ratings for the best and worst countries


# Display the results


### Inferences drawn from the solution

7. Do restaurants in different countries have significantly different ratings?

In [12]:
# Filter out the restaurants with a 0 rating


# Boxplot to visualize the distribution of ratings across countries


### Inferences drawn from the solution

8. What is the distribution of restaurants with and without table booking option, that serve only 'Italian', only 'Chinese' and only 'American' cuisines? Draw inferences from the output.

To analyze the distribution of restaurants that serve only 'Italian', 'Chinese', and 'American' cuisines, and whether they offer table booking, we can follow these steps:

In [13]:
# Filter the dataset for restaurants that serve only 'Italian', 'Chinese', or 'American' cuisines


# Countplot for 'Cuisines' vs 'Has Table booking'


# Adding title and labels


### Inferences drawn from the solution

9. Since 90% of the restaurants listed on Zomato in this data are from India, find out if the average cost for two has an influence on ratings in India. Draw inferences from the output.

To explore whether there is a relationship between the average cost for two and the restaurant ratings (Aggregate rating), we can calculate the correlation coefficient between the two variables and visualize their relationship using a scatter plot.

In [14]:
# Filter data for India only


# Remove rows with 0 rating or 0 average cost for two


# Create a new DataFrame with only 'Average Cost for two' and 'Aggregate rating'


# Select relevant columns


# Scatter plot


# Calculate the correlation coefficient between cost and rating


# Print the result


### Inferences drawn from the solution

10. How does the price range affect the average ratings of restaurants, and how is this influenced by the presence of online delivery? Draw inferences from the output.

In [15]:
# Group by 'Price range' and 'Has Online delivery' to calculate the mean 'Aggregate rating'


# Plotting the results as a line plot to show how the average rating changes by price range and online delivery


# Adding title and labels


### Inferences drawn from the solution