# **Zomato Dataset Exploratory Data Analysis**

## **To import libraries:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## **To read csv**

In [None]:
df = pd.read_csv('C:\\Users\\ASAD COMPUTERS\\Desktop\\Data Science\\AtomCamp\\Krish Naik\\Zomato\\zomato.csv',encoding='ISO-8859-1')
df.head()

## **To check columns**

In [None]:
df.columns

## **To determine non-null values and data type**

In [None]:
df.info()

## **To find statistical informaton about numerical columns.**

In [None]:
df.describe()

## **In Data Analysis: What all thing we do?**
1. Missing Values
2. Explore about numerical variables
3. Explore about categorical variables
4. Finding relationship between features 

## **To check the size (i.e rows x columns)**

In [None]:
df.size

## **To check the shape (i.e rows , columns)**

In [None]:
df.shape

## **To check the missing values profile**

In [None]:
df.isnull().sum()

<p><b>Observation</b></p>
<p><b>Cuisines</b> has missing values: <b>9</b></p>

<p>Now, we will see:</p>
<p> If we can find any <b>correlation</b> of 'Cuisines' with any other <b>targetted variables/independent features.</b>

# **List comprehension to know which columns has missing values**

In [None]:
[features for features in df.columns if df[features].isnull().sum()>0]

## **To check the missing values profile in heat map**

In [None]:
# With the help of heat map
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

#### **We have two excel files (i.e., zomato.csv & Country-Code.xlsx). We have to merge these both files on column = 'Country-Code'**

## **To read excel file for 'Country-Code.xlsx'**

In [None]:
df_country = pd.read_excel('C:\\Users\\ASAD COMPUTERS\\Desktop\\Data Science\\AtomCamp\\Krish Naik\\Zomato\\Country-Code.xlsx')
df_country.head()

## **To get it's column profile**

In [None]:
df.columns

## **To combine both datasets (Column = 'Country Code', Join = 'LEFT')**

In [None]:
final_df=pd.merge(df,df_country,on='Country Code',how='left')

## **To display top 2 records**

In [None]:
final_df.head(2)

## **Another way to check data types**

In [None]:
final_df.dtypes

## **To check the columns if the merged file has column: 'Country'**

In [None]:
final_df.columns

## **To find out how many particular countries are there?**

In [None]:
# To make labels for 'Country names'
country_name = final_df.Country.value_counts().index

In [None]:
# To create an array for values of total country
country_values=final_df.Country.value_counts().values

In [None]:
# To create a pie chart showing Top 3 countries 
plt.pie(country_values[:3],labels=country_name[:3],autopct='%1.2f%%')

<p><b>Observation:</b></p>
    <p><b> Zomato </b> has maximum transaction from <b>India</b> at <b>94.39%</b></p>
    followed by United States at <b>4.73%</b>, and <b>United Kingdom</b> having least transactions at <b>0.87%.</b><p>

## **To find the ratings for Zomato order.**

In [None]:
ratings = final_df.groupby(['Aggregate rating','Rating color','Rating text']).size().reset_index().rename(columns={0:'Rating Count'})
# reset_index() function will reset the orignal index.
# rename() function will rename the column 0 to 'Rating Count'

In [None]:
ratings

# Observations:
1. When rating is from <b>4.5-4.9</b> ---> <b>Excellent</b>
2. When rating is from <b>4.0-4.4</b> ---> <b>Very Good</b>
3. When rating is from <b>3.5-3.9</b> ---> <b>Good</b>
4. When rating is from <b>2.5-3.4</b> ---> <b>Average</b>
5. When rating is from <b>1.8-2.4</b> ---> <b>Poor</b>
2. When rating is <b>0</b> ---> <b>Not Rated</b>




## **Display Aggregate Rating vs Rating count?**

In [None]:
import matplotlib
matplotlib.rcParams['figure.figsize']=(12,6)
sns.barplot(x='Aggregate rating',y='Rating Count',data=ratings)
# It's sort of Gaussian curve.

# **Giving rating specific color coding**

In [None]:
sns.barplot(x='Aggregate rating',y='Rating Count',hue='Rating color',data=ratings,palette=['blue','red','orange','yellow','green','darkgreen'])

# Observation
<p><b> 1. Not Rated: </b> count is very high at 2200.</p>
<p><b> 2. Maximum Number of ratings</b> are between <b> 2.5 to 3.4. </b></p>

## **Find frequency of color ratings?**

In [None]:
## Count plot : use for categorical variables
## count is basically telling about the frequency of color for ratings
sns.countplot(x='Rating color',data=ratings,palette=['blue','red','orange','yellow','green','darkgreen'])

## **Find the countries name that has given 0 rating?**

In [None]:
# We used boolean indexing to filter out 'Aggregate rating' to 0 then group it by 'Country' and used size() to find number of ratings
final_df[final_df['Aggregate rating']==0.0].groupby('Country').size()

In [None]:
#another way
final_df[final_df['Rating color']=='White'].groupby(['Country']).size().reset_index()

# Observation
Maximum number of <b>0</b> rating is from <b>Indian customers</b>

## **Find out which currency is used by which country?**

In [None]:
final_df.columns

In [None]:
# To show currency for each country:
final_df[['Country','Currency']].groupby(['Country','Currency']).size().reset_index() 

## **Which countries do have online deliveries option?**

In [None]:
final_df[['Has Online delivery','Country']].groupby(['Has Online delivery','Country']).size().reset_index()

In [None]:
final_df[final_df['Has Online delivery']=='Yes'].Country.value_counts()

# Observation
i. Online deliveries are only available in <b>India</b> and <b>UAE</b>. 

## **Create a pie chart for cities distribution?**

In [None]:
city_values = final_df.City.value_counts().values #array for values for city
city_labels = final_df.City.value_counts().index #labels for city names

In [None]:
# To plot a pie chart for city distribution
plt.pie(city_values[:5],labels=city_labels[:5],autopct='%1.2f%%')

# Observation
<b>New Delhi</b> has highest cities distribution at <b>68.87%</b>, followed by <b>Gurgaon</b> at <b>14.07%</b>, and <b>Ghaziabad</b> having least at <b>0.31%</b>.

## **Find the top 10 cuisines?**

In [None]:
cuisines_count = final_df.Cuisines.value_counts().values
cuisines_labels = final_df.Cuisines.value_counts().index

In [None]:
plt.pie(cuisines_count[:10],labels=cuisines_labels[:10],autopct='%1.2f%%')

# Observation
<b>North Indian</b> is the top cuisine at <b>26.58%</b>, followed by <b>North Indian, Chinese</b> at <b>26.58%</b>, and <b>Street Food</b> having at bottom at <b>4.23%</b> 