# Food Ordering and Delivery App - Exploratory Data Analysis

In this notebook, we will do a through EDA of the dataset and try to find key insights by cross referencing different features.

Let us start by importing the required libraries

In [13]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

Understanding the structure of the data

In [14]:
# read the CSV file
df_orig = pd.read_csv("food_order.csv")

In [15]:
df = df_orig.copy()

# return the first 5 rows
df.head()

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,1477147,337525,Hangawi,Korean,30.75,Weekend,Not given,25,20
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,Not given,25,23
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5,23,28
3,1477334,106968,Blue Ribbon Fried Chicken,American,29.2,Weekend,3,25,15
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4,25,24


In [16]:
print("There are", df.shape[0], "Rows and", df.shape[1], "Columns in the given dataset")

There are 1898 Rows and 9 Columns in the given dataset


In [17]:
# Use info() to print a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB


## Creating unclean data for the purpose of Data Cleaning/Data Transformation

In [113]:
# Copy and store the dataframe in another df to make it unclean
df_dirty = df_orig.copy()

In [114]:
# 1) Null Values in Cells: Set rating column "Not given" to NULLS
df_dirty.replace({"rating":{"Not given":np.NaN}}, inplace=True)

In [115]:
# Set some of the "cost_of_the_order" columns as 0
df_dirty.replace({"cost_of_the_order":{29.20:0}}, inplace=True)
len(df_dirty[df_dirty["cost_of_the_order"]==0])

10

In [116]:
# 3) Bad Column Naming: Add a new column called "voucher_used" when "cost_of_the_order" is 0
df_dirty["voucher_used"] = np.where(df_dirty["cost_of_the_order"]==0,True,False)
df_dirty.head()

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time,voucher_used
0,1477147,337525,Hangawi,Korean,30.75,Weekend,,25,20,False
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,,25,23,False
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5.0,23,28,False
3,1477334,106968,Blue Ribbon Fried Chicken,American,0.0,Weekend,3.0,25,15,True
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4.0,25,24,False


In [117]:
# 4) Duplicate Rows: Take some rows and append them back to cause duplicate rows
dup_rows = df_dirty.iloc[[2,12,20,26,34,37,49]]
df_dirty = df_dirty.append(dup_rows, ignore_index=True)

In [118]:
# Count number of duplicate rows
# df_dirty.groupby(df_dirty.columns.tolist(), as_index=False).size()
print("Number of duplicate rows: ", len(df_dirty)-len(df_dirty.drop_duplicates()))

Number of duplicate rows:  7


## Dirty Dataset: Start Here

In [120]:
df_dirty.head(10)

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time,voucher_used
0,1477147,337525,Hangawi,Korean,30.75,Weekend,,25,20,False
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,,25,23,False
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5.0,23,28,False
3,1477334,106968,Blue Ribbon Fried Chicken,American,0.0,Weekend,3.0,25,15,True
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4.0,25,24,False
5,1477224,147468,Tamarind TriBeCa,Indian,25.22,Weekday,3.0,20,24,False
6,1477894,157711,The Meatball Shop,Italian,6.07,Weekend,,28,21,False
7,1477859,89574,Barbounia,Mediterranean,5.97,Weekday,3.0,33,30,False
8,1477174,121706,Anjappar Chettinad,Indian,16.44,Weekday,5.0,21,26,False
9,1477311,39705,Bukhara Grill,Indian,7.18,Weekday,5.0,29,26,False


# Data Exploration, Cleaning and Pipelines

In [64]:
print("The unique ratings are", df_dirty["rating"].unique())

The unique ratings are [nan '5' '3' '4']


In [65]:
print("The unique delivery times are", df_dirty["delivery_time"].unique())

The unique delivery times are [20 23 28 15 24 21 30 26 22 17 25 16 29 27 18 31 32 19 33]


In [66]:
print("The unique food preparation times are", df_dirty["food_preparation_time"].unique())

The unique food preparation times are [25 23 20 28 33 21 29 34 24 30 35 32 31 27 22 26]


# Data Analysis and Visualization

# Storytelling with Data
