<a href="https://colab.research.google.com/github/Msolis314/FoodHub/blob/Develop/FDS_Project_LearnerNotebook_FullCode%20(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Foundations for Data Science: FoodHub Data Analysis

**Marks: 60**

### Context

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

### Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

### Data Description

The data contains the different data related to a food order. The detailed data dictionary is given below.

### Data Dictionary

* order_id: Unique ID of the order
* customer_id: ID of the customer who ordered the food
* restaurant_name: Name of the restaurant
* cuisine_type: Cuisine ordered by the customer
* cost: Cost of the order
* day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
* rating: Rating given by the customer out of 5
* food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
* delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

### Let us start by importing the required libraries

In [1]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
#Setting up decimal values
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
#Setting the color palettes
colors = ['#41d8ca','#c7e3e5','#f9d7c5','#f4a2a9','#ea5058','#eff051','#dbe7b4','#99cadd','#a9bdd5','#5f7c2a','#ffd328','#e2e9d4']
sns.set(style="white")
sns.set_palette(colors)

To grant access to drive

In [3]:
#Granting access to drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Understanding the structure of the data

In [5]:
# read the data
data = pd.read_csv('/content/drive/MyDrive/2024/Machine Learning/ProjectI/foodhub_order.csv')
#Make a copy to not change the original dataset
df = data.copy()
# returns the first 5 rows
df.head()

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,1477147,337525,Hangawi,Korean,30.75,Weekend,Not given,25,20
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,Not given,25,23
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5,23,28
3,1477334,106968,Blue Ribbon Fried Chicken,American,29.2,Weekend,3,25,15
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4,25,24


In [6]:
#checks the tail of the data
df.tail()

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
1893,1476701,292602,Chipotle Mexican Grill $1.99 Delivery,Mexican,22.31,Weekend,5,31,17
1894,1477421,397537,The Smile,American,12.18,Weekend,5,31,19
1895,1477819,35309,Blue Ribbon Sushi,Japanese,25.22,Weekday,Not given,31,24
1896,1477513,64151,Jack's Wife Freda,Mediterranean,12.18,Weekday,5,23,31
1897,1478056,120353,Blue Ribbon Sushi,Japanese,19.45,Weekend,Not given,28,24


#### Observations:

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer. In the rating column values that are missing are represented by Not given.

### **Question 1:** How many rows and columns are present in the data? [0.5 mark]

In [7]:
# Write your code here
print(f'The data set has {df.shape[0]} rows and {df.shape[1]} columns.')

The data set has 1898 rows and 9 columns.


#### Observations:
Initially, the data seems to represent 1898 orders.

### **Question 2:** What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]

In [8]:
# Use info() to print a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB


In [10]:
#It might be useful to define list with the numerical and categorical columns
numerical_col = ['cost_of_the_order','food_preparation_time','delivery_time']
categorical_col = ['order_id','customer_id','restaurant_name','cuisine_type','day_of_the_week','rating']


#### Observations:
* order_id: Presently defined as an integer, however, given its role as a categorical nominal variable serving as an identifier for orders, it can't be used to perform numerical operations. Its current numeric representation lacks significance in this context.

* customer_id: Similar to order_id, customer_id is also designated as an integer type. Given its categorical nature, analogous to order_id, it is necessary to reassess its data type to enhance its representational accuracy.

* restaurant_name and cuisine_type: Both variables are currently classified as object types, signifying categorical nominal data. This categorization aligns appropriately with their nature.

* cost_of_the_order: Representing a continuous numerical variable, it is proper to assign this variable a data type of float to accurately reflect its nature.

* day_of_week: Currently defined as an object type, this variable denotes a categorical binary nature with two possible values: weekend or weekday.

* rating: Although presently categorized as an object type, it represents a categorical ordinal variable with inherent ranking. Considering its potential usage in statistical analysis, it might be beneficial to treat it as a numerical continuous variable.

* food_preparation_time and delivery_time: Both variables, being numerical in nature, are appropriately assigned the int data type. This aligns with their representation as measures of time duration.

### **Question 3:** Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]

In [9]:

#to check for missing values in specific columns
df.isnull().sum()

order_id                 0
customer_id              0
restaurant_name          0
cuisine_type             0
cost_of_the_order        0
day_of_the_week          0
rating                   0
food_preparation_time    0
delivery_time            0
dtype: int64

In the intial exploration, it was noted that 'Not given' is a parameter in the rating column to scecify a missing value.In case there are similar cases with other strings to specify other missing values in the other categorical columns a further inspection is needed.

In [12]:
for col in categorical_col:
  print(df[col].unique())
  print("*"*60)

[1477147 1477685 1477070 ... 1477819 1477513 1478056]
************************************************************
[337525 358141  66393 ...  97838 292602 397537]
************************************************************
['Hangawi' 'Blue Ribbon Sushi Izakaya' 'Cafe Habana'
 'Blue Ribbon Fried Chicken' 'Dirty Bird to Go' 'Tamarind TriBeCa'
 'The Meatball Shop' 'Barbounia' 'Anjappar Chettinad' 'Bukhara Grill'
 'Big Wong Restaurant \x8c_¤¾Ñ¼' 'Empanada Mama (closed)' 'Pylos'
 "Lucky's Famous Burgers" 'Shake Shack' 'Sushi of Gari' 'RedFarm Hudson'
 'Blue Ribbon Sushi' 'Five Guys Burgers and Fries' 'Tortaria'
 'Cafe Mogador' 'Otto Enoteca Pizzeria' 'Vezzo Thin Crust Pizza'
 'Sushi of Gari 46' 'The Kati Roll Company' 'Klong' '5 Napkin Burger'
 'TAO' 'Parm' 'Sushi Samba' 'Haru Gramercy Park'
 'Chipotle Mexican Grill $1.99 Delivery' 'RedFarm Broadway' 'Cafeteria'
 'DuMont Burger' "Sarabeth's East" 'Hill Country Fried Chicken' 'Bistango'
 "Jack's Wife Freda" "Mamoun's Falafel" 'Prosperity Du

No other missing value identifier appears to be present in the other columns.Checking the percentage of not given ratings in the rating column:

In [13]:
df.rating.value_counts(normalize=True)

Not given   0.388
5           0.310
4           0.203
3           0.099
Name: rating, dtype: float64

In [15]:
#Creating a new table with percentage of not given values by restaurant
rating1 = df.restaurant_name.value_counts().reset_index().rename(columns={'index':'restaurant_name','restaurant_name':'Total_orders'})
rating2 = df.loc[df.rating == 'Not given'].restaurant_name.value_counts().reset_index().rename(columns={'index':'restaurant_name','restaurant_name':'Not_given_count'})
ratings = pd.merge(rating1,rating2,how='inner',on='restaurant_name')
ratings['Percent_Not_given'] = ratings.Not_given_count *100/ ratings.Total_orders
ratings

Unnamed: 0,restaurant_name,Total_orders,Not_given_count,Percent_Not_given
0,Shake Shack,219,86,39.269
1,The Meatball Shop,132,48,36.364
2,Blue Ribbon Sushi,119,46,38.655
3,Blue Ribbon Fried Chicken,96,32,33.333
4,Parm,68,29,42.647
...,...,...,...,...
129,Spice Thai,1,1,100.000
130,Gaia Italian Cafe,1,1,100.000
131,Rohm Thai,1,1,100.000
132,Alidoro,1,1,100.000


In [25]:
#Creating a new data frame changing not given values for null
df_null = pd.read_csv('/content/drive/MyDrive/2024/Machine Learning/ProjectI/foodhub_order.csv',na_values=['Not given'])
df_null['rating'] = df_null['rating'].fillna(df_null.groupby('restaurant_name')['rating'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan))
df_null.isnull().sum()



order_id                  0
customer_id               0
restaurant_name           0
cuisine_type              0
cost_of_the_order         0
day_of_the_week           0
rating                   30
food_preparation_time     0
delivery_time             0
dtype: int64

In [26]:
df_null.dtypes

order_id                   int64
customer_id                int64
restaurant_name           object
cuisine_type              object
cost_of_the_order        float64
day_of_the_week           object
rating                   float64
food_preparation_time      int64
delivery_time              int64
dtype: object

In [27]:
df_null['order_id'] = df_null.order_id.astype('object')
df_null['customer_id'] = df_null.customer_id.astype('object')
df['order_id'] = df.order_id.astype('object')
df['customer_id'] = df.customer_id.astype('object')

#### Observations:
A significant portion of the ratings is missing. Despite the high absence, addressing customer satisfaction necessitates consideration of the rating column. Imputing missing values based on the mode of ratings per restaurant could be beneficial. Examining the table of rating absence by restaurant reveals instances where certain restaurants received no ratings for any orders, making it impossible to assume a probable rating. Creating two data frames might be advantageous—one replacing null ratings with the mode for restaurants with available data and leaving missing values otherwise. Another data frame could categorize 'Not given' as a valid category, exploring potential trends in restaurants or orders lacking user ratings.

### **Question 4:** Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]

In [29]:

# Summary statistics
df_null.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order_id,1898.0,1898.0,1477147.000,1.0,,,,,,,
customer_id,1898.0,1200.0,52832.000,13.0,,,,,,,
restaurant_name,1898.0,178.0,Shake Shack,219.0,,,,,,,
cuisine_type,1898.0,14.0,American,584.0,,,,,,,
cost_of_the_order,1898.0,,,,16.499,7.484,4.47,12.08,14.14,22.297,35.41
day_of_the_week,1898.0,2.0,Weekend,1351.0,,,,,,,
rating,1868.0,,,,4.478,0.697,3.0,4.0,5.0,5.0,5.0
food_preparation_time,1898.0,,,,27.372,4.632,20.0,23.0,27.0,31.0,35.0
delivery_time,1898.0,,,,24.162,4.973,15.0,20.0,25.0,28.0,33.0


In [30]:
print(f'The minimum it takes for food to be prepared is {df.food_preparation_time.min()} minutes')
print(f'The maximum it takes for food to be prepared is {df.food_preparation_time.max()} minutes')
print(f'The average time of food preparation is {round(df.food_preparation_time.mean(),3)} minutes')

The minimum it takes for food to be prepared is 20 minutes
The maximum it takes for food to be prepared is 35 minutes
The average time of food preparation is 27.372 minutes


#### Observations:
`order_id`: There are 1898 unique orders, aligning with expectations.

`customer_id`: The dataset encompasses 1200 unique values, indicating that certain customers have placed multiple orders.

`restaurant_name`: A total of 178 restaurants are represented, with Shake Shack emerging as the most frequently chosen, accounting for 219 orders.

`cuisine_type`: The dataset encompasses 14 distinct cuisines. American cuisine is the most popular, appearing in 584 orders.

`cost_of_the_order`: The average order cost is  \$ 16.499, exhibiting a range from \$4.470 to \$35.410. However, a notable disparity between the 75th percentile and the maximum suggests the presence of high-value orders, potentially skewing the overall data distribution.

`food_preparation_time`: The average food preparation time is 27.372 minutes, ranging from a minimum of 20 minutes to a maximum of 35 minutes. The marginal difference between the 75th percentile and the maximum, coupled with a low standard deviation, implies a lack of extreme outliers in the preparation time.

`delivery_time`: The average delivery time is 24.162 minutes, accompanied by a standard deviation of approximately 5 minutes. This suggests a relatively consistent and predictable delivery duration across orders.

### **Question 5:** How many orders are not rated? [1 mark]

In [35]:
# Write the code here
# Looking at the original rating column
not_given_ratings = len(df.loc[df.rating == 'Not given'])
print(f'There are {not_given_ratings} not rated orders.')

There are 736 not rated orders.


In [37]:
#Missing values after imputing with the mode
print(f'There are {df_null.rating.isnull().sum()} missing ratings after imputing missing values with the mode.')

There are 30 missing ratings after imputing missing values with the mode.


#### Observations:
As previously mentioned, given that the "rating" column originally contained 736 "Not given" values, these were appropriately converted to the NaN type to facilitate creating another data frame (df_null) that imputes the missing values with the mode for the restaurants were a rating was given.

### Exploratory Data Analysis (EDA)

### Univariate Analysis

### **Question 6:** Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]

In [None]:
# Write the code here

### **Question 7**: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 8**: Which is the most popular cuisine on weekends? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 9**: What percentage of the orders cost more than 20 dollars? [2 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 10**: What is the mean order delivery time? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 11:** The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [None]:
# Write the code here

#### Observations:


### Multivariate Analysis

### **Question 12**: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]


In [None]:
# Write the code here

### **Question 13:** The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 14:** The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 15:** The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 16:** The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]

In [None]:
# Write the code here

#### Observations:


### Conclusion and Recommendations

### **Question 17:** What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]

### Conclusions:
*  

### Recommendations:

*  

---