# FoodHub Data: Exploratory Data Analysis

#### NAME: {Write your name here}

### Context

The number of restaurants in Kenya are increasing day by day. Lots of students and busy professionals rely on these restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

### Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

### Data Description

The data contains the different data related to a food order. The detailed data dictionary is given below.

### Data Dictionary

* order_id: Unique ID of the order
* customer_id: ID of the customer who ordered the food
* restaurant_name: Name of the restaurant
* cuisine_type: Cuisine ordered by the customer
* cost_of_the_order: Cost of the order
* day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
* rating: Rating given by the customer out of 5
* food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
* delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

### Let us start by importing the required libraries

In [1]:
# Installing the libraries with the specified version.
#!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 -q --user

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.*

In [2]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline
import warnings
# Ignore all warnings
warnings.filterwarnings('ignore')

### Understanding the structure of the data

In [3]:
# uncomment and run the following lines for Google Colab
#from google.colab import drive
#drive.mount('/content/drive')

#### Importing the dataset

In [4]:
# Importing the dataset and saving it as 'foodhub_data' in the notebook


###  <p style="color:blue;">Observations:</p>


#### View the first and last five rows of the dataset

In [5]:
# Viewing the first 5 rows of the dataset


In [6]:
# Viewing the last five rows of the dataset


###  <p style="color:blue;">Observations:</p>


#### Checking for dulicate entries in the dataset

###  <p style="color:blue;">Observations:</p>


### **Question 1:** How many rows and columns are present in the data? 

In [7]:
# Checking the number of rows and columns in the dataset


### <p style="color:blue;">Observations:</p>



### **Question 2:** What are the datatypes of the different columns in the dataset? (The info() function can be used) 

In [8]:
# Write your code here


###  <p style="color:blue;">Observations:</p>

Write your observations here


### **Question 3:** Are there any missing values in the data? If yes, treat them using an appropriate method. 

In [9]:
# Write your code here


###  <p style="color:blue;">Observations:</p>


### **Question 4:** Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? 

In [10]:
# Changing the display format to float to avoid cases where some values may be expressed in standard notation
pd.options.display.float_format = '{:.2f}'.format
# Checking the statistical summary of the data.


###  <p style="color:blue;">Observations:</p>



### **Question 5:** How many orders are not rated? [1 mark]

In [12]:
# Write the code here. Orders that are not rated are indicated as 'Not given'


###  <p style="color:blue;">Observations:</p>
Write your observations here

### Exploratory Data Analysis (EDA)

### Univariate Analysis

### **Question 6:** Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) 

6.1 Preparing the dataset for Exploratory Data Analysis (EDA)

6.1.1 Changing the `order_id` and the `customer_id` features (columns) into object data type since there is no meaningful analysis can be performed on them except 'counting'.

In [15]:
# code that converst 'order_id' to object data type (foodhub_data['order_id']=foodhub_data['order_id'].astype('object'))
# Write code to change 'Customer_id' to object data type

In [13]:
#Confirm that 'order_id' and 'customer_id' have been converted from integer to object data type


6.1.2 Creating a list of all numeric features and a list of all object features

In [None]:
# List of numeric features
foodhub_data_numeric=[]
for feature in foodhub_data.columns:
  if foodhub_data[feature].dtype!='object':
    foodhub_data_numeric.append(feature)
# Check the list of numeric features
foodhub_data_numeric

In [None]:
# List of object (non-numeric) features (columns)


6.2 Checking the distribution of the numeric columns

In [None]:
counter=1
for feature in foodhub_data_numeric:
  plt.figure(figsize=(5,5))
  sns.histplot(data=foodhub_data,x=feature,kde=True)
  plt.ylabel('Number of Orders')
  plt.show()
  print('(b). Boxplot for the column',f'"{feature}"')
  sns.boxplot(data=foodhub_data,x=feature,showfliers=True)
  plt.show()
  counter+=1


6.3 Looking at the descriptive statistics of the numeric features

In [None]:
#Write code to display descriptive statistics of numeric features

###  <p style="color:blue;">Observations:</p>
1. `cost_of_the_order`



2. `food_preparation_time`



3. `delivery_time`



6.4 Checking some characteristics of the non-numeric features

6.4.1 Popular Restaurants in terms of Orders

In [None]:
# The number of orders made from each restaurant onboarded in the FoodHub App


###  <p style="color:blue;">Observations:</p>



6.4.2 Popular Cuisine Type by Orders

In [None]:
#The different types of cuisines ordered by customers in the FoodHub App


In [None]:
#Visualizing the number of orders by cuisine type


###  <p style="color:blue;">Observations:</p>


6.4.3 Days of the Week and Orders

In [None]:
#The number of orders based on the 'day of the week'


In [None]:
#Visualizing the number of orders based on the 'day of the week


###  <p style="color:blue;">Observations:</p>


6.4.4 Ratings and number of orders

In [None]:
# Ratings and number of orders
cuisines_ordered=pd.DataFrame(foodhub_data['rating'].value_counts()).reset_index()
cuisines_ordered.columns=['Rating','Number of Orders']
cuisines_ordered

In [None]:
#The number of orders based on ratings
sns.countplot(data=foodhub_data,x='rating', order=foodhub_data['rating'].value_counts().index);
plt.ylabel('Number of Orders');
plt.xlabel('Rating');
plt.title('Number of Orders by Rating');
plt.bar_label(container=plt.gca().containers[0]);

###  <p style="color:blue;">Observations:</p>
i) Most of the orders were not rated by customers. A total of 736 out of 1898 orders were not rated.<br>
ii) Out of 1898 oders, 588 received a `rating` of 5.<br>
iii) A `rating` of 4 and 3 were given to 386 and 188 orders respectively.<br>

In [None]:
# Checking number of orders made by top 10 customers
customer_orders=pd.DataFrame(foodhub_data['customer_id'].value_counts()).reset_index()
customer_orders.columns=['Customer_ID','Number of Orders']
customer_orders.head(10)

In [None]:
# Exploring the restaurants and cuisines prefered by the top customer
foodhub_data.loc[foodhub_data['customer_id']==52832]


###  <p style="color:blue;">Observations:</p>
i) Top 10 customers made more than 6 orders.<br>
ii) The customer with the highest number of oreders made 13 orders followed by a customer with 10 orders.<br>
iii) The customer with the highets number of orders sampled different restaurants and cuisine types.

### **Question 7**: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]

In [None]:
# Top 5 restaurants in terms of the number of orders received.
orders_by_restaurants= pd.DataFrame(foodhub_data['restaurant_name'].value_counts()).reset_index()
orders_by_restaurants.columns=['Restaurant Name','Number of Orders']
orders_by_restaurants.head(5)

###  <p style="color:blue;">Observations:</p>
i)  The restaurant with the highets number of orders is `Shake Shack` with 219 orders.<br>
ii) The top 5 restaurants account for 634 orders out of the 1898 orders.<br>
iii)The top 5 resturants had 68 and above orders.<br>


### **Question 8**: Which is the most popular cuisine on weekends? [1 mark]

In [None]:
# Filter data for weekend orders
weekend_orders = foodhub_data[foodhub_data['day_of_the_week'] == 'Weekend']

# Count cuisine types for weekend orders
cuisines_popular = pd.DataFrame(weekend_orders['cuisine_type'].value_counts()).reset_index()
cuisines_popular.columns = ['Cuisine Type', 'Number of Weekend Orders']
cuisines_popular

###  <p style="color:blue;">Observations:</p>
i) The most popular cuisine type over the `Weekend` is `American` with 415 orders.<br>
ii) It is worth noting that `Americana` is also the most popular `Cuisine Type` overal with 584 orders.<br>


### **Question 9**: What percentage of the orders cost more than 20 dollars? [2 marks]

In [None]:
# Write the code here
foodhub_data[foodhub_data['cost_of_the_order']>20].shape[0]/foodhub_data.shape[0]*100

###  <p style="color:blue;">Observations:</p>
i) The orders that cost more than 20 dollars are 29 percent of the total orders.<br>
ii) About 60 percent of the orders cost less than 20 dollars.<br>


### **Question 10**: What is the mean order delivery time? [1 mark]

In [None]:
# Write the code here
foodhub_data['delivery_time'].mean()

###  <p style="color:blue;">Observations:</p>
i) The mean `order delivet time` is 24.16.<br>
ii) ii) The mean order delivery time (24.16) is lower than the median order delivery time (25.00)<br>

### **Question 11:** The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [None]:
# Top 3 most frequent customers and the number of orders they placed.
foodhub_top_customers=pd.DataFrame(foodhub_data['customer_id'].value_counts()).reset_index()
foodhub_top_customers.columns=['Customer ID','Number of Orders']
foodhub_top_customers.head(3)

###  <p style="color:blue;">Observations:</p>
i) The 3 top customers in terms of orders that `13`, `10`and `9` orders.<br>
ii) The total number of orders made by the top 3 customers was 32 out of 1898 orders.<br>

### Multivariate Analysis

### **Question 12**: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]


12.1 Checking correlation between all the Numeric features using `pairplot`.

In [None]:
# Exploring the pairwise relationship amongst the various features (lower half).
sns.pairplot(data=foodhub_data,diag_kind='hist',corner=True);

###  <p style="color:blue;">Observations:</p>
i) There seems to be no obvious pairwise correlation amongst the columns(features)<br>
ii) The `order_id` and `customer_id` are meaningless in this context since they are numeric but are used as labels<br>

12.2 Checking correlation between the Numeric features using `pairplot`.

In [None]:
# Checking the pairwise relationship of the numeric features (lower half).
sns.pairplot(data=foodhub_data[foodhub_data_numeric],diag_kind='hist',corner=True);

###  <p style="color:blue;">Observations:</p>
i) There seems to be no obvious pairwise correlation amongst the numeric columns(features)<br>


12.3 Digging Deeper into Correlation

In [None]:
foodhub_data[foodhub_data_numeric].corr()

In [None]:
# Visualizing the pairwise correlation using a heatmap.
sns.heatmap(foodhub_data[foodhub_data_numeric].corr(),annot=True,cmap='coolwarm');

###  <p style="color:blue;">Observations:</p>
i) There is a slight positive correlation (0.042) between `cost_of_the_order` and `food_praparation_time`<br>
ii) There is a slight negative (0.030) between `delivery_time` and `cost_of_the_order`<br>

In [None]:
# Using a jointplot to check both correlation and distibution
sns.jointplot(x='delivery_time',y='cost_of_the_order',hue='day_of_the_week',data=foodhub_data)

###  <p style="color:blue;">Observations:</p>
i) We already checked the distributions in 6.2 above.<br>
ii) More orders were made on `Weekend` compared to `Weekday`

12.4 Checking number of orders by Cuisine Type and Day of the Week 

In [None]:
# Cuisine type and there orders based on the day of the week
sns.catplot(data=foodhub_data,x='cuisine_type',kind='count', hue='day_of_the_week');
plt.xticks(rotation=90);
plt.title('Orders Based on Cuisine Type and Day of the Week');
plt.xlabel('Cuisine Type')
plt.ylabel('Number of Orders')
plt.bar_label(container=plt.gca().containers[0]);

###  <p style="color:blue;">Observations:</p>
i) The orders for all the `cuisine type`were higher on `Weekend` compared to `Weekday`.<br>
ii) `American` had the highets number of orders on bothe weekends and weekdays.<br>

In [None]:
# Checking ratings based on the day of the week.
sns.catplot(data=foodhub_data,x='day_of_the_week',kind='count', hue='rating');
plt.xticks(rotation=90);
plt.title('Ratings Based on the Day of the Week');
plt.xlabel('Day of the Week')
plt.ylabel('Number of Orders')
plt.bar_label(container=plt.gca().containers[0]);

###  <p style="color:blue;">Observations:</p>
i) It is important to note that `Weekend` had the highest number of `Rating not given`.<br>.
ii) It may be interesting to find out why most customers never bothered to provide ratings during weekends.<br>

### **Question 13:** The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]

13.1 Creating a dataframe with only rated orders (exclude cases where rating was not given)

In [None]:
# Create a new dataset that excludes records with 'Not given' rating
foodhub_data_rated=foodhub_data[foodhub_data['rating']!='Not given']
foodhub_data_rated.head()

13.2 Generating a dataset of the restaurants that fulfil the criteria

In [None]:
# Change the data type of 'rating' from object to float so as to calculate mean rating for each restaurant
foodhub_data_rated['rating']=foodhub_data_rated['rating'].astype('float');

In [None]:
# Generate a dataframe of restaurnats including the number of times each restaurant was rated and their mean rating
restaurants_rating_count_mean= foodhub_data_rated.groupby('restaurant_name')['rating'].agg(['count','mean']).reset_index()
restaurants_rating_count_mean.head()

In [None]:
#Generate a dataframe of the restaurants that qualify for promotion offer.
restaurants_promotion_offer = restaurants_rating_count_mean[(restaurants_rating_count_mean['count'] > 50) & (restaurants_rating_count_mean['mean'] > 4)]
restaurants_promotion_offer.sort_values(by='mean',ascending=False).reset_index()

###  <p style="color:blue;">Observations:</p>
i)`The Meatball Shop` had the highest mean `rating` at 4.51<br>
ii) Even though `Shake Shack` had the highest number of rated orders (133), it had the third highest mean rating (4.28).<br>
iii) It is worth noting that `Shake Shack` had the highest total number of orders (219) but only 133 were rated.<br>


### **Question 14:** The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]

In [None]:
# Creating a function to calculate commission for each order based on the criteria give
def commission_charged(order_value):
    if order_value>20:
        commission=order_value*0.25
    elif order_value>5:
        commission=order_value*0.15
    else:
        commission=0
    return commission
    

In [None]:
# Creating a copy of the foodhub_data dataset. We do not want to intefere with the original dataset 
foodhub_data_1=foodhub_data
# Adding a new column called 'commission' on the Foodhub_data_1 dataframe and
#Applying the function on the 'commission' column to calculate commission for each order
foodhub_data_1['commission']=foodhub_data_1['cost_of_the_order'].apply(commission_charged)
foodhub_data_1

In [None]:
# Calculating the total commission earned by the FoodHub for the aggregation services.
print('The total commission earned by FoodHub is ',foodhub_data_1['commission'].sum())

###  <p style="color:blue;">Observations:</p>
i)`FoodHub` earned a total commission of 6166.303 dollars.<br>
ii)`FoodHub` did not charge commission on orders worth less than 5 dollars.<br>


### **Question 15:** The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]

In [None]:
# Generating a dataframe of all the orders whose total delivery time was more than 60
# total_time_deliver =food_preparation_time+delivery_time
foodhub_data_1['total_delivery_time']=foodhub_data_1['food_preparation_time']+foodhub_data_1['delivery_time']
foodhub_data_1.head(5)


In [None]:
#Calculating the number of orders with delivery time exceeding 60 as a percentage of the total number orders.
delivery_time_percentage=(foodhub_data_1[foodhub_data_1['total_delivery_time']>60].shape[0]/foodhub_data_1.shape[0])*100
print('The percentage of orders with total delivery time greater than 60 is',f"{delivery_time_percentage:.2f}",'%')


###  <p style="color:blue;">Observations:</p>
i) The percentage of orders that took more than 60 minutes to deliver is 10.54 percent.<br>
ii) About 89 percent of orders took less 60 minutes and less to deliver.<br>


### **Question 16:** The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]

In [None]:
mean_delivery_by_day_week=foodhub_data_1.groupby('day_of_the_week')['delivery_time'].agg(['count','sum','mean']).reset_index()
mean_delivery_by_day_week.columns=['Day of the week','Number of Deliveries','Sum of Delivery Time','Mean Delivery Time']
mean_delivery_by_day_week

###  <p style="color:blue;">Observations:</p>
i) The mean delivery time is longer (28.34) on `Weekday`.<br>
ii) The mean delivery is shorter (22.47) on `Weekend`.<br>

### Conclusion and Recommendations

### **Question 17:** What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]

### Conclusions:
*  Their are more orders (1351) made on `Weekend` compared to (547) made on `Weekday`.<br>
*  There were quite a number of orders that were not `rated` (736) out of a total of 1898 orders.<br>
*  The most popular restaurant is `Shake Shark` with 219 orders out of 1898 orders.<br>
*  The most popular cuisine type is `American` (584) orders followed by `Japanese` with 470 orders.The two cuisine types are also very popular over the weekend accounting for 415 and 335 orders respectively<br>
*  Most restaurants that were rated had a rating of 5 (588 restaurants).<br>
*  Top ten customers had between 6 to 13 orders across different restaurants and cuisine types.<br>
*  `The Meatball Shop` was the most highly rated restaurant with a mean rating of 4.51.
*  The mean `delivery time` is longer on `Weekday` even though it has fewer orders compared to `Weekend`

### Conclusions:
*  Motivate customers to rate their orders. The customers can be motivated by offering loyalty points for all the orders rated.<br>
*  Conduct promotions for weekdays to increase the number of orders. The promotion can include bundled cuisine types or discounts.<br>
*  Rewards for popular restaurants to create a healthy competition among the restaurants.<br>
*  Check on the reasons why `American` and `Japanese` are popular cuisine types. `FoodHub` can conduct an exploratory survey. The insights from the survey can be shared with the restaurants.<br>
*  It seems most restaurants that were rated had a rating of 5. These restaurants should be rewarded to continue the good service. <br>
*  Conduct targeted promotion based on orders done by customers. For example, top customers can be rewarded with value addition to their orders while customers with small number of orders can be given discounts.<br>
* Restaurants with high mean rating should be branded. For example, FoodHub can create labels like `premium`, `silver`, `bronze` and others. The labels give the restaurants some level of prestige with customers.

---