## Case Study #2 - Pizza Runner

#### Problem Statement
Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)

Danny was scrolling through his Instagram feed when something really caught his eye - “80s Retro Styling and Pizza Is The Future!”

Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire - so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!

Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

#### Entity Relationship Diagram

![week2.png](week2.png)

Import modules

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlite3 as sql
pd.set_option('display.max_columns', None)

Initialize SQL

In [2]:
conn = sql.connect("week2.db")
cursor = conn.cursor() 
if os.stat("week2.db").st_size == 0:
    with open('week2-sql.txt','r') as file:
        script = file.read()
        script = script.replace('\n', ' ')
    cursor.executescript(script)

Verify tables

In [3]:
query = """SELECT name FROM sqlite_master WHERE type='table';"""
cursor.execute(query)
tables = [table[0] for table in cursor.fetchall()]
tables
print(f'The tables in the database are: {', '.join(tables)}')

The tables in the database are: runners, customer_orders, runner_orders, pizza_names, pizza_recipes, pizza_toppings


Fetch table information

In [4]:
for table in tables:
    print("=================================")
    print(f'Table [{table}]')
    df = pd.read_sql_query(f'SELECT * FROM {table}', conn)
    print(f'Dimensions: {df.shape[0]} rows x {df.shape[1]} columns\n')
    print(df.head())
    info_df = pd.DataFrame.from_dict({'Datatypes':df.dtypes, 'NULL count':df.isna().sum()})
    print()
    print(info_df)
    print()

Table [runners]
Dimensions: 4 rows x 2 columns

   runner_id registration_date
0          1        2021-01-01
1          2        2021-01-03
2          3        2021-01-08
3          4        2021-01-15

                  Datatypes  NULL count
runner_id             int64           0
registration_date    object           0

Table [customer_orders]
Dimensions: 14 rows x 6 columns

   order_id  customer_id  pizza_id exclusions extras           order_time
0         1          101         1                    2020-01-01 18:05:02
1         2          101         1                    2020-01-01 19:00:52
2         3          102         1                    2020-01-02 23:51:23
3         3          102         2              None  2020-01-02 23:51:23
4         4          103         1          4         2020-01-04 13:23:46

            Datatypes  NULL count
order_id        int64           0
customer_id     int64           0
pizza_id        int64           0
exclusions     object           0
ext

In [5]:
def query(stmt: str):
    """Executes a given SQL statement and returns a Pandas DataFrame given the results.
    
    Parameters
    ----------
    stmt: str
        The SQL statement to be executed
    """
    global conn
    result = pd.read_sql_query(stmt, conn)
    return result

## Case Study Questions

**A. Data Cleaning**

Q1: Investigate your data and do the necessary data adjustments and cleaning.

In [6]:
# Check customer_orders table
query("SELECT * FROM customer_orders")

Unnamed: 0,order_id,customer_id,pizza_id,exclusions,extras,order_time
0,1,101,1,,,2020-01-01 18:05:02
1,2,101,1,,,2020-01-01 19:00:52
2,3,102,1,,,2020-01-02 23:51:23
3,3,102,2,,,2020-01-02 23:51:23
4,4,103,1,4,,2020-01-04 13:23:46
5,4,103,1,4,,2020-01-04 13:23:46
6,4,103,2,4,,2020-01-04 13:23:46
7,5,104,1,,1,2020-01-08 21:00:29
8,6,101,2,,,2020-01-08 21:03:13
9,7,105,2,,1,2020-01-08 21:20:29


Note: There are blanks '' and 'null's in exclusions and extras columns. We need to unify them to nulls. We can use a CASE statement

In [7]:
script = """
    DROP TABLE IF EXISTS customer_orders_clean;
    CREATE TEMP TABLE customer_orders_clean AS
    SELECT
        order_id,
        customer_id,
        pizza_id,
        CASE
            WHEN exclusions IS NULL OR exclusions = "" OR exclusions LIKE 'null' THEN NULL
            ELSE exclusions
            END AS exclusions,
        CASE
            WHEN extras IS NULL OR extras = "" OR extras LIKE 'null' THEN NULL
            ELSE extras
            END AS extras,
            order_time
        FROM customer_orders;
"""
cursor.executescript(script)
# Verify result
query("SELECT * FROM customer_orders_clean")

Unnamed: 0,order_id,customer_id,pizza_id,exclusions,extras,order_time
0,1,101,1,,,2020-01-01 18:05:02
1,2,101,1,,,2020-01-01 19:00:52
2,3,102,1,,,2020-01-02 23:51:23
3,3,102,2,,,2020-01-02 23:51:23
4,4,103,1,4,,2020-01-04 13:23:46
5,4,103,1,4,,2020-01-04 13:23:46
6,4,103,2,4,,2020-01-04 13:23:46
7,5,104,1,,1,2020-01-08 21:00:29
8,6,101,2,,,2020-01-08 21:03:13
9,7,105,2,,1,2020-01-08 21:20:29


In [8]:
# Check runner_orders table
query("SELECT * FROM runner_orders")

Unnamed: 0,order_id,runner_id,pickup_time,distance,duration,cancellation
0,1,1,2020-01-01 18:15:34,20km,32 minutes,
1,2,1,2020-01-01 19:10:54,20km,27 minutes,
2,3,1,2020-01-03 00:12:37,13.4km,20 mins,
3,4,2,2020-01-04 13:53:03,23.4,40,
4,5,3,2020-01-08 21:10:57,10,15,
5,6,3,,,,Restaurant Cancellation
6,7,2,2020-01-08 21:30:45,25km,25mins,
7,8,2,2020-01-10 00:15:02,23.4 km,15 minute,
8,9,2,,,,Customer Cancellation
9,10,1,2020-01-11 18:50:20,10km,10minutes,


Inconsistencies found in columns pickup_time ('null'), distance ('null', 'km'), duration ('minutes', 'minute', 'mins', 'null'), cancellation ('null','')

In [9]:
script = """
    DROP TABLE IF EXISTS runner_orders_clean;
    CREATE TEMP TABLE runner_orders_clean AS
    SELECT 
    order_id, 
    runner_id,  
    CASE
        WHEN pickup_time LIKE 'null' THEN NULL
        ELSE pickup_time
        END AS pickup_time,
    CASE
        WHEN distance LIKE 'null' THEN NULL
        WHEN distance LIKE '%km' THEN TRIM(distance, 'km')
        ELSE distance 
        END AS distance,
    CASE
        WHEN duration LIKE 'null' THEN NULL
        WHEN duration LIKE '%mins' THEN TRIM(duration, 'mins')
        WHEN duration LIKE '%minute' THEN TRIM(duration, 'minute')
        WHEN duration LIKE '%minutes' THEN TRIM(duration, 'minutes')
        ELSE duration
        END AS duration,
    CASE
        WHEN cancellation = "" OR cancellation LIKE 'null' THEN NULL
        ELSE cancellation
        END AS cancellation
    FROM runner_orders
"""
cursor.executescript(script)
# Verify result
query("SELECT * FROM runner_orders_clean")

Unnamed: 0,order_id,runner_id,pickup_time,distance,duration,cancellation
0,1,1,2020-01-01 18:15:34,20.0,32.0,
1,2,1,2020-01-01 19:10:54,20.0,27.0,
2,3,1,2020-01-03 00:12:37,13.4,20.0,
3,4,2,2020-01-04 13:53:03,23.4,40.0,
4,5,3,2020-01-08 21:10:57,10.0,15.0,
5,6,3,,,,Restaurant Cancellation
6,7,2,2020-01-08 21:30:45,25.0,25.0,
7,8,2,2020-01-10 00:15:02,23.4,15.0,
8,9,2,,,,Customer Cancellation
9,10,1,2020-01-11 18:50:20,10.0,10.0,


Changing the data types of pickup_time, distance, and duration to their correct numeric types instead of string

In [10]:
script = """
    PRAGMA writable_schema = 1; 
    UPDATE SQLITE_MASTER 
    SET SQL = 
        'CREATE TEMP TABLE runner_orders_clean (
            order_id INT NOT NULL, 
            runner_id INT NOT NULL,
            pickup_time DATETIME,
            distance FLOAT,
            duration INT,
            cancellation VARCHAR
         )' 
    WHERE NAME = 'runner_orders_clean';
    PRAGMA writable_schema = 0;
"""
cursor.executescript(script)

<sqlite3.Cursor at 0x14602d15640>

In [11]:
query("SELECT * FROM runner_orders_clean")

Unnamed: 0,order_id,runner_id,pickup_time,distance,duration,cancellation
0,1,1,2020-01-01 18:15:34,20.0,32.0,
1,2,1,2020-01-01 19:10:54,20.0,27.0,
2,3,1,2020-01-03 00:12:37,13.4,20.0,
3,4,2,2020-01-04 13:53:03,23.4,40.0,
4,5,3,2020-01-08 21:10:57,10.0,15.0,
5,6,3,,,,Restaurant Cancellation
6,7,2,2020-01-08 21:30:45,25.0,25.0,
7,8,2,2020-01-10 00:15:02,23.4,15.0,
8,9,2,,,,Customer Cancellation
9,10,1,2020-01-11 18:50:20,10.0,10.0,


**B. Pizza Metrics**

Q2: How many pizzas were ordered?

In [12]:
query("""
    SELECT COUNT(*) as pizza_order_count
    FROM customer_orders_clean
""")

Unnamed: 0,pizza_order_count
0,14


Q3: How many unique customer orders were made?


In [13]:
query("""
    SELECT COUNT(DISTINCT order_id) as unique_customer_orders
    FROM customer_orders_clean
""")

Unnamed: 0,unique_customer_orders
0,10


Q4: How many successful orders were delivered by each runner?


In [14]:
query("""
    SELECT 
        runner_id, 
        COUNT(order_id) AS successful_orders
    FROM runner_orders_clean
    WHERE distance <> 0
    GROUP BY runner_id;
""")

Unnamed: 0,runner_id,successful_orders
0,1,4
1,2,3
2,3,1


Q5: How many of each type of pizza was delivered?


In [15]:
query("""
    SELECT 
        p.pizza_name, 
        COUNT(c.pizza_id) AS delivered_pizza_count
    FROM customer_orders_clean AS c
    JOIN runner_orders_clean AS r
        ON c.order_id = r.order_id
    JOIN pizza_names AS p
        ON c.pizza_id = p.pizza_id
    WHERE r.distance <> 0
    GROUP BY p.pizza_name;
""")

Unnamed: 0,pizza_name,delivered_pizza_count
0,Meatlovers,9
1,Vegetarian,3


Q6: How many Vegetarian and Meatlovers were ordered by each customer?


In [18]:
query("""
    SELECT
        co.customer_id, pn.pizza_name, COUNT(*) as num_orders
    FROM
        customer_orders co
    INNER JOIN pizza_names pn
        ON co.pizza_id = pn.pizza_id
    GROUP BY co.customer_id, pn.pizza_name
""")

Unnamed: 0,customer_id,pizza_name,num_orders
0,101,Meatlovers,2
1,101,Vegetarian,1
2,102,Meatlovers,2
3,102,Vegetarian,1
4,103,Meatlovers,3
5,103,Vegetarian,1
6,104,Meatlovers,3
7,105,Vegetarian,1


Q7: What was the maximum number of pizzas delivered in a single order?


In [19]:
query("""
    WITH order_counts AS (
        SELECT 
            order_id, COUNT(*) as pizza_count
        FROM
            customer_orders
        GROUP BY order_id
    )
      
    SELECT
        MAX(pizza_count) AS max_pizza_count
    FROM
        order_counts      
""")

Unnamed: 0,max_pizza_count
0,3


Q8: For each customer, how many delivered pizzas had at least 1 change and how many had no changes?


Note: Here, a change refers to a pizza order with exclusions or extras involved.

In [22]:
query("""
    SELECT 
        co.customer_id,
        SUM(CASE
            WHEN co.exclusions IS NOT NULL OR co.extras IS NOT NULL THEN 1 
            ELSE 0 END) AS total_with_changes,
        SUM(CASE
            WHEN co.exclusions IS NULL AND co.extras IS NULL THEN 1 
            ELSE 0 END) AS total_no_changes
    FROM
        customer_orders_clean co
    GROUP BY co.customer_id
""")

Unnamed: 0,customer_id,total_with_changes,total_no_changes
0,101,0,3
1,102,0,3
2,103,4,0
3,104,2,1
4,105,1,0


Q9: How many pizzas were delivered that had both exclusions and extras?


In [24]:
query("""
    SELECT
        SUM(CASE
            WHEN exclusions IS NOT NULL AND extras IS NOT NULL THEN 1 ELSE 0 
        END) AS total_with_both_exclusions_and_extras
    FROM
        customer_orders_clean    
""")

Unnamed: 0,total_with_both_exclusions_and_extras
0,2


Q10: What was the total volume of pizzas ordered for each hour of the day?


In [28]:
query("""
    SELECT
        strftime('%H', order_time) AS order_hour,
        COUNT(*) AS total_pizzas
    FROM
        customer_orders_clean
    GROUP BY
        order_hour
""")

Unnamed: 0,order_hour,total_pizzas
0,11,1
1,13,3
2,18,3
3,19,1
4,21,3
5,23,3


Q11: What was the volume of orders for each day of the week?


In [37]:
query("""
    SELECT
        SUBSTR('SunMonTueWedThuFriSat', 1 + 3*strftime('%w', order_time), 3) AS order_day,
        COUNT(*) AS total_pizzas
    FROM
        customer_orders_clean
    GROUP BY order_day
    ORDER BY total_pizzas DESC
""")

Unnamed: 0,order_day,total_pizzas
0,Wed,5
1,Sat,5
2,Thu,3
3,Fri,1


**B. Runner and Customer Experience**

Q12: How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)


In [49]:
query("""
    SELECT 
        strftime('%W', registration_date) as week_number,
        COUNT(*) as signup_count
    FROM
        runners
    GROUP BY
        week_number
""")

Unnamed: 0,week_number,signup_count
0,0,2
1,1,1
2,2,1


Q13: What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?

Q14: Is there any relationship between the number of pizzas and how long the order takes to prepare?


Q15: What was the average distance travelled for each customer?


Q16: What was the difference between the longest and shortest delivery times for all orders?


Q17: What was the average speed for each runner for each delivery and do you notice any trend for these values?


Q18: What is the successful delivery percentage for each runner?


**C. Ingredient Optimization**

Q19: What are the standard ingredients for each pizza?


Q20: What was the most commonly added extra?

Q21: What was the most common exclusion?


Q22: Generate an order item for each record in the customers_orders table in the format for each of the following:
- `Meat Lovers`
- `Meat Lovers - Exclude Beef`
- `Meat Lovers - Extra Bacon`
- `Meat Lovers - Exclude Cheese, Bacon - Extra Mushroom, Peppers`

Q23: Generate an alphabetically ordered comma separated ingredient list for each pizza order from the `customer_orders` table and add a `2x` in front of any relevant ingredients
- For example: "Meat Lovers: 2xBacon, Beef, ... , Salami"

Q24: What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first?

**D. Pricing and Ratings**

Q25: If a Meat Lovers pizza costs $12 and Vegetarian costs $10 and there were no charges for changes - how much money has Pizza Runner made so far if there are no delivery fees?


Q26: Refer to Q24, what if there was an additional $1 charge for any pizza extras? Example: Add cheese is $1 extra

Q27: The Pizza Runner team now wants to add an additional ratings system that allows customers to rate their runner, how would you design an additional table for this new dataset - generate a schema for this new table and insert your own data for ratings for each successful customer order between 1 to 5.


Q28: Using your newly generated table - can you join all of the information together to form a table which has the following information for successful deliveries?
- `customer_id`, `order_id`, `runner_id`, `rating`, `order_time`, `pickup_time`
- Time between order and pickup, Delivery duration, Average speed, Total number of pizzas

Q29: If a Meat Lovers pizza was $12 and Vegetarian $10 fixed prices with no cost for extras and each runner is paid $0.30 per kilometre traveled - how much money does Pizza Runner have left over after these deliveries?


**E. Other**

Q30: If Danny wants to expand his range of pizzas - how would this impact the existing data design? Write an INSERT statement to demonstrate what would happen if a new Supreme pizza with all the toppings was added to the Pizza Runner menu?

