# Week 2: CostPro Customer Behavior



🚨 **First things first! Make a copy of this notebook. Your changes will not save unless you create your own copy!**

💡 Build Intuition

As a data scientist at CostPro, your goal is to collect and analyze data on various factors influencing the likelihood of a purchase, such as a customer's past purchasing history, online behavior, and demographic information.

You'll use probability distributions to model customer behavior, to determine how likely different outcomes are.

## 🚀 Project Setup

First, we need to download and import all of the dependencies that we will need for the project

### Dependencies

In [None]:
# Install all required dependencies for the project
!pip install -qqq numpy pandas seaborn matplotlib gdown plotly scipy
# We need this to avoid version incompatibilities between packages (please ignore any warnings)
!pip install --force-reinstall --ignore-installed -qqq protobuf==3.20

In [None]:
# Import dependencies
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import shutil
from scipy.stats import poisson, binom, norm, skew, kurtosis, probplot, loggamma
import gdown
from google.colab import files

# [TO BE IMPLEMENTED]
# Feel free to add any other imports that you need to write your own custom metrics

## Data

[Online Retail](https://archive.ics.uci.edu/ml/datasets/online+retail) is a collection of roughly 400,000 records from an international online retail dataset. For this project, we will use a slightly modified (cleaned up) version of this dataset as the sales data for CostPro.

### Download the Dataset

In [None]:
# Download the file from co:rise google drive
file_name = "online_retail.csv"
unique_id = "16HZKULqv2sX6AMqi4A8ndQ9ULeCLo4Qt"
gdown.download(id=unique_id, output=file_name)

Now we need to create a directory. This will be used to store the files that you will need to produce your dashboard.

In [None]:
# Create the directory
!mkdir -p retail_metrics_dashboard

### Clean & Prepare the Dataset

Now we'll import the data as a pandas dataframe. The code below will show you the first 5 rows of the dataframe, but feel free to explore the data further!

In [None]:
# import data and show first 5 rows
data = pd.read_csv(file_name)
data.head()

## Project Jumpstart

### Explore Basic Statistics You Get Out of the Box with Pandas

In [None]:
# check record count and column count
data.shape

In [None]:
# review basic information about the dataset including column names, data types, non-null counts, and memory usage
data.info()

In [None]:
# check the number of missing values in each column
data.isnull().sum(axis=0)

On the job, you'll need to explore if there are specific patterns to the missing values. For the sake of expediency and simplicity in this project, we'll drop them.

In [None]:
# Drop missing values and validate that there are no missing values after the fix
data = data.dropna().reset_index(drop=True)
data.isnull().sum(axis=0)

In [None]:
# Check the number of unique values in each column
data.nunique()

In [None]:
# Use pandas describe() method to get a summary of the numerical columns
basic_stats = data.describe()
basic_stats

Now we need to create columns for Year, Month, Day, and Quarter so that your leadership team at CostPro can look at statistics for specific time periods in your dashboard.

In [None]:
# Format date columns
# Note: the pandas to_datetime function is being deprecated in the near future
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

data["Year"] = data["InvoiceDate"].dt.year
data["Month"] = data["InvoiceDate"].dt.month
data["Day"] = data["InvoiceDate"].dt.day
data["Quarter"] = data["InvoiceDate"].dt.quarter

### Clean the Data

Check the basic stats on the numeric data again, what do you notice?

In [None]:
basic_stats

Look at the quantity minimum! `-80995.000000`

Let's take a look at the dataset again to figure out what's going on!

In [None]:
# Review records where the quantity is less than 0
data[data['Quantity'] < 0].head()

After closer examination, we can see that both the quantities and pricing are negative. We can also see that the Invoice Numbers are prepended with a `C`.

A quick glance at the data docs tells us that "InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation."

In [None]:
# Create a function to remove cancelled orders
def remove_cancelled_orders(data):
    # Create a subset of the data that only has cancelled invoices
    cancelled_invoices = data[data['InvoiceNo'].str.contains('C')]

    # Create a new column that contains the original invoice number
    cancelled_invoices['OriginalInvoiceNo'] = cancelled_invoices['InvoiceNo'].str.replace('C', '')

    # Filter the dataset to only include invoices that haven't been cancelled
    data = data[~data['InvoiceNo'].str.contains('C')]

    # Filter the data by the original invoice number on the cancelled invoices
    return data[~data['InvoiceNo'].isin(cancelled_invoices['OriginalInvoiceNo'])]

data = remove_cancelled_orders(data)

display(data.head())

# Review the data again to ensure that the cancelled orders have been removed
assert data[data['InvoiceNo'].str.contains('C')].shape[0] == 0


Great! That's looking much better! Now the minimum quantity is 1.

On the job, we'd definitely want to dig into what products get returned and at what rate and factor returns into total sales. For the sake of expediency in this project, we're going to ignore cancellations and returns.

### Create Useful Aggregations

In [None]:
# Create a column that represents the total price of each item
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']

In [None]:
# Create a dataframe that represents the total value of each order
total_value_per_order = data.groupby('InvoiceNo')['TotalPrice'].sum().to_frame()
total_value_per_order.head()

### Filter Data By Year and Quarter

Here's an example, you get the first one for free!

In [None]:
# Write a function to filter a pandas dataframe by year and quarter
def filter_by_year_and_quarter(df, year, quarter):

    # Filter the dataframe by the year and quarter
    filtered_df = df[(df['Year'] == year) & (df['Quarter'] == quarter)]

    # Return the filtered dataframe
    if filtered_df.shape[0] > 0:
        return filtered_df
    else:
        assert filtered_df.shape[0] > 0, "There are no records for the specified year and quarter"

In [None]:
# Create a dataframe that only contains records for 2022 Q4
display(filter_by_year_and_quarter(data, 2022, 4).head(5))

# Write a test to validate the function
assert filter_by_year_and_quarter(data, 2022, 4)['Year'].unique()[0] == 2022, "There should only be records for 2022"
assert filter_by_year_and_quarter(data, 2022, 4)['Quarter'].unique()[0] == 4, "There should be records for Q4"

## Your Turn

Use probability distributions to answer questions that CostPro can use to drive business decisions!

### Are CostPro Prices Normally Distributed?

CostPro management is worried that there might be a mismatch between the prices they're charging and the prices that customers are willing to pay. Specifically, they're worried that they're selling too many expensive items, and that customers are mostly buying lower-priced items.

Ideally, management would like the distribution of item prices to match the distribution of customer purchase prices.

Your job is to plot the distribution of number of sales at each unit price. Are sales normally distributed across unit price?

Calculate skew and kurtosis for the distribution. Then create a QQ plots of the distribution, to compare it to a normal distribution.

💡 Hint: Are there any libraries/methods we've imported that could help? If you're not sure how to use a method, you can often access documentation by writing `help(function_name)`


In [None]:
# Create a dataframe that tells us the total number of sales at each unit price
orders_per_price = <YOUR CODE HERE>
orders_per_price.head()

In [None]:
# Write a function to check skew and kurtosis
def check_skew_kurtosis(column: str, data: pd.DataFrame):

    skewness = <YOUR CODE HERE>
    kurtosis_value = <YOUR CODE HERE>
    print(f'The skew of {column} is {skewness}. A normal distribution should have a skew close to 0\n' )
    print(f'The kurtosis of {column} is {kurtosis_value}. A normal distribution should have a kurtosis close to 0\n' )


In [None]:
## Get the skew and Kurtosis for the distribution of the number of sales at each unit price
<YOUR CODE HERE>

In [None]:
## Write a function that outputs the QQ plot of a column of data in a dataframe

def qq_plot(column: str, data: pd.DataFrame):
  <YOUR CODE HERE>

In [None]:
# Display your plot
<YOUR CODE HERE>

### How Likely Is It That CostPro Will Have Enough High-Value Sales?

Your boss at CostPro feels that CostPro needs more "high value" sales, where a sale is "high value" if the total price is over $1000. Management has set a goal of at least 350 high value sales per quarter.

Your job is to estimate the likelihood that CostPro will meet this goal. How likely is it to have at least 350 high value sales in a given quarter?

To do this, you can model sales as a binomial distribution. Since we're interested only in high value sales, you can count a sale as a "success" if it's over \$1000, and a failure if it's less than \$1000. To calculate the binomial distribution, you just need the two key values:

1. n, the total number of sales in a given quarter (you can use the mean number of sales in a quarter)
2. p, the probability that a sale is over \$1000 (you can use the overall rate at which sales are over \$1000 in the dataset)

In the cells below, use the sales dataset to get the values of n and p, and then use these to generate a binomial distribution. You can then use this binomial distribution to calculate the probability that there will be at least 350 high value sales in a given quarter.

In [None]:
# Write a function using the Binomial distribution to model the number of sales
def customer_purchase_proba(data: pd.DataFrame, year: int, quarter: int) -> float:
    """
    Calculate the probability of a customer arriving on the website in the next 15 minutes
    :param data: The dataframe containing the data
    :param year: The year to calculate the probability for
    :param quarter: The quarter to calculate the probability for
    :return: The probability of a customer arriving on the website in the next 15 minutes
    """
    # Filter the data by the given year and quarter
    data = filter_by_year_and_quarter(data, year, quarter)

    # Calulate the total number of sales
    <YOUR CODE HERE>

    # Generate the Poisson distribution
    <YOUR CODE HERE>

    # Calculate the probability of having 10% more customers on a given day
    return <YOUR CODE HERE>

In [None]:
# Call the function for 2022 Q4
print(f'The probability of having at least 350 high-value sales is: {<YOUR CODE HERE>}')

Now, plot the cumulative distribution function for the binomial distribution.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np

<YOUR CODE HERE>

In [None]:
# Visualize the Binomial distribution
<YOUR CODE HERE>

### How Likely Is It That CostPro Will Have a 10% Increase in Customers On Any Given Day?

Your supervisor wants to know how effective the new marketing campaign is. The day after the campaign was launched, CostPro saw a 10% increase in sales, relative to average number of sales in a day.

Your supervisor wants to know: how likely is it that this was due to random chance?

To help answer this question, you can model customer behavior using the Poisson distribution. In the cells below, use the Poisson distribution to calculate the probability that CostPro has a 10% increase (over the mean) in customers on a given day.

In [None]:
# Write a function using the Poisson distribution to model how likely a customer is to arrive on the website in the next 15 minutes
def customer_arrival_proba(data: pd.DataFrame, year: int, quarter: int) -> float:
    """
    Calculate the probability of a customer arriving on the website in the next 15 minutes
    :param data: The dataframe containing the data
    :param year: The year to calculate the probability for
    :param quarter: The quarter to calculate the probability for
    :return: The probability of a customer arriving on the website in the next 15 minutes
    """
    # Filter the data by the given year and quarter
    data = filter_by_year_and_quarter(data, year, quarter)

    # Calulate the mean number of customers per day
    <YOUR CODE HERE>

    # Generate the Poisson distribution
    <YOUR CODE HERE>

    # Calculate the probability of having 10% more customers on a given day
    return <YOUR CODE HERE>

In [None]:
# Call the customer arrival function for 2022 Q4
print(f'The probability of having 10% more customers on a given day is: {<YOUR CODE HERE>}')

In [None]:
# Write a test to validate the customer arrival function
assert <YOUR CODE HERE>, "There should be no customers for 2022 Q4"

In [None]:
# Visualize the Poisson distribution for customer visits
def plot_visits_distribution(data):
  <YOUR CODE HERE>


plot_visits_distribution(data)

### Optional

As data scientists it's good practice to make liberal use of `assert` statements throughout notebooks to catch unexpected issues.

We're often reading in large volumes of data created by someone/some process, and analysis in a notebook created once may be reused on newer data as it comes in. Some basic sense checking can save you a lot of headaches down the line.

Examples include:
* am I getting the number of rows I expect from this data?
* is this column that should never be a null, never actually a null?

Can you think of and write some appropriate assert statements?

### Optional

Come Up With Your Own Question, And Answer It With Probability Distributions

## Make a Dashboard for Your Portfolio!

Optional, but ***Highly Encouraged***, Since You've Already Written the Code!

To bring our analysis to life we'll be using a toolkit known as Jupyter Widgets, alongside an interactive plot creation library called Plotly.

At this stage we don't anticipate that you're necessarily familiar with the ins and outs of how Jupyter Widgets function, but the code and links below should help you get started.

You're welcome to modify it as you please. In fact, data visualization is a crucial component of the data science toolkit, so if you're so inclined, give it a try! Some excellent starting points for learning about Jupyter Widgets can be found here, [here](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html), [here](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html), and [here](https://towardsdatascience.com/bring-your-jupyter-notebook-to-life-with-interactive-widgets-bc12e03f0916)!

One potential place to start - can you come up with a better plot type for the data to show outliers? Have a scroll through what the plotly.express library can do [here](https://plotly.com/python/plotly-express/) for inspiration

In [None]:
import pandas as pd
import plotly.express as px
from ipywidgets import interact, Select, Output, link
from IPython.display import display, clear_output

from google.colab import output
output.enable_custom_widget_manager()

# Create a copy of the data we've processed,
#  in case we want to modify it some more
df = data.copy(deep=True)

# Create an output widget
out = Output()

# Define function to execute on change of filter
def metrics_and_visualization(year, quarter):
  # Filter the data for the selected year and quarter
  filtered_df = filter_by_year_and_quarter(df, year, quarter)

  # Sample our data, to make the visualisation less computionationally intensive
  filtered_df = filtered_df.sample(frac=0.1, random_state=42)

  out.clear_output(wait=True)
  with out:
    # Display the metrics
    print(f"Q{quarter} Customer Purchase Probability: {customer_purchase_proba(filtered_df, year, quarter)}")
    print(f"Q{quarter} Customer Arrival Probability: {customer_arrival_proba(filtered_df, year, quarter)}")

    # Visualise the poisson distribution for customer visits
    plot_visits_distribution(filtered_df)

    # Feel free to add other visualisations here!
    # <YOUR CODE>

# Create a selection widget to choose the quarter
year_widget = Select(options=df["Year"].unique().tolist(), description='Year:')
quarter_widget = Select(options=df["Quarter"].unique().tolist(), description='Quarter:')

# Display widgets
display(year_widget, quarter_widget, out)

# Define function to update quarters widget based on the selected year
def update_quarters(change):
    year = change['new']
    quarters = df[df["Year"] == year]["Quarter"].unique().tolist()
    quarter_widget.options = quarters
    quarter_widget.value = quarters[0] if quarters else None

# Bind the function to changes in the year widget
year_widget.observe(update_quarters, names='value')

# Bind the metrics_and_visualization function to changes in the widgets
def on_change(change):
    metrics_and_visualization(year_widget.value, quarter_widget.value)

quarter_widget.observe(on_change, names='value')

# Call function once to update quarters and display initial data
update_quarters({'new': year_widget.value})
metrics_and_visualization(year_widget.value, quarter_widget.value)

We often want to share visualisations like this with our colleagues and let them play around but without exposing them to all the analysis code we took to get there. Luckily we can easily do this in Colab!
* go to `Edit` -> `Select All Cells`
* go to `View` -> `Show/hide code`

You can also go through and collapse individual sections that you don't want to share (e.g. the bit at the top where we installed some python libraries).

You can then share a link to the notebook (`Share` -> `Copy Link`, setting the `General Access` field appropriately) and then anyone you share the link with can open the notebook, run through it, and see your analysis!

#### 🚀 You Did It!!!

Congratulations, you've completed your second assignment in Applied Statistics for Data Science.