
# Question 1: Given some sample data, write a program to answer the following

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis. 

* Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. 
* What metric would you report for this dataset?
* What is its value?

In [1]:
import pandas as pd

In [75]:
df = pd.read_excel('2019 Winter Data Science Intern Challenge Data Set.xlsx', engine="openpyxl")

In [76]:
df.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
0,1,53,746,224,2,cash,2017-03-13 12:36:56.190022
1,2,92,925,90,1,cash,2017-03-03 17:38:51.999116
2,3,44,861,144,1,cash,2017-03-14 04:23:55.594730
3,4,18,935,156,1,credit_card,2017-03-26 12:43:36.648760
4,5,18,883,156,1,credit_card,2017-03-01 04:35:10.772536


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   order_id        5000 non-null   int64         
 1   shop_id         5000 non-null   int64         
 2   user_id         5000 non-null   int64         
 3   order_amount    5000 non-null   int64         
 4   total_items     5000 non-null   int64         
 5   payment_method  5000 non-null   object        
 6   created_at      5000 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 273.6+ KB


## ***Part 1: Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.***

Average order value (AOV) is defined as the ratio of total revenue accrued to the total orders placed over a given period of time &rarr; $\frac{\Sigma_t\text{revenue}}{\Sigma_t\text{orders}} = \text{AOV}$. The AOV provided appears to have transpired by dividing the total order amount by the total number of customers: $\frac{\$ 15,725,640}{5000} = \$3145.128 \approx \$3145.13$. We can attempt to attain a more accurate AOV approximation by dividing the total order amount by the total number of *orders* per customer. 

From [bigcommerce.com](https://www.bigcommerce.com/ecommerce-answers/what-average-order-value/):

"*AOV is determined using sales per order, not sales per customer. Although one customer may come back multiple times to make a purchase, each order would be factored into AOV separately.*"

In [24]:
total_revenue = df["order_amount"].sum()
total_orders = df["total_items"].sum()

In [31]:
AOV = round(total_revenue / total_orders, 2)
AOV

357.92

__Projected AOV: $\$357.92$__. Considering the shoes are supposed to be relatively affordable this appears to be a more accurate reflection of the average cumulative expenditure per recorded transaction.

## ***Part 2: What metric would you report for this dataset?***

In part 1 we measured the AOV. As a business owner I may be interested in increasing the number of transactions per customer. Perhaps we could look at every person who purchased only 1 item and encourage them to buy multiple items through [bundles or free shipping](https://www.bigcommerce.com/ecommerce-answers/what-average-order-value/). We will define this metric as Single Purchase Percentage (SPP). Another metric we could consider is the amount of returning customers we have. If we can encourage patrons to come back to our store we can subsequently increase the amount of revenue we generate. We could offer coupons for future orders or institute a rewards program to facilitate an increase in returning customers. We will define this metric as Average Customer Retention (ACR). 

## ***Part 3: What is its value?***

We can find the SPP by using pandas' .value_counts() method for the total_items column and divide the single purchase entry by the sum of the remaining purchase entries. We can find the ACR in a similar manner by using the .value_counts() method to find the number of times each user_id occurs within the data set and averaging their respective counts. 

In [49]:
purchase_counts = df["total_items"].value_counts()
purchase_counts

2       1832
1       1830
3        941
4        293
5         77
2000      17
6          9
8          1
Name: total_items, dtype: int64

In [52]:
single_purchases = purchase_counts[1]
single_purchases

1830

In [77]:
multiple_purchases = purchase_counts.drop(1, axis=0).sum()
multiple_purchases

3170

In [78]:
spp = round(100*(single_purchases / multiple_purchases), 2)

In [79]:
customer_freq = df["user_id"].value_counts().sort_values()
customer_freq

717     7
750     7
998     9
719     9
955     9
       ..
727    25
791    26
847    26
868    27
718    28
Name: user_id, Length: 301, dtype: int64

In [80]:
acr = round(customer_freq.mean(), 2)

In [81]:
print(f"Single Purchase Percentage (SPP): %{spp}, Average Customer Retention (ACR): {acr}")

Single Purchase Percentage (SPP): %57.73, Average Customer Retention (ACR): 16.61


## Conclusion

We found that procuring an accurate AOV stems from dividing the total revenue by the total number of orders. We then defined two metrics, SPP and ACR, that could potentially help increase our total revenue. A decrease in SPP would imply that we are experiencing an increase in the amount people buying more than one item. An increase in ACR would imply that we are experiencing a greater retention of customers.