#### Question 1a: *Think about what could be going wrong with our calculation. Think about a better way to evaluate this data*

Immediately, I think the average of `3145.13` is being skewed either a number of very large orders or by one shop that does very high volumes orders consistently. I would check the standard deviation to confirm this.

In [1]:
import pandas as pd

all_info    = pd.read_csv("data_sneakers.csv")
volume_info = all_info[['order_amount', 'total_items']]

# Show a general analysis of the order_amount and total_items columns, including percentiles in 10% increments.
volume_info.describe(percentiles=[x/10.0 for x in range(0,10)])

Unnamed: 0,order_amount,total_items
count,5000.0,5000.0
mean,3145.128,8.7872
std,41282.539349,116.32032
min,90.0,1.0
0%,90.0,1.0
10%,133.0,1.0
20%,156.0,1.0
30%,176.0,1.0
40%,236.0,2.0
50%,284.0,2.0


The `describe()` outputs shows us the rather extreme standard deviation of `41282.5`. None of the percentiles look too drasticly weighted, so I assume that the anomalous mean is cause by a couple HUGE data points around that max value of `704000`. Let's check a few things to confirm this.

In [8]:
order_amounts = volume_info['order_amount']

search_threshold = order_amounts.max() - order_amounts.std()
outliers = order_amounts[ order_amounts >= search_threshold]
print(f"{len(outliers)}/{len(order_amounts)} entries found that exceed {search_threshold} in order_amounts.")

17/5000 entries found that exceed 662717.4606512119 in order_amounts.


#### Question 1b: *What metric would you report for this dataset?*
Looks like there is a small percentage of outliers skewing our mean. A median should be good enough to give us a quick look at what a good AOV would look like, since only about the 99th percentile of the data is at the upper extreme.

In [9]:
print(volume_info.median())

order_amount    284.0
total_items       2.0
dtype: float64


#### Question 1c: *What is its value?*
A value of `284.0` for our AOV seems appropriate looking at our percentiles from earlier. We could use something like interquartile ranges, or simply trimming maxes and mins to get a more precise middle value for the dataset, but I doubt that would do much here since the outliers make up such a small percentage of the overall dataset.