TL;DR: <br>
Q1.a)<br>
At first glance, I imagine that some high-end outliers in our dataset are skewing the AOV.<br>
Q1.b)<br>
In that case the mean should work a lot better for us in approximating AOV.<br>
Q1.c)<br>
The mean order amount is 284.00<br>
<br>
Read more below on how I would go about investigating the validity of my hypothesis.

#### Question 1a: *Think about what could be going wrong with our calculation. Think about a better way to evaluate this data*

Immediately, I think the average of `3145.13` is being skewed either a number of very large orders or by one shop that does very high volumes orders consistently. I would check the standard deviation to confirm this.

In [None]:
import pandas as pd

all_info    = pd.read_csv("data_sneakers.csv")
volume_info = all_info[['order_amount', 'total_items']]

# Show a general analysis of the order_amount and total_items columns, including percentiles in 10% increments.
volume_info.describe(percentiles=[x/10.0 for x in range(0,10)])

The `describe()` outputs shows us the rather extreme standard deviation of `41282.5`. And there we see our `3145.13` number, it's the mean order amount. These percentiles don't tell us much so let's look a little closer at the upper range.

In [None]:
volume_info.describe(percentiles=[x/100.0 for x in range(90,100)])

Aha! We've found the outliers! All the way up in the 99th percentile it looks like we've got a couple points up near that `70400` max value. Let's play around with some other numbers to see how many outliers we are working with.

In [None]:
order_amounts = volume_info['order_amount']

search_threshold = order_amounts.max() - order_amounts.std()
outliers = order_amounts[ order_amounts >= search_threshold]
print(f"{len(outliers)}/{len(order_amounts)} entries found that exceed {search_threshold} in order_amounts.")

When considering AOV these datapoints aren't very useful for us since it gives us a bad indication of what the average customers order value is going to look like. Let's trim these outliers out of our calculation using an Interquartile range
(IQR).

In [None]:
def IQR_calc(data, group_by, factor=1.5, bottom=0.25, top=0.75):
    Q1, Q3 = data[group_by].quantile([bottom, top])
    IQR = Q3-Q1
    min_threshold = Q1 - (1.5 * IQR)
    max_threshold = Q3 + (1.5 * IQR)
    mask = (data[group_by]>=min_threshold) & (data[group_by]<=max_threshold)
    trimmed_data = data.loc[mask]
    return trimmed_data

# We can be pretty extreme with our IQR here, normally Q1 and Q3
# would be further from the edge of our dataset.
trimmed_volume_info = IQR_calc(volume_info, 'order_amount', bottom=0.10, top=0.90)
trimmed_volume_info.describe()


Much better!

#### Question 1b: *What metric would you report for this dataset?*
Looks like there is a small percentage of outliers skewing our mean. A median should be good enough to give us a quick look at what a good AOV would look like, since only about the 99th percentile of the data is at the upper extreme, and there don't seem to be many lower extremes either.

In [None]:
print("MEANS:")
print(volume_info.mean())
print(trimmed_volume_info.mean())
print("MEDIANS:")
print(volume_info.median())
print(trimmed_volume_info.median())

#### Question 1c: *What is its value?*
Reporting a median with the data trimmed by our IQR function doesn't even changes the results. So for this case, the median of `284.00` works really quite well. 

But if we want to stick to the strict definition of AOV a value of `301.84` for our AOV based on the trimmed data from our IQR_calc() function feels acceptable.