# Below are the answers to questions 1 and 2 and below the answers is my code and thought process for question 1.

## Question 1

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe.
We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window,
we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively
affordable item, something seems wrong with our analysis.

### a. Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

Using the describe function in Pandas when looking at the summarized statistics the mean
of the order_amount for all the stores does come out to 3145.13. So a few things we
know now is the average was calculated correctly using the numbers provided in the
csv file and that another value wasn't accidentally used.

When examining the total_items percentiles, the numbers seem very reasonable but, when examining
the max items purchased we see that a max of 2000 items was purchased which could be skewing
the average. This is further represented by a standard deviation of 41282.5 which shows how much the
data is deviating from the mean.

The data had 17 occurences of purchases of 2000 items which was increasing the average by a lot so removing 
them would be the bes course of action.

When the dataframe was grouped and all the occurences of all the values were counted we could see other large values of 51450 and 25725
besides just the value of 704000 for the 17 transcations of 2000 shoes. To evaluate the data better we would want to
use the median to evaluate the dataset.

### b. What metric would you report for this dataset?
To stop the data from being affected by extreme  outliers we would want to use the median since
values above the median will increase the mean.

### c. What is its value?
Using the median to calculate the new Average Order Value of 284.0 which is defiantly a more
reasonable number than 3145.13. 

## Question 2
For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

### a. How many orders were shipped by Speedy Express in total?
Using the below query the number of orders shipped by Speedy Express was **54**

SELECT COUNT(*) AS Orders_Shipped

FROM [Orders]

JOIN [Shippers]
    
    ON [Shippers].ShipperID = [Orders].ShipperID

WHERE [Shippers].ShipperName = 'Speedy Express'

### b. What is the last name of the employee with the most orders?
The employee with the most orders last name is **Peacock** with the number of orders being **40**.

SELECT [Employees].LastName, COUNT(*) AS Orders_Shipped

FROM [Orders]

JOIN [Employees]

ON [Orders].EmployeeID = [Employees].EmployeeID

GROUP BY [Employees].LastName

ORDER BY Orders_Shipped DESC

LIMIT 1

### c. What product was ordered the most by customers in Germany?
The product ordered most by customers in Germany was **Boston Crab Meat** with there
being exactly **160** orders.


SELECT [Products].ProductID, [Products].ProductName, SUM(Quantity) AS Total

FROM [Orders], [OrderDetails], [Customers], [Products]

WHERE Country = "Germany" AND [OrderDetails].OrderID = [Orders].OrderID AND [OrderDetails].ProductID = [Products].ProductID AND [Customers].CustomerID = [Orders].CustomerID

GROUP BY [Products].ProductID

ORDER BY Total DESC

LIMIT 1;

In [None]:
# Load required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load dataset into a pandas dataframe and show the column data types
df = pd.read_csv('../input/shopify-shoe-store-data/2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv')
df.dtypes

In [None]:
# Initial look at first 10 rows of data
df.head(10)

In [None]:
# Now before any more analysis I would like to check for any Nan values in the data
df.isnull().values.any()

In [None]:
# Use the df.describe() function to use pandas to summarize statistical data
df.describe()

Using the describe function in Pandas when looking at the summarized statistics the mean
of the order_amount for all the stores does come out to $3145.13.

The percentiles are very reasonable but, the max items bought is 2000.0 which
seems to be a lot of purchases for an individual to make, and they total to
a value of 704000.0. These numbers are defiantly skewing the data and
returning a very high average.

The STD is quite large for the order_amount at 41282.5 which shows how much the data is deviating
from the mean.

Now that I have found an obvious outlier I would want to see how many other
large values there are in the order_amount and total_items columns.

In [None]:
# stripplot to show the shop_id by total_items
plt.figure(figsize = (25,20))
sns.stripplot(data = df, x = "shop_id", y = "total_items", jitter = True, palette = "plasma")
plt.show()

In [None]:
# Boxplot that is plotting the order_amount column
plt.figure(figsize = (25,20))
sns.boxplot(data = df["order_amount"])
plt.show()

Above there is a stripplot to analyze the total_items column and a boxplot to analyze the order_amount column. And it is clearly obvious that the max 2000 items
is skewing the Average Order Value. 

The boxplot return some interesting data for us to see it is clearly showing many outliers in the order_amount column since most
of the is sitting on a large line close to zero.

Next lets look deeper into the total_items column.

In [None]:
# Create a for loop to see how many values there are above 8
# After some trial and error I found the second highest value in the total_items column was 8
for x in df["total_items"]:
    if x > 8:
        print(x)

In [None]:
# Now let us locate all values of 2000 in the total_items column
df.loc[df["total_items"].isin([2000])]

Now this is interesting above you can see all values of 2000 in the dataset and whats odd about it is
the transactions were made by the same store and exactly the same time for every transaction.

The 17 instances of the value 2000 is clearly incorrect and the store with id 42 should be contacted
to see where those numbers are coming from.

For the next step I want to find the outliers in the order_amount boxplot.
When going back and examining the 75th percentile we can see the value is 390.0 but, 
the max value is 704000 so, now I want to find all values over 390.0 in the
dataset.

In [None]:
# Use groupby and count() to find the occurences 
df.groupby("order_amount").count().tail(15)

Above we have more numbers that are very large purchases for shoes.
Those numbers are of course 704000, 51450, and 77175

Below we can see that these purchases are very large earlier I talked about the 2000 items and there amount of 704000
but, now we can see for the values of 51450 and 77175 that these puchases are very large for 1 or 2 items and
all of the purchases were made at the same store with an id of 78.


In [None]:
df.loc[df["order_amount"].isin([704000, 51450, 25725])].sort_values(by = "order_amount", ascending=False)

So for the actual Average Order Value we want to remove the values 704000.0, 51450, and 25725.

In [None]:
# Using df.drop and specifying we want to remove all values above and equal to 25725 we return the dataframe without those values
df.drop(df[df["order_amount"] >= 25725].index, inplace=True)

In [None]:
# plot a new stripplot to show the shop_id by total_items
plt.figure(figsize = (25,20))
sns.stripplot(data = df, x = "shop_id", y = "total_items", jitter = True, palette = "plasma")
plt.show()

In [None]:
# Use df.describe() to find new Average Order Amount without the outliers
df.describe()

Since we have a lot of large values we want to use a metric that isn't affected by those values 
such as the median.

The new Average Order Value is 284.0 which is much more reasable than the before value 3145.13.