### Shopify's Data Science Internship Fall 2021: Overview and Answers

If you are the Shopify employee reviewing my application: I did my best to display my answers at the top of the page and my work below, to help review my answers as easily as possible.

If you are a member of a general audience: this notebook is for you as well. Hopefully, this notebook will give you a peek into the skill testing questions that big tech companies could ask you.
  
While the questions for applicants seem straight forward at first, a careful inspection reveals that there is more going on beneath the surface.  


Original Job Posting (Open Until May 9th, 2021):(https://jobs.smartrecruiters.com/Shopify/743999744930811-fall-2021-data-science-internship)


## Question 1
(The following is the prompt given for our dataset. We will aim to answer all the questions).  
  
 
Question 1: Given some sample data, write a program to answer the following:   
  
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.   

#### Question 1a)
a) Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.  
#### Answer  
Given such a high mean value, we can assume that there are outliers present in our data set.  
Therefore, a better metric would be Median Order Volume.  
However, in the work below I demonstrate that this is not the best approach. The better way to evaluate the data is to recalculate the Average Order Volume after removing the large orders, and removing transactions involving store 78 (given that store 78 is charging $25,725 per pair of shoes).

#### Question 1b)
b) What metric would you report for this dataset?    
#### Answer
The metric I use is Average Small Order Volume, after removing the over priced orders from store 78.  
(In case of ATS: Median Order Volume)
#### Question 1c)
c) What is its value?  
#### Answer
302.58  
(In case of ATS: 284)

### Question 2:
For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.  
  
#### a) How many orders were shipped by Speedy Express in total?  
  
SELECT COUNT(OrderID) FROM Orders  
WHERE ShipperID =   
(SELECT ShipperID FROM Shippers   
WHERE ShipperName = 'Speedy Express');  
  
Returns 54  

#### b) What is the last name of the employee with the most orders?  
  
SELECT LastName FROM Employees  
WHERE EmployeeID = (SELECT TOP 1 EmployeeID FROM   
(SELECT EmployeeID, COUNT(OrderID) FROM Orders GROUP BY EmployeeID ORDER BY COUNT(OrderID) DESC));  
    
Returns Peacock    
  
#### c) What product was ordered the most by customers in Germany?   
  
SELECT ProductName FROM Products  
WHERE ProductID IN (SELECT TOP 1 ProductID FROM OrderDetails  
WHERE OrderID IN (  
SELECT OrderID FROM Orders  
WHERE CustomerID IN (SELECT CustomerID FROM Customers  
WHERE Country = 'Germany'))  
GROUP BY ProductID  
ORDER BY COUNT(OrderDetailID) DESC);  
      
Returns "Gorgonzola Telino"

### Section 1.0: The Set Up

In [None]:
# import our packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# load our dataset.
data =  pd.read_csv('../input/shopify-data-science-internship-challenge/Shopify.csv')
data.head()

#### Data Dictionary

Above we can see the basic columns:  
1) order_id: Unique integer that defines our order number.  
2) shop_id: integer used to identify the shop which the order was placed.  
3) user_id: Integer value, indicating the user who preformed the order.  
4) order_amount: (in dollars) the value of the customer paid to the store.  
5) total_items: Number of items (shoes in this case) bought in the order.  
6) payment_method: cash, credit_card, debit.  
7) created_at: date/time information when the order was placed.

In [None]:
# We are dealing with 5000 rows of order data.
data.shape

### Section 1.1 (Question 1a.)
Recall: Naive Average Order Value was \$3145.13, which for sneakers is very pricy.  
a) Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.  

The simple answer that I believe they are looking for here is Median Order Value. Its a well known technique in data analysis to use the median as the "average" function when there are outliers present in the data. However, before we conclude the simple answer, one must always do their due dillegence. In our case, we will first confirm our statistics, then visualize the distribution of order size to look for outliers.

In [None]:
# Verify Average Order Value is $3145.13, Display Median Order Value
print('Average Order Value:', data['order_amount'].mean(), ',',
      'Median Order Value:', data['order_amount'].median())

In [None]:
# View the distribution of order sizes.
sns.countplot(x='total_items', data=data)
pd.DataFrame(data['total_items'].value_counts().sort_index())

Aha! Our intuition is correct. We have 17 orders (0.34% of the total orders) that are massive, and thus throwing off our mean calculations.  
  
The next step here is to just split our mean calculations into "small orders" (<10 in this case), and "large orders" (orders of size 2000). Interestingly, we still run into issues with this approach:

In [None]:
# Small and large order averages. 
data[data['total_items'] < 10]['order_amount'].mean(), data[data['total_items'] == 2000]['order_amount'].mean()

The Small Order Average is 754.09, yet the Total Median Order Value is 284.  
This should raise some eyebrows, why such a large discrepancy?  
  
So next we ask what is the average price to a pair of shoes in our dataset, and we will finally uncover the underlying deception.

In [None]:
# Create a price per pair of shoes for every order.
data['price_per_item'] = data['order_amount']/data['total_items']

In [None]:
pd.DataFrame(data['price_per_item'].describe())

Who is charging \$25,725 for a pair of shoes? Let us see.

In [None]:
pd.DataFrame(data['price_per_item'].value_counts().sort_index())

And our key insight here, all shoes are below 400, except for 46 orders which all share an outragous price tag of 25,725!  
  
Thus, we ought to examine the heavily overpriced orders to see if there are any patterns.

In [None]:
normal_orders = data[data['price_per_item'] < 400 ]
fraudulent_orders = data[data['price_per_item'] == 25725]

In [None]:
fraudulent_orders

And the pattern is clear as day, Shop_id = 78 is the only culprit of the fraud!

In [None]:
# Shop #78 has 0 orders under $400, and 46 orders priced at $25,725 per pair of shoes!!
len(normal_orders[normal_orders['shop_id']==78]), len(fraudulent_orders[fraudulent_orders['shop_id']==78])

So the final steps of our analysis is to filter out the overpriced orders from shop 78, and filter out the outliers in order size to get a decent handle of Average Order Volume.

In [None]:
new_data = data[data['shop_id'] != 78]
new_data_small_orders = new_data[new_data['total_items'] != 2000]
new_data_large_orders = new_data[new_data['total_items'] == 2000]

In [None]:
# QED.
new_data_small_orders['order_amount'].mean(), new_data['order_amount'].median()