## Case3-shopify data

There is one problem in this case: cohort analysis using shopify data

In order to read the the `shopify_orders.parquet` file you will need to have the `pyarrow` library installed. The code cell below includes a `pip` instruction for installing this package. If you need to install it, please remove the comment in the cell below and execute the pip command.

If after installing pyarrow you get errors about pyarrow not being available when trying to read the data, please restart your jupyter kernel and try loading the data again.

In [None]:
#%pip install --upgrade pyarrow

In [1]:
import pandas as pd

orders = pd.read_parquet("shopify_orders.parquet")
orders.info()
orders.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100436 entries, 0 to 100435
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Day                100436 non-null  object 
 1   customer_type      100436 non-null  object 
 2   Customer ID        100436 non-null  int64  
 3   orders             100436 non-null  int64  
 4   total_sales        100436 non-null  float64
 5   Returns            100436 non-null  float64
 6   Ordered quantity   100436 non-null  int64  
 7   Gross sales        100436 non-null  float64
 8   Net sales          100436 non-null  float64
 9   Shipping           100436 non-null  float64
 10  Tax                100436 non-null  float64
 11  Net quantity       100436 non-null  int64  
 12  Returned quantity  100436 non-null  int64  
 13  Discounts          100436 non-null  float64
dtypes: float64(7), int64(5), object(2)
memory usage: 10.7+ MB


Unnamed: 0,Day,customer_type,Customer ID,orders,total_sales,Returns,Ordered quantity,Gross sales,Net sales,Shipping,Tax,Net quantity,Returned quantity,Discounts
0,2018-09-02,Returning,7609456,1,31.25,-0.0,7,27.21,27.21,4.04,0.0,7,0,-0.0
1,2017-07-23,Returning,4782112,1,152.75,-0.0,2,134.77,134.77,0.0,17.98,2,0,-0.0
2,2018-02-25,Returning,5245146,1,10.25,-0.0,2,10.25,10.25,0.0,0.0,2,0,-0.0
3,2020-08-09,Returning,7033470,1,1149.84,-0.0,7,1014.65,1014.65,0.0,135.19,7,0,-0.0
4,2020-06-16,Returning,7560082,1,0.0,-0.0,5,0.0,0.0,0.0,0.0,5,0,-0.0


**Definition:** A customer’s cohort is the month in which a customer placed
their first order

The customer type column indicates whether order was placed by a new or returning customer

We now describe the *want* for the exercise, which we ask you to complete

**Want**: Compute the monthly total number of orders, total sales, and
total quantity separated by customer cohort and customer type

Read that carefully one more time…

### Extended Exercise

Using the reshape and `groupby` tools you have learned, apply the want
operator described above

See below for advice on how to proceed

When you are finished, you should have something that looks like this:

<img src="shopify_cohort_answer.png" style="">
  

A few notes on the table above:

1. Your actual output will be much bigger. This is just to give you an idea of what it might look like
1. The numbers you produce should actually be the same as what are included in this table… Index into your answer and compare what you have with this table to verify your progress
1. The labels will not have "Month-year" by default -- they will be numerical dates like `2016-07-31`. This is ok. The changing to "Month-year" representation is optional

Now, how to do it?

There is more than one way to code this, but here are some suggested
steps.

1. Convert the `Day` column to have a `datetime` `dtype` instead of object (Hint: use the `pd.to_datetime` function)
1. Add a new column that specifies the date associated with each
  customer’s `"First-time"` order
  - Hint 1: You can do this with a combination of `groupby` and
    `merge`
  - Hint 2: `customer_type` is always one of `Returning` and
    `First-time`  
  - Hint 3: Some customers don’t have a
    `customer_type == "First-time"` entry. You will need to set the
    value for these users to some date that precedes the dates in the
    sample. After adding valid data back into `orders` DataFrame,
    you can identify which customers don’t have a `"First-Time"`
    entry by checking for missing data in the new column.  
1. You’ll need to group by 3 things  
1. You can apply one of the built-in aggregation functions to the GroupBy
1. After doing the aggregation, you’ll need to use your reshaping skills to
  move things to the right place in rows and columns


Good luck!

> NOTE at the very end of my code, I ran the following to get the dates to appear in a human readable way
> ```
>     .rename(columns=lambda x: x.strftime("%B-%y"))
>     .rename(index=lambda x: x.strftime("%B-%y"), level="cohort")
> ```

In [None]:
solution = ...

months = ["July-16", "August-16", "September-16"]
solution.loc[pd.IndexSlice[:, :, months], months]