### Prepping Data Challenge: Customer Classification (Week 9)

The dataset is the sample superstore dataset that comes with Tableau Desktop. We're using just the Orders table.

### Requirements
- Input the data
- Aggregate the data to the years each customer made an order
- Calculate the year each customer made their First Purchase
- Scaffold the dataset so that there is a row for each year after a customers First Purchase, even if they did not make an order
- Create a field to flag these new rows, making it clear whether a customer placed an order in that year or not
- Calculate the Year on Year difference in the number of customers from each Cohort in each year
  - Cohort = Year of First Purchase
- Create a field which flags whether or not a customer placed an order in the previous year
- Create the Customer Classification using the above definitions
- Join back to the original input data
   - Ensure that in rows where a customer did not place an order, the majority of the original fields are null. The exceptions to this are the Customer Name and Customer ID fields.
- Output the data

In [1]:
import pandas as pd
import numpy as np
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [2]:
# Input the data.
with pd.ExcelFile('Sample - Superstore.xls') as xlsx:
    df = pd.read_excel(xlsx, 'Orders')

In [3]:
order = df[['Row ID','Order Date','Customer ID']]

In [4]:
order.head(5)

Unnamed: 0,Row ID,Order Date,Customer ID
0,1,2020-11-08,CG-12520
1,2,2020-11-08,CG-12520
2,3,2020-06-12,DV-13045
3,4,2019-10-11,SO-20335
4,5,2019-10-11,SO-20335


In [5]:
#Aggregate the data to the years each customer made an order
order['Year'] = pd.to_datetime(order['Order Date']).dt.year

In [6]:
#Calculate the year each customer made their First Purchase
order['First Purchase'] = order.groupby(['Customer ID'])['Year'].transform('min')

In [7]:
#Scaffold the dataset so that there is a row for each year after a customers First Purchase, 
#even if they did not make an order
present_year = order['Year'].max()

for key, group in order.groupby('Customer ID'):
    for year in range(group['First Purchase'].iloc[0], present_year+1):
        if year not in group['Year'].values:
            order = order.append({'Year': year, 'Customer ID': group['Customer ID'].iloc[0],
                            'First Purchase': group['First Purchase'].min()}, ignore_index=True)

In [8]:
#Create a field to flag these new rows, making it clear whether a customer placed an order in that year or not
order["Order?"] = np.where(order['Order Date'].isna(), 0, 1)

In [9]:
#Create a field which flags whether or not a customer placed an order in the previous year
order["Prev_order"] = (order.groupby(['Customer ID'])["Order?"].transform(lambda x: x.shift()))

In [10]:
#Create the Customer Classification using the above definitions
order['Customer Classification'] = np.where(order["Prev_order"].isna(), 'New',
                                   np.where((order["Order?"]==1) & (order["Prev_order"] == 1), 'Consistent',
                                   np.where((order["Order?"]==1) & (order["Prev_order"] == 0),  'Returning', 'Sleeping')))

In [11]:
#Calculate the Year on Year difference in the number of customers from each Cohort in each year
order['YoY Difference'] = order.groupby(['Year', 'First Purchase'])['Customer ID'].transform('count').diff()

In [12]:
order.head()

Unnamed: 0,Row ID,Order Date,Customer ID,Year,First Purchase,Order?,Prev_order,Customer Classification,YoY Difference
0,1.0,2020-11-08,CG-12520,2020,2019,1,,New,
1,2.0,2020-11-08,CG-12520,2020,2019,1,1.0,Consistent,0.0
2,3.0,2020-06-12,DV-13045,2020,2020,1,,New,-242.0
3,4.0,2019-10-11,SO-20335,2019,2019,1,,New,262.0
4,5.0,2019-10-11,SO-20335,2019,2019,1,1.0,Consistent,0.0


In [13]:
#Join back to the original input data
df2 = order.merge(df, how='left',on='Customer ID')\
           .rename(columns={'Row ID_x':'Row ID','Order Date_x' : 'Order Date'})

In [14]:
df2 = df2[['Customer Classification','YoY Difference',"Order?", "Year","First Purchase",'Row ID','Order ID','Order Date',
            'Ship Date','Ship Mode','Customer Name','Segment',"Country/Region",'City','State','Postal Code','Region',
            'Product ID','Category','Sub-Category','Product Name','Sales','Quantity','Discount','Profit','Customer ID']]

In [15]:
df2.head()

Unnamed: 0,Customer Classification,YoY Difference,Order?,Year,First Purchase,Row ID,Order ID,Order Date,Ship Date,Ship Mode,...,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Customer ID
0,New,,1,2020,2019,1.0,CA-2020-152156,2020-11-08,2020-11-11,Second Class,...,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136,CG-12520
1,New,,1,2020,2019,1.0,CA-2020-152156,2020-11-08,2020-11-11,Second Class,...,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582,CG-12520
2,New,,1,2020,2019,1.0,CA-2021-164098,2020-11-08,2021-01-27,First Class,...,Central,OFF-ST-10000615,Office Supplies,Storage,"SimpliFile Personal File, Black Granite, 15w x...",18.16,2,0.2,1.816,CG-12520
3,New,,1,2020,2019,1.0,US-2019-123918,2020-11-08,2019-10-15,Same Day,...,Central,FUR-FU-10004952,Furniture,Furnishings,C-Line Cubicle Keepers Polyproplyene Holder w/...,131.376,6,0.6,-95.2476,CG-12520
4,New,,1,2020,2019,1.0,US-2019-123918,2020-11-08,2019-10-15,Same Day,...,Central,OFF-PA-10003001,Office Supplies,Paper,Xerox 1986,5.344,1,0.2,1.8704,CG-12520


In [17]:
#output the dataset
df2.to_csv('wk9-output.csv', index=False)