# 4.10: Coding Etiquette & Excel Reporting - Part A

### This notebook contains:
    01. Importing Libraries
    02. Importing Data
    03. Coding Etiquette & Excel Reporting
        A. Crosstabs in Python
        B. PII Data Concerns
        C. Customer Behaviour by Geograpgical Region
        D. Low-Activity Customer Exclusion

## 01. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 02. Importing Data

In [2]:
# turning project folder path into string
path = r'/Users/lisa/DA Projects/12-2022 Instacart Basket Analysis'

In [3]:
# Importing latest data set
df_opa = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_all.pkl'))

In [4]:
# Importing department
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'))

## 03. Coding Etiquette & Excel Reporting

### A. Crosstabs in Python

In [5]:
# creating crosstab to confirm first time users
crosstab = pd.crosstab(df_opa['days_since_prior_order'], df_opa['order_number'], dropna = False)

In [6]:
# copying crosstab to clipboard
crosstab.to_clipboard()

In [8]:
crosstab

order_number,1,2,3,4,5,6,7,8,9,10,...,90,91,92,93,94,95,96,97,98,99
days_since_prior_order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,0,20536,20441,17984,16438,16046,14384,13890,12774,11460,...,1195,1148,1040,937,1134,1099,1041,883,1037,935
1.0,0,31674,29950,28010,27092,25533,24996,22921,21783,21067,...,2787,2801,2728,2605,2606,2639,2436,2535,2633,2363
2.0,0,46454,46264,43902,40729,40907,38101,37519,35745,33830,...,3623,3555,3314,3589,3319,3072,2985,3057,3091,2609
3.0,0,61637,63388,59996,57882,56183,52869,49291,48295,47826,...,3318,3237,2799,3073,2902,2635,2658,2462,2168,2467
4.0,0,76733,78861,73540,70519,66569,62399,61143,59499,57700,...,2342,2260,2490,2031,1800,2083,1918,1792,1695,1810
5.0,0,88999,91741,86503,81859,77583,73584,70020,64468,63154,...,1630,1630,1367,983,1215,1220,1166,966,912,936
6.0,0,120681,122871,114644,106764,100756,95954,89752,87203,78634,...,754,842,940,1014,835,570,490,602,485,542
7.0,0,184802,181656,167597,157442,143628,137675,128423,120734,114769,...,634,573,521,440,602,421,420,378,419,322
8.0,0,112324,110742,102217,94945,87611,81622,78760,71070,67567,...,238,262,228,244,275,232,234,143,137,84
9.0,0,73676,75379,68513,65045,58933,54801,50593,47752,43483,...,190,90,98,101,146,91,67,102,66,65


### B. PII Data Concerns

#### Question 2. 
Consider any security implications that might exist for this new data. You’ll need to address any PII data in the data before continuing your analysis.

In [9]:
# removing restrictions to see all columns
pd.options.display.max_columns = None

In [10]:
df_opa.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,frequency_flag,first_name,last_name,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both


As the customer and prices data is fabricated for the purpose of this course, it is not subject to PII laws. The user_id or order_id on it's own does not provide any means to identify individuals.

### C. Customer Behaviour by Geograpgical Region

#### Question 3.
The Instacart officers are interested in comparing customer behavior in different geographic areas. Create a regional segmentation of the data. You’ll need to create a “Region” column based on the “State” column from your customers data set. 

In [11]:
# Check frequency for States
df_opa['state'].value_counts(dropna = False)

Pennsylvania            667082
California              659783
Rhode Island            656913
Georgia                 656389
New Mexico              654494
Arizona                 653964
North Carolina          651900
Oklahoma                651739
Alaska                  648495
Minnesota               647825
Massachusetts           646358
Wyoming                 644255
Virginia                641421
Missouri                640732
Texas                   640394
Colorado                639280
Maine                   638583
North Dakota            638491
Alabama                 638003
Kansas                  637538
Louisiana               637482
Delaware                637024
South Carolina          636754
Oregon                  636425
Arkansas                636144
Nevada                  636139
New York                635983
Montana                 635265
South Dakota            633772
Illinois                633024
Hawaii                  632901
Washington              632852
Mississi

In [12]:
# checking shape
df_opa.shape

(32404859, 34)

In [13]:
# Create list states northeast
northeast = ['Maine','New Hampshire','Vermont','Massachusetts','Rhode Island','Connecticut','New York','Pennsylvania','New Jersey']

In [14]:
# Create list states midwest
midwest = ['Wisconsin','Michigan','Illinois','Indiana','Ohio','North Dakota','South Dakota','Nebraska','Kansas','Minnesota','Iowa','Missouri']

In [15]:
# Create list states south
south = ['Delaware','Maryland','District of Columbia','Virginia','West Virginia','North Carolina','South Carolina','Georgia','Florida','Kentucky','Tennessee','Mississippi','Alabama','Oklahoma','Texas','Arkansas','Louisiana']

In [16]:
# Create list states west
west = ['Idaho','Montana','Wyoming','Nevada','Utah','Colorado','Arizona','New Mexico','Alaska','Washington','Oregon','California','Hawaii']

In [17]:
# creating for-loop for US Regions
result = []

for value in df_opa["state"]:
  if value in northeast:
    result.append("Northeast")
  elif value in midwest:
    result.append("Midwest")
  elif value in south:
    result.append("South")
  else:
    result.append("West")

In [18]:
result

['South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'South',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 'Midwest',
 '

In [19]:
# Combining results with ords prods merged
df_opa['region'] = result

In [20]:
# frequency check region
df_opa['region'].value_counts(dropna = False)

South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: region, dtype: int64

In [21]:
# check column headers
df_opa.columns

Index(['order_id', 'user_id', 'order_number', 'order_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'first_order',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_days', 'busiest_period_of_day', 'max_order', 'loyalty_flag',
       'avg_price', 'spending_flag', 'order_freq', 'frequency_flag',
       'first_name', 'last_name', 'gender', 'state', 'age', 'date_joined',
       'no_of_dependants', 'marital_status', 'income', '_merge', 'region'],
      dtype='object')

Determine whether there’s a difference in spending habits between the different U.S. regions. (Hint: You can do this by crossing the variable you just created with the spending flag.)

In [22]:
# creating crosstab to investigate region and spending habits
crosstab_reg = pd.crosstab(df_opa['region'], df_opa['spending_flag'], dropna = False)

In [23]:
crosstab_reg

spending_flag,High Spender,Low spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,155975,7441350
Northeast,108225,5614511
South,209691,10582194
West,160354,8132559


In [24]:
# copying crosstab to clipboard
crosstab_reg.to_clipboard()

Low spenders generate by far the most revenue in the south of the US (over 10 mio) followed by west, midwest and northeast. The same order applies to high spenders with the south on top. Generally the low spenders generate much revenue than high spenders. 

### D. Low-Activity Customer Exclusion

#### Question 4.
The Instacart CFO isn’t interested in customers who don’t generate much revenue for the app. Create an exclusion flag for low-activity customers (customers with less than 5 orders) and exclude them from the data. Make sure you export this sample.

In [25]:
# Creating low activity flag based on max_order
df_opa.loc[df_opa['max_order']>=5,'low_activity_flag']='Normal activity'
df_opa.loc[df_opa['max_order']<5,'low_activity_flag']='Low activity'

In [26]:
# check results
df_opa.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,frequency_flag,first_name,last_name,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region,low_activity_flag
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity


In [27]:
# check 2
df_opa['low_activity_flag'].value_counts(dropna=False)

Normal activity    30964564
Low activity        1440295
Name: low_activity_flag, dtype: int64

In [28]:
# checking rows
30964564+1440295

32404859

In [29]:
# creating dataframe for low activity customers
df_low_cust = df_opa[df_opa['low_activity_flag']=='Low activity']

In [30]:
# checking results
df_low_cust.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,frequency_flag,first_name,last_name,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region,low_activity_flag
1510,520620,120,1,3,11,,True,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Most orders,3,New customer,9.385714,Low spender,19.0,Regular customer,Sarah,Rich,Female,Kentucky,54,3/2/2017,2,married,99219,both,South,Low activity
1511,3273029,120,3,2,8,19.0,False,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,3,New customer,9.385714,Low spender,19.0,Regular customer,Sarah,Rich,Female,Kentucky,54,3/2/2017,2,married,99219,both,South,Low activity
1512,520620,120,1,3,11,,True,46149,1,0,Zero Calorie Cola,77,7,13.4,Mid-range product,Regularly busy,Less busy,Most orders,3,New customer,9.385714,Low spender,19.0,Regular customer,Sarah,Rich,Female,Kentucky,54,3/2/2017,2,married,99219,both,South,Low activity
1513,3273029,120,3,2,8,19.0,False,46149,1,1,Zero Calorie Cola,77,7,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,3,New customer,9.385714,Low spender,19.0,Regular customer,Sarah,Rich,Female,Kentucky,54,3/2/2017,2,married,99219,both,South,Low activity
1514,520620,120,1,3,11,,True,26348,3,0,Mixed Fruit Fruit Snacks,50,19,3.1,Low-range product,Regularly busy,Less busy,Most orders,3,New customer,9.385714,Low spender,19.0,Regular customer,Sarah,Rich,Female,Kentucky,54,3/2/2017,2,married,99219,both,South,Low activity


In [31]:
# exporting low activity customer sample
df_low_cust.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'low_activity_customers.pkl'))

In [32]:
# creating new dataframe for normal activity customers
df_opan = df_opa[df_opa['low_activity_flag']=='Normal activity']

In [33]:
# check new df
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,frequency_flag,first_name,last_name,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region,low_activity_flag
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Less busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Normal activity


In [34]:
# check 2
df_opan['low_activity_flag'].value_counts(dropna=False)

Normal activity    30964564
Name: low_activity_flag, dtype: int64

In [35]:
# exporting normal activity customer sample
df_opan.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'normal_activity_customers.pkl'))