# 4.10: Coding Etiquette & Excel Reporting - Part B

This Script is a continuation from Part A. 
### This notebook contains:
    01. Importing Libraries
    02. Importing Data
    03. Coding Etiquette & Excel Reporting
        E. Customer Profiling
            a. Age Profiling
            b. Income Profiling
            c. Household Size
            d. Child Age Status
            e. Pet Status

## 01. Importing Libraries

In [18]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 02. Importing Data

In [19]:
# turning project folder path into string
path = r'/Users/lisa/DA Projects/12-2022 Instacart Basket Analysis'

In [20]:
# Importing latest dataset with normal activity customers
df_opan = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'normal_activity_customers.pkl'))

In [21]:
# dropping "low activity flag" from new dataframe
df_opan = df_opan.drop(columns=['low_activity_flag'])

In [22]:
# Importing department
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'))

## 03. Coding Etiquette & Excel Reporting

### E. Customer Profiling

#### Question 5.
The marketing and business strategy units at Instacart want to create more-relevant marketing strategies for different products and are, thus, curious about customer profiling in their database. Create a profiling variable based on age, income, certain goods in the “department_id” column, and number of dependents. You might also use the “orders_day_of_the_week” and “order_hour_of_day” columns if you can think of a way they would impact customer profiles. (Hint: As an example, try thinking of what characteristics would lead you to the profile “Single adult” or “Young parent.”)

In [23]:
# getting overview
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,last_name,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region
0,2539329,1,1,2,8,,True,196,1,0,...,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South
2,473747,1,3,3,12,21.0,False,196,1,1,...,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South
4,431534,1,5,4,15,28.0,False,196,1,1,...,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both,South


Based on the variables at hand, I will create a series of flags.

In [24]:
df_dep

Unnamed: 0.1,Unnamed: 0,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


### a. Age Profiling

    18-30 = Young Adult
    31-45 = Middle-Aged Adult 
    46-60 = Old-Aged Adult
    >60 = Senior

In [25]:
# creating conditions for age flag
df_opan.loc[(df_opan['age']>=18)&(df_opan['age']<=30),'age_group_flag']='Young adult'
df_opan.loc[(df_opan['age']>=31)&(df_opan['age']<=45),'age_group_flag']='Middle-Aged adult'
df_opan.loc[(df_opan['age']>=46)&(df_opan['age']<=60),'age_group_flag']='Old-Aged adult'
df_opan.loc[df_opan['age']>60,'age_group_flag']='Senior'

In [26]:
# check
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,gender,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult
2,473747,1,3,3,12,21.0,False,196,1,1,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult
4,431534,1,5,4,15,28.0,False,196,1,1,...,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult


In [27]:
# frequency check
df_opan['age_group_flag'].value_counts(dropna=False)

Senior               10112607
Old-Aged adult        7284900
Middle-Aged adult     7262817
Young adult           6304240
Name: age_group_flag, dtype: int64

### b. Income Profiling

    <30000 = Low Income
    30000 to 55000 = Low Middle Income
    55000 to 100000 = Middle Income
    100000 to 375000 = Upper Middle Income
    375000 = High Income
    
Categorization is losely based on the usnews article "Where Do I Fall in the American Economic Class System?",
last read on 2022-12-19: https://money.usnews.com/money/personal-finance/family-finance/articles/where-do-i-fall-in-the-american-economic-class-system

In [28]:
# creating conditions for income flag
df_opan.loc[df_opan['income']<30000,'income_flag']='Low income'
df_opan.loc[(df_opan['income']>=30000)&(df_opan['income']<55000),'income_flag']='Low middle income'
df_opan.loc[(df_opan['income']>=55000)&(df_opan['income']<100000),'income_flag']='Middle income'
df_opan.loc[(df_opan['income']>=100000)&(df_opan['income']<375000),'income_flag']='Upper middle income'
df_opan.loc[df_opan['income']>375000,'income_flag']='High Income'

In [29]:
# check
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,state,age,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income
2,473747,1,3,3,12,21.0,False,196,1,1,...,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income
4,431534,1,5,4,15,28.0,False,196,1,1,...,Alabama,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income


In [30]:
# frequency check
df_opan['income_flag'].value_counts(dropna=False)

Upper middle income    14149620
Middle income          12304774
Low middle income       4258518
Low income               193870
High Income               57782
Name: income_flag, dtype: int64

### c. Household Size

    0 dependants = Small Household
    1-2 dependant = Middle Household
    3 or more dependants = Large Household

In [31]:
# creating conditions for household size flag
df_opan.loc[df_opan['no_of_dependants'] <= 0 ,'household_size_flag']='Small household'
df_opan.loc[(df_opan['no_of_dependants'] >= 1) & (df_opan['no_of_dependants'] <= 2),'household_size_flag']='Middle household'
df_opan.loc[df_opan['no_of_dependants']>=3,'household_size_flag']='Large household'

In [32]:
# check
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,age,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag,household_size_flag
0,2539329,1,1,2,8,,True,196,1,0,...,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household
1,2398795,1,2,3,7,15.0,False,196,1,1,...,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household
2,473747,1,3,3,12,21.0,False,196,1,1,...,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household
3,2254736,1,4,4,7,29.0,False,196,1,1,...,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household
4,431534,1,5,4,15,28.0,False,196,1,1,...,31,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household


In [33]:
# frequency check
df_opan['household_size_flag'].value_counts(dropna=False)

Middle household    15452367
Large household      7772516
Small household      7739681
Name: household_size_flag, dtype: int64

### d. Child Age Status

    department_id is 18 and customer has a number of dependants greater than 0 = young children in household
    department_id is not 18 and customer has a number of dependants greater than 0 = older children in household
    else = no children


In [34]:
# check department names and numbers
df_dep

Unnamed: 0.1,Unnamed: 0,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


In [35]:
# creating a subset of orders with department_id 18 (babies)
sub_babies = df_opan[df_opan['department_id']==18]

In [36]:
# check
sub_babies.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,age,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag,household_size_flag
1508,1382150,109,6,1,9,15.0,False,3858,5,0,...,67,7/29/2018,1,married,41805,both,Northeast,Senior,Low middle income,Middle household
2893,2684151,290,22,6,10,7.0,False,45309,32,0,...,24,5/18/2019,1,married,55550,both,Midwest,Young adult,Middle income,Middle household
3508,2684151,290,22,6,10,7.0,False,15076,33,0,...,24,5/18/2019,1,married,55550,both,Midwest,Young adult,Middle income,Middle household
3982,2332460,420,21,5,17,11.0,False,14408,1,0,...,26,10/17/2018,2,married,97248,both,West,Young adult,Middle income,Middle household
4030,58188,420,3,6,13,23.0,False,30161,7,0,...,26,10/17/2018,2,married,97248,both,West,Young adult,Middle income,Middle household


In [37]:
#check 2
sub_babies['department_id'].value_counts(dropna=False)

18    410392
Name: department_id, dtype: int64

In [38]:
# reduce subset for faster processing
sub_babies = sub_babies[['user_id','department_id']]

In [39]:
# check
sub_babies.head()

Unnamed: 0,user_id,department_id
1508,109,18
2893,290,18
3508,290,18
3982,420,18
4030,420,18


In [40]:
# check shape
sub_babies.shape

(410392, 2)

In [41]:
# remove dublicates
sub_babies = sub_babies.drop_duplicates()

In [42]:
# check new shape
sub_babies.shape

(30230, 2)

In [43]:
# creating list with user_ids
list_babies = sub_babies['user_id'].unique()

In [44]:
# check
list_babies

array([   109,    290,    420, ..., 149691, 194803,  21688])

In [45]:
# creating child age flag conditions
df_opan.loc[(df_opan['user_id'].isin(list_babies))&(df_opan['no_of_dependants']>0),'child_age_flag']='Young children'
df_opan.loc[(df_opan['department_id']!=18)&(df_opan['no_of_dependants']>0),'child_age_flag']='Older children'
df_opan.loc[df_opan['no_of_dependants']==0,'child_age_flag']='No children'

In [46]:
# checking results
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag,household_size_flag,child_age_flag
0,2539329,1,1,2,8,,True,196,1,0,...,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children
1,2398795,1,2,3,7,15.0,False,196,1,1,...,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children
2,473747,1,3,3,12,21.0,False,196,1,1,...,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children
3,2254736,1,4,4,7,29.0,False,196,1,1,...,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children
4,431534,1,5,4,15,28.0,False,196,1,1,...,2/17/2019,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children


In [47]:
# checking frequency
df_opan['child_age_flag'].value_counts(dropna=False)

Older children    22917819
No children        7739681
Young children      307064
Name: child_age_flag, dtype: int64

### e. Pet Status

    department_id is 8 = customer has pets
    department_id IS NOT 8 = customer does not have pets

In [49]:
# creating a subset of orders with department_id 8 (pets)
sub_pets = df_opan[df_opan['department_id']==8]

In [50]:
# check
sub_pets.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,date_joined,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag,household_size_flag,child_age_flag
1495,659764,109,4,2,5,20.0,False,36273,9,0,...,7/29/2018,1,married,41805,both,Northeast,Senior,Low middle income,Middle household,Older children
1496,3116901,109,5,0,7,26.0,False,36273,3,1,...,7/29/2018,1,married,41805,both,Northeast,Senior,Low middle income,Middle household,Older children
2791,1439283,290,2,6,14,7.0,False,25860,8,0,...,5/18/2019,1,married,55550,both,Midwest,Young adult,Middle income,Middle household,Older children
2792,3080196,290,3,6,12,7.0,False,25860,21,1,...,5/18/2019,1,married,55550,both,Midwest,Young adult,Middle income,Middle household,Older children
2793,2881272,290,4,6,10,7.0,False,25860,7,1,...,5/18/2019,1,married,55550,both,Midwest,Young adult,Middle income,Middle household,Older children


In [51]:
#check 2
sub_pets['department_id'].value_counts(dropna=False)

8    93060
Name: department_id, dtype: int64

In [52]:
# reduce subset for faster processing
sub_pets = sub_pets[['user_id','department_id']]

In [53]:
# check
sub_pets.head()

Unnamed: 0,user_id,department_id
1495,109,8
1496,109,8
2791,290,8
2792,290,8
2793,290,8


In [54]:
# check shape
sub_pets.shape

(93060, 2)

In [55]:
# remove dublicates
sub_pets = sub_pets.drop_duplicates()

In [56]:
# check shape
sub_pets.shape

(13175, 2)

In [57]:
# creating list with user_ids
list_pets = sub_pets['user_id'].unique()

In [58]:
# check list
list_pets

array([   109,    290,    709, ..., 128084,  14475,  66119])

In [59]:
df_opan.loc[df_opan['user_id'].isin(list_pets),'pets_flag']='Has Pets'
df_opan.loc[~df_opan['user_id'].isin(list_pets),'pets_flag']='No Pets'

In [60]:
# checking results
df_opan.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,no_of_dependants,marital_status,income,_merge,region,age_group_flag,income_flag,household_size_flag,child_age_flag,pets_flag
0,2539329,1,1,2,8,,True,196,1,0,...,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children,No Pets
1,2398795,1,2,3,7,15.0,False,196,1,1,...,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children,No Pets
2,473747,1,3,3,12,21.0,False,196,1,1,...,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children,No Pets
3,2254736,1,4,4,7,29.0,False,196,1,1,...,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children,No Pets
4,431534,1,5,4,15,28.0,False,196,1,1,...,3,married,40423,both,South,Middle-Aged adult,Low middle income,Large household,Older children,No Pets


In [61]:
# checking frequency
df_opan['pets_flag'].value_counts(dropna=False)

No Pets     27513213
Has Pets     3451351
Name: pets_flag, dtype: int64

In [62]:
# Exporting Profiling Results
df_opan.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_customers_profiled.pkl'))