## Instacart Analysis – Customer Profile Categorisation
1.	Import libraries, set directory paths & import data
2.	Check data frame dimensions, columns and datatypes
3.	Review customer loyalty
    -	How often does each customer shops at Instacart?
    -	How much does each customer spend?
    -	What is the maximum order placed per customer?
4.	Review age range & frequency
    -	Calculate mean and median ages for assigning profiling age groups
    -	Assign age grouping using loc[] method
    -	Copy value counts to clipboard for writing to the Customer Profiling spreadsheet
5.	Review family situation
    -	Review number of dependants & average no. of dependants
    -	Review marital status
    -	Copy value_counts to clipboard for grouping assessment (refer to Customer Profiling spreadsheet)
    -	Assign family_flag using loc[] method
    -	Copy result to clipboard for writing to the Customer Profiling spreadsheet
6.	Categorise income levels
    -	Assess income range and other descriptive statistics
    -	Assign income categories using loc[] method
    -	Copy value counts to clipboard for writing to Customer Profiling spreadsheet
7.	Export to pickle for further assessment


### import libraries

In [1]:
import pandas as pd
import numpy as np
import os

### set data set directory path

In [2]:
datasetpath = r'D:\My Documents\! Omnicompetent Ltd\Courses\Career Foundry - Data Analytics\Data Analytics Course\Instacart Basket Analysis\02 Data Sets'
datasetpath

'D:\\My Documents\\! Omnicompetent Ltd\\Courses\\Career Foundry - Data Analytics\\Data Analytics Course\\Instacart Basket Analysis\\02 Data Sets'

### import revised data set, following product review

In [3]:
df_testing = pd.read_pickle(os.path.join(datasetpath,'testing_sample_prod.pkl'))
df_testing.head()

Unnamed: 0,order_id,user_id,number_of_orders,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,reordered,product_name,department_id,...,marital_status,income,region,max_order,prod_price_range,sum_product_order,top_order,product_revenue,big_revenue,key_dept
0,2539329,1,1,2,8,0.0,196,0,Soda,7,...,married,40423,South,10,Medium Range Product,181305,0.0,91152.0,0.0,1.0
1,473747,1,3,3,12,21.0,196,1,Soda,7,...,married,40423,South,10,Medium Range Product,181305,0.0,91152.0,0.0,1.0
2,2254736,1,4,4,7,29.0,196,1,Soda,7,...,married,40423,South,10,Medium Range Product,181305,0.0,91152.0,0.0,1.0
3,550135,1,7,1,9,20.0,196,1,Soda,7,...,married,40423,South,10,Medium Range Product,181305,0.0,91152.0,0.0,1.0
4,2539329,1,1,2,8,0.0,14084,0,Organic Unsweetened Vanilla Almond Milk,16,...,married,40423,South,10,High Range Product,96579,0.0,58687.5,0.0,1.0


### review dimensions, columns & datatypes

In [4]:
df_testing.shape

(9268148, 25)

In [5]:
df_testing.dtypes

order_id                    int64
user_id                     int64
number_of_orders            int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
product_id                  int64
reordered                   int64
product_name               object
department_id               int64
price                     float64
gender                     object
state                      object
age                         int64
n_dependants                int64
marital_status             object
income                      int64
region                     object
max_order                   int64
prod_price_range           object
sum_product_order           int64
top_order                 float64
product_revenue           float64
big_revenue               float64
key_dept                  float64
dtype: object

## Review Customer Loyalty
    1. How often does each customer shops at Instacart?
    2. How much does each customer spend?
    3. What is the maximum order placed per customer?

### review days_since_prior_order descriptive statistics

In [6]:
df_testing['days_since_prior_order'].describe()

count    9.268148e+06
mean     1.022572e+01
std      8.685513e+00
min      0.000000e+00
25%      4.000000e+00
50%      7.000000e+00
75%      1.400000e+01
max      3.000000e+01
Name: days_since_prior_order, dtype: float64

### calculate average number of days since last order per customer

In [7]:
df_testing['avg_order_days'] = df_testing.groupby(['user_id']) ['days_since_prior_order'].transform(np.mean)

In [8]:
df_testing['avg_order_days'].value_counts(10)

8.000000     1.169381e-03
9.000000     1.097846e-03
7.000000     1.093207e-03
12.000000    1.078748e-03
10.000000    1.073030e-03
                 ...     
1.714286     7.552749e-07
0.428571     7.552749e-07
1.571429     7.552749e-07
29.571429    7.552749e-07
1.200000     5.394821e-07
Name: avg_order_days, Length: 47918, dtype: float64

### assign shopping frequency flag
    Often:         <=7 days
    Occasionally:   >7   <=14 days
    Rarely:         >14 days

In [9]:
df_testing.loc[(df_testing['avg_order_days'] <=7), 'shop_freq'] = 'Often'

In [10]:
df_testing.loc[(df_testing['avg_order_days'] >7) & (df_testing['avg_order_days'] <=14), 'shop_freq'] = 'Occasionally'

In [11]:
df_testing.loc[(df_testing['avg_order_days'] >14), 'shop_freq'] = 'Rarely'

In [12]:
df_testing['shop_freq'].value_counts()

Occasionally    4440870
Often           2811485
Rarely          2015793
Name: shop_freq, dtype: int64

### copy value counts to clipboard

In [13]:
shopfreq = df_testing['shop_freq'].value_counts()

In [14]:
shopfreq.to_clipboard()

### review average spending per customer

In [15]:
df_testing['avg_spend'] = df_testing.groupby(['user_id']) ['price'].transform(np.mean)

### review avg_spend descriptive statistics

In [16]:
df_testing['avg_spend'].describe()

count    9.268148e+06
mean     7.788362e+00
std      8.529728e-01
min      1.000000e+00
25%      7.304580e+00
50%      7.800493e+00
75%      8.292105e+00
max      2.080000e+01
Name: avg_spend, dtype: float64

### assign spending flag
    Average Spender:  <=8 USD
    High Spender:     >8 USD

In [17]:
df_testing.loc[(df_testing['avg_spend'] <=8), 'spend_level'] = 'Average Spender'

In [18]:
df_testing.loc[(df_testing['avg_spend'] >8), 'spend_level'] = 'High Spender'

### check spend_level value counts

In [19]:
df_testing['spend_level'].value_counts(dropna=False)

Average Spender    5670314
High Spender       3597834
Name: spend_level, dtype: int64

### copy value counts to clipboard

In [20]:
spendlevel = df_testing['spend_level'].value_counts()

In [21]:
spendlevel.to_clipboard()

### using max_order column, assign loyalty flag
    New Customer:     <=10 orders
    Regular Customer: >10   <=40
    Loyal Customer:   >40 orders

In [22]:
df_testing.loc[df_testing['max_order'] <=10, 'loyalty_flag'] = 'New Customer'

In [23]:
df_testing.loc[(df_testing['max_order'] >10) & (df_testing['max_order'] <=40), 'loyalty_flag'] = 'Regular Customer'

In [24]:
df_testing.loc[df_testing['max_order'] >40, 'loyalty_flag'] = 'Loyal Customer'

In [25]:
df_testing['loyalty_flag'].value_counts(dropna=False)

Regular Customer    4750666
Loyal Customer      3074298
New Customer        1443184
Name: loyalty_flag, dtype: int64

### copy value counts to clipboard

In [26]:
loyalty = df_testing['loyalty_flag'].value_counts()

In [27]:
loyalty.to_clipboard()

## Review age range & frequency

In [28]:
df_testing['age'].value_counts(dropna=False).sort_index()

18    143305
19    148345
20    143985
21    146220
22    146593
       ...  
77    143398
78    140718
79    152140
80    147516
81    144857
Name: age, Length: 64, dtype: int64

In [29]:
df_testing['age'].value_counts(dropna=False)

79    152140
48    152124
49    151824
64    151051
31    150910
       ...  
60    138079
36    138065
66    136055
41    134866
25    134432
Name: age, Length: 64, dtype: int64

### calculate mean and median age

In [30]:
avg_age = df_testing['age'].mean()
avg_age

49.46377787665885

In [31]:
median_age = df_testing['age'].median()
median_age

49.0

### assign age groupings
    Snapper:    >=18  <30
    Grown-up:   >=30  <60
    Chief:      >=60  <82

In [32]:
df_testing.loc[(df_testing['age'] <30), 'age_flag'] = 'Snapper'

In [33]:
df_testing.loc[(df_testing['age'] >=30) & (df_testing['age'] <60), 'age_flag'] = 'Grown-up'

In [34]:
df_testing.loc[(df_testing['age'] >=60), 'age_flag'] = 'Chief'

### check assigned age_flag value_counts

In [35]:
age_flag = df_testing['age_flag'].value_counts(dropna=False)

### copy to clipboard

In [36]:
age_flag.to_clipboard()

## Review family situation

### review number of dependants & average no. of dependants

In [37]:
df_testing['n_dependants'].value_counts(dropna=False).sort_index()

0    2315290
1    2311418
2    2315062
3    2326378
Name: n_dependants, dtype: int64

In [38]:
df_testing['n_dependants'].mean()

1.501991120556124

### review marital status

In [39]:
df_testing['marital_status'].value_counts(dropna=False)

married                             6509474
single                              1524773
divorced/widowed                     790517
living with parents and siblings     443384
Name: marital_status, dtype: int64

### compare 'n_dependant' vs 'marital_status' via crosstab

In [40]:
cross_dependfam = pd.crosstab(df_testing['n_dependants'], df_testing['marital_status'], dropna=False)

In [41]:
cross_dependfam

marital_status,divorced/widowed,living with parents and siblings,married,single
n_dependants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,790517,0,0,1524773
1,0,151960,2159458,0
2,0,145341,2169721,0
3,0,146083,2180295,0


### copy crosstab to clipboard

In [42]:
cross_dependfam.to_clipboard()

### assign family_flag using loc[ ] method

#### Free =
    no dependants, single 
    no dependants, divorced/widow

In [43]:
df_testing.loc[(df_testing['n_dependants'] ==0) & ((df_testing['marital_status'] =='single') | (df_testing['marital_status'] =='divorced/widowed')), 'family_flag'] = 'free'

#### Responsible =
    1 dependant, living with parents
    1 dependant, married
    2 dependants, living with parents

In [44]:
df_testing.loc[(df_testing['n_dependants'] ==1) & ((df_testing['marital_status'] =='living with parents and siblings') | (df_testing['marital_status'] =='married')), 'family_flag'] = 'responsible'

In [45]:
df_testing.loc[(df_testing['n_dependants'] ==2) & (df_testing['marital_status'] =='living with parents and siblings'), 'family_flag'] = 'responsible'

#### Busy = 
    2 dependants, married
    3 dependants, married
    3 dependants, living with parents

In [46]:
df_testing.loc[(df_testing['n_dependants'] ==2) & (df_testing['marital_status'] =='married'), 'family_flag'] = 'busy'

In [47]:
df_testing.loc[(df_testing['n_dependants'] ==3) & ((df_testing['marital_status'] =='living with parents and siblings') | (df_testing['marital_status'] =='married')), 'family_flag'] = 'busy'

### check family_flag column population

In [48]:
df_family = df_testing[(df_testing['family_flag'] =='free') | (df_testing['family_flag'] =='responsible') | (df_testing['family_flag'] =='busy')]

In [49]:
df_family[['n_dependants','marital_status','family_flag']].head(15)

Unnamed: 0,n_dependants,marital_status,family_flag
0,3,married,busy
1,3,married,busy
2,3,married,busy
3,3,married,busy
4,3,married,busy
5,3,married,busy
6,3,married,busy
7,3,married,busy
8,3,married,busy
9,3,married,busy


### Confirm all records have been populated by:
    a) checking all records have been saved to df_family
    b) reviewing family_flag value_counts

In [50]:
df_family.shape

(9268148, 32)

In [51]:
family_flag = df_testing['family_flag'].value_counts()

### copy to clipboard

In [52]:
family_flag.to_clipboard()

## Categorize income levels

### review income range

In [53]:
df_testing['income'].value_counts(dropna=False).sort_index()

25903       5
25937       9
25941      11
25955     117
25972      11
         ... 
584097    239
590790      6
591089     62
592409     98
593901    193
Name: income, Length: 95052, dtype: int64

### review descriptive statistics

In [54]:
df_testing['income'].describe()

count    9.268148e+06
mean     9.969125e+04
std      4.312018e+04
min      2.590300e+04
25%      6.733700e+04
50%      9.678000e+04
75%      1.281170e+05
max      5.939010e+05
Name: income, dtype: float64

### assign income categories
    Under:    >$25K   <=$70K
    Standard: >$70K   <=$130K
    Over:     >$130K

In [55]:
df_testing.loc[(df_testing['income'] <=70000), 'income_flag'] = 'Under'

In [56]:
df_testing.loc[(df_testing['income'] >70000) & (df_testing['income'] <=130000), 'income_flag'] = 'Standard'

In [57]:
df_testing.loc[(df_testing['income'] >130000), 'income_flag'] = 'Over'

### check income_flag value_counts, and copy to clipboard

In [58]:
income_flag = df_testing['income_flag'].value_counts()

In [59]:
income_flag.to_clipboard()

### review columns & dimensions

In [60]:
df_testing.dtypes

order_id                    int64
user_id                     int64
number_of_orders            int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
product_id                  int64
reordered                   int64
product_name               object
department_id               int64
price                     float64
gender                     object
state                      object
age                         int64
n_dependants                int64
marital_status             object
income                      int64
region                     object
max_order                   int64
prod_price_range           object
sum_product_order           int64
top_order                 float64
product_revenue           float64
big_revenue               float64
key_dept                  float64
avg_order_days            float64
shop_freq                  object
avg_spend                 float64
spend_level                object
loyalty_flag  

In [61]:
df_testing.shape

(9268148, 33)

### export to pickle for visual assessment

In [62]:
df_testing.to_pickle(os.path.join(datasetpath,'testing_sample_prodcust.pkl'))