#  4.10: Coding Etiquette & Excel Reporting - Part 1

## Table of content:

### Task 5 -  Create a profiling variable based on age, income, certain goods in the “department_id” column, and number of dependents

### 1) Create sample data
### 2) Create Age Group
### 3) Create Income Group
### 4) Create Department Group
### 5) Create Parents Group
### 6) Create Customer Profile
### 7) Create Day Shopper Group

###  - Export data

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

### Import Data Set saved from Task 4

In [2]:
path = r'C:\Users\facun\Desktop\Data Analysis\CF\PYTHON\Instacart Basket Analysis'

In [3]:
df_high_act_customer = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'high_activity_customers.pkl'))

### Check output

In [4]:
df_high_act_customer.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,order_frequency_flag,Gender,State,Age,date_joined,n_dependants,fam_status,income,Region,customer_activity
0,2539329,1,1,2,8,,196,1,0,Soda,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,high activity customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,high activity customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,high activity customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,high activity customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,high activity customer


### Check shape

In [5]:
df_high_act_customer.shape

(30965686, 31)

In [6]:
df_high_act_customer.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_last_order', 'product_id',
       'add_to_cart_order', 'reordered', 'product_name', 'aisle_id',
       'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'average_price',
       'spending_flag', 'median_days_since_last_order', 'order_frequency_flag',
       'Gender', 'State', 'Age', 'date_joined', 'n_dependants', 'fam_status',
       'income', 'Region', 'customer_activity'],
      dtype='object')

## 1) Creating sample data set of the data frame

I decided to create a sample data set as the original data set is to big and some when i tried to merge some files consumes too much memory of my computer 

In [7]:
# Create a list holding True/False value to the test np.random.rant() <= 0.7

dev = np.random.rand(len(df_high_act_customer)) <= 0.7

In [8]:
# Store 70% of the sample in the dataframe big

big = df_high_act_customer[dev]

In [9]:
# Store 30% of the sample in the dataframe small

small = df_high_act_customer[~dev]

In [10]:
# Check results

len(df_high_act_customer)

30965686

In [11]:
len(big) + len(small)

30965686

In [12]:
# Reduce samples to subset of only columns relevant to analysis

df_hac_final = small[['order_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day','days_since_last_order','product_name', 'Age', 'State', 'product_id','n_dependants', 'department_id', 'prices', 'price_range_loc','busiest_day', 'busiest_period_of_day', 'loyalty_flag','fam_status', 'spending_flag', 'order_frequency_flag', 'Gender', 'income','Region']]

In [13]:
df_hac_final.head()

Unnamed: 0,order_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,Age,State,product_id,n_dependants,...,price_range_loc,busiest_day,busiest_period_of_day,loyalty_flag,fam_status,spending_flag,order_frequency_flag,Gender,income,Region
0,2539329,1,2,8,,Soda,31,Alabama,196,3,...,Mid-range product,Regularly busy,Average orders,New customer,married,Low_spender,Non-frequent customer,Female,40423,South
3,2254736,4,4,7,29.0,Soda,31,Alabama,196,3,...,Mid-range product,Least busy,Average orders,New customer,married,Low_spender,Non-frequent customer,Female,40423,South
4,431534,5,4,15,28.0,Soda,31,Alabama,196,3,...,Mid-range product,Least busy,Most orders,New customer,married,Low_spender,Non-frequent customer,Female,40423,South
5,3367565,6,2,7,19.0,Soda,31,Alabama,196,3,...,Mid-range product,Regularly busy,Average orders,New customer,married,Low_spender,Non-frequent customer,Female,40423,South
11,2539329,1,2,8,,Original Beef Jerky,31,Alabama,12427,3,...,Low-range product,Regularly busy,Average orders,New customer,married,Low_spender,Non-frequent customer,Female,40423,South


In [14]:
df_hac_final.shape

(9291228, 22)

## 2) Age Group

To create the age groups and the flags with need to find the minimum and maximum age.

After that i decided to categorize them in 3 groups:
- **Young Adult**
- **Middle-Age Adult**
- **Senior Adult**


In [15]:
df_hac_final['Age'].min()

18

In [16]:
df_hac_final['Age'].max()

81

In [17]:
df_hac_final.loc[df_hac_final['Age'] <= 35, 'age_group'] = 'Young Adult: 18-35'
df_hac_final.loc[(df_hac_final['Age'] > 35) & (df_hac_final['Age'] <= 55), 'age_group'] = 'Middle-Age Adult: 36-55'
df_hac_final.loc[df_hac_final['Age'] > 55, 'age_group'] = 'Senior Adult: 56+'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hac_final.loc[df_hac_final['Age'] <= 35, 'age_group'] = 'Young Adult: 18-35'


#### Check frequency

In [18]:
df_hac_final['age_group'].value_counts(dropna = False)

Senior Adult: 56+          3754102
Middle-Age Adult: 36-55    2916519
Young Adult: 18-35         2620607
Name: age_group, dtype: int64

In [19]:
df_hac_final[['order_id','age_group']].head()

Unnamed: 0,order_id,age_group
0,2539329,Young Adult: 18-35
3,2254736,Young Adult: 18-35
4,431534,Young Adult: 18-35
5,3367565,Young Adult: 18-35
11,2539329,Young Adult: 18-35


## 3) Income Group

To create the Income group and the flags of it I need to find the minimum and maximum age.

After that i decided to categorize them in 4 groups:
- **Low Income**
- **Middle Income**
- **High Income**
- **Elevated Income**

In [20]:
df_hac_final['income'].min()

25903

In [21]:
df_hac_final['income'].max()

593901

In [22]:
df_hac_final.loc[df_hac_final['income'] <= 100000, 'income_group'] = 'Low Income'
df_hac_final.loc[(df_hac_final['income'] > 100000) & (df_hac_final['income'] <= 200000), 'income_group'] = 'Medium Income'
df_hac_final.loc[(df_hac_final['income'] > 200000) & (df_hac_final['income'] <= 300000), 'income_group'] = 'High Income'
df_hac_final.loc[df_hac_final['income'] > 300000, 'income_group'] = 'Elevated Income'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hac_final.loc[df_hac_final['income'] <= 100000, 'income_group'] = 'Low Income'


#### Check frequency

In [23]:
df_hac_final['income_group'].value_counts(dropna = False)

Low Income         5026729
Medium Income      4187708
High Income          46566
Elevated Income      30225
Name: income_group, dtype: int64

In [24]:
df_hac_final[['order_id','income_group']].head()

Unnamed: 0,order_id,income_group
0,2539329,Low Income
3,2254736,Low Income
4,431534,Low Income
5,3367565,Low Income
11,2539329,Low Income


##  4) Department Group

### Importing department csv

In [25]:
df_depts = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'), index_col=False)

### Check output

In [26]:
df_depts.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [27]:
# Merging the departments dataframe with the high activity customer datd

df_high_act_customer_dept = df_hac_final.merge(df_depts, on = 'department_id')

In [28]:
df_high_act_customer_dept.head()

Unnamed: 0,order_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,Age,State,product_id,n_dependants,...,loyalty_flag,fam_status,spending_flag,order_frequency_flag,Gender,income,Region,age_group,income_group,department
0,2539329,1,2,8,,Soda,31,Alabama,196,3,...,New customer,married,Low_spender,Non-frequent customer,Female,40423,South,Young Adult: 18-35,Low Income,beverages
1,2254736,4,4,7,29.0,Soda,31,Alabama,196,3,...,New customer,married,Low_spender,Non-frequent customer,Female,40423,South,Young Adult: 18-35,Low Income,beverages
2,431534,5,4,15,28.0,Soda,31,Alabama,196,3,...,New customer,married,Low_spender,Non-frequent customer,Female,40423,South,Young Adult: 18-35,Low Income,beverages
3,3367565,6,2,7,19.0,Soda,31,Alabama,196,3,...,New customer,married,Low_spender,Non-frequent customer,Female,40423,South,Young Adult: 18-35,Low Income,beverages
4,2295261,9,1,16,0.0,Zero Calorie Cola,31,Alabama,46149,3,...,New customer,married,Low_spender,Non-frequent customer,Female,40423,South,Young Adult: 18-35,Low Income,beverages


## 5) Parents Group

In [29]:
df_high_act_customer_dept.loc[df_high_act_customer_dept['n_dependants'] > 0, 'dependants_group'] = 'Parent'
df_high_act_customer_dept.loc[df_high_act_customer_dept['n_dependants'] == 0, 'dependants_group'] = 'Childless'

In [30]:
df_high_act_customer_dept['dependants_group'].value_counts(dropna = False)

Parent       6969157
Childless    2322071
Name: dependants_group, dtype: int64

## 6) Customer Profile

In [31]:
df_high_act_customer_dept['fam_status'].value_counts(dropna = False)

married                             6525351
single                              1527998
divorced/widowed                     794073
living with parents and siblings     443806
Name: fam_status, dtype: int64

I decide to create a **Family Type** group as customer profile. 

In [32]:
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] > 0) & (df_high_act_customer_dept['fam_status'] == 'married'), 'family_type'] = 'Married with children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] > 0) & (df_high_act_customer_dept['fam_status'] == 'single'), 'family_type'] = 'Single parent'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] > 0) & (df_high_act_customer_dept['fam_status'] == 'divorced/widowed'), 'family_type'] = 'divorced/widowed with children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] > 0) & (df_high_act_customer_dept['fam_status'] == 'living with parents and siblings'), 'family_type'] = 'living with parents and siblings with children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] == 0) & (df_high_act_customer_dept['fam_status'] == 'married'), 'family_type'] = 'Married without children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] == 0) & (df_high_act_customer_dept['fam_status'] == 'single'), 'family_type'] = 'Single without children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] == 0) & (df_high_act_customer_dept['fam_status'] == 'divorced/widowed'), 'family_type'] = 'divorced/widowed without children'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['n_dependants'] == 0) & (df_high_act_customer_dept['fam_status'] == 'living with parents and siblings'), 'family_type'] = 'living with parents and siblings without children'

In [33]:
value_counts = df_high_act_customer_dept['family_type'].value_counts(dropna = False)
value_counts

Married with children                             6525351
Single without children                           1527998
divorced/widowed without children                  794073
living with parents and siblings with children     443806
Name: family_type, dtype: int64

I also though would be good to know the percentage representation. (This is more for my own knowledge, it crossed my mind how i can see the results in percentage)

In [34]:
total_count = value_counts.sum()
percentage_df = (value_counts / total_count * 100).reset_index()
percentage_df.columns = ['Family Type', 'Percentage']
percentage_df['Percentage'] = percentage_df['Percentage'].apply(lambda x: f'{x:.2f}%')
percentage_df

Unnamed: 0,Family Type,Percentage
0,Married with children,70.23%
1,Single without children,16.45%
2,divorced/widowed without children,8.55%
3,living with parents and siblings with children,4.78%


## 7) Days shoppers group

- First i assigned the day name to the day of the week
- Then create the days shopper

In [35]:
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 0, 'Day'] = 'Saturday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 1, 'Day'] = 'Sunday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 2, 'Day'] = 'Monday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 3, 'Day'] = 'Tuesday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 4, 'Day'] = 'Wednesday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 5, 'Day'] = 'Thursday'
df_high_act_customer_dept.loc[df_high_act_customer_dept['orders_day_of_week'] == 6, 'Day'] = 'Friday'

In [36]:
df_high_act_customer_dept['Day'].value_counts(dropna = False)

Saturday     1773487
Sunday       1623479
Friday       1287526
Thursday     1209466
Monday       1206899
Tuesday      1101796
Wednesday    1088575
Name: Day, dtype: int64

In [37]:
df_high_act_customer_dept.loc[(df_high_act_customer_dept['Day'] == 'Saturday') | (df_high_act_customer_dept['Day'] == 'Sunday'), 'Shopper day'] = 'Weekend Shopper'
df_high_act_customer_dept.loc[(df_high_act_customer_dept['Day'] != 'Saturday') & (df_high_act_customer_dept['Day'] != 'Sunday'), 'Shopper day'] = 'Weekday Shopper'


In [38]:
df_high_act_customer_dept['Shopper day'].value_counts(dropna = False)

Weekday Shopper    5894262
Weekend Shopper    3396966
Name: Shopper day, dtype: int64

In [39]:
df_high_act_customer_dept.head()

Unnamed: 0,order_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_name,Age,State,product_id,n_dependants,...,Gender,income,Region,age_group,income_group,department,dependants_group,family_type,Day,Shopper day
0,2539329,1,2,8,,Soda,31,Alabama,196,3,...,Female,40423,South,Young Adult: 18-35,Low Income,beverages,Parent,Married with children,Monday,Weekday Shopper
1,2254736,4,4,7,29.0,Soda,31,Alabama,196,3,...,Female,40423,South,Young Adult: 18-35,Low Income,beverages,Parent,Married with children,Wednesday,Weekday Shopper
2,431534,5,4,15,28.0,Soda,31,Alabama,196,3,...,Female,40423,South,Young Adult: 18-35,Low Income,beverages,Parent,Married with children,Wednesday,Weekday Shopper
3,3367565,6,2,7,19.0,Soda,31,Alabama,196,3,...,Female,40423,South,Young Adult: 18-35,Low Income,beverages,Parent,Married with children,Monday,Weekday Shopper
4,2295261,9,1,16,0.0,Zero Calorie Cola,31,Alabama,46149,3,...,Female,40423,South,Young Adult: 18-35,Low Income,beverages,Parent,Married with children,Sunday,Weekend Shopper


### Export data set for the next Step

In [40]:
df_high_act_customer_dept.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'df_high_act_customer_dept_Step5.pkl'))