# Coding Etiquette & Excel Reporting (2)
# (Part 1 - Q 5)

## Content
#### 1. Import libraries and data
#### 2. Create a profiling variable
#### 2.1. Create new column 'age_group'
#### 2.2. Create new column 'family_size'
#### 2.3. Create new column 'income_class'
#### 2.4. Create new columns 'baby_flag' and 'pet_flag'
#### 2.5. Create new column 'period_of_day'
#### 2.6. Create new column 'income_class_and_age'
#### 2.7. Create one variable for customer profiling 'customer_profile'
#### 3. Drop unnecessary columns
#### 4. Export data

# 1. Import libraries and data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create project folder path
path = r'C:\Users\Lara\Career Foundry Projects\21-09-2023 Instacart Basket Analysis'

In [3]:
# Import dataset instacart_high_act_users.pkl
df_all = pd.read_pickle (os.path.join (path, '02 Data','Prepared Data', 'instacart_high_act_users.pkl'))

In [4]:
# Check head() and info() of df_all
df_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,product_name,prices,price_range,...,order_frequency_flag,gender,state,age,number_of_dependants,marital_status,income,department,region,activity_flag
0,2539329,1,1,Monday,8,,196,Soda,9.0,Mid-range product,...,Non-frequent customer,Female,Alabama,31,3,married,40423,beverages,South,High-activity customer
1,2398795,1,2,Tuesday,7,15.0,196,Soda,9.0,Mid-range product,...,Non-frequent customer,Female,Alabama,31,3,married,40423,beverages,South,High-activity customer
2,473747,1,3,Tuesday,12,21.0,196,Soda,9.0,Mid-range product,...,Non-frequent customer,Female,Alabama,31,3,married,40423,beverages,South,High-activity customer
3,2254736,1,4,Wednesday,7,29.0,196,Soda,9.0,Mid-range product,...,Non-frequent customer,Female,Alabama,31,3,married,40423,beverages,South,High-activity customer
4,431534,1,5,Wednesday,15,28.0,196,Soda,9.0,Mid-range product,...,Non-frequent customer,Female,Alabama,31,3,married,40423,beverages,South,High-activity customer


In [5]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30992664 entries, 0 to 32434207
Data columns (total 27 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int32  
 1   user_id                 int32  
 2   order_number            int16  
 3   orders_day_of_week      object 
 4   order_hour_of_day       int8   
 5   days_since_prior_order  float32
 6   product_id              int32  
 7   product_name            object 
 8   prices                  float32
 9   price_range             object 
 10  busiest_days            object 
 11  busiest_period_of_day   object 
 12  max_order               int16  
 13  loyalty_flag            object 
 14  mean_price              float32
 15  spending_flag           object 
 16  median_days             float32
 17  order_frequency_flag    object 
 18  gender                  object 
 19  state                   object 
 20  age                     int8   
 21  number_of_dependants    int8   
 2

# 2. Create a customer profiling variable

#### Create a profiling variable based on age, income, certain goods in the “department_id” column, and number of dependents. You might also use the “orders_day_of_the_week” and “order_hour_of_day” columns if you can think of a way they would impact customer profiles. (Hint: As an example, try thinking of what characteristics would lead you to the profile “Single adult” or “Young parent.”)

## 2.1. Create new column 'age_group'

#### Age group intervals are made taking into consideration officially revised age groups made by WHO and UN in 2015. For people older than 60 I made just one group.


In [6]:
# Check min and max values in column 'age'
df_all['age'].min()

18

In [7]:
df_all['age'].max()

81

In [8]:
# Create new column 'age_group'
df_all.loc[df_all['age'] <= 25, 'age_group'] = 'Youth'

In [9]:
df_all.loc[(df_all['age'] > 25) & (df_all['age'] <= 44), 'age_group'] = 'Young adult'

In [10]:
df_all.loc[(df_all['age'] > 44) & (df_all['age'] <= 60), 'age_group'] = 'Middle-age adult'

In [11]:
df_all.loc[df_all['age'] > 60, 'age_group'] = 'Older adult'

In [12]:
# Check if new column was created
df_all.shape

(30992664, 28)

In [13]:
# Check value counts
df_all['age_group'].value_counts(dropna = False)

age_group
Older adult         10121613
Young adult          9222501
Middle-age adult     7780384
Youth                3868166
Name: count, dtype: int64

#### Comment: The reason why age group Youth has low number of counts is because it describes only customers aged between 18 and 25 years with those years included, that is a span of 8 years. Other age groups span over about 15 - 20 years. Interenstingly, biggest customers are Older adults, that is people over 60 years old

In [14]:
# Create a custom made dictionary for new order/arrangement for age groups
arrange_age_gr = {'Youth' : 0, 'Young adult' : 1, 'Middle-age adult' : 2, 'Older adult' : 3} 

In [15]:
# Count all values in column 'age_group' and sort by custom made dictionary arrange_age_gr
df_all['age_group'].value_counts().sort_index(key = lambda x: x.map(arrange_age_gr))

age_group
Youth                3868166
Young adult          9222501
Middle-age adult     7780384
Older adult         10121613
Name: count, dtype: int64

# 2.2. Create new column 'family_size'

In [16]:
df_all['marital_status'].value_counts()

marital_status
married                             21763021
single                               5099271
divorced/widowed                     2647761
living with parents and siblings     1482611
Name: count, dtype: int64

In [17]:
df_all['number_of_dependants'].value_counts().sort_index()

number_of_dependants
0    7747032
1    7725981
2    7740199
3    7779452
Name: count, dtype: int64

In [18]:
# Create crosstab to think about possible 'family_type' value names
crosstab_1 = pd.crosstab(df_all['marital_status'], df_all['number_of_dependants'], dropna = False)

In [19]:
crosstab_1

number_of_dependants,0,1,2,3
marital_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
divorced/widowed,2647761,0,0,0
living with parents and siblings,0,508439,485157,489015
married,0,7217542,7255042,7290437
single,5099271,0,0,0


#### Comment: Dependant can be child, step child, sibling or parent or other close relative. It cannot be a spouse.
#### Crosstab above shows that single adults, or family of one, are people who's marrital status is either single or divorsed/widowed and they are the only ones without dependants. Family size for married people will be equal to number of dependants + 1, and for married people +2 ( because spouse is not a dependant!)

In [20]:
# Create new column 'family_size'
df_all.loc[df_all['number_of_dependants'] == 0, 'family_size'] = '1'
df_all.loc[((df_all['number_of_dependants'] == 1) & (df_all['marital_status'] == 'living with parents and siblings')), 'family_size'] = '2'
df_all.loc[(((df_all['number_of_dependants'] == 1) & (df_all['marital_status'] == 'married')) | ((df_all['number_of_dependants'] == 2) & (df_all['marital_status'] == 'living with parents and siblings'))) , 'family_size'] = '3'
df_all.loc[(((df_all['number_of_dependants'] == 2) & (df_all['marital_status'] == 'married')) | ((df_all['number_of_dependants'] == 3) & (df_all['marital_status'] == 'living with parents and siblings'))) , 'family_size'] = '4'
df_all.loc[((df_all['number_of_dependants'] == 3) & (df_all['marital_status'] == 'married')), 'family_size'] = '5'

In [21]:
# Check if new column was created
df_all.shape

(30992664, 29)

In [22]:
# Check value counts
df_all['family_size'].value_counts(dropna = False).sort_index()

family_size
1    7747032
2     508439
3    7702699
4    7744057
5    7290437
Name: count, dtype: int64

## 2.3. Create new column 'income_class'

In [23]:
# Check min and max of column 'income'
df_all['income'].min()

25903

In [24]:
df_all['income'].max()

593901

#### Calculations below were made with thought that 'income' column contains information about total family income.
#### Income classes were made taking into consideration conclusions from PEW reserch center about income classes in US for 2016. Middle income class family is defined as family of 3 (2 parents and 1 child) who has total family income between 2/3 and double of average US income, that is between 45,200 and 135,600 for year 2016. This should be further adjusted to various family sizes and place of living, but here I'll only adjust it to family size. All income values were rounded to nearest hundred. Link https://www.pewresearch.org/short-reads/2018/09/06/the-american-middle-class-is-stable-in-size-but-losing-ground-financially-to-upper-income-families/

In [25]:
# Create new column 'income_class'
df_all.loc[((df_all['family_size'] == '1') & (df_all['income'] < 26100)), 'income_class'] = 'Low income class'
df_all.loc[((df_all['family_size'] == '1') & ((df_all['income'] >= 26100) & (df_all['income'] <= 78200))), 'income_class'] = 'Middle income class'
df_all.loc[((df_all['family_size'] == '1') & (df_all['income'] > 78200)), 'income_class'] = 'Upper income class'

In [26]:
df_all.loc[((df_all['family_size'] == '2') & (df_all['income'] < 36900)), 'income_class'] = 'Low income class'
df_all.loc[((df_all['family_size'] == '2') & ((df_all['income'] >= 36900) & (df_all['income'] <= 110700))), 'income_class'] = 'Middle income class'
df_all.loc[((df_all['family_size'] == '2') & (df_all['income'] > 110700)), 'income_class'] = 'Upper income class'

In [27]:
df_all.loc[((df_all['family_size'] == '3') & (df_all['income'] < 45200)), 'income_class'] = 'Low income class'
df_all.loc[((df_all['family_size'] == '3') & ((df_all['income'] >= 45200) & (df_all['income'] <= 135600))), 'income_class'] = 'Middle income class'
df_all.loc[((df_all['family_size'] == '3') & (df_all['income'] > 135600)), 'income_class'] = 'Upper income class'

In [28]:
df_all.loc[((df_all['family_size'] == '4') & (df_all['income'] < 52200)), 'income_class'] = 'Low income class'
df_all.loc[((df_all['family_size'] == '4') & ((df_all['income'] >= 52200) & (df_all['income'] <= 156600))), 'income_class'] = 'Middle income class'
df_all.loc[((df_all['family_size'] == '4') & (df_all['income'] > 155600)), 'income_class'] = 'Upper income class'

In [29]:
df_all.loc[((df_all['family_size'] == '5') & (df_all['income'] < 58400)), 'income_class'] = 'Low income class'
df_all.loc[((df_all['family_size'] == '5') & ((df_all['income'] >= 58400) & (df_all['income'] <= 175100))), 'income_class'] = 'Middle income class'
df_all.loc[((df_all['family_size'] == '5') & (df_all['income'] > 175100)), 'income_class'] = 'Upper income class'

In [30]:
# Check if new column was created
df_all.shape

(30992664, 30)

In [31]:
# Create a custom made dictionary for new order/arrangement for age groups
arrange_income_cls = {'Low income class' : 0, 'Middle income class' : 1, 'Upper income class' : 2} 

In [32]:
# Count all values and sort by arrange_income_cls
df_all['income_class'].value_counts(dropna = False).sort_index(key = lambda x: x.map(arrange_income_cls))

income_class
Low income class        2806379
Middle income class    20673240
Upper income class      7513045
Name: count, dtype: int64

## 2.4. Create new columns 'baby_flag' and 'pet_flag'

In [33]:
# Create new column 'baby_flag'
df_all.loc[df_all['department'] == 'babies', 'baby_flag'] = 'Have baby'
df_all.loc[df_all['department'] != 'babies', 'baby_flag'] = 'Do not have baby'

In [34]:
# Check if new column was created
df_all.shape

(30992664, 31)

In [35]:
# Check value counts
df_all['baby_flag'].value_counts(dropna = False)

baby_flag
Do not have baby    30582272
Have baby             410392
Name: count, dtype: int64

In [36]:
# Group data by 'baby_flag' and count unique users
df_all.groupby('baby_flag')['user_id'].nunique()

baby_flag
Do not have baby    162632
Have baby            30230
Name: user_id, dtype: int64

In [37]:
crosstab_2 = pd.crosstab(df_all['marital_status'], df_all['baby_flag'], dropna = False)

In [38]:
crosstab_2

baby_flag,Do not have baby,Have baby
marital_status,Unnamed: 1_level_1,Unnamed: 2_level_1
divorced/widowed,2613410,34351
living with parents and siblings,1463230,19381
married,21475338,287683
single,5030294,68977


In [39]:
# Create new column 'pet_flag'
df_all.loc[df_all['department'] == 'pets', 'pet_flag'] = 'Have pet'
df_all.loc[df_all['department'] != 'pets', 'pet_flag'] = 'Do not have pet'

In [40]:
# Check if new column was created
df_all.shape

(30992664, 32)

In [41]:
# Check value counts
df_all['pet_flag'].value_counts(dropna = False)

pet_flag
Do not have pet    30899599
Have pet              93065
Name: count, dtype: int64

In [42]:
# Group data by 'pet_flag' and count unique users
df_all.groupby('pet_flag')['user_id'].nunique()

pet_flag
Do not have pet    162633
Have pet            13175
Name: user_id, dtype: int64

#### Comment: Even though 15% users buys baby products, from crosstab I can only conclude that people of all family structure buy baby stuff. For example: all "living with siblings and parents" are <=25 years and still buy baby stuff. Possibly siblings are babies. Also "single" users are from all age groups except "older adults" and they all don't have any dependants and still buy baby stuff. Total count of baby products is only 1.3% of all transactions (number of rows).  I will not use this information for customer profiling.
#### Similarly, I won't use 'pet_flag' either as about 6% users have pets and total count of pet realted item bought is only about 0.3% of all transactions (number of rows).

In [43]:
# Drop both columns
df_all = df_all.drop (columns = ['baby_flag', 'pet_flag'])

In [44]:
# Check shape
df_all.shape

(30992664, 30)

## 2.5. Create new columns 'part_of_day'

#### I used partition of the day into 4 parts: morning, mid-day, late afternoon to early night and late night to have 4 periods of the day with 6 hours each, even though previous analyses showed majority of items were bought between 8 am and 4 pm (16).

In [45]:
# Create new column 'part_of_day'
df_all.loc[(df_all['order_hour_of_day'] >= 5) & (df_all['order_hour_of_day'] < 11), 'part_of_day'] = 'Morning'
df_all.loc[(df_all['order_hour_of_day'] >= 11) & (df_all['order_hour_of_day'] < 17 ), 'part_of_day'] = 'Mid-day'
df_all.loc[(df_all['order_hour_of_day'] >= 17) & (df_all['order_hour_of_day'] < 23), 'part_of_day'] = 'Late afternoon to early night'
df_all.loc[(df_all['order_hour_of_day'] >= 23) | (df_all['order_hour_of_day'] < 5), 'part_of_day'] = 'Late night'

In [46]:
# Check shape
df_all.shape

(30992664, 31)

In [47]:
df_all['part_of_day'].value_counts(dropna = False)

part_of_day
Mid-day                          15180993
Morning                           7908986
Late afternoon to early night     7034570
Late night                         868115
Name: count, dtype: int64

## 2.6. Create new column 'income_class_and_age'

In [48]:
# Create crosstab for 'age_group' and 'income_class' to compare values from newly made column below
crosstab_3 = pd.crosstab(df_all['age_group'], df_all['income_class'], dropna = False)

In [49]:
crosstab_3

income_class,Low income class,Middle income class,Upper income class
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Middle-age adult,511993,4731328,2537063
Older adult,657483,6139594,3324536
Young adult,1191349,6735161,1295991
Youth,445554,3067157,355455


In [50]:
crosstab_3.to_clipboard()

#### Younger adult is Youth or Young adult (all ages <= 44)

In [51]:
df_all.loc[((df_all['age'] <= 44) & (df_all['income_class'] == 'Low income class')), 'income_class_and_age'] = 'Younger adult with low income'

In [52]:
df_all.loc[((df_all['age'] <= 44) & (df_all['income_class'] == 'Middle income class')), 'income_class_and_age'] = 'Younger adult with mid income'

In [53]:
df_all.loc[((df_all['age'] <= 44) & (df_all['income_class'] == 'Upper income class')), 'income_class_and_age'] = 'Younger adult with high income'

#### Mid-age adult

In [54]:
df_all.loc[((df_all['age_group'] == 'Middle-age adult') & (df_all['income_class'] == 'Low income class')), 'income_class_and_age'] = 'Mid-age adult with low income'

In [55]:
df_all.loc[((df_all['age_group'] == 'Middle-age adult') & (df_all['income_class'] == 'Middle income class')), 'income_class_and_age'] = 'Mid-age adult with mid income'

In [56]:
df_all.loc[((df_all['age_group'] == 'Middle-age adult') & (df_all['income_class'] == 'Upper income class')), 'income_class_and_age'] = 'Mid-age adult with high income'

#### Older adult

In [57]:
df_all.loc[((df_all['age_group'] == 'Older adult') & (df_all['income_class'] == 'Low income class')), 'income_class_and_age'] = 'Older adult with low income'

In [58]:
df_all.loc[((df_all['age_group'] == 'Older adult') & (df_all['income_class'] == 'Middle income class')), 'income_class_and_age'] = 'Older adult with mid income'

In [59]:
df_all.loc[((df_all['age_group'] == 'Older adult') & (df_all['income_class'] == 'Upper income class')), 'income_class_and_age'] = 'Older with high income'

In [60]:
# Check shape
df_all.shape

(30992664, 32)

In [61]:
# Count all values
df_all['income_class_and_age'].value_counts(dropna = False)

income_class_and_age
Younger adult with mid income     9802318
Older adult with mid income       6139594
Mid-age adult with mid income     4731328
Older with high income            3324536
Mid-age adult with high income    2537063
Younger adult with high income    1651446
Younger adult with low income     1636903
Older adult with low income        657483
Mid-age adult with low income      511993
Name: count, dtype: int64

# 2.7. Create one variable for customer profiling 'customer_profile'

#### Only users without dependats are single and divorced/widowed (family_size is 1) - they will be Single adults. All users who live with siblings and parents are under 25 but they all have 1, 2 or 3 dependants, so they'll be a Head of household under 26.
#### Married and age <=44 will be Young parent, from 45 to 59 mid age parent and for over 60 an older parent.

#### Single adults

In [62]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Younger adult with low income'), 'customer_profile'] = 'Single younger adult with low income'

In [63]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Mid-age adult with low income'), 'customer_profile'] = 'Single mid-age adult with low income'

In [64]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Older adult with low income'), 'customer_profile'] = 'Single older adult with low income'

In [65]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
NaN                                     30992303
Single younger adult with low income         361
Name: count, dtype: int64

In [66]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Younger adult with mid income'), 'customer_profile'] = 'Single younger adult with mid income'

In [67]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Mid-age adult with mid income'), 'customer_profile'] = 'Single mid-age adult with mid income'

In [68]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Older adult with mid income'), 'customer_profile'] = 'Single older adult with mid income'

In [69]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
NaN                                     28347253
Single younger adult with mid income     1927981
Single older adult with mid income        404480
Single mid-age adult with mid income      312589
Single younger adult with low income         361
Name: count, dtype: int64

In [70]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Younger adult with high income'), 'customer_profile'] = 'Single younger adult with high income'

In [71]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Mid-age adult with high income'), 'customer_profile'] = 'Single mid-age adult with high income'

In [72]:
df_all.loc[(df_all['family_size'] == '1')  & (df_all['income_class_and_age'] == 'Older adult with high income'), 'customer_profile'] = 'Single older adult with high income'

In [73]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
NaN                                      25376692
Single younger adult with mid income      1927981
Single mid-age adult with high income     1629572
Single younger adult with high income     1340989
Single older adult with mid income         404480
Single mid-age adult with mid income       312589
Single younger adult with low income          361
Name: count, dtype: int64

#### Head of household under 26

In [74]:
df_all.loc[(df_all['marital_status'] == 'living with parents and siblings')  & (df_all['income_class'] == 'Low income class'), 'customer_profile'] = 'Head of household under 26 with low income'

In [75]:
df_all.loc[(df_all['marital_status'] == 'living with parents and siblings')  & (df_all['income_class'] == 'Middle income class'), 'customer_profile'] = 'Head of household under 26 with mid income'

In [76]:
df_all.loc[(df_all['marital_status'] == 'living with parents and siblings')  & (df_all['income_class'] == 'Upper income class'), 'customer_profile'] = 'Head of household under 26 with high income'

In [77]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
NaN                                            23894081
Single younger adult with mid income            1927981
Single mid-age adult with high income           1629572
Single younger adult with high income           1340989
Head of household under 26 with mid income      1299726
Single older adult with mid income               404480
Single mid-age adult with mid income             312589
Head of household under 26 with low income       166019
Head of household under 26 with high income       16866
Single younger adult with low income                361
Name: count, dtype: int64

#### Parent

In [78]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Younger adult with low income'), 'customer_profile'] = 'Younger parent with low income'

In [79]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Mid-age adult with low income'), 'customer_profile'] = 'Mid-age parent with low income'

In [80]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Older adult with low income'), 'customer_profile'] = 'Older parent with low income'

In [81]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
NaN                                            21254082
Single younger adult with mid income            1927981
Single mid-age adult with high income           1629572
Younger parent with low income                  1470523
Single younger adult with high income           1340989
Head of household under 26 with mid income      1299726
Older parent with low income                     657483
Mid-age parent with low income                   511993
Single older adult with mid income               404480
Single mid-age adult with mid income             312589
Head of household under 26 with low income       166019
Head of household under 26 with high income       16866
Single younger adult with low income                361
Name: count, dtype: int64

In [82]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Younger adult with mid income'), 'customer_profile'] = 'Younger parent with mid income'

In [83]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Mid-age adult with mid income'), 'customer_profile'] = 'Mid-age parent with mid income'

In [84]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Older adult with mid income'), 'customer_profile'] = 'Older parent with mid income'

In [85]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
Younger parent with mid income                 6574611
Older parent with mid income                   5735114
NaN                                            4525618
Mid-age parent with mid income                 4418739
Single younger adult with mid income           1927981
Single mid-age adult with high income          1629572
Younger parent with low income                 1470523
Single younger adult with high income          1340989
Head of household under 26 with mid income     1299726
Older parent with low income                    657483
Mid-age parent with low income                  511993
Single older adult with mid income              404480
Single mid-age adult with mid income            312589
Head of household under 26 with low income      166019
Head of household under 26 with high income      16866
Single younger adult with low income               361
Name: count, dtype: int64

In [86]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Younger adult with high income'), 'customer_profile'] = 'Younger parent with high income'

In [87]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Mid-age adult with high income'), 'customer_profile'] = 'Mid-age parent with high income'

In [88]:
df_all.loc[(df_all['marital_status'] == 'married')  & (df_all['income_class_and_age'] == 'Older adult with high income'), 'customer_profile'] = 'Older parent with high income'

In [89]:
# Check shape
df_all.shape

(30992664, 33)

In [90]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
Younger parent with mid income                 6574611
Older parent with mid income                   5735114
Mid-age parent with mid income                 4418739
NaN                                            3324536
Single younger adult with mid income           1927981
Single mid-age adult with high income          1629572
Younger parent with low income                 1470523
Single younger adult with high income          1340989
Head of household under 26 with mid income     1299726
Mid-age parent with high income                 907491
Older parent with low income                    657483
Mid-age parent with low income                  511993
Single older adult with mid income              404480
Single mid-age adult with mid income            312589
Younger parent with high income                 293591
Head of household under 26 with low income      166019
Head of household under 26 with high income      16866
Single younger adult with low income            

In [91]:
# Replese all NaN values with string 'Other customers'
df_all['customer_profile'] = df_all['customer_profile'].fillna('Other customers')

In [92]:
# Check value counts
df_all['customer_profile'].value_counts(dropna = False)

customer_profile
Younger parent with mid income                 6574611
Older parent with mid income                   5735114
Mid-age parent with mid income                 4418739
Other customers                                3324536
Single younger adult with mid income           1927981
Single mid-age adult with high income          1629572
Younger parent with low income                 1470523
Single younger adult with high income          1340989
Head of household under 26 with mid income     1299726
Mid-age parent with high income                 907491
Older parent with low income                    657483
Mid-age parent with low income                  511993
Single older adult with mid income              404480
Single mid-age adult with mid income            312589
Younger parent with high income                 293591
Head of household under 26 with low income      166019
Head of household under 26 with high income      16866
Single younger adult with low income            

# 3. Drop unnecessary columns

In [93]:
# Drop unnecessary column
df_all = df_all.drop ('income_class_and_age', axis = 1)

In [94]:
# Check shape
df_all.shape

(30992664, 32)

# 4. Export data

In [95]:
# Export df_all as pikle format
df_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'instacart_customer_profile.pkl'))