# Project: 12-2023 Instacart Basket Analysis
## Author: Nadia Ordonez
## Step 7 IC Customer profiling

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-data)
    * [2.1 Importing libraries](#2.1-Importing-libraries)
    * [2.2 Importing data](#2.2-Importing-data)
* [3. Age profile](#3.-Age-profile)
* [4. Dependents profile](#4.-Dependents-profile)
* [5. Income profile](#5.-Income-profile)
* [6. Customer profiles](#6.-Customer-profiles)
* [7. Exporting data](#7.-Exporting-data) 

# 1. Introduction

* The marketing and business strategy units at Instacart want to create more-relevant marketing strategies for different products and are, thus, curious about customer profiling in their database. After discussions with the Instacart team, it was agreed to create a customer profile based on age, income, and number of dependents. Here, number of dependents refers to the number of children per user. The following variables were created: 
    * For the "age profile" variable three categories will be created: "Young adult", "Adult" and "Retired".
    * For the "dependents profile" two categories will be created: "No kids" and "With kids".
    * For the "income profile" variable three categories will be created "Low income", "Middle income" and "High income". 
* The above classification will be combined to generate 18 different customer profiles, ranging from "Young adult, No kids, Low income" to "Retired, Kids, High income". 
* NOTE: After deliverations with stakeholders, it was agreed to evaluate the frequency of users taking into account the volumen of orders per user, instead of single users. In our current dataframe, a single user_id is repeated several items in the "user_id" variable depending on how many products were purchased within their orders. 

# 2. Importing data

## 2.1 Importing libraries

In [1]:
#Import analytical libraries
import pandas as pd
import numpy as np
import os

## 2.2 Importing data

In [2]:
#Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\12-2023 Instacart Basket Analysis'

### Order products der

In [5]:
#Import “orders_products_der_step6.pkl”
#See "Step 5 IC Orders products final" to check for merging details
orders_products_der = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_der_step6.pkl'))

In [6]:
#Check df size
orders_products_der.shape

(30992966, 30)

In [7]:
#Check headers
orders_products_der.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,max_order_number,price_range_products,busiest_days,busiest_period_of_day,region,loyalty_flag,mean_price_per_user,type_of_spender,median_days_since_prior_order,usage_frequency
0,2539329,1,1,2,8,,196,1,0,Soda,...,10,mid-range,Regular days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Most orders,South,New,6.367797,Low spender,20.5,Non-frequent customer


In [8]:
#See headers
orders_products_der.dtypes

order_id                            int32
user_id                             int32
order_number                         int8
orders_day_of_week                   int8
order_hour_of_day                    int8
days_since_prior_order            float64
product_id                          int32
add_to_cart_sequence                int32
reordered                            int8
product_name                       object
aisle_id                             int8
department_id                        int8
prices                            float64
gender                           category
state                            category
age                                 int32
date_joined                        object
number_of_dependants                int32
family_status                    category
income                              int64
max_order_number                     int8
price_range_products               object
busiest_days                       object
busiest_period_of_day             

In [9]:
#Create a subset to avoid RAM issues
#Selecting columns to avoid RAM memory issues
variables = ['order_id', 'user_id', 'days_since_prior_order', 'department_id', 'prices', 'age', 'number_of_dependants', 'income', 'region']
orders_products_der = orders_products_der.loc[:, variables]

In [10]:
#Check df size
orders_products_der.shape

(30992966, 9)

In [11]:
#Check headers
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region
0,2539329,1,,7,9.0,31,3,40423,South
1,2398795,1,15.0,7,9.0,31,3,40423,South
2,473747,1,21.0,7,9.0,31,3,40423,South
3,2254736,1,29.0,7,9.0,31,3,40423,South
4,431534,1,28.0,7,9.0,31,3,40423,South


# 3. Age profile

* For the "age profile" variable three categories will be created: "Young adult", "Adult" and "Retired".:
    * If the user is younger than 26, then user will be labeled as “Young adult”.
    * If the user is 26 and younger than 65, they will be labeled a “Adult”.
    * Users 65 and older will be labeled as "Retired".

## Variable evaluation

The "age" variable would be used to create the "age_profile" labels.

In [12]:
#Evaluate "age"
orders_products_der['age'].describe()
#User age ranges from 18 to 81 years old

count    3.099297e+07
mean     4.946787e+01
std      1.848520e+01
min      1.800000e+01
25%      3.300000e+01
50%      4.900000e+01
75%      6.500000e+01
max      8.100000e+01
Name: age, dtype: float64

## Conditions

In [13]:
#If the user is younger than 26, then user will be labeled as “Young adult”
orders_products_der.loc[orders_products_der['age'] < 26, 'age_profile'] = 'Young adult'

  orders_products_der.loc[orders_products_der['age'] < 26, 'age_profile'] = 'Young adult'


In [14]:
#If the user is 26 and younger than 65, they will be labeled a “Adult”
orders_products_der.loc[(orders_products_der['age'] < 65) & (orders_products_der['age'] >= 26), 'age_profile'] = 'Adult' 

In [15]:
#Users 65 and older will be labeled as "Retired"
orders_products_der.loc[orders_products_der['age'] >= 65, 'age_profile'] = 'Retired'

## Output evaluation

In [16]:
#Count values
orders_products_der['age_profile'].value_counts(dropna = False)
#Most users are adults age 26 and older and younger than 65 years old.

age_profile
Adult          18921994
Retired         8202766
Young adult     3868206
Name: count, dtype: int64

In [17]:
#Check flags
orders_products_der.groupby('age_profile').agg({'age' : ['min', 'max']})
#nothing odd on the results

Unnamed: 0_level_0,age,age
Unnamed: 0_level_1,min,max
age_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
Adult,26,64
Retired,65,81
Young adult,18,25


In [18]:
#Check results
orders_products_der[['age', 'age_profile']].head()
#nothing odd on the results

Unnamed: 0,age,age_profile
0,31,Adult
1,31,Adult
2,31,Adult
3,31,Adult
4,31,Adult


In [19]:
#Check results
orders_products_der[['age', 'age_profile']].tail()
#nothing odd on the results

Unnamed: 0,age,age_profile
32434480,25,Young adult
32434481,25,Young adult
32434482,25,Young adult
32434483,25,Young adult
32434484,25,Young adult


The age profiles were correctly assigned to all users. Approximately 57.33% of the users fall into the "Adult" category.

# 4. Dependents profile

* For the "dependents_profile" two categories will be created: "No kids" and "Kids".
    * If the user does not have dependents (=0), then user will be labeled as “No kids”.
    * If the user has 1 or more dependents, then they will be labeled as “With kids”.

## Variable evaluation

The "number_of_dependents" variable would be used to create the "dependents_profile" labels.

In [20]:
#Evaluate "number_of_dependents"
orders_products_der['number_of_dependants'].describe()
#User have non or up to 3 kids

count    3.099297e+07
mean     1.501799e+00
std      1.118900e+00
min      0.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      3.000000e+00
Name: number_of_dependants, dtype: float64

In [21]:
#Printing the frequency of a column will quickly inform you which values appear more often within that column.
orders_products_der['number_of_dependants'].value_counts(dropna = False)
#Most users have at least 1 kid 

number_of_dependants
3    7779460
0    7747091
2    7740423
1    7725992
Name: count, dtype: int64

## Conditions

In [22]:
#The folllowing conditions are applied:
#If the user does not have dependents (=0), then user will be labeled as “No kids”.
#If the user has 1 or more dependents then, they will be labeled as “With kids”.

result = []

for value in orders_products_der["number_of_dependants"]:
  if value == 0:
    result.append("No kids")
  else:
    result.append("With kids")

In [23]:
#See results
result

['With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No k

In [24]:
#Adding the results to your dataframe
orders_products_der['dependents_profile'] = result

## Output evaluation

In [25]:
#See results
orders_products_der.head()
#variable "dependents_profile" was added

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region,age_profile,dependents_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids


In [26]:
#Counting values within new column
orders_products_der['dependents_profile'].value_counts(dropna = False)
#Nothing odd and matches with previous frequency

dependents_profile
With kids    23245875
No kids       7747091
Name: count, dtype: int64

In [27]:
#Check flags
orders_products_der.groupby('dependents_profile').agg({'number_of_dependants' : ['min', 'max']})
#nothing odd on the results

Unnamed: 0_level_0,number_of_dependants,number_of_dependants
Unnamed: 0_level_1,min,max
dependents_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
No kids,0,0
With kids,1,3


In [28]:
#Check results
orders_products_der[['dependents_profile', 'number_of_dependants']].head()
#nothing odd on the results

Unnamed: 0,dependents_profile,number_of_dependants
0,With kids,3
1,With kids,3
2,With kids,3
3,With kids,3
4,With kids,3


In [29]:
#Check results
orders_products_der[['dependents_profile', 'number_of_dependants']].tail()
#nothing odd on the results

Unnamed: 0,dependents_profile,number_of_dependants
32434480,No kids,0
32434481,No kids,0
32434482,No kids,0
32434483,No kids,0
32434484,No kids,0


Labels were correctly assigned to users. Approximately 75.02% of users have at least 1 kid, and up to 3 kids.  

# 5. Income profile

* For the "income profile" variable three categories will be created "Low income", "Middle income" and "High income". 
    * Users with incomes of 60k per year or lower will be labeled as "Low income".
    * Users with incomes higher than 60k but less than 250k will be labeled as "Middle income".
    * User with incomes of 250k or higher will be labeled as "High income".

## Variable evaluation

The "income" variable would be used to create the "income_profile" labels.

In [30]:
#Evaluate "income"
orders_products_der['income'].describe()
#User income ranges from 26k up to 600k a year

count    3.099297e+07
mean     9.967341e+04
std      4.313976e+04
min      2.590300e+04
25%      6.728900e+04
50%      9.676500e+04
75%      1.281010e+05
max      5.939010e+05
Name: income, dtype: float64

## Conditions

In [31]:
#The folllowing conditions are applied:
#Users with incomes of 60k per year or lower will be labeled as "Low income".
#Users with incomes higher than 60k but less than 250k will be labeled as "Middle income".
#User with incomes of 250k or higher will be labeled as "High income".

result = []

for value in orders_products_der["income"]:
  if value < 60000:
    result.append("Low income")
  elif value >= 250000:
    result.append("High income")
  else:
    result.append("Middle income")

In [32]:
#See results
result

['Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low inco

In [33]:
#Adding the results to the df
orders_products_der['income_profile'] = result

## Output evaluation

In [34]:
#See results
orders_products_der.head()
#variable "income_profile" was added

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region,age_profile,dependents_profile,income_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids,Low income
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids,Low income


In [35]:
#Counting values within new column
orders_products_der['income_profile'].value_counts(dropna = False)

income_profile
Middle income    25004498
Low income        5808573
High income        179895
Name: count, dtype: int64

In [36]:
#Check flags
orders_products_der.groupby('income_profile').agg({'income' : ['min', 'max']})
#nothing odd on the results

Unnamed: 0_level_0,income,income
Unnamed: 0_level_1,min,max
income_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
High income,250190,593901
Low income,25903,59999
Middle income,60000,249904


In [37]:
#Check results
orders_products_der[['income_profile', 'income']].head()
#nothing odd on the results

Unnamed: 0,income_profile,income
0,Low income,40423
1,Low income,40423
2,Low income,40423
3,Low income,40423
4,Low income,40423


In [40]:
#Check results
orders_products_der[['income_profile', 'income']].tail()
#nothing odd on the results

Unnamed: 0,income_profile,income
32434480,Low income,53755
32434481,Low income,53755
32434482,Low income,53755
32434483,Low income,53755
32434484,Low income,53755


The income profile was correctly assigned to all users. Most users (76.3%) have an average income higher than 60k but less than 250k a year. 

# 6. Customer profiles

Here, 18 different customer profiles were generated based on previously created variables of age, number of dependents and income profiles. Customer profiles will range from "Young adult, No kids, Low income" to "Retired, Kids, High income".

In [41]:
#Customer profiling based on created variables
orders_products_der['customer_profile'] = orders_products_der['age_profile'] + '_' + orders_products_der['dependents_profile'] + '_' + orders_products_der['income_profile']

In [43]:
#See results
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region,age_profile,dependents_profile,income_profile,customer_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income


In [44]:
#See results
orders_products_der.tail()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region,age_profile,dependents_profile,income_profile,customer_profile
32434480,3308056,106143,10.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434481,2988973,106143,5.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434482,930,106143,4.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434483,467253,106143,7.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434484,156685,106143,5.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income


In [45]:
#Counting values within new column
orders_products_der['customer_profile'].value_counts(dropna = False)
#As expected 18 profiles were created
#Adults with kids and a middle income are out top 1 users

customer_profile
Adult_With kids_Middle income          11328954
Retired_With kids_Middle income         5442930
Adult_No kids_Middle income             3812244
Adult_With kids_Low income              2740559
Young adult_With kids_Middle income     1980497
Retired_No kids_Middle income           1796557
Adult_No kids_Low income                 926872
Young adult_With kids_Low income         926080
Retired_With kids_Low income             694263
Young adult_No kids_Middle income        643316
Young adult_No kids_Low income           307579
Retired_No kids_Low income               213220
Adult_With kids_High income               83414
Retired_With kids_High income             41653
Adult_No kids_High income                 29951
Retired_No kids_High income               14143
Young adult_With kids_High income          7525
Young adult_No kids_High income            3209
Name: count, dtype: int64

## Customer profile expenditure

Here, customer profiles will be examine in detail by calculating the mean, max and min of prices of products that they purchased. 

In [46]:
#Calculating the average price of products purchased per customer profile = ("mean_price_per_customer_profile")
orders_products_der['mean_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.mean)

  orders_products_der['mean_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.mean)


In [47]:
#Calculating the max price of products purchased per customer profile = ("max_price_per_customer_profile")
orders_products_der['max_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.max)

  orders_products_der['max_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.max)


In [48]:
#Calculating the min price of products purchased per customer profile = ("min_price_per_customer_profile")
orders_products_der['min_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.min)

  orders_products_der['min_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.min)


In [49]:
#Check flags
orders_products_der.groupby('customer_profile').agg({'prices' : ['min', 'max', 'mean']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,min,max,mean
customer_profile,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adult_No kids_High income,1.0,25.0,7.844429
Adult_No kids_Low income,1.0,25.0,7.110357
Adult_No kids_Middle income,1.0,25.0,7.958333
Adult_With kids_High income,1.0,25.0,7.772997
Adult_With kids_Low income,1.0,25.0,7.116217
Adult_With kids_Middle income,1.0,25.0,7.958375
Retired_No kids_High income,1.0,25.0,7.728465
Retired_No kids_Low income,1.0,25.0,6.512758
Retired_No kids_Middle income,1.0,25.0,7.961454
Retired_With kids_High income,1.0,25.0,7.764586


In [50]:
#See results
orders_products_der[['customer_profile', 'mean_price_per_customer_profile', 'max_price_per_customer_profile', 'min_price_per_customer_profile', 'prices']].head()

Unnamed: 0,customer_profile,mean_price_per_customer_profile,max_price_per_customer_profile,min_price_per_customer_profile,prices
0,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
1,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
2,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
3,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
4,Adult_With kids_Low income,7.116217,25.0,1.0,9.0


Retired people with a low income who have or do not kids spent in average less per product than the rest of customer profiles (around 6.5 dollars). The top 6 customer profiles that spend the most per product (around 7.9 dollars) are all middle income with or without kids of all three age profiles: yound adults, adults, Retired.         

## Customer profile usage frequency

Here, customer profiles will be examined in detail by calculating the mean, max and min of "days_since_prior_order" to evaluate the frequency at which each customer profile is placing orders in the Instacart app.

In [51]:
#Calculating the mean usage frequency per customer profile = ("mean_usage_per_customer_profile")
orders_products_der['mean_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.mean)

  orders_products_der['mean_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.mean)


In [52]:
#Calculating the max usage frequency per customer profile = ("max_usage_per_customer_profile")
orders_products_der['max_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.max)

  orders_products_der['max_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.max)


In [53]:
#Calculating the min usage frequency per customer profile = ("min_usage_per_customer_profile")
orders_products_der['min_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.min)

  orders_products_der['min_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.min)


In [54]:
#Check flags
orders_products_der.groupby('customer_profile').agg({'days_since_prior_order' : ['min', 'max', 'mean']})

Unnamed: 0_level_0,days_since_prior_order,days_since_prior_order,days_since_prior_order
Unnamed: 0_level_1,min,max,mean
customer_profile,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adult_No kids_High income,0.0,30.0,9.552416
Adult_No kids_Low income,0.0,30.0,11.016815
Adult_No kids_Middle income,0.0,30.0,10.717566
Adult_With kids_High income,0.0,30.0,10.363265
Adult_With kids_Low income,0.0,30.0,11.045855
Adult_With kids_Middle income,0.0,30.0,10.791625
Retired_No kids_High income,0.0,30.0,9.633782
Retired_No kids_Low income,0.0,30.0,11.396369
Retired_No kids_Middle income,0.0,30.0,10.703718
Retired_With kids_High income,0.0,30.0,10.280375


In [55]:
#See results
orders_products_der[['customer_profile', 'mean_usage_per_customer_profile', 'max_usage_per_customer_profile', 'min_usage_per_customer_profile', 'days_since_prior_order']].head()

Unnamed: 0,customer_profile,mean_usage_per_customer_profile,max_usage_per_customer_profile,min_usage_per_customer_profile,days_since_prior_order
0,Adult_With kids_Low income,11.045855,30.0,0.0,
1,Adult_With kids_Low income,11.045855,30.0,0.0,15.0
2,Adult_With kids_Low income,11.045855,30.0,0.0,21.0
3,Adult_With kids_Low income,11.045855,30.0,0.0,29.0
4,Adult_With kids_Low income,11.045855,30.0,0.0,28.0


The top 3 most frequent users have high incomes and are yound adults with kids as well as adults and retired with no kids.

# 7. Exporting data

In [56]:
#See variables
orders_products_der.dtypes

order_id                             int32
user_id                              int32
days_since_prior_order             float64
department_id                         int8
prices                             float64
age                                  int32
number_of_dependants                 int32
income                               int64
region                              object
age_profile                         object
dependents_profile                  object
income_profile                      object
customer_profile                    object
mean_price_per_customer_profile    float64
max_price_per_customer_profile     float64
min_price_per_customer_profile     float64
mean_usage_per_customer_profile    float64
max_usage_per_customer_profile     float64
min_usage_per_customer_profile     float64
dtype: object

In [57]:
#Check headers
orders_products_der.head()
#4 new variables were added on the initially imported dataframe

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,age,number_of_dependants,income,region,age_profile,dependents_profile,income_profile,customer_profile,mean_price_per_customer_profile,max_price_per_customer_profile,min_price_per_customer_profile,mean_usage_per_customer_profile,max_usage_per_customer_profile,min_usage_per_customer_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income,7.116217,25.0,1.0,11.045855,30.0,0.0
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income,7.116217,25.0,1.0,11.045855,30.0,0.0
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income,7.116217,25.0,1.0,11.045855,30.0,0.0
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income,7.116217,25.0,1.0,11.045855,30.0,0.0
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income,7.116217,25.0,1.0,11.045855,30.0,0.0


In [58]:
#Check size before exporting
orders_products_der.shape
#4 new variables were added on the initially imported dataframe
#the number or rows were not altered

(30992966, 19)

In [59]:
#Exporting to prepared data folder
#The pickle format is preferred for large df. This df contains 31M rows
orders_products_der.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_cust_step7.pkl'))