# 4.9 (Task part 1) The ultimate dataframe
** **
## Table of contents:

1. Importing libraries <br>
2. Importing the customers dataframe <br>
3. Data wrangling
    - 3.1 Renaming columns
    - 3.2 Changing datatyoes
    - 3.3 Dropping columns
4. Data consistency checks
    - 4.1 Preliminary check using describe()
    - 4.2 Mixed-type columns
    - 4.3 Checking missing values
    - 4.4 Checking for duplicates
5. The final merge
    - 5.1 Importing the orders products dataframe
    - 5.2 Preliminary checks
    - 5.3 Testing the merge
    - 5.4 Merging dataframes
    - 5.5 Final checks after merge
6. Exporting the new merged dataframe
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing the customers dataframe
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing dataframe from csv file
df_custs = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'customers.csv'), index_col = False)

In [4]:
# Printing the first 10 rows
df_custs.head(10)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [5]:
# Checking the shape
df_custs.shape

(206209, 10)

<b> Observations: </b> <br>
This new dataframe contains data on Instacart customers and needs to be merged with the "orders products" dataframe. <br>
Before merging this new dataframe with the "orders products" one, we need to wrangle the data and perform some consistency checks.

# 3. Data wrangling
** **

## 3.1 Renaming columns

This dataframe contains some columns name that are not consistent with the previous dataframe. <br>
All name will be changed in lowercase and underscores will be used. <br>
Some names will be changed due to cosmetic reason or if they are not intuitive or self-explanatory.

In [6]:
# Renaming some of the columns name, in consistency with the main dataframe
df_custs.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [7]:
df_custs.rename(columns = {'Surnam' : 'last_name'}, inplace = True)

In [8]:
df_custs.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [9]:
df_custs.rename(columns = {'STATE' : 'state'}, inplace = True)

In [10]:
df_custs.rename(columns = {'Age' : 'age'}, inplace = True)

In [11]:
df_custs.rename(columns = {'n_dependants' : 'number_dependants'}, inplace = True)

In [12]:
# Checking the head
df_custs.head(10)

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


Columns names have been successfully changed.

## 3.2 Changing datatypes

In [13]:
# Checking the datatypes
df_custs.dtypes

user_id               int64
first_name           object
last_name            object
gender               object
state                object
age                   int64
date_joined          object
number_dependants     int64
fam_status           object
income                int64
dtype: object

All datatypes are fine, except user_id, that is an identifier, so it should be a string.

In [14]:
# Changing the datatype
df_custs['user_id'] = df_custs['user_id'].astype('str')

In [15]:
# Testing the change
df_custs.dtypes

user_id              object
first_name           object
last_name            object
gender               object
state                object
age                   int64
date_joined          object
number_dependants     int64
fam_status           object
income                int64
dtype: object

Change confirmed.

## 3.3 Dropping columns

As for the column, I decide to keep them all. <br>
To complete the analysis requested, I don't need informations on name, gender, and date of join. <br> 
I only need age, state, number of dependants, familiar status and income, in order to analyze customer behaviour. <br>
I could drop 4 columns. However, it would be impossible to restore them after overwriting the dataframe. <br>
I will keep them in case they will be needed for further analysis.

# 4. Data consistency checks
** **

## 4.1 Preliminary check using describe()

In [16]:
# Using df.describe
df_custs.describe()

Unnamed: 0,age,number_dependants,income
count,206209.0,206209.0,206209.0
mean,49.501646,1.499823,94632.852548
std,18.480962,1.118433,42473.786988
min,18.0,0.0,25903.0
25%,33.0,0.0,59874.0
50%,49.0,1.0,93547.0
75%,66.0,3.0,124244.0
max,81.0,3.0,593901.0


<b> Observations: </b> <br>
I don't see any abnormalities in the describe function. <br>
Count is 206209 (same number of rows of the dataframe), min value for age is 18 (I guess minors can't buy through the app) and max is 81, while mean is 49.  <br>
Minimum value for number of dependants is 0 and max is 3. <br>
There is a huge gap between the min income and the max income but there is no way to verify if the values are correct. <br> 
I guess it means that also people with high income can use Instacart app to buy goods.

In [17]:
# Checking the highest incomes, out of curiosity
df_custs.loc[df_custs['income'] > 200000]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income
34,117740,Lisa,Sparks,Female,Oregon,55,1/1/2017,1,married,292759
434,159362,Tina,Shannon,Female,Missouri,74,1/3/2017,3,married,372334
818,15683,Michael,Carrillo,Male,New Mexico,40,1/5/2017,1,married,251211
979,200930,Charles,Nichols,Male,South Carolina,60,1/6/2017,1,married,300913
991,136298,Kevin,Ortega,Male,New Mexico,47,1/6/2017,3,married,433206
...,...,...,...,...,...,...,...,...,...,...
205730,173540,Margaret,Hale,Female,Missouri,53,3/30/2020,1,married,214185
205903,618,Harold,Mcclain,Male,Colorado,76,3/31/2020,0,divorced/widowed,206652
205939,150154,Anne,Dorsey,Female,Florida,59,3/31/2020,1,married,210332
206105,5519,Kathy,Daniel,Female,Georgia,78,4/1/2020,3,married,262610


<b> Observations: </b> <br>
Only around 1000 customers (out of 200000) have an income higher than 200000$. Nothing abnormal here.

## 4.2 Mixed-type columns

In [18]:
# Checking for mixed-types columns
for col in df_custs.columns.tolist():
  weird = (df_custs[[col]].applymap(type) != df_custs[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_custs[weird]) > 0:
    print (col)

first_name


<b> Observations: </b> <br>
first_name contains mixed-type data.

In [19]:
# Changing datatype of first_name
df_custs['first_name'] = df_custs['first_name'].astype('str')

In [20]:
# Testing the change
for col in df_custs.columns.tolist():
  weird = (df_custs[[col]].applymap(type) != df_custs[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_custs[weird]) > 0:
    print (col)

<b> Observations: </b> <br>
No output. Datatype of fisrt_name has been changed and there aren't anymore more mixed-type data columns.

## 4.3 Checking missing values

In [21]:
# Checking for missing values
df_custs.isnull().sum()

user_id              0
first_name           0
last_name            0
gender               0
state                0
age                  0
date_joined          0
number_dependants    0
fam_status           0
income               0
dtype: int64

<b> Observations: </b> <br>
No missing values found.

## 4.4 Checking for duplicates

In [22]:
# Creating a new test dataframe to check for duplicates
df_dups = df_custs[df_custs.duplicated()]

In [23]:
# Checking the output
df_dups

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income


<b> Observations: </b> <br>
No duplicates found.

# 5. The final merge
** **

Before the merge, we need to import our main dataframe (orders products) and conduct some preliminary checks.

## 5.1 Importing the orders products dataframe

In [24]:
# Importing the main dataframe
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_merged_derived_flags.pkl'))

## 5.2 Preliminary checks

In [25]:
# Checking the shape of main dataframe
df_ords_prods_merged.shape

(30328763, 23)

In [26]:
# Checking the head
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
5,550135,1,7,1,9,20.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
6,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
7,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
8,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
9,2968173,15,15,1,9,7.0,196,2,0,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender,10.0,Frequent customer


It seems some columns are not visible.

In [27]:
# Changing the setting for max columns to: None
pd.set_option('display.max_columns', None)

In [28]:
# Checking the head again
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
9,2968173,15,15,1,9,7.0,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender,10.0,Frequent customer


Now all 23 columns are visible.

In [29]:
# Checking the shape of custs dataframe
df_custs.shape

(206209, 10)

In [30]:
# Checking the head
df_custs.head(10)

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


<b> Observations: </b> <br>
The two dataframes have a different length, but they share a column (user_id) that can be used as a key. <br>
However, is important that user_id has the same datatype in both dataframes.

In [31]:
# Checking the datatype of the user_id column of the main dataframe
df_ords_prods_merged['user_id'].dtype

dtype('O')

In [32]:
# Checking the datatype of the user_id column of the custs dataframe
df_custs['user_id'].dtype

dtype('O')

<b> Observations: </b> <br>
Both variables are objects (strings). All is ready for the merge.

## 5.3 Testing the merge

In [33]:
# Testing the merge using pd.merge()
pd.merge(df_ords_prods_merged, df_custs, on = ['user_id'], indicator = True)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income,_merge
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30328758,467253,106143,25,6,16,7.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Regularly busy,Most orders,26,Regular customer,10.70,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755,both
30328759,156685,106143,26,4,23,5.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Least busy,Least busy,Average orders,26,Regular customer,10.70,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755,both
30328760,1561557,66343,2,1,11,30.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Busiest day,Most orders,4,New customer,8.10,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151,both
30328761,276317,66343,3,6,15,19.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Regularly busy,Most orders,4,New customer,8.10,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151,both


<b> Observations: </b> <br>
All the columns and the indicator have been merged successfully in the test. <br>
Proceeding with the actual merge, using an inner join.

## 5.4 Merging dataframes

In [34]:
df_ords_prods_all = df_ords_prods_merged.merge(df_custs, on = ['user_id'], indicator = True)

In [35]:
# Checking the output
df_ords_prods_all

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income,_merge
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30328758,467253,106143,25,6,16,7.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Regularly busy,Most orders,26,Regular customer,10.70,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755,both
30328759,156685,106143,26,4,23,5.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Least busy,Least busy,Average orders,26,Regular customer,10.70,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755,both
30328760,1561557,66343,2,1,11,30.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Busiest day,Most orders,4,New customer,8.10,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151,both
30328761,276317,66343,3,6,15,19.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Regularly busy,Most orders,4,New customer,8.10,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151,both


<b> Observations: </b> <br>
The new dataframe has the same number of rows (30.328.763) but 10 more columns. Merge successfull.

In [36]:
# Checking the value_counts of the indicator
df_ords_prods_all['_merge'].value_counts()

both          30328763
left_only            0
right_only           0
Name: _merge, dtype: int64

<b> Observations: </b> <br>
No errors during merge. Full match.

In [37]:
# Removing the indicator column
df_ords_prods_all = df_ords_prods_all.drop(columns = ['_merge'])

## 5.5 Final checks after merge

In [38]:
# Checking the head
df_ords_prods_all.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
9,2398795,1,2,3,7,15.0,10258,2,0,Pistachios,117,19,3.0,Low-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


In [39]:
# Checking the tail
df_ords_prods_all.tail(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag,first_name,last_name,gender,state,age,date_joined,number_dependants,fam_status,income
30328753,3102310,106143,20,3,16,7.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Least busy,Most orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328754,1539810,106143,21,1,18,5.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Busiest day,Average orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328755,3308056,106143,22,4,20,10.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Least busy,Least busy,Average orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328756,2988973,106143,23,2,22,5.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Regularly busy,Average orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328757,930,106143,24,6,12,4.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Regularly busy,Most orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328758,467253,106143,25,6,16,7.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Regularly busy,Regularly busy,Most orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328759,156685,106143,26,4,23,5.0,19675,1,1,Organic Raspberry Black Tea,94,7,10.7,Mid-range product,Least busy,Least busy,Average orders,26,Regular customer,10.7,High spender,7.0,Frequent customer,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
30328760,1561557,66343,2,1,11,30.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Busiest day,Most orders,4,New customer,8.1,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151
30328761,276317,66343,3,6,15,19.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Regularly busy,Most orders,4,New customer,8.1,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151
30328762,2922475,66343,4,1,12,30.0,47210,1,1,Fresh Farmed Tilapia Fillet,15,12,8.1,Mid-range product,Regularly busy,Busiest day,Most orders,4,New customer,8.1,Low spender,30.0,Non-frequent customer,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151


In [40]:
# Checking the shape
df_ords_prods_all.shape

(30328763, 32)

In [41]:
# Checking the datatypes
df_ords_prods_all.dtypes

order_id                          object
user_id                           object
order_number                      object
orders_day_of_week                 int64
order_hour_of_creation             int64
days_since_prior_order           float64
product_id                        object
add_to_cart_order                  int64
reordered                          int64
product_name                      object
aisle_id                          object
department_id                     object
prices                           float64
price_label                       object
busiest_day                       object
busiest_days                      object
busiest_period_of_day             object
max_order                          int64
loyalty_flag                      object
avg_price                        float64
spending_flag                     object
median_days_since_prior_order    float64
order_frequency_flag              object
first_name                        object
last_name       

# 6. Exporting the new merged dataframe
** **

In [42]:
# Exporting the dataframe as a pickle file
df_ords_prods_all.to_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_all.pkl'))