## Contents

1. [Importing Libraries](#1.-Importing-Libraries)     
	1.1. [Importing Dataframes](#1.1-Importing-Dataframes)      
2. [Wrangling the data](#2.-Wrangling-the-data)   
3. [Completing the fundamental data quality and consistency checks](#3.-Completing-the-fundamental-data-quality-and-consistency-checks)   
4. [Combining the customer data with the rest of the prepared Instacart data](#4.-Combining-the-customer-data-with-the-rest-of-the-prepared-Instacart-data)   
5. [Exporting the dataframe as a pickle file](#5.-Exporting-the-dataframe-as-a-pickle-file)    

## 1. Importing Libraries

In [2]:
# Importing Libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

### 1.2. Importing Dataframes

In [3]:
# Importing customer dataset
path = r'C:\Users\User 1\Documents\Instacart Basket Analysis 04-2023'

In [5]:
df_customer_dataset = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

## 2. Wrangling the data

In [6]:
# Checking the shape of the dataframe
df_customer_dataset.head(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [7]:
df_customer_dataset.tail(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
206189,193828,Russell,Travis,Male,North Carolina,46,4/1/2020,1,married,160483
206190,197067,Kathy,Bell,Female,Arizona,42,4/1/2020,0,single,114821
206191,7177,Russell,Zimmerman,Male,Mississippi,71,4/1/2020,3,married,64400
206192,61888,Joshua,Guerra,Male,New Jersey,37,4/1/2020,1,married,68491
206193,103412,Willie,Goodman,Male,Michigan,46,4/1/2020,2,married,154481
206194,189337,Shawn,Wood,Male,New Jersey,56,4/1/2020,1,married,40373
206195,205766,Amanda,Hodge,Female,Oregon,18,4/1/2020,2,living with parents and siblings,48510
206196,139950,Gloria,Murray,Female,Colorado,45,4/1/2020,2,married,150954
206197,74598,Christopher,Velazquez,Male,Minnesota,52,4/1/2020,0,single,140700
206198,83573,Gloria,Murray,Female,Michigan,28,4/1/2020,0,single,32237


In [8]:
df_customer_dataset.shape

(206209, 10)

#### Observations: Surname is spelt incorrectly; STATE is in capitals; fam_status - needs clarity (informal); first name and last name can be concatenated.

In [12]:
# Renaming columns
df_customer_dataset.rename(columns = {'First Name' : 'first_name', 'Surnam' : 'last_name', 'Gender' : 'gender', 'STATE' : 'state', 'Age' : 'age', 'n_dependants' : 'number_of_dependants', 'fam_status' : 'family_status'}, inplace = True)

In [13]:
# Concatenating first name and last name 
df_customer_dataset['full_name'] = df_customer_dataset['first_name'] +' '+ df_customer_dataset['last_name']

In [14]:
df_customer_dataset.head(20)

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_of_dependants,family_status,income,full_name
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285,Patricia Hart
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568,Kenneth Farley
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049,Michelle Hicks
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374,Ann Gilmore
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643,Cynthia Noble
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746,Chris Walton
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712,Joseph Hickman
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432,Jeremy Vang
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072,Shawn Chung


#### Although the number of dependants and family_status are seemingly irrelevance, substantial hypothesis testing must be conducted before permanently removing these columns. 

## 3. Completing the fundamental data quality and consistency checks

In [15]:
# Checking the descriptive statistics
df_customer_dataset.describe()

Unnamed: 0,user_id,age,number_of_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


In [17]:
# Checking the data types.
df_customer_dataset.dtypes

user_id                  int64
first_name              object
last_name               object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
family_status           object
income                   int64
full_name               object
dtype: object

In [23]:
# Checking for missing values
df_customer_dataset.isnull().sum()

user_id                     0
first_name              11259
last_name                   0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependants        0
family_status               0
income                      0
full_name               11259
dtype: int64

#### There are quite a few people missing first names and therefore the full_name reflects these missing names.

In [25]:
# Looking for missing values
df_customer_dataset['first_name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
Ned          186
David        186
Name: first_name, Length: 208, dtype: int64

In [22]:
# check for columns that have mixed data types

for col in df_customer_dataset.columns.tolist():
    weird = (df_customer_dataset[[col]].applymap(type) != df_customer_dataset[[col]].iloc[0].apply(type)).any(axis=1)
    if len (df_customer_dataset[weird]) > 0:
        print (col)

first_name
full_name


#### The first_name/full_name column is considered mixed because it contains NaN values (string) and integers. To standardise this column and to bypass this issue, we can assign the data type as string

In [26]:
# assign first_name and full_name as string
df_customer_dataset['first_name'] = df_customer_dataset['first_name'].astype('str')

In [27]:
df_customer_dataset['full_name'] = df_customer_dataset['full_name'].astype('str')

In [29]:
# Checking to see if this has resolve the issue
for col in df_customer_dataset.columns.tolist():
    weird = (df_customer_dataset[[col]].applymap(type) != df_customer_dataset[[col]].iloc[0].apply(type)).any(axis=1)
    if len (df_customer_dataset[weird]) > 0:
        print (col)

#### Observation: There are no mixed data type columns.

In [31]:
# Checking for duplicates
# making a dataframe to check for duplicates
df_customer_dataset_dups = df_customer_dataset[df_customer_dataset.duplicated()]

In [32]:
df_customer_dataset_dups.shape

(0, 11)

#### Observation: There are no duplicates.

In [33]:
# Removing first name and last name as obsolete by creating a second dataset.
df_customer_dataset_2 = df_customer_dataset.drop(columns = ['first_name', 'last_name'])

In [34]:
df_customer_dataset_2.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income,full_name
0,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285,Patricia Hart
2,65803,Male,Idaho,35,1/1/2017,2,married,99568,Kenneth Farley
3,125935,Female,Iowa,40,1/1/2017,0,single,42049,Michelle Hicks
4,130797,Female,Maryland,26,1/1/2017,1,married,40374,Ann Gilmore


In [36]:
# Re-ordering the columns for structure
df_customer_dataset_2.reindex(['user_id', 'full_name', 'gender', 'age', 'state', 'date_joined', 'number_of_dependants', 'family_status', 'income'], axis = 1)

Unnamed: 0,user_id,full_name,gender,age,state,date_joined,number_of_dependants,family_status,income
0,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665
1,33890,Patricia Hart,Female,36,New Mexico,1/1/2017,0,single,59285
2,65803,Kenneth Farley,Male,35,Idaho,1/1/2017,2,married,99568
3,125935,Michelle Hicks,Female,40,Iowa,1/1/2017,0,single,42049
4,130797,Ann Gilmore,Female,26,Maryland,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...,...
206204,168073,Lisa Case,Female,44,North Carolina,4/1/2020,1,married,148828
206205,49635,Jeremy Robbins,Male,62,Hawaii,4/1/2020,3,married,168639
206206,135902,Doris Richmond,Female,66,Missouri,4/1/2020,2,married,53374
206207,81095,Rose Rollins,Female,27,California,4/1/2020,1,married,99799


## 4. Combining the customer data with the rest of the prepared Instacart data

In [40]:
# Importing df_ords_prods_merged_dataset to combine with df_customer_dataset_2
path = r'C:\Users\User 1\Documents\Instacart Basket Analysis 04-2023'

In [41]:
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_aggregated.pkl'))

In [42]:
#Checking the head and shape of the imported dataframe
df_ords_prods_merged.shape

(32404859, 24)

In [43]:
df_ords_prods_merged.head(20)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,busiest_day,Busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_frequency,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,Mid-range product,Regularly busy,Regular days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
5,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regular days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
6,550135,1,7,1,9,20.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
7,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
8,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
9,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


#### Observation: user_id is the common denominator in both dataframes.

In [44]:
# Checking the data types of both dataframes
df_customer_dataset_2.dtypes

user_id                  int64
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
family_status           object
income                   int64
full_name               object
dtype: object

In [45]:
df_ords_prods_merged.dtypes

order_id                     int64
user_id                      int64
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
product_id                   int64
add_to_cart_order            int64
reordered                    int64
product_name                object
aisle_id                     int64
department_id                int64
prices                     float64
_merge                    category
price_range_loc             object
busiest_day                 object
Busiest_days                object
busiest_period_of_day       object
max_order                    int64
loyalty_flag                object
avg_price                  float64
spending_flag               object
median_frequency           float64
order_frequency_flag        object
dtype: object

#### The data types of both dataframes are the same: int64.

In [49]:
# Merging the two dataframes on user_id
df_combined = df_customer_dataset_2.merge(df_ords_prods_merged, on = 'user_id')

In [50]:
# Checking the new dataframe
df_combined.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income,full_name,order_id,...,price_range_loc,busiest_day,Busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_frequency,order_frequency_flag
0,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel,2543867,...,Mid-range product,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
1,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel,1285508,...,Mid-range product,Regularly busy,Regular days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
2,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel,2578584,...,Mid-range product,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
3,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel,423547,...,Mid-range product,Regularly busy,Regular days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
4,26711,Female,Missouri,48,1/1/2017,3,married,165665,Deborah Esquivel,2524893,...,Mid-range product,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer


In [51]:
df_combined['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [52]:
# Re-ordering the columns for structure
df_combined.reindex(['user_id', 'full_name', 'gender', 'age', 'state', 'date_joined', 'number_of_dependants', 'family_status', 'income', 'order_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order', 'product_id', 'add_to_cart_order', 'reordered', 'product_name', 'price_range_loc', 'busiest_day', 'Busiest_days', 'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_price', 'spending_flag', 'median_frequency', 'order_frequency_flag', '_merge'], axis = 1)

Unnamed: 0,user_id,full_name,gender,age,state,date_joined,number_of_dependants,family_status,income,order_id,...,busiest_day,Busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_frequency,order_frequency_flag,_merge
0,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665,2543867,...,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
1,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665,1285508,...,Regularly busy,Regular days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
2,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665,2578584,...,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
3,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665,423547,...,Regularly busy,Regular days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
4,26711,Deborah Esquivel,Female,48,Missouri,1/1/2017,3,married,165665,2524893,...,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32404854,80148,Cynthia Noble,Female,55,New York,4/1/2020,1,married,57095,2859858,...,Regularly busy,Regular days,Most orders,4,New customer,3.886667,Low spender,12.0,Regular customer,both
32404855,80148,Cynthia Noble,Female,55,New York,4/1/2020,1,married,57095,2859858,...,Regularly busy,Regular days,Most orders,4,New customer,3.886667,Low spender,12.0,Regular customer,both
32404856,80148,Cynthia Noble,Female,55,New York,4/1/2020,1,married,57095,3209855,...,Regularly busy,Regular days,Most orders,4,New customer,3.886667,Low spender,12.0,Regular customer,both
32404857,80148,Cynthia Noble,Female,55,New York,4/1/2020,1,married,57095,2859858,...,Regularly busy,Regular days,Most orders,4,New customer,3.886667,Low spender,12.0,Regular customer,both


## 5. Exporting the dataframe as a pickle file

In [53]:
# Exporting the dataframe
df_combined.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'df_combined.pkl'))