1) Download the customer data set and add it to your “Original Data” folder.

2) Create a new notebook in your “Scripts” folder for part 1 of this task.

3) Import your analysis libraries, as well as your new customer data set as a dataframe.

4) Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

5) Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

6) Combine your customer data with the rest of your prepared Instacart data. Tip: Make sure the key columns are of the same data type!


# -----Task----- 

# 3. Import Libraries and Customer Dataset

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Creating a path to to .csv file
path = r'C:\Users\mmoss\20-12-2021 Instacart Basket Analysis'

In [3]:
# Importing the .csv file
df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [4]:
# Importing the prepared Instacart data
df_prepared = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', '4.8_ords_products_merged_derived_grouped.pkl'))

# 4. Wrangling the data

In [5]:
# Displaying the data to see the columns
df_customers.head(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


Columns to rename: First_Name to first_name, 
                   Surnam to surname, 
                   Gender to gender, 
                   STATE to state, 
                   Age to age, 
                   n_dependants to number_of_dependants, 
                   fam_status to marital_status, 

In [6]:
# First Name
df_customers.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [7]:
# Surnam
df_customers.rename(columns = {'Surnam' : 'surname'}, inplace = True)

In [8]:
# Gender
df_customers.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [9]:
# STATE
df_customers.rename(columns = {'STATE' : 'state'}, inplace = True)

In [10]:
# Age
df_customers.rename(columns = {'Age' : 'age'}, inplace = True)

In [11]:
# n_dependants
df_customers.rename(columns = {'n_dependants' : 'number_of_dependants'}, inplace = True)

In [12]:
# fam_status
df_customers.rename(columns = {'fam_status' : 'marital_status'}, inplace = True)

In [13]:
# Test it out
df_customers.head(20)

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


Success!

All columns here look useful. Will not remove any. 

# 5. Data quality and consistency checks

In [14]:
# Checking for missing values
df_customers.isnull().sum()

user_id                     0
first_name              11259
surname                     0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependants        0
marital_status              0
income                      0
dtype: int64

In [15]:
#Create a subset of just the missing values
df_nan = df_customers[df_customers['first_name'].isnull() == True]

In [16]:
#display it
df_nan

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
...,...,...,...,...,...,...,...,...,...,...
206038,121317,,Melton,Male,Pennsylvania,28,3/31/2020,3,married,87783
206044,200799,,Copeland,Female,Hawaii,52,4/1/2020,2,married,108488
206090,167394,,Frost,Female,Hawaii,61,4/1/2020,1,married,45275
206162,187532,,Floyd,Female,California,39,4/1/2020,0,single,56325


first_name has 11,259 rows with missing names. It would be easier to filter out these values and make a new dataframe just with the records that have first names.

In [17]:
# Running shape to see a before and after row count
df_customers.shape

(206209, 10)

In [18]:
# Removing the missing value rows
df_customers_clean = df_customers[df_customers['first_name'].isnull() == False]

In [19]:
# Testing the shape
df_customers_clean.shape

(194950, 10)

Success! Missing rows have been removed.

In [20]:
# Checking for duplicates
df_dups = df_customers_clean[df_customers_clean.duplicated()]

In [21]:
# Displaying them
df_dups

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income


There are no duplicates so I will not need to address any.

In [22]:
# Checking for mixed type data
df_customers_clean.head(20)

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


All columns have just one type of data in them. No need to address mixed-type columns.

# 6. Combining your data

In [23]:
# Displaying df_customers
df_customers.head(10)

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [24]:
#Displaying df_prepared
df_prepared.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_day_2,busiest_period_of_day,average_price,spending_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Fewest orders,6.367797,Low Spender
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,6.367797,Low Spender
2,473747,1,3,3,12,21.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,6.367797,Low Spender
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Least busy,Regularly busy,Fewest orders,6.367797,Low Spender
4,431534,1,5,4,15,28.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Least busy,Regularly busy,Fewest orders,6.367797,Low Spender
5,3367565,1,6,2,7,19.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Fewest orders,6.367797,Low Spender
6,550135,1,7,1,9,20.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Fewest orders,6.367797,Low Spender
7,3108588,1,8,1,14,14.0,False,196,2,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Fewest orders,6.367797,Low Spender
8,2295261,1,9,1,16,0.0,False,196,4,1,...,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Fewest orders,6.367797,Low Spender
9,2550362,1,10,4,8,30.0,False,196,1,1,...,Soda,77,7,9.0,Mid-range product,Least busy,Regularly busy,Fewest orders,6.367797,Low Spender


In [25]:
df_merged = df_prepared.merge(df_customers, on = 'user_id')

In [26]:
df_merged

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,spending_flag,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
0,2539329,1,1,2,8,,True,196,1,0,...,Low Spender,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Low Spender,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,False,196,1,1,...,Low Spender,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Low Spender,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,False,196,1,1,...,Low Spender,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32399727,156685,106143,26,4,23,5.0,False,19675,1,1,...,High Spender,Gerald,Yates,Male,Hawaii,25,5/26/2017,0,single,53755
32399728,484769,66343,1,6,11,,True,47210,1,0,...,Low Spender,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151
32399729,1561557,66343,2,1,11,30.0,False,47210,1,1,...,Low Spender,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151
32399730,276317,66343,3,6,15,19.0,False,47210,1,1,...,Low Spender,Jacqueline,Arroyo,Female,Tennessee,22,9/12/2017,3,married,46151


In [27]:
df_merged['_merge'].value_counts()

both          32399732
left_only            0
right_only           0
Name: _merge, dtype: int64

Success!

#### Exporting to .pkl file

In [28]:
df_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', '4.9_data_merged.pkl'))