# Exercise 4.9 - Part 1

### Step 1: Download the customer data set and add it to your “Original Data” folder.

### Step 2: Create a new notebook in your “Scripts” folder for part 1 of this task.

### Step 3: Import your analysis libraries, as well as your new customer data set as a dataframe.

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Creating a path
path = r'D:\Nov Laptop\Ivan Dimitrov - Data Analyst (CF)\13-06-2023 Instacart Basket Analysis'

In [3]:
# Importing the new "customer" data set as a dataframe
customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

In [6]:
# Importing the latest orders and products merged data set
ords_prods_last = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data','Orders_Products_Aggregated_4.8.pkl'))

### Step 4: Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis. (All wrangling will be done on the "customer" data set, since it's the newest data set and needs to be cleaned).

In [17]:
# First we need to see the columns and check for any unusual column names
customers.columns

Index(['user_id', 'First Name', 'Last Name', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [16]:
# Changing "Surnam" column to match the flow of the data set.
# Since we have "First Name" it's relevant to have "Last Name"
customers.rename(columns={'Surname':'Last Name'},inplace = True)

### Step 5: Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

In [18]:
# Checking the size of "customers" data set
customers.shape

(206209, 10)

In [19]:
# Checking the "customers" data set for missing values and data types
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Last Name     206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


#### So there are some missing values in the "First Name" column (206209 - 194950 = 11259 missing).
#### We can further address that with "value_counts" for missing values in that column

In [20]:
customers['First Name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
Ned          186
David        186
Name: First Name, Length: 208, dtype: int64

##### The missing values equal to 5.5% from the data set. Anything above 5% is considered a good rule of thumb to NOT be deleted. In this case regardless of the % we have another factors like "user_id" and "Last Name" that can do the job. So we'll leave it as it is.

In [22]:
# Continue with Checking the Descriptive stats for the "customers" data set
customers.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


##### Stats looking normal

In [23]:
# Checking a preview of the data set
customers.head(10)

Unnamed: 0,user_id,First Name,Last Name,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [25]:
# Normalizing the columns
customers.rename(columns = {'user_id':'User_ID','STATE':'State','date_joined':'Date_Joined','n_dependants':'Dependants','fam_status':'Fam_Status','income':'Income'}, inplace = True)

In [27]:
# Checking result
customers.head(10)

Unnamed: 0,User_ID,First Name,Last Name,Gender,State,Age,Date_Joined,Dependants,Fam_Status,Income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [28]:
# Checking for mixed-type data
for col in customers.columns.tolist():
    weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (customers[weird]) > 0:
        print(col)

First Name


##### Result is expected since the "First Name" column holds the missing values, so we'll change the type of data to a string

In [29]:
# Changing the data type of "First Name" column to a STRING
customers['First Name'] = customers['First Name'].astype('str')

In [30]:
# Re-checking for mixed-type data
for col in customers.columns.tolist():
    weird = (customers[[col]].applymap(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (customers[weird]) > 0:
        print(col)

In [31]:
# Check for duplicates
df_dups = customers[customers.duplicated()]

In [32]:
df_dups

Unnamed: 0,User_ID,First Name,Last Name,Gender,State,Age,Date_Joined,Dependants,Fam_Status,Income


##### There are NO duplicate values

### Step 6: Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)

In [42]:
# Check "ords_prods_last" data set
ords_prods_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404161 entries, 0 to 32404160
Data columns (total 24 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                int32   
 1   User_ID                 int32   
 2   order_number            int8    
 3   orders_day_of_week      int8    
 4   order_time_of_day       int8    
 5   days_since_prior_order  float16 
 6   product_id              int32   
 7   add_to_cart_order       int32   
 8   reordered               int8    
 9   product_name            object  
 10  aisle_id                int32   
 11  department_id           int8    
 12  prices                  float64 
 13  _merge                  category
 14  price_label             object  
 15  busiest_day             object  
 16  Busiest_Days            object  
 17  busiest_period_of_day   object  
 18  max_order               int8    
 19  loyalty_flag            object  
 20  avg_price               float64 
 21  spendi

In [34]:
# Check "ords_prods_last" size
ords_prods_last.shape

(32404161, 24)

In [37]:
# Checking "customers" data set
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   User_ID      206209 non-null  int32 
 1   First Name   206209 non-null  object
 2   Last Name    206209 non-null  object
 3   Gender       206209 non-null  object
 4   State        206209 non-null  object
 5   Age          206209 non-null  int32 
 6   Date_Joined  206209 non-null  object
 7   Dependants   206209 non-null  int32 
 8   Fam_Status   206209 non-null  object
 9   Income       206209 non-null  int64 
dtypes: int32(3), int64(1), object(6)
memory usage: 13.4+ MB


In [36]:
# Changing Dtype to lower the size of the data set
customers['User_ID'] = customers['User_ID'].astype('int32')
customers['Age'] = customers['Age'].astype('int32')
customers['Dependants'] = customers['Dependants'].astype('int32')

In [41]:
# Changing the Name of "ords_prods_last" "user_id" column
ords_prods_last.rename(columns = {'user_id':'User_ID'}, inplace = True)

In [43]:
ords_prods_cust_combined = ords_prods_last.merge(customers, on = 'User_ID')

In [44]:
# Checking if the merge is successful
ords_prods_cust_combined.head(10)

Unnamed: 0,order_id,User_ID,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,frequency_flag,First Name,Last Name,Gender,State,Age,Date_Joined,Dependants,Fam_Status,Income
0,2539329,1,1,2,8,,196,1,0,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
5,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
6,550135,1,7,1,9,20.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
7,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
8,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
9,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


In [45]:
# Checking the size of the new derived data set
ords_prods_cust_combined.shape

(32404161, 33)

In [46]:
# Checking the data types
ords_prods_cust_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404161 entries, 0 to 32404160
Data columns (total 33 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                int32   
 1   User_ID                 int32   
 2   order_number            int8    
 3   orders_day_of_week      int8    
 4   order_time_of_day       int8    
 5   days_since_prior_order  float16 
 6   product_id              int32   
 7   add_to_cart_order       int32   
 8   reordered               int8    
 9   product_name            object  
 10  aisle_id                int32   
 11  department_id           int8    
 12  prices                  float64 
 13  _merge                  category
 14  price_label             object  
 15  busiest_day             object  
 16  Busiest_Days            object  
 17  busiest_period_of_day   object  
 18  max_order               int8    
 19  loyalty_flag            object  
 20  avg_price               float64 
 21  spendi

### Step 7: Ensure your notebook contains logical titles, section headings, and descriptive code comments.

##### Done.

### Step 8: Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.

In [47]:
ords_prods_cust_combined.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_cust_combined_4.9_Part1.pkl'))