# Exercise 4.9 - Part 1

## Table of Contents — Exercise 4.9 Part 1

1. [Imports & Pathways](#Imports--Pathways)
2. [Initial Dataset Inspection](#Initial-Dataset-Inspection)
3. [Wrangle Headers](#Wrangle-Headers)
4. [Wrangle Missing Values](#Wrangle-Missing-Values)
5. [Investigate Mixed Types](#Investigate-Mixed-Types)
6. [Duplicate Check](#Duplicate-Check)
7. [Merge](#Merge)
8. [Export](#Export)

## Imports/Pathways

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# File pathway shortcut
path = r'C:\Users\Chase\anaconda_projects\Exercise 4\07-2025 Instacart Basket Analysis'

In [3]:
# Prepared file - Orders_Products_Merge (Clean)
df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers_new.csv'))

In [4]:
# Prepared file - Orders_Products_Merge (Clean)
df_ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_v1.pkl'))

## Data Wrangling

#### Initial Dataset Inspection 

In [5]:
# Quick glance at the data
print(df_cust)

        user_id First Name    Surnam  Gender           STATE  Age date_joined  \
0         26711    Deborah  Esquivel  Female        Missouri   48    1/1/2017   
1         33890   Patricia      Hart  Female      New Mexico   36    1/1/2017   
2         65803    Kenneth    Farley    Male           Idaho   35    1/1/2017   
3        125935   Michelle     Hicks  Female            Iowa   40    1/1/2017   
4        130797        Ann   Gilmore  Female        Maryland   26    1/1/2017   
...         ...        ...       ...     ...             ...  ...         ...   
206204   168073       Lisa      Case  Female  North Carolina   44    4/1/2020   
206205    49635     Jeremy   Robbins    Male          Hawaii   62    4/1/2020   
206206   135902      Doris  Richmond  Female        Missouri   66    4/1/2020   
206207    81095       Rose   Rollins  Female      California   27    4/1/2020   
206208    80148    Cynthia     Noble  Female        New York   55    4/1/2020   

        n_dependants fam_st

In [6]:
# info on the data types
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


Looks like first_name is possibly missing 11,259 entries. Possible mixed types in Dtype: object 

In [7]:
# check headers and data
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


I see naming issues: 
First Name---> first_name;
Surnam-------> last_name;	
Gender-------> gender; 
STATE--------> state;	
Age----------> age;
date_joined--> no change needed;
n_dependants-> num_dependants;	
fam_status---> martial_status;	
income-------> no change needed;

In [8]:
# basic info from the dataset
df_cust.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


#### Wrangle Headers

Headers need to be fixed and missing values in first_name need investigation

In [9]:
# Edit headers to be correct
# Make everythign lowercase for consistency
df_cust.rename(columns={
    'First Name': 'first_name',
    'Surnam': 'last_name', # change to last_name, easier to understand 
    'Gender': 'gender',
    'STATE': 'state',
    'Age': 'age',
    'date_joined': 'date_joined', 
    'n_dependants': 'num_dependents', # changed to num_dependents, easier to understand
    'fam_status': 'marital_status', # changed to marital_status, clearer and easier to understand 
    'income': 'income'}, inplace=True) 


In [10]:
# Work check on headers
df_cust.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_dependents,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


Headers look good

#### Wrangle Missing Values

In [11]:
# Check the columns for missing data
df_cust.isnull().sum().sort_values(ascending=False)

first_name        11259
user_id               0
last_name             0
gender                0
state                 0
age                   0
date_joined           0
num_dependents        0
marital_status        0
income                0
dtype: int64

In [12]:
df_cust['first_name'].isnull().sum()

np.int64(11259)

In [13]:
# percentage missing from the column
missing_count = 11259
total_rows = 206209
missing_pct = (missing_count / total_rows) * 100
print(f"{missing_pct:.2f}% missing")

5.46% missing


I checked the csv in Excel and there are actual blanks in the first name column. I will fill blanks with NaN

In [14]:
# Changing the blanks "" to NaN
df_cust['first_name'].replace('', np.nan, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cust['first_name'].replace('', np.nan, inplace=True)


In [15]:
# Flag the missing first_names just in case
df_cust['missing_first_name'] = df_cust['first_name'].isna()

In [16]:
# work check
df_cust.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_dependents,marital_status,income,missing_first_name
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,False
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285,False
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568,False
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049,False
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374,False


In [17]:
# work check
df_cust['missing_first_name'].value_counts()

missing_first_name
False    194950
True      11259
Name: count, dtype: int64

True value matches the number of missing values.

#### Investigate Mixed Types 

In [18]:
df_cust.dtypes

user_id                int64
first_name            object
last_name             object
gender                object
state                 object
age                    int64
date_joined           object
num_dependents         int64
marital_status        object
income                 int64
missing_first_name      bool
dtype: object

In [19]:
# A closer look into the dtypes
for col in df_cust.select_dtypes(include='object').columns:
    print(f"{col}:")
    print(df_cust[col].apply(type).value_counts(), "\n")

first_name:
first_name
<class 'str'>      194950
<class 'float'>     11259
Name: count, dtype: int64 

last_name:
last_name
<class 'str'>    206209
Name: count, dtype: int64 

gender:
gender
<class 'str'>    206209
Name: count, dtype: int64 

state:
state
<class 'str'>    206209
Name: count, dtype: int64 

date_joined:
date_joined
<class 'str'>    206209
Name: count, dtype: int64 

marital_status:
marital_status
<class 'str'>    206209
Name: count, dtype: int64 



The floats are the missing values I just changed to NaN and apperently Pandas treats them as floats. So, looks good to me

In [20]:
# Checking user_id types match
df_ords_prods_merge.dtypes

order_id                     int64
product_id                   int64
add_to_cart_order            int64
reordered                    int64
user_id                      int64
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
first_order                   bool
product_name                object
aisle_id                   float64
department_id              float64
prices                     float64
_merge                    category
price_range                 object
busiest_day                 object
busiest_period_of_day       object
dtype: object

In [21]:
# Reorganizing columns to make the dataframe more intuitive
df_ords_prods_merge = df_ords_prods_merge[
    [
        'user_id', 'order_id', 'product_id',
        'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order', 'first_order',
        'product_name', 'aisle_id', 'department_id', 'prices', 'price_range',
        'add_to_cart_order', 'reordered',
        '_merge', 'busiest_day', 'busiest_period_of_day'
    ]
]

In [22]:
df_ords_prods_merge.head()

Unnamed: 0,user_id,order_id,product_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_name,aisle_id,department_id,prices,price_range,add_to_cart_order,reordered,_merge,busiest_day,busiest_period_of_day
0,202279,2,33120,3,5,9,8.0,False,Organic Egg Whites,86.0,16.0,11.3,Mid-range product,1,1,both,Regularly busy,Most orders
1,202279,2,28985,3,5,9,8.0,False,Michigan Organic Kale,83.0,4.0,13.4,Mid-range product,2,1,both,Regularly busy,Most orders
2,202279,2,9327,3,5,9,8.0,False,Garlic Powder,104.0,13.0,3.6,Low-range product,3,0,both,Regularly busy,Most orders
3,202279,2,45918,3,5,9,8.0,False,Coconut Butter,19.0,13.0,8.4,Mid-range product,4,1,both,Regularly busy,Most orders
4,202279,2,30035,3,5,9,8.0,False,Natural Sweetener,17.0,13.0,13.7,Mid-range product,5,0,both,Regularly busy,Most orders


#### Duplicate Check

In [23]:
# Check for duplicates
df_ords_prods_merge.duplicated().sum()

np.int64(0)

No duplicates

# Merge

In [24]:
# Check shape of df_cust
df_cust.shape

(206209, 11)

In [25]:
# check the shape of df_ords_prods_merge
df_ords_prods_merge.shape

(32435059, 18)

I want to double check if these numbers make sense. Could there really be only 206,209 customers and then all of them have made over 32M orders?

In [26]:
# How many unique customer IDs are in the orders dataset?
df_ords_prods_merge['user_id'].nunique()  # Should be close to 206,209

206209

Boom! Looks like the two datasets will merge just fine and not have NaNs

In [27]:
# Merge df_cust to df_ords_prods_merge using left merge
df_ords_prods_merge = df_ords_prods_merge.merge(df_cust, on='user_id', how='left')

In [28]:
df_ords_prods_merge.head()

Unnamed: 0,user_id,order_id,product_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_name,aisle_id,...,first_name,last_name,gender,state,age,date_joined,num_dependents,marital_status,income,missing_first_name
0,202279,2,33120,3,5,9,8.0,False,Organic Egg Whites,86.0,...,Paul,Coleman,Male,Idaho,57,2/6/2020,3,married,98119,False
1,202279,2,28985,3,5,9,8.0,False,Michigan Organic Kale,83.0,...,Paul,Coleman,Male,Idaho,57,2/6/2020,3,married,98119,False
2,202279,2,9327,3,5,9,8.0,False,Garlic Powder,104.0,...,Paul,Coleman,Male,Idaho,57,2/6/2020,3,married,98119,False
3,202279,2,45918,3,5,9,8.0,False,Coconut Butter,19.0,...,Paul,Coleman,Male,Idaho,57,2/6/2020,3,married,98119,False
4,202279,2,30035,3,5,9,8.0,False,Natural Sweetener,17.0,...,Paul,Coleman,Male,Idaho,57,2/6/2020,3,married,98119,False


Looks like the merge went well. I do see that the columns needs to be regorangized to make sense again.

In [29]:
# reorganize the columns after the new merge
df_ords_prods_merge = df_ords_prods_merge[
    ['user_id', 'first_name', 'last_name', 'gender', 'age', 'state', 'income',
    'marital_status', 'num_dependents', 'date_joined', 'missing_first_name',
    'order_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day',
    'days_since_prior_order', 'first_order',
    'product_id', 'product_name', 'aisle_id'
    # Add any engineered columns here
    ]
]

In [30]:
# Check work
df_ords_prods_merge.head()

Unnamed: 0,user_id,first_name,last_name,gender,age,state,income,marital_status,num_dependents,date_joined,missing_first_name,order_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,product_name,aisle_id
0,202279,Paul,Coleman,Male,57,Idaho,98119,married,3,2/6/2020,False,2,3,5,9,8.0,False,33120,Organic Egg Whites,86.0
1,202279,Paul,Coleman,Male,57,Idaho,98119,married,3,2/6/2020,False,2,3,5,9,8.0,False,28985,Michigan Organic Kale,83.0
2,202279,Paul,Coleman,Male,57,Idaho,98119,married,3,2/6/2020,False,2,3,5,9,8.0,False,9327,Garlic Powder,104.0
3,202279,Paul,Coleman,Male,57,Idaho,98119,married,3,2/6/2020,False,2,3,5,9,8.0,False,45918,Coconut Butter,19.0
4,202279,Paul,Coleman,Male,57,Idaho,98119,married,3,2/6/2020,False,2,3,5,9,8.0,False,30035,Natural Sweetener,17.0


# Export

In [33]:
# Pickle Export
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_v2.pkl'))

In [34]:
# Column Check
df_ords_prods_merge.columns

Index(['user_id', 'first_name', 'last_name', 'gender', 'age', 'state',
       'income', 'marital_status', 'num_dependents', 'date_joined',
       'missing_first_name', 'order_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'first_order',
       'product_id', 'product_name', 'aisle_id'],
      dtype='object')