# Customer Data Wrangling & Merge

# Table of Contents

1️⃣ Customer Data Wrangling & Merge  
2️⃣ Check 'user_id' Data Types in Both DataFrames  
3️⃣ Merge DataFrames on 'user_id'  
4️⃣ Preview Merged DataFrame  
5️⃣ Export Merged DataFrame  
6️⃣ Reflection on Merge Process


In [41]:
import pandas as pd
import numpy as np
import os


This cell imports the main libraries:

- **pandas**: for data manipulation and analysis.
- **numpy**: for numerical operations.
- **os**: to handle file paths when loading and saving data.


In [43]:
# Set the path to your main project folder
path = r'C:\Users\rhysm\OneDrive\Desktop\Career Foundry\Data Immersion\Module 4\04-2025 Instacart Basket Analysis'

# Import the prepared Instacart dataframe
ords_prods_merge = pd.read_pickle(os.path.join(
    path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))

# Preview the dataframe
ords_prods_merge.head()


Unnamed: 0,Unnamed: 0.1_x,Unnamed: 0_x,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,Unnamed: 0.1_y,Unnamed: 0_y,product_name,aisle_id,department_id,prices,merge_status
0,1,1,2398795,1,2,3,7,15.0,196,1,1,both,195,195,Soda,77,7,9.0,both
1,1,1,2398795,1,2,3,7,15.0,10258,2,0,both,10258,10258,Pistachios,117,19,3.0,both
2,1,1,2398795,1,2,3,7,15.0,12427,3,1,both,12427,12427,Original Beef Jerky,23,19,4.4,both
3,1,1,2398795,1,2,3,7,15.0,13176,4,0,both,13176,13176,Bag of Organic Bananas,24,4,10.3,both
4,1,1,2398795,1,2,3,7,15.0,26088,5,1,both,26089,26089,Aged White Cheddar Popcorn,23,19,4.7,both


Here, we define the file path to the project folder and load the prepared Instacart dataset (`ords_prods_merge.pkl`), which contains merged order and product data. We then preview the first few rows to confirm it loaded correctly.


In [45]:
# Import the customer dataset
customers = pd.read_csv(os.path.join(
    path, '02 Data', 'Original Data', 'customers.csv'))

# Preview the first few rows
customers.head()


Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


We load the customer dataset (`customers.csv`), which contains user details like name, age, gender, state, family status, and income. Previewing the first few rows ensures it imported properly.


In [47]:
# Rename the columns for clarity
customers.rename(columns={'First Name': 'first_name', 'Surnam': 'last_name'}, inplace=True)

# Check for missing values
customers.isnull().sum()


user_id             0
first_name      11259
last_name           0
Gender              0
STATE               0
Age                 0
date_joined         0
n_dependants        0
fam_status          0
income              0
dtype: int64

We clean up the column names by renaming:

- `First Name` → `first_name`
- `Surnam` → `last_name`

This makes them easier to work with and more consistent with Python naming conventions.


We check for missing values across all columns to understand data completeness. This helps identify any gaps that might affect the merge or later analysis.


In [49]:
# Check for duplicates
customers.duplicated().sum()


0

We check if there are any duplicate rows in the `customers` dataframe. It’s important to confirm uniqueness before merging to avoid introducing duplicate data.


In [51]:
# Check that 'user_id' is the same type in both dataframes
ords_prods_merge['user_id'].dtype
customers['user_id'].dtype


dtype('int64')

We confirm that the `user_id` column is the same data type in both dataframes (`ords_prods_merge` and `customers`). Matching data types are essential for a successful merge.


In [53]:
# Merge the two dataframes on 'user_id'
ords_prods_cust_merge = ords_prods_merge.merge(customers, on='user_id', how='left')

# Preview the result
ords_prods_cust_merge.head()


Unnamed: 0,Unnamed: 0.1_x,Unnamed: 0_x,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,...,merge_status,first_name,last_name,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,1,1,2398795,1,2,3,7,15.0,196,1,...,both,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,1,1,2398795,1,2,3,7,15.0,10258,2,...,both,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,1,1,2398795,1,2,3,7,15.0,12427,3,...,both,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,1,1,2398795,1,2,3,7,15.0,13176,4,...,both,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,1,1,2398795,1,2,3,7,15.0,26088,5,...,both,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


In [54]:
# Export the merged dataframe as a pickle file
ords_prods_cust_merge.to_pickle(os.path.join(
    path, '02 Data', 'Prepared Data', 'ords_prods_cust_merge.pkl'))


We export the final merged dataframe as a pickle file (`ords_prods_cust_merge.pkl`) so it can be reused later without repeating the merge steps.
