# This script includes the following points:

Step 3. Importing the libraries and data

Step 3.1. Exploring the data


Step 4. Data wrangling

Step 4.1. Renaming columns

Step 4.2. Checking variable types

Step 5. Data quality and consistency

Step 5.1. Checking for mixed data

Step 5.2. Checking for missing values

Step 5.3. Dropping columns

Step 5.4. Checking for duplicates

Step 6. Combining data

Step 6.1 Merging dataframes

Step 7. Exporting data in pkl format.

# Step 3. Importing libraries and data

In [1]:
#Import libraries 
import pandas as pd
import numpy as np
import os

In [2]:
path = r'/Users/buketoztekin/Documents/Instacart Basket Analysis/'

In [25]:
df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)


## Step 3.1 Exploring data

In [26]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


In [27]:
df_cust.shape

(206209, 10)

In [28]:
df_cust.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


In [33]:
medians = df_cust[['Age', 'income', 'n_dependants']].median()

In [34]:
print(medians)

Age                49.0
income          93547.0
n_dependants        1.0
dtype: float64


Based on the data exploration, we observe the following key points:

Count Consistency: The counts for all variables are identical, indicating there are no missing values in the dataset.

Age: The ages range from a minimum of 18 to a maximum of 81.
The mean and median ages are equal, suggesting that the age distribution is approximately normal.

Income: Similar to age, the mean and median incomes are the same, indicating that the income data is also normally distributed.

Number of Dependents: The number of dependents varies from a minimum of 1 to a maximum of 3.
The values for the number of dependents appear to be normally distributed within this range.

# Step 4. Data Wrangling

In [37]:
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## Step 4.1. Renaming columns

In [38]:
#Renaming First Name 
df_cust.rename(columns = {'First Name' : 'first_name'}, inplace = True)
df_cust.rename(columns = {'Gender' : 'gender'}, inplace = True)
df_cust.rename(columns = {'Surnam' : 'surname'}, inplace = True)
df_cust.rename(columns = {'STATE' : 'state'}, inplace = True)
df_cust.rename(columns = {'Age' : 'age'}, inplace = True)
df_cust.rename(columns = {'First Name' : 'first_name'}, inplace = True)
df_cust.rename(columns = {'n_dependants' : 'number_of_dependants'}, inplace = True)
df_cust.rename(columns = {'fam_status' : 'family_status'}, inplace = True)


In [39]:
df_cust.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## Step 4.2 Checking variable types

In [None]:
df_cust.dtypes

# Step 5. Data Quality and Consistency Checks

## Step 5.1 Checking for Mixed Data

In [40]:
# Checking for mixed data 
for col in df_cust.columns.tolist():
    if df_cust[col].apply(type).nunique() > 1:
        print(col)

first_name


In [41]:
#Fixing mixed data in 'first_name' column
df_cust['first_name'] = df_cust['first_name'].astype('str')


In [42]:
# Checking for mixed data after fixing
for col in df_cust.columns.tolist():
    if df_cust[col].apply(type).nunique() > 1:
        print(col)

## Step 5.2 Checking for Missing Values

In [43]:
#Finding missing values
df_cust.isnull().sum()


user_id                 0
first_name              0
surname                 0
gender                  0
state                   0
age                     0
date_joined             0
number_of_dependants    0
family_status           0
income                  0
dtype: int64

It seems like there are no missing values. But when we skim through the dataframe, some 'nan' values can be detected. These values are possibly detected as strings rather than NaN values. 

In [44]:
df_cust.head(60)

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [45]:
#Replacing 'nan' values (as strings) with actual NaN
df_cust = df_cust.replace('nan', np.nan)

In [46]:
#Finding missing values
df_cust.isnull().sum()

user_id                     0
first_name              11259
surname                     0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependants        0
family_status               0
income                      0
dtype: int64

11259 first_name values are missing. Since we have the unique user_id for each individual, first_name and surnames columns can be dropped. If it is needed to know the names of the customers later, it is possible to find the through their user_ids.

## Step 5.3. Dropping Columns

In [47]:
#Dropping first_name and surname columns
df_cust_drop = df_cust.drop(columns = ['first_name', 'surname'])


In [48]:
df_cust_drop.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


## Step 5.4 Checking for Duplicates

In [49]:
#Checking for duplicates
df_dups = df_cust_drop[df_cust_drop.duplicated()]

In [50]:
df_dups.shape

(0, 8)

No duplicates were found.

# Step 6. Combining Data

In [3]:
# Importing prepared data
file_path = os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined_merged_derived.pkl')
df_ords_prods_merge = pd.read_pickle(file_path)

In [52]:
df_ords_prods_merge.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_spending,spender_flag,purchase_frequency,frequency_flag
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,...,Mid-range product,Regularly busy,Regularly busy,Most orders,32,Regular customer,6.935811,Low spender,8.0,Frequent customer
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,...,Mid-range product,Regularly busy,Regularly busy,Average orders,32,Regular customer,6.935811,Low spender,8.0,Frequent customer
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,...,Mid-range product,Busiest day,Busiest days,Average orders,5,New customer,7.930208,Low spender,7.0,Frequent customer
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,...,Mid-range product,Regularly busy,Slowest days,Most orders,3,New customer,4.972414,Low spender,9.0,Frequent customer
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,...,Mid-range product,Least busy,Slowest days,Average orders,3,New customer,4.972414,Low spender,9.0,Frequent customer


Since both df_cust_drop and the merged data has user_id column in common, these tables will be merged on user_id column.


In [53]:
df_ords_prods_merge.shape

(32404859, 25)

## Step. 6.1. Merging dataframes

In [54]:
#Merging df_cust_drop and df_ords_prods_merge on user_id
df_ords_prods_cust_merged = df_cust_drop.merge(df_ords_prods_merge, on = 'user_id')

In [55]:
df_ords_prods_cust_merged.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income,product_id,product_name,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_spending,spender_flag,purchase_frequency,frequency_flag
0,26711,Female,Missouri,48,1/1/2017,3,married,165665,196,Soda,...,Mid-range product,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
1,26711,Female,Missouri,48,1/1/2017,3,married,165665,196,Soda,...,Mid-range product,Regularly busy,Regularly busy,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
2,26711,Female,Missouri,48,1/1/2017,3,married,165665,196,Soda,...,Mid-range product,Regularly busy,Busiest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
3,26711,Female,Missouri,48,1/1/2017,3,married,165665,6184,Clementines,...,Low-range product,Regularly busy,Regularly busy,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer
4,26711,Female,Missouri,48,1/1/2017,3,married,165665,6184,Clementines,...,Low-range product,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer


In [56]:
df_ords_prods_cust_merged.shape

(32404859, 32)

Row number of the merged dataframe is equal to the row number of the df_prds_prods_merge dataframe.

# Step 7. Exporting Data

In [None]:
df_ords_prods_cust_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_customers_merged.pkl'))