**Table of contents**<a id='toc0_'></a>    
- 1. [Importing Data](#toc1_)    
- 2. [products.csv Data Consistency Checks](#toc2_)    
- 3. [orders_wrangled.csv Data Consistency Checks](#toc3_)    
- 4. [Exporting data frames as csv files](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 1. <a id='toc1_'></a>[Importing Data](#toc0_)

In [108]:
# Importing Libraries
import numpy as np
import pandas as pd
import os

In [109]:
Path = r'D:\Data Analysis\01-08-2025 Instacart Basket Analysis\Data'
df_pro = pd.read_csv(os.path.join(Path, 'Original Data', 'products.csv'), index_col=False)
df_ord = pd.read_csv(os.path.join(Path, 'Prepared Data', 'orders_wrangled.csv'), index_col=False)
df_ord.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,
1,1,2398795,1,prior,2,3,7,15.0
2,2,473747,1,prior,3,3,12,21.0
3,3,2254736,1,prior,4,4,7,29.0
4,4,431534,1,prior,5,4,15,28.0


In [110]:
# Creating a test dataframe to practice
df_test = pd.DataFrame()
df_test['mix'] = ['a', 'b', 1, True]
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


# 2. <a id='toc2_'></a>[products.csv Data Consistency Checks](#toc0_)

In [111]:
# for-loop checking per column if dataframe contains any mixed-type columns. Weird checks wheter the datatype within columns are consistent and if stmt prints column names.
for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [112]:
df_test['mix'] = df_test['mix'].astype('str')

In [113]:
#Figuring out which columns have null values
df_pro.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [None]:
#Creating a new data frame that acts like a flag based on the missing value.
# ALt: df_proCleaned = df_pro.loc[df_pro['product_name'].isna() == True]
df_nan = df_pro[df_pro['product_name'].isnull() == True]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [115]:
df_pro_clean = df_pro[df_pro['product_name'].isna() == False]
df_pro_clean.shape

(49677, 5)

In [116]:
df_pro_dups = df_pro_clean[df_pro_clean.duplicated()]
df_pro_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [117]:
df_pro_clean.shape

(49677, 5)

In [118]:
df_pro_clean_noDup = df_pro_clean.drop_duplicates()
df_pro_clean_noDup.shape

(49672, 5)

# 3. <a id='toc3_'></a>[orders_wrangled.csv Data Consistency Checks](#toc0_)

In [119]:
# Task 2
df_ord.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


All columns, except for days_since_prior_order, show values that fall within their logical parameters. The lower values shown in the last column for count, mean could signify data consistency issues that will require further investigation.

In [120]:
# Task 3: checking for mixed-type data with for-loop did not output any column names: none found.
for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

In [121]:
#Task 5: Checking for missing values
df_ord.isna().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_day_of_week              0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

There are 206,209 order_ids with missing days_since_prior_order data which is not likely for it to be a data processing error in recognizing zero values since there are 67,755 order_id observations showing 0 for days lapsed. It could actually mean these were first time users with no order history. Imputing these values would alter our data frame patterns since both mean and median values will not represent real behavioral aspects of customers.

In [122]:
#Generating a subset that lists user_ids with missing/empty days since prior order values.
user_id_nan = df_ord.loc[df_ord['days_since_prior_order'].isna()]
user_id_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,
11,11,2168274,2,prior,1,2,11,
26,26,1374495,3,prior,1,1,14,
39,39,3343014,4,prior,1,6,11,
45,45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,prior,1,4,12,
3420934,3420934,3189322,206206,prior,1,3,18,
3421002,3421002,2166133,206207,prior,1,6,19,
3421019,3421019,2227043,206208,prior,1,1,15,


In [123]:
#Creating a new data frame with details of only user_ids detected by user_id_nan
user_id_withNAN = df_ord.loc[df_ord['user_id'].isin(user_id_nan['user_id'])].sort_values('user_id')
user_id_withNAN.head(50)

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,
1,1,2398795,1,prior,2,3,7,15.0
6,6,550135,1,prior,7,1,9,20.0
5,5,3367565,1,prior,6,2,7,19.0
10,10,1187899,1,train,11,4,8,14.0
9,9,2550362,1,prior,10,4,8,30.0
8,8,2295261,1,prior,9,1,16,0.0
7,7,3108588,1,prior,8,1,14,14.0
3,3,2254736,1,prior,4,4,7,29.0
2,2,473747,1,prior,3,3,12,21.0


In [124]:
df_ord_clean = df_ord[df_ord['days_since_prior_order'].isna() == False]
df_ord_clean.shape

(3214874, 8)

Task 6 - The above results show that the null values are indeed due to inexistent order history for each user_id hence it would be better to not drop or impute and work with df_ord as it is. Created the df_ord_clean data frame without the 206209 df_ord NAN values in case we need to execute any special code that requires no nulll values and created a subset that lists the user_ids with missing days_since_prior_order values: namely user_id_nan created above.

In [125]:
#Task 7: Showing duplicate values
df_ord_dups = df_ord[df_ord.duplicated()]
df_ord_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order


Task 7. No duplicate values found in the df_ord data frame.

# 4. <a id='toc4_'></a>[Exporting data frames as csv files](#toc0_)

In [126]:
# Task 9 - Exporting df_pro_clean_noDup and df_ord_clean data as csv files and may use df_ord depending on requirements.
df_pro_clean_noDup.to_csv(os.path.join(Path, 'Prepared Data', 'df_pro_NoNANDups.csv'))
df_ord_clean.to_csv(os.path.join(Path, 'Prepared Data', 'df_ord_NoNANDups.csv'))