# Task 4.5: Data Consistency Checks

# Data Consistency Checks - Task 4.5

This notebook performs comprehensive data consistency checks on the Instacart project datasets.
We will examine the orders and products data for:
- Mixed data types
- Missing values  
- Duplicate records
- Data quality issues

The focus will be on the df_ords dataframe as specified in the task requirements.

# Importing Libraries and Data

In [12]:
import pandas as pd
import numpy as np
import os

# Set up project path
path = r'/Users/josephadamski/Instacart Basket Analysis'

# Import dataframes
df_prods = pd.read_csv(os.path.join(path, 'Data', 'Original Data', 'products.csv'), index_col=False)
df_ords = pd.read_csv(os.path.join(path, 'Data', 'Prepared Data', 'orders_wrangled.csv'), index_col=False)

print("Data successfully imported!")
print(f"Products dataframe shape: {df_prods.shape}")
print(f"Orders dataframe shape: {df_ords.shape}")

Data successfully imported!
Products dataframe shape: (49693, 5)
Orders dataframe shape: (3421083, 6)


## STEP 1: CONSISTENCY CHECKS ON df_prods

In [13]:
print("=== CHECKING df_prods FOR MIXED TYPES ===")
mixed_found = False
for col in df_prods.columns.tolist():
    weird = (df_prods[[col]].map(type) != df_prods[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_prods[weird]) > 0:
        print(f"Mixed types found in column: {col}")
        mixed_found = True

if not mixed_found:
    print("No mixed data types found in df_prods!")

print("\n=== MISSING VALUES IN df_prods ===")
prods_missing = df_prods.isnull().sum()
print(prods_missing)

print("\n=== DUPLICATES IN df_prods ===")
df_prods_dups = df_prods[df_prods.duplicated()]
print(f"Number of duplicate rows: {len(df_prods_dups)}")
if len(df_prods_dups) > 0:
    print("Sample duplicates:")
    print(df_prods_dups.head())

# Clean df_prods for later export
df_prods_clean = df_prods[df_prods['product_name'].notna()].drop_duplicates()
print(f"\nCleaned df_prods shape: {df_prods_clean.shape}")

=== CHECKING df_prods FOR MIXED TYPES ===
Mixed types found in column: product_name

=== MISSING VALUES IN df_prods ===
product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

=== DUPLICATES IN df_prods ===
Number of duplicate rows: 5
Sample duplicates:
       product_id                                       product_name  \
462           462                  Fiber 4g Gummy Dietary Supplement   
18459       18458                                         Ranger IPA   
26810       26808               Black House Coffee Roasty Stout Beer   
35309       35306  Gluten Free Organic Peanut Butter & Chocolate ...   
35495       35491                            Adore Forever Body Wash   

       aisle_id  department_id  prices  
462          70             11     4.8  
18459        27              5     9.2  
26810        27              5    13.4  
35309       121             14     6.8  
35495       127             11     9.9  

Cleaned 

### Step 1 Complete - df_prods Consistency Checks
**Mixed Types:** Mixed data types found in product_name column
**Missing Values:** 16 missing values found in product_name column  
**Duplicates:** 5 duplicate records found
 
The missing values and duplicates have been cleaned for the final export version.


## STEP 2: DESCRIPTIVE STATISTICS FOR df_ords

In [14]:
print("=== DESCRIPTIVE STATISTICS FOR df_ords ===")
df_ords.describe()

=== DESCRIPTIVE STATISTICS FOR df_ords ===


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Step 2 - Analysis of df_ords Descriptive Statistics
 
From the df.describe() output, I can observe the following about the data quality:

**Orders Day of Week (orders_day_of_week):**
- Range: 0-6, which correctly represents the 7 days of the week
- Mean of ~2.8 suggests fairly even distribution across weekdays
 
**Order Hour of Day (order_hour_of_day):** 
- Range: 0-23, which correctly represents the 24-hour format
- No values exceed 23, indicating clean time data
 
**Days Since Prior Order:**
- The count (3,214,874) is lower than other columns (3,421,083)
- This suggests approximately 206,209 missing values
- Missing values likely represent first-time customers who have no prior orders
 
**Order Numbers:**
- Maximum of 100 orders per customer seems reasonable
- Minimum of 1 confirms all customers have at least one order
 
**Overall Assessment:** The data ranges appear logical and within expected parameters.

## STEP 3: CHECK FOR MIXED-TYPE DATA IN df_ords

In [16]:
print("\n=== CHECKING df_ords FOR MIXED DATA TYPES ===")

mixed_type_columns = []
for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].map(type) != df_ords[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_ords[weird]) > 0:
        print(f"Mixed types found in column: {col}")
        mixed_type_columns.append(col)

if len(mixed_type_columns) == 0:
    print("No mixed data types found in df_ords!")

print("\n=== DATA TYPES IN df_ords ===")
print(df_ords.dtypes)


=== CHECKING df_ords FOR MIXED DATA TYPES ===
No mixed data types found in df_ords!

=== DATA TYPES IN df_ords ===
order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object


### Step 3 - Mixed Data Type Analysis

**Results:** No mixed data types were detected in the df_ords dataframe.
 
**Data Type Summary:**
- All ID columns (order_id, user_id) are properly stored as int64
- Categorical variables (orders_day_of_week, order_hour_of_day) are int64
- days_since_prior_order is float64 (appropriate due to potential missing values)
 
**Conclusion:** The data types are consistent and appropriate for each column. No action needed.

## STEP 4: FIX MIXED-TYPE DATA (IF FOUND)

In [19]:
# Since no mixed-type data was found, no fixes are needed
# If mixed types were found, we would use:
# df_ords['column_name'] = df_ords['column_name'].astype('desired_type')

print("No mixed-type data to fix in df_ords")

No mixed-type data to fix in df_ords


## STEP 5: CHECK FOR MISSING VALUES IN df_ords

In [20]:
print("\n=== MISSING VALUES ANALYSIS IN df_ords ===")
missing_values = df_ords.isnull().sum()
print(missing_values)

# Examine the missing values in detail
df_missing_analysis = df_ords[df_ords['days_since_prior_order'].isnull()]
print(f"\nSample of records with missing days_since_prior_order:")
print(df_missing_analysis[['order_id', 'user_id', 'order_number', 'days_since_prior_order']].head(10))

# Check if missing values correspond to first orders
first_order_check = df_missing_analysis['order_number'].value_counts().sort_index()
print(f"\nOrder numbers for records with missing days_since_prior_order:")
print(first_order_check.head())


=== MISSING VALUES ANALYSIS IN df_ords ===
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

Sample of records with missing days_since_prior_order:
    order_id  user_id  order_number  days_since_prior_order
0    2539329        1             1                     NaN
11   2168274        2             1                     NaN
26   1374495        3             1                     NaN
39   3343014        4             1                     NaN
45   2717275        5             1                     NaN
50   2086598        6             1                     NaN
54   2565571        7             1                     NaN
75    600894        8             1                     NaN
79    280530        9             1                     NaN
83   1224907       10             1                     NaN

Order numbers for records wi

### Step 5 - Missing Values Analysis
 
**Findings:**
- 206,209 missing values found in the 'days_since_prior_order' column (6.03% of total records)
- All other columns have complete data
 
**Root Cause Analysis:**
Upon examination, ALL missing values correspond to records where order_number = 1.
This makes logical sense because first-time customers have no previous orders,
so there are no "days since prior order" to calculate.
 
**Proposed Solution:**
Create a flag to identify first-time orders while preserving the data structure.

## STEP 6: ADDRESS MISSING VALUES

In [21]:
print("\n=== ADDRESSING MISSING VALUES ===")

# Create a flag for first-time orders
df_ords['first_order_flag'] = df_ords['days_since_prior_order'].isnull()

# Verify the flag works correctly
first_order_summary = df_ords['first_order_flag'].value_counts()
print("First Order Flag Summary:")
print(f"First-time orders (True): {first_order_summary[True]:,}")
print(f"Repeat orders (False): {first_order_summary[False]:,}")

# Cross-check with order_number = 1
verification = df_ords[df_ords['order_number'] == 1]['first_order_flag'].all()
print(f"\nVerification: All order_number = 1 have first_order_flag = True: {verification}")

print(f"Dataframe shape after adding flag: {df_ords.shape}")


=== ADDRESSING MISSING VALUES ===
First Order Flag Summary:
First-time orders (True): 206,209
Repeat orders (False): 3,214,874

Verification: All order_number = 1 have first_order_flag = True: True
Dataframe shape after adding flag: (3421083, 7)


### Step 6 - Missing Values Resolution
 
**Method Used:** Created a boolean flag 'first_order_flag' to identify first-time customers

**Why I chose this method:** I created a flag instead of dropping records because the missing 
values represent legitimate business cases (first-time customers). Dropping these records would 
lose valuable customer data and reduce our dataset size unnecessarily. This approach preserves 
all data while enabling customer segmentation analysis.

## STEP 7: CHECK FOR DUPLICATE VALUES IN df_ords

In [22]:
print("\n=== CHECKING FOR DUPLICATE RECORDS ===")

df_ords_duplicates = df_ords[df_ords.duplicated()]
duplicate_count = len(df_ords_duplicates)

print(f"Number of duplicate rows found: {duplicate_count:,}")

if duplicate_count > 0:
    print("\nSample of duplicate records:")
    print(df_ords_duplicates.head())
else:
    print("No duplicate records found in df_ords!")

# Check for duplicates in key identifier columns
print("\n=== CHECKING KEY COLUMN UNIQUENESS ===")
print(f"Unique order_ids: {df_ords['order_id'].nunique():,} (should equal total rows)")
print(f"Total rows: {len(df_ords):,}")
print(f"Order_id uniqueness: {'PASS' if df_ords['order_id'].nunique() == len(df_ords) else 'FAIL'}")


=== CHECKING FOR DUPLICATE RECORDS ===
Number of duplicate rows found: 0
No duplicate records found in df_ords!

=== CHECKING KEY COLUMN UNIQUENESS ===
Unique order_ids: 3,421,083 (should equal total rows)
Total rows: 3,421,083
Order_id uniqueness: PASS


### Step 7 - Duplicate Records Analysis
 
**Findings:** No complete duplicate records found in df_ords
 
**Key Column Analysis:**
- Order_id uniqueness: VERIFIED
- Each order_id appears exactly once, confirming data integrity
- Total unique order_ids matches total row count (3,421,083)
 
**Assessment:** The dataset demonstrates excellent referential integrity with no duplicate transactions.

## STEP 8: ADDRESS DUPLICATES

In [24]:
# Since no duplicates were found, no action is needed
print("No duplicate records to remove from df_ords")

# If duplicates had been found, we would use:
# df_ords_clean = df_ords.drop_duplicates()

No duplicate records to remove from df_ords


### Step 8 - Duplicate Handling
 
**Method Used:** No action required as no duplicates were found
 
**Why I chose this method:** Since no duplicates were found, no action was needed. 
If duplicates had been found, I would have used drop_duplicates() to maintain data 
integrity while preserving unique records.

## STEP 9: EXPORT CLEANED DATA

In [23]:
print("\n=== EXPORTING CLEANED DATASETS ===")

# Export the cleaned orders data with the new flag
orders_export_path = os.path.join(path, 'Data', 'Prepared Data', 'orders_checked.csv')
df_ords.to_csv(orders_export_path, index=False)
print(f"Orders data exported: orders_checked.csv")

# Export the cleaned products data  
products_export_path = os.path.join(path, 'Data', 'Prepared Data', 'products_checked.csv')
df_prods_clean.to_csv(products_export_path, index=False)
print(f"Products data exported: products_checked.csv")

print(f"\nFILES CREATED:")
print(f"- orders_checked.csv ({len(df_ords):,} records)")
print(f"- products_checked.csv ({len(df_prods_clean):,} records)")

print(f"\nTASK 4.5 COMPLETE!")


=== EXPORTING CLEANED DATASETS ===
Orders data exported: orders_checked.csv
Products data exported: products_checked.csv

FILES CREATED:
- orders_checked.csv (3,421,083 records)
- products_checked.csv (49,672 records)

TASK 4.5 COMPLETE!
