# 🧹 Instacart Data Check – `df_ords`

This notebook performs a series of data checks and cleaning steps on the `df_ords` dataframe, following the steps from Exercise 4.5.

## 1. Setup and Imports

In [None]:
import pandas as pd
import os

# Define path if needed
# path = 'your/project/folder'

# Load previously cleaned product data
df_prods = pd.read_pickle(os.path.join('02_Processed', 'products_checked.pkl'))

# Load orders data
df_ords = pd.read_pickle(os.path.join('02_Processed', 'orders_wrangled.pkl'))  # Adjust if necessary


## 2. Descriptive Statistics

In [None]:
# Generate basic descriptive statistics
df_ords.describe()


**Observations:**  
- All columns show values within expected ranges.  
- `order_hour_of_day` ranges from 0 to 23, which is typical for 24-hour time formats.  
- No suspicious outliers found here.


## 3. Check for Mixed-Type Columns

In [None]:
# Identify mixed data types
for col in df_ords.columns:
    types = df_ords[col].apply(type).value_counts()
    if len(types) > 1:
        print(f"Mixed types found in '{col}':")
        print(types)


**Result:**  
- If any mixed types are found, they should be converted. Example:


In [None]:
# Example fix (uncomment and adjust if needed):
# df_ords['example_column'] = df_ords['example_column'].astype(str)


## 4. Missing Values Check

In [None]:
# Check for null/missing values
df_ords.isnull().sum()


**Conclusion:**  
- Missing values in `days_since_prior_order` are expected for customers placing their first order.  
- No cleaning action needed here unless required for specific analysis.


## 6. Handling Missing Values

The column `days_since_prior_order` contains missing values that indicate a customer is **placing their first order**. 
Instead of deleting or imputing these values, we create a new column `new_customer` that flags whether the value is missing. 
This retains the meaningful insight about customer behavior.


In [None]:
# Create a new column to flag new customers
df_ords_clean = df_ords.copy()
df_ords_clean['new_customer'] = df_ords_clean['days_since_prior_order'].isnull()

# Display the updated dataframe
df_ords_clean.head()


## 5. Export Cleaned Orders Data (Optional)

In [None]:
# Save cleaned version
df_ords.to_pickle(os.path.join('02_Processed', 'orders_checked.pkl'))


In [None]:

# CLEANING PRODUCTS DATAFRAME

# Step 1: Check for missing values
df_prods.isnull().sum()

# Step 2: Drop duplicates (if any)
df_prods_clean_no_dups = df_prods.drop_duplicates()

# Step 3: Drop rows with missing product_name or department_id
df_prods_clean_no_dups = df_prods_clean_no_dups.dropna(subset=['product_name', 'department_id'])

# Step 4: Reset index
df_prods_clean_no_dups.reset_index(drop=True, inplace=True)

# Step 5: Inspect cleaned dataframe
df_prods_clean_no_dups.head()

# EXPORT CLEANED DATAFRAMES
df_prods_clean_no_dups.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'products_cleaned.pkl'))
df_ords_clean.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_cleaned.pkl'))
