# Notebook 01: Data loading & memory optimization

## WHhy does this step matter ?

### The Problem

The Instacart dataset contains **32+ million order-product pairs**. Loading this data with default pandas data types (int64, float64) consumes approximately **1.5GB of RAM**. This creates several problems:

1. **Team collaboration and github compliance**: Not all team members have powerful machines github only allows 100MB max per file
2. **Processing speed**: The larger the datasets are, the slower are the operations

### The Solution

**We optimize memory through data type conversion** without losing any information. This is achieved by:
- Converting `int64` → `int32/int16/int8` based on actual value ranges
- Converting `float64` → `float32` for decimal values
- Converting `object` → `category` for text columns with few unique values


In [8]:
### Step 1: Import Required Libraries

import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import warnings

pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)

print("="*70)
print("NOTEBOOK 01: DATA LOADING & MEMORY OPTIMIZATION")
print("="*70)
print(f"Execution started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

NOTEBOOK 01: DATA LOADING & MEMORY OPTIMIZATION
Execution started: 2026-02-20 13:35:22


**Why these libraries?**
- pandas: Data manipulation and analysis
-numpy: Numerical operations
-os: File path operations
-json: Saving metadata for PDF report
-datetime: Timestamps for documentation

In [9]:
DATA_DIR = '../data'
OUTPUT_DIR = '../data/processed'

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Data directory: {DATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Data directory: ../data
Output directory: ../data/processed


We separate paths to keeps raw data separate from processed one and enable reproducibility

In [25]:
print("\n[1/3] Loading data...")

orders = pd.read_csv(f'{DATA_DIR}/orders.csv')
print(f"   orders: {len(orders):,} rows")

orders_prior = pd.read_csv(f'{DATA_DIR}/order_products__prior.csv')
print(f"   orders_prior: {len(orders_prior):,} rows")

products = pd.read_csv(f'{DATA_DIR}/products.csv')
print(f"   products: {len(products):,} rows")

products_priced = pd.read_csv(f'{DATA_DIR}/product_with_price(in).csv')
print(f"   products priced: {len(products_priced):,} rows")

aisles = pd.read_csv(f'{DATA_DIR}/aisles.csv')
print(f"   aisles: {len(aisles)} rows (134 aisles)")

departments = pd.read_csv(f'{DATA_DIR}/departments.csv')
print(f"   departments: {len(departments)} rows (21 departments)")


[1/3] Loading data...
   orders: 3,421,083 rows
   orders_prior: 32,434,489 rows
   products: 49,688 rows
   products priced: 1,000 rows
   aisles: 134 rows (134 aisles)
   departments: 21 rows (21 departments)


As all files are needed for our analysis, we load them all at once. Also, this enable us to check data quality before any processing 

In [26]:
print("\n[2/3] Optimizing memory...")

def optimize_dtypes(df, name="DataFrame"):
    """
    Reduce memory by converting to smaller data types.
    Source: Project2_PythonML_A25.pdf Section 4
    """
    print(f"\nOptimizing {name}...")
    memory_before = df.memory_usage(deep=True).sum() / (1024 ** 2)
    
    for col in df.columns:
        col_type = df[col].dtype
        
        # Integers
        if col_type == 'int64':
            col_min, col_max = df[col].min(), df[col].max()
            if col_min >= -128 and col_max <= 127:
                df[col] = df[col].astype('int8')
            elif col_min >= -32768 and col_max <= 32767:
                df[col] = df[col].astype('int16')
            else:
                df[col] = df[col].astype('int32')
        
        # Floats
        elif col_type == 'float64':
            df[col] = df[col].astype('float32')
        
        # Objects to category
        elif col_type == 'object' and df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')

    
    return df

# Apply optimization
orders = optimize_dtypes(orders, "orders")
orders_prior = optimize_dtypes(orders_prior, "orders_prior")
products = optimize_dtypes(products, "products")
products_priced = optimize_dtypes(products_priced, "products")
aisles = optimize_dtypes(aisles, "aisles")
departments = optimize_dtypes(departments, "departments")


[2/3] Optimizing memory...

Optimizing orders...

Optimizing orders_prior...

Optimizing products...

Optimizing products...

Optimizing aisles...

Optimizing departments...


In [27]:
print("\n[3/3] Saving optimized data...")

# Save all files to processed folder
orders.to_csv(f'{OUTPUT_DIR}/orders_optimized.csv', index=False)
orders_prior.to_csv(f'{OUTPUT_DIR}/orders_prior_optimized.csv', index=False)
products.to_csv(f'{OUTPUT_DIR}/products_optimized.csv', index=False)
products_priced.to_csv(f'{OUTPUT_DIR}/products_priced_optimized.csv', index=False)
aisles.to_csv(f'{OUTPUT_DIR}/aisles_optimized.csv', index=False)
departments.to_csv(f'{OUTPUT_DIR}/departments_optimized.csv', index=False)

print("  All files saved to ../data/processed/")
print(f"\nExecution completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


[3/3] Saving optimized data...
  All files saved to ../data/processed/

Execution completed: 2026-02-20 14:59:28


In [22]:
print("\n" + "="*60)
print("NOTEBOOK 01 COMPLETE")
print("="*60)

print(f"\nFiles created:")
for file in os.listdir(OUTPUT_DIR):
    if file.endswith('.csv'):
        size_mb = os.path.getsize(f'{OUTPUT_DIR}/{file}') / 1024 / 1024
        print(f"  ✓ {file}: {size_mb:.2f} MB")

print(f"\nNext: Run Notebook 02 (Currency Conversion)")


NOTEBOOK 01 COMPLETE

Files created:
  ✓ aisles_optimized.csv: 0.00 MB
  ✓ aisle_performance.csv: 0.00 MB
  ✓ baskets.csv: 245.92 MB
  ✓ departments_optimized.csv: 0.00 MB
  ✓ department_performance.csv: 0.00 MB
  ✓ orders_optimized.csv: 106.59 MB
  ✓ orders_prior_optimized.csv: 581.73 MB
  ✓ products_optimized.csv: 2.11 MB
  ✓ rfm_customer_segments.csv: 10.56 MB

Next: Run Notebook 02 (Currency Conversion)
