# OmniScale Optimizer - Complete Pipeline

## Overview

This notebook implements a complete End-to-End Machine Learning System that combines:

1. Neural Collaborative Filtering (NCF) - Deep Learning for Product Recommendations
2. Logistics Optimization - Warehouse Order Assignment with Capacity Constraints
3. High-Performance Computing (HPC) - C++ with OpenMP for speed
4. Distributed Computing - MapReduce simulation for scalability

## Requirements

Install the required packages before running the notebook:

In [1]:
# Install required packages
!pip install torch pandas numpy scikit-learn pybind11

Collecting pybind11
  Downloading pybind11-3.0.2-py3-none-any.whl.metadata (10 kB)
Downloading pybind11-3.0.2-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pybind11
Successfully installed pybind11-3.0.2


## Pipeline Phases

| Phase | Component | Description |
|-------|-----------|-------------|
| 1 | Data Parsing | Stream JSON reviews from Amazon |
| 2 | Feature Mining | Extract user-item interactions plus K-Means clustering |
| 3 | Neural Recommender | Train PyTorch NCF model |
| 4 | HPC Optimizer | C++ logistics solver with OpenMP |
| 5 | Distributed Computing | MapReduce simulation |

---

## Phase 1: Google Drive Setup and Project Initialization

Purpose: Mounts Google Drive to Colab and configures the project environment.

What it does:
1. Mounts Google Drive at /content/drive
2. Sets project root to /content/drive/MyDrive/OmniScale-Optimizer
3. Updates sys.path so Python can import from src/ folder
4. Changes working directory to project root

In [None]:
import sys
import os
from google.colab import drive

# 1. Mount the drive
drive.mount('/content/drive')

# 2. Define the EXACT path to your project root
project_root = "/content/drive/MyDrive/OmniScale-Optimizer"

# 3. Add to sys.path if not already there
if project_root not in sys.path:
    sys.path.append(project_root)

# 4. Change the working directory
os.chdir(project_root)

print(f"Current Working Directory: {os.getcwd()}")
print("Python Path updated. You can now import from 'src'.")

---

## Phase 1: Create Project Structure

Purpose: Creates the complete folder hierarchy and Python module structure.

What it does:
1. Authenticates user with Google
2. Creates project folder
3. Creates subdirectories: data/raw, data/processed, notebooks, src/parser, src/models, src/optimizer/cpp_core, src/distributed, tests, scripts
4. Creates __init__.py files to make folders Python packages

In [None]:
from google.colab import drive
import os
from google.colab import auth

auth.authenticate_user()
drive.mount('/content/drive', force_remount=True)

# Create a main project folder in your Drive
project_path = "/content/drive/MyDrive/OmniScale-Optimizer"
if not os.path.exists(project_path):
    os.makedirs(project_path)
    print(f"Created project folder at {project_path}")

# Change the current working directory to the project folder
os.chdir(project_path)

folders = [
    "data/raw",
    "data/processed",
    "notebooks",
    "src/parser",
    "src/models",
    "src/optimizer/cpp_core",
    "src/distributed",
    "tests",
    "scripts"
]

for folder in folders:
    path = os.path.join(project_path, folder)
    os.makedirs(path, exist_ok=True)

# Create empty __init__.py files to make them Python modules
init_files = [
    "src/__init__.py",
    "src/parser/__init__.py",
    "src/models/__init__.py",
    "src/optimizer/__init__.py",
    "src/distributed/__init__.py"
]

for file in init_files:
    with open(os.path.join(project_path, file), 'w') as f:
        pass

print("Project structure created successfully!")

---

## Phase 1: Create Stream Parser Module

Purpose: Creates a memory-efficient JSON parser using Python generator pattern.

What it does: Defines stream_amazon_data function that opens JSON file and yields one line at a time.

Why Generators: Memory efficient - only one record in memory at a time.

In [None]:
%%writefile src/parser/stream_parser.py

import json


def stream_amazon_data(file_path):
    """A generator that yields one row at a time (Phase 1: Parsing)"""
    with open(file_path, 'r') as f:
        for line in f:
            yield json.loads(line)


if __name__ == "__main__":
    print("Stream Parser Module Initialized")

---

## Phase 1: Download Amazon Reviews Dataset

Purpose: Downloads the Amazon Electronics Reviews dataset from Stanford SNAP repository.

Dataset Details:
- Source: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
- Size: approximately 400MB compressed, 1.4GB uncompressed
- Content: 1.6M product reviews from Amazon Electronics category
- Format: JSON (one review per line)

What each cell does:
1. Navigate to data folder
2. Download using wget
3. Extract using gunzip
4. Verify files

In [None]:
import os

# Navigate to data folder
raw_data_path = "/content/drive/MyDrive/OmniScale-Optimizer/data/raw"
os.chdir(raw_data_path)

print(f"Current directory: {os.getcwd()}")

In [None]:
# Download the 5-core Electronics dataset
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz

In [None]:
# Extract the dataset
!gunzip reviews_Electronics_5.json.gz

In [2]:
# Verify downloaded files
!ls -lh

total 4.0K
drwxr-xr-x 1 root root 4.0K Jan 16 14:24 sample_data


---

## Phase 1: Data Quality Check

Purpose: Validates the downloaded dataset to ensure data quality before processing.

Cell 1: Basic Statistics - Counts total reviews and unique users.
Cell 2: True Duplicates - Checks for same user reviewing same product multiple times.

Expected Output:
Total reviews: 1689188
Unique reviewers: 192403
Potential duplicates: 0
True duplicates: 0

In [22]:
import json

# Count records in JSON file
count = 0
seen = set()
duplicates = 0

with open('/content/drive/MyDrive/OmniScale-Optimizer/data/raw/reviews_Electronics_5.json', 'r') as f:
    for line in f:
        count += 1
        if line.strip():
            try:
                data = json.loads(line)
                review_id = data.get('reviewerID', '')
                if review_id in seen:
                    duplicates += 1
                seen.add(review_id)
            except:
                pass

print(f"Total reviews: {count}")
print(f"Unique reviewers: {len(seen)}")
print(f"Potential duplicates: {duplicates}")

Total reviews: 1689188
Unique reviewers: 192403
Potential duplicates: 1496785


In [None]:
import json

# Check for true duplicates
seen_pairs = set()
true_duplicates = 0

with open('/content/drive/MyDrive/OmniScale-Optimizer/data/raw/reviews_Electronics_5.json', 'r') as f:
    for line in f:
        if line.strip():
            try:
                data = json.loads(line)
                pair = (data.get('reviewerID', ''), data.get('asin', ''))
                if pair[0] and pair[1]:
                    if pair in seen_pairs:
                        true_duplicates += 1
                    seen_pairs.add(pair)
            except:
                pass

print(f"True duplicates: {true_duplicates}")

---

## Phase 2: Create Feature Miner Module

Purpose: Creates the FeatureMiner class that handles data extraction and K-Means++ clustering.

Data Extraction: Reads streaming JSON data, extracts key fields (user_id, item_id, rating), generates random lat/lon coordinates.

K-Means++ Clustering: Implements K-Means from scratch, groups users into delivery zones based on location.

In [23]:
%%writefile src/parser/feature_miner.py

import numpy as np
import pandas as pd
from src.parser.stream_parser import stream_amazon_data


class FeatureMiner:

    def __init__(self, file_path):
        self.file_path = file_path

    def extract_interactions(self, limit=100000):
        """Extracts User-Item-Rating and generates random Lat/Lon for users"""
        data = []
        gen = stream_amazon_data(self.file_path)
        
        for i, record in enumerate(gen):
            if i >= limit: break
            data.append({
                'user_id': record.get('reviewerID'),
                'item_id': record.get('asin'),
                'rating': record.get('overall'),
                'lat': np.random.uniform(30, 45),
                'lon': np.random.uniform(-120, -70)
            })
        return pd.DataFrame(data)

    def manual_kmeans_plusplus(self, points, k, max_iters=100, tol=1e-4):
        """K-Means++ clustering with smart initialization"""
        # Random seeding
        if random_state is not None:
            np.random.seed(random_state)

        n_samples = len(points)

        # K-Means++ Initialization
        centroids = [points[np.random.randint(n_samples)]]

        for _ in range(1, k):
            centroids_arr = np.array(centroids)
            dists = np.min(
                np.linalg.norm(points[:, np.newaxis] - centroids_arr, axis=2),
                axis=1
            )

            dist_sq = dists ** 2
            total = dist_sq.sum()

            if total == 0:
                centroids.append(points[np.random.randint(n_samples)])
            else:
                probs = dist_sq / total
                centroids.append(points[np.random.choice(n_samples, p=probs)])

        centroids = np.array(centroids)

        # Main K-Means Loop
        # ✅ OPTIONAL IMPROVEMENT: Iteration counter
        for iteration in range(max_iters):
            distances = np.linalg.norm(points[:, np.newaxis] - centroids, axis=2)
            labels = np.argmin(distances, axis=1)

            new_centroids = np.array([
                points[labels == i].mean(axis=0) if np.any(labels == i)
                else centroids[i]
                for i in range(k)
            ])

            if np.linalg.norm(new_centroids - centroids) < tol:
                break

            centroids = new_centroids

        inertia = np.sum((points - centroids[labels]) ** 2)

        return centroids, labels, inertia
    
    # Backward compatibility wrapper
    def manual_kmeans(self, points, k, max_iters=100):
        """Wrapper for backward compatibility"""
        centroids, labels, _ = self.manual_kmeans_plusplus(points, k, max_iters)
        return centroids, labels


Overwriting src/parser/feature_miner.py


---

## Phase 2: MAIN DATA PROCESSING (CRITICAL)

Purpose: Transforms raw JSON into structured data for ML models.

What it does:
1. Reads 100K reviews using the stream parser
2. Generates coordinates for each user
3. K-Means Clustering - Groups users into 10 delivery zones
4. Saves to CSV

IMPORTANT: This cell MUST run successfully before the demo notebook.

Expected Output:
Extracted 100000 interactions.
Data saved to Drive!

In [None]:
from src.parser.feature_miner import FeatureMiner

# Path to the dataset
raw_json = "/content/drive/MyDrive/OmniScale-Optimizer/data/raw/reviews_Electronics_5.json"

# Error handling: Check if file exists
if not os.path.exists(raw_json):
    raise FileNotFoundError(f"Raw data file not found: {raw_json}. Please download the dataset first.")

# Initialize and Process
miner = FeatureMiner(raw_json)
df_interactions = miner.extract_interactions(limit=100000)

print(f"Extracted {len(df_interactions)} interactions.")

# Cluster user locations into 10 Delivery Zones using K-Means++
points = df_interactions[['lat', 'lon']].values
centroids, labels, inertia = miner.manual_kmeans_plusplus(points, k=10)

# Add the Zone Label to our dataframe
df_interactions['delivery_zone'] = labels

# Print clustering quality metrics
print(f"Clustering Quality (Inertia): {inertia:.2f}")
print(f"Average distance to warehouse: {np.sqrt(inertia/len(points)):.2f}")

# Save processed data
df_interactions.to_csv("/content/drive/MyDrive/OmniScale-Optimizer/data/processed/clean_data.csv", index=False)
print("Data saved to Drive!")


---

## Phase 3: Neural Collaborative Filtering (NCF)

What is NCF: Neural Collaborative Filtering is a deep learning approach to recommendations.

Pipeline:
1. Load processed data from CSV
2. Map IDs to integers
3. Create NCF Model with PyTorch embeddings
4. Train using MSE loss
5. Save model to Drive

Expected Output:
Dataset has 50000 users and 30000 items.
Using device: cuda
Epoch 1, Loss: 1.2345
Model saved to Drive!

In [None]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# 1. Load the data from Phase 1
data_path = "/content/drive/MyDrive/OmniScale-Optimizer/data/processed/clean_data.csv"

# Error handling
if not os.path.exists(data_path):
    raise FileNotFoundError(f"Processed data not found: {data_path}. Please run Phase 2 first.")

df = pd.read_csv(data_path)

# 2. Map IDs to Integers
df['user_idx'] = df['user_id'].astype('category').cat.codes
df['item_idx'] = df['item_id'].astype('category').cat.codes

# 3. Get total counts
num_users = df['user_idx'].nunique()
num_items = df['item_idx'].nunique()

print(f"Dataset has {num_users} users and {num_items} items.")

# 4. Split into Training and Testing
train, test = train_test_split(df, test_size=0.2, random_state=42)

# Save the counts
with open("/content/drive/MyDrive/OmniScale-Optimizer/data/processed/metadata.txt", "w") as f:
    f.write(f"{num_users},{num_items}")

### NCF Model Architecture

The model consists of:

1. Embedding Layers: Lookup tables for user/item indices
2. MLP: Multi-Layer Perceptron (64 -> 32 -> 1)
3. Forward Pass: Concatenate embeddings, pass through MLP

In [None]:
%%writefile src/models/ncf_model.py

import torch
import torch.nn as nn


class NCFModel(nn.Module):

    def __init__(self, num_users, num_items, embed_size=32):
        super(NCFModel, self).__init__()
        
        # Embedding Layers
        self.user_embed = nn.Embedding(num_users, embed_size)
        self.item_embed = nn.Embedding(num_items, embed_size)
        
        # Neural Network Layers (MLP)
        self.fc_layers = nn.Sequential(
            nn.Linear(embed_size * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
        
    def forward(self, user_indices, item_indices):
        u_emb = self.user_embed(user_indices)
        i_emb = self.item_embed(item_indices)
        x = torch.cat([u_emb, i_emb], dim=-1)
        prediction = self.fc_layers(x)
        return prediction.squeeze()

### Training the NCF Model

Training process:
1. Device Selection: Use CUDA GPU if available
2. Model Initialization: Create model with embedding size 32
3. Optimizer: Adam with learning rate 0.001
4. Loss Function: MSE
5. Training Loop (5 epochs)
6. Save Model

In [None]:
from src.models.ncf_model import NCFModel
import torch.optim as optim
import torch.nn as nn

# 1. Setup Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Initialize Model
model = NCFModel(num_users, num_items).to(device)

# 3. Define Optimizer and Loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# 4. Prepare Tensors
user_train = torch.LongTensor(train['user_idx'].values).to(device)
item_train = torch.LongTensor(train['item_idx'].values).to(device)
ratings_train = torch.FloatTensor(train['rating'].values).to(device)

# 5. Training Loop
model.train()
for epoch in range(5):
    optimizer.zero_grad()
    outputs = model(user_train, item_train)
    loss = criterion(outputs, ratings_train)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# 6. Save Model Weights
torch.save(model.state_dict(), "/content/drive/MyDrive/OmniScale-Optimizer/data/processed/ncf_model.pth")
print("Model saved to Drive!")

---

## Phase 4: Logistics Optimization (Python)

What is Logistics Optimization: Assigning customer orders to warehouses to minimize delivery distance and respect capacity constraints.

Algorithm: For each user, calculate distance to all warehouses, sort by distance, assign to first warehouse with available capacity.

Expected Output:
Total Orders: 100000
Capacity per Warehouse: 11000
Optimization Complete!
Orders unassigned: 0

In [None]:
%%writefile src/optimizer/solver.py

import numpy as np


class LogisticsOptimizer:

    def __init__(self, warehouse_locations, warehouse_capacities):
        self.warehouses = warehouse_locations
        self.capacities = warehouse_capacities

    def calculate_distance(self, p1, p2):
        return np.linalg.norm(p1 - p2)

    def assign_orders(self, user_locations):
        num_users = len(user_locations)
        assignments = np.full(num_users, -1)
        current_usage = np.zeros(len(self.warehouses))

        for i in range(num_users):
            user_pos = user_locations[i]
            costs = [self.calculate_distance(user_pos, w) for w in self.warehouses]
            preferred_warehouses = np.argsort(costs)
            
            for w_idx in preferred_warehouses:
                if current_usage[w_idx] < self.capacities[w_idx]:
                    assignments[i] = w_idx
                    current_usage[w_idx] += 1
                    break
                    
        return assignments, current_usage

### Running the Logistics Optimizer

Steps:
1. Load processed data
2. Calculate warehouse locations
3. Define capacity constraints
4. Run assignment algorithm
5. Analyze results
6. Save final shipping plan

In [None]:
from src.optimizer.solver import LogisticsOptimizer
import pandas as pd
import numpy as np

# 1. Load processed data
data_path = "/content/drive/MyDrive/OmniScale-Optimizer/data/processed/clean_data.csv"

# Error handling
if not os.path.exists(data_path):
    raise FileNotFoundError(f"Processed data not found: {data_path}. Please run Phase 2 first.")

df = pd.read_csv(data_path)

# 2. Get Warehouse Locations
warehouse_locations = df.groupby('delivery_zone')[['lat', 'lon']].mean().values
num_warehouses = len(warehouse_locations)

# 3. Define Constraints
total_orders = len(df)
capacity_per_warehouse = int((total_orders / num_warehouses) * 1.1) 
capacities = [capacity_per_warehouse] * num_warehouses

print(f"Total Orders: {total_orders}")
print(f"Capacity per Warehouse: {capacity_per_warehouse}")

# 4. Initialize and Run the Optimizer
optimizer = LogisticsOptimizer(warehouse_locations, capacities)
user_coords = df[['lat', 'lon']].values

print("Solving assignment optimization...")
assignments, final_usage = optimizer.assign_orders(user_coords)

# 5. Analyze Results
df['assigned_warehouse'] = assignments
unassigned = np.sum(assignments == -1)

print(f"Optimization Complete!")
print(f"Orders unassigned (due to capacity): {unassigned}")
print(f"Warehouse Usage: {final_usage}")

# Save the final optimized plan
df.to_csv("/content/drive/MyDrive/OmniScale-Optimizer/data/processed/final_shipping_plan.csv", index=False)

---

## Phase 4: HPC C++ Optimizer

Why C++: Python loops are slow because they are interpreted with no parallelization.

C++ Advantages:
1. Compiled code - No interpreter overhead
2. OpenMP parallelization - Use all CPU cores
3. Memory efficiency

Integration: Uses pybind11 to create Python bindings for C++ code.

In [None]:
# Setup directory for C++ code
import os
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
project_root = "/content/drive/MyDrive/OmniScale-Optimizer"
os.chdir(project_root)
os.makedirs("src/optimizer/cpp_core", exist_ok=True)

print(f"Current Directory: {os.getcwd()}")

In [None]:
%%writefile src/optimizer/cpp_core/optimizer.cpp

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <vector>
#include <cmath>
#include <algorithm>
#include <omp.h>

namespace py = pybind11;


py::array_t<int> fast_assign(py::array_t<double> user_locs, 
                            py::array_t<double> warehouse_locs, 
                            py::array_t<int> capacities) {
    
    auto users = user_locs.unchecked<2>();
    auto warehouses = warehouse_locs.unchecked<2>();
    
    int n_users = users.shape(0);
    int n_warehouses = warehouses.shape(0);
    
    py::array_t<int> assignments({n_users});
    auto assign_ptr = assignments.mutable_unchecked<1>();

    #pragma omp parallel for
    for (int i = 0; i < n_users; i++) {
        double min_dist = 1e18;
        int best_w = -1;

        for (int j = 0; j < n_warehouses; j++) {
            double dx = users(i, 0) - warehouses(j, 0);
            double dy = users(i, 1) - warehouses(j, 1);
            double dist = std::sqrt(dx*dx + dy*dy);
            
            if (dist < min_dist) {
                min_dist = dist;
                best_w = j;
            }
        }
        assign_ptr(i) = best_w;
    }
    
    return assignments;
}


PYBIND11_MODULE(fast_optimizer, m) {
    m.def("fast_assign", &fast_assign, "High-performance order assignment");
}

### Compiling the C++ Optimizer

This cell compiles the C++ code into a Python extension module (.so file).

Expected Output:
Compiling with paths...
Compilation Successful! .so file created.

In [None]:
import pybind11
import sys

# Get the include paths
pybind_inc = pybind11.get_include()
python_inc = get_ipython().system('python3-config --includes')

# Join the list into a string
python_inc_str = " ".join(python_inc)

print("Compiling with paths...")

get_ipython().system('g++ -O3 -Wall -shared -std=c++11 -fPIC -fopenmp {python_inc_str} -I{pybind_inc} src/optimizer/cpp_core/optimizer.cpp -o src/optimizer/cpp_core/fast_optimizer.so')


import os
if os.path.exists("src/optimizer/cpp_core/fast_optimizer.so"):
    print("Compilation Successful! .so file created.")
else:
    print("Warning: Compilation may have failed. Check the output above.")

---

## Phase 5: Distributed Computing (MapReduce)

What is MapReduce: A programming model for processing large datasets in parallel across multiple machines.

Two Phases:
1. Map Phase: Split data into chunks. Each worker processes its chunk independently.
2. Reduce Phase: Aggregate results from all workers.

In This Implementation: Uses ProcessPoolExecutor to simulate distributed processing.

Expected Output:
Data loaded: 100000 rows.
Starting Distributed MapReduce Simulation...
Success! Distributed Execution Time: 0.1234 seconds

In [None]:
%%writefile src/distributed/map_reduce_ops.py

import numpy as np
from concurrent.futures import ProcessPoolExecutor
import sys
import os

# Ensure the C++ library can be found by workers
sys.path.append("/content/drive/MyDrive/OmniScale-Optimizer/src/optimizer/cpp_core")
import fast_optimizer


def worker_task(chunk_data, warehouse_coords, caps):
    """The Map Step: Each worker processes a slice of the data."""
    assignments = fast_optimizer.fast_assign(chunk_data, warehouse_coords, caps)
    local_count = len(assignments)
    local_usage = np.bincount(assignments, minlength=len(warehouse_coords))
    
    return {
        "assignments": assignments,
        "usage": local_usage,
        "count": local_count
    }


def run_distributed_optimizer(df, warehouse_coords, caps, num_workers=4):
    """The Master Logic: Orchestrates the distribution and aggregation."""
    user_coords = df[['lat', 'lon']].values.astype(np.float64)
    chunks = np.array_split(user_coords, num_workers)
    
    results = []
    
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(worker_task, chunk, warehouse_coords, caps) for chunk in chunks]
        
        for future in futures:
            results.append(future.result())
            
    # The Reduce Step: Aggregate results from all workers
    total_usage = np.zeros(len(warehouse_coords))
    all_assignments = []
    
    for res in results:
        total_usage += res["usage"]
        all_assignments.extend(res["assignments"])
        
    return all_assignments, total_usage

### Running the Distributed Optimizer

This cell demonstrates the full distributed pipeline:
1. Load Data
2. Prepare Inputs
3. Execute MapReduce
4. Measure execution time

In [None]:
import pandas as pd
import numpy as np
import os

# 1. Reload the data
data_path = "/content/drive/MyDrive/OmniScale-Optimizer/data/processed/clean_data.csv"

# Error handling
if not os.path.exists(data_path):
    raise FileNotFoundError(f"Processed data not found: {data_path}. Please run Phase 2 first.")

df = pd.read_csv(data_path)
print(f"Data loaded: {len(df)} rows.")

# 2. Re-calculate warehouse locations
warehouse_locations = df.groupby('delivery_zone')[['lat', 'lon']].mean().values
num_warehouses = len(warehouse_locations)
total_orders = len(df)
capacity_per_warehouse = int((total_orders / num_warehouses) * 1.1) 
capacities = [capacity_per_warehouse] * num_warehouses

In [21]:
from src.distributed.map_reduce_ops import run_distributed_optimizer
import time

# Ensure inputs are correctly typed for C++
warehouse_coords = warehouse_locations.astype(np.float64)
caps = np.array(capacities).astype(np.int32)

print("Starting Distributed MapReduce Simulation...")
start = time.time()

# Use 2 workers to match Colab CPU count
dist_assignments, dist_usage = run_distributed_optimizer(
    df, 
    warehouse_coords, 
    caps, 
    num_workers=2
)


dist_time = time.time() - start
print(f"Success! Distributed Execution Time: {dist_time:.4f} seconds")
print(f"Total Orders: {len(dist_assignments)}")

ImportError: Python version mismatch: module was compiled for Python 3.10, but the interpreter version is incompatible: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0].

---

## Summary

This notebook demonstrated a complete OmniScale machine learning pipeline:

| Phase | Component | Technology | Output |
|-------|-----------|------------|--------|
| 1 | Data Parsing | Python Generators | Raw JSON stream |
| 2 | Feature Mining | K-Means++ | clean_data.csv |
| 3 | Recommender | PyTorch NCF | ncf_model.pth |
| 4 | Logistics | C++/OpenMP | fast_optimizer.so |
| 5 | Distributed | MapReduce | Parallel execution |

### Key Takeaways:

1. End-to-End ML: From raw data to deployed models
2. Hybrid Computing: Python plus C++ for best performance
3. Scalability: MapReduce enables handling millions of records
4. Real Data: Amazon reviews provide realistic dataset

### Files Generated:

- data/processed/clean_data.csv
- data/processed/ncf_model.pth
- data/processed/final_shipping_plan.csv
- src/optimizer/cpp_core/fast_optimizer.so

### Next Steps:

- Try the Demo Notebook for visualizations
- Experiment with hyperparameters
- Deploy to cloud for true distributed computing

---

Pipeline completed successfully! Run the cells in order from Phase 1 to Phase 5.