# <center> Getting Data Ready for MLOps Monitoring <center/>

## <center> Setting up housing data for model tracking and drift detection <center/>
    
## <center> By Joemichael Alvarez <center/>

***Data Set:*** `California Housing Dataset`: 
[here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html)

***Data Set Information:***

- **Number of instances 20,640**

- **Number of Attributes 9**

- **Attribute breakdown 8 quantitative input variables, and 1 quantitative output variable**

- **Missing Attribute Values None**

***Attribute Information:***

Given are the variable name, variable type, the measurement unit and a brief description. The median house value is our regression target. 

- **MedInc** -- quantitative -- median income in block group -- Input Variable
- **HouseAge** -- quantitative -- median house age in block group -- Input Variable
- **AveRooms** -- quantitative -- average number of rooms per household -- Input Variable
- **AveBedrms** -- quantitative -- average number of bedrooms per household -- Input Variable
- **Population** -- quantitative -- block group population -- Input Variable
- **AveOccup** -- quantitative -- average number of household members -- Input Variable
- **Latitude** -- quantitative -- block group latitude -- Input Variable
- **Longitude** -- quantitative -- block group longitude -- Input Variable
- **MedHouseVal** -- quantitative -- median house value in hundreds of thousands of dollars -- Output Variable

# What We're Building

Before diving into MLOps concepts, it's important to understand what we're building. This project demonstrates a complete MLOps pipeline using housing data to simulate real-world model monitoring scenarios.

**We seek to create a production-ready ML system that can detect when model performance degrades due to data drift, concept drift, or other factors that occur in real deployed systems.**

# Setting Up the Data

## Getting Started

In [None]:
#required libraries
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

#our custom modules
from data_loader import DataLoader

#quality of life
np.random.seed(42)
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully")
print(f"Current working directory: {os.getcwd()}")

## Loading Our Data

### First Look

In [None]:
#initialize data loader
data_loader = DataLoader()

#load housing data
print("Loading California housing dataset...")
df = data_loader.load_housing_data()

print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
print("\nDataset info:")
df.info()

In [None]:
#display first few records vertically and rounded
df.head(10).round(2).T

In [None]:
#basic statistics for our columns
df.describe().round(2).T

In [None]:
#this is how i like to display the shape of my df
df.info()

The California housing dataset is clean with no missing values. All features are quantitative and properly formatted. For MLOps simulation, we need to add temporal dimensions to simulate real-world data streaming scenarios.

# <center>Exploring the Data

## Looking at Each Feature

In [None]:
#histogram univariate
for col in df.columns:
    plt.figure()
    sns.histplot(data=df, x=col, multiple="dodge")
    plt.title(f"Histogram of {col}")
    plt.show()

Most features show reasonable distributions. Income and house values show some right skew which is typical for economic data. Geographic features (latitude/longitude) show clustering patterns reflecting California's population centers.

In [None]:
## How Features Connect

In [None]:
corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True)

Lets sort some of our more important correlations with house values.

In [None]:
#create correlations with MedHouseVal
correlations = df.drop("MedHouseVal", axis=1).apply(lambda x: x.corr(df.MedHouseVal, method='spearman'))

#sort the correlations in descending order
correlations.sort_values(ascending=False)

## Adding Time to the Mix

For MLOps monitoring, we need to add timestamps to simulate real-world data streams. The original dataset lacks temporal information, but production systems receive data over time.

___Domain Knowledge:___ Housing markets exhibit temporal patterns due to economic cycles, seasonal trends, and policy changes. Adding timestamps allows us to simulate drift detection scenarios.

In [None]:
#add timestamps to simulate data streaming
print("Adding timestamps to simulate real-world data streaming...")
df_with_time = data_loader.add_timestamps(df, start_date="2023-01-01")

print(f"Dataset with timestamps shape: {df_with_time.shape}")
print("\nNew temporal columns:")
print(df_with_time[['timestamp', 'date']].head())

#check time range coverage
print(f"\nTime range:")
print(f"Start: {df_with_time['timestamp'].min()}")
print(f"End: {df_with_time['timestamp'].max()}")
print(f"Duration: {df_with_time['timestamp'].max() - df_with_time['timestamp'].min()}")

In [None]:
# Let's see how our house values look over time
plt.figure(figsize=(12, 6))

# Just grab every 1000th point so the plot doesn't get too crowded
sample_df = df_with_time.iloc[::1000].copy()

plt.scatter(sample_df['timestamp'], sample_df['MedHouseVal'], 
           alpha=0.6, s=20, color='navy')
plt.title('House Prices Over Time (Sample of Our Data)')
plt.xlabel('When the Data Came In')
plt.ylabel('Median House Value')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Splitting Data the Right Way

MLOps requires temporal splits rather than random splits to simulate realistic model deployment scenarios where models are trained on historical data and tested on future data.

___MLOps Insight:___ Time-based splits prevent data leakage and reflect real-world model performance where future data may differ from training data.

In [None]:
#temporal split at mid-year point
split_date = "2023-07-01"  

train_data, val_data = data_loader.split_by_time(df_with_time, split_date)

print(f"Training data shape: {train_data.shape}")
print(f"Validation data shape: {val_data.shape}")
print(f"\nTraining period: {train_data['timestamp'].min()} to {train_data['timestamp'].max()}")
print(f"Validation period: {val_data['timestamp'].min()} to {val_data['timestamp'].max()}")

In [None]:
# Time to save everything so we can use it later
print("Saving all our processed data...")

# Make sure we have the right folders
os.makedirs("../data/raw", exist_ok=True)
os.makedirs("../data/processed", exist_ok=True)

# Save the full dataset with our new timestamps
data_loader.save_processed_data(df_with_time, "housing_with_timestamps.csv")

# Save our train/validation splits
data_loader.save_processed_data(train_data, "train_data.csv")
data_loader.save_processed_data(val_data, "validation_data.csv")

# Create a baseline dataset - we'll use this as our "reference" for drift detection
baseline_data = train_data[train_data['timestamp'] < '2023-02-01'].copy()
data_loader.save_processed_data(baseline_data, "baseline_reference.csv")

print(f"✅ Saved baseline reference: {baseline_data.shape[0]} records from early 2023")

In [None]:
# Create daily chunks for our monitoring demo
print("Breaking data into daily batches for monitoring simulation...")

# Use some of our validation data and limit it so we don't go crazy with files
data_loader.create_daily_batches(val_data.head(10000))  

print("✅ Got our daily batches ready for monitoring!")

## Quick Quality Check

Perform basic data quality validation before model training.

In [None]:
# Let's check if our data is clean and ready to go
def perform_data_quality_checks(df, dataset_name):
    print(f"\n=== Quality Check: {dataset_name} ===")
    
    # Any missing values?
    missing_values = df.isnull().sum()
    print(f"Missing values:")
    if missing_values.sum() == 0:
        print("✅ No missing values - we're good!")
    else:
        print(f"{missing_values[missing_values > 0]}")
    
    # Duplicate rows?
    duplicates = df.duplicated().sum()
    if duplicates == 0:
        print("✅ No duplicate rows found")
    else:
        print(f"⚠️ Found {duplicates} duplicate rows")
    
    # What types are we working with?
    print(f"\nData types:")
    print(df.dtypes)
    
    # Check for outliers using the IQR method (pretty standard approach)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    outlier_counts = {}
    
    for col in numeric_cols:
        if col not in ['timestamp']:  # Skip timestamp since it's not really a "feature"
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
            outlier_counts[col] = outliers
    
    print(f"\nOutliers (using IQR method):")
    for col, count in outlier_counts.items():
        if count > 0:
            print(f"  {col}: {count} outliers ({count/len(df)*100:.2f}%)")
        else:
            print(f"  {col}: No outliers ✅")
    
    print("=" * 50)

# Run our checks on all the datasets
perform_data_quality_checks(train_data, "Training Data")
perform_data_quality_checks(val_data, "Validation Data")
perform_data_quality_checks(baseline_data, "Baseline Reference Data")

## Wrapping Up

### What We Got Done

**Data Processing:**
- ✅ Loaded California housing dataset (20,640 instances, 9 attributes)
- ✅ Added temporal dimension for MLOps simulation
- ✅ Implemented temporal data splitting
- ✅ Created baseline reference dataset for drift detection
- ✅ Generated daily batches for monitoring simulation
- ✅ Performed data quality validation

**Output Files:**
- `../data/processed/housing_with_timestamps.csv` - Full dataset with temporal features
- `../data/processed/train_data.csv` - Training split (first 6 months)
- `../data/processed/validation_data.csv` - Validation split (last 6 months)
- `../data/processed/baseline_reference.csv` - Reference data for drift monitoring
- `../data/processed/day_*.csv` - Daily batches for monitoring

**Next Steps:**
- 📊 **Notebook 02**: Model training with MLflow experiment tracking
- 🔄 **Notebook 03**: Data drift simulation and detection
- 📈 **Notebook 04**: Production monitoring with Evidently AI

The preprocessed data is ready for model training and MLOps pipeline implementation.