# 01 — Regression + Threshold Discovery (Residual-Based)\n
\n
This notebook is where you **justify** MinC, MaxC, and T using evidence:\n
- Regression plots (Time → Axis #1–#8)\n
- Residual distributions\n
- Quantiles (e.g., 95th and 99th of positive residuals)\n
- Event counts for different persistence windows T\n
\n
**Important:** Run this notebook before submission so outputs/plots are saved and visible.

In [2]:
import sys
print(sys.executable)


c:\Users\kevin\Downloads\PredictiveMaintenance_Streaming_Lab1\PredictiveMaintenance_Streaming_Lab1\venv\Scripts\python.exe


## 1) Load training data from Neon (PostgreSQL)\n
Make sure your `.env` file is set in the repo root before running.

In [None]:
import os, sys
import pandas as pd
from dotenv import load_dotenv

# ✅ Make notebook run from project root (so "import src" works)
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(PROJECT_ROOT)
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

print("Working directory:", os.getcwd())
print("Python executable:", sys.executable)

load_dotenv()

from src.config import TIME_COL, AXIS_COLS
from src.db import read_training_data, ensure_tables

ensure_tables(TIME_COL, AXIS_COLS)
train = read_training_data()
print("Training rows:", len(train))
train.head()


ModuleNotFoundError: No module named 'src'

## 2) Fit regression models + compute residuals

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from src.preprocessing import fit_train_scalers, transform_zscore
from src.regression import fit_models, residuals

scalers = fit_train_scalers(train, AXIS_COLS)
train_z = transform_zscore(train, AXIS_COLS, scalers)

models = fit_models(train_z, TIME_COL, AXIS_COLS)
models

## 3) Plot regression lines (per axis)\n
Save plots to `outputs/figures/` so your README can embed them.

In [None]:
import os
os.makedirs('outputs/figures', exist_ok=True)

for ax in AXIS_COLS:
    r, yhat = residuals(train_z, TIME_COL, ax, models[ax])
    x = train_z[TIME_COL].to_numpy()
    y = train_z[ax].to_numpy()
    plt.figure()
    plt.scatter(x, y, s=5)
    plt.plot(x, yhat)
    plt.title(f'Regression: {ax}')
    plt.xlabel(TIME_COL)
    plt.ylabel(ax)
    plt.tight_layout()
    plt.savefig(f'outputs/figures/regression_{ax}.png', dpi=150)
    plt.close()

print('Saved regression plots to outputs/figures/')

## 4) Residual analysis + threshold candidates\n
Compute positive residual quantiles. Use these to justify MinC and MaxC.

In [None]:
quantiles = [0.90, 0.95, 0.975, 0.99]
rows = []
for ax in AXIS_COLS:
    r, _ = residuals(train_z, TIME_COL, ax, models[ax])
    pos = r[r > 0]
    if len(pos) == 0:
        rows.append({'axis': ax, **{str(q): np.nan for q in quantiles}})
        continue
    qs = np.quantile(pos, quantiles)
    rows.append({'axis': ax, **{str(quantiles[i]): float(qs[i]) for i in range(len(quantiles))}})

qdf = pd.DataFrame(rows)
qdf

## 5) Choose MinC, MaxC, T\n
- MinC: typically around 95th percentile\n
- MaxC: typically around 99th percentile\n
- T: test 5/10/15 seconds and pick what reduces noise but keeps sustained deviations