# Case Study: Cascading Models (Regression → Classification)

## 1. The Concept: What is Cascading?
In complex ML systems, we often chain models together. A common pattern is **Cascading**, where the *output* of Model A becomes an *input feature* for Model B.

### Why do this?
Sometimes, an intermediate variable is easier to predict than the final target, but that intermediate variable is highly correlated with the final target.

**Scenario**: Real Estate Flipping.
*   **Goal**: Predict *"Is this house a good investment?"* (Binary: Yes/No).
*   **Problem**: "Good investment" depends heavily on the *Predicted Sale Price* vs *Current Listing Price*.
*   **Solution**:
    1.  **Model 1 (Regression)**: Predict the true market value of the house.
    2.  **Feature Engineering**: Calculate `Profit_Margin = Predicted_Value - Listing_Price`.
    3.  **Model 2 (Classification)**: Use `Profit_Margin` (and other features) to classify "Invest" or "Pass".

---

## 2. Implementation: Diamond Premium Prediction
We will use the **Diamonds** dataset.
*   **Model A (Regression)**: Predict the fair price of a diamond based on its physical specs (`carat`, `depth`, `table`).
*   **Model B (Classification)**: Predict if a diamond is a "Premium Cut" based on its specs AND its predicted fair price.
    *   *Hypothesis*: Maybe Premium cuts are overpriced or underpriced in specific ways that the regression model captures?

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 1. Load Data
# We drop missing values and sample for speed
df = sns.load_dataset('diamonds').dropna().sample(2000, random_state=42)

# 2. Define Targets
# Regression Target: Price
y_reg = df['price']
# Classification Target: Is it a 'Premium' or 'Ideal' cut? (1=Yes, 0=No)
y_class = df['cut'].apply(lambda x: 1 if x in ['Premium', 'Ideal'] else 0)

# Features for Regression (Physical Specs)
X_reg = df[['carat', 'depth', 'table', 'x', 'y', 'z']]

# Split Data (Must keep indices aligned!)
X_train, X_test, y_reg_train, y_reg_test, y_class_train, y_class_test = train_test_split(
    X_reg, y_reg, y_class, test_size=0.2, random_state=42
)

print("Data Splitting Complete.")
print(f"Training Samples: {len(X_train)}")

## 3. Step 1: Train Regression Model

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_reg_train)

# Generate Predictions on BOTH Train and Test steps
# We need these predictions to train Model 2
price_pred_train = regressor.predict(X_train)
price_pred_test = regressor.predict(X_test)

print(f"Regression R2 Score: {regressor.score(X_test, y_reg_test):.4f}")

## 4. Step 2: Feature Engineering (The Cascade)

In [None]:
# We create new feature sets for the classifier
X_class_train = X_train.copy()
X_class_test = X_test.copy()

# ADD the regression output as a feature
X_class_train['Predicted_Fair_Price'] = price_pred_train
X_class_test['Predicted_Fair_Price'] = price_pred_test

# Check the new feature matrix
display(X_class_train.head())

## 5. Step 3: Train Classification Model

In [None]:
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_class_train, y_class_train)

# Evaluate
acc = accuracy_score(y_class_test, classifier.predict(X_class_test))
print(f"Classification Accuracy (Using Cascaded Pipe): {acc:.4f}")

# Compare with Baseline (Without the Price Feature)
classifier_base = LogisticRegression(max_iter=1000)
classifier_base.fit(X_train, y_class_train) # Using original X_train without price
acc_base = accuracy_score(y_class_test, classifier_base.predict(X_test))

print(f"Baseline Accuracy (Without Price Feature):     {acc_base:.4f}")
print(f"Improvement from Cascading: {(acc - acc_base)*100:.2f}%")