# AISE 26 â€“ W9D1 Split Strategy Showdown  
### Partner A: Random Holdout + 5-Fold Standard CV  
**Author:** Andrea Churchwell  
**Dataset:** Diabetes Progression (#7)  
**Task Type:** Regression  
**Metric:** RÂ² (can be changed later if needed)  

---

## ðŸ“Œ Notebook Purpose

This notebook helps me (Partner A) understand **every single step** of my evaluation strategy **before** I convert the final version into the required `eval_partner_a.py` script for the assignment.

I am working slowly and clearly to make sure:
- I follow the instructions *exactly*
- I understand each line of code
- I know why we split the data the way we do
- I know how RÂ² is calculated
- I know what K-Fold CV does and why itâ€™s required

This notebook is my learning space.  
The clean Python file (`eval_partner_a.py`) will be produced **after** I understand everything here.

---

## âœ” Partner A Requirements (What I Must Do)

- Perform an **80/20 Random Holdout split**
- Use the **Ridge** regression model (no tuning)
- Use the **same metric** as my partner (RÂ² for now)
- Run **5-Fold Standard KFold Cross-Validation** on the **training** set only
- Print:
  - Test score (RÂ²)
  - CV mean score
  - CV standard deviation
  - Individual fold scores

---

## âœ” What This Notebook Will Include

1. Setting up imports  
2. Loading the Diabetes dataset  
3. Inspecting the data  
4. Creating the 80/20 split  
5. Building the pipeline (StandardScaler + Ridge)  
6. Fitting the model  
7. Evaluating using RÂ²  
8. Running 5-Fold KFold CV  
9. Understanding the outputs  
10. Preparing results for the required `comparison.csv`

---

## ðŸ§  Reminder

This notebook is for learning and exploring.  
Only after completing everything here, I will move the final version into:

`eval_partner_a.py`
to submit on GitHub as required.

In [1]:
import sys
sys.executable

'c:\\Users\\achur\\desktop\\AISE_CLASS_CLONES\\aise-w9d1-splitstrategy-churchwell-diaz\\venv\\Scripts\\python.exe'

In [2]:
import numpy as np            # numerical arrays and math
import pandas as pd           # table-style dataframes

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
RANDOM_STATE

42

In [3]:
from sklearn.datasets import load_diabetes

data = load_diabetes(as_frame=True)
X = data.data
y = data.target

print("Shape of X (features):", X.shape)
print("Shape of y (target):", y.shape)

X.head()

Shape of X (features): (442, 10)
Shape of y (target): (442,)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print("X_train shape:", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape: (353, 10)
X_test shape:  (89, 10)
y_train shape: (353,)
y_test shape:  (89,)


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

model = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge())
])

model.fit(X_train, y_train)

y_pred_test = model.predict(X_test)
test_r2 = r2_score(y_test, y_pred_test)
print(f"Test RÂ² score (20% holdout): {test_r2:.4f}")

Test RÂ² score (20% holdout): 0.4541


In [6]:
from sklearn.model_selection import KFold, cross_val_score

kfold = KFold(
    n_splits=5,
    shuffle=True,
    random_state=RANDOM_STATE
)

cv_scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=kfold,
    scoring="r2"
)

print("Individual fold scores:", cv_scores)
print(f"CV Mean RÂ²: {cv_scores.mean():.4f}")
print(f"CV Std Dev: {cv_scores.std():.4f}")

Individual fold scores: [0.46995096 0.53808555 0.41334996 0.4902986  0.49223599]
CV Mean RÂ²: 0.4808
CV Std Dev: 0.0404
