### Imports and data

In [12]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("WineQT.csv")

# Features and target
X_alcohol = df["alcohol"].values.reshape(-1, 1)
X_chlorides = df["chlorides"].values.reshape(-1, 1)

y = df["quality"].values

### Sklearn's SGD function

In [None]:
# Feature and target
X_chl = df[["chlorides"]].values
X_alc = df[["alcohol"]].values
y = df["quality"].values

def evaluate_feature_sgd(X, y, feature_name, eta=0.01, max_iter=1000):
    kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5 different folds 
    results = []

    print(f"\nEvaluating {feature_name} with 5-fold cross-validation:")

    for fold, (train_idx, test_idx) in enumerate(kf.split(X), 1):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train SGDRegressor
        sgd = SGDRegressor(max_iter=max_iter, tol=1e-3, eta0=eta, random_state=42)
        sgd.fit(X_train_scaled, y_train)
        y_pred = sgd.predict(X_test_scaled)

        # Metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)

        results.append({'Fold': fold, 'MSE': mse, 'RMSE': rmse, 'R2': r2})
        print(f"Fold {fold}: MSE={mse:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")

    results_df = pd.DataFrame(results)
    print(f"\nSummary statistics: ({feature_name})")
    print(results_df[['MSE', 'RMSE', 'R2']].agg(['mean', 'std', 'min', 'max']))
    return results_df

# Evaluate chlorides and alcohol
results_chl = evaluate_feature_sgd(X_chl, y, "Chlorides")
results_alc = evaluate_feature_sgd(X_alc, y, "Alcohol")



Evaluating Chlorides with 5-fold cross-validation:
Fold 1: MSE=0.5590, RMSE=0.7477, R²=-0.0046
Fold 2: MSE=0.7270, RMSE=0.8526, R²=0.0128
Fold 3: MSE=0.6591, RMSE=0.8118, R²=0.0275
Fold 4: MSE=0.6658, RMSE=0.8160, R²=0.0245
Fold 5: MSE=0.5916, RMSE=0.7692, R²=-0.0057

Summary statistics: (Chlorides)
           MSE      RMSE        R2
mean  0.640508  0.799461  0.010886
std   0.066123  0.041384  0.015661
min   0.559044  0.747692 -0.005741
max   0.726972  0.852626  0.027457

Evaluating Alcohol with 5-fold cross-validation:
Fold 1: MSE=0.4180, RMSE=0.6465, R²=0.2489
Fold 2: MSE=0.5902, RMSE=0.7683, R²=0.1985
Fold 3: MSE=0.5145, RMSE=0.7173, R²=0.2408
Fold 4: MSE=0.5002, RMSE=0.7073, R²=0.2671
Fold 5: MSE=0.4634, RMSE=0.6807, R²=0.2122

Summary statistics: (Alcohol)
           MSE      RMSE        R2
mean  0.497259  0.704008  0.233520
std   0.063989  0.045172  0.027839
min   0.417962  0.646500  0.198529
max   0.590210  0.768252  0.267143


## Q1.4.1 How well does alcohol alone predict wine quality in each split?
Looking at alcohol’s RMSE for each fold (≈ 0.65–0.77), the model predicts wine quality moderately well. 

R² shows how much of the variation in wine quality the model can explain, with higher values meaning a better fit.

The R² values (≈ 0.20–0.27) indicate that alcohol explains roughly 20–27% of the variance in quality —

## Q1.4.2 How well does chloride alone predict wine quality in each split?

Chlorides performs poorly: RMSE ≈ 0.75–0.85, and R² values are around 0 (even negative in some folds). This shows that chlorides alone barely explains any variance in wine quality.

## Q1.4.3 Do you think the model underfits? Why?
Yes, the models are only using one feature each (out of the 12 features in the dataset), and the R² values are relatively low. 

## Q1.4.4 Provide the mean and variance from the 5 different folds and comment on the variation in performance across all 5 folds when using alcohol versus chloride.

| Feature   | Mean RMSE | Std RMSE | Mean R² | 
| --------- | --------- | -------- | ------- | 
| Alcohol   | 0.704     | 0.045    | 0.234   | 
| Chlorides | 0.799     | 0.041    | 0.011   |

Alcohol predicts wine quality better than chlorides, which barely helps, and both models are too simple to capture the full patterns in the data.
