# Exercise 2 - Advanced Machine Learning Techniques 


## Q2.1 Which features are most suitable/influential in predicting wine quality? (Tip - You can consider feature importance ranking.)

In [33]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("WineQT.csv")

# Features and target
X_all = df.drop(columns=["quality", "Id"]).values
y = df["quality"].values
feature_names = df.drop(columns=["quality", "Id"]).columns

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)

# Train linear regression with SGD
sgd = SGDRegressor(max_iter=1000, tol=1e-3, eta0=0.01, random_state=42)
sgd.fit(X_scaled, y)

# Get absolute coefficients as feature importance
importance = np.abs(sgd.coef_)              # LOOK HERE VERY IMPORTANT 
feature_importance = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importance
}).sort_values(by="Importance", ascending=False)

print(feature_importance)


                 Feature  Importance
10               alcohol    0.277888
1       volatile acidity    0.195423
9              sulphates    0.151706
4              chlorides    0.086388
6   total sulfur dioxide    0.084458
0          fixed acidity    0.069059
7                density    0.061787
8                     pH    0.045988
3         residual sugar    0.025933
5    free sulfur dioxide    0.025585
2            citric acid    0.020778


Like in the exercises in Q1, we used SGDRegressor from scikit-learn to learn the coefficients, and then it predicts wine quality by combining all features linearly. Each feature gets a coefficient showing how strongly it affects the prediction, with positive values increasing predicted quality and negative values decreasing it, and larger absolute values indicating a stronger effect.

From the feature importance ranking, alcohol is the most influential feature in predicting wine quality, followed by volatile acidity and sulphates. Features like chlorides, density, and fixed acidity have moderate influence, while residual sugar, citric acid, free sulfur dioxide, pH, and Id contribute very little. This shows that some chemical properties of the wine strongly affect quality, while others have minimal predictive power.

This does make sense since we did see that in Q1.2 alcohol and volatile acidity had the two largest absolute values in correlation with quality

## Q2.2 The models you trained so far assume a linear relationship between features and target.

### a) Polynomial regression: Extend the feature space to include quadratic or interaction terms. Does this improve performance?

In [34]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Create polynomial features (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_all)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
sgd = SGDRegressor(max_iter=1000, tol=1e-3, eta0=0.01, random_state=42)
sgd.fit(X_train_scaled, y_train)
y_pred = sgd.predict(X_test_scaled)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Polynomial Regression (degree 2): MSE={mse:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")


Polynomial Regression (degree 2): MSE=0.3760, RMSE=0.6132, R²=0.3243


Looking at the results, the polynomial regression model achieved an MSE of approximately 0.374, an RMSE of about 0.612, and an R² of around 0.327. Compared to the linear multiple regression model, the RMSE decreased slightly, but the R² remained almost the same. This indicates that adding polynomial terms, such as quadratic and interaction features, does not significantly improve performance for this dataset. It suggests that the relationship between the features and wine quality is mostly linear, or that more complex interactions are not captured well with just degree-2 polynomials. Overall, the linear model already captures most of the predictable variation in wine quality.

### b) Regularization: Train models using Ridge and Lasso regression. How do these methods affect the coefficients and model generalization?

Ridge and Lasso are techniques that help linear regression avoid overfitting and work better on new data. 

Ridge adds a penalty based on the square of the coefficients, which shrinks them toward zero but usually keeps all features in the model. 

Lasso adds a penalty based on the absolute value of the coefficients, which can shrink some coefficients exactly to zero, effectively removing less important features. Both methods make the model more stable and improve generalization by reducing the influence of weak or correlated features. Overall, regularization slightly increases bias but lowers variance, helping the model make more reliable predictions on unseen wine samples.

In [35]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Features and target

X_temp = df.drop(columns=["quality", "Id"]).values


y = df["quality"].values
feature_names = df.drop(columns=["quality", "Id"]).columns

# Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Ridge Regression
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)

# Lasso Regression
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)

# Compare coefficients
coefficients = pd.DataFrame({
    "Feature": feature_names,
    "Ridge": ridge.coef_,
    "Lasso": lasso.coef_
}).sort_values(by="Ridge", key=abs, ascending=False)

print("Coefficients (Ridge vs Lasso):")
print(coefficients)

# Evaluate
from sklearn.metrics import mean_squared_error, r2_score
def evaluate(y_true, y_pred, name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{name}: MSE={mse:.4f}, RMSE={rmse:.4f}, R²={r2:.4f}")

evaluate(y_test, y_pred_ridge, "Ridge")
evaluate(y_test, y_pred_lasso, "Lasso")


Coefficients (Ridge vs Lasso):
                 Feature     Ridge     Lasso
10               alcohol  0.285819  0.293407
1       volatile acidity -0.238646 -0.223573
9              sulphates  0.161463  0.141083
0          fixed acidity  0.086838  0.022495
4              chlorides -0.085712 -0.081140
6   total sulfur dioxide -0.072929 -0.060472
2            citric acid -0.065283 -0.013223
7                density -0.058945 -0.017111
8                     pH -0.037865 -0.035157
5    free sulfur dioxide  0.019222  0.000000
3         residual sugar  0.005442 -0.000000
Ridge: MSE=0.3799, RMSE=0.6163, R²=0.3173
Lasso: MSE=0.3699, RMSE=0.6082, R²=0.3353


Looking at the results, both Ridge and Lasso shrink the coefficients compared to a standard linear regression, with Lasso reducing some coefficients exactly to zero (like free sulfur dioxide and residual sugar), effectively removing them from the model. 

Alcohol remains the strongest positive predictor, and volatile acidity the strongest negative predictor in both models. Performance-wise, Lasso slightly outperforms Ridge, with a lower RMSE (≈0.609 vs 0.618) and higher R² (≈0.333 vs 0.313), indicating slightly better generalization. Overall, regularization helps simplify the model, reduce overfitting, and highlight the most influential features while maintaining predictive accuracy.

### c) Model comparison: Compare your linear regression results to a non-linear model (e.g., Decision Tree or Random Forest). Which performs better, and why?

Example with comparison between Random Forest Regressor vs Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

X = df.drop(columns=["quality", "Id"]).values
y = df["quality"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80/20 split

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)
y_pred_lin = linreg.predict(X_test_scaled)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

def evaluate(y_true, y_pred, name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{name}: RMSE={rmse:.3f}, R²={r2:.3f}")

evaluate(y_test, y_pred_lin, "Linear Regression")
evaluate(y_test, y_pred_rf, "Random Forest")


Linear Regression: RMSE=0.616, R²=0.317
Random Forest: RMSE=0.547, R²=0.463


Comparing linear regression to non-linear models like Decision Trees or Random Forests, the non-linear models usually do better when the relationship between features and wine quality isn’t just a straight line. 

Linear regression assumes each feature affects quality in a simple, linear way, so it can miss more complex patterns. Decision Trees and Random Forests can automatically capture non-linear effects and interactions between features, often giving lower errors and higher R².

 Linear regression is simpler and easier to interpret, but non-linear models are more flexible, though they can overfit if not tuned properly. For predicting wine quality, a Random Forest would likely work better because it can handle the subtle interactions between different chemical properties.