<a href="https://colab.research.google.com/github/Trace-dap-troai/Group-Exercise-2/blob/main/Group_Excercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#data generation
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression

# Generate dataset (>1000 samples, >30 features)
X, y = make_regression(
    n_samples=1500,
    n_features=35,
    n_informative=25,
    noise=15,
    random_state=42
)

data = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
data["target"] = y

data.head()


In [None]:
# Introduce some missing values artificially
np.random.seed(42)
for col in data.columns[:5]:
    data.loc[data.sample(frac=0.02).index, col] = np.nan

# Handle missing values using median imputation
data = data.fillna(data.median())


Missing values were handled using median imputation to preserve the
distribution of features and avoid bias caused by extreme values.


In [None]:
#Feature Scaling

from sklearn.preprocessing import StandardScaler

X = data.drop("target", axis=1)
y = data["target"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Standardization was applied to ensure that all features contribute equally
to the regression models, especially those sensitive to feature scale such
as Ridge and Lasso regression.


In [None]:
##Feature Selection (SelectKBest)
from sklearn.feature_selection import SelectKBest, mutual_info_regression

selector = SelectKBest(score_func=mutual_info_regression, k=20)
X_selected = selector.fit_transform(X_scaled, y)

selected_features = X.columns[selector.get_support()]
selected_features


Mutual Information was used to select the top 20 most informative features.
This step reduces dimensionality and improves model interpretability.


In [None]:
#Model Building & Evaluation
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42
)

#Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2, rmse


Linear Regression was used as a baseline model.
RÂ² measures how well the model explains variance, while RMSE quantifies
prediction error in the original target scale.


In [None]:
#K-Fold Cross Validation
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_r2 = cross_val_score(lr, X_selected, y, cv=kf, scoring="r2")

cv_r2.mean(), cv_r2.std()


5-fold cross-validation provides a more robust performance estimate by
averaging results across multiple data splits.


In [None]:
#Model Enhancement
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.05)

ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

ridge_r2 = r2_score(y_test, ridge.predict(X_test))
lasso_r2 = r2_score(y_test, lasso.predict(X_test))

ridge_r2, lasso_r2






Ridge regression reduces overfitting by shrinking coefficients,
while Lasso regression additionally performs automatic feature selection.


In [None]:
#Polynomial Regression (Creativity)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

poly_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("lr", LinearRegression())
])

poly_model.fit(X_train, y_train)
poly_pred = poly_model.predict(X_test)

poly_r2 = r2_score(y_test, poly_pred)
poly_r2


Polynomial Regression captures non-linear relationships that cannot be
modeled by standard linear regression.


In [None]:
#Visualization
import matplotlib.pyplot as plt

residuals = y_test - y_pred

plt.figure(figsize=(6,4))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()


Residual analysis helps verify linearity and homoscedasticity assumptions.


In [None]:
#Lasso Feature Importance
coef = pd.Series(lasso.coef_, index=selected_features)

coef[coef != 0].sort_values().plot(kind="barh", figsize=(6,6))
plt.title("Lasso Selected Feature Importance")
plt.show()


Lasso regression highlights the most influential features by driving
irrelevant coefficients to zero.


CONCOLUTION

This project demonstrated a complete regression workflow including
preprocessing, feature selection, model training, evaluation, and enhancement.

Advanced techniques such as regularization and polynomial regression
improved performance and provided deeper insights into feature relevance
and model behavior.
