
# Predictive Analysis Using Machine Learning — Regression

**Goal:** Build and evaluate a machine learning regression model to predict outcomes from a dataset, demonstrating **feature selection**, **model training**, and **evaluation**.

**Dataset:** California Housing — available via scikit-learn.

**Outline:**
1. Load and explore data
2. Split data; create preprocessing & baseline model
3. Perform feature selection
4. Train tuned models
5. Evaluate with metrics & visualizations
6. Inspect feature importance and conclude


In [None]:

# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Ensure reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


In [None]:

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X.head(), y.head()


: 

In [None]:

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

X_train.shape, X_test.shape


: 


## Baseline Model (No Feature Selection)

We'll start with a linear regression model on all features to establish a baseline.


In [None]:

# Baseline: Linear Regression on all features
numeric_features = X.columns.tolist()

preprocess = ColumnTransformer(
    transformers=[("num", StandardScaler(), numeric_features)],
    remainder="drop",
)

baseline_reg = Pipeline(steps=[
    ("prep", preprocess),
    ("reg", LinearRegression())
])

baseline_reg.fit(X_train, y_train)
y_pred_base = baseline_reg.predict(X_test)

baseline_metrics = {
    "MAE": mean_absolute_error(y_test, y_pred_base),
    "RMSE": mean_squared_error(y_test, y_pred_base, squared=False),
    "R2": r2_score(y_test, y_pred_base),
}
baseline_metrics



## Feature Selection

We'll use **f_regression** with `SelectKBest` to score features and choose the most informative ones.


In [None]:

# Try several k values and pick the one with best cross-validated score
k_values = [3, 5, 6, 8, X_train.shape[1]]

pipe_fs = Pipeline(steps=[
    ("prep", preprocess),
    ("select", SelectKBest(score_func=f_regression)),
    ("reg", LinearRegression())
])

param_grid = {"select__k": k_values}
grid_fs = GridSearchCV(pipe_fs, param_grid=param_grid, cv=5, scoring="r2", n_jobs=-1)
grid_fs.fit(X_train, y_train)

best_k = grid_fs.best_params_["select__k"]
best_k, grid_fs.best_score_


In [None]:

# Fit the best feature selection pipeline on the full training set and evaluate
best_fs_model = grid_fs.best_estimator_
y_pred_fs = best_fs_model.predict(X_test)

fs_metrics = {
    "MAE": mean_absolute_error(y_test, y_pred_fs),
    "RMSE": mean_squared_error(y_test, y_pred_fs, squared=False),
    "R2": r2_score(y_test, y_pred_fs),
}
fs_metrics



## Model Training with an Alternative Regressor

We'll also train a **RandomForestRegressor** behind the same preprocessing + feature selection block and tune a couple of key hyperparameters.


In [None]:

pipe_rf = Pipeline(steps=[
    ("prep", preprocess),
    ("select", SelectKBest(score_func=f_regression, k=best_k)),
    ("rf", RandomForestRegressor(random_state=RANDOM_STATE))
])

param_grid_rf = {
    "rf__n_estimators": [100, 300],
    "rf__max_depth": [None, 10, 20],
    "rf__min_samples_split": [2, 5]
}

grid_rf = GridSearchCV(pipe_rf, param_grid=param_grid_rf, cv=5, scoring="r2", n_jobs=-1)
grid_rf.fit(X_train, y_train)

y_pred_rf = grid_rf.best_estimator_.predict(X_test)

rf_metrics = {
    "MAE": mean_absolute_error(y_test, y_pred_rf),
    "RMSE": mean_squared_error(y_test, y_pred_rf, squared=False),
    "R2": r2_score(y_test, y_pred_rf),
}

grid_rf.best_params_, rf_metrics



## Feature Scores and Importances

We'll inspect which features were selected and their scores. For tree-based models, we'll also inspect feature importances.


In [None]:

# Get scores from SelectKBest fitted inside the best linear regression pipeline
selector = SelectKBest(score_func=f_regression, k=best_k)
selector.fit(preprocess.fit_transform(X_train), y_train)

selected_mask = selector.get_support()
selected_features = np.array(numeric_features)[selected_mask]
scores = selector.scores_[selected_mask]

feat_scores = pd.DataFrame({"feature": selected_features, "f_score": scores}).sort_values("f_score", ascending=False)
feat_scores.head(best_k)


In [None]:

# Plot feature scores
fig = plt.figure(figsize=(8, 6))
plt.barh(feat_scores["feature"], feat_scores["f_score"])
plt.gca().invert_yaxis()
plt.xlabel("F-score")
plt.title("Top Features by F-score")
plt.tight_layout()
plt.show()


In [None]:

# Feature importances from RandomForest
rf_model = grid_rf.best_estimator_.named_steps["rf"]
selected_feature_names = selected_features

importances = rf_model.feature_importances_
imp_df = pd.DataFrame({"feature": selected_feature_names, "importance": importances}).sort_values("importance", ascending=False)
imp_df.head(best_k)


In [None]:

# Plot RF feature importances
fig = plt.figure(figsize=(8, 6))
plt.barh(imp_df["feature"], imp_df["importance"])
plt.gca().invert_yaxis()
plt.xlabel("Importance")
plt.title("Random Forest Feature Importances")
plt.tight_layout()
plt.show()



## Model Comparison


In [None]:

results = pd.DataFrame([
    {"Model": "Baseline LinearRegression (all features)", **baseline_metrics},
    {"Model": f"LinearRegression + SelectKBest(k={best_k})", **fs_metrics},
    {"Model": "RandomForest + SelectKBest(best k)", **rf_metrics},
])
results.sort_values("R2", ascending=False)



## Conclusion

- We established a strong baseline with Linear Regression.
- Using **feature selection (SelectKBest with f_regression)** helped identify the most informative features and can improve interpretability.
- An alternative **Random Forest** model provided a useful comparison; depending on hyperparameters and selected features, it can achieve better R² and lower error metrics.
- This notebook demonstrates a full workflow: preprocessing, feature selection, model training, hyperparameter tuning, and evaluation.
