# 5. Modeling

## 5.1 Using the Modeling Module

We'll use the structured `ModelTrainer` class from `src.modeling` to ensure consistent and reproducible model training and evaluation.

### 5.1.1 Initialize Model Trainer and Load Data

In [1]:
import sys
sys.path.append('../src')
from modeling import ModelTrainer, prepare_training_data
from config import DATA_PATHS, RANDOM_STATE, TEST_SIZE
import warnings
warnings.filterwarnings('ignore')

# Initialize the model trainer
trainer = ModelTrainer(random_state=RANDOM_STATE)

# Load engineered data for training
X, y = prepare_training_data()
print(f"Training data shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target is log-transformed: {'SalePrice_log' not in DATA_PATHS['prepared_scaled']}")

Training data shape: (1460, 215)
Target shape: (1460,)
Target is log-transformed: True


## 5.2 Train–Test Split Using ModelTrainer

The `split_data` method ensures consistent data splitting with proper randomization:

In [2]:
# Split data using the ModelTrainer method
X_train, X_test, y_train, y_test = trainer.split_data(X, y, test_size=TEST_SIZE)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Test target shape: {y_test.shape}")

Training set shape: (1168, 215)
Test set shape: (292, 215)
Training target shape: (1168,)
Test target shape: (292,)


## 5.3 Baseline Model: Linear Regression

A baseline provides a reference point for more complex models. Using the `train_baseline_model` method:

In [3]:
# Train baseline linear regression model
lin_reg = trainer.train_baseline_model(X_train, y_train)
print("Linear Regression model trained successfully")

Linear Regression model trained successfully


In [4]:
# Evaluate baseline model
from feature_engineering import FeatureEngineer
feature_engineer = FeatureEngineer()

In [5]:
# Need to check if target is log-transformed and apply inverse transform if needed
import numpy as np
if y_train.name == 'SalePrice_log':
    # For log-transformed target, we need inverse transform for evaluation
    y_pred_lr = feature_engineer.inverse_log_transform(lin_reg.predict(X_test))
    y_test_original = feature_engineer.inverse_log_transform(y_test)
else:
    y_pred_lr = lin_reg.predict(X_test)
    y_test_original = y_test

lr_metrics = trainer.evaluate_model(lin_reg, X_test, y_test_original)
print(f"Linear Regression - RMSE: {lr_metrics['RMSE']:.2f}, MAE: {lr_metrics['MAE']:.2f}, R²: {lr_metrics['R2']:.4f}")

Linear Regression - RMSE: 199122.20, MAE: 178827.82, R²: -4.1692


## 5.4 Regularized Models (Ridge, Lasso, Elastic Net)

Regularization is critical given the **high-dimensional feature space** after one-hot encoding.

### 5.4.1 Ridge Regression Using ModelTrainer


In [6]:
# Train Ridge regression model
ridge = trainer.train_ridge_model(X_train, y_train, alpha=1.0)

# Evaluate Ridge model
if y_train.name == 'SalePrice_log':
    y_pred_ridge = feature_engineer.inverse_log_transform(ridge.predict(X_test))
    ridge_metrics = trainer.evaluate_model(ridge, X_test, y_test_original)
else:
    ridge_metrics = trainer.evaluate_model(ridge, X_test, y_test_original)

print(f"Ridge Regression - RMSE: {ridge_metrics['RMSE']:.2f}, MAE: {ridge_metrics['MAE']:.2f}, R²: {ridge_metrics['R2']:.4f}")

Ridge Regression - RMSE: 199122.20, MAE: 178827.81, R²: -4.1692


### 5.4.2 Lasso Regression Using ModelTrainer

In [7]:
# Train Lasso regression model
lasso = trainer.train_lasso_model(X_train, y_train, alpha=0.001, max_iter=10000)

# Evaluate Lasso model
if y_train.name == 'SalePrice_log':
    y_pred_lasso = feature_engineer.inverse_log_transform(lasso.predict(X_test))
    lasso_metrics = trainer.evaluate_model(lasso, X_test, y_test_original)
else:
    lasso_metrics = trainer.evaluate_model(lasso, X_test, y_test_original)

print(f"Lasso Regression - RMSE: {lasso_metrics['RMSE']:.2f}, MAE: {lasso_metrics['MAE']:.2f}, R²: {lasso_metrics['R2']:.4f}")

Lasso Regression - RMSE: 199122.20, MAE: 178827.81, R²: -4.1692


### 5.4.3 Elastic Net Using ModelTrainer

In [8]:
# Train Elastic Net model
elastic = trainer.train_elasticnet_model(X_train, y_train, alpha=0.001, l1_ratio=0.5, max_iter=10000)

# Evaluate Elastic Net model
if y_train.name == 'SalePrice_log':
    y_pred_elastic = feature_engineer.inverse_log_transform(elastic.predict(X_test))
    elastic_metrics = trainer.evaluate_model(elastic, X_test, y_test_original)
else:
    elastic_metrics = trainer.evaluate_model(elastic, X_test, y_test_original)

print(f"Elastic Net - RMSE: {elastic_metrics['RMSE']:.2f}, MAE: {elastic_metrics['MAE']:.2f}, R²: {elastic_metrics['R2']:.4f}")

Elastic Net - RMSE: 199122.20, MAE: 178827.81, R²: -4.1692


## 5.5 Model Comparison Summary Using ModelTrainer

The `get_model_comparison_table` method provides a clean comparison of all trained models:

## 5.6 Complete Automated Training Pipeline

The `train_all_models` method provides a complete automated training and evaluation pipeline:

In [9]:
# Run complete automated training pipeline
all_results = trainer.train_all_models(X, y, tune_hyperparameters=False)
print("All models trained and evaluated:")
for model_name, metrics in all_results.items():
    print(f"{model_name}: RMSE={metrics['RMSE']:.2f}, R²={metrics['R2']:.4f}")

All models trained and evaluated:
Linear Regression: RMSE=0.21, R²=0.7735
Ridge: RMSE=0.13, R²=0.9059
Lasso: RMSE=0.14, R²=0.8933
Elastic Net: RMSE=0.14, R²=0.8980


## 5.7 Best Model Selection

The `get_best_model` method automatically identifies the best performing model:

In [10]:
# Get the best model based on RMSE
best_model_name, best_model = trainer.get_best_model(metric='RMSE')
print(f"Best model: {best_model_name}")
print(f"Best model RMSE: {trainer.model_results[best_model_name]['RMSE']:.2f}")

Best model: Ridge
Best model RMSE: 0.13


## 5.8 Model Persistence

The `save_model` method allows saving trained models for future use:

In [None]:
# # Save the best model
# model_path = trainer.save_model(best_model, f"{best_model_name.lower().replace(' ', '_')}_model.pkl")
# print(f"Best model saved to: {model_path}")

Best model saved to: e:\Projects_3\Data Science\Regression\Housing Prices_2\models\ridge_model.pkl


## 5.9 Conclusion

Using the ModelTrainer module from `src.modeling`, we have successfully:

* **Fully automated sklearn pipelines** were built using structured methods
* Multiple regression models were trained and evaluated consistently
* Regularization proved essential due to feature dimensionality
* Model comparison and selection were automated
* Model persistence was handled systematically

### Key Benefits of Using the ModelTrainer Module:

1. **Consistency**: Same training and evaluation approach across all models
2. **Reproducibility**: Fixed random states and systematic data splitting
3. **Automation**: Complete pipeline from training to comparison in single calls
4. **Persistence**: Built-in model saving and loading functionality
5. **Flexibility**: Easy hyperparameter tuning and model comparison

### Next logical steps :

* Hyperparameter tuning with `GridSearchCV` (built-in methods available)
* Feature importance interpretation using the `get_feature_importance_data` method
* Model deployment using the saved models
* Advanced modeling techniques (ensemble methods, etc.)

In [12]:
# Create model comparison table
results = trainer.get_model_comparison_table()
print(results)

                       RMSE       MAE        R2
Ridge              0.132508  0.094151  0.905910
Elastic Net        0.137940  0.093091  0.898037
Lasso              0.141097  0.095121  0.893316
Linear Regression  0.205600  0.095930  0.773479
