# **Regression Modeling Functions Notebook** 
* this notebook contains functions and steps for regression modeling  

### **Regression candidate Models** 

1. **Linear Regression**: A basic regression model that models the relationship between the independent variables and the target variable using a linear equation.

2. **Ridge Regression (L2 Regularization)**: A linear regression model with added L2 regularization to prevent overfitting.

3. **Lasso Regression (L1 Regularization)**: Similar to Ridge Regression, but with L1 regularization, which can lead to feature selection by driving some coefficients to exactly zero.

4. **ElasticNet Regression**: A combination of Ridge and Lasso, incorporating both L1 and L2 regularization.

5. **Polynomial Regression**: Extends linear regression by including polynomial terms of the features to capture non-linear relationships.

6. **Decision Tree Regression**: Uses decision tree algorithms to predict the target variable based on the feature values.

7. **Random Forest Regression**: An ensemble of decision trees that can handle non-linearity and provide improved performance and generalization.

8. **Gradient Boosting Regression**: A boosting technique that builds multiple weak learners (usually decision trees) sequentially, with each one trying to correct the errors of the previous one.

9. **XGBoost (Extreme Gradient Boosting)**: A highly optimized gradient boosting framework that often outperforms traditional gradient boosting algorithms.

10. **LightGBM**: Another gradient boosting framework that's designed for efficiency and can handle large datasets well.

11. **CatBoost**: Yet another gradient boosting library that provides support for categorical features out of the box.

12. **Support Vector Regression (SVR)**: Uses support vector machines to find the optimal hyperplane that best fits the data.

13. **K-Nearest Neighbors (KNN) Regression**: Predicts the target value based on the average of the K-nearest neighbors' target values.

14. **Neural Network Regression**: Utilizes neural networks to model complex relationships between features and target variables.

15. **Bayesian Regression**: Incorporates Bayesian principles to estimate the posterior distribution of model parameters and predictions.

16. **Huber Regression**: A robust regression technique that's less sensitive to outliers compared to ordinary least squares.

17. **Quantile Regression**: Focuses on modeling different quantiles of the target variable, making it useful for understanding the entire distribution.

18. **Isotonic Regression**: Preserves the order of the data while modeling the relationship between features and target.

Remember that the performance of these models can vary depending on the nature of your data, the problem you're trying to solve, and the amount of data available. It's a good practice to try multiple models, tune their hyperparameters, and evaluate their performance using appropriate metrics before selecting the best one for your task.

In [3]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import QuantileRegressor
#from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import mean_squared_error

# Create a sample dataset
np.random.seed(42)
X = np.random.rand(100, 5)  # Generating 100 samples with 5 features
y = 2*X[:, 0] + 3*X[:, 1] - 1.5*X[:, 2] + np.random.randn(100)  # True target values with added noise

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
def linear_regression(X_train, y_train, X_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Ridge Regression
def ridge_regression(X_train, y_train, X_test):
    model = Ridge(alpha=1.0)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Lasso Regression
def lasso_regression(X_train, y_train, X_test):
    model = Lasso(alpha=1.0)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# ElasticNet Regression
def elastic_net_regression(X_train, y_train, X_test):
    model = ElasticNet(alpha=1.0, l1_ratio=0.5)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Decision Tree Regression
def decision_tree_regression(X_train, y_train, X_test):
    model = DecisionTreeRegressor(max_depth=None, min_samples_split=2)
    # max_depth: Maximum depth of the tree. Controls the level of complexity.
    # min_samples_split: Minimum number of samples required to split an internal node.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Random Forest Regression
def random_forest_regression(X_train, y_train, X_test):
    model = RandomForestRegressor(n_estimators=100, max_depth=None, min_samples_split=2)
    # n_estimators: Number of trees in the forest.
    # max_depth: Maximum depth of each tree in the forest.
    # min_samples_split: Minimum number of samples required to split an internal node.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Gradient Boosting Regression
def gradient_boosting_regression(X_train, y_train, X_test):
    model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
    # n_estimators: Number of boosting stages (trees) to be built.
    # learning_rate: Controls the contribution of each tree to the final prediction.
    # max_depth: Maximum depth of each tree in the ensemble.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# XGBoost Regression
def xgboost_regression(X_train, y_train, X_test):
    model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
    # n_estimators: Number of boosting stages (trees) to be built.
    # learning_rate: Controls the contribution of each tree to the final prediction.
    # max_depth: Maximum depth of each tree in the ensemble.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# LightGBM Regression
def lightgbm_regression(X_train, y_train, X_test):
    model = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
    # n_estimators: Number of boosting stages (trees) to be built.
    # learning_rate: Controls the contribution of each tree to the final prediction.
    # max_depth: Maximum depth of each tree in the ensemble.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# CatBoost Regression
def catboost_regression(X_train, y_train, X_test):
    model = CatBoostRegressor(iterations=100, learning_rate=0.1, depth=3)
    # iterations: Number of boosting stages (trees) to be built.
    # learning_rate: Controls the contribution of each tree to the final prediction.
    # depth: Maximum depth of each tree in the ensemble.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Support Vector Regression
def svr_regression(X_train, y_train, X_test):
    model = SVR(kernel='rbf', C=1.0)
    # kernel: Specifies the kernel type used in the algorithm.
    # C: Regularization parameter. Controls the trade-off between fitting to the data and allowing margin violations.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# K-Nearest Neighbors Regression
def knn_regression(X_train, y_train, X_test):
    model = KNeighborsRegressor(n_neighbors=5)
    # n_neighbors: Number of neighbors to use for prediction.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Neural Network Regression
def neural_network_regression(X_train, y_train, X_test):
    model = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=1000, alpha=0.0001)
    # hidden_layer_sizes: Tuple representing the number of neurons in each hidden layer.
    # max_iter: Maximum number of iterations to converge.
    # alpha: L2 regularization term.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Bayesian Regression
# BayesianRidge uses Bayesian principles to estimate the posterior distribution of model parameters and predictions.
def bayesian_regression(X_train, y_train, X_test):
    model = BayesianRidge()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Huber Regression
# HuberRegressor is a robust regression technique that's less sensitive to outliers compared to ordinary least squares.
def huber_regression(X_train, y_train, X_test):
    model = HuberRegressor(epsilon=1.35)
    # epsilon: Determines the threshold for outlier detection. Smaller values make the model more robust.
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Quantile Regression
# QuantileRegressor focuses on modeling different quantiles of the target variable, useful for understanding the entire distribution.
def quantile_regression(X_train, y_train, X_test):
    model = QuantileRegressor()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# Isotonic Regression
# IsotonicRegression preserves the order of the data while modeling the relationship between features and target.
# def isotonic_regression(X_train, y_train, X_test):
#     model = IsotonicRegression(out_of_bounds='clip')
#     # out_of_bounds: Determines how values outside the training domain are handled. 'clip' restricts predictions to the training range.
#     model.fit(X_train, y_train)
#     y_pred = model.predict(X_test)
#     return y_pred

# Evaluate Models
def evaluate_models(y_true, y_preds):
    for model_name, y_pred in y_preds.items():
        mse = mean_squared_error(y_true, y_pred)
        print(f"{model_name} MSE: {mse:.4f}")

# Perform predictions and evaluate
y_preds = {
    'Linear Regression': linear_regression(X_train, y_train, X_test),
    'Ridge Regression': ridge_regression(X_train, y_train, X_test),
    'Lasso Regression': lasso_regression(X_train, y_train, X_test),
    'ElasticNet Regression': elastic_net_regression(X_train, y_train, X_test),
    'Decision Tree Regression': decision_tree_regression(X_train, y_train, X_test),
    'Random Forest Regression': random_forest_regression(X_train, y_train, X_test),
    'Gradient Boosting Regression': gradient_boosting_regression(X_train, y_train, X_test),
    'XGBoost Regression': xgboost_regression(X_train, y_train, X_test),
    'LightGBM Regression': lightgbm_regression(X_train, y_train, X_test),
    'CatBoost Regression': catboost_regression(X_train, y_train, X_test),
    'SVR Regression': svr_regression(X_train, y_train, X_test),
    'KNN Regression': knn_regression(X_train, y_train, X_test),
    'Neural Network Regression': neural_network_regression(X_train, y_train, X_test),
    'Bayesian Regression': bayesian_regression(X_train, y_train, X_test),
    'Huber Regression': huber_regression(X_train, y_train, X_test),
    'Quantile Regression': quantile_regression(X_train, y_train, X_test),
    #'Isotonic Regression': isotonic_regression(X_train, y_train, X_test)
}

# Evaluate and print results
evaluate_models(y_test, y_preds)


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 140
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 5
[LightGBM] [Info] Start training from score 1.872579
0:	learn: 1.4299279	total: 1.56ms	remaining: 155ms
1:	learn: 1.4008955	total: 2.43ms	remaining: 119ms
2:	learn: 1.3789802	total: 3.37ms	remaining: 109ms
3:	learn: 1.3356283	total: 4.18ms	remaining: 100ms
4:	learn: 1.3060525	total: 4.81ms	remaining: 91.4ms
5:	learn: 1.2725260	total: 5.56ms	remaining: 87.1ms
6:	learn: 1.2453403	total: 6.05ms	remaining: 80.5ms
7:	learn: 1.2265735	total: 8.14ms	remaining: 93.7ms
8:	learn: 1.2047794	total: 8.86ms	remaining: 89.5ms
9:	learn: 1.1797702	total: 9.85ms	remaining: 88.7ms
10:	learn: 1.1630638	total: 10.9ms	remaining: 88.5ms
11:	learn: 1.1444302	total: 11.4ms	remaining: 83.7ms
12:	learn: 1.1210697	total: 12.3ms	remaining: 82.1ms
13:	learn: 1.1021982	total: 13.4ms	remaining: 82.5ms
14:	learn: 1.0874311	total: 14ms	remain

