<a href="https://colab.research.google.com/github/BYUIDSS/DSS-ML-Bootcamp/blob/main/3a_Regression_Model/ML_Bootcamp_Regression_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression Machine Learning Model 🧠  📈

---



### 🔴 PLEASE COPY THIS NOTEBOOK INTO YOUR OWN GITHUB OR GOOGLE DRIVE DO NOT MODIFY THIS VERSION🔴


## Overview


**XGBoost Regression is a supervised machine learning algorithm that builds an ensemble of decision trees to predict continuous values. It is optimized for speed and performance using gradient boosting techniques.**  

<br>

---

<br>

**Definition**  
XGBoost (Extreme Gradient Boosting) is an advanced implementation of gradient boosting designed for efficiency and accuracy. It improves predictions by sequentially training trees while correcting previous errors. The key components include:  

* **Boosting Trees**: A collection of decision trees built sequentially to reduce errors.  
* **Gradient Descent Optimization**: Adjusts model weights using the gradient of a loss function.  
* **Regularization**: Controls model complexity to prevent overfitting.  

For **regression**, XGBoost predicts continuous values by minimizing a chosen loss function, commonly Mean Squared Error (MSE) or Mean Absolute Error (MAE).  

<br>

---

<br>

**Key Concepts**  
1. **Boosting Mechanism**:  
   - Unlike a single decision tree, XGBoost builds multiple trees in sequence.  
   - Each new tree corrects the errors of the previous ones by focusing on residuals.  

2. **Loss Functions**:  
   - Determines how errors are measured and minimized.  
   - Common choices:  
     * **Mean Squared Error (MSE)** – Penalizes larger errors more heavily.  
     * **Mean Absolute Error (MAE)** – Treats all errors equally.  
     * **Huber Loss** – A mix of MSE and MAE to handle outliers.  

3. **Regularization Techniques**:  
   - Prevents overfitting by adding penalties to complex models.  
   - **L1 Regularization (Lasso)** – Shrinks coefficients, promoting sparsity.  
   - **L2 Regularization (Ridge)** – Penalizes large coefficients to reduce variance.  

4. **Feature Importance & Selection**:  
   - XGBoost ranks features by importance, aiding feature selection.  
   - Can be used to eliminate redundant or irrelevant features.  

<br>

---

<br>

**Pros**  
1. **High Performance** – Optimized for speed, scalability, and efficiency.  
2. **Handles Missing Data** – Automatically learns how to deal with missing values.  
3. **Regularization Built-in** – Reduces overfitting with L1 and L2 penalties.  
4. **Works Well with Large Datasets** – Efficient memory usage and parallel processing.  

<br>

---

<br>

**Cons**  
1. **Complexity** – More difficult to tune compared to simpler models.  
2. **Computationally Intensive** – Training can be slow on very large datasets.  
3. **Sensitive to Hyperparameters** – Performance depends on careful tuning of learning rate, tree depth, and regularization.  

<br>

---

<br>

**Tips**  
* **Optimize Hyperparameters** – Use grid search or Bayesian optimization for tuning.  
* **Use Early Stopping** – Stops training if performance stops improving on validation data.  
* **Scale Features if Needed** – Although XGBoost can handle unscaled data, standardization may help in some cases.  
* **Leverage Feature Importance** – Identify and remove less relevant features to improve efficiency.  

<br>

---

<br>

**Useful Articles and Videos**  
* [XGBoost Official Documentation](https://xgboost.readthedocs.io/en/stable/tutorials/model.html)  
* [XGBoost for Regression – Machine Learning Mastery](https://machinelearningmastery.com/xgboost-for-regression/)  
* [Understanding XGBoost – Analytics Vidhya](https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/)  
* [XGBoost Explained – YouTube](https://www.youtube.com/watch?v=OtD8wVaFm6E)  

<br>


## Import Data/Libraries

In [5]:
!pip install lets_plot



In [6]:
# needed libraries for Regression models
import pandas as pd
from sklearn import tree
import xgboost as xgb
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, mean_squared_error
from sklearn.model_selection import cross_val_score,  train_test_split , KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, Normalizer
import lets_plot as lp
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import kagglehub
import statsmodels.api as sm
import statsmodels.formula.api as smf

# foundation dataset
#### If this gives an error go into the Data folder in GitHub and click on the baseball data csv and then click "Raw" (underneath history in the upper righthand corner) then copy that url to replace the "data_raw_url"
data_raw_url = "https://raw.githubusercontent.com/BYUIDSS/DSS-ML-Bootcamp/refs/heads/main/3a_Regression_Model/Data/nyc-east-river-bicycle-counts.csv?token=GHSAT0AAAAAAC6JBYNOYS7FRMQ2VJSQ5HZ6Z6N4UCQ"
bicycle_df = pd.read_csv(data_raw_url)

## Explore, Visualize and Understand the Data

In [7]:
bicycle_df.head(10)

Unnamed: 0.1,Unnamed: 0,Date,Day,High Temp (°F),Low Temp (°F),Precipitation,Brooklyn Bridge,Manhattan Bridge,Williamsburg Bridge,Queensboro Bridge,Total
0,0,2016-04-01 00:00:00,2016-04-01 00:00:00,78.1,66.0,0.01,1704.0,3126,4115.0,2552.0,11497
1,1,2016-04-02 00:00:00,2016-04-02 00:00:00,55.0,48.9,0.15,827.0,1646,2565.0,1884.0,6922
2,2,2016-04-03 00:00:00,2016-04-03 00:00:00,39.9,34.0,0.09,526.0,1232,1695.0,1306.0,4759
3,3,2016-04-04 00:00:00,2016-04-04 00:00:00,44.1,33.1,0.47 (S),521.0,1067,1440.0,1307.0,4335
4,4,2016-04-05 00:00:00,2016-04-05 00:00:00,42.1,26.1,0,1416.0,2617,3081.0,2357.0,9471
5,5,2016-04-06 00:00:00,2016-04-06 00:00:00,45.0,30.0,0,1885.0,3329,3856.0,2849.0,11919
6,6,2016-04-07 00:00:00,2016-04-07 00:00:00,57.0,53.1,0.09,1276.0,2581,3282.0,2457.0,9596
7,7,2016-04-08 00:00:00,2016-04-08 00:00:00,46.9,44.1,0.01,1982.0,3455,4113.0,3194.0,12744
8,8,2016-04-09 00:00:00,2016-04-09 00:00:00,43.0,37.9,0.09,504.0,997,1507.0,1502.0,4510
9,9,2016-04-10 00:00:00,2016-04-10 00:00:00,48.9,30.9,0,1447.0,2387,3132.0,2160.0,9126


In [8]:
# try info

In [9]:
# investigate describe()

## Feature Enginnering and Data Augmentation

### **Data Augmentation**  
**Definition:**
Data augmentation is the process of artificially expanding the size and diversity of a training dataset by applying transformations or modifications to the existing data while preserving the underlying labels or structure. It is commonly used in machine learning, especially in computer vision and natural language processing, to improve model performance and robustness.

### **Feature Engineering**  
**Definition:**
Feature engineering is the process of creating, modifying, or selecting relevant features (input variables) from raw data to improve the performance of a machine learning model. It involves transforming raw data into a format that makes it more suitable for algorithms to learn patterns.


## Machine Learning Model

### Split the data to train and test

### Create the model

### Train the model

### Make predictions

#### Hyperparameter Search

In [10]:
# Hint: google GridSearchCV()


### Evaluate the Model

**MSE (Mean Squared Error) – The average of the squared differences between the predicted and actual values.**  
Example: If the squared errors for three predictions are 4, 9, and 1, then MSE = (4 + 9 + 1) / 3 ≈ 4.67.

**RMSE (Root Mean Squared Error) – The square root of the MSE, providing an error measure in the same unit as the target variable.**  
Example: With an MSE of 4.67, RMSE = √4.67 ≈ 2.16.

**MAE (Mean Absolute Error) – The average of the absolute differences between the predicted and actual values.**  
Example: If the absolute errors for three predictions are 2, 3, and 1, then MAE = (2 + 3 + 1) / 3 = 2.

**R-squared (Coefficient of Determination) – The proportion of the variance in the dependent variable that is explained by the independent variables.**  
Example: An R-squared value of 0.8 indicates that 80% of the variability in the output is explained by the model, with the remaining 20% unexplained.


In [12]:
# Evaluate the model using regression metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared (R2)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print('Mean Squared Error (MSE):', mse)
print('Root Mean Squared Error (RMSE):', rmse)
print('Mean Absolute Error (MAE):', mae)
print('R-squared (R2):', r2)