<a href="https://colab.research.google.com/github/Requenamar3/Machine-Learning/blob/main/MR2_CAP4631C_Mini_project_Group_1_Anays_Garcia%2C_Martha_Requena_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comprehensive Assignment / Mini Project

### CAP 4631C - Spring 2025

## General Instructions

# 1. **Submission Requirements:**
# - Every member of the group MUST submit the same iPython Notebook on Canvas.
# - You are only allowed **one** question/clarification for this assignment. If something is unclear, make a **reasonable assumption** and document it.
# - Use **5% as the threshold** for assessing percentage reduction or increase when applicable.

---



In [None]:
# Import necessary libraries for data processing and numerical calculations
import numpy as np  # For numerical calculations
import pandas as pd  # For handling datasets

# Import visualization libraries
import matplotlib.pyplot as plt  # For data visualization

# Import dataset library
from sklearn import datasets  # For loading sample datasets

# Import model selection and evaluation tools
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV  # For model training, validation, and hyperparameter tuning
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

# Import machine learning models - Regression
from sklearn.linear_model import LinearRegression  # Linear regression model
from sklearn.tree import DecisionTreeRegressor  # Decision tree regression model
from sklearn import tree  # For visualizing decision trees
# Explicitly import plot_tree function
from sklearn.tree import plot_tree
from sklearn.preprocessing import PolynomialFeatures  # Creating polynomial features
from sklearn.pipeline import Pipeline  # Automating preprocessing + modeling
from sklearn.preprocessing import StandardScaler # Used to center or standardize the desised variable(s)

# Import ensemble learning models - Forest-based
from sklearn.ensemble import RandomForestRegressor  # Random forest regression model
from sklearn.ensemble import ExtraTreesRegressor  # Extremely Randomized Trees for regression

# Import ensemble learning models - Boosting
from sklearn.ensemble import GradientBoostingRegressor  # Gradient Boosting regression model
from sklearn.ensemble import AdaBoostRegressor  # AdaBoost regression model
from sklearn.ensemble import BaggingRegressor  # Bagging regression model

---
---
# **Question 1 (30 points): Linear Regression & Best Subset Selection**

# **Instructions:**
# - Use the `diabetes_df` dataset.
# - Outcome variable: `Y`
# - Predictors: All columns except `Y`

## - **Diabetes Dataset** (from `sklearn.datasets`)

### **Understanding Diabetes Progression with Linear Regression**

---
Imagine you're a data scientist tasked with predicting the progression of diabetes based on patient characteristics such as BMI (Body Mass Index), age, and other medical indicators. You want to build a machine learning model that helps doctors estimate the severity of diabetes for different patients. The dataset you have is from **Scikit-learn's Diabetes dataset**, which includes 10 independent variables (features) and one dependent variable (target), which represents a measure of diabetes progression.

---


In [None]:
diabetes_data = datasets.load_diabetes()
diabetes_df = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
diabetes_df['Y'] = diabetes_data.target

In [None]:
# Separate the independent variables (features) and the dependent variable (target)
x_all_diabetes = diabetes_df.drop('Y', axis=1)  # All predictors
y = diabetes_df['Y']  # Target variable

# **Part (a): Train Linear Regression with BMI & Age as predictors**


In [None]:
# Split the dataset into training and testing sets (80% train, 20% test)
x_train_d, x_test_d, y_train_d, y_test_d = train_test_split(x_all_diabetes, y, test_size=0.2, random_state=1)


In [None]:
# Select only 'bmi' and 'age' as independent variables
x_train_diabet = x_train_d[['bmi', 'age']]
x_test_diabet = x_test_d[['bmi', 'age']]

In [None]:
# Train a linear regression model using only 'bmi' and 'age'
lr_diabet_model = LinearRegression().fit(x_train_diabet, y_train_d)

# - Write the regression equation.

In [None]:
# Retrieve the intercept and coefficients of the trained model
intercept = lr_diabet_model.intercept_
coefficients = lr_diabet_model.coef_

In [None]:
# Print regression equation
equation = f"Y = {intercept:.4f} + {coefficients[0]:.4f}*bmi + {coefficients[1]:.4f}*age"
print("Regression Equation:", equation)

Regression Equation: Y = 151.7227 + 953.9009*bmi + 99.7463*age


# - Compute the **test Mean Squared Error (MSE)**.

In [None]:
# Obtain predictions on the test set
y_pred_test_lr = lr_diabet_model.predict(x_test_diabet)

# Compute Mean Squared Error (MSE)
mse = mean_squared_error(y_test_d, y_pred_test_lr)

# Display MSE
print("\nTest Mean Squared Error for Linear Regression Model:", mse)


Test Mean Squared Error for Linear Regression Model: 3889.760177676556


In [None]:
rmse_lr = np.sqrt(mean_squared_error(y_test_d, y_pred_test_lr))
print("Root Mean Squared Error Lr:", rmse_lr)


Root Mean Squared Error Lr: 62.36794190669238


In [None]:
# Normalizing the RMSE by dividing it by the mean of the target variable
normalized_rmse_lr = rmse_lr / np.mean(y)

In [None]:
# Printing results
print("Root Mean Squared Error for Lr:", rmse_lr)
print("Normalized RMSE Lr:", normalized_rmse_lr)

Root Mean Squared Error for Lr: 62.36794190669238
Normalized RMSE Lr: 0.4099553904905794


---


##**Part (b): Best Subset Selection**

In [None]:
# Run an external script (bss_definitions.py) from the specified directory
%run "/content/sample_data/bss_definitions.py"



In [None]:
# Perform best subset selection
bss_cv_diabet = my_best_subset_selection_cv(x_train_d, y_train_d, folds=10)


Cross-validation results for predictors: ['bmi']
Cross-validation MSE values: [4428.58156781 3744.94528957 3449.02844641 5168.69924663 4100.97206098
 3333.42686357 3572.49578957 3320.55211654 3569.1143104  4362.38219425]
Mean cross-validation MSE: 3905.0198



  out_df = pd.concat([out_df, df_iteration], ignore_index=True)


Cross-validation results for predictors: ['bmi', 's5']
Cross-validation MSE values: [3627.25716083 3314.60374146 3403.72448862 4063.33802171 3620.28499082
 2965.32576376 3035.39356039 2535.75424936 2612.15419454 3249.17233194]
Mean cross-validation MSE: 3242.7009

Cross-validation results for predictors: ['bmi', 'bp', 's5']
Cross-validation MSE values: [3370.14873007 3351.09415337 3781.19404891 3612.61432856 3447.79996499
 2832.76416333 2892.68468224 2425.49105815 2563.18859693 3209.45307596]
Mean cross-validation MSE: 3148.6433

Cross-validation results for predictors: ['bmi', 'bp', 's3', 's5']
Cross-validation MSE values: [3233.92018207 3397.47425965 3517.09836969 3413.70202595 3449.74626566
 2597.46994378 2847.98880853 2498.05480369 2685.74173376 3094.02162371]
Mean cross-validation MSE: 3073.5218

Cross-validation results for predictors: ['sex', 'bmi', 'bp', 's3', 's5']
Cross-validation MSE values: [2933.3090766  3010.84058364 3479.61214591 3262.24990297 3099.89312772
 2408.3177409

# - Identify the **best predictors** for predicting `Y`.


In [None]:
# Ensure that all column contents are fully visible when printed
pd.set_option('display.max_colwidth', None)
# Display the cross-validation results for the BSS (Best Subset Selection) model
bss_cv_diabet

Unnamed: 0,Predictors,Mean_CV_MSE
0,[bmi],3905.0198
1,"[bmi, s5]",3242.7009
2,"[bmi, bp, s5]",3148.6433
3,"[bmi, bp, s3, s5]",3073.5218
4,"[sex, bmi, bp, s3, s5]",2960.2805
5,"[sex, bmi, bp, s1, s3, s5]",2968.495
6,"[sex, bmi, bp, s1, s2, s4, s5]",2969.4027
7,"[sex, bmi, bp, s1, s2, s4, s5, s6]",2978.6571
8,"[age, sex, bmi, bp, s1, s2, s4, s5, s6]",2992.3225
9,"[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6]",3010.6571


In [None]:
# Define the threshold for percent reduction in Mean Cross-Validation MSE (CV MSE)
percent_reduction_threshold = 5

In [None]:
# Compute the percentage reduction in Mean CV MSE between successive models
percent_reduction = -(bss_cv_diabet['Mean_CV_MSE'].pct_change().dropna()) * 100
percent_reduction.name = "Pct reduc from previous model"

# - Report the selected predictors.



In [None]:
# Identify the last model where the percent reduction in Mean CV MSE meets or exceeds the threshold
last_row_above_threshold = bss_cv_diabet.iloc[
    percent_reduction[percent_reduction >= percent_reduction_threshold].index[-1]
]['Predictors']

# Print the selected predictors based on the threshold
print("BSS selects the following predictors for the model:")
print(last_row_above_threshold)  # Display the predictors selected
print()  # Print an empty line for better readability

# Inform the user that adding more predictors results in diminishing returns
print("Adding additional predictors results in a % reduction in Mean CV MSE below",
      percent_reduction_threshold, "%:")

BSS selects the following predictors for the model:
['bmi', 's5']

Adding additional predictors results in a % reduction in Mean CV MSE below 5 %:


In [None]:
# Train a linear regression model using the selected predictors
selected_features = list(last_row_above_threshold)  # Convert selected predictors to a list
x_train_selected = x_train_d[selected_features]
x_test_selected = x_test_d[selected_features]

In [None]:
# Fit model with best subset of features
lr_diabet_bss_training = LinearRegression().fit(x_train_selected, y_train_d)

In [None]:
# Predict on test set
y_pred_test_diabet_bss = lr_diabet_bss_training.predict(x_test_selected)

# - Compute the **test MSE**.

In [None]:
# Compute Test Mean Squared Error (MSE)
mse_bss = mean_squared_error(y_test_d, y_pred_test_diabet_bss)

# Print the test MSE
print("\nTest Mean Squared Error (MSE) for Best Subset Selection Model:", mse_bss)


Test Mean Squared Error (MSE) for Best Subset Selection Model: 3254.1047720984925


In [None]:
rmse_bss = np.sqrt(mean_squared_error(y_test_d, y_pred_test_diabet_bss))

In [None]:
# Normalizing the RMSE by dividing it by the mean of the target variable
normalized_rmse_bss = rmse_bss / np.mean(y)

In [None]:
# Printing results
print("Root Mean Squared Error  for Bss:", rmse_bss)
print("Normalized RMSE for Bss:", normalized_rmse_bss)

Root Mean Squared Error  for Bss: 57.04476112754345
Normalized RMSE for Bss: 0.374965192189138


In [None]:
# Normalizing the RMSE by dividing it by the mean of the target variable
normalized_rmse_lr = rmse_lr / np.mean(y)

---
---
# **Part (c): Model Comparison**
# - Compare the **MSE** values of models from (a) and (b).


In [None]:
# Print test MSE comparison
print(f"Test MSE (Linear Regression): {mse:.4f}")
print(f"Test MSE (Best Subset Selection): {mse_bss:.4f}")

Test MSE (Linear Regression): 3889.7602
Test MSE (Best Subset Selection): 3254.1048


In [None]:
mse_difference = (mse - mse_bss) / mse *100  # Difference in MSE
mse_difference

16.341763413233235

In [None]:
mse_difference = mse - mse_bss
mse_improvement_percentage = (mse_difference / mse_bss) * 100

# Print improvement results
print(f"\nMSE Improvement using Best Subset Selection: {mse_difference:.2f}")
print(f"Percentage Improvement in MSE: {mse_improvement_percentage:.2f}%")



MSE Improvement using Best Subset Selection: 635.66
Percentage Improvement in MSE: 19.53%


In [None]:
# Compute Training MSE using the best subset selection model
y_pred_train_bss = lr_diabet_bss_training.predict(x_train_selected)  # Predict on training data
train_mse_bss = mean_squared_error(y_train_d, y_pred_train_bss)  # Calculate train MSE

# Predict on training data using only the features the model was trained on
y_pred_train_lr = lr_diabet_model.predict(x_train_d[['bmi', 'age']])

# Compute Training Mean Squared Error (MSE)
train_mse_lr = mean_squared_error(y_train_d, y_pred_train_lr)

# Calculate the difference
mse_difference = train_mse_lr - train_mse_bss

print("\n📌 Difference")
print(f"Training MSE (Linear Regression): {train_mse_lr:.4f}")
print(f"Training MSE (BSS): {train_mse_bss:.4f}")
print(f"Diference in Training Error: {mse_difference:.2f}")



📌 Difference
Training MSE (Linear Regression): 3846.9565
Training MSE (BSS): 3194.1284
Diference in Training Error: 652.83


---
The **Best Subset Selection (BSS) model** improves over the **Linear Regression model** by achieving a **lower Test MSE (3254.10 vs. 3889.76)**, meaning it predicts with less overall error. By selecting only the most relevant predictors (**`bmi` and `s5`**), it makes the model more efficient and reduces the risk of overfitting.  
However, the **coefficient of variation (40.36%)** is still quite high, exceeding the acceptable threshold of **20%**, which raises concerns about prediction stability. While the BSS model eliminates unnecessary predictors, it may have also removed variables that contribute to more precise predictions.  

---


# **Part (d): Predictions for 8 New Patients**

✅ **Input the given standardized values for `Age`, `BMI`, `BP`, and `S5`**  

In [None]:
# Provided data for new patients
age_data = np.array([0.037, -0.045, 0.101, 0.67, 0.38, 0.002, -0.011, 0.018])
bmi_data = np.array([0.061, 0.03, -0.034, 0.11, -0.087, 0.0001, 0.018, -0.057])
s5_data = np.array([0.031, 0.09, -0.054, -0.011, -0.087, 0.006, 0.058, -0.032])
y_actual = np.array([144, 168, 59, 205, 97, 134, 79, 88])


✅ **Use the selected predictors from Part (b) to predict `Y` for 8 new patients**

In [None]:
# Ensure correct predictors are used from Best Subset Selection
best_subset_features = [selected_features]  # BSS results

In [None]:
# Create a dataframe for new patient data with selected predictors ['bmi', 's5']
new_patient_data = pd.DataFrame({
    'bmi': bmi_data,
    's5': s5_data
})

In [None]:
# Select only the predictors identified by Best Subset Selection
new_patient_data = new_patient_data[selected_features]

In [None]:
# Train the model using the correct predictors from the training set
x_train_bss = x_train_d[selected_features]  # Use only the BSS-selected predictors
new_model_bss = LinearRegression()
new_model_bss.fit(x_train_bss, y_train_d)  # Train on selected features

# - Predict `Y` for new patients using the model from (b).


In [None]:
# Predict Y for new patients
y_pred_new_bss = new_model_bss.predict(new_patient_data)

✅ **Compute the Root Mean Squared Error (RMSE)** for these predictions

In [None]:
# Create a DataFrame to display results
df_results_bss = pd.DataFrame({
    'Actual Y': y_actual,
    'Predicted Y (BSS)': y_pred_new_bss.round(2)
})

# Display the DataFrame
print("\n📌 Predictions for New Patients:")
print(df_results_bss)


📌 Predictions for New Patients:
   Actual Y  Predicted Y (BSS)
0       144             212.97
1       168             227.05
2        59              95.39
3       205             221.72
4        97              38.51
5       134             155.43
6        79             199.34
7        88              92.67


In [None]:
# Compute Mean Squared Error (MSE)
mse_new_bss = mean_squared_error(y_actual, y_pred_new_bss)
print("\nTest Mean Squared Error for new_bss:", mse_new_bss)


Test Mean Squared Error for new_bss: 3528.829917394264


In [None]:
# Compute Root Mean Squared Error (RMSE)
rmse_new_bss = np.sqrt(mean_squared_error(y_actual, y_pred_new_bss))
print("Root Mean Squared Error New Bss:", rmse_new_bss)

Root Mean Squared Error New Bss: 59.403955401928116


In [None]:
# Compute Relative RMSE
mean_y = np.mean(y_test_d)  # Average Y from test data
coeff_variation_bss = (rmse_new_bss / mean_y) * 100

In [None]:
print(f"\n📌 Model Performance Evaluation:")
print(f"Root Mean Squared Error (RMSE): {rmse_new_bss:.4f}")
print(f"Mean of Y: {mean_y:.4f}")
print(f"Coefficient of Variation (BSS): {coeff_variation_bss:.2f}%")


📌 Model Performance Evaluation:
Root Mean Squared Error (RMSE): 59.4040
Mean of Y: 147.2022
Coefficient of Variation (BSS): 40.36%


In [None]:
# Normalizing the RMSE by dividing it by the mean of the target variable
normalized_rmse_new_bss = rmse_new_bss/ np.mean(y_actual)

In [None]:
# Printing results
print("Root Mean Squared Error for new_bss:", rmse_new_bss)
print("Normalized rmse_new_bss:", normalized_rmse_new_bss )

Root Mean Squared Error for new_bss: 59.403955401928116
Normalized rmse_new_bss: 0.4879174981677874


The error value we obtained **is higher than ideal**, suggesting that the model’s predictions **may not be highly reliable**. With an **RMSE of 59.40** and a **mean Y value of 147.20**, the **coefficient of variation is 40.36%**, which is **well above the acceptable threshold of 20%**. This indicates that the model’s predictions **deviate significantly** from actual values, raising concerns about its accuracy.  

While the Best Subset Selection method helped by **eliminating unnecessary predictors**, it may have also **removed important variables**, leading to a **loss of predictive power**. Using only `bmi` and `s5` **may not fully capture** the complexity of the relationship between the predictors and Y, which could explain the high error.  

Overall, while the model **provides a general estimate, its accuracy is not yet sufficient for reliable predictions**. Incorporating additional predictors or exploring alternative modeling techniques **could help improve performance**.  

---


---
---

---
# Question 2 (25 points): Decision Trees, Random Forest, and Boosting



##**Part (a): Cost-Complexity Pruning for Decision Tree**

### - Train a Decision Tree with **cost-complexity pruning**.


In [None]:
# Train a Decision Tree without pruning
reg_tree_unpruned = DecisionTreeRegressor(random_state=1)

# Fit the model on the training dataset
reg_tree_unpruned.fit(x_train_d, y_train_d)

In [None]:
# Retrieve the cost complexity pruning path (ccp_alpha values)
ccp_path = reg_tree_unpruned.cost_complexity_pruning_path(x_train_d, y_train_d)

In [None]:
# Define the hyperparameter grid using the extracted alpha values
hyperparam_grid_alpha = {'ccp_alpha': ccp_path['ccp_alphas']}

In [None]:
# Set up K-Fold cross-validation to evaluate the model performance
cv_set_up = KFold(n_splits=5, shuffle=True, random_state=1)  # 10-fold cross-validation with shuffling


In [None]:
# Set up Grid Search CV for pruning using different ccp_alpha values
grid_search_setting_alpha = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=1),
    param_grid=hyperparam_grid_alpha,
    cv=cv_set_up,
    scoring='neg_mean_squared_error'
)

NameError: name 'cv_set_up' is not defined

In [None]:
# Train the Decision Tree model using the best ccp_alpha value
grid_search_setting_alpha.fit(x_train_d, y_train_d)

In [None]:
# Print the best ccp_alpha value that resulted in the lowest cross-validation (CV) MSE
print("The alpha that led to the lowest CV MSE was: ", grid_search_setting_alpha.best_params_)

In [None]:
# Train a pruned Decision Tree Regressor using the optimal ccp_alpha found via grid search
reg_tree_diabet_postp = DecisionTreeRegressor(
    random_state=1,
    ccp_alpha=grid_search_setting_alpha.best_params_['ccp_alpha']
)


In [None]:
reg_tree_diabet_postp.fit(x_train_d, y_train_d)

In [None]:
y_pred_test_tree_diabet_postp = reg_tree_diabet_postp.predict(x_test_d)

In [None]:
root_mean_squared_error(y_test_d, y_pred_test_tree_diabet_postp)

In [None]:
coeff_variation_tree= root_mean_squared_error(y_test_d, y_pred_test_tree_diabet_postp) / np.mean(y)
print(f"coeff_variation_tree: {coeff_variation_tree:.2f}%")

In [None]:
# Compute Mean Squared Error (MSE)
mse_tree = mean_squared_error(y_test_d, y_pred_test_tree_diabet_postp)
print("\nTest Mean Squared Error for Tree:", y_pred_test_tree_diabet_postp)

In [None]:
plt.rcParams['figure.figsize'] = [20, 10]
tree.plot_tree(reg_tree_diabet_postp,filled=True,rounded=True,feature_names=x_train_d.columns,fontsize=8)
plt.show()

In [None]:
x_train_d.columns [reg_tree_diabet_postp.feature_importances_!=0]

In [None]:
reg_tree_diabet_postp.get_depth()

### - Write all **IF-THEN rules** from the tree.

**Rule 1:**  
   **IF** `bmi <= 0.009` **AND** `s5 <= -0.004`  
   **THEN** Prediction = `95.684`  

---
**Rule 2:**  
   **IF** `bmi <= 0.009` **AND** `s5 > -0.004`  
   **THEN** Prediction = `156.205`  

---
**Rule 3:**  
   **IF** `bmi > 0.009` **AND** `bp <= 0.024`  
   **THEN** Prediction = `176.347`  

---
**Rule 4:**  
   **IF** `bmi > 0.009` **AND** `bp > 0.024`  
   **THEN** Prediction = `242.369`  

---

## **Part (b): Train a Random Forest**

###- Set `p/2` as the number of predictors.


In [None]:
num_features = x_train_d.shape[1] # Get the total number of features
max_features = int(num_features / 2)
feature_subset_array = np.arange(1, max_features + 1)

print("Feature Subset Array:", feature_subset_array)


### - Train trees with **50 to 1000** trees (increase by 50 each time).

In [None]:
number_of_trees = np.arange(50, 1001, 50)
#number_of_trees

In [None]:
# Defining the hyperparameter grid for Random Forest tuning / model
hyperparam_grid_rf = {
    'n_estimators': number_of_trees,  # Number of trees in the forest (controls complexity and performance)
    'max_features': feature_subset_array  # Number of features to consider for best split at each node
}


### - Use **5-fold Cross Validation**.

In [None]:
# Setting up cross-validation with 3 splits, shuffling data, and ensuring reproducibility
cv_set_up_ = KFold(n_splits=5, shuffle=True, random_state=1)


In [None]:
# Set up GridSearchCV for hyperparameter tuning of Random Forest Regressor
grid_search_setting_rf = GridSearchCV(
    estimator=RandomForestRegressor(random_state=1),  # Initialize Random Forest model with a fixed random state for reproducibility
    param_grid=hyperparam_grid_rf,  # Use the defined hyperparameter grid for tuning
    cv=cv_set_up,  # Cross-validation strategy (e.g., K-Fold or other CV method)
    scoring='neg_mean_squared_error'  # Evaluation metric (negative MSE for minimizing error in regression tasks)
)


In [None]:
# Performing hyperparameter tuning using GridSearchCV on the training data
grid_search_setting_rf.fit(x_train_d, y_train_d)


In [None]:

# Printing the best hyperparameter values found by GridSearchCV
print('Selected hyperparameter values:', grid_search_setting_rf.best_params_)


In [None]:
rf_diabetes = RandomForestRegressor(n_estimators=600, max_features=4, random_state=1)

In [None]:
rf_diabetes.fit(x_train_d, y_train_d)

In [None]:
# Making predictions on the test set using the trained Random Forest model
y_pred_test_rf = rf_diabetes.predict(x_test_d)

In [None]:
# Calculating the Root Mean Squared Error (RMSE) for model evaluation
rmse_rf = root_mean_squared_error(y_test_d, y_pred_test_rf)
rmse_rf

In [None]:

# Normalizing the RMSE by dividing it by the mean of the target variable
normalized_rmse_rf = rmse_rf / np.mean(y)

In [None]:
# Printing results
print("Root Mean Squared Error (RMSE):", rmse_rf)
print("Normalized RMSE:", normalized_rmse_rf)

In [None]:
df_feature_imp = pd.Series(data=rf_diabetes.feature_importances_, index=x_train_d.columns , name= "Predictor Importance in Forest")


In [None]:
df_feature_imp.sort_values(ascending=False)

## **Part (c): Train a Boosted Tree Model**

### - Learning rate = **0.01**.
### - Number of trees = **100**.
### - Max depth = **2 to 6** (increase by 1 each time).
### - Use **5-fold Cross Validation**.



In [None]:
# Define hyperparameter values
learning_rate = 0.01
n_estimators = 100
max_depth_values = range(2, 7)  # 2, 3, 4, 5, 6
cv_folds = 5

In [None]:
number_of_trees_boosting = 100

In [None]:
# Displaying the values for verification
print(number_of_trees_boosting)

In [None]:
# As suggested by the textbook, let's try small values

# Defining a range of tree depths for Gradient Boosting, as suggested by the textbook
depth_values = range(2, 7)
depth_values

In [None]:
# Defining lambda (learning rate) values for Gradient Boosting tuning
lambda_values = 0.01

###- Set `p/2` as the number of predictors.
### - Train trees with **50 to 1000** trees (increase by 50 each time).
### - Use **5-fold Cross Validation**.



In [None]:
# Use the trained and pruned decision tree to make predictions on the test dataset
y_pred_test_tree_diabet_postp = reg_tree_diabet_postp.predict(x_test_d)




In [None]:
# Calculate RMSE (Root Mean Squared Error) on the test data
# RMSE measures how far predictions are from actual values
test_rmse_2nd_tree = root_mean_squared_error(y_test_d, y_pred_test_tree_diabet_postp)


In [None]:
# Compute the percentage difference dynamically
rmse_difference_ratio_among_trees = ((rmse - test_rmse_2nd_tree) / rmse) *100

# Print the results
print(f"test_rmse_1st_tree: {rmse:.4f}")
print(f"test_rmse_2nd_treet: {test_rmse_2nd_tree:.4f}")
print(f"Relative RMSE Difference among trees: {rmse_difference_ratio_among_trees:.4f}")


In [None]:
# Define the range of trees to test
number_of_trees = np.arange(50, 1001, 50)

# Determine the number of features available
x_train_d.shape[1]

# Determine the number of features available from x_train_d
p = x_train_d.shape[1]
p_over_2 = p // 2  # Use integer division to get a whole number

# Define the range of features to consider
number_of_features = np.arange(3, x_train_cr.shape[1] + 1, 2)

# Define the hyperparameter grid for Random Forest
hyperparam_grid_rf = {
    'n_estimators': number_of_trees,  # Number of trees in the forest
    'max_features': number_of_features  # Number of features to consider per split
}

In [None]:
# Set up 3-fold cross-validation
cv_set_up= KFold(n_splits=5, shuffle=True, random_state=1)

# Use GridSearchCV to find the best hyperparameters
grid_search_setting_rf = GridSearchCV(
    estimator=RandomForestRegressor(random_state=1),
    param_grid=hyperparam_grid_rf,
    cv=cv_set_up,
    scoring='neg_mean_squared_error'
)

# Train using Grid Search
grid_search_setting_rf.fit(x_train_d, y_train_d)

# Print the best hyperparameters found
print('This is the hyperparameter combination that led to the lowest CV MSE:', grid_search_setting_rf.best_params_)

In [None]:
# Creating an array of values representing the number of trees for a Random Forest model
# The values range from 500 to 1000 (inclusive) in increments of 100
number_of_trees = np.arange(50, 1001, 50)
number_of_trees

In [None]:
# Getting the number of features in the training set
x_train_d.shape[1]


In [None]:
# Let's try 2, 4, ..., 12
# I do not try the full range to minimize computation time.

# Generate an array of even numbers starting from 2 up to the total number of features in X_train_b
# This will be used to evaluate model performance with different feature subset sizes
number_of_features = np.arange(2, X_train_b.shape[1] + 1, 2)
number_of_features


In [None]:
# Defining the hyperparameter grid for Random Forest tuning / model
hyperparam_grid_rf = {
    'n_estimators': number_of_trees,  # Number of trees in the forest (controls complexity and performance)
    'max_features': number_of_features  # Number of features to consider for best split at each node
}


In [None]:
# USE 3 folds , not 10 !!!!

# Setting up cross-validation with 3 splits, shuffling data, and ensuring reproducibility
cv_set_up= KFold(n_splits=3, shuffle=True, random_state=1)


In [None]:
# Set up GridSearchCV for hyperparameter tuning of Random Forest Regressor
grid_search_setting_rf = GridSearchCV(
    estimator=RandomForestRegressor(random_state=1),  # Initialize Random Forest model with a fixed random state for reproducibility
    param_grid=hyperparam_grid_rf,  # Use the defined hyperparameter grid for tuning
    cv=cv_set_up,  # Cross-validation strategy (e.g., K-Fold or other CV method)
    scoring='neg_mean_squared_error'  # Evaluation metric (negative MSE for minimizing error in regression tasks)
)


In [None]:
# Performing hyperparameter tuning using GridSearchCV on the training data
grid_search_setting_rf.fit(X_train_b, y_train_b)


## **Part (d): Compare Boosted Trees vs Random Forest**

### - Determine which model is better using **error metrics**.

---
---

# ---

# ## Question 3 (10 points): Regression Tree Analysis

# **Instructions:**
# - Answer the following questions based on a given regression tree image.

# **Part (a): Find tree depth**
# **Part (b): Identify the leaf with the lowest number of observations**
# **Part (c): List all predictor variables in the tree**
# **Part (d): Predict `Y` for given `X` values**
# **Part (e): Identify the leaf with the smallest prediction error**
# **Part (f): Determine `min_samples_split` from the tree**


# TODO: Answer based on tree visualization (Markdown cell)

# - **Orange Dataset** (CSV file from Canvas) → Used for Question 4.

# ## Question 4 (15 points): Predicting Tree Age from Circumference

# **Instructions:**
# - Load the **Orange dataset**.
# - Train at least **three regression models**.
# - Use **Adjusted R²** to select the best model.
# - Propose the best regression equation for predicting tree age.


Stat approach
Evaluate the performance of the equation only on the training data, but make sure to adjust the evaluation for overfitting.

How to adjust for overfitting?

Use adjusted R squared, Residuals Standard Error (RSE), Cp, BIC, ... (or any metric that adjusts for overfitting).

### **Step 1: Load & Explore the Orange Dataset**
✅ **Load the Orange dataset**  
✅ **Display sample data to understand its structure**  
✅ **Identify the outcome variable (`Tree Age`) and predictor (`Circumference`)**  
✅ **Visualize the relationship between `Circumference` and `Tree Age` using scatter plots**  

---

In [None]:
url= 'https://raw.githubusercontent.com/Requenamar3/Machine-Learning/refs/heads/main/Orange.csv'

In [None]:
orange_tree_df = pd.read_csv(url)

In [None]:
orange_tree_df.info()

In [None]:
x = orange_tree_df[['circumference']]
y = orange_tree_df['age']

In [None]:
orange_tree_df[["circumference",'age']].corr()

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(orange_tree_df['circumference'], orange_tree_df['age'], c= 'blue')
plt.xlabel('Circumference')
plt.ylabel('Age')
plt.title('Scatter Plot of Circumference vs Age')
plt.show()

In [None]:
print(np.array(orange_tree_df ['age']).reshape(-1,1).ndim)
print(np.array(orange_tree_df ['age']).reshape(-1,1).shape)

In [None]:
x_circu = np.array(orange_tree_df ['circumference']).reshape(-1,1)
y_age = np.array(orange_tree_df ['age']).reshape(-1,1)

### Linear regression

In [None]:
lr_orange_circ = LinearRegression()
lr_orange_circ.fit(x_circu, y_age)

In [None]:
lr_orange_circ.coef_


In [None]:
lr_orange_circ.intercept_

In [None]:
y_pred_circ = lr_orange_circ.predict(x_circu)

In [None]:
r2 = r2_score(y_age, y_pred_circ)
r2


In [None]:
def rse_calculator(y_actual, y_predicted, p):

    # Compute the sum of squared residuals (errors)
    residual_sum = np.sum((y_actual - y_predicted) ** 2)

    # Compute the RSE using the formula: sqrt(SSE / (n - p - 1))
    rse_value = np.sqrt(residual_sum / (y_actual.size - p - 1))

    # Return the RSE value rounded to 4 decimal places
    return np.round(rse_value, 4)

In [None]:
rse_calculator(y_age, y_pred_circ, 1) / np.mean(y_age)

**R² Score (0.8345) Interpretation:**
   - The model explains approximately **83.45% of the variance** in the actual data.
   - This indicates a **good fit**, though some variance remains unexplained.

**Relative Standard Error (RSE):**
   - The RSE value divided by the mean (0.2202) suggests that the typical prediction error is about **22.02% of the average actual value**.
   - While this is relatively low, it still suggests some **room for model improvement**.

In [None]:
intercept = lr_orange_circ.intercept_.item()
slope = lr_orange_circ.coef_.item()

# Print regression equation
print(f"Regression Equation: Age = {intercept:.4f} + {slope:.4f} * Circumference")

In [None]:
plt.figure(figsize=(10, 6))

# Scatter plot (Age vs Circumference)
plt.scatter(orange_tree_df['circumference'], orange_tree_df['age'], c='blue')

# Proper axis labels
plt.xlabel('Circumference')
plt.ylabel('Age')

# Title
plt.title('Scatter Plot of Circumference vs Age')

# Correct yticks range and formatting
plt.yticks(np.linspace(orange_tree_df['age'].min(), orange_tree_df['age'].max(), num=10).astype(int))

# Plot the regression line
plt.plot(orange_tree_df['circumference'], y_pred_circ, c='red', linestyle='-')  # Red solid line for regression

# Enable grid
plt.grid(True)

# Show plot
plt.show()


In [None]:
residuals_circ = y_age - y_pred_circ

In [None]:
plt.scatter(y_pred_circ, residuals_circ,c='blue')

plt.xlabel("Predicted y")
plt.ylabel("Residuals")  # Difference between actual and predicted values
plt.axhline(0, c='red', ls='--')

plt.show()

In [None]:
rse_circ= rse_calculator(y_age, y_pred_circ, x_circu.shape[1])
rse_circ

The residual plot shows no strong bias, but potential heteroscedasticity and outliers. The model may not fully capture some patterns, suggesting a need for further refinement or a different model.

---
### Polynomial Regression

In [None]:
# Initialize a polynomial transformer with degree 2
poly2_object = PolynomialFeatures(degree=2)

# Convert 'circumference' column into a NumPy array and reshape it for transformation
x_circu = np.array(orange_tree_df['circumference']).reshape(-1, 1)

# Define the target variable (MPG)
y_age = orange_tree_df['age']

In [None]:
orange_tree_df['circumference'].corr(orange_tree_df['circumference']**2)

In [None]:
# Pipeline with two steps: centering and polynomial feature transformation
poly2_pipeline = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=False)),  # Step 1: Center the predictor. We do with_std=False bc we want to center only, not standardize
    ('poly_features', PolynomialFeatures(degree=2))  # Step 2: Apply polynomial transformation
])

In [None]:
x_circu_transf_poly2 = poly2_pipeline.fit_transform(x_circu)
x_circu_transf_poly2

In [None]:
lr_poly2 = LinearRegression()
lr_poly2.fit(x_circu_transf_poly2, y_age)

In [None]:
lr_poly2.intercept_

In [None]:
lr_poly2.coef_

In [None]:
intercept = lr_poly2.intercept_
coefficients = lr_poly2.coef_[1:]

In [None]:
feature_names = ['Intercept'] + ['Coeff_degree_'+ str(i) for i in range(1, len(coefficients)+1)]

coefficients_df = pd.DataFrame({'Coefficient Name': feature_names, 'Coefficient Value': np.concatenate(([intercept], coefficients))})

print(coefficients_df)

### **👉 Why do we do this?**  
- The **intercept represents the baseline prediction** when all features are zero.  
- The **coefficients show the impact of each polynomial term** on the target variable.  
- Helps **interpret how `circumference²` affects MPG differently than `circumference` alone**.  

✅ **Avalon Dataset Example:**  
- When predicting **the age of abalone based on shell diameter**, coefficients tell us how much each feature contributes.  
- If `diameter²` has a **larger positive coefficient** than `diameter`, the **growth rate accelerates with size**.  

---

In [None]:
import pandas as pd

# Example DataFrame with coefficients
coefficients_df = pd.DataFrame({
    "Coefficient Name": ["Intercept", "Coeff_degree_1", "Coeff_degree_2"],
    "Coefficient Value": [1029.928998, 7.816031, -0.033573]
})

# Extracting values dynamically
intercept = coefficients_df.loc[coefficients_df["Coefficient Name"] == "Intercept", "Coefficient Value"].values[0]
coeff_1 = coefficients_df.loc[coefficients_df["Coefficient Name"] == "Coeff_degree_1", "Coefficient Value"].values[0]
coeff_2 = coefficients_df.loc[coefficients_df["Coefficient Name"] == "Coeff_degree_2", "Coefficient Value"].values[0]

# Variable name
variable = "centered circumference"

# Dynamically construct the regression equation
equation = f"Predicted mpg = {intercept:.2f} {coeff_1:+.2f} * ({variable}) {coeff_2:+.3f} * ({variable} squared)"

print(equation)


In [None]:
# Scatter plot
plt.figure(figsize=(8, 8))
plt.scatter(orange_tree_df['circumference'], orange_tree_df['age'], c='blue')
plt.xlabel("circumference")
plt.ylabel("age")
plt.title("Orange tree age vs its circumference")

# Linear regression line
linear_model = LinearRegression().fit(x_circu, y_age)
linear_predictions = linear_model.predict(x_circu)
plt.plot(x_circu, linear_predictions, c='red', ls='-', linewidth=3, label='Linear Model')

# Create xaxis_values (these are sorted x values)
xaxis_values = orange_tree_df['circumference'].sort_values().values.reshape(-1, 1)

# Transform the xaxis_values using the pipeline
xaxis_values_transf_poly2 = poly2_pipeline.fit_transform(xaxis_values)

# Second degree poly curve
poly_predictions = lr_poly2.predict(xaxis_values_transf_poly2)
plt.plot(xaxis_values, poly_predictions, c='green', ls='-', linewidth=3, label='Poly of Second Degree')

plt.legend()
plt.show()



In [None]:
# Calculate the mean (average) circumference of all vehicles in the dataset
mean_circumference = orange_tree_df['circumference'].mean()

# Calculate the maximum circumference value in the dataset
max_circumference = orange_tree_df['circumference'].max()

# Print the calculated mean circumference
print("Mean circumference:", mean_circumference)

# Print the maximum circumference value
print("Max circumference:", max_circumference)



In [None]:
np.random.seed(1)
X_age_new = np.random.randint(116, 214, 5).reshape(-1, 1)  # New random circumference values
X_age_new

In [None]:

X_age_new_transformed = poly2_pipeline.fit_transform(X_age_new)

In [None]:

y_pred_poly2_age = lr_poly2.predict(X_age_new_transformed)

# Display predictions
print("Predicted Age for New Circumference Values:")
print(pd.DataFrame({'Circumference': X_age_new.flatten(), 'Predicted Age': y_pred_poly2_age.round(2)}))

In [None]:

y_pred_poly2_tr_age = lr_poly2.predict(x_circu_transf_poly2)

In [None]:
# Function to calculate adjusted R-squared
def adj_r2_calculator(r2, n, p):
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

In [None]:
# Compute R² value for the polynomial regression model
r2_value_poly2_circ = r2_score(y_age, y_pred_poly2_tr_age)

In [None]:
# Compute Adjusted R² value (accounting for number of predictors)
adj_r2_value = adj_r2_calculator(r2_value_poly2_circ, n=orange_tree_df.shape[0], p=2)

In [None]:
# Print results
print(f"R² (Polynomial of Second Degree): {r2_value_poly2_circ:.4f}")
print(f"Adjusted R² Value 2nd degree: {adj_r2_value:.4f}")

In [None]:
# Compute Residual Standard Error (RSE) using actual vs. predicted values with 2 predictors
rse_calculator (y_age, y_pred_poly2_tr_age, p = 2)


#Train a Third-Degree Polynomial Regression Model

In [None]:
# Create a pipeline for polynomial feature transformation (degree 3) with feature centering
poly3_pipeline = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=False)),  # Center data without scaling variance
    ('poly_features', PolynomialFeatures(degree=3))  # Generate polynomial features up to degree 3
])


In [None]:
x_circu_transf_poly3 = poly3_pipeline.fit_transform(x_circu)
x_circu_transf_poly3

In [None]:
lr_poly3 = LinearRegression()
lr_poly3.fit(x_circu_transf_poly3, y_age)

In [None]:
# Transform the data through the pipeline

x_circu_transf_poly3 = poly3_pipeline.fit_transform(x_circu)

In [None]:
# Train the cubic polynomial regression model
lr_poly3 = LinearRegression().fit(x_circu_transf_poly3, y_age)

In [None]:
# Extract model coefficients
intercept = lr_poly3.intercept_
coefficients = lr_poly3.coef_[1:]  # Skip the first zero value

# Create coefficient names
feature_names = ['Intercept'] + [f'Coeff_degree_{i}' for i in range(1, len(coefficients) + 1)]

# Store coefficients in a DataFrame
coefficients_df = pd.DataFrame({'Coefficient Name': feature_names, 'Coefficient Value': np.concatenate(([intercept], coefficients))})

# Display coefficient values
print(coefficients_df)

In [None]:
# Scatter plot
plt.figure(figsize=(8, 8))
plt.scatter(orange_tree_df['circumference'], orange_tree_df['age'], c='blue')
plt.xlabel("circumference")
plt.ylabel("age")
plt.title("Orange tree age vs its circumference")

# Linear regression line
linear_model = LinearRegression().fit(x_circu, y_age)
linear_predictions = linear_model.predict(x_circu)
plt.plot(x_circu, linear_predictions, c='red', ls='-', linewidth=3, label='Linear Model')

# Create xaxis_values (these are sorted x values)
xaxis_values = orange_tree_df['circumference'].sort_values().values.reshape(-1, 1)

# Transform the xaxis_values using the pipeline
xaxis_values_transf_poly2 = poly2_pipeline.fit_transform(xaxis_values)

# Second degree poly curve
poly_predictions = lr_poly2.predict(xaxis_values_transf_poly2)
plt.plot(xaxis_values, poly_predictions, c='green', ls='-', linewidth=3, label='Poly of Second Degree')


# Transform the xaxis_values using the pipeline
xaxis_values_transf_poly3 = poly3_pipeline.fit_transform(xaxis_values)

# Third degree poly curve
poly_predictions_poly3 = lr_poly3.predict(xaxis_values_transf_poly3)
plt.plot(xaxis_values, poly_predictions_poly3, c='orange', ls='-', linewidth=3, label='Poly of Third Degree')


plt.legend()
plt.show()



In [None]:
# Make predictions using the trained polynomial regression model (degree 3)
y_pred_poly3 = lr_poly3.predict(x_circu_transf_poly3)

# Compute R-squared (R²) value to measure how well the model explains variance in the data
r2_value_poly3 = r2_score(y_age, y_pred_poly3)

# Calculate Adjusted R-squared, which accounts for the number of predictors (p=3) and sample size (n)
adj_r2_calculator(r2_value_poly3, n=orange_tree_df.shape[0], p=3)


In [None]:
# Compute Residual Standard Error (RSE) to measure the standard deviation of residuals,
# adjusting for the number of predictors (p=3)
rse_calculator(y_age, y_pred_poly3, p=3)


3rd degree polynomial is worst that the second dgree  polynomial so we should stay with 2nd degree polynomial

# Regression Tree

# - **Student Performance Dataset** (CSV file from Canvas) → Used for Question 5.

# ## Question 5 (20 points): Predicting Final Exam Scores

# **Instructions:**
# - Load the **Student Performance dataset**.
# - Train a **Linear Regression model** to predict `G3` (final exam score) using `G1` and `G2`.
# - Compute **Mean Squared Error (MSE)** on **10 new students' data**.
# - Evaluate and **justify** if the model performs well.




# TODO: Load Student Performance dataset and train regression model


# ---

# **Final Notes:**
# - Ensure all steps are included in your notebook.
# - Justify your model selection with **data-driven reasoning**.
# - Clearly comment on all code for readability.

# ---
