**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2025-08-01

**Last update:** 2025-09-30

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

* Your answers for the "⚡ Mandatory" sections of each lab <span style="color:red"><strong>must be submitted before the start of the next lab session</strong></span>.

ENJOY WORKING ON THIS LAB.
***

In [None]:
# Usual loading of our utilities
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

# Decision Tree

A decision tree is a supervised, non-parametric ML algorithm used for both classification and regression problems that models decisions and their possible consequences in a tree-like structure. It splits the data into subsets based on the value of input features, creating branches that lead to decision nodes and leaf nodes, which represent the final output or decision. 

In this section, we will begin by loading and exploring the dataset gathered from antenna measurements to understand its structure and key characteristics. This analysis will include examining the relationships between various input features, such as frequency, and the target variable, antenna gain. By identifying patterns and trends within the data, we aim to build a decision tree model that can accurately predict antenna gain based on these input features. The decision tree will split the data into subsets based on feature values, creating a hierarchy of decision rules to make predictions.

***
### Data Analysis

**⚠️ READ CAREFULLY** 

Before moving to the decision tree model, I intend to provide a comprehensive analysis of the data. Thorough data analysis is a critical step in all ML tasks, as it helps you understand the structure, quality, and patterns within the dataset. Without proper data exploration and preparation, it is unlikely that you will develop an accurate or effective model.

Data analysis involves assessing the distribution of variables, identifying relationships between features, detecting outliers or anomalies, and addressing issues such as missing values or incorrect data types. These analyses guide decisions about feature selection, engineering, and preprocessing, which are essential for creating models that generalize well to unseen data.

Neglecting or performing data analysis incorrectly can lead to significant problems, such as introducing bias, overfitting, or failing to capture the true patterns in the data. Machine learning models are fundamentally dependent on the quality of the input data, and any inaccuracies or inconsistencies in the data can directly impact the model's performance and predictions.

Please note and remember that data analysis is not **just** a preliminary step but a **foundational process** that drives the success of the entire ML pipeline. By thoroughly analyzing the data, you ensure that the decision tree (in this notebook or any other model) is built on a solid foundation.
***

Let's read/load the data and explore it using pandas functions.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the CSV file for antenna data
df = pd.read_csv("../datasets/antenna_data.csv", 
                 comment="#", skip_blank_lines=True)

# Split the data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.3, random_state=42)

# preview the data
train_df.head()

In [None]:
train_df.tail()

As you already see in the CSV file, not much information is given about the data. For example, we do not know the units. When this is the case, it is your first duty to ask the data provider to give you more information on the dataset. For here, I can tell you that the frequency is in Hz, antenna gain is in dB, and direction is in radian. No more information exists on this dataset.

In [None]:
train_df.info()

In [None]:
val_df.info()

In [None]:
train_df.describe()

***
### ✅ Check your understanding

Study the last three reports you got above. What do they tell you?

***

We aim to generate summary statistics for the object-type columns (categorical or string data) in the `train_df` DataFrame using pandas. This step is essential for quickly understanding the distribution and characteristics of categorical or text-based data within the DataFrame. For example, upon examining the "Polarization" column, we discover that it contains text data. It has 373 data points, consisting of 2 unique categories. The most frequent category is "Ciculat," appearing 187 times. This analysis provides valuable insights into the structure and distribution of the column's data.

In [None]:
train_df.describe(include=['O']) # Encode categorical variables

### Univariate analysis
Univariate analysis is the examination and analysis of a single variable within a dataset. Its primary goal is to understand the patterns, distribution, and characteristics of the variable by summarizing its central tendency (e.g., mean, median, mode), dispersion (e.g., variance, standard deviation, range), and shape (e.g., skewness). This type of analysis helps identify key patterns such as the presence of outliers, the spread of data, or the overall data distribution. 

I've designed the function below (inspired from a Kaggle source code) to perform univariate analysis of a continuous variable by visually combining a `boxplot` and a `histogram` into a single figure. The boxplot highlights the data's range, quartiles, and potential outliers, while the histogram displays the frequency distribution of the variable, optionally overlaying a Kernel Density Estimate (KDE) curve to illustrate the smooth probability density. Additionally, the function calculates and displays key statistical metrics such as mean, median, standard deviation, and skewness directly on the plot, making it a powerful tool for summarizing and interpreting the behavior of a single continuous variable.

Alternatively to `boxplot`, you can use `violinplot`. I've included it in the function below, but commented it. You need to comment the boxplot and uncomment the violinplot to get it to work. But first, use the original setting with the boxplot.

In [None]:
import seaborn as sns

# Plot a combined boxplot and histogram for univariate analysis of a continuous variable from Pandas.
# bins is the number of bins for histogram
def dist_box(data, bins=30):
    if not isinstance(data, pd.Series):
        raise ValueError("Input must be a pandas Series")

    # basic stats
    name   = data.name.upper()
    mean   = data.mean()
    median = data.median()
    std    = data.std()
    skew   = data.skew()

    # Plot
    fig, (ax_box, ax_dis) = plt.subplots(nrows=2, sharex=True,
                                         gridspec_kw={"height_ratios": (0.2, 0.8)},
                                         figsize=(10, 6))

    sns.set_theme(style="whitegrid")
    fig.suptitle(f"UNIVARIATE ANALYSIS: {name}", fontsize=18, fontweight='bold')

    # Boxplot
    sns.boxplot(x=data, showmeans = True, orient = 'h', color = 'violet', ax = ax_box)
    #sns.violinplot(x= data, orient= 'h', color = 'violet', ax = ax_box)
    
    ax_box.set(xlabel = "" )
    ax_box.grid("True")
    ax_box.set_title("Boxplot (Spread & Outliers)", fontsize=12)

    # Histogram
    sns.histplot(data, bins=bins, ax=ax_dis, color='skyblue', edgecolor='black', kde=True)
    
    ax_dis.axvline(mean  , color='g', linestyle='-', linewidth=2, label=f'Mean: {mean:.2f}')
    ax_dis.axvline(median, color='r', linestyle='--' , linewidth=2, label=f'Median: {median:.2f}')
    ax_dis.legend(loc="upper left")
    ax_dis.grid("True")
    ax_dis.set_title("Histogram", fontsize=12)

    # Add text box with stats
    stats_text = f"""Count: {data.count()}
            Std Dev: {std:.2f}
            Skewness: {skew:.2f}"""
    props = dict(boxstyle ="round", facecolor= "white", alpha = 0.7)
    ax_dis.text(0.15, 0.8, stats_text, transform = ax_dis.transAxes,
                verticalalignment='top', horizontalalignment='right', bbox = props, fontsize =10)

    plt.show()


In [None]:
# Select all quantitative columns for checking the spread
list_col = ["Gain", "Frequency", "Direction", "Impedance", "SNR"]

# Loop through each column and apply the dist_box function
for col in list_col:
    if col in train_df.columns:
        dist_box(train_df[col])
    else:
        print(f"Column {col} does not exist in the DataFrame.")

***
### ✅ Check your understanding

- Study the plotted data above and provide a detailed explanation of your observations. Focus on identifying trends, patterns, relationships, and any notable features within the data. Highlight variations, distributions, correlations, or anomalies, and explain their significance in the context of the analysis.

***

As noted in the description of the training data, the dataset contains `NaN` values. Additionally, the `Frequency` and `Gain` have large ranges and need to be transformed for use in our ML model. For them, only scaling is not sufficient, and more tricks required to be applied.

The code below preprocesses the dataset by addressing missing values, transforming numerical features, encoding categorical variables, and splitting the data into training and validation sets to prepare it for modeling.

⚠️ Remember, you need to apply the same preprocessing steps to both training and validation sets. However, for steps involving fitting (like imputation or scaling), you should fit on the training set and then apply the same transformation to the validation set to avoid data leakage.

***
Stop working on the notebook and read the following page (Sections 11.1 and 11.2) on common pitfalls in using scikit-learn:

https://scikit-learn.org/stable/common_pitfalls.html

***

Step 1) Replace infinite values with NaN in the entire DataFrame

In [None]:
# Replace infinite values with NaN in the entire DataFrame
for d in [train_df, val_df]:
    d.replace([np.inf, -np.inf], np.nan, inplace=True)

Step 2) Convert Frequency and Gain to logarithmic scale in both training and validation sets.

In [None]:
# Convert Frequency and Gain to logarithmic scale
for d in [train_df, val_df]:
    d["Frequency"] = np.log10(d["Frequency"])
    d["Gain"]      = np.log10(d["Gain"])

Step 4) Apply label encoding to categorical column on both training and validation sets.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Apply label encoding to categorical column
label_encoder = LabelEncoder()
train_df["Polarization"] = label_encoder.fit_transform(train_df["Polarization"])
val_df["Polarization"]   = label_encoder.transform(val_df["Polarization"])

Step 5) 
* Create imputer strategy with 'mean' to fill NaN values with the mean of each column.
* Fit and transform the DataFrame using the imputer.

Note: If you want to use other strategies, check the [SimpleImputer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) for more information.

In [None]:
from sklearn.impute import SimpleImputer

# Apply imputer (fit on training, transform both)
imputer = SimpleImputer(strategy="mean")

train_df[:] = imputer.fit_transform(train_df)
val_df[:]   = imputer.transform(val_df)

Here, we provide data visualization for exploratory data analysis (EDA) using **Seaborn**. You are aware that Seaborn is a powerful Python visualization library built on top of Matplotlib. The main focus of our code sections is to reveal relationships and patterns in the `train_df` using two common types of plots: correlation heatmaps and pair plots. We/you are not allowed to visualize the `val_df` DataFrame, as it is reserved for validation purposes only.

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(train_df.corr(), annot=True)
plt.show()

Note: You can limit the heatmap plot by selecting a hanful columns of data. The code below does this for you.

In [None]:
selective=["Frequency", "Gain", "Direction"]  # Selective columns for heatmap

plt.figure()
sns.heatmap(train_df[selective].corr(), annot=True)
plt.show()

***
⚠️ The tables shown above are not the confusion matrices. They represent a heatmap of correlation between different features in the training set.
***

In [None]:
sns.pairplot(train_df, hue="Polarization", diag_kind="kde")
plt.show()

***
### ✅ Check your understanding

Study all the plotted data above and provide a detailed explanation of your observations. Focus on identifying trends, patterns, relationships, and any notable features within the data. Highlight key information, such as variations, distributions, correlations, or anomalies, and explain their significance in the context of the analysis.

**NOTE:** Since our dataset has few samples with limited features, you may not necessarily find strong correlation or patterns in the plotted data. However, remember to provide such analysis, or even more indepth as you will see later (or have seen earlier), in all of your data analysis in your practical project.

***

In this section, we examine the relationship between antenna Gain and Frequency. Our goal is to develop a ML Regression model that predicts antenna Gain based on Frequency. As a first step in this approach, we perform data visualization to explore the structure and patterns within our training dataset for Frequency and Gain.

Run the code section below first, and then read this note. Remember not to skip reading it!

In the data visualization code below, and in earlier visualizations from previous notebooks, you might have noticed that I often use `alpha` values <1.0 (e.g., 0.7 or 0.5) when plotting. Have you ever wondered why?

Let me break it down for you. This is actually a really useful trick for data anaysis to better understanding your data:

1. **Make overlapping points easier to see**: When lots of data points overlap or are really close to each other, using full opacity (`alpha = 1.0`) can make them blend together or even hide some points entirely. Lowering the alpha lets you see these overlaps more clearly, making it easier to spot patterns like clusters or dense areas in your data.

2. **Show data density**: transparnecy naturally darkens areas where points overlap, giving you a quick visual hint about where the data is concentracted. It's kind of like creating a density plot or heatmap, but without doing extra work.

3. **Balance class visibility in multi-class data**: If you are plotting data with multiple classes (like with a `hue` in Seaborn), transparency helps to make sure all classes are visible, even if one class has way more data points than the others.

4. **Keep plots readable when they are crowded**: When you work with a complex plots (like Seaborn pairplots), things can get messy if there are too many data points. Lowering the alpha reduces visual clutter, making it easier to interpret the plots without overwhelming your eyes.

You can examine and verify all aforementioned points by setting alpha to 1.0 and re-run the plotting code below and compare with the original code with alpha=0.5.

In [None]:
plt.figure()
plt.plot(train_df["Frequency"], train_df["Gain"], "o", 
           markersize = 8, color="royalblue", 
           markeredgecolor="black", alpha=0.5)

plt.xlabel(r"Frequency: $\log_{10}(f)$ [Hz]")
plt.ylabel(r"Gain: $\log_{10}(G)$ [dB]")
plt.title("Antenna Gain")
plt.grid("True")

plt.show()

***
Did you notice the replaced NaNs with mean values in the dataset? As you see, they are concentrated at a certain frequency range ($2.2<log_{10}(f)<5$). This may cause bias in your model, and you need to be careful about it. due to its special distribution, it would have been better not to include NaNs in our dataset, and drop those rows instead of replacing them with mean values. You could use `dropna()` function for this purpose. For this lab, we continue with the imputed dataset.
***

The code section below computes the **SSR** and **MSE** for all possible split points in the training data, sorted by Frequency. We have seen this earlier in the lecture notes. For each frequency index, we split the data to the left and right of that point (for the left and right nodes), calculate the error for each node, by measuring how much the data in each node deviates from its local mean. This forms the basis for evaluating potential split points in our decision tree.

In [None]:
# Sort training data by Frequency. 
# This is important for the SSR and MSE calculations, as we explained in the lecture notes.
train_df = train_df.sort_values(by = "Frequency")

# Extract Frequency and Gain values
f = train_df["Frequency"].values
g = train_df["Gain"].values

# Store SSR and MSE history
ssr_hist_right = []
ssr_hist_left  = []
mse_hist_right = []
mse_hist_left  = []

# Loop to calculate averages, SSR, and MSE
for i in range(len(f)):
    # Calculate averages
    avg_right = np.mean(g[i + 1:]) if i + 1 < len(f) else 0  # Handle empty slice for last index
    avg_left = np.mean(g[0:i + 1])

    # Calculate SSR
    ssr_right = sum((g[i + 1:] - avg_right) ** 2)
    ssr_left = sum((g[0:i + 1] - avg_left ) ** 2)
    
    # Calculate MSE
    mse_right = np.mean((g[i + 1:] - avg_right) ** 2) if i + 1 < len(f) else 0  # Handle empty slice
    mse_left = np.mean((g[0:i + 1] - avg_left ) ** 2)
    
    # Append SSR and MSE to history
    ssr_hist_right.append(ssr_right)
    ssr_hist_left .append(ssr_left )
    mse_hist_right.append(mse_right)
    mse_hist_left .append(mse_left )

# Plot SSR history
plt.figure()
plt.plot(range(len(ssr_hist_right)), ssr_hist_right, color="royalblue", label="SSR Right Node")
plt.plot(range(len(ssr_hist_left )), ssr_hist_left , color="tomato"   , label="SSR Left Node" )
plt.xlabel("Frequency index")
plt.ylabel("SSR")
plt.legend()
plt.grid("True")
plt.title("SSR History")

# Plot MSE history
plt.figure()
plt.plot(range(len(mse_hist_right)), mse_hist_right, color="green" , label="MSE Right Node")
plt.plot(range(len(mse_hist_left)) , mse_hist_left , color="orange", label="MSE Left Node" )
plt.xlabel("Frequency index")
plt.ylabel("MSE")
plt.legend()
plt.grid("True")
plt.title("MSE History")

plt.show()

Before moving on to the Decision Tree, lets apply what we learned earlier on regression. In the section below, we train and compare **polynomial regression** models of varying degrees (including a linear baseline) to predict antenna Gain based on Frequency. We use the scikit-learn `pipeline` to streamline preprocessing (including polynomial feature expansion and standardization) and model fitting. The trained models are evaluated by plotting their predictions and reporting the R2-score on a validation set, helping assess model complexity and detect **underfitting** or **overfitting**.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

X_train = train_df[["Frequency"]].values
y_train = train_df["Gain"].values

X_val = val_df[["Frequency"]].values
y_val = val_df["Gain"].values

# Prepare prediction X values
x_line = np.linspace(train_df["Frequency"].min(), train_df["Frequency"].max(), 100).reshape(-1, 1)

plt.figure()
plt.plot(X_train, y_train, "o", 
           markersize = 8, color="royalblue", 
           markeredgecolor="black", alpha=0.3,
           label="Training")

# -------------------------------
# Linear regression for reference
linear_pipeline = Pipeline([
    ("scaler"   , StandardScaler()), # Standardize the features
    ("regressor", LinearRegression()) ])

linear_pipeline.fit(X_train, y_train)
y_linear = linear_pipeline.predict(x_line)
plt.plot(x_line, y_linear, "-.", linewidth=2, color="royalblue", label="Linear")

# -------------------------------
# Polynomial Fits
degrees = [3, 5, 7, 9]

for degree, color in zip(degrees, colors):
    poly_pipeline = Pipeline([
        ("poly_features", PolynomialFeatures(degree=degree)),
        ("scaler", StandardScaler()), # Standardize the features
        ("regressor", LinearRegression()) # Train linear regression using polynomial features
    ])
    
    poly_pipeline.fit(X_train, y_train)
    y_poly = poly_pipeline.predict(x_line)
    plt.plot(x_line, y_poly, linewidth=2, color=color, label=f"Polynom d={degree}")
    
    y_val_pred = poly_pipeline.predict(X_val)
    score = r2_score(y_val, y_val_pred)
    print(f"Degree {degree}: R2-score on validation: {score:.4f}")

plt.xlabel(r"Frequency: $\log_{10}(f)$ [Hz]")
plt.ylabel(r"Gain: $\log_{10}(G)$ [dB]")
plt.legend()
plt.grid("True")
plt.title("Antenna Gain and Polynomial Regression Fits")
plt.show()


The code section below trains a Decision Tree regresor to predict a target varioble based on input features and evaluates its performance using metrics like MSE, SSR. These metrics help measure how accurately the model predicts and how well it explains the variance in the target data.

In [None]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error , r2_score
from sklearn.model_selection import cross_val_score

# Train Decision Tree Regressor
# As you see, the concept for developing a Decision Tree Regressor is similar to the polynomial regression, 
# but it uses a tree structure to make decisions based on feature values.
dt_pipeline = Pipeline([
    ('scaler', StandardScaler()), # Standardize the features
    ('dt_regressor', DecisionTreeRegressor(random_state=42))  # Train Decision Tree Regressor
])

dt_pipeline.fit(X_train, y_train)

# Predict on validation set
y_pred = dt_pipeline.predict(X_val)

# Evaluation metrics
mse = mean_squared_error (y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
ssr = np.sum((y_val - y_pred) ** 2)
r2  = r2_score(y_val, y_pred)

print("Decision Tree regression metrics:")
print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"SSR: {ssr:.4f}")
print(f"R2-score: {r2:.4f}")

# Visualize the Decision Tree
plt.figure(figsize=(12, 6))
plot_tree(dt_pipeline.named_steps['dt_regressor'], feature_names=["Frequency"],
          filled   = True,
          rounded  = True,
          max_depth= 3,  # You can remove or increase this if the tree is shallow.
          fontsize = 10 )
plt.title("Decision Tree (Frequency vs. Gain)")
plt.show()

***
### 💡 Reflect and Run

- Study the decision tree and make sure you understand how the tree is made.

- Compare model performance and results from `DecisionTreeRegressor` with those obtained earlier from the Linear Regression and Polynomial models. Which model performs better and why?

- What are the advantages and disadvantages of using a Decision Tree Regressor compared to other regression models like Linear Regression or Polynomial models?

- Study `DecisionTreeRegressor` in SciKit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
In case the link did not work, search for `DecisionTreeRegressor` in SciKit-Learn.

- Read about `min_samples_split` and `min_samples_leaf` parameters in DecisionTreeRegressor.

- Modify the `DecisionTreeRegressor` by changing the `max_depth` parameter. Train the model with different depths (e.g., 3, 5, 10) and observe how the evaluation metrics (MSE, MAE, SSR, R2-score) change.

- Use the `.feature_importances_ attribute` of the `DecisionTreeRegressor` to rank the importance of features in the dataset. Visualize this using a bar chart.

***
⚠️ For Decision Trees, you need to be careful about overfitting. You can control the depth of the tree using the `max_depth` parameter. A deeper tree can capture more complex patterns but may also lead to overfitting, where the model performs well on training data but poorly on unseen data. In contrast, a shallower tree may underfit the data, failing to capture important patterns. Finding the right balance is crucial for optimal model performance.
***

In the previous code, we trained the `DecisionTreeRegressor` with default parameters, which may not provide the best model performance. Therefore, we need to imporve our model by incorporating **hyperparameter tuning** and **cross-validation**. Therefore, in the code section below,  we use `RandomizedSearchCV` to optimize the model by searching for the best combination of hyperparameters (`max_depth`, `min_samples_split`, and `min_samples_leaf`) within a specified range. Additionally, we perform **cross-validation** to ensure the model generalizes well to unseen data, reducing the risk of overfitting. This makes the second code more robust and capable of producing a well-optimized decision tree model.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from scipy.stats import randint

# Build pipeline
new_pipeline = Pipeline([
    ('scaler', StandardScaler()), # Standardize the features
    ('new_dt_regressor', DecisionTreeRegressor(random_state=42))  # Train Decision Tree Regressor
])

# Hyperparameter search space
param_distributions = {
    'new_dt_regressor__max_depth'        : randint(2, 10),
    'new_dt_regressor__min_samples_split': randint(2, 10),
    'new_dt_regressor__min_samples_leaf' : randint(1, 10)
}

# Randomized search with cross-validation
search = RandomizedSearchCV(
    estimator=new_pipeline,   # Use the new pipeline
    param_distributions=param_distributions, # Hyperparameter search space
    n_iter       = 30,    # Number of iterations for random search
    cv           = 5,     # Cross-validation folds
    scoring      = 'r2',  # Scoring metric for evaluation
    n_jobs       = -1,    # Use all available CPU cores
    random_state = 42,    # Random state for reproducibility
)

# Fit search
search.fit(X_train, y_train)

# Best model
best_model = search.best_estimator_

# Predict on validation set
y_pred = best_model.predict(X_val)

# Evaluation metrics
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
ssr = np.sum((y_val - y_pred) ** 2)
r2  = r2_score(y_val, y_pred)

print("Best Parameters from RandomizedSearchCV:", search.best_params_)
print("Decision Tree regression metrics:")
print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"SSR: {ssr:.4f}")
print(f"R2-score: {r2:.4f}")

# Cross-validation with best model
cv_scores = cross_val_score(best_model, X_val, y_val, cv=5, scoring='r2')
print(f"\nCross-Validated R2-Scores: {cv_scores}")
print(f"Average Cross-Validated R2: {cv_scores.mean():.4f}")

# Visualize the Decision Tree
plt.figure(figsize=(12, 6))
plot_tree(best_model.named_steps['new_dt_regressor'], feature_names=["Frequency"],
          filled   = True,
          rounded  = True,
          max_depth= 3,  # You can remove or increase this if the tree is shallow
          fontsize = 10 )
plt.title("Best Decision Tree Model (Frequency vs. Gain)")
plt.show()


***
### 💡 Reflect and Run

- Compare performance metrics (MSE, MAE, R2-score) of the tuned model with those obtained earlier from the model with default parameters. Which one performs better and why?

- Change the range of values in the `param_distributions`. For example, increase the maximum value of `max_depth` to 20; and/or Modify `min_samples_split` and `min_samples_leaf` to have a broader range (e.g., randint(2, 20)). Rerun the code and observe how the best parameters and model performance change.

- Increase the `n_iter` parameter in `RandomizedSearchCV` from 30 to \{50 or 100\}. Does the performance improve with more iterations? Why or why not?

***

Up to this point, we focused on developing a Frequency-Gain model, which primarily relied on a single feature (`Frequency`) to predict the target variable (`Gain`). However, our dataset contains additional features (i.e., multi-dimensional), making it more complex for prediction. To better leverage this complexity, we updated the code to include all available features, except for the target variable (`Gain`), in the feature sets (`X_train` and `X_val`). By excluding `Gain` from the input features, we ensure proper training practices and avoid data leakage, where the target variable could improperly influence predictions. 

In [None]:
# Use the entire features for X_train and X_val
X_train = train_df.drop(columns=["Gain"]).values  # Exclude "Gain" from training features
y_train = train_df["Gain"].values  # Target variable

X_val = val_df.drop(columns=["Gain"]).values  # Exclude "Gain" from validation features
y_val = val_df["Gain"].values  # Target variable

# Build pipeline
new_pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize the features
    ('full_dt_regressor', DecisionTreeRegressor(random_state=42))  # Train Decision Tree Regressor
])

# Hyperparameter search space
param_distributions = {
    'full_dt_regressor__max_depth'        : randint(2, 10),
    'full_dt_regressor__min_samples_split': randint(2, 10),
    'full_dt_regressor__min_samples_leaf' : randint(1, 10)
}

# Randomized search with cross-validation
search = RandomizedSearchCV(
    estimator=new_pipeline,   # Use the new pipeline
    param_distributions=param_distributions, # Hyperparameter search space
    n_iter       = 30,    # Number of iterations for random search
    cv           = 5,     # Cross-validation folds
    scoring      = 'r2',  # Scoring metric for evaluation
    n_jobs       = -1,    # Use all available CPU cores
    random_state = 42,    # Random state for reproducibility
)

# Fit search
search.fit(X_train, y_train)

# Best model
best_model = search.best_estimator_

# Predict on validation set
y_pred = best_model.predict(X_val)

# Evaluation metrics
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
ssr = np.sum((y_val - y_pred) ** 2)
r2 = r2_score(y_val, y_pred)

print("Best Parameters from RandomizedSearchCV:", search.best_params_)
print("Decision Tree regression metrics:")
print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"SSR: {ssr:.4f}")
print(f"R2-score: {r2:.4f}")

# Cross-validation with best model
cv_scores = cross_val_score(best_model, X_val, y_val, cv=5, scoring='r2')
print(f"\nCross-Validated R2-Scores: {cv_scores}")
print(f"Average Cross-Validated R2: {cv_scores.mean():.4f}")

# Visualize the Decision Tree
plt.figure(figsize=(12, 6))
plot_tree(best_model.named_steps['full_dt_regressor'], feature_names=train_df.columns,
          filled=True,
          rounded=True,
          max_depth=3,  # You can remove or increase this if the tree is shallow
          fontsize=10)
plt.title("Best Decision Tree Model (All Features)")
plt.show()

***
### ⚡ Mandatory submission

Construct the "Rod Breaking" example from the lecture notes, build the decision tree, and find out if the rode breaks for the given test data.

***
   END
***