## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **Friday, 2th May 2025 at 11:59 pm**. 

5. **Five points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts). *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn't seem genuine, you will lose points.* 
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

## 1) Regression Problem - Miami housing
### 1a) Data preparation
Read the data *miami-housing.csv*. Check the description of the variables [here](https://www.kaggle.com/datasets/deepcontractor/miami-housing-dataset). Split the data into 60% train and 40% test. Use `random_state = 45`. The response is `SALE_PRC`, and the rest of the columns are predictors, except `PARCELNO`. Print the shape of the predictors dataframe of the train data.

*(2 points)*

###  1b) Baseline Decision Tree Model

Train a **Decision Tree Regressor** to predict `SALE_PRC` using all available predictors.

- Use `random_state=45` and keep all other hyperparameters at their default values.
- After training the model, evaluate and report the following on **both the training and test sets**:
  - **Mean Absolute Error (MAE)**
  - **R² Score**

*(3 points)*

### 1c) Tune the Decision Tree Model

Tune the hyperparameters of the **Decision Tree Regressor** developed in the previous question and evaluate its performance.

Your goal is to achieve a **test set MAE (Mean Absolute Error) below $68,000**.

- You must display the **optimal hyperparameter values** obtained from the tuning process.
- Compute and report the **test MAE and R² Score** using the tuned model.

**Hints:**

1. `BayesSearchCV()` with `max_depth` and `max_features` can often complete in under a minute.
2. You may use **any hyperparameter tuning method** (e.g., GridSearchCV, RandomizedSearchCV, BayesSearchCV).
3. You are free to choose **which hyperparameters to tune** and define your own **search space**.

*(9 points)*


### 1d) Bagged Decision Trees with Out-of-Bag Evaluation

Train a **Bagging Regressor** using Decision Trees as base estimators to predict `SALE_PRC`.

- Enable **out-of-bag (OOB) evaluation** by setting `oob_score=True`.
- Keep all other parameters at their default values.
- Tune only the `n_estimators` hyperparameter: increase the number of trees until the **OOB MAE stabilizes**.
- Ensure that the final **OOB MAE is less than \$48,000**.
- Report the final **OOB MAE**, **test MAE**, and **R² score**.

### 1e) Bagged Decision Trees Without Bootstrapping

Train a **Bagging Regressor** using Decision Trees, but this time **disable bootstrapping** by setting `bootstrap=False`.

- Use the same `n_estimators` value as in the previous question.
- Compute and report the following on the **test set**:
  - **Mean Absolute Error (MAE)**
  - **R² Score**

Explain **why the test MAE** in this case is:

- **Much higher** than the MAE obtained when bootstrapping was enabled (previous question).
- **Lower** than the MAE obtained from a single untuned decision tree (as in Question 1(b)).

> 💡 Hint: Consider the impact of bootstrap sampling on variance reduction and the benefits of aggregation in ensemble methods.


*(2 point for code, 3 + 3 points for reasoning)*


### 1f) Bagged Decision Trees with Feature Bootstrapping Only

Train a **Bagging Regressor** using Decision Trees, with the following configuration:

- **Disable sample bootstrapping** by setting `bootstrap=False`.  
- **Enable feature bootstrapping** by setting `bootstrap_features=True`.  
- **Keep the default setting** for `max_features` (i.e., do not modify it).

Use the same number of estimators (`n_estimators`) as in the previous bagging experiments.

- Compute and report the following on the **test set**:
  
  - **Mean Absolute Error (MAE)**
  - **R² Score**

Explain why the **test MAE** obtained in this setting is **much lower** than the one in the previous question, where neither bootstrapping samples nor features was used.


*(2 point for code, 3 points for reasoning)*

### 1g) Tuning a Bagged Tree Model

#### i) Approaches

There are two common approaches for tuning a **bagged tree model**:

1. **Out-of-Bag (OOB) Prediction**
2. **$K$-fold Cross-Validation** using `GridSearchCV`

What is the advantage of each approach over the other? Specifically:

- What is the **advantage of the out-of-bag approach** compared to $K$-fold cross-validation?
- What is the **advantage of $K$-fold cross-validation** compared to the out-of-bag approach?

*(3 + 3 points)*

#### ii) Tuning the Hyperparameters

Tune the hyperparameters of the Bagging Regressor model developed in 1(d).  
You may use any tuning approach of your choice (e.g., grid search, random search, bayes search).  
It is up to you to select which hyperparameters to tune and define their candidate values.

**Your tuned model must achieve a test MAE lower than the one obtained in Question 1f.**

After tuning:

- Report the optimal hyperparameter values found.
- Compute and report the **test MAE** and **R² score** using the tuned model.

*(9 points)*

### 1h) Random Forest

#### i) Tuning a Random Forest Model

Train and tune a **Random Forest Regressor** to predict `SALE_PRC`.

- Select hyperparameters and define your own tuning grid.
- Use any tuning approach (e.g., Out-of-Bag (OOB) evaluation or $K$-fold cross-validation).
- Report the following performance metrics on the **test set**:
  - **Mean Absolute Error (MAE)**
  - **R² Score**

> ✅ Your goal is to achieve a **test MAE below $46,000**.


*(9 points)*

#### ii) Feature Importance

After fitting the tuned **Random Forest Regressor**, extract and display the **feature importances**.

- Print the predictors in **decreasing order of importance** based on the trained model.
- This helps identify which features contribute most to predicting `SALE_PRC`.

*(4 points)*

#### iii) Random Forest vs. Bagging: `max_features`

The `max_features` hyperparameter is available in both `RandomForestRegressor()` and `BaggingRegressor()`.

Does `max_features` have the **same meaning** in both models?  
If not, explain the **difference in how it is interpreted and applied**.

> 💡 **Hint:** Refer to the scikit-learn documentation for both estimators to understand how `max_features` affects feature selection during training.

*(1 + 3 points)*

## 2) Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit. 

There is a train data - *train.csv*, which you will use to develop a model. There is a test data - *test.csv*, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

1. `age`: Age of the client

2. `education`: Education level of the client 

3. `day`: Day of the month the call is made

4. `month`: Month of the call 

5. `y`: did the client subscribe to a term deposit? 

6. `duration`: Call duration, in seconds. This attribute highly affects the output target (e.g., if `duration`=0 then `y`='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call `y` is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: [Source](https://archive.ics.uci.edu/ml/datasets/bank+marketing). Do not use the raw data source for this assignment. It is just for reference.)

### a) Data Preparation

Begin by examining the **distribution of the target variable** in both the training and test sets. This will help you assess whether there is any significant **class imbalance**.

Next, consider the two available approaches for hyperparameter tuning:

- **Cross-validation (CV)**
- **Out-of-bag (OOB) evaluation**

#### ❓ Which method do you prefer for this dataset, and why?

Discuss your choice based on:

- The **size of the dataset**
- The **class imbalance** in the target variable
- The **reliability and interpretability** of each method
- Whether you need **stratified sampling** to preserve class distribution during evaluation

*(2 points)*

### b) Random Forest for Term Deposit Subscription Prediction

Develop and tune a **Random Forest Classifier** to predict whether a client will subscribe to a term deposit using the following predictors:  

- `age`
- `education`
- `day`
- `month`

The model must satisfy the following performance criteria:

#### ✅ Requirements:

1. **Minimum overall classification accuracy of 75%**, across both *train.csv* and *test.csv*.
2. **Minimum recall of 60%**, across both *train.csv* and *test.csv*.

You must:

- Print the **accuracy** and **recall** for both datasets (*train.csv* and *test.csv*).
- Use **cross-validation on the training data** to optimize the model hyperparameters.
- Select a **threshold probability** for classification and apply it consistently across both datasets.


#### ⚠️ Important Notes:

i. **Do not use `duration`** as a predictor. Its value is determined after the marketing call ends, so using it would leak information about the outcome.

ii. You are free to choose any **decision threshold** for classification, but the same threshold must be used consistently for both training and test evaluation.

iii. Use **cross-validation** to tune hyperparameters such as `max_features`, `max_depth`, and `max_leaf_nodes`.  
  - You may rely on the default cross-validation behavior, which uses **stratified folds by default for classification tasks** to account for class imbalance.

iv. After tuning the model, **plot cross-validated accuracy and recall** across a range of threshold values (e.g., 0.1 to 0.9). Use this plot to select a threshold that satisfies the required trade-off between accuracy and recall.

v. **Evaluate the final tuned model (with the chosen threshold)** on the test dataset. Do not use the test data to guide any part of the tuning or threshold selection.


#### 💡 Hints:

- Restrict the search space to:
  - `max_depth` ≤ 25  
  - `max_leaf_nodes` ≤ 45  
  These limits encourage generalization and help balance recall and accuracy.
  
- Consider using cross-validation scores to compute predicted probabilities when plotting recall/accuracy curves.



#### 📝 Scoring Breakdown *(22 points total)*:

- **8 points** – Hyperparameter tuning via cross-validation  
- **5 points** – Plotting accuracy and recall across thresholds  
- **5 points** – Threshold selection based on the plot  
- **4 points** – Reporting accuracy and recall on both datasets


## 3) Predictor Transformations in Trees

Can a **non-linear monotonic transformation** of predictors (such as `log()`, `sqrt()`, etc.) be useful in improving the accuracy of **decision tree models**?

Provide a brief explanation based on your understanding of how decision trees split data and handle predictor scales.

*(4 points for answer)*