## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the **Code cells** and your answers in the **Markdown cells** of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to render the **.ipynb** file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **Friday, 23th May 2025 at 11:59 pm**.

5. **Five points are properly formatting the assignment**. The breakdown is as follows:
    - Must be an HTML file rendered using Quarto **(1 point)**. *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
    - No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission.  **(1 point)**
    - There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) **(1 point)**
    - Final answers to each question are written in the Markdown cells. **(1 point)**
    - There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. **(1 point)**
6. Please make sure your code results are clearly incorporated in your submitted HTML file.

**Feel free to add data visualizations of your hyperparameter tuning process. Visualizing and analyzing tuning results is important—even if it's not explicitly required in the instructions.**

##  AdaBoost vs Bagging (4 points)

Which model among AdaBoost and Random Forest is more sensitive to outliers? **(1 point)** Explain your reasoning with the theory you learned on the training process of both models. **(3 points)**

##  Regression with Boosting (54 points)

For this question, you will use the **miami_housing.csv** file. You can find the description for the variables [here](https://www.kaggle.com/datasets/deepcontractor/miami-housing-dataset).

The `SALE_PRC` variable is the regression response and the rest of the variables, except `PARCELNO`, are the predictors.

### a): Preprocessing

Read the dataset. Create the training and test sets with a 60%-40% split and `random_state = 1`. **(1 point)**

### b) AdaBoost

Tune an **AdaBoost Regressor** to achieve a **test MAE below \$47,000**.

- You **must set `random_state=1` for all components** (e.g., base estimator, AdaBoost model, etc.).
- **Submissions that meet the MAE cutoff using any other `random_state` will receive zero credit.**

**Scoring:**
- 5 points for achieving test MAE < \$47,000
- 1 point for reporting the training MAE of your tuned model to evaluate generalization


### c) Loss Functions in Gradient Boosting

Gradient Boosting supports multiple loss functions, including **`squared_error`**, **`absolute_error`**, and **`huber`**.  

- **(1 point)** Which loss function performs best on this dataset?
- **(3 points)** What are the advantages of this loss function compared to the other two?

###  Task: Tune a Gradient Boosting Model

Your goal is to tune a **Gradient Boosting Regressor** to achieve a **cross-validation MAE below \$45,000**.

- You **must keep all `random_state` values set to 1**.  
- **Submissions using any other `random_state` will receive zero credit, even if the MAE cutoff is met.**

 **Scoring (10 points total):**  
- 5 points for using a well-reasoned hyperparameter search strategy  
- 5 points for achieving MAE < \$45,000
- 1 point for reporting the training MAE of your tuned model to evaluate generalization 



**Hints**

- **Parallel processing is not supported** in the vanilla `GradientBoostingRegressor`.
- **`BayesSearchCV`**, like gradient boosting itself, performs a sequential search—each trial depends on the result of the previous one—so it does **not support parallel exploration**.
- **Optuna** is generally faster and more efficient than both `BayesSearchCV` and `GridSearchCV`. It supports **parallel execution** of trials and includes several built-in performance enhancements.



### d) XGBoost vs. Gradient Boosting

**XGBoost Enhancements:**

- What improvements make **XGBoost superior to vanilla Gradient Boosting** in terms of performance and runtime?  
  - Explain the **enhancements** (1 point)  
  - Provide the **reasons** behind the improvements (1 point)  
  - Identify relevant **hyperparameters** and describe how they influence model behavior (2 points)

**XGBoost Limitations:**

- What important feature or behavior is **missing in XGBoost** but well-implemented in **vanilla Gradient Boosting**? (1 point)


### e) Tuning XGBoost with Different Search Strategies

Tune an **XGBoost Regressor** to achieve a **cross-validation MAE below \$42,500**.

- You **must keep `random_state=1` in all components** (e.g., XGBoost model, CV splits, search objects).
- **Submissions that meet the cutoff using any other `random_state` will receive zero credit.**

**Scoring (10 points total):**

- 5 points for a well-designed and appropriate hyperparameter search strategy
- 5 using 3 different search strategies
- 5 points for achieving MAE < \$42,500


**Search Strategies (Required Comparison)**

You must tune the model using **three different search settings**:

1. **BayesSearchCV**
   - Unlike vanilla `GradientBoostingRegressor`, XGBoost supports parallel training and can benefit from multi-core processing (`n_jobs=-1`), so `BayesSearchCV` is practica with it.

2. **Optuna (with `n_jobs=-1`)**

3. **Optuna (default single-threaded)**


**Execution Time**

You must report the **execution time** for each tuning strategy.

- You can measure this using:
  - A **Jupyter magic command** like `%%time`, or
  - Python’s `time.time()` (end - start)

For a **fair comparison**, use the **same search space** across all methods.  
Only one of the tuned models needs to meet the performance cutoff, but you should still report times for all three.



### f) Feature Importance

Using the **best hyperparameter settings**, fit the final model and **output the feature importances**.

- Use the `.feature_importances_` attribute or equivalent method from your model.
- Visualize the importances if possible (e.g., with a bar plot).

## Imbalanced Classification with Regularized Gradient Boosting (42 points)

In this question, you will use the **train.csv** and **test.csv** datasets. Each observation represents a marketing call made by a banking institution. The target variable `y` indicates whether the client subscribed to a term deposit (`1`) or not (`0`), making this a binary classification task.

The predictors you should use are: `age`, `day`, `month`, and `education`.

⚠️ **Note:** As discussed last quarter, the variable `duration` **must not be used as a predictor**.  
**No credit will be given** for models that include it.


### a) Data Preprocessing

Perform the following preprocessing steps:

- Read in the training and testing datasets.
- Create a new `season` feature by mapping each `month` to its corresponding season.
- Define the predictor and response variables.
- Convert all categorical predictors to `pandas.Categorical` dtype before passing them to the models.
- Convert the response variable `y` to binary values (`0` and `1`).


**(5 points)**

We will rely on the **native categorical feature support** provided by each library (XGBoost, LightGBM, and CatBoost), so explicit one-hot encoding is **not** required.


### b) Target Exploration

For classification tasks, it's important to examine the distribution of the target variable to determine whether the classes are imbalanced. This helps you avoid common pitfalls when dealing with imbalanced classification.

- Explore the class distribution in both the **training** and **test** sets.

**(2 points)**

### c) LightGBM and CatBoost

LightGBM and CatBoost are gradient boosting frameworks, like XGBoost, but each introduces unique innovations.

- What do LightGBM and CatBoost have in common with XGBoost? **(2 points)**  
- What advantages do they offer over XGBoost? **(2 points)**  
- How are these advantages implemented in each model? **(3 points)**  
- All three libraries support native categorical feature handling.  
  Do they use the same approach? If not, explain the differences. **(3 points)**

### c) Handling Imbalanced Classification in Gradient Boosting Extensions

For all extensions of Gradient Boosting (XGBoost, LightGBM, and CatBoost):

- Are there additional inputs or hyperparameters available to handle imbalanced classification? **(1 point)**  
- If yes, describe how the method works. **(1 point)**  
- How should the value of this hyperparameter be set or tuned for best results? **(1 point)**

### d) Model Evaluation: XGBoost, LightGBM, and CatBoost

Evaluate the performance of the following models: **XGBoost**, **LightGBM**, and **CatBoost**, using the metrics on test set below:

- **Recall**
- **Precision**
- **F1 Score**
- **AUPRC** (Area Under the Precision-Recall Curve)
- **ROC AUC**

For each model, build and compare **two versions**:

1. **Baseline model**: using default settings with `random_state=1`, without addressing class imbalance.
2. **Imbalance-aware model**: with `scale_pos_weight` enabled to handle class imbalance.

- Compare the performance of both versions for each model.
- Summarize which model and approach performed best for imbalanced classification, and try to explain why.


### d) Tuning LightGBM for Classification

Tune a **LightGBM classifier** to achieve:

- **Cross-validation accuracy ≥ 70%**
- **Cross-validation recall ≥ 65%**

You **must set `random_state=1` in all components** (e.g., model, cross-validation, search objects).  
**Submissions that exceed the cutoffs using any other `random_state` will receive zero credit.**

**Scoring (15 points total):**  
- 7.5 points for a well-designed and justified search strategy  
- 7.5 points for meeting both performance thresholds


**Hints:**

- For classification, you may also tune the **decision threshold** (not just model hyperparameters).


### e) Test Set Evaluation

Evaluate the **tuned LightGBM model** on the **test set**:

- Report the **test accuracy** and **test recall**.
- Include the **threshold** used for classification.

This will help assess how well the model generalizes beyond the training data.  

**(2 points)**

### f) Tuning CatBoost for Classification

Tune a **CatBoost classifier** to achieve:

- **Cross-validation accuracy ≥ 70%**
- **Cross-validation recall ≥ 65%**

You **must set `random_state=1` in all components** (e.g., model, cross-validation, search objects).  
**Submissions that exceed the cutoffs using any other `random_state` will receive zero credit.**

**Scoring (15 points total):**  
- 7.5 points for a well-structured and appropriate hyperparameter search  
- 7.5 points for meeting both performance thresholds


**Hints:**

- You are free to use **any tuning strategy** and define **any reasonable search space**.
- In addition to tuning hyperparameters, you may also need to **tune the decision threshold** to meet the classification performance criteria.


### g) Test Set Evaluation

Evaluate the **tuned CatBoost model** on the **test set**:

- Report the **test accuracy** and **test recall**.
- Include the **classification threshold** used.

This will help assess whether the model generalizes well beyond the training data.  
**(1 point)**

## 🎁 Bonus (Extra Credit) – 20 Points

To help you prepare for your upcoming prediction project involving hyperparameter tuning, I’ve created the following optional tasks.  
Feel free to skip them if time does not permit.


### a) Comparing Tuning Strategies

Compare the tuning time and results of **`GridSearchCV`** and **`RandomizedSearchCV`** using the same search space you used in Task 2e (`BayesSearchCV` and `Optuna`).

- What are the trade-offs between **exhaustive search**, **random search**, and **smarter strategies** like **Bayesian optimization** and **Optuna**?
- Are the differences in runtime justified by improvements in model performance?




### b) Resumable Tuning Strategies

Do your own research: Among all the tuning strategies you have used, which ones allow you to **continue tuning without starting from scratch** when increasing `n_trials` or `n_iter`?

- Identify the methods that support **incremental or resumable search**.
- Explain how they work and why they are efficient.
- Provide code to demonstrate how these strategies **reuse previous results** rather
