## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **Friday, 19th May 2023 at 11:59 pm**. 

5. **Five points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts).
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

## D.1 Conceptual

### D.1.1 AdaBoost vs Random Forest

Among AdaBoost and Random Forest, which model is more sensitive to outliers in response and why? Consider both regression and classification.

*(1 + 3 points)*

### D.1.2 Loss functions
Which loss functions should you use in boosting algorithms to reduce sensitivity to outliers in response, as compared to the squared error loss function, for regression problems. Name any 2 loss functions, and explain how they reduce the sensitivity towards outliers.

*(2 + 2 points)*

## D.2 Regression Problem - Miami housing
### D.2.1 Data preparation
Read the data *miami-housing.csv*. Check the description of the variables [here](https://www.kaggle.com/datasets/deepcontractor/miami-housing-dataset). Split the data into 60% train and 40% test. Use `random_state = 45`. The response is `SALE_PRC`, and the rest of the columns are predictors, except `PARCELNO`. Print the shape of the predictors dataframe of the train data.

*(2 points)*

### D.2.2 AdaBoost hyperparameter tuning
Develop and tune an AdaBoost model to predict `SALE_PRC` based on all the predictors. Compute the MAE on test data.

You must tune in the following manner:

1. Use `GridSearchCV` to minimize the $5$-fold mean absolute error (MAE). 

2. You are advised to do a coarse grid search first to get an idea of the domain space where the optimal hyperparameter values lie.

3. If you reach the goal with the coarse grid search, you can stop. Otherwise, you may follow it up with a finer grid search to get more precise optimal hyperparameter values.

4. You may decide yourself which hyperparameters you wish to tune. Tuning `max_depth`, `n_estimators`, and `learning_rate` should suffice.

The test MAE must be less than **$46,000**. You must show the optimal values of the hyperparameters obtained, and the test MAE.

*Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.*

**Hint:** Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.

1. Consider tree depths of 3, 5, and 10, number of trees as 10, 50, 100, and 200, and learning rates as 0.0001, 0.001, 0.01, 0.1, and 1.0. `GridSearchCV` takes 2 minutes to execute on a 6-core laptop for these values.

2. With the above search, you will probably fail to achieve the objective. However, when you visualize the 5-fold MAE with each of the hyperparameter values considered, you will realize that there is a particular hyperparameter for which you should consider higher / lower values. You will also realize that you need not consider some of the values of the remaining hyperparameters. 

3. Do another 2-minute grid search based on what you realized in (2), and you should achieve the objective.

*(10 points)*

### D.2.3 AdaBoost feature importance
Arrange and print the predictors in decreasing order of importance.

*(2 points)*

### D.2.4 Huber loss

What is the advantage of the Huber loss function *(page 349 of [Elements of Statistical Learning](https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf))* over the squared error and absolute error loss functions? 

*(4 points)*

### D.2.5 `RandomizedSearchCV` vs `GridSearchCV`

What's the advantage of `GridSearchCV` over `RandomizedSearchCV` and vice-versa? When will `GridSearchCV` be preferred over `RandomizedSearchCV` and vice-versa?

*(4 points)*

### D.2.6 Gradient boosting (Huber loss) hyperparameter tuning

Develop and tune a Gradient boosting model with `Huber` loss to predict `SALE_PRC` based on all the predictors. Compute the MAE on test data.

You must tune in the following manner:

1. Use may use `GridSearchCV` or `RandomizedSearchCV` to minimize the $K$-fold mean absolute error (MAE). You may choose any K.

2. You may decide yourself which hyperparameters you wish to tune. Tuning `max_depth`, `n_estimators`, `learning_rate`, and `subsample` should suffice.

The MAE must be less than **$43,000**. You must show the optimal values of the hyperparameters obtained, and the test MAE.

*Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.*

**Hint:** Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.

1. Use 2-fold cross-validation to make the execution speed higher. Here, we are compromising - adding bias to the CV error to get a lesser execution time.

2. In gradient boosting, the suggested depth of trees is in  [4, 8] *(see page 363 in [Elements of Statistical Learning](https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf))*. So, consider depths of 4, 6, and 8. Consider 3 values of number of trees in \[100, 1000\], 3 values of learning rates in [0.01, 0.5], and 3 subsample values in \[0.5, 1\]. It takes 8 minutes on a 6-core laptop to do an exhaustive search on these values.

3. With the above search, you will probably fail to achieve the objective. However, when you compare the optimal hyperparameter values obtained with the hyperparameter values considered, you will realize that there are some hyperparameters for which you should consider higher / lower values. 

4. Do another 10-minute grid search based on what you realized in (3), and you should achieve the objective.

5. Further fine-tuning may reduce your RMSE to up to \$42,400. However, you can stop once it is below \\$43,000 in (4).


*(14 points)*

### D.2.7 Gradient boosting feature importance
Arrange and print the predictors in decreasing order of importance.

*(2 points)*

### D.2.8 Bias-variance
For each of the following hyperparameters tuned in the previous question, explain how do they effect the bias / variance of a gradient boosting model, when their values are increased.

#### `max_depth`

#### `n_estimators`

#### `learning_rate`

#### `subsample`

*(8 points)*

### D.2.9 XGBoost objective function
How is XGboost different from gradient boosting performed with the `GradientBoostingRegressor()` function in the previous question with regard to the optimization objective? How does it benefit prediction accuracy with XGBoost?

*(4 points)*

### D.2.10 XGBoost hyperparameter tuning
Develop and tune an XGBoost model to predict `SALE_PRC` based on all the predictors. Compute the MAE on test data.

You must tune in the following manner:

1. Use may use `GridSearchCV` or `RandomizedSearchCV` to minimize the $K$-fold mean absolute error (MAE). You may choose any K. 

2. You may decide yourself which hyperparameters you wish to tune. Tuning `max_depth`, `n_estimators`, `learning_rate`, `reg_lambda`, `gamma` and `subsample` should suffice.

The test MAE must be less than **$42,000**. You must show the optimal values of the hyperparameters obtained, and the test MAE.

*Note: Hyperparameter tuning must be done on train data. Test data is only to assess model performance. Test data must remain untouched until the model is finalized, and must only be used to compute the test MAE.*

**Hint:** Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.

1. Inspired by the optimal hyperparameter values obtained in [D.2.6](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#gradient-boosting-huber-loss-hyperparameter-tuning), do a search with 2-fold cross validation. Even though the default loss function in XGBoost is [`squarederror`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor), hyperparameter values similar to the optimal hyperparameter values  obtained in [D.2.6](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#gradient-boosting-huber-loss-hyperparameter-tuning) seem to work well. The regularization parameters `gamma` and `reg_lambda` help reduce RMSE further. 

2. It took 10 minutes on a 6-core laptop to tune the model with (1), with the values of `gamma` considered as 0 and 10, and values of `reg_lambda` considered as 0, 1, and 10.

*(14 points)*

### D.2.11 XGBoost Feature importance
Arrange and print the predictors in decreasing order of importance.

*(2 points)*

## D.3 Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit. 

There is a train data - *train.csv*, which you will use to develop a model. There is a test data - *test1.csv* and *test2.csv*, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

1. `age`: Age of the client

2. `education`: Education level of the client 

3. `day`: Day of the month the call is made

4. `month`: Month of the call 

5. `y`: did the client subscribe to a term deposit? 

6. `duration`: Call duration, in seconds. This attribute highly affects the output target (e.g., if `duration`=0 then `y`='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call `y` is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: [Source](https://archive.ics.uci.edu/ml/datasets/bank+marketing). Do not use the raw data source for this assignment. It is just for reference.)

### D.3.1 Data preparation
Convert all the categorical predictors in the data to dummy variables. Note that `month` and `education` are categorical variables.

*(1 point)*

### D.3.2 Boosting
Develop and tune any boosting model to predict the probability of a client subscribing to a term deposit based on `age`, `education`, `day` and `month`. The model must have: 

(a)  **Minimum overall classification accuracy of 70%** among the classifcation accuracies on *train.csv*, and *test.csv*. 

(b) **Minimum recall of 65%** among the recall on *train.csv*, and *test.csv*. 

Print the accuracy and recall for both the datasets - *train.csv*, and *test.csv*.

Note that: 

i. You cannot use `duration` as a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively. 

ii. You are free to choose any value of threshold probability for classifying observations. However, you must use the same threshold on both the datasets.

iii. Use cross-validation on train data to optimize the model hyperparameters.

iv. Using the optimal model hyperparameters obtained in (iii), develop the decision tree model. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.

v. Evaluate the accuracy and recall of the developed model with the tuned decision threshold probability on both the datasets. Note that the test dataset must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.

*(20 points - 10 points for tuning the hyperparameters, 4 points for making the plot, 4 points for tuning the decision threshold probability based on the plot, and 2 points for printing the accuracy & recall on both the datasets)*

It is up to you to pick the hyperparameters and their values in the grid.

**Hint:** Below is one way to solve the problem. Note that there may be several completely different and better ways to solve the problem.

XGBoost may help with tuning of `n_estimators`, `max_depth`, `learning_rate`, `gamma`, `reg_lambda`, and `subsample`. You may take the recommended value of [`scale_pos_weight`](https://xgboost.readthedocs.io/en/latest/parameter.html). Use `RandomizedSearchCV`. Evaluation of 200 models with 5-fold cross validation, i.e., 1000 fits, takes 45 minutes on a 6-core laptop. You may try a 2-fold cross validation to reduce time.