## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **Monday, 5th June 2023 at 11:59 pm**. 

5. All the estimated code execution times in this assignment are based of an instance of *n1-standard-32 (32 cores virtual machine)* on Google colab.

6. **Five points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts).
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

## Conceptual

### Ensembling

Is it possible for an ensemble model to perform worse than one or more of the individual models? Why or Why not?

*(1 + 4 points)*

### Ensemble fail
If an ensemble model does perform worse than one or more of the individual models, then what should be the course of action?

*(3 points)*

## Regression Problem - Miami housing
### Data preparation
Read the data *miami-housing.csv*. Check the description of the variables [here](https://www.kaggle.com/datasets/deepcontractor/miami-housing-dataset). Split the data into 60% train and 40% test. Use `random_state = 45`. The response is `SALE_PRC`, and the rest of the columns are predictors, except `PARCELNO`. Print the shape of the predictors dataframe of the train data.

*(1 point)*

### MARS model
Develop a MARS model to predict `SALE_PRC` based on all the predictors. Compute the MAE on test data.

Assume that you have used `GridSearchCV` to tune the `max_degree` hyperparameter of the model, and the optimal value comes out to be `max_degree = 3`. Use this value to train the model. 

*Estimated code execution time: 1 minute*

The test MAE should be around **$55,000**. 

*(2 points)*

### Bagged MARS model
Bag 20 MARS model with the same value of `max_degree`, and report the test MAE based on the bagged MARS model.

*Estimated code execution time: 5 minutes*

The test MAE should be around **$51,000**. 

*(4 points)*

### Voting ensemble

Develop a voting ensemble model with:

1. The bagged MARS model developed in [E.2.3](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#bagged-mars-model), 

2. The tuned bagged tree model developed in [C.1.6.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20C.html#tuning-the-hyperparameters)

3. The tuned random forest model developed in [C.1.8.1](https://nustat.github.io/STAT303-3-class-notes/Assignment%20C.html#tuning-random-forest)

4. The tuned AdaBoost model developed in [D.2.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#adaboost-hyperparameter-tuning)

5. The tuned Gradient boosting model *(with Huber loss)* developed in [D.2.6](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#gradient-boosting-huber-loss-hyperparameter-tuning)

6. The tuned XGBoost model developed in [D.2.10](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#xgboost-hyperparameter-tuning)

Report the MAE of each of the above models *(1-6)*, and the voting ensemble.

The MAE of the voting ensemble is likely to be higher than some of the individual models, as these models have a broad range of MAEs *(see equation 10.1 in [class notes](https://nustat.github.io/STAT303-3-class-notes/Lec10_Ensemble.html))*.

*Note:*

*1. If you had replaced the boosting models in (5) and (6) with other boosting models, you can use those.*

*2. You may either use the function `VotingRegressor()` or just take the average of the predictions of all the models and compute the MAE. The latter will be quicker as you have already fitted the individual models to compute their predictions and respective MAEs, so you don't need to fit the models again with `VotingRegressor()`*

*(6 + 4 points)*

### Voting ensemble with good models

Only ensemble those models that have comparable MAEs and relatively low MAEs. These are likely to be models (5) and (6) in the previous question ([E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble)). Report the MAE of this voting ensemble.

This ensemble is likely to have a lower MAE than each of the models *1-6* in the previous question ([E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble)).

*(4 points)*

### Stacking ensemble with Linear regression

Develop a linear regression metamodel based on models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble). Report the MAE of the metamodel on test data. Which model has the highest weight in the ensemble?

*Note:*

1. You may use the `StackingRegressor()` function. However, as the next set of questions ask you to develop different metamodels based on the models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble), using the `StackingRegressor()` will be inefficient as it will involve fitting each of the individual models every time it is called.

2. A faster way will be to use the `cross_val_predict()` function to compute th 5-fold cross-validated predictions from each of the models *1-6*, consider these predictions from the 6 models as 6 different predictors, and fit the metamodel. Once computed, these cross-validated predictions can be used with different metamodels without the need of fitting the individual models repeatedly with `StackingRegressor()`.

*(8 points)*

### Stacking ensemble with Lasso
Develop a lasso metamodel based on models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble). Tune the hyperparameter `C` for the lasso metamodel. Report the MAE of the metamodel on test data. 

*(6 points)*

### Stacking ensemble with MARS
Develop a MARS metamodel based on models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble). Take `max_degree = 1`. In general, the optimal degree of a MARS metamodel will be 1. This is because the metamodel is based on very strong predictors, and thus increasing its complexity is likely to overfit. Of course, in rare cases the optimal degree may be greater than 1. Report the MAE of the metamodel on test data.

*(4 points)*

### Stacking ensemble with Random Forest
Develop a Random forest metamodel based on models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble). Tune the `max_samples` hyperparameter of the metamodel. Report the MAE of the metamodel on test data. 

*(6 points)*

### Stacking ensemble with XGBoost
Develop a XGBoost metamodel based on models *1-6* in [E.2.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble). Tuning the metamodel is optional. Report the MAE of the metamodel on test data. 

*(5 points)*

### Ensemble of ensembles
Develop a voting ensemble of the previous 5 stacking ensembles *(i.e., the stacking ensembles in E.2.6, E.2.7, E.2.8, E.2.9, and E.2.10)*. Report the MAE of the meta-metamodel on test data.

This must be your best model with the least MAE, which must be less than $41,500. 

*(5 points)*

## Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit. 

There is a train data - *train.csv*, which you will use to develop a model. There is a test data - *test.csv*, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

1. `age`: Age of the client

2. `education`: Education level of the client 

3. `day`: Day of the month the call is made

4. `month`: Month of the call 

5. `y`: did the client subscribe to a term deposit? 

6. `duration`: Call duration, in seconds. This attribute highly affects the output target (e.g., if `duration`=0 then `y`='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call `y` is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: [Source](https://archive.ics.uci.edu/ml/datasets/bank+marketing). Do not use the raw data source for this assignment. It is just for reference.)

### Data preparation
Convert all the categorical predictors in the data to dummy variables. Note that `month` and `education` are categorical variables.

*(1 point)*

### Voting ensemble - hard voting
Develop a voting ensemble *(hard voting)* based on the models in:

1. Tuned Generalized additive model in [B.4](https://nustat.github.io/STAT303-3-class-notes/Assignment%20B.html#gam-for-classification)

2. Tuned Random Forest model in [C.2.3](https://nustat.github.io/STAT303-3-class-notes/Assignment%20C.html#random-forest-1)

3. Tuned boosting model in [D.3.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20D.html#boosting)

Report the accuracy and recall on test data for each of the individual models *(1-3)*, and the hard voting ensemble.

*(7 points - 3 points for reporting the accuracy and recall for the individual models, 2 points for taking the majority vote of predicted class, 2 points for reporting accuracy and recall on test data)*

### Voting ensemble - soft voting
Develop a soft voting ensemble based on models *1-3* in [E.3.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble---hard-voting). Tune the decision threshold probability of the soft-voting ensemble to achieve the highest possible accuracy for a minimum recall of 65%. Note that the test data much be untouched while tuning. Report the accuracy and recall of the soft-voting ensemble on test data.

*Note:*

*1. Use the cross-validated predicted probabilities of models 1-3 in [E.3.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble---hard-voting) to find the average cross-validated probability.*

*2. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.*

*(8 points - 3 points for computing the average probability, 3 points for tuning the decision threshold probability, 2 points for reporting the accuracy and recall on test data)*

### Stacking ensemble - Logistic regression

Develop and tune a stacking ensemble based on models *1-3* in [E.3.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble---hard-voting) with logistic regression as the metamodel. Tune the hyperparameter `C` and the decision threshold probability to maximize accuracy for a recall of at least 65%.

Report the accuracy and recall on test data for the ensemble.

*(8 points - 3 points for tuning `C`, 3 points for tuning the decision threshold probability, 2 points for reporting accuracy and recall on test data)*

### Stacking ensemble - Random forest

Develop and tune a stacking ensemble based on them models *1-3* in [E.3.2](https://nustat.github.io/STAT303-3-class-notes/Assignment%20E.html#voting-ensemble---hard-voting) with random forest as the metamodel. Tune the hyperparameter `max_features` to maximize accuracy for a recall of at least 65%.

Report the accuracy and recall on test data for the ensemble.

*(8 points - 3 points for tuning `max_features`, 3 points for tuning the decision threshold probability, 2 points for reporting accuracy and recall on test data)*