## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **Sunday, 23rd April 2023 at 11:59 pm**. 

6. **Five points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts). *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn't seem genuine, you will lose points.* 
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

7. For all questions on cross-validation, you must use `sklearn` functions.

## Degrees of freedom
Find the number of degrees of freedom of the following models. Exclude the intercept when counting the degrees of freedom. You may either show your calculation, or explain briefly how you are computing the degrees of freedom.

### Quadratic spline
A model with one predictor, where the predictor is transformed into a quadratic spline with 5 knots

*(2 points)*

### Natural cubic splines
A model with one predictor, where the predictor is transformed into a natural cubic spline with 4 knots

*(2 points)*

### Generalized additive model
A model with four predictors, where the transformations of the respective predictors are (i) cubic spline transformation with 3 knots, (ii) log transformation, (iii) linear spline transformation with 2 knots, (iv) polynomial transformation of degree 4.

*(4 points)*

## Number of knots
Find the number of knots in the following spline transformations, if each of the transformations corresponds to 7 degrees of freedom (excluding the intercept).

### Cubic splines
Cubic spline transformation 

*(1 point)*

### Natural cubic splines
Natural cubic spline transformation 

*(1 point)*

### Degree 4 spline
Spline transformation of degree 4

*(1 point)*

## Regression problem

Read the file *investment_clean_data.csv*. This data is a cleaned version of the file *train.csv* in last quarter's regression [prediction problem](https://www.kaggle.com/competitions/data-science-2-linear-regression-2023-bank-loans). Refer to the link for description of variables. It required some effort to get a RMSE of less than 650 with linear regression. In this question, we'll use MARS / natural cubic splines to get a RMSE of less than 350 with relatively less effort. Use mean squared error as the performance metric in cross validation.

### Data preparation

Prepare the data for modeling as follows:

1. Use the Pandas function `get_dummies()` to convert all the categorical predictors to dummy variables. 

2. Using the `sklearn` function `train_test_split`, split the data into 20% test and 80% train. Use `random_state = 45`.

*Note:*

*A. The function `get_dummies()` can be used over the entire DataFrame. Don't convert the categorical variables individually.*

*B. The MARS model does not accept categorical predictors, which is why the conversion is done.*

*C. The response is `money_made_inv`*

*(2 points)*

### Optimal MARS degree
Use $5$-fold cross validation to find the optimal degree of the MARS model to predict `money_made_inv` based on all the predictors in the dataset.

**Hint:** Start from degree 1, and keep going until it doesn't benefit.

*(4 points)*

### Fitting MARS model

With the optimal degree identified in the previous question, fit a MARS model. Print the model summary. What is the degree of freedom of the model (excluding the intercept)? 

*(1 + 1 + 2 points)*

### Interpreting MARS basis functions
Based on the model summary in the previous question, answer the following question. Holding all other predictors constant, what will be the mean increase in `money_made_inv` for a unit increase in `out_prncp_inv`, given that `out_prncp_inv` is in [500, 600], `term` = 36 (months), `loan_amnt` = 1000, and `int_rate` = 0.1?

First, write the basis functions being used to answer the question, and then substitute the values.

Also, which basis function is non-zero for the smallest domain space of `out_prncp_inv`? Also, specify the domain space in which it is non-zero.

*(3 + 2 points)*

### Feature importance

Find the relative importance of each predictor in the MARS model developed in B.3.3. You may choose any criterion for finding feature importance based on the [MARS documentation](https://contrib.scikit-learn.org/py-earth/content.html#multivariate-adaptive-regression-splines). Print a DataFrame with 2 columns - one column consisting of predictors arranged in descending order of relative importance, and the second column quantifying their relative importance. Exclude predictors rejected by the model developed in B.3.3. 

*Note the forward pass and backward passes of the algorithm perform feature selection without manual intervention.*

*(4 points)*

### Prediction
Using the model developed in B.3.3, compute the RMSE on test data.

*(2 points)*

### Non-trivial train data {-}
Let us call the part of the dataset where `out_prncp_inv = 0` as a trivial subset of data. For this subset, we can directly predict the response without developing a model *(recall the EDA last quarter)*. For all the questions below, fit / tune the  model only on the non-trivial part of the train data. However, when making predictions, and computing RMSE, consider the entire test data. Combine the predictions of the model on the non-trivial subset of test data with the predictions on the trivial subset of test data to make predictions on the entire test data.

### Prediction with non-trivial train data
Find the optimal degree of the MARS model based on the non-trivial train data, fit the model, and re-compute the RMSE on test data. 

*Note: You should get a lesser RMSE as compared to what you got in B.3.6.*

*(4 points)*

### Reducing model variance

The MARS model is highly flexible, which makes it a low bias-high variance model. However, high prediction variance increases the expected mean squared error on test data *(see **equation 2.7 on page 34** of the book)*. How can you reduce the prediction variance of the model without increasing the bias? Check slide 12 of the [bias-variance presentation](https://nuwildcat-my.sharepoint.com/personal/akl0407_ads_northwestern_edu/_layouts/15/onedrive.aspx?ga=1&id=%2Fpersonal%2Fakl0407%5Fads%5Fnorthwestern%5Fedu%2FDocuments%2FSTAT303%2D3%20presentations%2FWeek1%5FBiasVariance%2Epdf&parent=%2Fpersonal%2Fakl0407%5Fads%5Fnorthwestern%5Fedu%2FDocuments%2FSTAT303%2D3%20presentations). The MARS model, in general, corresponds to case B. You can see that by averaging the predictions of multiple models, you will reduce prediction variance without increasing the bias.

Take 10 samples of train data of the same size as the train data, with replacement. For each sample, fit a MARS model with the optimal degree identified earlier. Use the $i^{th}$ model, say $\hat{f}_i$ to make prediction $\hat{f_i}(\mathbf{x}_{test})$  on each test data point $\mathbf{x}_{test}$ *(Note that predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data)*. Compute the average prediction on each test data point based on the 10 models as follows: 

$$\hat{f}(\mathbf{x}_{test}) = \frac{1}{10}\Sigma_{1=1}^{10} \hat{f_i}(\mathbf{x}_{test})$$

Consider $\hat{f}(\mathbf{x}_{test})$ as the prediction at the test data point $\mathbf{x}_{test}$. Compute the RMSE based on this model, which is the average prediction of 10 models. You should get a lesser RMSE as compared to the previous question (B.3.7).

*Note: For ease in grading, use the Pandas DataFrame method [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) to take samples with replacement, and put `random_state` for the ith sample as i, where i goes from 0 to 9.*

*(6 points)*

### Generalized additive model (GAM)

Develop a Generalized linear model $\hat{f}_{GLM}(.)$ to predict `money_made_inv` as follows:

$$\hat{f}_{GLM}(\mathbf{x}) = \hat{\beta}_0 + \Sigma_{i=1}^{4} \hat{\beta}_i{f}_i(\mathbf{x}),$$

where ${f}_i(\mathbf{x})$ is a MARS model of degree $i$.

Print the estimated beta coefficients ($\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3, \hat{\beta}_4$) of the developed model.

*Note: The model is developed on the non-trivial train data*

*(8 points)*

### Prediction with GAM

Use the GAM developed in the previous question to compute RMSE on test data.

*Note: Predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data*

*(5 points)*

### Reducing GAM prediction variance

As we reduced the variance of the MARS model in B.3.8, follow the same approach to reduce the variance of the GAM developed in B.3.9, and compute the RMSE on test data.

*Note: You should get a lesser RMSE as compared to what you got in B.3.10.*

*(8 points)*

### Natural cubic splines

Even though MARS is efficient and highly flexible, natural cubic splines work very well too, if tuned properly.

Consider the predictors identified in the model summary of the MARS model printed in B.3.3. For each predictor, create natural cubic splines basis functions with $d$ degrees of freedom. Include all-order interactions *(i.e., 2-factor, 3-factor, 4-factor interactions, and so on)* of all the basis functions. Use the `sklearn` function `cross_val_score()` to find and report the optimal degrees of freedom for the natural cubic spline of each predictor. 

Consider degrees of freedom from 3 to 6 for the natural cubic spline transformation of each predictor.

*(8 points)*

### Fitting the natural cubic splines model

With the optimal degrees of freedom identified in the previous question, fit a model to predict `money_made_inv`, where the basis functions correspond to the natural cubic splines of each predictor, and all-factor interactions of the basis functions. Compute the RMSE on test data.

*Note: Predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data*

*(4 points)*

## GAM for classification
The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit. 

There is one train data - *train.csv*, which you will use to develop a model. There are two test datasets - *test1.csv* and *test2.csv*, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

1. `age`: Age of the client

2. `education`: Education level of the client 

3. `day`: Day of the month the call is made

4. `month`: Month of the call 

5. `y`: did the client subscribe to a term deposit? 

6. `duration`: Call duration, in seconds. This attribute highly affects the output target (e.g., if `duration`=0 then `y`='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call `y` is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: [Source](https://archive.ics.uci.edu/ml/datasets/bank+marketing). Do not use the raw data source for this assignment. It is just for reference.)

Develop a **generalized additive model (GAM)** to predict the probability of a client subscribing to a term deposit based on *age, education, day* and *month*. The model must have: \
(a)  **Minimum overall classification accuracy of 75%** among the classification accuracies on *train.csv*, *test1.csv* and *test2.csv*. \
(b) **Minimum recall of 55%** among the recall on *train.csv*, *test1.csv* and *test2.csv*. 

Print the accuracy and recall for all the three datasets - *train.csv*, *test1.csv* and *test2.csv*.

Note that: 

i. You cannot use `duration` as a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively. 

ii. One way to develop the model satisfying constrains (a) and (b) is to use **spline transformations for *age* and *day*, and interacting *month* with all the predictors (including the spline transformations)**

iii. You may assume that the distribution of the predictors is the same in all the three datasets. Thus, you may create B-spline basis functions independently for the train and test datasets.

iv. Use cross-validation on train data to optimize the model hyperparameters, and the decision threshold probability. Then, use the optimal hyperparameters to fit the model on train data. Then, evaluate its accuracy and recall on all the three datasets. Note that the test datasets must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.

*(20 points: 10 points for cross validation, 5 points for obtaining and showing the optimal values of the hyperparameters and decision threshold probability, 2 points for fitting the model with the optimal hyperparameters, and 3 points for printing the accuracy & recall on each of the three datasets)*