## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **Thursday, 13th April 2023 at 11:59 pm**. 

6. **Four points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 pt). *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn't seem genuine, you will lose points.* 
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

##  Bias-variance trade-off

Throughout the course, the conceptual clarity about bias and variance will help you tune the models for optimal performance and enable you to compare different models in terms of bias and variance. In this question, you will perform simulations to understand and visualize bias-variance trade-off as in Fig. 2.12 of the [book](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf) (page 36).

Assume that the response $y$ is a function of the predictors $x_1$ and $x_2$ and includes a random error $\epsilon$, as follows:

$$
y = f(x_1, x_2) + \epsilon, \qquad
$$  {#eq-function}

where the function $f(.)$ is the [Bukin function](https://www.sfu.ca/~ssurjano/bukin6.html), $x_1 \sim U[-15, -5], x_2 \sim U[-3, 3]$, and $\epsilon \sim N(0, \sigma^2); \sigma = 10$. *Here $U$ refers to Uniform distribution, and $N$ refers to normal distribution.* Use NumPy to simulate values from these distributions.

You will code an algorithm (described below) to compute the expected squared bias, expected variance, var($\epsilon$) and expected test MSE of the following 7 linear regression models having the predictors as:

1. $x_1$ and $x_2$

2. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 2, which will be $x_1^2, x_2^2$, and $x_1x_2$

3. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 3, which will be $x_1^3, x_2^3, x_1^2x_2$, and $x_1x_2^2$

4. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 4

5. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 5

6. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 6

7. All the predictors in the above model, and all polynomial combinations of $x_1$, and $x_2$ of degree 7

As you can see the models are arranged in increasing order of flexibility / complexity. This corresponds to the horizontal axis of Fig. 2.12 in the book.

Use the following **algorithm** to compute the expected squared bias, expected variance, var($\epsilon$) and expected test MSE of the 7 linear regression models above:

**I. Define the Bukin function** that accepts $x_1$ and $x_2$ as parameters and returns the Bukin function value ($f(x_1, x_2)$).

*(2 points)*

**II.** Repeat steps **III - VII** for all degrees $d$ in $\{1, 2, ..., 7\}$

*(2 points)*

**III.** Considering a model of **degree $d$**, simulate the following test and train datasets.

   A. **Simulate test data**
   
   1. Set a seed of 100. Use the code: `np.random.seed(100)`, where `np` refers to the numpy library

   2. Simulate 100 values of $x_1$ from $U[-15, -5]$.

   3. Simulate 100 values of $x_2$ from $U[-3, 3]$. 

   4. Compute the Bukin function value $f(x_1, x_2)$ for the simulated values of $x_1$ and $x_2$.

   5. Use the function `PolynomialFeatures` from the `preprocessing` module of the `sklearn` library to create all polynomial combinations of $x_1$, and $x_2$ up to degree $d$.
   
*(4 points)*

B. **Simulate 100 train data sets**, where each train data is simulated as follows:

   1. Set a seed of *i* for simualting the *ith* train data. Use the code: `np.random.seed(i)`, where `np` refers to the numpy library.

   2. Simulate 100 values of $x_1$ from $U[-15, -5]$

   3. Simulate 100 values of $x_2$ from $U[-3, 3]$ 

   4. Compute the Bukin function value $f(x_1, x_2)$ for the simulated values of $x_1$ and $x_2$

   5.  Simulate the response $y$ using the above set of simulated values with @eq-function
    
   6. Use the function `PolynomialFeatures` from the `preprocessing` module of the `sklearn` library to create all polynomial combinations of $x_1$, and $x_2$ up to degree $d$.
   
*(6 points)*

**IV.** For each train data in III(B), develop a linear regression model using the `LinearRegression()` function from the `linear_model` module of the `sklearn` library.

*(2 points)*

**V.** Note that the squared bias at a test point $x_{1\_test}, x_{2\_test}$ is:

$$
[Bias(\hat{f}(x_{1\_test}, x_{2\_test}))]^2 = [E(\hat{f}(x_{1\_test}, x_{2\_test})) - f(x_{1\_test}, x_{2\_test})]^2, \qquad
$$ {#eq-bias}

where $E(\hat{f}(x_{1\_test}, x_{2\_test}))$ is the mean prediction of the 100 trained models at $x_{1\_test}, x_{2\_test}$. 

Compute the overall expected squared bias as the average squared bias at all the test data points, as in the equation below:

$$
[Bias(\hat{f}(.))]^2 = \frac{1}{100}\Sigma_{i=1}^{100} \big[Bias(\hat{f}(x_{1i\_test}, x_{2i\_test}))\big]^2, \qquad
$$ {#eq-bias_overall}

*(8 points)*

**VI.** Note that the variance at a test point $x_{1\_test}, x_{2\_test}$ is $Var(\hat{f}(x_{1\_test}, x_{2\_test}))$. Compute the overall expected variance as the average variance at all the test data points, as in the equation below:

$$
Var(\hat{f}(.)) = \frac{1}{100}\Sigma_{i=1}^{100} Var(\hat{f}(x_{1i\_test}, x_{2i\_test})) \qquad
$$ {#eq-variance}

*(6 points)*

**VII.** Compute the overall expected test mean squared error as the sum of the expected squared bias (@eq-bias_overall), expected variance (@eq-variance), and error variance ($\sigma^2$):

$$
MSE = [Bias(\hat{f}(.))]^2 + Var(\hat{f}(.)) + \sigma^2, \qquad
$$ {#eq-mse}

*(4 points)*

**VIII.** Plot the overall expected squared bias, overall expected variance, and overall expected test MSE *(as obtained from @eq-bias_overall, @eq-variance, and @eq-mse respectively)* against the degree $d$ (or flexibility / complexity) of the model . Your plot should look like one of the plots in Fig. 2.12 of the book.

*(3 points)*

**IX.** What is the degree of the optimal model, i.e., the degree that provides the best **bias-variance trade-off**?

*(2 points)*


*Note: While coding the algorithm, comment it well so that it is easy to give partial credit in case of mistakes. Include the numerals of the algorithm (such as II(B), V, VI, etc.) in your comments so that it is easy to check your algorithm for completeness.*

## Tuning a classification model with `sklearn`

###  Data {-}
Read the data *classification_data.csv*. The description of the columns is as follows:

1. `hi_int_prncp_pd`: Indicates if a high percentage of the repayments made went to interest rather than principal. **Target variable.**

2. `out_prncp_inv`: Remaining outstanding principal for portion of total amount funded by investors

3. `loan_amnt`: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

4. `int_rate`: Interest rate on the loan

5. `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.

You will develop and tune a logistic regression model to predict `hi_int_prncp_pd` based on the rest of the columns (predictors) as per the instructions below. 

### Train-test split
Use the function `train_test_split` from the `model_selection` module of the `sklearn` library to split the data into 75% train and 25% test. Stratify the split based on the response. Use `random_state` as 45. Print the proportion of 0s and 1s in both the train and test datasets.

*(4 points)*

### Scaling predictors
Scale the predictors to avoid convergence errors when fitting the logistic regression model.

*Note that last quarter, we were focusing on inference (along with prediction), so we avoided scaling. It is a bit inconvenient to interpret odds with scaled predictors. However, avoiding scaling may lead to convergence errors as some of you saw in your course projects. So, it is a good practice to scale, especially when your focus is prediction.*

*(3 points)*

### Tuning the degree
Use the functions:

1. `cross_val_score` from the `model_selection` module of the `sklearn` library to tune the degree of the logistic regression model for maximizing the stratified 5-fold prediction accuracy. Consider degrees from 1 to 6. 

2. `PolynomialFeatures` from the `preprocessing` module of the `sklearn` library to create all polynomial combinations of the predictors up to degree $d$.

What is the optimal degree?

*(4 points)*

**Notes:** 

- A model of degree $d$ will consist of polynomial transformations and interactions of predictors up to degree $d$. For example, a model of degree 2 will consist of the square of each predictor and all 2-factor interactions of the predictors.

- You may use the `newton-cg` solver to avoid convergence issues.

- Use the default `C` value at this point, you will tune it later.

### Test accuracy with optimal degree
For the optimal degree identified in the previous question, compute the test accuracy.

*(4 points)*

### Tuning `C`
With the optimal degree identified in the previous question, find the optimal regularization parameter `C`. Again use the `cross_val_score` function.

*(3 points)*

### Test accuracy with optimal degree and `C`
For the optimal degree and optimal `C` identified in the previous questions, compute the test accuracy.

*(3 points)*

### Tuning decision threshold probability
With the optimal degree and optimal `C` identified in the previous questions, find the optimal decision threshold probability to maximize accuracy. Use the `cross_val_predict` function.

*(4 points)*

### Test accuracy for optimal degree, `C`, and threshold probability
For the optimal degree, optimal `C`, and optimal decision threshold probabilities identified in the previous questions, compute the test accuracy.

*(4 points)*

### Simultaneous optimization of multiple parameters
In the above tuning approach we optimized the hyperparameters and the decision threshold probability sequentially. This is a greedy approach, which doesn't consider all combinations of hyperparameters and decision threshold probabilities, and thus may fail to find the optimal combination of values that maximize accuracy. Thus, tune both the model hyperparameters - degree and `C`, and the decision threshold probability simultaneously considering all value combinations. This will take more time, but is likely to provide more accurate optimal parameter values.

*(6 points)*

### Test accuracy with optimal parameters obtained simultaneously
For the optimal degree, optimal `C`, and optimal decision threshold probabilities identified in the previous question, compute the test accuracy.

*(4 points)*

### Optimizing parameters for multiple performance metrics
Find the optimal `C` and degree to maximize recall while having a precision of more than 75%. Use the function `cross_validate` from the `model_selection` module of the `sklearn` library.

*Note: `cross_validate` function is very similar to `cross_val_score`, the only difference is you can use multiple metrics with the scoring input, as you need in this question.*

*(8 points)*

### Performance metrics computation
For the optimal degree and `C` identified in the previous question, compute the following performance metrics on test data. Use `sklearn` functions, manual computation is not allowed.

1. Precision

2. Recall

3. Accuracy

4. ROC-AUC

5. Show the confusion matrix

*(10 points)*