## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Write your code in the **Code cells** and your answers in the **Markdown cells** of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to render the **.ipynb** file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The assignment is worth 100 points, and is due on **Monday, 18th April 2025 at 11:59 pm**. 

5. **Five points are properly formatting the assignment**. The breakdown is as follows:
    - Must be an HTML file rendered using Quarto **(1 point)**. *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
    - No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission.  **(1 point)**
    - There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) **(1 point)**
    - Final answers to each question are written in the Markdown cells. **(1 point)**
    - There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. **(1 point)**

6.  The maximum possible score in the assigment is 103+5 = 108 out of 100.

## 1) Optimizing KNN for Classification (71 points)

In this question, you will use **classification_data.csv**. Each row is a loan and the each column represents some financial information as follows:

- `hi_int_prncp_pd`: Indicates if a high percentage of the repayments went to interest rather than principal. **This is the classification response.**

- `out_prncp_inv`: Remaining outstanding principal for portion of total amount funded by investors

- `loan_amnt`: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

- `int_rate`: Interest Rate on the loan

- `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.

As indicated above, `hi_int_prncp_pd` is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) classifier throughout this question.

### a) Load the Dataset **(1 point)**

Read the dataset into your notebook. 

### b) Define Predictor and Response Variables **(1 point)**

Create the **predictor** (features) and **response** (target) variables from the dataset.



### c)  Split the Data into Training and Test Sets **(1 points)**

Create the training and test datasets using the following specifications:

- Use a **75%-25% split**.
- Ensure that the **class ratio is preserved** in both training and test sets (i.e., stratify the split).
- Set `random_state=45` for reproducibility.

### d) Check Class Ratios **(2 points)**

Print the **class distribution** (ratios) for:

- The entire dataset  
- The training set  
- The test set  

This is to verify that the **class ratio is preserved** after splitting.


### e) Scale the Dataset **(2 points)**

Use `StandardScaler` to scale the dataset in order to prepare it for KNN modeling. 

Scaling ensures that all features contribute equally to the distance calculations used by the KNN algorithm. 

### f) Set Up Cross-Validation **(2 points)**

Before creating and tuning your model, you need to define cross-validation settings to ensure consistent and accurate evaluation across folds.

Please follow these specifications:

- Use **5 stratified folds** to preserve class distributions in each split.
- **Shuffle** the data before splitting to introduce randomness.
- Set `random_state=14` for reproducibility.

**Note:** You must use these exact cross-validation settings throughout the rest of this question to maintain consistency.

### g) Tune K for KNN Using Cross-Validation **(12 points)**

Tune a **KNN Classifier** using cross-validation with the following specifications:

- Use **every odd K value from 1 to 50** (inclusive).
- Keep all other model settings at their defaults.
- Use the **cross-validation settings** defined in part (f).
- Evaluate performance using the **F1 score** as the metric.

**(4 points)**

Then, complete the following tasks:

- Create a **plot of K values vs. cross-validation F1 scores** to visualize how K balances overfitting and underfitting. **(2 points)**
- Print the **best average cross-validation F1 score**. **(1 points)**
- Report the **K value corresponding to the best F1 score**. **(1 points)**
- Determine whether this is the **only K value** that results in the best F1 score. Use code to justify your answer. **(2 points)**
- Reflect on whether **accuracy** is a good metric for tuning the model in this case. Explain your reasoning. **(2 points)**

**💡 Hint:**  

In addition to reporting the best `K` and best F1 score, you may also want to examine the full cross-validation results to check if other K values achieved the same F1 score.

### h) Optimize the Classification Threshold **(4 points)**

Using the **optimal K value** you identified in part (g), optimize the classification **threshold** to maximize the cross-validation F1 score.

#### Specifications:
- Search across all possible threshold values using a **step size of 0.05**.
- Use the **cross-validation settings** defined in part (f).
- Evaluate performance using the **F1 score**, consistent with part (g).

#### Tasks:
- Visualize the **F1 score vs. different threshold values**. **(2 points)**
- Identify and report the **best threshold** that yields the highest F1 score. **(1 points)**
- Output the **best cross-validation F1 score**. **(1 points)**



### i) Evaluate the Tuning Method **(2 points)**

Is the method we used in parts (g) and (h) **guaranteed** to find the best combination of **K** and **threshold**, i.e., to tune the classifier to its optimal values?  
**(1 point)**

Justify your answer.  
**(1 point)**

### j)  Evaluate Tuned Classifier on Test Set **(3 points)**

Using the **tuned KNN classifier** and the **optimal threshold** you identified, evaluate the model on the **test set**. Report the following metrics:

- F1 Score  
- Accuracy  
- Precision  
- Recall  
- AUC  



### k) Jointly Tune K and Threshold **(6 points)**

Now, tune **K** and the **classification threshold simultaneously**, rather than sequentially.

- Use the same settings from the previous parts (i.e., odd K values from 1 to 50, threshold step size of 0.05, F1 score as the metric, and the same cross-validation strategy).
- Identify the **best F1 score**, along with the **K value and threshold** that produce it.



### l)  Visualize Cross-Validation Results with a Heatmap **(3 points)**

Create a **heatmap** to visualize the cross-validation results in two dimensions.

- The **x-axis** should represent the **K values**.
- The **y-axis** should represent the **threshold values**.
- The color should represent the **F1 score**.

**Note:** This question only requires **one line of code**. You’ll need to recall a **data visualization function** and a **data reshaping method** from 303-1.

### m)  Compare Joint vs. Sequential Tuning Results **(4 points)**

- How does the **best cross-validation F1 score** from part (k) compare to the scores from parts (g) and (h)? **(1 point)**
- Did the **optimal K value** and **threshold** change when tuning them jointly? **(1 point)**
- Explain **why or why not**. Consider how tuning the two parameters together might impact the result. **(2 points)**


### n) Evaluate Final Tuned Model on Test Set **(3 points)**

Using the **tuned classifier and threshold** from part (k), evaluate the model on the **test set**. Report the following metrics:

- F1 Score  
- Accuracy  
- Precision  
- Recall  
- AUC  



### o) Compare Tuning Strategies and Computational Cost **(3 points)**

Compare the tuning approach used in parts **(g) & (h)** (separate tuning of K and threshold) with the approach in **part (k)** (joint tuning of K and threshold) in terms of **computational cost**.

- How many **K and threshold combinations** did you evaluate in each approach? **(2 points)**
- Based on this comparison and your answer from part (l), explain the **main trade-off** involved in model tuning (e.g., between computation and performance). **(2 points)**

### p) Tune K Using Multiple Metrics **(5 points)**

`GridSearchCV` or `cross_val_score` only allows tuning based on a **single metric**. In this part, you’ll practice tuning hyperparameters while evaluating **multiple metrics** simultaneously.

Cross-validate a **KNN classifier** using the following specifications:

- Use **every odd K value from 1 to 50** (inclusive), and keep all other hyperparameters at their default settings.
- Apply the **cross-validation settings** from part (f).
- Evaluate the model using **accuracy**, **precision**, and **recall** as metrics **at the same time**.

Save the cross-validation results into a **DataFrame**, and compute the **average score for each metric**, and visualize how these metrics change with different values of K.




### q) Optimize for Precision with Recall Constraint **(4 point)**

Identify the **K value** that yields the **highest precision**, while maintaining a **recall of at least 75%**.  
**(3 points)**

Then, print the **average cross-validation metrics** (accuracy, precision, recall) for that K value.  
**(1 point)**


### r) Tune Threshold for Maximum Precision **(3 point)**

Using the **optimal K value** identified in part (q), find the **threshold** that maximizes **cross-validation precision**, following the specifications below:

- Evaluate all possible threshold values with a **step size of 0.05**.
- Use the **cross-validation settings** from part (f).

Then:
- Print the **best cross-validation precision**. 
- Report the **threshold value** that achieves this precision. 

**Note:** This task is very similar to part (h), but it’s important for the next part.

### s) Evaluate Precision-Optimized Model on Test Set **(2 points)**

Using the **tuned classifier and threshold** from parts (q) and (r), evaluate the model on the **test set**. Report the following metrics:

- Test Accuracy  
- Test Precision  
- Test Recall  
- Test AUC  

### t) Final Reflection: Comparing Tuning Strategies  **(3 points)**

You have now tuned your KNN classifier using **three different strategies**:

1. **Sequential tuning** of K and threshold based on **F1 score** (parts g–h)
2. **Joint tuning** of K and threshold using **F1 score** (part k)
3. Tuning based on **multiple metrics**, selecting the K with the **highest precision** while maintaining **recall ≥ 75%** (parts p–r)

Reflect on the following:

- Which tuning strategy led to the **best overall performance on the test set**, based on the metrics you care about most?
- Which strategy would you choose in a real-world application, and why?
- What are the **trade-offs** between tuning for F1 score versus prioritizing precision or recall individually?



**Note:** This is an open-ended question. As long as your reasoning makes sense, you will receive full credit.

## 2) Tuning a KNN Regressor on Bank Loan Data (32 points)

In this question, you will use **bank_loan_train_data.csv** to tune *(the model hyperparameters)* and train the model. Each row is a loan and the each column represents some financial information as follows:

- `money_made_inv`: Indicates the amount of money made by the bank on the loan. **This is the regression response.**

- `out_prncp_inv`: Remaining outstanding principal for portion of total amount funded by investors

- `loan_amnt`: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

- `int_rate`: Interest Rate on the loan

- `term`: The number of payments on the loan. Values are in months and can be either 36 or 60

- `mort_acc`: The number of mortgage accounts

- `application_type_Individual`: 1 if the loan is an individual application or a joint application with two co-borrowers

- `tot_cur_bal`: Total current balance of all accounts

- `pub_rec`: Number of derogatory public records

As indicated above, `money_made_inv` is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) regressor throughout this question.

###  a) Split, Scale, and Tune a KNN Regressor **(15 point)**

Create the **training and test datasets** using the following specifications:

- Use an **80%-20% split**.
- Set `random_state=1` for reproducibility.

Then, **scale your data**, as KNN is sensitive to the scale of input features.

Next, you will **tune a KNN Regressor** by searching for the optimal hyperparameters using three search approaches: **Grid Search**, **Random Search**, and **Bayesian Search**.

#### Cross-Validation Setting

You should use **5-fold cross-validation**, with the following specifications:

- The data should be **shuffled** before splitting  
- Use `random_state=1` to ensure **reproducibility**

####  Hyperparameters to Tune:

You will tune the following hyperparameters for the KNN Regressor:

1. `n_neighbors`: Number of nearest neighbors 
2. `p`: Power parameter for the Minkowski distance  
3. `weights`: Weight function used in prediction  
 
You must consider the following **5 types of weights**:

- `'uniform'`: All neighbors are weighted equally  
- `'distance'`: Weight is inversely proportional to distance  
- Custom weight functions:
  - $\propto \frac{1}{\text{distance}^2}$
  - $\propto \frac{1}{\text{distance}^3}$
  - $\propto \frac{1}{\text{distance}^4}$

For **each search method** (Grid Search, Random Search, Bayesian Search), report the following:

- `best_params_`: The best combination of hyperparameters  
- `best_score_`: Cross-validated RMSE on the training set  
- **Test RMSE** obtained from the best model  
- **Execution time** for the search process 

**Hint:**

Define **three custom weight functions** as shown below:

```
def dist_power_2(distance):
    return 1 / (1e-10 + distance**2)

def dist_power_3(distance):
    return 1 / (1e-10 + distance**3)

def dist_power_4(distance):
    return 1 / (1e-10 + distance**4)
```

Note the small constant `1e-10` helps avoid division by zero and numerical instability.

### b) Compare Tuning Approaches **(1 point)**

Compare the results from part (2a) in terms of **execution time** and **model performance**.  
Briefly discuss the **main trade-offs** among the three hyperparameter tuning approaches: Grid Search, Random Search, and Bayesian Search.



### c) Feature Selection and Hyperparameter Tuning with GridSearchCV **(15 point)**

KNN performance can **deteriorate significantly** if irrelevant or noisy predictors are included. In this part, you will explore **feature selection** to improve model performance, followed by **hyperparameter tuning** using `GridSearchCV` (with `refit=True`).

Try the following **four different feature selection approaches**:

1. **Correlation-based filtering**:  
   - Select features with an absolute correlation of at least **0.1** with the target variable.

2. **Lasso regression for feature selection**:  
   - Use `Lasso(alpha=50)` to select important features based on non-zero coefficients.

3. **SelectKBest**:  
   - Use `SelectKBest` with `f_regression`, selecting the **top 4** features.

4. **Variance threshold**:  
   - Use `VarianceThreshold(threshold=0.1)` to select features with sufficient variability.


For **each approach**, perform hyperparameter tuning using **GridSearchCV**, and report:

- The **best score** (cross-validated RMSE) on the **training set**
- The **test RMSE** from the best model
- The **best hyperparameters**


### d) Compare Feature Selection Approaches **(1 point)**

Create a **DataFrame** that summarizes the model performance from each feature selection method, including:

- **Training RMSE**
- **Test RMSE**

Be sure to also include the results from the model trained **without any feature selection** for comparison.

Then, briefly explain what you learned from this experiment.  
For example: Did feature selection improve performance? Which method worked best?
