In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 13 

<div class="alert-warning">
    
## Instructions  
rubric={points}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Before you start  

- Please **read carefully
[Use of Generative AI policy](https://ubc-cs.github.io/cpsc330-2025W1/syllabus.html#use-of-generative-ai-in-the-course)** before starting the homework assignment. 
  
- Review the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html) for detailed guidance on completing and submitting assignments. 

### Group work instructions

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2.
  
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   
### Before submitting  

- **Run all cells** (▶▶ button) to ensure the notebook executes cleanly from top to bottom.

  - Execution counts must start at **1** and be sequential.
    
  - Notebooks with missing outputs or errors may lose marks.

- **Do not upload or push data files** used in this lab to GitHub or Gradescope. (A `.gitignore` is provided to prevent this.)  


### Submitting on Gradescope  

- Upload **only** your `.ipynb` file (with outputs shown) and any required output files. Do **not** submit extra files.
  
- If needed, refer to the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- If your notebook is too large to render, also upload a **Web PDF** or **HTML** version.  
  - You can create one using **File $\rightarrow$ Save and Export Notebook As**.  
  - If you get an error when creating a PDF, try running the following commands in your lab directory:  

    ```bash
    conda install -c conda-forge nbconvert-playwright
    jupyter nbconvert --to webpdf lab1.ipynb
    ```  

  - Ensure all outputs are visible in your PDF or HTML file; TAs cannot grade your work if outputs are missing.

</div>


_Note: Unlike previous assignments, this one is open-ended and project-style. Treat it as an opportunity to explore, experiment, and learn._

<!-- BEGIN QUESTION -->

## Imports

<div class="alert alert-warning">
    
Imports
    
</div>

_Points:_ 0

In [65]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV, 
    cross_val_score,
    cross_validate,
    train_test_split,
)

<!-- END QUESTION -->

## Introduction

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (10-14 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem
<hr>
rubric={points:3}

In this mini project, you have the option to choose on which dataset you will be working on. The tasks you will need to carry on will be similar, independently of your choice.

### Option 1
You can choose to work on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 


### Option 2
You can choose to work on a regression problem using a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset, then you will try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

> Note there is an updated version of this dataset with more features available [here](http://insideairbnb.com/). The features were are using in `listings.csv.gz` for the New York city datasets. You will also see some other files like `reviews.csv.gz`. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.


<div class="alert alert-info">
    
**Your tasks:**

1. Review the available datasets and choose the one you find most interesting. It may help to read through the dataset documentation on Kaggle before deciding.
2. Once you've selected a dataset, take time to understand the problem it represents and the meaning of each feature. Use the Kaggle documentation to guide you.
3. Download the dataset and load it into a pandas DataFrame.
4. Write a few sentences summarizing your initial thoughts about the problem and the dataset.
   
</div>

<div class="alert alert-warning">
    
Solution_1
    
</div>

_Points:_ 3

The problem is a classification problem with two possible target classes, either default or not default. From the Kaggle documentation, it appears that all of the features are numerical and thus most models could be applied to the dataset. Although we might need to potentially impute any missing values and will need to scale as there are likely order of magnitude differences between the features. Overall, the features provided seem to be relevant to our goal. 

In [None]:
credit_df = pd.read_csv("data/UCI_Credit_Card.csv")
credit_df = credit_df.set_index("ID")
credit_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting
<hr>
rubric={points:2}

<div class="alert alert-info">
    
**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.

> If your computer cannot handle training on 70% training data, make the test split bigger.

</div>

<div class="alert alert-warning">
    
Solution_2
    
</div>

_Points:_ 2

In [None]:
X_train, X_test= train_test_split(credit_df, test_size=0.3, random_state=123)
X_test

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA
<hr>
rubric={points:10}

<div class="alert alert-info">

**Your tasks:**

1. **Perform exploratory data analysis (EDA)**: Conduct an initial exploration of the training set to better understand its characteristics.

2. **Summarize and visualize the data**: Include at least **two summary statistics** and **two visualizations** that you find informative. For each, write **one sentence** explaining what insight it provides.

3. **Record your observations**: Summarize your **initial observations** about the dataset based on your EDA.

4. **Select evaluation metrics**: Choose one or more **appropriate metrics** for assessing model performance and briefly justify your choice.

</div>

<div class="alert alert-warning">
    
Solution_3
    
</div>

_Points:_ 10

Lets see if there are any missing values in the dataset.

In [None]:
credit_df.isnull().sum()

The dataset has no missing values. Now, looking closer at the features we can observe the following:

- The education features has some issues. It can be seen as ordinal, but currently it is ordered in the opposite direction with higher numbers given to lower education (graduate school = 1, university = 2, high school = 3). It also has some duplicate value issue with unknown = 5 and unknown = 6. Both of these need to be dealt with
- PAY_0 (payment status in september which is the current month) and months closer to now, is likely more important in predicting the default status next month. A person who has not paid for a couple of months already is more likely to default again. This feature importance could potentially be explored using decision trees.
- The relationship between PAY_AMNTK and BILL_AMNTK could be useful. Logically, the clients that default are likely to only have been a paying a small percentage of their bill statements in the last couple of months
- The trend of PAY_K and BILL_AMNTK (the payment status in a specific month and the bill statement in a specific month) could logically be a strong predictor of defaulting. Someone who is not paying already and has an increasing bill statement for the last couple of months is likely to not pay next month. 

To investigate the relationship between PAY_AMNTK and BILL_AMNTK, we can measure the average percentage of bill statement payments (average bill to payment ratio) made by the defaulters and non-defaulters.

In [None]:
bill_cols = ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']
pay_cols  = ['PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

credit_df['pay_ratio'] = (credit_df[pay_cols].sum(axis=1) / (credit_df[bill_cols].sum(axis=1) + 1e-6)) # small addition to avoid 0 division

# We need to filter out the extreme outliers where a person with very low statment ends up overpaying by a lot
# otherwise this payment would dominate the mean
credit_df_filtered = credit_df[(credit_df['pay_ratio'] > 0) & (credit_df['pay_ratio'] < 5)]

credit_df_filtered.groupby('default.payment.next.month')['pay_ratio'].mean()

From this statistic, we can see that a client who defaults pays, on average, 12 percent less per month towards their bill statement than a client who does not default. To get a clearer picture, we can visualize the payment to bill ratio and default status using a histogram. 

In [None]:
pay_ratio_non_default = credit_df_filtered[credit_df_filtered['default.payment.next.month']==0]['pay_ratio']
pay_ratio_default = credit_df_filtered[credit_df_filtered['default.payment.next.month']==1]['pay_ratio']

plt.figure(figsize=(8,5))
plt.hist(pay_ratio_non_default, bins=50, alpha=0.6, color='green', label='Non-Defaulters')
plt.hist(pay_ratio_default, bins=50, alpha=0.6, color='red', label='Defaulters')

plt.xlabel('Average Payment-to-Bill Ratio')
plt.ylabel('Number of Clients')
plt.title('Distribution of Average Payment-to-Bill Ratio by Default Status')
plt.legend()
plt.grid(alpha=0.3)
plt.show()


The histogram shows that most clients have average payment to bill ratios aroud 0, but the ratios greater than that are usually clients who do not default. Although there is class imbalance, the histogram shows that clients who default cluster near 0 payment to bill ratio. This suggests that the average payment to bill statement ratio could be a valuable feature for our goal.

To investigate the trend of PAY_K and AMNT_K, we can calculate the average monthly change in bill amount 

In [None]:
bill_cols = ['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']
credit_df['bill_trend'] = credit_df[bill_cols].iloc[:, ::-1].diff(axis=1).mean(axis=1)
credit_df.groupby('default.payment.next.month')['bill_trend'].mean()

It appears that on average, the people who default next month have lower increase in bill statement by 500 NT dollars in the last couple of months than those who do not default. This leads to the question that perhaps those with a higher credit limit (who ar able to add more to the bill statement each month) are less likely to default. This would make sense as higher credit is usually given to those with higher salaries and a long record of always paying on time. We can investigate this by visualizing the relationship between credit limit and default status. We will do this using a histogram which will show us the distribution of the credit limit and the default status using color coding.

In [None]:
defaulted = credit_df[credit_df['default.payment.next.month'] == 1]['LIMIT_BAL']
non_defaulted = credit_df[credit_df['default.payment.next.month'] == 0]['LIMIT_BAL']

plt.figure(figsize=(8,5))
plt.hist(non_defaulted, bins=50, alpha=0.6, color='green', label='No Default')
plt.hist(defaulted, bins=50, alpha=0.6, color='red', label='Default')

plt.title('Distribution of Credit Limit by Default Status')
plt.xlabel('Credit Limit (NT dollars)')
plt.ylabel('Number of Clients')
plt.legend(title='Default Next Month')
plt.show()


The histogram shows a clear trend, that the clients who default next month tend to have lower credit limits than those who do not default. Thus, the credit limit could be an important feature in predicting whether a client defaults or not.

To decide which evaluation metrics to use, we must check for potential class imbalance in the dataset. We will do this on the entire dataset to get a clear idea.

In [None]:
class_counts = credit_df.groupby("default.payment.next.month").size()
class_counts

class_prop = credit_df.groupby("default.payment.next.month").size() / len(credit_df) * 100
class_prop

class_imbalance = pd.DataFrame({"counts" : class_counts, "proportion" : class_prop})
class_imbalance

There is a clear class imbalance in this dataset and we cannot use accuracy as our metric. To better judge our models, given the imbalance, we must use precision, recall, and f1 score as our metrics. Since our goal is to predict whether a person will default, and as there are less cases of a person defaulting, we will take the class where the person does default as the "positive" class.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Feature engineering
<hr>
rubric={points:1}

<div class="alert alert-info">
    
**Your tasks:**

1. **Perform feature engineering**: Create new features that are relevant to the problem and use this updated feature set in the following exercises. You may need to iterate between **feature engineering** and **preprocessing** to refine your features and improve model performance.
   
</div>

<div class="alert alert-warning">
    
Solution_4
    
</div>

_Points:_ 1

As discussed in the EDA section, one useful feature that can be added is the average payment to bill ratio.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Preprocessing and transformations
<hr>
rubric={points:10}

<div class="alert alert-info">
    
**Your tasks:**

1. **Identify feature types**: Determine the different types of features in your dataset (e.g., numerical, categorical, ordinal, text) and specify the transformations you would apply to each type.

2. **Define a column transformer (if needed)**: Implement a `ColumnTransformer` to apply the appropriate preprocessing steps to each feature type.
 
</div>

<div class="alert alert-warning">
    
Solution_5
    
</div>

_Points:_ 10

Since there are no missing values in the dataset, there is no need for imputation on any features. 

We have the following ordinal features: education and PAY_* (all of the repayment status columns). We need to fix the ordering issues of the education column, by switching the order so higher education has a higher value (i.e Graduate School = 3, University = 2...) and also remove the duplicate value for "unkown". The same does not need to happen to PAY_* as it has meaning in its order, with higher numbers given to more months of missed payments. We don't need to scale these features, as once the other features that need to be scaled are scaled, the order of magnitude difference will not be large enough to have serious effects (these features already have values close to 0).

In [63]:
# First we need to fix the education feature
education_map = {
    0: 0,   # invalid 0s in dataset
    1: 3,   # graduate school = 3
    2: 2,   # university = 2
    3: 1,   # high school = 1
    4: 0,   # others = 0
    5: 0,  # unknown will be lumped with others (less noise)
    6: 0   
}

X_train['EDUCATION'] = credit_df['EDUCATION'].map(education_map).astype(int)

# This transformation is just a rule and doesn't learn anything from the test set
X_test['EDUCATION'] = credit_df['EDUCATION'].map(education_map).astype(int)

# Now we can outline the ordinal features
Ordinal_features = ['EDUCATION', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']


These features are already ordinal encoded so we won't need to apply the ordinal encoder transformer.

We have the following categorical features: marriage and sex. Both will require one-hot encoding as they have no ordinality. Although, since sex is a binary feature for this dataset, we will treat it as binary and have one binary column for it (to avoid mirror columns).

In [None]:
Binary_features = ['SEX']
binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)


Categorical_features = ['MARRIAGE']
categorical_transformer = OneHotEncoder(sparse_output=True)


We have the following numeric features that all need to be scaled: LIMIT_BAL, AGE, BILL_AMNT*, PAY_AMNT*

In [None]:
Numeric_features = ['LIMIT_BAL', 'AGE', 'BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 
                    'PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

numeric_transformer = StandardScaler()

We will now need to create a column transformer to apply the necessary transformation to the features

In [None]:
preprocessor = make_column_transformer((numeric_transformer, Numeric_features),
                                       (binary_transformer, Binary_features),
                                       (categorical_transformer, Categorical_features),
                                       verbose_feature_names_out=False)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 6. Baseline model
<hr>
rubric={points:2}

<div class="alert alert-info">
    
**Your tasks:**

1. **Establish a baseline**: Use one of `scikit-learn`’s baseline models (e.g., `DummyClassifier` or `DummyRegressor`, depending on your task) and report the results. This will serve as a reference point for evaluating the performance of your more advanced models.

</div>

<div class="alert alert-warning">
    
Solution_6
    
</div>

_Points:_ 2

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 7. Linear models 
<hr>
rubric={points:10}

<div class="alert alert-info">
    
**Your tasks:**

1. **Train a linear model**: Use a linear model as your first real attempt at solving the problem.

2. **Tune hyperparameters**: Perform hyperparameter tuning to explore different values of the model's complexity parameter. 

3. **Evaluate with cross-validation**: Report the cross-validation scores along with their standard deviation.

4. **Summarize findings**: Summarize your results, highlighting key observations from your experiments.

</div>

<div class="alert alert-warning">
    
Solution_7
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Different models
<hr>
rubric={points:12}

<div class="alert alert-info">
    
**Your tasks:**

1. **Experiment with additional models**: Train at least **three models** other than a linear model. Ensure that **at least one** of these models is a **tree-based ensemble model** (e.g., Random Forest, Gradient Boosting, or XGBoost).

2. **Compare and interpret results**: Summarize your findings in terms of **overfitting/underfitting** behavior and **fit/score times** for each model. Reflect on your results. Were you able to **outperform the linear model**?

</div>

<div class="alert alert-warning">
    
Solution_8
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 9. Feature selection 
<hr>
rubric={points:2}

<div class="alert alert-info">
    
**Your tasks:**

1. **Perform feature selection**: Attempt to select relevant features using methods such as `RFECV` or forward selection.

2. **Evaluate the impact** Compare the model performance before and after feature selection. Do the results improve with feature selection?

3. **Summarize findings** Summarize your observations and decide whether to **keep feature selection** in your pipeline.  If it improves results, retain it for the next exercises; otherwise, you may choose to omit it.
</div>

<div class="alert alert-warning">
    
Solution_9
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Hyperparameter optimization
<hr>
rubric={points:10}

<div class="alert alert-info">
    
**Your tasks:**

1. **Optimize hyperparameters**: Attempt to optimize hyperparameters for the models you have tried so far. In at least **one case**, tune **multiple hyperparameters** for a single model.

2. **Use suitable optimization methods**: You may use any of the following approaches for hyperparameter optimization:
   - [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)  
   - [`RandomizedSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)  
   - [Bayesian optimization with scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)

3. **Summarize your results**: Report and compare the optimized results across models. Discuss whether hyperparameter optimization led to performance improvements.

</div>

<div class="alert alert-warning">
    
Solution_10
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Interpretation and feature importances
<hr>
rubric={points:10}

<div class="alert alert-info">
    
**Your tasks:**

1. **Interpret model feature importance**: Use one of the interpretation methods discussed in class (e.g., `shap`), or another suitable method of your choice, to examine the most important features of one of your **non-linear models**.

2. **Summarize insights**: Summarize your observations about which features contribute most to the model's predictions and how they influence the outcomes.

   
</div>

<div class="alert alert-warning">
    
Solution_11
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Results on the test set
<hr>

rubric={points:10}

<div class="alert alert-info">
    
**Your tasks:**

1. **Evaluate on the test set**: Apply your best-performing model to the test data and report the test scores.

2. **Compare and reflect**: Compare the **test scores** with the **validation scores** from previous experiments. Discuss the consistency between them. How much do you **trust your results**? Reflect on whether you might have encountered **optimization bias**.

3. **Explain individual predictions**: Select one or two examples from your test predictions and use an interpretation method (e.g., **SHAP force plots**) to explain these individual predictions.
</div>

<div class="alert alert-warning">
    
Solution_12
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Summary of results
<hr>
rubric={points:12}

Imagine you are preparing to present the summary of your results to your boss and co-workers.

<div class="alert alert-info">
    
**Your tasks:**

1. **Summarize key results**: Create a clear and concise table highlighting your most important results (e.g., models compared, validation/test scores, key observations).

2. **Write concluding remarks**: Summarize your main takeaways from the project, including what worked well and what did not.

3. **Propose future improvements**: Discuss ideas or approaches you did not try but that could potentially improve **performance** or **interpretability**.

4. **Report final results**: Report your **final test score** and the **metric** you used.

</div>


<div class="alert alert-warning">
    
Solution_13
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

## 14. Your takeaway
<hr>
rubric={points:2}

<div class="alert alert-info">
    
**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  

</div>

<div class="alert alert-warning">
    
Solution_14
    
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

Before submitting your assignment, please ensure you have followed all the steps in the **Instructions** section at the top.  

### Submission checklist  

- [ ] Restart the kernel and run all cells (▶▶ button)
- [ ] Make sure to push the most up to date version of your homework assignment to your GitHub repository so that we can use it for grading if there are any problems with your submission on Gradescope. 
- [ ] The `.ipynb` file runs without errors and shows all outputs.  
- [ ] Only the `.ipynb` file and required output files are uploaded (no extra files).  
- [ ] If the `.ipynb` file is too large to render on Gradescope, upload a Web PDF and/or HTML version as well.


This was a tricky one but you did it 👏👏!  

![](img/eva-well-done.png)