In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")



# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: Monday, October 28th, 2024 at 11:59pm**

## Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Data splitting](#2)
3. [EDA](#3)
4. [Feature engineering](#4)
5. [Preprocessing and transformations](#5) 
6. [Baseline model](#6)
7. [Linear models](#7)
8. [Different models](#8)
9. [Feature selection](#9)
10. [Hyperparameter optimization](#10)
11. [Interpretation and feature importances](#11) 
12. [Results on the test set](#12)
13. [Summary of the results](#13)
14. [Your takeaway from the course](#15)

<div class="alert alert-info">

## Submission instructions
<hr>
rubric={points:4}

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   


Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W1/blob/master/docs/homework_instructions.md). 

1. Before submitting the assignment, run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Follow the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2024W1/docs/homework_instructions.html), which include information on how to do your assignment and how to submit your assignment.
4. Upload your solution on Gradescope. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope.


_Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary._

</div>

<!-- BEGIN QUESTION -->

## Imports

<div class="alert alert-warning">
    
Imports
    
</div>

_Points:_ 0

In [2]:
#Imports from class demo_10-regression metrics and 6.2 Encoding Text Features

import matplotlib.pyplot as plt
import os
import numpy as np
import pandas as pd
import spacy
import nltk
nltk.download("punkt")
nltk.download("vader_lexicon")
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer



%matplotlib inline
DATA_DIR = os.path.join(os.path.abspath(".."), (".."), "data/")

# Ignore future deprecation warnings from sklearn (using `os` instead of `warnings` also works in subprocesses)

os.environ['PYTHONWARNINGS']='ignore::FutureWarning'

[nltk_data] Downloading package punkt to /Users/sophi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/sophi/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


<!-- END QUESTION -->

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>
rubric={points:3}

In this mini project, you have the option to choose on which dataset you will be working on. The tasks you will need to carry on will be similar, independently of your choice.

### Option 1
You can choose to work on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 


### Option 2
You can choose to work on a regression problem using a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset, then you will try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

> Note there is an updated version of this dataset with more features available [here](http://insideairbnb.com/). The features were are using in `listings.csv.gz` for the New York city datasets. You will also see some other files like `reviews.csv.gz`. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.


**Your tasks:**

1. Spend some time understanding the options and pick the one you find more interesting (it may help spending some time looking at the documentation available on Kaggle for each dataset).
2. After making your choice, focus on understanding the problem and what each feature means, again using the documentation on the dataset page on Kaggle. Write a few sentences on your initial thoughts on the problem and the dataset. 
3. Download the dataset and read it as a pandas dataframe. 

<div class="alert alert-warning">
    
Solution_1
    
</div>

_Points:_ 3

We decided that the New York City Airbnb data was more interesting to us as it is applicable to us as students and travelers to learn the factors that influence the popularity of an airbnb. We also wanted to get additional practice implementing regression models and working with regression metrics. Initially we noticed that we will need to employ some text representation for the name field since it logically could indicate the appeal of a certain airbnb. It also looks like we will need to manipulate a couple of categorical variables like neighborhood and room_type. We will need to make a decision regarding host_name if it is better treated as text or a categorical variable or we refer to the host only by the host_id.

In [30]:
airbnbData = pd.read_csv("AB_NYC_2019.csv")
airbnbData.head()
airbnbData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

<div class="alert alert-warning">
    
Solution_2
    
</div>

_Points:_ 2

In [31]:
train_df, test_df = train_test_split(airbnbData, test_size = 0.3, random_state = 123)

X_train = train_df.drop(columns=["reviews_per_month"])
y_train = train_df["reviews_per_month"]
X_train = X_train[y_train.notna()]
y_train = y_train.dropna()

X_test = test_df.drop(columns=["reviews_per_month"])
y_test = test_df["reviews_per_month"]
X_test = X_test[y_test.notna()]
y_test = y_test.dropna()

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

[(27236, 15), (27236,), (11607, 15), (11607,)]

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

_Points:_ 10

In [32]:
# Summary Statistics
train_df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,34226.0,34226.0,34226.0,34226.0,34226.0,34226.0,34226.0,27236.0,34226.0,34226.0
mean,18939790.0,67262730.0,40.729142,-73.952083,151.528399,7.094957,23.244814,1.369816,7.042453,112.526004
std,11013320.0,78405110.0,0.054531,0.046201,236.628392,21.54829,44.573323,1.700737,32.590803,131.420031
min,2539.0,2438.0,40.50641,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9394482.0,7721897.0,40.690193,-73.98303,69.0,1.0,1.0,0.19,1.0,0.0
50%,19545460.0,30745260.0,40.72324,-73.95555,106.0,3.0,5.0,0.71,1.0,45.0
75%,29150850.0,106837500.0,40.763287,-73.93627,175.0,5.0,23.0,2.0,2.0,225.75
max,36485610.0,274321300.0,40.91234,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


Listing id and host id have counts that are way to high to be useful since there a is a unique id for each listing and each host. Latitude and longitude also contain a small range of values and their predictive ability can be more effectively exploited by using neighborhood and neighborhood_group instead. Price, calculated_host_listings_count, availability_365, and number of reviews may be useful so we will plot it to see its correlation with the target variable.  

In [33]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 34226 entries, 36150 to 15725
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              34226 non-null  int64  
 1   name                            34216 non-null  object 
 2   host_id                         34226 non-null  int64  
 3   host_name                       34209 non-null  object 
 4   neighbourhood_group             34226 non-null  object 
 5   neighbourhood                   34226 non-null  object 
 6   latitude                        34226 non-null  float64
 7   longitude                       34226 non-null  float64
 8   room_type                       34226 non-null  object 
 9   price                           34226 non-null  int64  
 10  minimum_nights                  34226 non-null  int64  
 11  number_of_reviews               34226 non-null  int64  
 12  last_review                     2

This summary indicates the column types and number of non-null values in each column. From this we can group the columns into numeric and categorical for the purposes of graphing their relation to the target variable. We also not that last_review has significant number of null values so we will discard it. 

In [34]:
#Summary of missing values
train_df.isnull().sum()

id                                   0
name                                10
host_id                              0
host_name                           17
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       6990
reviews_per_month                 6990
calculated_host_listings_count       0
availability_365                     0
dtype: int64

In plotting features against the target variable we made the following observations:
- last_review has a significant number of null values so we will discard it. 
- price and calculated_host_listings_count both have wide ranges and showed skewed distributions when plotted linearly. Therefore we log scaled them.
- minimum_nights should be binned since it is more logical to treat is as a discrete variable rather than a continuous one. 

In [35]:
# Numeric Visualizations
# numeric_feats = ["price", "calculated_host_listings_count", "availability_365", "number_of_reviews"]
# for col in numeric_feats:
#     plt.figure()
#     plt.scatter(train_df[col], train_df["reviews_per_month"])
#     plt.title(col + " vs Reviews per Month")
#     plt.xlabel(col)
#     plt.ylabel("Reviews per Month")
#     if col in ["price", "calculated_host_listing_count"] :
#         plt.xscale("log")
#     plt.show()

# plt.figure()
# bins = [0, 1, 3, 5, 7]
# labels =["0-1 night", "2-3 nights", "4-5 nights", "6+ nights"]
# train_df["minimum_nights_bins"] = pd.cut(train_df["minimum_nights"], bins=bins, labels=labels) 
# reviews = train_df.groupby("minimum_nights_bins")["reviews_per_month"].mean()
# plt.figure()
# plt.bar(reviews.index, reviews.values)
# plt.title("Distribution of Minimum Nights")
# plt.xlabel("Minimum Nights")
# plt.ylabel("Average Reviews Per Month")
          

Numeric Visualizations Summary:
According to the visualizations of the numeric features the following features had the strongest impact on reviews per month and should therefore be considered in our model:
- minimum_nights, average reviews per month seem to steadily decrease as minimum nights increases suggesting that minimum night may be a useful predictor for average reviews per month.

We decided to discard:
- number_of_reviews: the number of reviews shows a weak correlation with the target variable. A large number_of_reviews may be more indicative of a listings age rather than its popularity.
- Price
- availability_365
- calculated_host_listings_count

In [36]:
#Categorical Visualizations
# categorical_feats = ["neighbourhood", "neighbourhood_group", "room_type"]
# for col in categorical_feats:
#     plt.figure()
#     means_per_feat = train_df.groupby(col)["reviews_per_month"].mean()
#     plt.bar(train_df[col], train_df["reviews_per_month"])
#     plt.title(col + " vs Reviews per Month")
#     plt.xlabel(col)
#     plt.ylabel("Average Reviews per Month")
#     plt.show()          

In [37]:
train_df["room_type"].describe()

count               34226
unique                  3
top       Entire home/apt
freq                17848
Name: room_type, dtype: object

Categorical Visualizations Summary:
According to the visualizations of the categorical features we have selected the following features which clearly impact the target variable:
- neighborhood_group, we will consider neighborhood group and not neighborhood since neighborhood both contains all the data shown in neighborhood and maps it more clearly, it is less specific and may therefore provide more general insights when considering unseen data.
- room_type, we were interested by the observation that private room had by far the most reviews_per_month even though entire home/apt was the top occurring value. 

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<div class="alert alert-warning">
    
Solution_4
    
</div>

_Points:_ 1

In [38]:
#From appendixA_feature-engineering-text-data.ipynb
sid = SentimentIntensityAnalyzer()

X_train["name"] = X_train["name"].fillna("")
X_train = X_train.assign(vader_sentiment=X_train["name"].apply(lambda x: sid.polarity_scores(x)["pos"]))

X_test["name"] = X_test["name"].fillna("")
X_test = X_test.assign(vader_sentiment=X_test["name"].apply(lambda x: sid.polarity_scores(x)["pos"]))


In [39]:
X_train = X_train.assign(min_price=X_train["price"] * X_train["minimum_nights"])
X_test = X_test.assign(min_price=X_test["price"] * X_test["minimum_nights"])
X_train

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,calculated_host_listings_count,availability_365,vader_sentiment,min_price
20195,16162621,NEW! Exceptional 2BR/1BA Williamsburg Oasis,104781467,Russell,Brooklyn,Williamsburg,40.71306,-73.94856,Entire home/apt,199,3,1,2016-12-11,1,0,0.000,597
18702,14807279,Renovated brownstone apt w/ private outdoor patio,43853650,Eric,Brooklyn,Bedford-Stuyvesant,40.68612,-73.95927,Entire home/apt,225,4,112,2019-06-30,1,136,0.000,900
34780,27573483,"Fort Greene two bedroom, quiet and peaceful",1180925,Mark,Brooklyn,Fort Greene,40.69488,-73.97222,Entire home/apt,200,7,1,2018-09-21,3,163,0.348,1400
23690,19154733,2 bedrooms Williamsburg loft - huge and sunny,42405567,Chris,Brooklyn,Williamsburg,40.71356,-73.94372,Entire home/apt,225,4,7,2018-09-02,2,8,0.560,900
36152,28736839,"Gorgeous Harlem loft w/15 ft ceilings, sleeps 10",3251620,S,Manhattan,Harlem,40.81148,-73.95140,Entire home/apt,300,1,1,2018-10-10,1,0,0.364,300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7763,5885201,SUNNY ROOM A IN CHARMING AREA :),4291007,Graham And Ben,Brooklyn,Bedford-Stuyvesant,40.69363,-73.95980,Private room,95,30,40,2019-06-01,11,331,0.787,2850
15377,12325045,IDEAL One bedroom apt by Central Park!,66501870,K Alexandra,Manhattan,Midtown,40.76016,-73.96910,Entire home/apt,139,2,132,2019-06-30,1,154,0.424,278
17730,13915004,"Sunlit, spacious NY apartment",7177483,Dani,Manhattan,Harlem,40.80380,-73.95569,Entire home/apt,250,3,10,2019-01-01,1,0,0.000,750
28030,21897845,One room.,159769278,Musieka,Bronx,Pelham Gardens,40.86706,-73.84674,Private room,40,2,17,2019-06-04,1,17,0.000,80


In [40]:
discrete = ["minimum_nights"]
discrete_nights = KBinsDiscretizer(n_bins=3, encode="onehot")
discrete_nights

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<div class="alert alert-warning">
    
Solution_5
    
</div>

_Points:_ 10

In [41]:
numeric_feats = ["minimum_nights", "vader_sentiment", "number_of_reviews", "price", "calculated_host_listings_count", "availability_365", "latitude", "longitude", "min_price"]
categorical_feats = ["neighbourhood_group", "room_type", "neighbourhood"]
text_feats = ["name"]
drop_feats = ["host_id", "host_name", "last_review"]

In [42]:
text_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=""),
    FunctionTransformer(lambda x: x.astype(str).flatten()),
    CountVectorizer(ngram_range=(1, 2), min_df=5)
)
    
preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_feats),
    (text_transformer, text_feats), 
    (StandardScaler(), numeric_feats),
    ("drop", drop_feats),
)

In [43]:
preprocessor.fit(X_train)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 6. Baseline model <a name="6"></a>
<hr>
rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

_Points:_ 2

In [44]:
dummy = DummyRegressor()

In [45]:
pd.DataFrame(cross_validate(dummy, X_train, y_train, cv=10, return_train_score=True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.011645,0.001435,-2.5e-05,0.0
1,0.006299,0.00112,-0.000159,0.0
2,0.005523,0.000788,-0.00049,0.0
3,0.004622,0.000785,-2.7e-05,0.0
4,0.004282,0.000672,-7e-05,0.0
5,0.004466,0.000736,-0.000577,0.0
6,0.004079,0.000655,-5.2e-05,0.0
7,0.003909,0.00063,-2.2e-05,0.0
8,0.004173,0.000952,-0.000339,0.0
9,0.004161,0.000716,-0.003229,0.0


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_7
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [46]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [47]:
alphas = 10.0 ** np.arange(-6, 6, 1)
ridge_model = RidgeCV(alphas=alphas, cv=10)
ridge_model.fit(X_train, y_train)

In [48]:
best_alpha_2 = ridge_model.alpha_
best_alpha_2

100.0

In [49]:
ridge_tuned = Ridge(alpha=best_alpha_2)
pd.DataFrame(cross_validate(ridge_tuned, X_train, y_train, cv = 10, return_train_score = True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.134957,0.001254,0.384417,0.424463
1,0.134212,0.001155,0.411587,0.421673
2,0.131298,0.001117,0.392175,0.423863
3,0.128372,0.001108,0.387918,0.424207
4,0.131199,0.001082,0.285216,0.436233
5,0.125924,0.001103,0.3805,0.424954
6,0.128202,0.001101,0.442961,0.419239
7,0.128862,0.001074,0.376287,0.425428
8,0.12969,0.001062,0.401165,0.421649
9,0.122543,0.001076,0.393881,0.423963


In [50]:
rfr = RandomForestRegressor(max_depth = 12, n_estimators=10, random_state=123)

In [None]:
pd.DataFrame(cross_validate(rfr, X_train, y_train, cv = 10, return_train_score = True))

In [None]:
kn_model = KNeighborsRegressor(n_neighbors = 3)

In [26]:
pd.DataFrame(cross_validate(kn_model, X_train, y_train, cv = 10, return_train_score = True))

KeyboardInterrupt: 

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Different models <a name="8"></a>
<hr>
rubric={points:12}

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_8
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 9. Feature selection <a name="9"></a>
<hr>
rubric={points:2}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<div class="alert alert-warning">
    
Solution_9
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:10}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_10
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Use the methods we saw in class (e.g., `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

<div class="alert alert-warning">
    
Solution_11
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:10}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

<div class="alert alert-warning">
    
Solution_12
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Summary of results <a name="13"></a>
<hr>
rubric={points:12}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_13
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

## 14. Your takeaway <a name="15"></a>
<hr>
rubric={points:2}

**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  

<div class="alert alert-warning">
    
Solution_14
    
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

This was a tricky one but you did it! 

![](img/eva-well-done.png)