# CPSC 330 - Applied Machine Learning 

## Homework 3: Preprocessing 
### Associated lectures: [Lectures 4, 5, 6](https://ubc-cs.github.io/cpsc330/README.html) 

**Due date: Wednesday, Feb 02, 2022 at 11:59pm**

## Table of Contents

- [Instructions](#si)
- [Introduction](#in)
- [Exercise 1: Introducing the dataset](#1)
- [Exercise 2: Exploratory data analysis (EDA)](#2)
- [Exercise 3: Preprocessing](#3)
- [Exercise 4: Building models](#4)
- [Exercise 5: Evaluating on the test set](#5)

## Imports 

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## Instructions <a name="si"></a>
<hr>
rubric={points:6}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

**You may not work with a partner on this homework. Individual submissions are required.** 

_Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary._

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This homework will give you some practice of data preprocessing and building a supervised machine learning pipeline on a real-world dataset. 

<br><br>

## Exercise 1: Introducing the dataset <a name="1"></a>
<hr>

In this lab, you will be working on [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#). Download the CSV and save it as `adult.csv` locally in this homework folder. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).

The starter code below loads the data CSV (assuming that it is saved as `adult.csv` in this folder). 

_Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary._

In [2]:
census_df = pd.read_csv("adult.csv")
census_df.shape

(32561, 15)

<br><br>

### 1.1 Data splitting 
rubric={points:4}

In order to avoid violation of the golden rule, the first step before we do anything is splitting the data. 

**Your tasks:**

1. Split the data into `train_df` (60%) and `test_df` (40%) with `random_state = 1`. Keep the target column (`income`) in the splits so that we can use it in the exploratory data analysis.  

_Usually having more data for training is a good idea. But here I'm using 60%/40% split because this is kind of a big dataset for a modest laptop. A smaller training data means it won't take too long to train the model on your laptop. A side advantage of this would be that with a bigger test split, we'll have a more reliable estimate of the deployment performance!_

In [3]:
train_df, test_df = train_test_split(census_df, train_size = 0.6, random_state=1)

<br><br>

## Exercise 2: Exploratory data analysis (EDA) <a name="2"></a> 
<hr>

Let's examine our `train_df`. 

In [4]:
train_df.sort_index()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
9,41,Private,70037,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,3004,60,?,>50K
10,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
12,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


We see some missing values represented with a "?". Probably these were the questions not answered by some people during the census.  Usually `.describe()` or `.info()` methods would give you information on missing values. But here, they won't pick "?" as missing values as they are encoded as strings instead of an actual NaN in Python. So let's replace them with `np.nan` before we carry out EDA. If you do not do it, you'll encounter an error later on when you try to pass this data to a classifier. 

In [5]:
train_df_nan = train_df.replace("?", np.nan)
test_df_nan = test_df.replace("?", np.nan)
train_df_nan.shape

(19536, 15)

In [6]:
train_df_nan.sort_index()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
9,41,Private,70037,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,3004,60,,>50K
10,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
12,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


The "?" symbols are now replaced with NaN values. 

<br><br>

### 2.1 Visualizing features
rubric={points:10}

**Your tasks:**

1. Examine the information given by `train_df_nan.info()` and `train_df_nan.describe()` methods. In case of `.describe()`, use the `include="all"` argument to show summary statistics of all  features.
2. Visualize the histograms of numeric features. 
3. From the visualizations, which features seem relevant for the given prediction task?

> Note: (Optional) If you're feeling excited about this you are welcome to use [`pandas_profiling`](https://github.com/pandas-profiling/pandas-profiling) for more elaborate visualization and EDA. 

In [7]:
train_df_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19536 entries, 29859 to 29733
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             19536 non-null  int64 
 1   workclass       18435 non-null  object
 2   fnlwgt          19536 non-null  int64 
 3   education       19536 non-null  object
 4   education.num   19536 non-null  int64 
 5   marital.status  19536 non-null  object
 6   occupation      18431 non-null  object
 7   relationship    19536 non-null  object
 8   race            19536 non-null  object
 9   sex             19536 non-null  object
 10  capital.gain    19536 non-null  int64 
 11  capital.loss    19536 non-null  int64 
 12  hours.per.week  19536 non-null  int64 
 13  native.country  19194 non-null  object
 14  income          19536 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.4+ MB


In [8]:
train_df_nan.describe(include="all")

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
count,19536.0,18435,19536.0,19536,19536.0,19536,18431,19536,19536,19536,19536.0,19536.0,19536.0,19194,19536
unique,,8,,16,,7,14,6,5,2,,,,40,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,13582,,6261,,8923,2494,7856,16708,13067,,,,17496,14899
mean,38.583436,,189181.0,,10.086456,,,,,,1070.484029,84.400235,40.457719,,
std,13.679366,,103384.8,,2.581172,,,,,,7407.933397,394.006448,12.289522,,
min,17.0,,14878.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117963.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,177941.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,236804.2,,13.0,,,,,,0.0,0.0,45.0,,


<br><br>

### 2.2 Identify transformations to apply
rubric={points:18}

**Your tasks:**
1. Identify the sequence of transformations that you would apply on each column in the dataset and fill in the table below accordingly. An example of the sequence of transformations to be applied on the `occupation` feature is shown in the table below. You may decide not to apply any transformations on a certain column or entirely drop a column from your model. That's totally fine. 
2. Are there common transformations you would like to apply on certain types of features? Identify different feature types for applying different transformations. In particular, fill in the lists below. 
3. Is including the `race` feature for predicting income ethically a good idea? Briefly discuss. 

> Note: This question is a bit open-ended and there is no single correct solution.

| Feature | Transformation |
| --- | ----------- |
| occupation | imputation, OHE |
| age |Scale |
| workclass | imputation, OHE |
| fnlwgt | Scale |
| education | drop |
| education.num | passthrough |
| marital.status | OHE |
| relationship | OHE |
| race | drop |
| sex | OHE |
| capital.gain | Scale |
| capital.loss | Scale |
| hours.per.week | Scale |
| native.country | imputation, OHE |


In [9]:
# Fill in the lists below. 
# It's OK to keep some of the lists empty or add new lists. 
numeric_features = ["age", "fnlwgt", "capital.gain", "capital.loss", "hours.per.week"]
categorical_features = ["marital.status", "native.country", "occupation", "relationship", "sex", "workclass"]
ordinal_features = []
binary_features = ["sex", "workclass"]
drop_features = ["education", "race"]
passthrough_features = ["education.num"]
imputer_features = ["occupation", "native.country", "workclass"]
target = "income"



### Answer:
I don't believe including race is ethically good. The children born into wealthy families would be more likely to maintain wealth later in life. This has nothing to do with their race as the [same trend is noticed in both white and black families](https://academic.oup.com/qje/article/135/2/711/5687353).


<br><br>

### 2.3 Separating feature vectors and targets  
rubric={points:4}

**Your tasks:**

1. Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df_nan` and `test_df_nan`. 
2. At this point, if you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` would it work? Why or why not?

In [10]:
X_train, y_train = (train_df_nan.drop(columns=["income"]), train_df_nan["income"])
X_test, y_test = (test_df_nan.drop(columns=["income"]), test_df_nan["income"])


### Answer:
A SVC model will not work at this point because our dataset still contains categorical values. SVC models are trained with numeric data, so at the bare minimum we need to label/one-hot encode the categorical columns.

## Exercise 3: Preprocessing <a name="3"></a>
<hr>

### 3.1 Preprocessing using `sklearn`'s `ColumnTransformer` and `Pipeline`
rubric={points:18}

Let's carry out preprocessing using `sklearn`'s `ColumnTransformer` and `Pipeline`. Note that you can define pipelines in two ways: 
- by using [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and explicitly providing named steps
- by using [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline), which automatically names the steps in the pipeline with their class names. 

Similarly you can create a column transformer in two ways:
- by using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
- by using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) 

You may use the method of your choice but `make_pipeline` and `make_column_transformer` are highly recommended.  

**Your tasks:**

1. Create a column transformer `preprocessor` based on transformations you want to apply on the data from 2.2. 
2. Transform the data by calling `fit_transform` on the training set. What's the shape of the transformed data? 
3. Why do we need to use a column transformer in this case? Briefly explain. 

In [11]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (
        make_pipeline(
            SimpleImputer(strategy="most_frequent"),
            OneHotEncoder(sparse=False)
        ),
        categorical_features
    ),
    ("passthrough", passthrough_features),
    ("drop", drop_features),
)
X_transformed = preprocessor.fit_transform(X_train)
column_names = [
    numeric_features + 
    preprocessor.transformers_[1][1]['onehotencoder'].get_feature_names_out(categorical_features).tolist() +
    passthrough_features
 ]
pd.DataFrame(X_transformed, columns=column_names[0])


Unnamed: 0,age,fnlwgt,capital.gain,capital.loss,hours.per.week,marital.status_Divorced,marital.status_Married-AF-spouse,marital.status_Married-civ-spouse,marital.status_Married-spouse-absent,marital.status_Never-married,...,sex_Male,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education.num
0,-1.139222,0.070873,-0.144509,-0.214216,-0.037246,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0
1,0.395976,-0.485480,-0.144509,-0.214216,0.776476,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,10.0
2,-1.431641,1.505840,-0.144509,-0.214216,-1.664690,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9.0
3,-0.700594,-0.452090,-0.144509,-0.214216,1.590199,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0
4,0.322871,-1.569575,-0.144509,-0.214216,0.369615,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19531,-0.042652,-1.511528,-0.144509,-0.214216,1.183338,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.0
19532,-0.408175,-1.487713,-0.144509,-0.214216,0.369615,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,13.0
19533,0.761500,-0.526048,-0.144509,-0.214216,-0.037246,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0
19534,1.931174,-1.602665,-0.144509,5.303602,-0.037246,0.0,0.0,1.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0


### Answer:
The shape of the transformed data is 19536 rows x 83 columns.

We need to use a column transformer because each column needs to undergo their own seperate preprocessing steps. Categorical data needs to be one-hot encoded and numeric data need to be scaled.

## Exercise 4: Building models <a name="4"></a>
<hr>

Now that we have preprocessed features, we are ready to build models. Below, I'm providing the function we used in class which returns mean cross-validation score along with standard deviation for a given model. Feel free to use it to keep track of your results if you like. 

In [12]:
results_dict = {} # dictionary to store all the results

In [13]:

def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

<br><br>

### 4.1 Baseline model 
rubric={points:6}

**Your tasks:**
1. Define a pipeline with two steps: `preprocessor` from 3.1 and `scikit-learn`'s `DummyClassifier` with `strategy="prior"` as your classifier.  
2. Carry out 5-fold cross-validation with the pipeline. Store the results in `results_dict` above. Display the results as a pandas DataFrame.  

> You may use the function `mean_std_cross_val_scores` above to carry out cross-validation and storing results. Refer to the class notes if you are unsure about how to use it. 

In [14]:
# Todo: Ask why np.set.seed(#) doesn't make the code deterministic
dummy_pipe = make_pipeline(preprocessor, DummyClassifier(strategy="prior"))
results_dict["Dummy"] = mean_std_cross_val_scores(dummy_pipe, X_train, y_train, return_train_score=True, cv = 5)
pd.DataFrame(results_dict)

Unnamed: 0,Dummy
fit_time,0.058 (+/- 0.009)
score_time,0.019 (+/- 0.003)
test_score,0.763 (+/- 0.000)
train_score,0.763 (+/- 0.000)


<br><br>

### 4.2 Trying different classifiers
rubric={points:14}

**Your tasks:**

1. For each of the models in the starter code below: 
    - Define a pipeline with two steps: `preprocessor` from 3.1 and the model as your classifier. 
    - Carry out 5-fold cross-validation with the pipeline.  
    - Store the results in `results_dict`. 
2. Display all the results so far as a pandas dataframe. 
3. Compare the train and validation accuracies and `fit` and `score` times in each case. How do the the validation accuracies compare to the baseline model from 4.1? Which model has the best validation accuracy? Which model is the fastest one?  

> Note that this might take a while to run.

> You may use the function above `mean_std_cross_val_scores` to carry out cross-validation and storing results. Refer to the class notes if you are unsure about how to use it. 

In [15]:
models = {
    "decision tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(),
}

In [16]:
for name, model in models.items():
    curr_pipe = make_pipeline(preprocessor, model)
    results_dict[name] = mean_std_cross_val_scores(curr_pipe, X_train, y_train, return_train_score=True, cv = 5)
pd.DataFrame(results_dict)

Unnamed: 0,Dummy,decision tree,kNN,RBF SVM
fit_time,0.058 (+/- 0.009),0.182 (+/- 0.018),0.064 (+/- 0.005),11.426 (+/- 1.012)
score_time,0.019 (+/- 0.003),0.016 (+/- 0.002),1.283 (+/- 0.025),2.252 (+/- 0.331)
test_score,0.763 (+/- 0.000),0.810 (+/- 0.005),0.829 (+/- 0.006),0.850 (+/- 0.004)
train_score,0.763 (+/- 0.000),1.000 (+/- 0.000),0.879 (+/- 0.002),0.851 (+/- 0.001)


### Answer

#### Decision Tree
The decision tree was the second slowest when it comes to fitting the data. Since it was only 2x slower than Knn, which doesn't do any fitting, it suggests the maximal depth decision tree isn't very deep. The score time of a decision tree is approximately the same as the score time of the dummy classifier which also supports my previous point. Since we never specified the max depth the train_score >> test_score which means there's lots of overfitting. However, even with overfitting the decision tree still does better than a dummy classifier by ~5%. **Other than the dummy classfier, this was the fastest model based on the fit + score time.**


#### kNN Classification
Besides the dummy classifier this model took the shortest time to fit. This makes sense because kNN is considered a lazy learner, hence it doesn't do anything but rememeber the training data $O(1)$. It also makes sense that It takes a long time to score. This is because for every unknown example it must calculate the distance to all examples in the training set. The accuracy on the validation set (test_score) is approximately 5% worse than the accuracy on the training set (train_score), A significant difference between the two of them suggests overfitting, but a 5% difference suggests minimal overfitting. When compared to the dummy classifier, KNN shows a 6% improvement when classifying unknown examples.

#### RBF SVM
This did better than knn which is expected since that's generally what happens. RBF's fit is non trivial when compared to the other models since it's trying to find the support vectors that best fit the data. This is why RBF SVM takes the longest time to fit. The difference between the test and train score are very insignficant which suggests a lack of overfitting. **When compared to the rest of the models this was the most accurate.**

<br><br>

### (optional) 4.3 Exploring importance of scaling
rubric={points:1}

In this exercise you'll examine whether scaling helps in case of KNNs and SVM RBFs. 

**Your tasks:**

1. Create a column transformer without the `StandardScaler` step for `numeric_features`. 
2. Repeat the steps in 4.2 with this new column transformer. 
3. Compare the results of scaled numeric features with unscaled numeric features. Is scaling necessary for decision trees? Why or why not?

In [20]:
unscaled_results_dict = {
    "Dummy" : results_dict["Dummy"]
}

unscaled_ct = make_column_transformer(
    (
        make_pipeline(
            SimpleImputer(strategy="most_frequent"),
            OneHotEncoder(sparse=False)
        ),
        categorical_features
    ),
    ("passthrough", passthrough_features),
    ("drop", drop_features),
)

for name, model in models.items():
    unscaled_pipe = make_pipeline(unscaled_ct, model)
    unscaled_results_dict[name] = mean_std_cross_val_scores(unscaled_pipe, X_train, y_train, return_train_score=True, cv = 5)
pd.DataFrame(unscaled_results_dict)



Unnamed: 0,Dummy,decision tree,kNN,RBF SVM
fit_time,0.058 (+/- 0.009),0.111 (+/- 0.024),0.070 (+/- 0.017),14.640 (+/- 1.160)
score_time,0.019 (+/- 0.003),0.016 (+/- 0.003),1.596 (+/- 0.256),2.509 (+/- 0.206)
test_score,0.763 (+/- 0.000),0.813 (+/- 0.005),0.803 (+/- 0.008),0.822 (+/- 0.003)
train_score,0.763 (+/- 0.000),0.864 (+/- 0.001),0.826 (+/- 0.007),0.823 (+/- 0.001)


### Answer
Scaling isn't necessary for decision trees because we're not employing any sort of distance calculation that gets effected by varying feature ranges.

### 4.4 Hyperparameter optimization
rubric={points:10}

In this exercise, you'll carry out hyperparameter optimization for the hyperparameter `C` of SVC RBF classifier. In practice you'll carry out hyperparameter optimization for all different hyperparameters for the most promising classifiers. For the purpose of this assignment, we'll only do it for the `SVC` classifier with one hyperparameter: `C`. 

**Your tasks:**

1. For each `C` value in the `param_grid` in the starter code below: 
    - Create a pipeline object with two steps: preprocessor from 3.1 and `SVC` classifier with the value of `C`.
    - Carry out 5-fold cross validation with the pipeline.  
    - Store the results in `results_dict` and display results as a pandas DataFrame. 
2. Which hyperparameter value seems to be performing the best? Is it different than the default value for the hyperparameter used by `scikit-learn`? 

> Note: Running this might take a while. 

In [29]:
param_grid = {"C": np.logspace(-2, 2, 4)}
cv_results = {"C": [], "mean_cv_score": []}

for c in param_grid["C"]:
    curr_cv_pipe = make_pipeline(preprocessor, SVC(C = c))
    cv_mean = np.mean(cross_validate(curr_cv_pipe, X_train, y_train, cv = 5)["test_score"])
    cv_results["C"].append(c)
    cv_results["mean_cv_score"].append(cv_mean)
pd.DataFrame(cv_results)
    

Unnamed: 0,C,mean_cv_score
0,0.01,0.772676
1,0.215443,0.842445
2,4.641589,0.852528
3,100.0,0.852221


### Answer:
C = 4.641589 seems to have performed the best, This differs from the default C=1 of an SVC.

<br><br><br><br>

## Exercise 5: Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. In this exercise you'll examine whether the results you obtained using cross-validation on the train set are consistent with the results on the test set. 

### 5.1 Scoring on the unseen test set 
rubric={points:10}

**Your tasks:**

1. Train the best performing model on the entire training set. 
2. Report the results of this model on `X_test`. 
3. Are the cross-validation results and test results consistent? 

In [34]:
# TODO: Ask about this. Holand-Netherlands in test but not train which causes problems.
opt_C = param_grid["C"][2]
svm_clf = make_pipeline(preprocessor, SVC(C = opt_C))
svm_clf.fit(X_train, y_train)
svm_clf.score(X_test, y_test)


ValueError: Found unknown categories ['Holand-Netherlands'] in column 1 during transform

In [43]:
set(X_test["native.country"].unique()) - set(X_train["native.country"].unique())

{'Holand-Netherlands'}

## Exercise 6: Summary

rubric={Pass/Fail}

You are all done with Homework 3! Your last task is reflecting on what you have learned and answering the following questions. **This task is necessary to pass HW3**. Failing to complete this task means you will not receive points for this assignment.

1. In about 100 words, describe what you have learned from this homework.
2. Write at least one well-formulated question on something that is still not clear about the content of this homework, or you would like to know more about.

I learned about how to use column_transformers apply certain transformations to certain types of columns. Unfortunately, I ran into quite a lot of trouble accessing the new OHE names for question 3.1 since the lecture never discussed how to access new names if the encoder is inside a pipeline which is inside a column transformer. In order to solve that, I also learned how to effectively google questions. In a similar vein to the above question, I also gained a better understanding of unfamiliar np/pandas functions through google. 

Question: What’s the easiest way of accessing the new names from a OHE in a column transformer if that OHE is inside a pipeline?

For example: How would you access:
```python
    make_column_transformer (
        make_pipeline (
            SimpleImputer(strategy="most_frequent"),
            OneHotEncoder(sparse=False) # This
        )
    )
```

<br><br><br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 

Congratulations on finishing the homework! This was a tricky one but I hope you are feeling good after working on it. You are now ready to build a simple supervised machine learning pipeline on real-world datasets! Well done :clap:! 

![](eva-well-done.png)

