In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab2.ipynb")

![](img/571_lab_banner.png)

# Lab 2: Preprocessing

<br><br>

## Imports 

In [None]:
from hashlib import sha1
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert-warning">
    
## Instructions  
rubric={mechanics}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Before you start  

- Read the **[Use of Generative AI Policy](https://ubc-mds.github.io/policies/)**.
  
- Review the **[General Lab Instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/)**.
    
- Check the **[MDS Rubrics](https://github.com/UBC-MDS/public/tree/master/rubric)** for grading criteria.

### Before submitting  

- **Run all cells** (‚ñ∂‚ñ∂ button) to ensure the notebook executes cleanly from top to bottom.

  - Execution counts must start at **1** and be sequential.
    
  - Notebooks with missing outputs or errors may lose marks.
    
- **Include a clickable link to your GitHub repository** below this cell.

- Make at least 3 commits to your GitHub repository and ensure it's up to date. If Gradescope becomes inaccessible, we'll grade the most recent GitHub version submitted before the deadline.

- **Do not upload or push data files** used in this lab to GitHub or Gradescope. (A `.gitignore` is provided to prevent this.)  



### Submitting on Gradescope  

- Upload **only** your `.ipynb` file (with outputs shown) and any required output files. Do **not** submit extra files.
  
- If needed, refer to the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- If your notebook is too large to render, also upload a **Web PDF** or **HTML** version.  
  - You can create one using **File $\rightarrow$ Save and Export Notebook As**.  
  - If you get an error when creating a PDF, try running the following commands in your lab directory:  

    ```bash
    conda install -c conda-forge nbconvert-playwright
    jupyter nbconvert --to webpdf lab1.ipynb
    ```  

  - Ensure all outputs are visible in your PDF or HTML file; TAs cannot grade your work if outputs are missing.

</div>


_Points:_ 4

YOUR REPO LINK GOES HERE

<!-- END QUESTION -->

<br><br>

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This homework will give you some practice of data preprocessing and building a supervised machine learning pipeline on a medium-sized dataset which has different types of features. 

<br><br>

## Exercise 1: Dataset and preliminary EDA
<hr>


In this lab, you will be working on [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#). Download the CSV and save it as `adult.csv` under the data folder in this lab folder. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).

The starter code below loads the data CSV (assuming that it is saved as `adult.csv` under the data folder). 

>  ‚ö†Ô∏è  _Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary._

In [None]:
census_df = pd.read_csv("data/adult.csv")
census_df.shape

<br><br>

<div class="alert alert-info">
    
### 1.1 Data splitting 
rubric={autograde}

In order to avoid violation of the golden rule, the first step before we do anything is splitting the data. 

**Your tasks:**

1. Split the data into `train_df` (40%) and `test_df` (60%) with `random_state = 123`. Keep the target column (`income`) in the splits so that we can use it in the exploratory data analysis.  

> ‚ö†Ô∏è Usually, having more data for training is a good idea. But in this lab we'll be using 40%/60% split because running cross-validation with this dataset can take a long time on a modest laptop. A smaller training data means it will be a bit faster to train the model on your laptop. A side advantage of this is that with a bigger test split, we'll have a more reliable estimate of the model performance!

</div>

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 1

In [None]:
train_df = None
test_df = None

...

In [None]:
grader.check("q1.1")

<br><br>

Let's examine our `train_df`. 

In [None]:
train_df.sort_index()

We see some missing values represented with a "?". Probably these were the questions not answered by some people during the census.  Usually `.describe()` or `.info()` methods would give you information on missing values. But here, they won't pick "?" as missing values because they are encoded as strings instead of an actual NaN in Python. So let's replace them with `np.nan` before we carry out EDA. If you do not do it, you'll encounter an error later on when you try to pass this data to a classifier. 

In [None]:
train_df = train_df.replace("?", np.nan)
test_df = test_df.replace("?", np.nan)
train_df.shape

In [None]:
train_df.sort_index()

The "?" symbols are now replaced with NaN values. 

<br><br>

<div class="alert alert-info">
    
### 1.2 `describe()` method
rubric={autograde}

The table below shows the output of `train_df.describe(include='all')`, which summarizes both numeric and categorical features.

**Your tasks:**

1. What are the highest hours per week someone reported? Store it in a variable called `max_hours_per_week`.
2. What is the most frequently occurring occupation in this dataset? Store it in a variable called `most_freq_occupation`.
3. Store the column names of the columns with missing values as a list in a variable called `missing_vals_cols`. 
4. Store the column names of all numeric-looking columns, irrespective of whether you want to include them in your model or not, as a list in a variable called `numeric_cols`.

</div>

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

In [None]:
census_summary = train_df.describe(include="all")
census_summary

In [None]:
max_hours_per_week = None # 1.2.1

...

In [None]:
most_freq_occupation = None # 1.2.2

...

In [None]:
...

In [None]:
missing_vals_cols = None # 1.2.3
numeric_cols = None # 1.2.4

...

In [None]:
# Sorting the lists for the autograder
missing_vals_cols.sort()
numeric_cols.sort()

In [None]:
grader.check("q1.2")

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 1.3 Visualizing features
rubric={viz,reasoning}

**Your tasks:**

1. For each numeric feature listed in `numeric_cols`, generate overlapping histograms showing the distributions for the <=50K and >50K income classes, similar to how you did it in Lab 1. You may use any visualization library of your choice.
   
2. Write a brief summary (1 to 2 sentences) of your preliminary observations based on these histograms.

> ‚ö†Ô∏è If you use `Altair`, note that column names containing periods (e.g., `capital.gain`, `capital.loss`) have a special meaning in `Altair`. They indicate nested data structures such as JSON fields. To avoid errors, consider renaming these columns (e.g., replacing . with _)

</div>

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Identifying different feature types and transformations  
<hr>

Typically, data isn't readily formatted for direct input into machine learning models. It's crucial for a machine learning practitioner to examine each column and determine an effective method for encoding its information. Let's determine the types of features we have and come up with suitable encoding strategies for them. 

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 2.1 Identify transformations to apply
rubric={reasoning}

Before passing this data to a machine learning model, we need to apply some transformations on different features. Below we are providing possible transformations which can be applied on each column in `census_df`.  

**Your tasks:**
1. Write your justification or explanation for each row in the explanation column. Some example explanations are given below. 

> ‚ö†Ô∏è This question is a bit open-ended. If you do not agree with the provided transformation, feel free to argue your case in the explanation. That said, in this assignment, go with the transformations provided below for the purpose of autograding. 

> You can find the information about the columns [here](http://archive.ics.uci.edu/ml/datasets/Adult).

</div>

<div class="alert alert-warning">

Solution_2.1
    
</div>

| Feature | Transformation | Explanation
| --- | ----------- | ----- |
| age | scaling with `StandardScaler` |  A numeric feature with no missing values, ranging from 17 to 90. Scaling is recommended due to its distinct range compared to other numeric features. While MinMaxScaler might be more suitable, using StandardScaler should be fine too.|
| workclass | imputation, one-hot encoding | |
| fnlwgt | drop |  The column represents the weight assigned by the US census bureau to each row. This is not really a feature which is relevant to predict the income |
| education | ordinal encoding | |
| education.num | drop | |
| marital.status | one-hot encoding  | |
| occupation | imputation, one-hot encoding  | Categorical column with missing values |
| relationship | one-hot encoding  | |
| race | drop  |  |
| sex | one-hot encoding with `drop = if_binary` | |
| capital.gain | scaling with `StandardScaler` |  | 
| capital.loss | scaling with `StandardScaler` | numeric feature with no missing values |
| hours.per.week | scaling with `StandardScaler` | |
| native.country | imputation, one-hot encoding | | 


_Points:_ 10

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<div class="alert alert-info">
    
### 2.2 Identify feature types
rubric={autograde}


**Your tasks:**
1. Based on the types of transformations we want to apply on the features above, identify different feature types and store them in the variables below as lists.

</div>

<div class="alert alert-warning">
    
Solution_2.2
    
</div>

_Points:_ 5

In [None]:
# Fill in the lists below.
numeric_features = []
categorical_features = []
ordinal_features = []
binary_features = []
drop_features = []
target = "income"

...

In [None]:
# Sorting all the lists above for the autograder
numeric_features.sort()
categorical_features.sort()
ordinal_features.sort()
binary_features.sort()
drop_features.sort()

In [None]:
grader.check("q2.2")

<br><br><br><br>

## Exercise 3: Baseline model

<div class="alert alert-info">
    
### 3.1 Separating feature vectors and targets  
rubric={autograde}

**Your tasks:**

1. Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df` and `test_df`.

</div>

<div class="alert alert-warning">
    
Solution_3.1
    
</div>

_Points:_ 1

In [None]:
X_train = None
y_train = None
X_test = None
y_test = None

...

In [None]:
grader.check("q3.1")

<br><br>

<div class="alert alert-info">
    
### 3.2 Dummy classifier
rubric={autograde}

**Your tasks:**

1. Carry out 5-fold cross-validation using [`scikit-learn`'s `cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) function with `return_train_score=True` and store the results as a dataframe named `dummy_df` where each row corresponds to the results from a cross-validation fold.

</div>

<div class="alert alert-warning">
    
Solution_3.2
    
</div>

_Points:_ 1

In [None]:
dummy_df = None 

...

In [None]:
grader.check("q3.2")

<br><br><br><br>

## Exercise 4: Column transformer 
<hr>

In this dataset, we have different types of features: numeric features, an ordinal feature, categorical features, and a binary feature. We want to apply different transformations on different columns and therefore we need a column transformer. First, we'll define different transformations on different types of features and then will create a `scikit-learn`'s `ColumnTransformer` using `make_column_transformer`. For example, the code below creates a `numeric_transformer` for numeric features. 

In [None]:
from sklearn.preprocessing import StandardScaler

numeric_transformer = StandardScaler()

In the exercises below, you'll create transformers for other types of features. 

<br><br>

<div class="alert alert-info">
    
### 4.1 Preprocessing ordinal features
rubric={autograde}

**Your tasks:**

1. Create a transformer called `ordinal_transformer` for our ordinal features. 

> ‚ö†Ô∏è Note that you need to provide an ordered list of categories when defining your `OrdinalEncoder`. The correct order for education levels isn't obvious, so for this exercise, assume the following order: "HS-grad" < "Prof-school" < "Assoc-voc" < "Assoc-acdm" < "Some-college" < "Bachelors"

</div>

<div class="alert alert-warning">
    
Solution_4.1
    
</div>

_Points:_ 5

In [None]:
ordinal_transformer = None

...

In [None]:
...

In [None]:
...

In [None]:
grader.check("q4.1")

<br><br>

Now we'll create a transformer called `binary_transformer` for our binary features to encode binary features as integers 0 and 1.

> ‚ö†Ô∏è _Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary._


In [None]:
binary_transformer = None
from sklearn.preprocessing import OneHotEncoder
binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)

<br><br>

<div class="alert alert-info">
    
### 4.2 Preprocessing categorical features
rubric={autograde}

There are a few categorical features with missing values in our dataset. Our initial step is to impute these missing values before proceeding to one-hot encode the features. For this assignment, apply imputation to all categorical features, regardless of whether they have missing values. If a feature lacks missing values, the imputation step will have no effect.

If we want to apply more than one transformation on a set of features, we need to create a [`scikit-learn` `Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). For example, for categorical features we can create a `scikit-learn` `Pipeline` with first step as imputation and the second step as one-hot encoding. 

**Your tasks:**

1. Create a `sklearn` `Pipeline` using [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) called `categorical_transformer` for our categorical features with two steps:
- `SimpleImputer` for imputation with `strategy="constant"` and `fill_value="missing"`
- `OneHotEncoder` with `handle_unknown="ignore"` and `sparse_output=False` for one-hot encoding.

</div>

<div class="alert alert-warning">
    
Solution_4.2
    
</div>

_Points:_ 4

In [None]:
categorical_transformer = None

...

In [None]:
grader.check("q4.2")

<br><br>

<div class="alert alert-info">
    
### 4.3 Creating a column transformer. 
rubric={autograde}

Now we're ready to combine the different transformers using a `ColumnTransformer`.

**Your tasks:**
1. Create a `sklearn` `ColumnTransformer` named `preprocessor` using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) with the transformers defined in the previous exercises. Use the sequence below in the column transformer and add a "drop" step for the `drop_features` in the end.  
    - `numeric_transformer`
    - `ordinal_transformer`
    - `binary_transformer`
    - `categorical_transformer`
2. Transform the data by calling `fit_transform` on the training set and save it as a dataframe in a variable called `transformed_df`. How many new columns have been created in the preprocessed data in comparison to the original `X_train`? Store the difference between the number of columns in `transformed_df` and `X_train` in a variable called `n_new_cols`. 

> You are not required to do this but optionally you can try to get column names of the transformed data and create the dataframe `transformed_df` with proper column names.

</div>

<div class="alert alert-warning">
    
Solution_4.3
    
</div>

_Points:_ 7

In [None]:
preprocessor = None

...

In [None]:
transformed_df = None
n_new_cols = None

...

In [None]:
transformed_df

In [None]:
n_new_cols

In [None]:
grader.check("q4.3")

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 4.4 Short answer questions
rubric={reasoning}

**Your tasks:**

Answer each of the following questions in 2 to 3 sentences. 

1. What is the problem with calling `fit_transform` on your test data with `StandardScaler`?
2. Why is it important to follow the Golden Rule? If you violate it, will that give you a worse classifier?
3. When is it appropriate to use sklearn `ColumnTransformer`?

</div>

<div class="alert alert-warning">
    
Solution_4.4
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 5: Building models 

Now that we have preprocessed features, we are ready to build models. Below, I'm providing the function we used in class which returns mean cross-validation score along with standard deviation for a given model. Use it to keep track of your results. 

In [None]:
results_dict = {}  # dictionary to store all the results

In [None]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores.iloc[i], std_scores.iloc[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

Below, I'm showing an example where I call `mean_std_cross_val_scores` with `DummyClassifier`. The function calls `cross_validate` with the passed arguments and returns a series with mean cross-validation results and std of cross-validation. When you train new models, you can just add the results of these models in `results_dict`, which can be easily converted to a dataframe so that you can have a table with all your results. 

In [None]:
# Baseline model

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(random_state = 123)
pipe = make_pipeline(preprocessor, dummy)
results_dict["dummy"] = mean_std_cross_val_scores(
    pipe, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 5.1 Trying different classifiers
rubric={accuracy,quality,reasoning}

**Your tasks:**

1. For each model provided in the starter code below:
    - Create a pipeline using `make_pipeline` with two steps: the preprocessor from section 4.3 and the model as your classifier.
      
    - Conduct 5-fold cross-validation using the pipeline. Obtain the mean cross-validation scores and standard deviation using the `mean_std_cross_val_scores` function provided earlier.
    - Store the results in a DataFrame called `income_pred_results_df`, using the model names from the dictionary below as the index. Each row should contain the output from the `mean_std_cross_val_scores` function. In other words, `income_pred_results_df` should look similar to the earlier `results_df` DataFrame, but with additional rows for the new models, as illustrated below:

  | Model          | fit_time | score_time | test_score | train_score |
  |----------------|-----------|-------------|-------------|--------------|
  | dummy          |           |             |             |              |
  | decision tree  |           |             |             |              |
  | kNN            |           |             |             |              |
  | RBF SVM        |           |             |             |              |
  
2. Among the models (excluding the dummy model), which one shows the highest degree of overfitting and which one exhibits the least overfitting?


> ‚ö†Ô∏è Note: The execution might take some time. Please be patient!"

</div>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

models = {
    "decision tree": DecisionTreeClassifier(random_state=123),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(random_state=123),
}

<div class="alert alert-warning">
    
Solution_5.1
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
income_pred_results_df = None 
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 5.2 Hyperparameter optimization
rubric={accuracy,quality}

In this exercise, you'll carry out hyperparameter optimization for the hyperparameter `C` of SVC RBF classifier. In practice, you'll carry out hyperparameter optimization for all different hyperparameters of the most promising classifiers. For the purpose of this assignment, we'll only do it for the `SVC` classifier with one hyperparameter, namely `C`. 

**Your tasks:**

1. For each `C` value in the `param_grid` below: 
    - Create a pipeline object with two steps: preprocessor from 4.3 and `SVC` classifier with the `C` value.
    - Carry out 5-fold cross validation with the pipeline.  
    - Store the results in `results_dict` and display results as a pandas DataFrame called `results_df`. In essence, `results_df` should resemble the `income_pred_results_df` dataframe you created earlier, but has additional rows for SVC RBF models with different C values.
2. Which hyperparameter value seems to be performing the best? In this assignment, consider the hyperparameter value that gives you the highest cross-validation score as the "best" one. Store it in a variable called `best_C`. (Since this question is not autograded, please store the value directly as a number, something like `best_C = 0.001`, if `C = 0.001` is giving you the highest CV score.) Is it different than the default value for the hyperparameter used by `scikit-learn`? 

> Note: Running this will take a while. Please be patient.

</div>

In [None]:
param_grid = {"C": [0.1, 100, 1000]}
param_grid

<div class="alert alert-warning">
    
Solution_5.2
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
best_C = None

...

<!-- END QUESTION -->

<br><br>

<div class="alert alert-info">
    
### 5.3 Scoring on the unseen test set 
rubric={autograde}

Now that we have a best performing model, it's time to assess our model on the set aside test set. In this exercise, you'll examine whether the results you obtained using cross-validation on the train set are consistent with the results on the test set. 

**Your tasks:**

1. Create a pipeline named `final_pipeline` with the preprocessor from 4.3 as the first step and the best performing SVC model from 5.2 as the second step. 
2. Train the pipeline on the entire training set `X_train` and `y_train`. 
3. Score the pipeline on `X_test` and `y_test` and store the score in a variable called `test_score`.

</div>

<div class="alert alert-warning">
    
Solution_5.3
    
</div>

_Points:_ 3

In [None]:
final_pipeline = None
test_score = None

...

In [None]:
grader.check("q5.3")

<br><br><br><br>

## Exercise 6: Food for thought
<hr>

Each lab will have a few challenging questions. In some labs, I will be including challenging questions which lead to the material in the upcoming week. These are usually low-risk questions and will contribute to maximum 5% of the lab grade. The main purpose here is to challenge yourself or dig deeper in a particular area. When you start working on labs, attempt all other questions before moving to these questions. If you are running out of time, please skip these questions. 

We will be more strict with the marking of these questions. There might not be model answers. If you want to get full points in these questions, your answers need to
- be thorough, thoughtful, and well-written
- provide convincing justification and appropriate evidence for the claims you make 
- impress the reader of your lab with your understanding of the material, your analytical and critical reasoning skills, and your ability to think on your own

![](img/eva-game-on.png)

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### (Challenging) 6.1 The `native.country` column
rubric={reasoning}

In our column transformer above, we treated `native.country` as a categorical feature, where a new column will be created for each unique category in this column.

**Your tasks:**

1. Examine the `value_counts` for this column.
2. Point out the problems/limitations associated with the current encoding of this column.   
3. Propose and implement a better approach to encode the column. Justify why is your approach better.  
4. Examine whether you get better accuracy with your best model when you use this encoding. Discuss your results.

</div>

<div class="alert alert-warning">
    
Solution_6.1
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
preprocessor_q_7_1 = None
...

In [None]:
transformed_df_q_7_1  = None
n_new_cols_q_7_1  = None

...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br><br><br>

Before submitting your assignment, please ensure you have followed all the steps in the **Instructions** section at the top.  

### Submission checklist  

- [ ] Restart the kernel and run all cells (‚ñ∂‚ñ∂ button)
- [ ] Make at least three commits to your Github repository. 
- [ ] The `.ipynb` file runs without errors and shows all outputs.  
- [ ] Only the `.ipynb` file and required output files are uploaded (no extra files).  
- [ ] If the `.ipynb` file is too large to render on Gradescope, upload a Web PDF and/or HTML version as well.
- [ ] Include the link to your lab GitHub repository below the instructions.  


Congratulations on finishing the homework! This was a tricky one but I hope you are feeling good after working on it. You are now ready to build a simple supervised machine learning pipeline on real-world datasets! Well done üëèüëè!

![](img/eva-well-done.png)

