<a href="https://colab.research.google.com/github/MarvNC/cs523/blob/master/s25_chapter6_handout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preface

I changed challenge 2. The video suggests you just plug-in `KNNImputer`. I am now asking you to write a custom class that wraps `KNNImputer`. So you will need to add it, and some auxiliary code (e.g., `set_config`) that goes with it.

<center>
<h1>Chapter Six</h1>
</center>

<hr>

## LEARNING OBJECTIVES:
- Look at final wrangling step, imputation. Introduce several alternatives.
- Test processing time for alternatives. Becomes important in production system.
- Capture final choice, KNNImputer, in a custom Transformer.
- Place it correctly in Pipeline.

# I. Imputation

The last thing I would like to consider in our pipeline is dealing with those NaN values. Most machine learning algorithms will choke on them. So we need to replace them with numbers. The process is called imputation.


## Set-up

First bring in your library.

In [None]:
github_name = 'smith'
repo_name = 'cis423'
source_file = 'library.py'
url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
!rm $source_file
!wget $url
%run -i $source_file

In [None]:
type(CustomRobustTransformer)

In [None]:
url = 'https://raw.githubusercontent.com/fickas/asynch_models/refs/heads/main/datasets/titanic_trimmed.csv'
titanic_table = pd.read_csv(url)  #using our new package to read in an entire dataset - the coolest

In [None]:
titanic_table.head()  #print first 5 rows of the table

In [None]:
titanic_features = titanic_table.drop(columns='Survived')

## Wrangle using your pipeline

I added a Tukey check on `Age`. As we know from the last chapter, this does not cause any clipping with Titanic data. However, we may see new data so want it in place. I am using the outer fence, meaning I am only interested in "probables".

If you are still working on the `TukeyTransformer` from last chapter, you can leave it off for now.

In [None]:
from sklearn.pipeline import Pipeline

## I'm going to add Tukey to Age and Fare

In [None]:
#don't change this code - it loads from your library

transformed_df = titanic_transformer.fit_transform(titanic_features)

In [None]:
transformed_df.head(10)

|index|Age|Gender|Class|Married|Fare|Joined\_Belfast|Joined\_Cherbourg|Joined\_Queenstown|Joined\_Southampton|
|---|---|---|---|---|---|---|---|---|---|
|0|0\.5526315789473685|0|1\.0|0\.0|-0\.2553191489361702|0|0|0|1|
|1|-0\.5|0|0\.0|0\.0|-0\.5531914893617021|0|0|0|1|
|2|-0\.9210526315789473|0|1\.0|NaN|0\.2978723404255319|0|0|0|1|
|3|-0\.7631578947368421|0|1\.0|0\.0|NaN|0|0|0|1|
|4|NaN|0|2\.0|0\.0|0\.46808510638297873|0|1|0|0|
|5|0\.18421052631578946|0|NaN|1\.0|-0\.5531914893617021|0|0|0|1|
|6|0\.7631578947368421|0|1\.0|0\.0|-0\.2553191489361702|0|1|0|0|
|7|-0\.6052631578947368|0|1\.0|0\.0|-0\.2553191489361702|0|1|0|0|
|8|-0\.23684210526315788|0|1\.0|0\.0|-0\.2127659574468085|0|0|0|1|
|9|-0\.23684210526315788|0|0\.0|0\.0|-0\.5531914893617021|0|0|0|1|

# II. Not all NaNs the same

The question is how did the NaN end up there in the first place? Here is a general classification.

* Missing completely at random (MCAR). Cannot be tied to any known variable or event. Perhaps random data entry error. Contains no interesting information.

* Missing at random (MAR). A bit of a misnomer, the missing value is linked to a known cause. Perhaps one of the ship staff was bad at filling in data. So there is a cause: careless staff. But we have lost that information by this time.

* Not missing at random (NMAR). Maybe some passengers did not want to give their nationality because of fear of discrimination. So certain nationalities go missing. We might be able to infer a value from other columns, e.g., Fare, Class.

The question is whether we can detect these 3 cases? If we have a NaN in the Age column, can I differentiate among these 3?

We will end up doing a bit of NMAR: try to infer values using other columns. But first let's look at a few simpler approaches.

# III. Simplest approach: deletion

We could simply delete all rows that have a NaN in any column. Could do it with one line of pandas:
<pre>
dropped_table = titanic_features.dropna(axis=0)  #axis=0 says rows
</pre>
But I don't like doing that. Especially with smallish datasets. It loses information. I'd rather try to repair such rows.

# IV. Use stats to replace NaNs

I'll cover 2 general ways to impute a value for a NaN. The first is to use descriptive statistics for each column. So if we have NaNs in the `Age` column, compute the mean of the column and use it to replace the NaNs. Could also use the median or the mode.

The code is simple:
<pre>
transformed_df['Fare'] = transformed_df['Fare'].fillna(value=transformed_df['Fare'].mean())
</pre>
And here is the good news. We can use one of our existing transformers to do it!

In [None]:
scaler = CustomMappingTransformer('Fare', {np.nan: transformed_df['Fare'].mean()})  #need np.nan to match a NaN in table
new_df = scaler.fit_transform(transformed_df)
print(transformed_df['Fare'].isna().sum(), new_df['Fare'].isna().sum())

In [None]:
new_df.head()

## sklearn has mean transformer built-in

It will compute the means for all the columns and fill in NaN values for us in one fell swoop. Cool.

<pre>
means = MeanImputerTransformer()
new_df = means.fit_transform(transformed_df)
</pre>

I won't give this as a challenge, but I hope you feel confident you could write it if you needed to.

## The problem

Using things like the mean assumes a normal distribution. That may be true of the `Age` column, but not the `Fare`. I think I have a better way.

The new way will look at rows instead of columns.

# V. Use crowd sourcing to replace NaNs

In essence, if we have a NaN in a row, use the other rows to infer the value. So if we have a NaN for the age of Mr. Smith, look at all the other rows. See if we can find characteristics from these other rows that will allow us to fill in a value for Smith. For instance, if Smith is in first class, married and paid 50 pounds for his ticket, find other males who were in first class, married, and paid a similar amount. Get their ages. Maybe average to get Smith's age.



## K Nearest Neighbors (KNN)

The good news is that there is a well-known technique for doing what I described: finding the rows closest to Smith and then averaging results. It is called KNN. And sklearn has an imputer built on it. Hurray.

I'll give you a brief intro to KNN and then see it in action.


## First insight: recast as a Geometry problem
<img src='https://www.dropbox.com/s/9fcc1crlxp19ijt/major_section.png?raw=1' width='300'>





Our goal is to compute how "similar" 2 rows (2 lists of numbers) are.
Let's take the view that each list of numbers is actually a point in 8-dimension space. Sounds kind of scary already! Let me give you the intuition pretending we only have 2 numbers in each list. Call the two lists A and B. We can view each list as a point on a 2D plot as shown below. Ignore the angle theta for now.

<img src='https://www.dropbox.com/s/7rtuzw37hgl1oi8/ed_vs_cos.png?raw=1' height=300>

We can use the distance d as a measure of their similarity. The smaller the distance, the closer they are. If the distance is 0, they are the same point.

### Good news!
<img src='https://www.sapaviva.com/wp-content/uploads/2017/06/6S.-Euclid-of-Alexandria-ca.320-275-BC-225x225.jpg' height='100'>

Euclid (circa 300BC) figured it out for us. He came up with a formula for computing d. It defines the "Euclidean distance" between 2 points (here called p and q instead of A and B) as follows.

  <p>
<img src='https://www.dropbox.com/s/9wao0kf3u32i3e9/euclidian.png?raw=1'>

Linking our plot above into this:

* n = 2. We have 2 values in each list.
* q1 is the same as x1, p1 is the same as x2.
* q2 is the same as y1, p2 is the same as y2.

The capital (Greek) letter sigma represents summation. I'll write the code out long hand for you.



In [None]:
p = [1,3]  #just guessing from the plot
q = [3,2]  #ditto
n = 2

In [None]:
greek_sigma = (p[0]-q[0])**2 + (p[1]-q[1])**2
greek_sigma

In [None]:
d = greek_sigma**.5
d

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg'
height=50 align=center>


Rewrite code above to use a list comprehension that will compute distance for any 2 lists of numbers p and q.

Answer: 2.23606797749979 for p and q above.



In [None]:
#I did it in one line by summing list comprehension than taking square root.
#Answer: 2.23606797749979



### Does 2.23606797749979 mean the 2 points are close?

Euclidean distance just gives you the distance. Makes no value judgement. We do know that a distance of 0 says the points are exactly the same. So that is the min value. But there is no limit on the upper value.

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg'
height=50 align=center>

Go ahead and try your solution on the first 2 rows from transformed_df.


###My answer: 1.4821474866516537



In [None]:
row1 = transformed_df.loc[0].to_list()
row2 = transformed_df.loc[1].to_list()
print(row1)  #[0.5526315789473685, 0.0, 1.0, 0.0, -0.2553191489361702, 0.0, 0.0, 0.0, 1.0]
print(row2)  #[-0.5, 0.0, 0.0, 0.0, -0.5531914893617021, 0.0, 0.0, 0.0, 1.0]

In [None]:
#you can use your solution from above here.
#Answer: #1.4821474866516537



## We have a similarity!

We checked a person (row1) with another person (row2) and found they had similarity score of `1.6`. But wait. Is that good? I'm going to punt on that question for now. What I want to do with KNN is find the Euclidean distance between row1 and every other row. So if there are 2000 people/rows , I will come up with a list of 1999 distances. I then choose the ones with the smallest distance. These become my experts.

Let's pretend that we are dubious of the `Married` column in row1. We want the experts opinion on what the married value should be for row1.
The next step is to have the experts vote on the `Married` value. If most of the experts have a 0 (actually a value <.5) in `Married`, then the voting result is 0. If most have 1 (a value >=.5), then voting result is 1. I take the vote result and that is my prediction.

I was a bit vague with this: choose the ones with the smallest distance. How many should I choose? That is where the K in KNN comes in. I choose the top K. You get to choose the value of K. A typical value is 5. But it could easily be the case that 11 or even higher will give better results. With binary columns like `Married`, I like to use an odd number to avoid ties.

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg'
height=50 align=center>


Among the 4 rows with indices 6 through 9, find the 4 distances to row1.




In [None]:
row1 = transformed_df.loc[0].to_list()  #row at indice 0
transformed_df[6:10]  #here are the 4 rows we will compare (differs from video)

I used a combo of a for loop and list comprehension to create `diffs` (list of pairs).

In [None]:
#your code


In [None]:

print(diffs)  #[(6, 1.4297976533901184), (7, 1.8277637214931934), (8, 0.7906196760559859), (9, 1.3084328906182103)]

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg'
height=50 align=center>

Ok, use the `sorted` function to sort on distances.



In [None]:
#here is answer from last quiz in case you did not get it.

diffs = [(6, 1.4297976533901184), (7, 1.8277637214931934), (8, 0.7906196760559859), (9, 1.3084328906182103)]

In [None]:
#sort on distance (2nd item). Reminder: sorted function takes a key argument so you can sort on 2nd item in pair

sdist =


In [None]:
print(sdist)  #[(8, 0.7906196760559859), (9, 1.3084328906182103), (6, 1.4297976533901184), (7, 1.8277637214931934)]

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg'
height=50 align=center>

Pull off top k rows, i.e., just their indices. I came up with the list `[8, 9, 6]`.



In [None]:
#now get indices of top k rows in terms of closeness to row 1
k = 3
#your code below



In [None]:
knn_row_indices  #[8, 9, 6]

## Let's say we did not know the marital status of row1

Let our top 3 rows vote!

In [None]:
the_votes = [transformed_df.loc[i, 'Married'] for i in knn_row_indices]
married_average = sum(the_votes)/len(the_votes)


In [None]:
(the_votes, married_average)

## Votes correct?

In [None]:
transformed_df.loc[0, 'Married']  #real answer

## I hope you see I only have to compute `sdist` once

Then I can choose how many to pull off the top (i.e., k) to vote.


## cosine similarity
<img src='https://www.dropbox.com/s/9fcc1crlxp19ijt/major_section.png?raw=1' width='300'>

We have been using Euclidean Distance to measure the similarity of 2 lists of numbers. Are there alternatives? Yes. One fairly common measure is *cosine similarity*. Let's look at this diagram again.

<img src='https://www.dropbox.com/s/7rtuzw37hgl1oi8/ed_vs_cos.png?raw=1' height=300>

We know that d represents the euclidean distance. But you also see the greek letter theta that measures an angle. We got the angle by drawing 2 lines both starting at (0,0). One line goes to A and the other line goes to B. What I want to do is measure the cosine of the angle (theta) that I get when I draw these lines.

Cosine, residing in both Geometry and Trigonometry, brings new jargon. The lines are called *vectors*. I find it all kind of interesting. Nothing has really changed on our side. We have 2 lists of numbers. But we can take different mathematical perspectives (distance versus angle) to get more abstract views. And this gives us mathematical tools we can use. Cool.

I know you probably have the formula for the cosine of theta sribbled somewhere from your math courses. But just in case, I'll give it to you below.

<img src='https://www.dropbox.com/s/oi1ttx99hf0uejn/cosine.png?raw=1'>

**What are A and B in this formula?** They are 2 lists of values, one from a row in `nan_table` and one from a row in `crowd_table`.

Looking at the right-hand side of the formula, **what are A-sub-i and B-sub-i**? In Python, they would be A[i] and B[i].

**What is the greek letter sigma?** We have seen it before. It is summation.
There is one tricky part. You can see that on the sigmas they have i=1, meaning they assume that the lists are indexed 1,2,3, etc. As any sane person would do it. But you know in Python land, you have to have i=0.


The official description of the left-hand version of the formula is for the **numerator** we are taking the *dot-product* of 2 vectors and for the **denominator** we are multiplying the *L2 norms* of the 2 vectors (parallel bars around A and B). The formula is nice because you can see how the jargony terms translate into algebra and things we know how to do in pure Python. I can also tell you that `numpy` knows how to deal with the dot-product and L2 norms directly. So when you import numpy, you bring in another abstraction level, one that knows about *linear algebra*.

One nice thing about the cosine is that it is bounded between -1 and 1. A 1 means the same point.

# VIII. The KNNImputer

The folks at sklearn have built-in a transformer that will impute all the NaN values in our table using KNN. And get this - it even works if a row has more than one NaN value! Nice.

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5,        #a rough guess
                     weights="uniform",    #could alternatively have distance factor in
                     add_indicator=False)  #do not add extra column for NaN

imputed_data = imputer.fit_transform(transformed_df)  #instantiate

### Did we get them all?

In [None]:
np.isnan(imputed_data)

In [None]:
np.isnan(imputed_data).any()  #Any Trues in there?

I used 3 parameters in above code. Let's look at the larger parameter list (and the defaults).

* `n_neighbors`, default=`5`.
Number of neighboring rows to use for imputation. If the available rows is less than this value, then what is available is used.

* `weights`, default=`’uniform’`. All columns treated the same with uniform. It is possible to weight certain columns to make them more important.

* `metric`, default=`’nan_euclidean’`.
Distance metric for searching neighbors. If we want cosine instead, it is possible to write our own function and pass it as a value here. Have to be careful that our function can deal with NaN values. Since I omitted it, I get default.

* `add_indicator`, default=`False`. This will add new binary columns to indicate a cell with a imputed value. See discussion below.


## Problem is we have numpy matrix as output

Even though we can take dataframe as input, cannot get anything but numpy matrix as output. I'll ask you to fix that in a challenge by wrapping KNNImputer in your own class. This should be kind of standard by now.

## Important note on `add_indicator`

If `add_indicator` is set to True, it will create a whole new column for each column that has at least one NaN. Might be easiest to look at example. Here is table that has NaNs in first 3 columns but not in 4th.

In [None]:
data = [[1,2,np.nan,10],
        [np.nan,3,np.nan,10],
        [5,np.nan,6,10],
        [7,8,9,10]]
test_df = pd.DataFrame(data, columns=['a','b','c','d'])
test_df


In [None]:
imputer = KNNImputer(add_indicator=True)
data = imputer.fit_transform(test_df)
pd.DataFrame(data, columns=['a','b','c','d','nan_a', 'nan_b', 'nan_c'])

As you can see, 3 new columns that mark whether a row had a NaN in a column. There are two reasons I don't like this.

1. I said we were going to treat NaNs as carrying no information. So we do not need to mark them specially. I hope you see if they did carry information, then the new columns would carry that forward even after the NaNs replaced, i.e., a machine learning algorithm might be able to make use of the new columns.

2. I don't like adding new columns like this. We may end up with different sized tables between exploration and production. That would be bad.

So bottomline is to set `add_indicator=False`.

## I'm going to time it because that will become important later


In [None]:
test_df  #has NaN values!

In [None]:
%%time
imputer = KNNImputer(add_indicator=False)
data = imputer.fit_transform(test_df)
data  #NaNs imputed

In [None]:
%%time
data = imputer.fit_transform(transformed_df)

### Roughly `5ms` for entire table

.005 of a second.

# IX. Using the `IterativeImputer`

I like this imputer quite a bit. It is really a framework or workflow for a more base predictor; you have to pass the base predictor you want as an argument to the imputer. What it then does is finds the first feature/column that has one or more NaNs. It treats this column as the target/label column y. It uses all the other features/columns to predict NaN values in this column.

When it is done with one column, i.e., has predicted values for all the NaNs in that column, it moves on to the next. In this way it iterates through all the columns with NaNs and predicts values for each NaN.

There are more details, but I would rather put them off for a subsequent chapter. Here is general use:
<pre>
from sklearn... import some_predictor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
ii_imp = IterativeImputer(
    estimator=some_predictor(), max_iter=10, random_state=1234)

numpy_matrix = ii_imp.fit_transform(old_df)
</pre>
What is missing is "`some_predictor`". We have to choose what predictor to use. Let's try a few on our small test table.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

#3 different choices for "some_predictor"
from sklearn.tree import DecisionTreeRegressor    #alternative 1
from sklearn.ensemble import ExtraTreesRegressor  #alternative 2
from sklearn.linear_model import BayesianRidge    #alternative 3



So I've imported 3 candidates to try for `some_predictor`. We will try each one in turn, first in simple 4 row table then the full Titanic table.

### Note: wall time is what we focus on

This would be the time your end-user would have to wait for an answer.

**Caveat**: wall time is machine dependent. If you are running on a machine with multiple cores (like Colab) then your wall time could easily be (much) smaller than CPU time given parallelism is in play.

## Start with `DecisionTreeRegressor`

A plain decision tree is used. We will discuss decision trees in more detail later in the course.

[Docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

In [None]:
test_df

In [None]:
%%time
estimator = DecisionTreeRegressor()  #instantiate predictor
ii_imp = IterativeImputer(estimator=estimator, max_iter=10, random_state=1234)  #instantiate imputer

data = ii_imp.fit_transform(test_df)
data

In [None]:
%%time
data = ii_imp.fit_transform(transformed_df)  #4.5 times slower than knn (differs from video)

## Next up is `ExtraTreesRegressor`

Briefly, it uses a "forest" of decision trees to make predictions. Each tree gets a vote. It is a twiddle on something you may have heard: Random Forests. It generally outshines plain decision trees.

[Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html).

In [None]:
test_df

In [None]:
%%time
ii_imp = IterativeImputer(estimator=ExtraTreesRegressor(), max_iter=10, random_state=1234)

data = ii_imp.fit_transform(test_df)
data

In [None]:
%%time
data = ii_imp.fit_transform(transformed_df)   #11.8K vs 10


## Finally we have `BayesianRidge`

This is a fancy form of regression. And yes, I am putting it off and regression until a later chapter. I mostly want to illustrate that you can try different methods using the `IterativeImputer`. We first looked at trees. Now switching to (fancy) regression.

[Docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html)

If you just can't wait until later, here is a [good tutorial](https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/) on Ridge Regression (and Lasso as well). And a [tutorial on Bayesian regression](https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7).

In [None]:
%%time
ii_imp = IterativeImputer(estimator=BayesianRidge(), max_iter=10, random_state=1234)

data = ii_imp.fit_transform(test_df)
data

In [None]:
%%time
data = ii_imp.fit_transform(transformed_df)   #slower (note parallelism in play by comparing to CPU time)


# X. Which is best?

My gut reaction is that the `IterativeImputer` with `ExtraTreesRegressor` is the best. I'd have to do a more sophisticated analysis to back that up, e.g., place some synthetic NaN values in the table but remember their actual value and compare with imputation.

The problem is the time the best takes. When I ran it on the actual Titanic data (a very small dataset), it was taking 2 seconds. That may be too slow if we are running a web-site where a user is waiting for a response. Because of this I am going to push ahead with the `KNNImputer`. It is the fastest.

# Challenge 1
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

<img src='https://www.dropbox.com/s/oi1ttx99hf0uejn/cosine.png?raw=1'>

Please compute the cosine similarity between the first 2 rows in `transformed_df`. If you are good at reading linear algebra notation, then know that numpy supplies:

* A method for doing dot product.

* A method for computing L2 norm.

If you are shaky on the notation, you can resort to using 3 list comprehensions for the 3 sums.



In [None]:
row1 = transformed_df.loc[0].to_list()
row2 = transformed_df.loc[1].to_list()

In [None]:
#compute and place answer in r1r2_cosine_similarity



In [None]:
r1r2_cosine_similarity  #0.4503419058458463

Here is an oracle, although a little complicated in its details.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array(row1).reshape(1, -1), np.array(row2).reshape(1, -1))[0][0]

### Can we use the `cosine_similarity` function with KNNImputer?

Yes and no. Yes, we could pass it in as a parameter when we instantiate. No, it won't work because it does not handle NaN values well.

We could write our own `nan_cosine_similarity` function that does handle NaN values and then pass it in, but I won't ask you to do it.

# Challenge 2
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Can we just plug the `KNNImputer` into our pipepline? One problem is that it does not return a dataframe but a numpy array. But there is a way around that. I can use `set_config` to tell sklearn that I want all its built-in transformers to input and output dataframes. Cool, check it out.



In [None]:
from sklearn import set_config  #add both of these to your library before you define any transformers

set_config(transform_output="pandas")  #have all transformers (including KNNImputer) now deal in dataframes versus numpy arrays

So that problem is solved. However, there is one more issue. I do not want `add_indicator` to be settable to `True`. It will totally mess up our data later. So I am still going to ask you to write a custom transformer that basically wraps `KNNImputer` and hard-codes `add_indicator` to be set to `False`. For those of you who have taken an FP class, this is in the flavor of a currying operation, albeit in OOP land versus FP land.

Note that you will need to import `KNNImputer` before defining the class; you will need to create an instance of it in the `__init__` method. So make sure this import (and all the imports you see in our notebooks) end up in your library.



### One other constraint

`KNNImputer` is set up to have a `fit` method: it will cause an error if `transform` is called before `fit`. Make sure your class is set up the same. Your `fit` method should simply call the `KNNImputer` `fit` method.

### A fancy type hint




Trying to ensure n_neighbors positive.

In [None]:
from typing import Annotated

PositiveInta = Annotated[int, lambda x: x > 0]

In [None]:
from annotated_types import Gt

PositiveIntb = Annotated[int, Gt(0)]

In [None]:
def foo(x: PositiveIntb):
  print(x)


In [None]:
foo(0)   #should warn with red line but does not - hmmmmm
foo(.4)  #should warn with red line and does

In [None]:
from sklearn.impute import KNNImputer

In [None]:
class CustomKNNTransformer(BaseEstimator, TransformerMixin):
  """Imputes missing values using KNN.

  This transformer wraps the KNNImputer from scikit-learn and hard-codes
  add_indicator to be False. It also ensures that the input and output
  are pandas DataFrames.

  Parameters
  ----------
  n_neighbors : int, default=5
      Number of neighboring samples to use for imputation.
  weights : {'uniform', 'distance'}, default='uniform'
      Weight function used in prediction. Possible values:
      "uniform" : uniform weights. All points in each neighborhood
      are weighted equally.
      "distance" : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
  """
  #your code below

In [None]:
test_df

### Make sure `fit` called before `transform`

In [None]:
#Hint: see attributes of KNNImputer that are set after fitting. Can check with hasattr function.

my_knn_imputer = CustomKNNTransformer()
new_df = my_knn_imputer.transform(test_df)  #AssertionError: NotFittedError: This CustomKNNTransformer instance is not fitted yet. Call "fit" with appropriate arguments before using this estimator.
new_df

### Check K>samples

In [None]:
my_knn_imputer = CustomKNNTransformer(n_neighbors=len(test_df)+1)
my_knn_imputer.fit(test_df)  #Warning


### Check fit versus transform columns mismatch

In [None]:
my_knn_imputer = CustomKNNTransformer()
my_knn_imputer.fit(test_df)
my_knn_imputer.transform(test_df.drop(columns='c'))  # Warning and Column names mismatch error
new_df

### Normal test

In [None]:
my_knn_imputer = CustomKNNTransformer()
new_df = my_knn_imputer.fit_transform(test_df)
new_df

<img src='https://www.dropbox.com/s/q54k7qetwarr5mz/Screen%20Shot%202022-01-20%20at%202.58.28%20PM.png?raw=1' height=150>

### A few more tests

In [None]:
my_knn_imputer2 = CustomKNNTransformer(n_neighbors=2)
new_df = my_knn_imputer2.fit_transform(test_df)
new_df

<pre>

a	b	c	d
0	1.0	2.0	7.5	10.0
1	3.0	3.0	7.5	10.0
2	5.0	5.5	6.0	10.0
3	7.0	8.0	9.0	10.0
</pre>

In [None]:
my_knn_imputer3 = CustomKNNTransformer(n_neighbors=2, weights='distance')
new_df = my_knn_imputer3.fit_transform(test_df)
new_df

|index|a|b|c|d|
|---|---|---|---|---|
|0|1\.0|2\.0|7\.098076211353315|10\.0|
|1|5\.0|3\.0|6\.0|10\.0|
|2|5\.0|3\.0|6\.0|10\.0|
|3|7\.0|8\.0|9\.0|10\.0|

In [None]:
my_knn_imputer4 = CustomKNNTransformer(n_neighbors=1, weights='distance')
new_df = my_knn_imputer4.fit_transform(test_df)
new_df

|index|a|b|c|d|
|---|---|---|---|---|
|0|1\.0|2\.0|6\.0|10\.0|
|1|5\.0|3\.0|6\.0|10\.0|
|2|5\.0|3\.0|6\.0|10\.0|
|3|7\.0|8\.0|9\.0|10\.0|

Go ahead and try it on `transformed_df`.

In [None]:
%%time
new_df = my_knn_imputer.fit_transform(transformed_df)


In [None]:
new_df.describe(include='all').T

|index|count|mean|std|min|25%|50%|75%|max|
|---|---|---|---|---|---|---|---|---|
|Age|1313\.0|0\.05787870285004208|0\.753051356322997|-1\.5526315789473684|-0\.4473684210526316|0\.02631578947368421|0\.5526315789473685|2\.289473684210526|
|Gender|1313\.0|0\.3488194973343488|0\.47677833844440637|0\.0|0\.0|0\.0|1\.0|1\.0|
|Class|1313\.0|1\.3972581873571972|1\.0428875432917633|0\.0|1\.0|1\.0|2\.0|3\.0|
|Married|1313\.0|0\.34364051789794364|0\.4749754473150905|0\.0|0\.0|0\.0|1\.0|1\.0|
|Fare|1313\.0|0\.5221593557064381|1\.222958326660415|-0\.5531914893617021|-0\.2553191489361702|0\.0|0\.7659574468085106|3\.74468085106383|
|Joined\_Belfast|1313\.0|0\.06473724295506474|0\.24615539898364328|0\.0|0\.0|0\.0|0\.0|1\.0|
|Joined\_Cherbourg|1313\.0|0\.19268849961919268|0\.3945607792644899|0\.0|0\.0|0\.0|0\.0|1\.0|
|Joined\_Queenstown|1313\.0|0\.06930693069306931|0\.2540721241868543|0\.0|0\.0|0\.0|0\.0|1\.0|
|Joined\_Southampton|1313\.0|0\.6732673267326733|0\.4691972932315876|0\.0|0\.0|1\.0|1\.0|1\.0|

### Looks good

No NaNs remaining (`count` is `1313` for all columns).


# Challenge 3
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Try it on customer data.



In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQPM6PqZXgmAHfRYTcDZseyALRyVwkBtKEo_rtaKq_C7T0jycWxH6QVEzTzJCRA0m8Vz0k68eM9tDm-/pub?output=csv'

In [None]:
customers_df = pd.read_csv(url)
customers_df.head()

In [None]:
customers_features = customers_df.drop(columns='Rating')

## Step 1. Build pipeline and add imputer

I count 8 steps.

In [None]:
#Build pipeline and include scalers from last chapter and imputer from this
customer_transformer = Pipeline(steps=[


## Step 2. Test out  `KNNTransformer`



In [None]:
transformed_customer_df = customer_transformer.fit_transform(customers_features)

In [None]:
transformed_customer_df.head()

|index|Gender|Experience Level|Time Spent|Age|OS\_Android|OS\_iOS|ISP\_AT&amp;T|ISP\_Cox|ISP\_HughesNet|ISP\_Xfinity|
|---|---|---|---|---|---|---|---|---|---|---|
|0|1\.0|1\.0|0\.7184632599776198|-0\.041379310344827586|0\.0|1\.0|0\.0|0\.0|0\.0|1\.0|
|1|0\.0|1\.0|-1\.6057441253263711|-0\.6896551724137931|1\.0|0\.0|0\.0|1\.0|0\.0|0\.0|
|2|1\.0|1\.0|0\.6202909362178289|-0\.7586206896551724|0\.0|0\.0|0\.0|1\.0|0\.0|0\.0|
|3|1\.0|1\.0|-0\.5315180902648265|-0\.4827586206896552|1\.0|0\.0|0\.0|0\.0|0\.0|1\.0|
|4|1\.0|1\.0|0\.7814248414770603|-0\.13793103448275862|0\.0|1\.0|0\.0|0\.0|0\.0|1\.0|

# Challenge 4
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Add `CustomKNNTransformer` to your library. It is what we will use in our pipeline to do imputation.

As a preprocessing step, I'd recommend asking Gemini to do full documentation and type hints on it and add that to your library.

Also update your pipeline.