# Boost your score with KNN Feature extraction

For TPS 6, I wanted to follow a different approach than before. Instead of focusing on modeling, I focused on feature engineering. Given the fact, that this competition has anonymized features, this might not be an obvious choise.
However my [EDA](https://www.kaggle.com/melanie7744/tps6-eda-comparison-to-tps5) also led me to the original dataset. I.e. the dataset used to generate the synthetic one we got for TPS 6. So I researched the solutions that worked back then. Many of them did not seem to be viable for me. But one looked promising: **KNN feature extraction**. 

My quest for knowledge then led me to a nice package: **fastknn**, written by [David Pinto](http://www.kaggle.com/davidpinto/fastknn-show-to-glm-what-knn-see-0-96#Feature-Engineering-with-KNN). It seemed to provide everything I wanted. Unfortunately it was written in R.

So the search continued until I found .... a Python implementation by Momijiame of this same package! It is called **Gokinjo**, which means neighborhood in Japanese and is available on [Github](https://github.com/momijiame/gokinjo). 


 
</div>
        
### What is KNN feature extraction? 

from [David Pinto](https://davpinto.github.io/fastknn/articles/knn-extraction.html):

<div class="alert alert-success">
The fastknn provides a function to do feature extraction using KNN. It generates k * c new features, where c is the number of class labels. The new features are computed from the distances between the observations and their k nearest neighbors inside each class, as follows:
<ul>
<li>The first test feature contains the distances between each test instance and its nearest neighbor inside the first class.</li>
<li>The second test feature contains the sums of distances between each test instance and its 2 nearest neighbors inside the first class.</li>
<li>The third test feature contains the sums of distances between each test instance and its 3 nearest neighbors inside the first class.</li>
<li>And so on.</li>
</ul>
This procedure repeats for each class label, generating k * c new features. Then, the new training features are generated using a n-fold CV approach, in order to avoid overfitting. Parallelization is available. You can specify the number of threads via nthread parameter.
</div>

So, let's try it out!

In [1]:
pip install gokinjo

Collecting gokinjo
  Downloading gokinjo-0.1.0-py3-none-any.whl (10 kB)
Installing collected packages: gokinjo
Successfully installed gokinjo-0.1.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

from gokinjo import knn_kfold_extract
from gokinjo import knn_extract

# list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tabular-playground-series-jun-2021/sample_submission.csv
/kaggle/input/tabular-playground-series-jun-2021/train.csv
/kaggle/input/tabular-playground-series-jun-2021/test.csv


In [3]:
# read competiton data
df_train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv') 

# label encode the target column
le = LabelEncoder()
df_train.target = le.fit_transform(df_train.target)

# define X and y for training data
X = df_train.drop(columns=["id","target"])
y = df_train.target

# prepare test data
X_test=df_test.drop(columns="id")

print("First five rows of training data:")
display(X.head())
print("First five rows of test data:")
display(X_test.head())

First five rows of training data:


Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70,feature_71,feature_72,feature_73,feature_74
0,0,0,6,1,0,0,0,0,7,0,...,3,0,0,0,0,0,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,1,0
2,0,0,0,0,0,1,0,3,0,0,...,8,0,0,0,0,1,0,0,0,0
3,0,0,7,0,1,5,2,2,0,1,...,0,0,4,0,2,2,0,4,3,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


First five rows of test data:


Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70,feature_71,feature_72,feature_73,feature_74
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,2,0,0,0,0,0,0,0,0,...,3,1,3,0,0,0,0,3,0,0
2,0,1,7,1,0,0,0,0,6,0,...,3,0,0,0,0,3,0,2,0,0
3,0,0,0,4,3,1,0,0,0,0,...,0,0,0,1,0,0,0,4,0,0
4,0,0,5,0,0,0,0,0,0,8,...,0,0,0,0,0,0,0,0,1,0


In [4]:
# convert to numpy because gokinjo expects np arrays
X = X.to_numpy()
y = y.to_numpy()
X_test = X_test.to_numpy()
# check shapes
print("X, shape: ", np.shape(X))
print("X_test, shape: ", np.shape(X_test))

X, shape:  (200000, 75)
X_test, shape:  (100000, 75)


In [5]:
# KNN feature extraction for train, as the data has not been normalized previously, let knn_kfold_extract do it
# you can set a different value for k, just be aware about the increase in computation time
KNN_feat_train = knn_kfold_extract(X, y, k=1, normalize='standard')
print("KNN features for training set, shape: ", np.shape(KNN_feat_train))
KNN_feat_train[0]

KNN features for training set, shape:  (200000, 9)


array([3.73320075, 3.45451402, 3.81265198, 3.85253456, 3.63653099,
       2.82483479, 3.32534783, 3.28254987, 3.12120553])

In [6]:
# create KNN features for test set, as the data has not been normalized previously, let knn_extract do it
KNN_feat_test = knn_extract(X, y, X_test, k=1, normalize='standard')
print("KNN features for test set, shape: ", np.shape(KNN_feat_test))
KNN_feat_test[0]

KNN features for test set, shape:  (100000, 9)


array([0.54760409, 0.38943284, 0.39264324, 0.58408197, 0.60395868,
       0.4631    , 0.59059477, 0.60252318, 0.51435248])

Note: generating the KNN features for the test set was not straight-forward for me. I did not find any sample code, so I digged into the source code of the Gokinjo package until I found the solution showed here. I hope this is how it is supposed to be done. Should anybody have experience with this package... your feedback is very welcome.


In [7]:
# add KNN feature to normal features
X, X_test = np.append(X, KNN_feat_train, axis=1), np.append(X_test, KNN_feat_test, axis=1) 
print("Train set, shape: ", np.shape(X))
print("Test set, shape: ", np.shape(X_test))

Train set, shape:  (200000, 84)
Test set, shape:  (100000, 84)


In [8]:
# store KNN features, they are computationally expensive
np.save('add_feat_train', KNN_feat_train)
np.save('add_feat_test', KNN_feat_test)

# to load them in your notebook you can use:
#new_features = np.load('add_feat_train.npy')

I used those extra features with an XGBoost Model: 
* My validation logloss improved from 1.75280 to 1.75089
* My public score improved from 1.75592 to 1.75338
* From the 10 most important features (as ranked by XGBoost), 8 were KNN features

I'd be curious to know what happens if you use a better model, i.e. one that has already a lower logloss than my XGBoost.
* Will such a model already have learned the extra insights from the KNN features and have threrfore no improvement in score? 
* Will such a model be able to use the additional KNN features more effectively and get a higher improvement in score?

If anybody tries this out, please comment below!