In [None]:
%%HTML
<style type="text/css">
div.h1 {
    background-color:#eebbcb; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 35px; 
    max-width: 1500px; 
    margin: auto; 
    margin-top: 50px;
}

div.h2 {
    background-color:#2ca9e1; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 35px; 
    max-width: 1500px; 
    margin: auto; 
    margin-top: 50px;
}
</style>

## If you like, please Upvote😹

<div class="h1">About this notebook</div>

![KnnClassification](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/220px-KnnClassification.svg.png)

In this notebook, we will gain a better understanding and insight of the KNN by implementing the method ourselves.

We can access the the dataset that contains user personality data and their movie preferences in this kaggle dataset.

First, we view the dataset and create target feature.

Second, we check theory of KNN and implement it with KDTree. We also check it's validity.

Third, we estimate target feature with our model.

# <div class="h2">Data overview and create target feature</div>

### Load library and dataset

In [None]:
import itertools
import pandas as pd
import numpy as np

import seaborn as sns
from sklearn import preprocessing

In [None]:
!ls ../input/top-personality-dataset

In [None]:
df_personality = pd.read_csv("../input/top-personality-dataset/2018-personality-data.csv")
df_ratings = pd.read_csv("../input/top-personality-dataset/2018_ratings.csv")

In [None]:
#Rename because some column name includes space.
rename_dict = {' openness': 'openness', ' agreeableness': 'agreeableness', ' emotional_stability': 'emotional_stability',
               ' conscientiousness': 'conscientiousness', ' extraversion': 'extraversion', ' assigned metric': 'assigned metric',
               ' assigned condition': 'assigned condition', ' is_personalized': 'is_personalized', ' enjoy_watching ': 'enjoy_watching'}
df_personality = df_personality.rename(columns=rename_dict)

### Data overview

In [None]:
df_personality.head()

In [None]:
df_personality.dtypes

In [None]:
df_personality.describe()

### Label encording

There are 2 categorical columns, I'll encode them with labelencording.

In [None]:
assigned_metric_le = preprocessing.LabelEncoder()
assigned_metric_le.fit(df_personality["assigned metric"])
df_personality["assigned metric"] = assigned_metric_le.transform(df_personality["assigned metric"])

assigned_condition_le = preprocessing.LabelEncoder()
assigned_condition_le.fit(df_personality["assigned condition"])
df_personality["assigned condition"] = assigned_condition_le.transform(df_personality["assigned condition"])

In [None]:
df_personality

### Create target feature

In this notebook, I'd like to estimate which users rate movies in which categories the most.

To do this, I'll create "movie_choice" column.

In [None]:
cols_predicted_rating = [' predicted_rating_1',' predicted_rating_2',' predicted_rating_3',
 ' predicted_rating_4',  ' predicted_rating_5', ' predicted_rating_6', ' predicted_rating_7', 
 ' predicted_rating_8', ' predicted_rating_9', ' predicted_rating_10', ' predicted_rating_11',' predicted_rating_12']

cols_movie = [' movie_1',' movie_2',' movie_3',' movie_4',  ' movie_5', ' movie_6', ' movie_7', 
 ' movie_8', ' movie_9', ' movie_10', ' movie_11',' movie_12']

In [None]:
df_personality[cols_predicted_rating]

In [None]:
df_personality["movie_choice"] = np.ndarray.argmax(df_personality[cols_predicted_rating].values, axis = 1)

In [None]:
df_personality = df_personality.drop(columns = (cols_predicted_rating + cols_movie))

Here, we get "movie_choice" column. I'll check it's distribution.

In [None]:
sns.countplot(data=df_personality,x="movie_choice")

# <div class="h2">Understand and implement k-nearest neighbor</div>

## theory 

k-nearest neighbors is one of the nonparametric method. With parametric method which assume distribution, there is a strong limitation that the data you want to analyze must be suitable for the assumed distribution. For example, we can't analyze multimodal distribution with single gauss distribution. We have to use dimore complex models like Gaussian mixture model. By nonparametric method, we can analyze such data with fewer assumptions.

Note, I refer following discussion from reference [2] (check notebook's bottom).

----------------

First, we consider area R. R is very small and it's probability *P* is:

$$
   P = \int_R p(x)dx
$$

If there are N samples, the probability which K sample of them are in R is following:

$$
   Bin(K | N,P) = \frac{N!}{K!(N-K)!} P^K (1-P)^{N-K}
$$

Since kth is either in or out of the region R, it's probabirity follows binomial distribution.

With binomial distribution, avarage and variance are:

$$
   E[K/N] = P
$$

$$
   var[K/N] = P(1-P)/N
$$

If we assume N is so big. Then var[K/N] approachs zero. So we can guess　E[K/N] = P.

Now, we can write K in this small R,

$$
  K = NP
$$

If we assume R is so small such that p(x) is same all over the R,

$$
  P = p(x) V
$$

V is the volume of R.

From last two equation, we get

$$
  p(x) = \frac{K}{NV}
$$

---------------------------

Now, we get expression　of p(x) by K, N and V. In KNN, we constrain K and change V. In other words, we give K when we create model instance and search for the nearest K points. And we consider as up to the Kth most distant point is included in this region R.

---------------------------

For classification, we want to know p(C_k|x). By Bayes' theorem,

$$
  p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}
$$

Since above discussion, we can guess probability that given vector is kth class,

$$
  p(x|C_k) = \frac{K_k}{N_kV}
$$

and p(C_k) is simply,

$$
  p(C_k) = \frac{N_k}{N}
$$

So finally we get,

$$
  p(C_k|x) = \frac{N_k}{N}
$$

---------------------------

Okay, we get very simple p(C_k|x) representation. Using this, we can get probability that given vector is kth class!

### implementation

I will implement the model according to the theory described above. If the points we want to estimate are given , we need to get the k neighboring points. In this implementation, I will use KDTree to implement this.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html

Using scipy.spatial.KDTree.query API, we can get k neighboring points easily. Of cource, you can implement bruteforce. For the sake of speed and simplicity, we will use this API.

In [None]:
from scipy.spatial import KDTree
from sklearn.model_selection import train_test_split
from collections import Counter

I'll implement my k-nearest neighbor model for Knn class.

This class has following method.

- **__init__**: Constructor. Especially, we can set k value here.

- **fit**: Training. In fact, we create KDTree instance which include training data. And also we input other nessecery data to model.

- **predict**: Predict labels for input vectors. To accommodate multiple vectors, we call following predict_each_point fuction here.

- **predict_each_point**: Predict label for input vector. We query k nearest neighbor points from input vector using KDTree. And return most common label in the nearest neighbors.

- **predict_proba**: Predict probabilities of labels for input vectors. To accommodate multiple vectors, we call following predict_each_proba fuction here.

- **predict_each_proba**: Predict probability of labels for input vector. We query k nearest neighbor points from input vector using KDTree. And count the number of points per label. The number of points divided by the total number of points in the local space is returned as the probability.

In [None]:
class Knn:
    def __init__(self, k = 3):
        self.N = 0
        self.N_k = None
        self.train_kdtree = None
        self.train_labels = None
        self.K = k

    
    def fit(self, X, y):
        self.dim = len(X[0])
        self.N = len(X)
        self.train_kdtree = KDTree(X)
        self.train_labels = y
        self.N_k = Counter(y)

    
    def predict(self, x):
        return np.array([self.predict_each_point(xi) for xi in x])
    
            
    def predict_each_point(self, x):
        _, idxs = self.train_kdtree.query(x, self.K) 
        c_K = Counter(self.train_labels[idxs])
        #most_common([n]) returns like [("key", value), ...].
        return c_K.most_common(1)[0][0] 
    
    
    def predict_proba(self, x):
        return np.array([self.predict_each_proba(xi) for xi in x])
    
    
    def predict_each_proba(self, x):
        _, idxs = self.train_kdtree.query(x, self.K)
        c_K = Counter(self.train_labels[idxs])
        p_Ck_x = {k: c_K[k]/self.K for k in self.N_k.keys()}
        p_Ck_x = [p_Ck_x[idx] for idx in sorted(p_Ck_x.keys())]
        
        return p_Ck_x

OK, I completed my model.

Next, I'll split data to train and test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_personality[[col for col in df_personality.columns if col not in ["userid", "movie_choice"]]], 
                                                    df_personality["movie_choice"], test_size=0.33, random_state=42)

Let's predict movie categories by our model! I create knn instance with k=30.

Test data is too big for demo, I use only first 10 samples.

In [None]:
knn = Knn(30)

knn.fit(X_train.values, y_train.values)

predicts = knn.predict(X_test.values[0:10])
predict_probas = knn.predict_proba(X_test.values[0:10])

predicts

### Validation of model

I completed implementation of my k-nearest neighbor model. I will check its validity with KNeighborsClassifier of sklean. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
neigh = KNeighborsClassifier(n_neighbors=30, algorithm='kd_tree')

neigh.fit(X_train.values, y_train.values)

predicts_val = neigh.predict(X_test.values[0:10])
predict_probas_val = neigh.predict_proba(X_test.values[0:10])

predicts_val

You can see that the estimates are quite different.😅

For example, 3rd sample is estimated 10 by my knn model, but done 2 by sklean.

You can see this reason by checking proba.

In [None]:
print("My KNN model")
print(predict_probas[2])
print("-------------------")
print("Sklean KNN model")
print(predict_probas_val[2])

As you can see, the probabilities are the same. When the probabilities are the same, there seems to be a difference in which label choose.

But 10th sample seemsto have different reason.

In [None]:
print("My KNN model")
print(predict_probas[9])
print("-------------------")
print("Sklean KNN model")
print(predict_probas_val[9])

The second and 10th estimates are interchanged. I think that this is because KDTree's implementation... If you know the reason, please tell me!

# <div class="h2">Inference with our model</div>

Finally, let's estimate test data with our model.

However, it's roughly the same as the previous demo, so we'll also measure execution speed.

In [None]:
import time

In [None]:
start = time.time()

k = 30

knn = Knn(k)

knn.fit(X_train.values, y_train.values)

predicts = knn.predict(X_test.values)
predict_probas = knn.predict_proba(X_test.values)

taken_time = time.time() - start

print(f"Our KNN model takes {taken_time} seconds with k = {k}")
print(f"Shape of train data was {X_train.values.shape}")
print(f"Shape of test data was {X_test.values.shape}")

In [None]:
start = time.time()

k = 50

knn = Knn(k)

knn.fit(X_train.values, y_train.values)

predicts = knn.predict(X_test.values)
predict_probas = knn.predict_proba(X_test.values)

taken_time = time.time() - start

print(f"Our KNN model takes {taken_time} seconds with k = {k}")
print(f"Shape of train data was {X_train.values.shape}")
print(f"Shape of test data was {X_test.values.shape}")

# Reference

1. wikipedia (refered top picture from here.)

2. Pattern Recognition and Machine Learning ( Japanese Edition )

3. Scipy doc. Especially, https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html