# Classification with _k_-nearest neigbor
In this example, we are trying to predict the genre of a movie. We are again using a modified [IMDB (Internet movie database) data set on movies](https://www.kaggle.com/nielspace/imdb-data)

In [4]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #Splitdata

In [5]:
df = pd.read_csv('movies.csv')
df = df.dropna() #first get rid of rows with empty cells
df.head() #show a tabel

Unnamed: 0,title,runtime,metascore,rating,genre
0,The Dark Knight,152,82.0,9.0,action
2,Inception,148,74.0,8.8,action
3,Interstellar,169,74.0,8.6,drama
4,Kimi no na wa,106,79.0,8.6,drama
5,The Intouchables,112,57.0,8.6,comedy


In [6]:
df['genre'].value_counts() 

drama     314
action    289
comedy    223
Name: genre, dtype: int64

The k-NN algorithm. For this exercise I need to normalize the data (make the mean 0 and the standard deviation 1). 

In [7]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df[['runtime', 'metascore', 'rating']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = df['genre'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [9]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=6) #create a KNN-classifier with 6 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculates the fit on the data

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.42338709677419356

42% of movies is predicted accurately. 

826:314x100=38,1

As we see that 38% of the movies are dramas, we could actaully get this performance by predicting _everything_ is 'drama'. We can use the _confusion matrix_ to see how well the model tells apart the different genres. A confusion matrix gives a the different classes and the number of predictions for each combination. So it becomes more accurate. 

In [11]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #"confusion matrix"
cm

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array([[52, 19, 16],
       [26, 20, 16],
       [47, 19, 33]])

In [12]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['drama', 'action', 'comedy'], columns = ['drama_p', 'action_p', 'comedy_p']) 
conf_matrix

Unnamed: 0,drama_p,action_p,comedy_p
drama,52,19,16
action,26,20,16
comedy,47,19,33


The way to read this is that of the drama movies, 31 are correctly predicted as 'drama', 23 are instead predicted as 'action' and '16' as comedy. The _recall_ and _precision_ for the category drama is:

#It allows to create a Fraction instance from integers, floats, numbers, decimals and strings. Fraction Instances : A Fraction instance can be constructed from a pair of integers, from another rational number, or from a string.

$recall = \frac{31}{31 + 23 + 16} = .38$

$precision = \frac{31}{31 + 25 + 41} = .35$

We might improve our scores by trying out different values of _k_.

SyntaxError: unexpected character after line continuation character (2471755501.py, line 1)