### Implementing K Nearest Neighbour algorithm on NBA dataset.
#### Getting the data.

In [2]:
import pandas as pd

nba = pd.read_csv("nba_2016.csv")
nba.head(4)

Unnamed: 0,PLAYER,TEAM,POS,AGE,GP,MPG,MIN%,USG%,TOV,FTA,...,3P%,TS%,PPG,RPG,TRB%,APG,AST%,SPG,BPG,VI
0,Aaron Brooks,Ind,PG,32,46,14.3,29.9,19.6,0.173,29,...,0.337,0.505,5.2,1.1,4.4,2.2,22.5,0.48,0.17,6.5
1,Aaron Gordon,Orl,SF,21,56,27.6,57.6,19.6,0.091,137,...,0.292,0.503,11.2,4.6,9.1,1.9,11.2,0.75,0.43,6.7
2,Adreian Payne,Min,PF,25,12,7.7,16.1,25.2,0.133,15,...,0.2,0.526,4.0,1.7,12.6,0.3,6.9,0.42,0.33,6.6
3,Al Horford,Bos,C,30,45,32.9,68.5,20.6,0.113,88,...,0.351,0.541,14.6,6.6,11.3,4.9,23.7,0.71,1.53,9.5


In [5]:
print(nba.columns.values)

['PLAYER' 'TEAM' 'POS' 'AGE' 'GP' 'MPG' 'MIN%' 'USG%' 'TOV' 'FTA' 'FT%'
 '2PA' '2P%' '3PA' '3P%' 'TS%' 'PPG' 'RPG' 'TRB%' 'APG' 'AST%' 'SPG' 'BPG'
 'VI']


### KNN Theory
#### KNN is based around the simple idea of predicting unknown values by matching them with most similar known values.
#### The similarity can be based on Euclidean Distance between one or more features of the individual data points.
#### We will create a function that returns the cummulative euclidean distance between any two rows of the dataframe.
#### Here, we will use LeBron James as the reference and find players that have 'similar' statistics as him. 
#### The resulting distance will be stored in the Pandas Series 'lebron_distance'.

In [8]:
selected_player = nba[nba["PLAYER"]=="LeBron James"].iloc[0]
distance_columns = ['AGE','GP','MPG','MIN%','USG%','TOV','FTA','FT%','2PA','2P%','3PA','3P%','TS%','PPG','RPG','TRB%','APG','AST%','SPG','BPG','VI']

import math

def euclidean_distance(row):
    inner_value = 0
    for k in distance_columns:
        inner_value += (row[k] - selected_player[k]) ** 2
    return math.sqrt(inner_value)

lebron_distance = nba.apply(euclidean_distance, axis = 1)

In [10]:
print(lebron_distance.head(5))

0    679.595620
1    403.797763
2    792.267875
3    427.299678
4    464.581224
dtype: float64


### Normalizing Columns
#### Some attributes of the players might have greater significance in them being similar to LeBron James.
#### But some attributes by default have larger values and will have greater effect on the outcome than others by sheer size.
#### In order to avoid this, we will normalize the columns.
#### We will keep the mean of 0 and a Standard Deviation of 1.
#### Normalized values will be stored in dataframe 'nba_normalized'.

In [12]:
nba_numeric = nba[distance_columns]

nba_pass = nba_numeric - nba_numeric.mean()
nba_normalized = nba_pass / nba_numeric.std()

### Finding Nearest Neighbour
#### We can easily find nearest neighbours using distance.euclidean function from scipy.spatial library.
#### It is much faster and efficient than manually finding distance.

In [18]:
from scipy.spatial import distance

nba_normalized.fillna(0, inplace=True)

lebron_normalized = nba_normalized[nba['PLAYER'] == 'LeBron James']

euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis = 1)

distance_frame = pd.DataFrame(data={"dist":euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist",inplace = True)
second_smallest = distance_frame.iloc[1]['idx']
most_similar_to_lebron = nba.loc[int(second_smallest)]["PLAYER"]
print(most_similar_to_lebron)

Eric Bledsoe


### Generating Training and Test Sets
#### We can now perform predictions by using training and test sets.
#### First, we create Training and Test sets.

In [21]:
import random
from numpy.random import permutation

#Random shuffle of the dataset
random_indices = permutation(nba.index)

# Cutoff for how many items we want in training and test set
test_cutoff = math.floor(len(nba)/3)

# Generate test set by taking first 1/3 of the randomly shuffled indices.
test = nba.loc[random_indices[1:test_cutoff]]

# Generate train set with rest of the data
train = nba.loc[random_indices[test_cutoff:]]

### Using Scikit-learn
#### We will now use scikit-learn's implementation of KNN.
#### Normalization and Distance finding are done automatically.

In [24]:
# The columns we will be making predictions with
x_columns = ['AGE','GP','MPG','MIN%','USG%','TOV','FTA','FT%','2PA','2P%','3PA','3P%','TS%','RPG','TRB%','APG','AST%','SPG','BPG','VI']

# The column we wangt to predict
y_column = ["PPG"]

from sklearn.neighbors import KNeighborsRegressor

# Creating the model
knn = KNeighborsRegressor(n_neighbors=5)

# Fit model on Training data
knn.fit(train[x_columns],train[y_column])

# Make predictions on test set using fit model
predictions = knn.predict(test[x_columns])

### Computing Error
#### After makiing predictions, we can compute the error involved.
#### We will use the mean squared error to see how accurate we are.

In [28]:
actual = test[y_column]

mse = (((predictions - actual) ** 2).sum()) / len(predictions)
print(mse)

PPG    2.94147
dtype: float64


### Conclusion
#### Using K Nearest Neighbours algorithm, we predicted the 'points per game' with an error of approximately 3. 