Reference:

https://www.dataquest.io/blog/k-nearest-neighbors-in-python/

# Load Dataset

Before we dive into the algorithm, let’s take a look at our data. Each row in the data contains information on how a player performed in the 2013-2014 NBA season.

Here are some selected columns from the data:



*   player — name of the player
*   pos — the position of the player
*   g — number of games the player was in
*   gs — number of games the player started
*   pts — total points the player scored

In [0]:
import pandas
with open("./sample_data/nba_2013.csv", 'r') as csvfile:
    nba = pandas.read_csv(csvfile)
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


# Euclidean distance

Before we can predict using KNN, we need to find some way to figure out which data rows are “closest” to the row we’re trying to predict on.

A simple way to do this is to use Euclidean distance. The formula is 

![alt text](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.machinelearningplus.com%2Fwp-content%2Fuploads%2F2019%2F04%2F2_multivariate_euclidean_distance_formula-min.png&f=1&nofb=1https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.machinelearningplus.com%2Fwp-content%2Fuploads%2F2019%2F04%2F2_multivariate_euclidean_distance_formula-min.png&f=1&nofb=1)

In [0]:
import math

nba = nba.dropna()

# Select Lebron James from our dataset
selected_player = nba[nba["player"] == "LeBron James"].iloc[0]

# Choose only the numeric columns (we'll use these to compute euclidean distance)
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']

def euclidean_distance(row):
    """
    A simple euclidean distance function
    """
    inner_value = 0
    for k in distance_columns:
      inner_value += (row[k] - selected_player[k]) ** 2
      return math.sqrt(inner_value)

# Find the distance from each player in the dataset to lebron.
lebron_distance = nba.apply(euclidean_distance, axis=1)

In [0]:
lebron_distance

0      6.0
3      1.0
4      4.0
6      1.0
7      5.0
      ... 
476    9.0
477    1.0
478    4.0
479    8.0
480    5.0
Length: 403, dtype: float64

# Normalizing columns

A simple way to deal with this is to normalize all the columns to have a mean of 0, and a standard deviation of 1. This will ensure that no single column has a dominant impact on the euclidean distance calculations.

In [0]:
# Select only the numeric columns from the NBA dataset
nba_numeric = nba[distance_columns]
# Normalize all of the numeric columns
nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()

In [0]:
nba_normalized

Unnamed: 0,age,g,gs,mp,fg,fga,fg.,x3p,x3pa,x3p.,x2p,x2pa,x2p.,efg.,ft,fta,ft.,orb,drb,trb,ast,stl,blk,tov,pf,pts
0,-0.814777,0.235407,-0.944580,-0.613829,-0.877328,-0.925592,0.482473,-0.826813,-0.859228,-0.115785,-0.723659,-0.753777,0.320007,-0.010073,-0.637683,-0.607244,-0.589593,0.233644,-0.225131,-0.090114,-0.763546,-0.624501,0.044941,-0.803874,0.102809,-0.874569
3,0.352916,0.676249,1.455748,1.338616,1.426921,1.432458,0.359705,1.545983,1.248019,0.935220,1.106739,1.203540,0.075030,0.532970,1.582841,1.478015,0.519979,-0.412605,0.375022,0.142778,0.864779,-0.279742,-0.705609,1.028200,0.306263,1.568399
4,-0.347700,-0.073182,0.041856,-0.494735,-0.472058,-0.632869,1.546458,-0.903355,-0.962741,-1.867461,-0.229318,-0.345585,1.027974,0.858795,-0.442574,-0.504087,0.670309,0.589081,0.047031,0.218722,-0.674728,-0.624501,0.697593,-0.282680,1.047417,-0.543641
6,0.352916,0.499912,1.324224,1.276779,2.515361,2.549144,0.346064,-0.845948,-0.859228,-0.553704,3.197668,3.535591,-0.071235,-0.322322,1.787240,1.654856,0.570089,1.752328,2.950097,2.689410,0.346676,0.524695,1.415510,0.664944,0.466120,2.143835
7,-0.581239,0.323576,-0.878818,-0.356175,-0.483638,-0.494638,0.196015,-0.865084,-0.874015,-0.856879,-0.256039,-0.215097,-0.083895,-0.444506,-0.656264,-0.629350,-0.589593,0.992986,0.109838,0.390860,-0.445282,-0.595771,0.273369,-0.582761,0.160939,-0.596336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
476,-1.515393,0.632165,-0.418481,0.437399,0.737962,0.882246,-0.076802,-0.137937,0.419908,-0.469847,0.899650,0.899068,0.318398,-0.417354,0.978933,1.404330,-0.725605,0.185175,-0.120453,-0.029360,0.635333,0.955644,-0.281385,1.944236,0.524249,0.744240
477,0.352916,0.279491,-0.648649,0.488930,0.981124,1.101789,0.032325,1.679931,1.617712,0.666213,0.545595,0.628055,0.011661,0.383633,1.220496,1.102226,0.591565,-0.461074,-0.273981,-0.343259,-0.267647,0.036287,-0.411916,0.222719,0.596911,1.176344
478,-0.347700,0.940754,1.620154,1.528707,2.110091,2.169687,0.291501,0.818836,1.188868,0.157164,2.148863,2.140375,0.375373,0.084960,0.551552,0.689595,-0.217349,1.752328,0.933304,1.226236,0.376281,3.512605,0.371267,1.328280,1.425260,1.751780
479,-1.281855,1.073007,-0.845937,0.037749,-0.263634,-0.212757,-0.090442,-0.903355,-0.962741,-1.867461,0.011172,0.173021,-0.500343,-0.770332,0.393606,0.475911,-0.088496,0.976830,0.409915,0.603501,-0.289851,-0.136093,0.534430,0.096369,0.800365,-0.202173


# Finding the nearest neighbor

We now know enough to find the nearest neighbor of a given row in the NBA dataset. We can use the distance.euclidean function from scipy.spatial, a much faster way to calculate euclidean distance.

In [0]:
from scipy.spatial import distance

# Fill in NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for lebron james.
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"]

# Find the distance between lebron james and everyone else.
euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)

# Create a new dataframe with distances.
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
# Find the most similar player to lebron (the lowest distance to lebron is lebron, the second smallest is the most similar non-lebron player)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]

# Generating training and testing sets

Now that we know how to find the nearest neighbors, we can make predictions on a test set. We’ll try to predict how many points a player scored using the 5 closest neighbors. We’ll find neighbors by using all the numeric columns in the dataset to generate similarity scores.

First, we have to generate test and train sets. In order to do this, we’ll use random sampling. We’ll randomly shuffle the index of the nba dataframe, and then pick rows using the randomly shuffled values.

If we didn’t do this, we’d end up predicting and training on the same data set, which would overfit. We could do cross validation also, which would be slightly better, but slightly more complex.

In [0]:
import random
from numpy.random import permutation

# Randomly shuffle the index of nba.
random_indices = permutation(nba.index)

# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)

# Generate the test set by taking the first 1/3 of the randomly shuffled indices.
test = nba.loc[random_indices[1:test_cutoff]]

# Generate the train set with the rest of the data.
train = nba.loc[random_indices[test_cutoff:]]

# Using sklearn for k nearest neighbors

There’s a regressor and a classifier available, but we’ll be using the regressor, as we have continuous values to predict on.

Sklearn performs the normalization and distance finding automatically

In [0]:
# The columns that we will be making predictions with.
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column that we want to predict.
y_column = ["pts"]

from sklearn.neighbors import KNeighborsRegressor
# Create the knn model.
# Look at the five closest neighbors.
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data.
knn.fit(train[x_columns], train[y_column])
# Make point predictions on the test set using the fit model.
predictions = knn.predict(test[x_columns])

In [0]:
predictions

array([[ 400.6],
       [ 114. ],
       [  30. ],
       [ 831.4],
       [ 780. ],
       [ 941.2],
       [ 114. ],
       [1943.8],
       [2027.6],
       [1269.6],
       [  33.4],
       [ 467. ],
       [  18. ],
       [ 143.6],
       [ 375.8],
       [1309.6],
       [ 923.2],
       [  86. ],
       [ 423.8],
       [ 317.2],
       [ 467.2],
       [  93.2],
       [ 705.8],
       [ 603.4],
       [  25.4],
       [ 808.2],
       [ 173.2],
       [ 591.2],
       [ 605.6],
       [  10.8],
       [ 554. ],
       [1294. ],
       [1067.6],
       [ 747.8],
       [ 689.2],
       [ 890.2],
       [ 712.4],
       [ 989.2],
       [1574. ],
       [1422. ],
       [ 774.2],
       [   8.8],
       [ 198.2],
       [ 770. ],
       [  71.6],
       [ 931. ],
       [ 188.8],
       [1178.2],
       [ 276.8],
       [ 640.8],
       [ 584.4],
       [ 893. ],
       [ 139.4],
       [ 167. ],
       [  85.2],
       [  72. ],
       [ 786.6],
       [1116.2],
       [ 393. 

In [0]:
# Get the actual values for the test set.
actual = test[y_column]

# Compute the mean squared error of our predictions.
mse = (((predictions - actual) ** 2).sum()) / len(predictions)

In [0]:
mse

pts    4501.658346
dtype: float64