# Using Machine Learning to Find Similar NBA Players

In this notebook, I am using an algorithm to query a database of NBA players to find similar players based on their statistics. My primary goal is to find a list of players similar to the 2003 and 2011 championships that surrounded Tim Duncan and Dirk Nowitzki, respectively. These two seasons are known as two of the most impressive championship runs because of the perceived, lesser surrounding talent. 

So, are the players surrounding our super super stars really subpar?

In [1]:
# Import packages

import pprint
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import scikit-learn moduldes
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KDTree

## Import Data

In [2]:
# User path to data file
# Mini is the file path for my Mac Mini
file_mini = "/Users/mini/Documents/nba_data/data_files/player_data_1981-2017.csv"
# Tiny is the file path for my Macbook
#file_tiny = "/Users/benjaminxiao/Documents/nba_data/data_files/player_data_1981-2017.csv"
data = pd.read_csv(file_mini)

In [3]:
data.columns.values

array(['Season', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP_tot',
       'FG_tot', 'FGA_tot', 'FG_perc', '3P_tot', '3PA_tot', '3P_perc',
       '2P_tot', '2PA_tot', '2P_perc', 'eFG_perc', 'FT_tot', 'FTA_tot',
       'FT_perc', 'ORB_tot', 'DRB_tot', 'TRB_tot', 'AST_tot', 'STL_tot',
       'BLK_tot', 'TOV_tot', 'PF_tot', 'PTS_tot', 'MP_per_G', 'FG_per_G',
       'FGA_per_G', '3P_per_G', '3PA_per_G', '2P_per_G', '2PA_per_G',
       'FT_per_G', 'FTA_per_G', 'ORB_per_G', 'DRB_per_G', 'TRB_per_G',
       'AST_per_G', 'STL_per_G', 'BLK_per_G', 'TOV_per_G', 'PF_per_G',
       'PTS_per_G', 'FG_per_36m', 'FGA_per_36m', '3P_per_36m',
       '3PA_per_36m', '2P_per_36m', '2PA_per_36m', 'FT_per_36m',
       'FTA_per_36m', 'ORB_per_36m', 'DRB_per_36m', 'TRB_per_36m',
       'AST_per_36m', 'STL_per_36m', 'BLK_per_36m', 'TOV_per_36m',
       'PF_per_36m', 'PTS_per_36m', 'PER', 'TS_perc', '3PAr', 'FTr',
       'ORB_perc', 'DRB_perc', 'TRB_perc', 'AST_perc', 'STL_perc',
       'BLK_perc', 'TOV_perc', '

In [4]:
data.shape

(15234, 106)

## Data Cleaning

In [5]:
# Subset data to G >= 27, MP_tot >= 450
data = data[(data.G >= 27) & (data.MP_tot >= 450)]

I had to lower the minute limit to 450 to include Speedy Claxton

In [6]:
tot_list = data.columns[8:30]

In [7]:
per_100p_list = data.columns[-20:-3]

In [8]:
drop_cols = [] # Store columns to drop
for col in tot_list: # Drop season totals
    drop_cols.append(col)
    
for col in per_100p_list: # Drop per 100 possessions stats
    drop_cols.append(col)

The extra data was making the player query inaccurate from what I could tell.

In [9]:
# Drop unwanted columns
data.drop(drop_cols, axis = 1, inplace=True)

In [10]:
# Reset indices for querying later
data = data.reset_index()

## Data Preprocessing

In [11]:
# Assign feature and target values
# Exclude things that are labels and not useful statistics
x = data.drop(['Season', 'Player', 'Pos', 'Tm', 'G', 'GS', 'Rounded_Pos'], axis = 1)
y = data['Pos']

In [12]:
x.columns.values

array(['index', 'Age', 'MP_tot', 'MP_per_G', 'FG_per_G', 'FGA_per_G',
       '3P_per_G', '3PA_per_G', '2P_per_G', '2PA_per_G', 'FT_per_G',
       'FTA_per_G', 'ORB_per_G', 'DRB_per_G', 'TRB_per_G', 'AST_per_G',
       'STL_per_G', 'BLK_per_G', 'TOV_per_G', 'PF_per_G', 'PTS_per_G',
       'FG_per_36m', 'FGA_per_36m', '3P_per_36m', '3PA_per_36m',
       '2P_per_36m', '2PA_per_36m', 'FT_per_36m', 'FTA_per_36m',
       'ORB_per_36m', 'DRB_per_36m', 'TRB_per_36m', 'AST_per_36m',
       'STL_per_36m', 'BLK_per_36m', 'TOV_per_36m', 'PF_per_36m',
       'PTS_per_36m', 'PER', 'TS_perc', '3PAr', 'FTr', 'ORB_perc',
       'DRB_perc', 'TRB_perc', 'AST_perc', 'STL_perc', 'BLK_perc',
       'TOV_perc', 'USG_perc', 'OWS', 'DWS', 'WS', 'WS_per_48', 'OBPM',
       'DBPM', 'BPM', 'VORP', 'MP', 'ORtg', 'DRtg'], dtype=object)

In [13]:
# Scale data for dimensionality reduction
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

## Search with KDTree

In [14]:
# Queries KDTree for an array of 30 similar players 
def kdtree_search(input_data, player_index):
    tree = KDTree(input_data, leaf_size=40, metric='euclidean')
    dist, ind = tree.query([input_data[player_index]], k = 30)
    return ind # Return array of 30 similar players

In [15]:
# Get 3 similar players in array of indices
# Limits returned 3 players to seasons after 2010-11
def get_three_players(ind):
    # Gets 8 recent, similar players in order of shortest distance
    subset_recent_players = ind[0] >= 7745 # Get players after 2010-11 season
    recent_players = ind[0][subset_recent_players]
    # Get first three players in array
    return recent_players[:8]

In [16]:
# Get the indices of our target players
def get_player_indices(df, target_player, target_season):
    return df.index[(df['Player'] == target_player) & (df['Season'] == target_season)][0]

In [17]:
# Inputs similar players into a list of tuples
def similar_players(df, array):
    data = []
    for i in array:
        player_name = df['Player'].iloc[i] # Similar player's name
        player_season = df['Season'].iloc[i] # Similar player's season registered
        player_team = df['Tm'].iloc[i] # Similar player's team that season
        data.append(tuple((player_name, player_season, player_team)))
    # Return list of tuples including player name, team, and season
    return data

In [18]:
"""
Takes a DataFrame, target player name, and target player season, and outputs a dictionary. 
The dictionary has the target players as keys, and tuples of similar player names, seasons registerd, teams that year.
"""
def search_for_similar_players(df, target_player, target_season, storage_dict):
    idx = get_player_indices(df, target_player, target_season)
    array = kdtree_search(x_scaled, idx)
    similar_player_indices = get_three_players(array)
    # Store to storage_dict
    storage_dict[target_player] = similar_players(df, similar_player_indices)

In [19]:
# Takes the stored data and removes data points that output the same player as our target player
def rm_same_player(players_dict, players_list):
    for player in players_list: # Call each separte player
        sim_players_list = players_dict[player] # Access list of tuples
        # Filter out all values where the target player is not the similar player
        sim_players_list = [pl for pl in sim_players_list if pl[0] != player]
        players_dict[player] = sim_players_list # Update list of tuples

## Gathering Similar Players to Three Championship Teams

I picked the 2003 San Antonio Spurs and 2011 Dallas Mavricks for teams that had one superstar who carried their teams to NBA championships. The purpose of this project is to compare their surrounding cast to modern players just for fun. For loaded championship teams, the 2008 Boston Celtics are probably a good example. They are known as a great team, but they are not supremely transformative like Kobe Bryant or 2006 Dwayne Wade.

### Process

* First, we gather a list of significant players.
* Next, we specify the season to look into.
* Finally, create an empty dictionary for data storage.

In [20]:
mavs_roster_list = ['Jason Terry', 'Shawn Marion', 'Tyson Chandler', 'J.J. Barea', \
                   'Jason Kidd', 'DeShawn Stevenson', 'Peja Stojakovic']
mavs_season = '2010-11'
similar_players_mavs = dict((el, None) for el in mavs_roster_list)

In [None]:
spurs_roster_list = ['Tony Parker', 'David Robinson*', 'Stephen Jackson', 'Manu Ginobili', \
              'Malik Rose', 'Speedy Claxton', 'Bruce Bowen']
spurs_season = '2002-03'
similar_players_spurs = dict((el,None) for el in spurs_roster_list)

In [21]:
# For the Celtics, use all significant players.
celts_season = '2007-08'
celts_roster_list = ['Paul Pierce', 'Ray Allen', 'Kevin Garnett', 'Rajon Rondo', 'James Posey', \
                    'Leon Powe', 'Eddie House', 'Kendrick Perkins', 'Tony Allen', \
                    'Glen Davis']
similar_players_celts = dict((el, None) for el in celts_roster_list) 

In [22]:
# Call function for 2003 Spurs
for name in spurs_roster_list:
    search_for_similar_players(data, name, spurs_season, similar_players_spurs)
    
# Remove the same players from the list
rm_same_player(similar_players_spurs, spurs_roster_list)

In [23]:
# Call function for 2011 Mavricks
for name in mavs_roster_list:
    search_for_similar_players(data, name, mavs_season, similar_players_mavs)

In [24]:
# Remove the same players from the list
rm_same_player(similar_players_mavs, mavs_roster_list)

In [25]:
# Call function for 2008 Celtics
for name in celts_roster_list:
    search_for_similar_players(data, name, celts_season, similar_players_celts)

rm_same_player(similar_players_celts, celts_roster_list)

## Which players are outputted?

In [26]:
pprint.pprint(similar_players_spurs)

{'Bruce Bowen': [('Anthony Parker', '2009-10', 'CLE'),
                 ('Jared Dudley', '2015-16', 'WAS'),
                 ('Shane Battier', '2010-11', 'TOT'),
                 ('Shane Battier', '2008-09', 'HOU'),
                 ('Shane Battier', '2009-10', 'HOU'),
                 ('Courtney Lee', '2014-15', 'MEM'),
                 ('Ime Udoka', '2006-07', 'POR')],
 'David Robinson*': [('Tim Duncan', '2015-16', 'SAS'),
                     ('Samuel Dalembert', '2011-12', 'HOU'),
                     ('Ian Mahinmi', '2015-16', 'IND'),
                     ('Emeka Okafor', '2010-11', 'NOH'),
                     ('Marcus Camby', '2008-09', 'LAC'),
                     ('DeAndre Jordan', '2012-13', 'LAC')],
 'Malik Rose': [('Nene Hilario', '2012-13', 'WAS'),
                ('Luis Scola', '2007-08', 'HOU')],
 'Manu Ginobili': [('James Harden', '2009-10', 'OKC'),
                   ('Trevor Ariza', '2008-09', 'LAL'),
                   ('Mario Chalmers', '2013-14', 'MIA')],
 'Speedy 

In [28]:
pprint.pprint(similar_players_mavs)

{'DeShawn Stevenson': [('Damjan Rudez', '2014-15', 'IND'),
                       ('Mickael Pietrus', '2010-11', 'TOT'),
                       ('Keith Bogans', '2012-13', 'BRK'),
                       ('Eddie House', '2010-11', 'MIA'),
                       ('James Jones', '2009-10', 'MIA'),
                       ('Anthony Tolliver', '2014-15', 'TOT')],
 'J.J. Barea': [('Jordan Crawford', '2013-14', 'TOT'),
                ('Greivis Vasquez', '2013-14', 'TOT'),
                ('Randy Foye', '2009-10', 'WAS'),
                ('D.J. Augustin', '2014-15', 'TOT')],
 'Jason Kidd': [('Nicolas Batum', '2014-15', 'POR'),
                ('Andre Iguodala', '2013-14', 'GSW')],
 'Jason Terry': [('Raymond Felton', '2012-13', 'NYK'),
                 ('Jamal Crawford', '2012-13', 'LAC'),
                 ('Mike Conley', '2014-15', 'MEM'),
                 ('Deron Williams', '2015-16', 'DAL'),
                 ('Nate Robinson', '2012-13', 'CHI')],
 'Peja Stojakovic': [('Eddie House', '2010-11'

In [29]:
pprint.pprint(similar_players_celts)

{'Eddie House': [('Patty Mills', '2015-16', 'SAS'),
                 ('Luther Head', '2007-08', 'HOU'),
                 ('Bobby Jackson', '2007-08', 'TOT'),
                 ('Mickael Pietrus', '2009-10', 'ORL'),
                 ('Jordan Farmar', '2009-10', 'LAL'),
                 ('Kyle Korver', '2010-11', 'CHI')],
 'Glen Davis': [('Zaza Pachulia', '2009-10', 'ATL'),
                ('Zaza Pachulia', '2010-11', 'ATL'),
                ('Ian Mahinmi', '2011-12', 'DAL'),
                ('Cody Zeller', '2013-14', 'CHA'),
                ('Semih Erden', '2010-11', 'TOT'),
                ('Ian Mahinmi', '2012-13', 'IND')],
 'James Posey': [('Quentin Richardson', '2009-10', 'MIA'),
                 ('Marvin Williams', '2014-15', 'CHO'),
                 ('Thabo Sefolosha', '2012-13', 'OKC'),
                 ('Joe Ingles', '2016-17', 'UTA'),
                 ('Jared Dudley', '2015-16', 'WAS'),
                 ('Shane Battier', '2008-09', 'HOU')],
 'Kendrick Perkins': [('DeAndre Jordan

## Caveats

* These only compare REGULAR SEASON statistics to other REGULAR SEASON players.
* Fewer entries means there were fewer palyers in recent times that matched the target player's season
* Stats only go up to 2016-17 season
* When comparing really great players, they are great because they are unique. So I don't trust the output nearly as much. This means the Celtics comparisons with their big 3 is probably not very good.