# Introduction
In this notebook I will be collecting the shot data that will be used for the project. I will be picking multiple key players to test some things on and show how the `nba_api` library works. I will then read the shot data that I collected earlier from a locally stored file.

## Notebook Objective
The objective of this notebook is to collect and clean shot data.

# Setup

## Imports

In [1]:
pip install nba_api

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request, json
from unicodedata import normalize
import seaborn as sns
import os 
import sys

In [3]:
from nba_api.stats.static import players
from nba_api.stats.static import teams
from nba_api.stats.endpoints import playercareerstats
from nba_api.stats.endpoints import leaguedashplayershotlocations
from nba_api.stats.endpoints import playerdashptshots
from nba_api.stats.endpoints import shotchartdetail

# Parameters

In [4]:
# In
SHOT_DATASET = '../../data/raw/shot_dataset.csv'
NBA_GAMES_DATASET = '../../data/raw/players.csv'

# Out
CLEAN_SHOT_DATASET = '../../data/processed/002_shot_dataset.csv'

## Configuration

In [5]:
%matplotlib inline

# Gathering Data
First let's get a basic understanding of the how this library works.

In [6]:
nba_players = players.get_players()
nba_players[103:107]

[{'id': 1628387,
  'full_name': 'Ike Anigbogu',
  'first_name': 'Ike',
  'last_name': 'Anigbogu',
  'is_active': False},
 {'id': 76050,
  'full_name': 'Michael Ansley',
  'first_name': 'Michael',
  'last_name': 'Ansley',
  'is_active': False},
 {'id': 1512,
  'full_name': 'Chris Anstey',
  'first_name': 'Chris',
  'last_name': 'Anstey',
  'is_active': False},
 {'id': 203507,
  'full_name': 'Giannis Antetokounmpo',
  'first_name': 'Giannis',
  'last_name': 'Antetokounmpo',
  'is_active': True}]

In [7]:
test_player = [player for player in nba_players
                   if player['full_name'] == 'Dirk Nowitzki'][0]
test_player

{'id': 1717,
 'full_name': 'Dirk Nowitzki',
 'first_name': 'Dirk',
 'last_name': 'Nowitzki',
 'is_active': False}

We will also need to gather team id's and team names

In [8]:
nba_teams = teams.get_teams()
for team in nba_teams:
    print(str(team['id']) + " " + team['full_name'])

1610612737 Atlanta Hawks
1610612738 Boston Celtics
1610612739 Cleveland Cavaliers
1610612740 New Orleans Pelicans
1610612741 Chicago Bulls
1610612742 Dallas Mavericks
1610612743 Denver Nuggets
1610612744 Golden State Warriors
1610612745 Houston Rockets
1610612746 Los Angeles Clippers
1610612747 Los Angeles Lakers
1610612748 Miami Heat
1610612749 Milwaukee Bucks
1610612750 Minnesota Timberwolves
1610612751 Brooklyn Nets
1610612752 New York Knicks
1610612753 Orlando Magic
1610612754 Indiana Pacers
1610612755 Philadelphia 76ers
1610612756 Phoenix Suns
1610612757 Portland Trail Blazers
1610612758 Sacramento Kings
1610612759 San Antonio Spurs
1610612760 Oklahoma City Thunder
1610612761 Toronto Raptors
1610612762 Utah Jazz
1610612763 Memphis Grizzlies
1610612764 Washington Wizards
1610612765 Detroit Pistons
1610612766 Charlotte Hornets


This function, `player_shots()`, will return a json containing the shot data of the player we specified while they were playing for the team we specified.

In [9]:
def player_shots(team, player, season=None):
    response = shotchartdetail.ShotChartDetail(
        team_id=team,
        player_id=player,
        season_nullable=season,
        context_measure_simple = 'FGA',
        season_type_all_star='Regular Season'
    )
    return response

This function, `player_df()`, will turn the json in to a dataframe.

In [10]:
def player_df(resp):
    content = json.loads(resp.get_json())
    results = content['resultSets'][0]
    headers = results['headers']
    rows = results['rowSet']
    df = pd.DataFrame(rows)
    df.columns = headers
    return df

We can now use these functions to get the shot data of some key players.

In [11]:
chosen_players = ['Kobe Bryant', 'Tim Duncan', 'Dirk Nowitzki', 'Stephen Curry', 'Devin Booker', 'Luka Doncic']

In [12]:
for player in nba_players:
    if player['full_name'] in chosen_players:
        print(player['full_name'] + ": " + str(player['id']))

Devin Booker: 1626164
Kobe Bryant: 977
Stephen Curry: 201939
Luka Doncic: 1629029
Tim Duncan: 1495
Dirk Nowitzki: 1717


In [13]:
# Played for LA Lakers 1996-2016
K_Bryant = player_shots(1610612747, 977)

# Played for San Antonio Spurs 1997-2016 
T_Duncan = player_shots(1610612759, 1495)

# Played for Dallas Mavericks 1998-2019
D_Nowitzki = player_shots(1610612742, 1717)

# Plays for Golden State Warriors 2009-present
S_Curry = player_shots(1610612744, 201939)

# Plays for the Phoenix Suns 2015-present
D_Booker = player_shots(1610612756, 1626164)

# Plays for Dallas Mavericks 2018-present
L_Doncic = player_shots(1610612742, 1629029)

In [14]:
KBryant_df = player_df(K_Bryant)
KBryant_df.head()

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM
0,Shot Chart Detail,20000012,10,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,10,27,...,Right Side(R),16-24 ft.,18,167,72,1,0,20001031,POR,LAL
1,Shot Chart Detail,20000012,12,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,10,22,...,Left Side(L),8-16 ft.,15,-157,0,1,0,20001031,POR,LAL
2,Shot Chart Detail,20000012,35,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,7,45,...,Left Side Center(LC),16-24 ft.,16,-101,135,1,1,20001031,POR,LAL
3,Shot Chart Detail,20000012,43,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,6,52,...,Right Side Center(RC),16-24 ft.,22,138,175,1,0,20001031,POR,LAL
4,Shot Chart Detail,20000012,155,977,Kobe Bryant,1610612747,Los Angeles Lakers,2,6,19,...,Center(C),Less Than 8 ft.,0,0,0,1,1,20001031,POR,LAL


In [15]:
KBryant_df.shape

(26198, 24)

In [16]:
KBryant_df.columns

Index(['GRID_TYPE', 'GAME_ID', 'GAME_EVENT_ID', 'PLAYER_ID', 'PLAYER_NAME',
       'TEAM_ID', 'TEAM_NAME', 'PERIOD', 'MINUTES_REMAINING',
       'SECONDS_REMAINING', 'EVENT_TYPE', 'ACTION_TYPE', 'SHOT_TYPE',
       'SHOT_ZONE_BASIC', 'SHOT_ZONE_AREA', 'SHOT_ZONE_RANGE', 'SHOT_DISTANCE',
       'LOC_X', 'LOC_Y', 'SHOT_ATTEMPTED_FLAG', 'SHOT_MADE_FLAG', 'GAME_DATE',
       'HTM', 'VTM'],
      dtype='object')

In [17]:
TDuncan_df = player_df(T_Duncan)
DNowitzki_df = player_df(D_Nowitzki)
SCurry_df = player_df(S_Curry)
DBooker_df = player_df(D_Booker)
LDoncic_df = player_df(L_Doncic)

In [18]:
key_players_shots_df = pd.concat([KBryant_df, TDuncan_df, DNowitzki_df, SCurry_df, DBooker_df, LDoncic_df])
key_players_shots_df.head(5)

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM
0,Shot Chart Detail,20000012,10,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,10,27,...,Right Side(R),16-24 ft.,18,167,72,1,0,20001031,POR,LAL
1,Shot Chart Detail,20000012,12,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,10,22,...,Left Side(L),8-16 ft.,15,-157,0,1,0,20001031,POR,LAL
2,Shot Chart Detail,20000012,35,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,7,45,...,Left Side Center(LC),16-24 ft.,16,-101,135,1,1,20001031,POR,LAL
3,Shot Chart Detail,20000012,43,977,Kobe Bryant,1610612747,Los Angeles Lakers,1,6,52,...,Right Side Center(RC),16-24 ft.,22,138,175,1,0,20001031,POR,LAL
4,Shot Chart Detail,20000012,155,977,Kobe Bryant,1610612747,Los Angeles Lakers,2,6,19,...,Center(C),Less Than 8 ft.,0,0,0,1,1,20001031,POR,LAL


In [19]:
key_players_shots_df.shape

(94194, 24)

In [20]:
key_players_shots_df.isnull().sum()

GRID_TYPE              0
GAME_ID                0
GAME_EVENT_ID          0
PLAYER_ID              0
PLAYER_NAME            0
TEAM_ID                0
TEAM_NAME              0
PERIOD                 0
MINUTES_REMAINING      0
SECONDS_REMAINING      0
EVENT_TYPE             0
ACTION_TYPE            0
SHOT_TYPE              0
SHOT_ZONE_BASIC        0
SHOT_ZONE_AREA         0
SHOT_ZONE_RANGE        0
SHOT_DISTANCE          0
LOC_X                  0
LOC_Y                  0
SHOT_ATTEMPTED_FLAG    0
SHOT_MADE_FLAG         0
GAME_DATE              0
HTM                    0
VTM                    0
dtype: int64

In [21]:
key_players_shots_df[key_players_shots_df['TEAM_ID'] == 1610612742]

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM
0,Shot Chart Detail,0020000007,34,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,1,8,8,...,Center(C),Less Than 8 ft.,0,0,0,1,0,20001031,DAL,MIL
1,Shot Chart Detail,0020000007,49,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,1,6,31,...,Left Side Center(LC),24+ ft.,24,-103,226,1,0,20001031,DAL,MIL
2,Shot Chart Detail,0020000007,174,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,2,7,17,...,Left Side(L),24+ ft.,23,-231,2,1,1,20001031,DAL,MIL
3,Shot Chart Detail,0020000007,186,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,2,5,37,...,Right Side(R),16-24 ft.,18,159,104,1,1,20001031,DAL,MIL
4,Shot Chart Detail,0020000007,274,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,3,10,32,...,Left Side Center(LC),24+ ft.,25,-126,219,1,0,20001031,DAL,MIL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3665,Shot Chart Detail,0022000986,377,1629029,Luka Doncic,1610612742,Dallas Mavericks,3,8,8,...,Right Side Center(RC),24+ ft.,28,89,269,1,0,20210504,MIA,DAL
3666,Shot Chart Detail,0022000986,389,1629029,Luka Doncic,1610612742,Dallas Mavericks,3,6,55,...,Left Side Center(LC),24+ ft.,25,-140,213,1,0,20210504,MIA,DAL
3667,Shot Chart Detail,0022000986,413,1629029,Luka Doncic,1610612742,Dallas Mavericks,3,4,28,...,Left Side Center(LC),24+ ft.,26,-138,231,1,1,20210504,MIA,DAL
3668,Shot Chart Detail,0022000986,448,1629029,Luka Doncic,1610612742,Dallas Mavericks,3,2,14,...,Center(C),24+ ft.,26,11,264,1,0,20210504,MIA,DAL


We can use operations on this dataframe to find specific data. Let's look at every 3 point buzzer beater scored by one of our key players.

In [22]:
key_players_shots_df[(key_players_shots_df['PERIOD'] == 4) & (key_players_shots_df['MINUTES_REMAINING'] < 1)  &
                     (key_players_shots_df['SECONDS_REMAINING'] < 1) & (key_players_shots_df['SHOT_MADE_FLAG'] == 1) &
                     (key_players_shots_df['SHOT_TYPE'] == '3PT Field Goal')]

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM
15265,Shot Chart Detail,20900281,599,977,Kobe Bryant,1610612747,Los Angeles Lakers,4,0,0,...,Center(C),24+ ft.,27,6,274,1,1,20091204,LAL,MIA
15611,Shot Chart Detail,20900476,520,977,Kobe Bryant,1610612747,Los Angeles Lakers,4,0,0,...,Left Side Center(LC),24+ ft.,25,-235,91,1,1,20100101,LAL,SAC
3437,Shot Chart Detail,20200422,450,1495,Tim Duncan,1610612759,San Antonio Spurs,4,0,0,...,Right Side Center(RC),24+ ft.,24,118,214,1,1,20021228,CHI,SAS
3968,Shot Chart Detail,20200881,485,1495,Tim Duncan,1610612759,San Antonio Spurs,4,0,0,...,Center(C),24+ ft.,24,-36,247,1,1,20030306,SAS,NJN
241,Shot Chart Detail,20000246,502,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,4,0,0,...,Right Side Center(RC),24+ ft.,28,108,268,1,1,20001203,LAL,DAL
10909,Shot Chart Detail,20701174,427,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,4,0,0,...,Left Side Center(LC),24+ ft.,24,-199,143,1,1,20080410,DAL,UTA
22441,Shot Chart Detail,29800393,420,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,4,0,0,...,Left Side Center(LC),24+ ft.,24,-136,201,1,1,19990326,DAL,DEN
23012,Shot Chart Detail,29900418,472,1717,Dirk Nowitzki,1610612742,Dallas Mavericks,4,0,0,...,Right Side Center(RC),24+ ft.,24,224,106,1,1,19991230,DAL,TOR
448,Shot Chart Detail,21800495,651,1629029,Luka Doncic,1610612742,Dallas Mavericks,4,0,0,...,Right Side(R),24+ ft.,22,220,12,1,1,20181223,POR,DAL
3100,Shot Chart Detail,22000485,617,1629029,Luka Doncic,1610612742,Dallas Mavericks,4,0,0,...,Left Side Center(LC),24+ ft.,27,-155,227,1,1,20210223,DAL,BOS


In [23]:
key_players_shots_df.groupby(['PLAYER_ID', 'PLAYER_NAME', 'GAME_ID', 'GAME_EVENT_ID']).sum().loc[977]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,TEAM_ID,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG
PLAYER_NAME,GAME_ID,GAME_EVENT_ID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Kobe Bryant,0020000012,10,1610612747,1,10,27,18,167,72,1,0
Kobe Bryant,0020000012,12,1610612747,1,10,22,15,-157,0,1,0
Kobe Bryant,0020000012,35,1610612747,1,7,45,16,-101,135,1,1
Kobe Bryant,0020000012,43,1610612747,1,6,52,22,138,175,1,0
Kobe Bryant,0020000012,155,1610612747,2,6,19,0,0,0,1,1
Kobe Bryant,...,...,...,...,...,...,...,...,...,...,...
Kobe Bryant,0029901185,450,1610612747,4,0,0,19,-141,128,1,0
Kobe Bryant,0029901185,457,1610612747,5,4,26,1,-12,5,1,0
Kobe Bryant,0029901185,496,1610612747,5,0,41,24,164,179,1,1
Kobe Bryant,0029901185,505,1610612747,5,0,21,24,-237,70,1,1


# Finish gathering data

In [24]:
games = pd.read_csv(NBA_GAMES_DATASET)
games.shape, games.dtypes

((7228, 4),
 PLAYER_NAME    object
 TEAM_ID         int64
 PLAYER_ID       int64
 SEASON          int64
 dtype: object)

In [25]:
games.head()

Unnamed: 0,PLAYER_NAME,TEAM_ID,PLAYER_ID,SEASON
0,Royce O'Neale,1610612762,1626220,2019
1,Bojan Bogdanovic,1610612762,202711,2019
2,Rudy Gobert,1610612762,203497,2019
3,Donovan Mitchell,1610612762,1628378,2019
4,Mike Conley,1610612762,201144,2019


In [26]:
players = list(games.groupby(['TEAM_ID', 'PLAYER_ID']).groups)

In [27]:
len(players)

4281

We can use one of our previous dataframes to get the structure of the new dataframe. We will then append each new players shots to this dataframe.

In [28]:
all_shot_df = KBryant_df[KBryant_df['TEAM_ID'] == 0]
all_shot_df

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM


In [29]:
# This is how I gathered the data. It is now stored locally in SHOT_DATASET
"""
for player in players:
    # use a try except block to filter players that have taken no shots in their career
    # (players that haven't taken shots throw an error when appended to the dataframe)
    try:
        resp = player_shots(player[0], player[1])
        test_df = test_df.append(player_df(resp), ignore_index=True)
    except:
        pass
"""

"\nfor player in players:\n    # use a try except block to filter players that have taken no shots in their career\n    # (players that haven't taken shots throw an error when appended to the dataframe)\n    try:\n        resp = player_shots(player[0], player[1])\n        test_df = test_df.append(player_df(resp), ignore_index=True)\n    except:\n        pass\n"

In [30]:
all_shot_data = pd.read_csv(SHOT_DATASET)
all_shot_data.shape, all_shot_data.dtypes

((2892449, 24),
 GRID_TYPE              object
 GAME_ID                 int64
 GAME_EVENT_ID           int64
 PLAYER_ID               int64
 PLAYER_NAME            object
 TEAM_ID                 int64
 TEAM_NAME              object
 PERIOD                  int64
 MINUTES_REMAINING       int64
 SECONDS_REMAINING       int64
 EVENT_TYPE             object
 ACTION_TYPE            object
 SHOT_TYPE              object
 SHOT_ZONE_BASIC        object
 SHOT_ZONE_AREA         object
 SHOT_ZONE_RANGE        object
 SHOT_DISTANCE           int64
 LOC_X                   int64
 LOC_Y                   int64
 SHOT_ATTEMPTED_FLAG     int64
 SHOT_MADE_FLAG          int64
 GAME_DATE               int64
 HTM                    object
 VTM                    object
 dtype: object)

In [31]:
all_shot_data.groupby(['PLAYER_ID']).sum().shape

(1347, 12)

In [32]:
all_shot_data.isnull().sum()

GRID_TYPE               0
GAME_ID                 0
GAME_EVENT_ID           0
PLAYER_ID               0
PLAYER_NAME            58
TEAM_ID                 0
TEAM_NAME               0
PERIOD                  0
MINUTES_REMAINING       0
SECONDS_REMAINING       0
EVENT_TYPE              0
ACTION_TYPE             0
SHOT_TYPE               0
SHOT_ZONE_BASIC         0
SHOT_ZONE_AREA          0
SHOT_ZONE_RANGE         0
SHOT_DISTANCE           0
LOC_X                   0
LOC_Y                   0
SHOT_ATTEMPTED_FLAG     0
SHOT_MADE_FLAG          0
GAME_DATE               0
HTM                     0
VTM                     0
dtype: int64

In [33]:
all_shot_data = all_shot_data.dropna()
all_shot_data.isnull().sum()

GRID_TYPE              0
GAME_ID                0
GAME_EVENT_ID          0
PLAYER_ID              0
PLAYER_NAME            0
TEAM_ID                0
TEAM_NAME              0
PERIOD                 0
MINUTES_REMAINING      0
SECONDS_REMAINING      0
EVENT_TYPE             0
ACTION_TYPE            0
SHOT_TYPE              0
SHOT_ZONE_BASIC        0
SHOT_ZONE_AREA         0
SHOT_ZONE_RANGE        0
SHOT_DISTANCE          0
LOC_X                  0
LOC_Y                  0
SHOT_ATTEMPTED_FLAG    0
SHOT_MADE_FLAG         0
GAME_DATE              0
HTM                    0
VTM                    0
dtype: int64

Let's write a funciton to make a season column for each shot. We can do this using the `GAME_ID` columns.

In [34]:
def season(game):
    seas = str(game['GAME_ID'])[1:3]
    if int(seas) > 21:
        return int('19' + seas)
    else:
        return int('20' + seas)

In [35]:
all_shot_data['SEASON'] = all_shot_data.apply(lambda row: season(row), axis=1)

In [36]:
all_shot_data[(all_shot_data['TEAM_NAME'] == 'Dallas Mavericks') & (all_shot_data['SEASON'] == 2011)]

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM,SEASON
170150,Shot Chart Detail,21100002,311,467,Jason Kidd,1610612742,Dallas Mavericks,2,1,24,...,24+ ft.,24,176,170,1,1,20111225,DAL,MIA,2011
170151,Shot Chart Detail,21100002,331,467,Jason Kidd,1610612742,Dallas Mavericks,2,0,10,...,24+ ft.,24,132,201,1,1,20111225,DAL,MIA,2011
170152,Shot Chart Detail,21100002,346,467,Jason Kidd,1610612742,Dallas Mavericks,3,11,45,...,24+ ft.,24,196,143,1,0,20111225,DAL,MIA,2011
170153,Shot Chart Detail,21100002,443,467,Jason Kidd,1610612742,Dallas Mavericks,3,2,28,...,24+ ft.,24,-35,244,1,0,20111225,DAL,MIA,2011
170154,Shot Chart Detail,21100002,465,467,Jason Kidd,1610612742,Dallas Mavericks,3,0,26,...,24+ ft.,24,225,102,1,0,20111225,DAL,MIA,2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1277161,Shot Chart Detail,21100983,137,202346,Dominique Jones,1610612742,Dallas Mavericks,2,9,17,...,Less Than 8 ft.,5,-56,20,1,0,20120426,ATL,DAL,2011
1277162,Shot Chart Detail,21100983,367,202346,Dominique Jones,1610612742,Dallas Mavericks,4,9,18,...,Less Than 8 ft.,0,2,4,1,0,20120426,ATL,DAL,2011
1277163,Shot Chart Detail,21100983,384,202346,Dominique Jones,1610612742,Dallas Mavericks,4,7,41,...,Less Than 8 ft.,1,-10,3,1,0,20120426,ATL,DAL,2011
1277164,Shot Chart Detail,21100983,407,202346,Dominique Jones,1610612742,Dallas Mavericks,4,6,23,...,16-24 ft.,20,-97,179,1,0,20120426,ATL,DAL,2011


Some teams have changed names over the years. To correct this we will make a dictionary of each teams current name and apply it across the dataset.

In [37]:
all_shot_data.groupby(['TEAM_ID', 'TEAM_NAME']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,SEASON
TEAM_ID,TEAM_NAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1610612737,Atlanta Hawks,21278470.0,268.323344,361891.801954,2.479195,5.361934,28.724592,12.61843,1.849815,84.322025,1.0,0.456072,20134690.0,2012.778725
1610612738,Boston Celtics,21361740.0,263.765036,307869.653363,2.484496,5.338538,28.809424,12.799369,-1.00508,86.001917,1.0,0.458646,20126350.0,2011.937919
1610612739,Cleveland Cavaliers,21302520.0,261.004637,255714.857301,2.467033,5.36444,28.802491,12.169738,-4.919565,76.924578,1.0,0.456335,20127580.0,2012.070439
1610612740,New Orleans Hornets,20957350.0,238.270273,99205.13288,2.469594,5.280071,28.752778,11.922134,3.230169,77.768631,1.0,0.457539,20102990.0,2009.567483
1610612740,New Orleans Pelicans,21627210.0,291.979829,464202.513482,2.489503,5.347215,28.796184,12.263353,-3.886284,83.462883,1.0,0.46405,20169260.0,2016.266128
1610612740,New Orleans/Oklahoma City Hornets,20556790.0,229.706767,47580.86245,2.457762,4.93609,28.319991,10.902698,-3.202123,75.462627,1.0,0.459752,20062730.0,2005.561477
1610612741,Chicago Bulls,21291120.0,262.405965,384668.169309,2.479818,5.336867,28.899887,12.186135,-2.134349,81.259783,1.0,0.447052,20136040.0,2012.905171
1610612742,Dallas Mavericks,21317930.0,260.856866,262705.171414,2.468699,5.374073,28.809739,13.338223,4.625537,89.916938,1.0,0.461685,20123280.0,2011.637945
1610612743,Denver Nuggets,21293830.0,270.036894,315118.061463,2.466111,5.344366,28.781118,11.520345,0.975214,75.7032,1.0,0.466406,20130410.0,2012.346107
1610612744,Golden State Warriors,21301540.0,272.037083,264573.469618,2.46884,5.343534,28.770485,12.990354,1.292064,89.503186,1.0,0.474063,20136910.0,2013.009427


In [38]:
team_id_name_dict = { 1610612737:"Atlanta Hawks", 1610612738:"Boston Celtics", 1610612739:"Cleveland Cavaliers", 
                      1610612740:"New Orleans Pelicans", 1610612741:"Chicago Bulls", 1610612742:"Dallas Mavericks",
                      1610612743:"Denver Nuggets", 1610612744:"Golden State Warriors", 1610612745:"Houston Rockets",
                      1610612746:"Los Angeles Clippers", 1610612747:"Los Angeles Lakers", 1610612748:"Miami Heat",
                      1610612749:"Milwaukee Bucks", 1610612750:"Minnesota Timberwolves", 1610612751:"Brooklyn Nets",
                      1610612752:"New York Knicks", 1610612753:"Orlando Magic", 1610612754:"Indiana Pacers",
                      1610612755:"Philadelphia 76ers", 1610612756:"Phoenix Suns", 1610612757:"Portland Trail Blazers",
                      1610612758:"Sacramento Kings", 1610612759:"San Antonio Spurs", 1610612760:"Oklahoma City Thunder",
                      1610612761:"Toronto Raptors", 1610612762:"Utah Jazz", 1610612763:"Memphis Grizzlies",
                      1610612764:"Washington Wizards", 1610612765:"Detroit Pistons", 1610612766:"Charlotte Hornets" }

In [39]:
clean_all_shot_data = all_shot_data.copy()
clean_all_shot_data['TEAM_NAME'] = clean_all_shot_data['TEAM_ID'].map(team_id_name_dict)

In [40]:
clean_all_shot_data.shape

(2892391, 25)

In [41]:
clean_all_shot_data.groupby(['TEAM_ID', 'TEAM_NAME']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,SEASON
TEAM_ID,TEAM_NAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1610612737,Atlanta Hawks,21278470.0,268.323344,361891.801954,2.479195,5.361934,28.724592,12.61843,1.849815,84.322025,1.0,0.456072,20134690.0,2012.778725
1610612738,Boston Celtics,21361740.0,263.765036,307869.653363,2.484496,5.338538,28.809424,12.799369,-1.00508,86.001917,1.0,0.458646,20126350.0,2011.937919
1610612739,Cleveland Cavaliers,21302520.0,261.004637,255714.857301,2.467033,5.36444,28.802491,12.169738,-4.919565,76.924578,1.0,0.456335,20127580.0,2012.070439
1610612740,New Orleans Pelicans,21318970.0,268.425434,304456.191646,2.480345,5.301349,28.756167,12.066166,-1.138161,80.896779,1.0,0.461355,20138730.0,2013.183681
1610612741,Chicago Bulls,21291120.0,262.405965,384668.169309,2.479818,5.336867,28.899887,12.186135,-2.134349,81.259783,1.0,0.447052,20136040.0,2012.905171
1610612742,Dallas Mavericks,21317930.0,260.856866,262705.171414,2.468699,5.374073,28.809739,13.338223,4.625537,89.916938,1.0,0.461685,20123280.0,2011.637945
1610612743,Denver Nuggets,21293830.0,270.036894,315118.061463,2.466111,5.344366,28.781118,11.520345,0.975214,75.7032,1.0,0.466406,20130410.0,2012.346107
1610612744,Golden State Warriors,21301540.0,272.037083,264573.469618,2.46884,5.343534,28.770485,12.990354,1.292064,89.503186,1.0,0.474063,20136910.0,2013.009427
1610612745,Houston Rockets,21254640.0,267.190768,167895.70346,2.468698,5.372618,28.767135,12.685012,3.860097,83.689096,1.0,0.458279,20132200.0,2012.54037
1610612746,Los Angeles Clippers,21422580.0,269.550481,242177.768367,2.467007,5.377815,28.890681,12.42354,0.222861,83.668673,1.0,0.466209,20135220.0,2012.839464


In [42]:
clean_all_shot_data.columns

Index(['GRID_TYPE', 'GAME_ID', 'GAME_EVENT_ID', 'PLAYER_ID', 'PLAYER_NAME',
       'TEAM_ID', 'TEAM_NAME', 'PERIOD', 'MINUTES_REMAINING',
       'SECONDS_REMAINING', 'EVENT_TYPE', 'ACTION_TYPE', 'SHOT_TYPE',
       'SHOT_ZONE_BASIC', 'SHOT_ZONE_AREA', 'SHOT_ZONE_RANGE', 'SHOT_DISTANCE',
       'LOC_X', 'LOC_Y', 'SHOT_ATTEMPTED_FLAG', 'SHOT_MADE_FLAG', 'GAME_DATE',
       'HTM', 'VTM', 'SEASON'],
      dtype='object')

In [43]:
clean_all_shot_data['GRID_TYPE'].unique()

array(['Shot Chart Detail'], dtype=object)

In [44]:
clean_all_shot_data['EVENT_TYPE'].unique()

array(['Missed Shot', 'Made Shot'], dtype=object)

We can drop `GRID_TYPE` because this column doesn't provide any useful information. We can drop `EVENT_TYPE` because this data is already stored in `SHOT_MADE_FLAG`.

In [45]:
clean_all_shot_data= clean_all_shot_data.drop(columns=['GRID_TYPE', 'EVENT_TYPE'])

In [46]:
clean_all_shot_data.columns

Index(['GAME_ID', 'GAME_EVENT_ID', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID',
       'TEAM_NAME', 'PERIOD', 'MINUTES_REMAINING', 'SECONDS_REMAINING',
       'ACTION_TYPE', 'SHOT_TYPE', 'SHOT_ZONE_BASIC', 'SHOT_ZONE_AREA',
       'SHOT_ZONE_RANGE', 'SHOT_DISTANCE', 'LOC_X', 'LOC_Y',
       'SHOT_ATTEMPTED_FLAG', 'SHOT_MADE_FLAG', 'GAME_DATE', 'HTM', 'VTM',
       'SEASON'],
      dtype='object')

We can also drop `PERIOD`, `MINUTES_REMAINING`, `SECONDS_REMAINING`, `ACTION_TYPE`, `HTM` and `VTM` because I won't be using these columns at any point in this project.

In [47]:
clean_all_shot_data= clean_all_shot_data.drop(columns=['PERIOD', 'MINUTES_REMAINING', 'SECONDS_REMAINING', 'ACTION_TYPE', 'HTM', 'VTM'])

## Filter unwanted seasons
This is a very large dataset so reducing it as much as possible is important. We will not be looking at any shot data before 2009 because we don't have the full data for those seasons. Let's filter our dataframe to remove all seasons before 2009.

In [48]:
final_shots = clean_all_shot_data[clean_all_shot_data['SEASON']>2008].copy()
final_shots.shape, final_shots.dtypes

((2293539, 17),
 GAME_ID                 int64
 GAME_EVENT_ID           int64
 PLAYER_ID               int64
 PLAYER_NAME            object
 TEAM_ID                 int64
 TEAM_NAME              object
 SHOT_TYPE              object
 SHOT_ZONE_BASIC        object
 SHOT_ZONE_AREA         object
 SHOT_ZONE_RANGE        object
 SHOT_DISTANCE           int64
 LOC_X                   int64
 LOC_Y                   int64
 SHOT_ATTEMPTED_FLAG     int64
 SHOT_MADE_FLAG          int64
 GAME_DATE               int64
 SEASON                  int64
 dtype: object)

# Save Dataset
Now that the dataset has been cleaned and reduced we can save the new dataset

In [49]:
final_shots.to_csv(CLEAN_SHOT_DATASET, index=False)

# Clear dataframes
This project contains many dataframes. To make sure we don't run out of memory we will delete our dataframes at the end of each notebook.

In [50]:
%reset -f