# 20190607

-----

Initial exploration of the NBA shot data

In [1]:
import itertools as it
import json
import numpy as np
import pandas as pd

In [2]:
with open('../data/all_shot_data.json') as json_file:  
    data = json.load(json_file)
    
df_raw = pd.DataFrame(list(it.chain.from_iterable([player['shots'] for player in data['data']])))
df_raw.head()

Unnamed: 0,ACTION_TYPE,EVENT_TYPE,GAME_DATE,GAME_EVENT_ID,GAME_ID,GRID_TYPE,HTM,LOC_X,LOC_Y,MINUTES_REMAINING,...,SHOT_ATTEMPTED_FLAG,SHOT_DISTANCE,SHOT_MADE_FLAG,SHOT_TYPE,SHOT_ZONE_AREA,SHOT_ZONE_BASIC,SHOT_ZONE_RANGE,TEAM_ID,TEAM_NAME,VTM
0,Dunk Shot,Made Shot,20181017,28,21800006,Shot Chart Detail,ORL,0,3,9,...,1,0,1,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612753,Orlando Magic,MIA
1,Driving Layup Shot,Missed Shot,20181017,66,21800006,Shot Chart Detail,ORL,-20,27,7,...,1,3,0,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612753,Orlando Magic,MIA
2,Jump Shot,Made Shot,20181017,164,21800006,Shot Chart Detail,ORL,-225,11,0,...,1,22,1,3PT Field Goal,Left Side(L),Left Corner 3,24+ ft.,1610612753,Orlando Magic,MIA
3,Turnaround Jump Shot,Made Shot,20181017,224,21800006,Shot Chart Detail,ORL,51,155,8,...,1,16,1,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,1610612753,Orlando Magic,MIA
4,Turnaround Jump Shot,Missed Shot,20181017,238,21800006,Shot Chart Detail,ORL,-91,87,7,...,1,12,0,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,1610612753,Orlando Magic,MIA


### Understand the Data
What are the possible values of each field? Let's find a player with lots of shots, use them as a sample, and figure out what all the fields mean.

In [35]:
df_raw\
    .groupby(['PLAYER_ID', 'PLAYER_NAME'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by=['SHOT_MADE_FLAG'], ascending=False)\
    .head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SHOT_MADE_FLAG
PLAYER_ID,PLAYER_NAME,Unnamed: 2_level_1
201935,James Harden,1909
202689,Kemba Walker,1684
202331,Paul George,1614
203078,Bradley Beal,1609
203081,Damian Lillard,1533


In [37]:
harden = df_raw[df_raw['PLAYER_ID'] == 201935]

In [38]:
harden.columns

Index(['ACTION_TYPE', 'EVENT_TYPE', 'GAME_DATE', 'GAME_EVENT_ID', 'GAME_ID',
       'GRID_TYPE', 'HTM', 'LOC_X', 'LOC_Y', 'MINUTES_REMAINING', 'PERIOD',
       'PLAYER_ID', 'PLAYER_NAME', 'SECONDS_REMAINING', 'SHOT_ATTEMPTED_FLAG',
       'SHOT_DISTANCE', 'SHOT_MADE_FLAG', 'SHOT_TYPE', 'SHOT_ZONE_AREA',
       'SHOT_ZONE_BASIC', 'SHOT_ZONE_RANGE', 'TEAM_ID', 'TEAM_NAME', 'VTM'],
      dtype='object')

In [44]:
harden\
    .groupby(['ACTION_TYPE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

## It seems like we might want to simplify these. Like, we might group these into bigger categories: drive, shot, etc.

Unnamed: 0_level_0,SHOT_MADE_FLAG
ACTION_TYPE,Unnamed: 1_level_1
Step Back Jump shot,587
Jump Shot,349
Driving Layup Shot,324
Driving Floating Jump Shot,170
Pullup Jump shot,150
Driving Finger Roll Layup Shot,79
Layup Shot,58
Running Layup Shot,37
Floating Jump shot,28
Driving Floating Bank Jump Shot,27


In [45]:
harden\
    .groupby(['EVENT_TYPE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
EVENT_TYPE,Unnamed: 1_level_1
Missed Shot,1066
Made Shot,843


In [47]:
harden\
    .groupby(['GRID_TYPE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
GRID_TYPE,Unnamed: 1_level_1
Shot Chart Detail,1909


In [58]:
harden\
    .groupby(['LOC_Y'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

harden['LOC_Y'].min()

# I think that LOC_X and LOC_Y might be position on the court. But, I'm not sure. Might want to find a relationship between them and the shot distance parameter.

-32

In [59]:
harden\
    .groupby(['SHOT_DISTANCE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
SHOT_DISTANCE,Unnamed: 1_level_1
25,292
26,277
1,216
2,145
27,137
24,120
0,89
3,69
4,61
28,60


In [60]:
harden\
    .groupby(['SHOT_TYPE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
SHOT_TYPE,Unnamed: 1_level_1
3PT Field Goal,1028
2PT Field Goal,881


In [61]:
harden\
    .groupby(['SHOT_ZONE_AREA'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
SHOT_ZONE_AREA,Unnamed: 1_level_1
Center(C),1094
Right Side Center(RC),377
Left Side Center(LC),289
Right Side(R),76
Left Side(L),72
Back Court(BC),1


In [62]:
harden\
    .groupby(['SHOT_ZONE_BASIC'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
SHOT_ZONE_BASIC,Unnamed: 1_level_1
Above the Break 3,942
Restricted Area,519
In The Paint (Non-RA),281
Mid-Range,81
Right Corner 3,50
Left Corner 3,35
Backcourt,1


In [63]:
harden\
    .groupby(['SHOT_ZONE_RANGE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

Unnamed: 0_level_0,SHOT_MADE_FLAG
SHOT_ZONE_RANGE,Unnamed: 1_level_1
24+ ft.,1027
Less Than 8 ft.,714
8-16 ft.,132
16-24 ft.,35
Back Court Shot,1


**NOTES:** OK, it seems like we have a real trove of data here where we can get shooting percentages by shot type. Remember, the goal here is to develop a metric to tracks how many quality shots a team gets. I'm noticing now that I'm not seeing a field that shows how open the player is... Let me check in on the data source for that. That is important for us.

OK, so, this is a bummer. The closest defender range data is gone. So, that means that whatever we do here, we won't have access to that. (Online consensus seems to be that somebody now is selling the data rather than the NBA just expopsing it for free). Can we still look into luck and the basketball meta-game without defender data?

In [64]:
df_raw\
    .groupby(['ACTION_TYPE'])\
    .agg({'SHOT_MADE_FLAG': 'count'})\
    .sort_values(by='SHOT_MADE_FLAG', ascending=False)

# Let's group by the following: Layup, Jump Shot, Pullup Jump Shot, Step Back Jump Shot, Other Jump Shot, Driving Layup, DUNK

Unnamed: 0_level_0,SHOT_MADE_FLAG
ACTION_TYPE,Unnamed: 1_level_1
Jump Shot,77517
Pullup Jump shot,23959
Driving Layup Shot,20340
Layup Shot,11404
Step Back Jump shot,7878
Driving Floating Jump Shot,7327
Cutting Layup Shot,5286
Tip Layup Shot,4884
Floating Jump shot,4880
Running Layup Shot,4668


In [3]:
conditions_action = [
    (df_raw['ACTION_TYPE'] == 'Pullup Jump shot'),
    (df_raw['ACTION_TYPE'] == 'Step Back Jump shot'),
    (df_raw['ACTION_TYPE'] == 'Driving Layup Shot'),
    (df_raw['ACTION_TYPE'].str.contains('Layup')),
    (df_raw['ACTION_TYPE'].str.contains('Jump')),
    (df_raw['ACTION_TYPE'].str.contains('Dunk')),
]

choices_action = ['Pullup Jump Shot', 'Step Back Jump Shot', 'Driving Layup Shot', 'Layup', 'Other Jump Shot', 'Dunk']

conditions_value = [
    (df_raw['SHOT_TYPE'] == '3PT Field Goal')
]

choices_value = [3]

def clean_threes(x):
    if 'Corner' in x:
        return 'Corner Three'
    else:
        return x

df = df_raw\
    .assign(
        action_type_clean= np.select(conditions_action, choices_action, default='Other'),
        shot_value = np.select(conditions_value, choices_value, default = 2),
        shot_zone = lambda x: x['SHOT_ZONE_BASIC'].apply(clean_threes))

df.reset_index(level=0, inplace=True)

df.head()

Unnamed: 0,index,ACTION_TYPE,EVENT_TYPE,GAME_DATE,GAME_EVENT_ID,GAME_ID,GRID_TYPE,HTM,LOC_X,LOC_Y,...,SHOT_TYPE,SHOT_ZONE_AREA,SHOT_ZONE_BASIC,SHOT_ZONE_RANGE,TEAM_ID,TEAM_NAME,VTM,action_type_clean,shot_value,shot_zone
0,0,Dunk Shot,Made Shot,20181017,28,21800006,Shot Chart Detail,ORL,0,3,...,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612753,Orlando Magic,MIA,Dunk,2,Restricted Area
1,1,Driving Layup Shot,Missed Shot,20181017,66,21800006,Shot Chart Detail,ORL,-20,27,...,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612753,Orlando Magic,MIA,Driving Layup Shot,2,Restricted Area
2,2,Jump Shot,Made Shot,20181017,164,21800006,Shot Chart Detail,ORL,-225,11,...,3PT Field Goal,Left Side(L),Left Corner 3,24+ ft.,1610612753,Orlando Magic,MIA,Other Jump Shot,3,Corner Three
3,3,Turnaround Jump Shot,Made Shot,20181017,224,21800006,Shot Chart Detail,ORL,51,155,...,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,1610612753,Orlando Magic,MIA,Other Jump Shot,2,Mid-Range
4,4,Turnaround Jump Shot,Missed Shot,20181017,238,21800006,Shot Chart Detail,ORL,-91,87,...,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,1610612753,Orlando Magic,MIA,Other Jump Shot,2,Mid-Range


### Analysis
What is the leage average by shot type and zone? What is the distribution of each? For each player can we give them a score which is like the total points above expectation based on that zone?

In [26]:
results = df\
    .groupby(['action_type_clean', 'shot_zone', 'shot_value'])\
    .agg({'SHOT_MADE_FLAG': ['mean', 'count']})\
    .reset_index()

results.columns = ['shot_type', 'shot_zone', 'shot_value', 'league_average', 'frequency']

results\
    .assign(expected_value=lambda x: x['league_average'] * x['shot_value'])\
    .sort_values(by=['expected_value'], ascending=False)

Unnamed: 0,shot_type,shot_zone,shot_value,league_average,frequency,expected_value
4,Dunk,Mid-Range,2,1.0,5,2.0
28,Pullup Jump Shot,Mid-Range,3,0.6,5,1.8
5,Dunk,Restricted Area,2,0.898484,12599,1.796968
25,Pullup Jump Shot,Corner Three,3,0.476331,338,1.428994
8,Layup,Restricted Area,2,0.606739,37334,1.213478
20,Other Jump Shot,Mid-Range,3,0.4,115,1.2
31,Step Back Jump Shot,Above the Break 3,3,0.388802,3215,1.166407
17,Other Jump Shot,Corner Three,3,0.380583,16903,1.14175
32,Step Back Jump Shot,Corner Three,3,0.373228,635,1.119685
23,Pullup Jump Shot,Above the Break 3,3,0.363399,8426,1.090197


**Some Ideas:** What if we calculated expected value for each shot then used that to join on the raw data to get total and average points over expectation for each player in the league. Use that as a measure of shooting ability.

Then, we can go game by game and compare relative points over expected and relative expected points and see why each team won each game - ie, they got better shots than the other team, or, they made more shots than the other team.

Also, I can do quality share of shots for each game to see who is getting good looks

# 20190609

Let's assign an expected value for each shot and for each player let's total up total points, total expected points, the delta between the two, and averages

In [37]:
results = df\
    .groupby(['action_type_clean', 'shot_zone', 'shot_value'])\
    .agg({'SHOT_MADE_FLAG': ['mean', 'count']})\
    .reset_index()

results.columns = ['action_type_clean', 'shot_zone', 'shot_value', 'league_average', 'frequency']

results = results\
    .assign(
        expected_value=lambda x: x['league_average'] * x['shot_value']
    )\
    .sort_values(by=['expected_value'], ascending=False)

In [35]:
df.columns

Index(['ACTION_TYPE', 'EVENT_TYPE', 'GAME_DATE', 'GAME_EVENT_ID', 'GAME_ID',
       'GRID_TYPE', 'HTM', 'LOC_X', 'LOC_Y', 'MINUTES_REMAINING', 'PERIOD',
       'PLAYER_ID', 'PLAYER_NAME', 'SECONDS_REMAINING', 'SHOT_ATTEMPTED_FLAG',
       'SHOT_DISTANCE', 'SHOT_MADE_FLAG', 'SHOT_TYPE', 'SHOT_ZONE_AREA',
       'SHOT_ZONE_BASIC', 'SHOT_ZONE_RANGE', 'TEAM_ID', 'TEAM_NAME', 'VTM',
       'action_type_clean', 'shot_value', 'shot_zone'],
      dtype='object')

In [55]:
player_aggregates = df[['PLAYER_NAME', 'shot_value', 'shot_zone', 'action_type_clean', 'SHOT_MADE_FLAG']]\
    .merge(results, how='left', on=['shot_zone', 'action_type_clean'])\
    .assign(
        points = lambda x: x['SHOT_MADE_FLAG'] * x['shot_value_y'],
        points_above_expectation = lambda x: x['points'] - x['expected_value']
    )\
    .groupby(['PLAYER_NAME'])\
    .agg({
        'expected_value': ['mean','sum', 'count'],
        'points': ['mean','sum', 'count'],
        'points_above_expectation': ['mean','sum', 'count'],    
    })\
    .reset_index()

player_aggregates.columns = ['player_name', 
                            'expected_value_mean', 'expected_value_sum', 'expected_value_count',
                            'points_mean', 'points_sum', 'points_count',
                            'points_above_expectation_mean', 'points_above_expectation_sum', 'points_above_expectation_count']

player_aggregates\
    .sort_values(by = ['points_above_expectation_sum'], ascending=False)\
    .drop(['expected_value_count', 'points_above_expectation_count', 'points_count'], axis = 1)\
    .head()

Unnamed: 0,player_name,expected_value_mean,expected_value_sum,points_mean,points_sum,points_above_expectation_mean,points_above_expectation_sum
451,Stephen Curry,0.762223,1668.506704,1.108725,2427,0.346502,758.493296
224,James Harden,0.738551,2160.260926,0.974359,2850,0.235808,689.739074
54,Buddy Hield,0.819023,1762.537309,1.071097,2305,0.252074,542.462691
302,Kevin Durant,0.928549,1990.810094,1.173974,2517,0.245424,526.189906
310,Klay Thompson,0.853897,1952.007515,1.082677,2475,0.228781,522.992485


## How Does Steph Get His Buckets?

In [75]:
steph = df[['PLAYER_NAME', 'shot_value', 'shot_zone', 'action_type_clean', 'SHOT_MADE_FLAG']]\
    .query('PLAYER_NAME == "Stephen Curry"')\
    .assign(
        points = lambda x: x['SHOT_MADE_FLAG'] * x['shot_value'],
    )\
    .groupby(['action_type_clean', 'shot_zone', 'shot_value'])\
    .agg({'SHOT_MADE_FLAG': ['mean', 'count']})\
    .reset_index()

steph.columns = ['action_type_clean', 'shot_zone', 'shot_value', 'shooting_percentage', 'frequency']

steph\
    .assign(
        expected_value=lambda x: x['shooting_percentage'] * x['shot_value']
    )\
    .merge(results, how='left', on = ['action_type_clean', 'shot_zone'])\
    [['action_type_clean', 'shot_zone', 'shot_value_x', 'shooting_percentage', 'league_average', 'frequency_x']]

# \
#     .groupby(['PLAYER_NAME'])\
#     .agg({
#         'expected_value': ['mean','sum', 'count'],
#         'points': ['mean','sum', 'count'],
#         'points_above_expectation': ['mean','sum', 'count'],    
#     })\
#     .groupby(['action_type_clean', 'shot_zone', 'shot_value'])\
#     .agg({'SHOT_MADE_FLAG': ['mean', 'count']})\
#     .reset_index()

    

Unnamed: 0,action_type_clean,shot_zone,shot_value_x,shooting_percentage,league_average,frequency_x
0,Driving Layup Shot,In The Paint (Non-RA),2,0.2,0.231177,10
1,Driving Layup Shot,Mid-Range,2,1.0,0.140845,1
2,Driving Layup Shot,Restricted Area,2,0.607843,0.518521,51
3,Dunk,Restricted Area,2,0.333333,0.898484,3
4,Layup,In The Paint (Non-RA),2,0.566667,0.317551,30
5,Layup,Mid-Range,2,0.5,0.311111,6
6,Layup,Restricted Area,2,0.649425,0.606739,174
7,Other,Corner Three,3,0.5,0.142857,2
8,Other,In The Paint (Non-RA),2,0.25,0.462296,4
9,Other,Mid-Range,2,0.333333,0.414327,6


In [77]:
results.shape

(37, 6)

In [78]:
steph.shape

(26, 5)

In [83]:
results.query('shot_zone == "Mid-Range" and action_type_clean == "Step Back Jump Shot"')

Unnamed: 0,action_type_clean,shot_zone,shot_value,league_average,frequency,expected_value
34,Step Back Jump Shot,Mid-Range,2,0.431525,3483,0.863049
35,Step Back Jump Shot,Mid-Range,3,0.166667,6,0.5


In [88]:
results.sort_values(by = ['action_type_clean', 'shot_zone'])

Unnamed: 0,action_type_clean,shot_zone,shot_value,league_average,frequency,expected_value
0,Driving Layup Shot,In The Paint (Non-RA),2,0.231177,2829,0.462354
1,Driving Layup Shot,Mid-Range,2,0.140845,71,0.28169
2,Driving Layup Shot,Restricted Area,2,0.518521,17440,1.037041
3,Dunk,In The Paint (Non-RA),2,0.526718,131,1.053435
4,Dunk,Mid-Range,2,1.0,5,2.0
5,Dunk,Restricted Area,2,0.898484,12599,1.796968
6,Layup,In The Paint (Non-RA),2,0.317551,3168,0.635101
7,Layup,Mid-Range,2,0.311111,90,0.622222
8,Layup,Restricted Area,2,0.606739,37334,1.213478
9,Other,Above the Break 3,3,0.058824,17,0.176471


We need t clean out some of these shots that are classified as mid range but have a shot value of three. That is not making any sense

# 20190615

In [None]:
with open('../data/all_shot_data.json') as json_file:  
    data = json.load(json_file)
    
df_raw = pd.DataFrame(list(it.chain.from_iterable([player['shots'] for player in data['data']])))

In [21]:
def clean_threes(x):
    if 'Corner' in x:
        return 'Corner Three'
    else:
        return x
    
conditions_action = [
    (df_raw['ACTION_TYPE'] == 'Pullup Jump shot'),
    (df_raw['ACTION_TYPE'] == 'Step Back Jump shot'),
    (df_raw['ACTION_TYPE'] == 'Driving Layup Shot'),
    (df_raw['ACTION_TYPE'].str.contains('Layup')),
    (df_raw['ACTION_TYPE'].str.contains('Jump')),
    (df_raw['ACTION_TYPE'].str.contains('Dunk')),
]

choices_action = ['Pullup Jump Shot', 'Step Back Jump Shot', 'Driving Layup Shot', 'Layup', 'Other Jump Shot', 'Dunk']

conditions_value = [
    (df_raw['SHOT_TYPE'] == '3PT Field Goal')
]

choices_value = [3]

df = df_raw\
    .assign(
        action_type_clean= np.select(conditions_action, choices_action, default='Other'),
        shot_value = np.select(conditions_value, choices_value, default = 2),
        shot_zone = lambda x: x['SHOT_ZONE_BASIC'].apply(clean_threes)
    )

In [35]:
df['shot_value'] == 2 & (df['shot_zone'] == "Corner Three" | df['shot_zone'] == "Above the Break 3")

TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]

In [39]:
df['shot_value'] == 2 & df['shot_zone'] == "Corner Three"

TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]

TypeError: unsupported operand type(s) for +: 'method' and 'method'