# Team Shot Profile Classifier v1.1
## GOAL: Given a shot taken during an NBA game, determine if the shot fits into the selected team's shot profile.
### In this model, we will use the 2023-24 season.
We are going to take all of the games in a season - about 1200. We need to collect the shot data of each team... this will come from two places

1. play by play -- the key information here is the shot_type. eventmsgactiontype gives a shot type in the form of a number. 
there are several types of shots these can be categorized further but this may be future work 

2. shot chart detail -- this lets us gather details about the shots for a specific team for a specific season

In order to successfully merge these two repositories, we need to gather key information first.

1. We want to gather all of the team ids. 30 teams.
   • this gives us a way to categorize the rest of the information.
   • we can run into issues requesting so much data...

2. After gathering a list of teams, we will get the shot chart data for every game in a season for each team. We can gather the unique game_ids and game events to merge with our other shot data.

    • What is the necessary data for training?

       1. shot distance

       2. shot location? 

3. Using the unique games_ids, we want to get the play by play data. We can pin point the data for a specific team using the game event id.

    • What is the data we need from here? Primarily interested in shot type.

       There are over 30 shot types possibly 40. These can likely be split further.

### to put our minds at ease about the moving parts, lets import our important libraries

In [50]:
#import nba libraries
from nba_api.stats.endpoints import leaguegamefinder, playbyplayv2, ShotChartDetail
from nba_api.stats.static import teams

#import other useful libraries
import pandas as pd
import time
import numpy as np
from fastai.tabular.all import *

## 1. Gather Team IDs
Here we will request the teams dataset and extract each name, id pair.

In [6]:
#request using teams libraries
nba_teams = teams.get_teams()
#extract info and add to a dictionaru
tIDs={}
for tm in nba_teams:
    tIDs[tm['id']]=tm['full_name']

#check the first and last items
print(list(tIDs.items())[0], list(tIDs.items())[-1])

(1610612737, 'Atlanta Hawks') (1610612766, 'Charlotte Hornets')


## 2. Gather Shot Data
#### Part 1: From Shot Chart Detail
1. define a function for executing each request
2. loop through all teams
3. save to some variable (we've chosen a dictionary for now -- season_data)

In [9]:
def get_shot_chart_data(player_id, team_id, season_type):
    try:
        shotchart = ShotChartDetail(
            player_id=player_id,
            team_id=team_id,
            context_measure_simple='FGA',
            season_type_all_star=season_type,
            season_nullable='2023-24'
        )
        
        shot_data = shotchart.get_data_frames()[0]  # Data is returned as a list of dataframes
        return shot_data
    except Exception as e:
        print(f"An error occurred: {e}")
        return pd.DataFrame()  # Return an empty DataFrame on error

In [11]:
player_id = 0  # Since we are looking for a team rather than an individual player, we can sub the id for 0.
season_type = 'Regular Season'
season_data={}
for tid in tIDs:
# Fetch and display shot chart data
    season_data[tid]=get_shot_chart_data(player_id, tid, season_type)

In [21]:
# Display some shot data to verify success
print(season_data[tid])

              GRID_TYPE     GAME_ID  GAME_EVENT_ID  PLAYER_ID  \
0     Shot Chart Detail  0022300009             11    1630163   
1     Shot Chart Detail  0022300009             15    1629023   
2     Shot Chart Detail  0022300009             19    1630163   
3     Shot Chart Detail  0022300009             24    1641706   
4     Shot Chart Detail  0022300009             28    1631109   
...                 ...         ...            ...        ...   
7128  Shot Chart Detail  0022301216            587     202330   
7129  Shot Chart Detail  0022301216            596    1628970   
7130  Shot Chart Detail  0022301216            598    1641706   
7131  Shot Chart Detail  0022301216            617    1641706   
7132  Shot Chart Detail  0022301216            621    1626179   

          PLAYER_NAME     TEAM_ID          TEAM_NAME  PERIOD  \
0         LaMelo Ball  1610612766  Charlotte Hornets       1   
1     P.J. Washington  1610612766  Charlotte Hornets       1   
2         LaMelo Ball  1610

In [17]:
print('First:', list(season_data.keys())[0],'Last:',list(season_data.keys())[-1],'Length:',len(season_data.keys()))

First: 1610612737 Last: 1610612766 Length: 30


These last two print statements are verifying that our data is populated as expected.

#### Part 2. From Play by Play
1. Here we are going to create a set of all *game ids* in the 2023-2024 NBA season. This will help us retrieve the play by play events from each game for each team.
   • Given 30 teams and 82 and 2 teams participating in each game, we should get 30 * 82 / 2 = **1230**.
2. Iterate through all games and request play by play data.
   • We will filter for shot attempts only

In [18]:
gameSet=set()
for tm in season_data:
    check=season_data[tm]['GAME_ID'].unique()
    gameSet.update(check)
print(len(gameSet))

1230


adsflhk

In [19]:
def get_play_by_play_data(gameSet):
    all_play_by_play_data = []
    dloadProg=0
    for game_id in gameSet:
        dloadProg+=1
        if dloadProg%20 == 0:
            print(str(round(dloadProg/1230*100, 2))+'% COMPLETE')
        try:
            pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
            pbp_df = pbp.get_data_frames()[0]
            pbp_df['GAME_ID'] = game_id
            all_play_by_play_data.append(pbp_df)
            time.sleep(1)  # To avoid hitting rate limits
        except Exception as e:
            print(f"Error fetching data for game {game_id}: {e}")
    
    combined_df = pd.concat(all_play_by_play_data, ignore_index=True)
    return combined_df

In [20]:
pbp_all = get_play_by_play_data(gameSet)

1.63% COMPLETE
3.25% COMPLETE
4.88% COMPLETE
6.5% COMPLETE
8.13% COMPLETE
9.76% COMPLETE
11.38% COMPLETE
13.01% COMPLETE
14.63% COMPLETE
16.26% COMPLETE
17.89% COMPLETE
19.51% COMPLETE
21.14% COMPLETE
22.76% COMPLETE
24.39% COMPLETE
26.02% COMPLETE
27.64% COMPLETE
29.27% COMPLETE
30.89% COMPLETE
32.52% COMPLETE
34.15% COMPLETE
35.77% COMPLETE
37.4% COMPLETE
39.02% COMPLETE
40.65% COMPLETE
42.28% COMPLETE
43.9% COMPLETE
45.53% COMPLETE
47.15% COMPLETE
48.78% COMPLETE
50.41% COMPLETE
52.03% COMPLETE
53.66% COMPLETE
55.28% COMPLETE
56.91% COMPLETE
58.54% COMPLETE
60.16% COMPLETE
61.79% COMPLETE
63.41% COMPLETE
65.04% COMPLETE
66.67% COMPLETE
68.29% COMPLETE
69.92% COMPLETE
71.54% COMPLETE
73.17% COMPLETE
74.8% COMPLETE
76.42% COMPLETE
78.05% COMPLETE
79.67% COMPLETE
81.3% COMPLETE
82.93% COMPLETE
84.55% COMPLETE
86.18% COMPLETE
87.8% COMPLETE
89.43% COMPLETE
91.06% COMPLETE
92.68% COMPLETE
94.31% COMPLETE
95.93% COMPLETE
97.56% COMPLETE
99.19% COMPLETE


Let's find out some details about our new dataframe.

In [24]:
len(pbp_all) #number of data points

567672

In [25]:
len(pbp_all['GAME_ID'].unique()) #number of unique games. this should be 1230 or close to it. The console would have printed an error message for every missed game when loading the data.

1230

In [26]:
#pick a random game to test . . . the first entry of every game should be a tip off
pbp_all.iloc[0]

GAME_ID                                             0022300166
EVENTNUM                                                     2
EVENTMSGTYPE                                                12
EVENTMSGACTIONTYPE                                           0
PERIOD                                                       1
WCTIMESTRING                                           8:11 PM
PCTIMESTRING                                             12:00
HOMEDESCRIPTION                                           None
NEUTRALDESCRIPTION           Start of 1st Period (8:11 PM EST)
VISITORDESCRIPTION                                        None
SCORE                                                     None
SCOREMARGIN                                               None
PERSON1TYPE                                                  0
PLAYER1_ID                                                   0
PLAYER1_NAME                                              None
PLAYER1_TEAM_ID                                        

In [28]:
#filter for shot attempts using the mappeed eventmsg types (make, miss)
pbp_shots = pbp_all[pbp_all['EVENTMSGTYPE'].isin([1, 2])]

In [29]:
len(pbp_shots)

218705

Here we notice that about 30 - 40% of game events are shot attempts.
Now let's verify the data.

In [30]:
pbp_shots.iloc[0]

GAME_ID                                            0022300166
EVENTNUM                                                    7
EVENTMSGTYPE                                                2
EVENTMSGACTIONTYPE                                         79
PERIOD                                                      1
WCTIMESTRING                                          8:11 PM
PCTIMESTRING                                            11:43
HOMEDESCRIPTION                                          None
NEUTRALDESCRIPTION                                       None
VISITORDESCRIPTION           MISS Ingram 13' Pullup Jump Shot
SCORE                                                    None
SCOREMARGIN                                              None
PERSON1TYPE                                                 5
PLAYER1_ID                                            1627742
PLAYER1_NAME                                   Brandon Ingram
PLAYER1_TEAM_ID                                  1610612740.0
PLAYER1_

Now that we have all of our data. We need to merge what we want and omit
what we dont need. we have a dictionary of season data for each team. the contents
of each key is a df. we can merge the individual dfs i think. lets try

## 3. Merge Shot Data

1. Identify columns needed to train our model (shot distance, shot type (eventmsgaction), shot made, period, and *others in the future i.e. court region*).
2. And the basis by which we are merging (*game_ids* and *eventnum/game_event_id*).
3. Verify

In [31]:
pbpShotFILT=pbp_shots[['GAME_ID','EVENTNUM','EVENTMSGACTIONTYPE','PERIOD']]
seasonFILT={}

In [32]:
for tm in season_data:
    tmFILT=season_data[tm][['GAME_ID','GAME_EVENT_ID','SHOT_DISTANCE','SHOT_MADE_FLAG']]
    mf=pd.merge(tmFILT, pbpShotFILT, 
                     left_on=['GAME_ID', 'GAME_EVENT_ID'],  # Columns from df1: pbpShotFILT
                     right_on=['GAME_ID', 'EVENTNUM'])
    seasonFILT[tm]=mf.drop(columns='GAME_EVENT_ID')

In [36]:
print(seasonFILT[list(seasonFILT.keys())[0]])

         GAME_ID  SHOT_DISTANCE  SHOT_MADE_FLAG  EVENTNUM  EVENTMSGACTIONTYPE  \
0     0022300018              6               1        12                  78   
1     0022300018             26               1        16                   1   
2     0022300018             15               1        25                  79   
3     0022300018              3               1        28                   5   
4     0022300018              2               1        32                   5   
...          ...            ...             ...       ...                 ...   
7579  0022301218             27               0       662                   1   
7580  0022301218              9               0       665                  97   
7581  0022301218              4               1       667                   1   
7582  0022301218             29               1       676                  79   
7583  0022301218             28               0       683                  79   

      PERIOD  
0          1

In [38]:
# I could add a win/loss ratio to this to help rank instead of just categorize. 
    #there is also more positional data i could incorporate...

seasonFILT now has every shot from the season separated by each team. At this point we may be able to train our model

At first I was considering giving a stat line or some ratio/expected value of shot selections. Instead I will elect to classify which team has the shot profile that most accurately aligns with the input/tested/selected shot/parameters (in a specific matchup?).

In [37]:
dfs = []

# Iterate through the dictionary and add a new column with the dictionary keys
for key, df in seasonFILT.items():
    df['TEAM_ID'] = key  # Add the key as a new column
    dfs.append(df)      # Add the DataFrame to the list

# Concatenate all DataFrames in the list into a single DataFrame
sznFILTdf = pd.concat(dfs, ignore_index=True)

# Display the resulting DataFrame
print(sznFILTdf)

           GAME_ID  SHOT_DISTANCE  SHOT_MADE_FLAG  EVENTNUM  \
0       0022300018              6               1        12   
1       0022300018             26               1        16   
2       0022300018             15               1        25   
3       0022300018              3               1        28   
4       0022300018              2               1        32   
...            ...            ...             ...       ...   
218698  0022301216              2               1       587   
218699  0022301216             23               1       596   
218700  0022301216             26               1       598   
218701  0022301216              7               0       617   
218702  0022301216             22               1       621   

        EVENTMSGACTIONTYPE  PERIOD     TEAM_ID  
0                       78       1  1610612737  
1                        1       1  1610612737  
2                       79       1  1610612737  
3                        5       1  1610612737 

## Part 4. Developing a Target + Feature Engineering

In [40]:
# Step 1: Create a team shot profile based on shot distance and shot type
team_profiles = {}

# Group data by team_id
for team_id, df in sznFILTdf.groupby('TEAM_ID'):
    # Get shot distance and shot type stats per team
    shot_stats = df.groupby(['SHOT_DISTANCE', 'EVENTMSGACTIONTYPE']).agg({
        'SHOT_MADE_FLAG': ['mean', 'count']  # Mean shot success rate and count of attempts
    }).reset_index()
    
    # Rename columns for clarity
    shot_stats.columns = ['SHOT_DISTANCE', 'EVENTMSGACTIONTYPE', 'MEAN_SHOT_SUCCESS_RATE', 'COUNT']
    
    # Store the profile in the dictionary
    team_profiles[team_id] = shot_stats

# Example to show the output
for team_id, profile in team_profiles.items():
    print(f"Team: {team_id}")
    print(profile)  # Displaying only the first few rows for brevity


Team: 1610612737
     SHOT_DISTANCE  EVENTMSGACTIONTYPE  MEAN_SHOT_SUCCESS_RATE  COUNT
0                0                   3                0.500000      2
1                0                   5                0.800000     20
2                0                   6                0.603774     53
3                0                   7                0.923077     26
4                0                   9                0.950000     20
..             ...                 ...                     ...    ...
471             55                  78                0.000000      1
472             55                  79                0.000000      1
473             57                  78                0.000000      1
474             62                  79                0.000000      2
475             72                   1                0.000000      1

[476 rows x 4 columns]
Team: 1610612738
     SHOT_DISTANCE  EVENTMSGACTIONTYPE  MEAN_SHOT_SUCCESS_RATE  COUNT
0                0              

In [89]:
# Step 2: Define a function to compare a shot to a team's profile (including shot type)
def does_shot_fit_profile(shot, team_id, team_profiles):
    shot_distance = shot['SHOT_DISTANCE']
    shot_type = shot['EVENTMSGACTIONTYPE']
    
    # Get the team's profile for shot distance and type
    team_profile = team_profiles.get(team_id)
    
    if team_profile is not None:
        # Match both shot distance and shot type
        closest_match = team_profile[
            (team_profile['SHOT_DISTANCE'] == shot_distance) & 
            (team_profile['EVENTMSGACTIONTYPE'] == shot_type)
        ]
        
        if not closest_match.empty:
            # Define the rule for fitting the profile (e.g., success rate > 50%)
            if closest_match['MEAN_SHOT_SUCCESS_RATE'].values[0] > 0.42:  # Example rule
                return 1  # Shot fits the profile
            else:
                return 0  # Shot does not fit
    return 0  # Default: Does not fit

# Step 3: Apply the function to generate the target column 'FITS_PROFILE'
def generate_profile_labels(df, team_profiles):
    df['FITS_PROFILE'] = df.apply(lambda row: does_shot_fit_profile(row, row['TEAM_ID'], team_profiles), axis=1)
    return df

# Apply to the whole dataset (sznFILTdf)
# Group data by team_id
team_profiles = {}

for team_id, df in sznFILTdf.groupby('TEAM_ID'):
    # Get shot distance and shot type stats per team
    shot_stats = df.groupby(['SHOT_DISTANCE', 'EVENTMSGACTIONTYPE']).agg({
        'SHOT_MADE_FLAG': ['mean', 'count']  # Mean shot success rate and count of attempts
    }).reset_index()
    
    # Rename columns for clarity
    shot_stats.columns = ['SHOT_DISTANCE', 'EVENTMSGACTIONTYPE', 'MEAN_SHOT_SUCCESS_RATE', 'COUNT']
    
    # Store the profile in the dictionary
    team_profiles[team_id] = shot_stats

# Apply to the whole dataset (sznFILTdf)
sznFILTdf = generate_profile_labels(sznFILTdf, team_profiles)

# Check the results
print(sznFILTdf)


           GAME_ID  SHOT_DISTANCE  SHOT_MADE_FLAG  EVENTNUM  \
0       0022300018              6               1        12   
1       0022300018             26               1        16   
2       0022300018             15               1        25   
3       0022300018              3               1        28   
4       0022300018              2               1        32   
...            ...            ...             ...       ...   
218698  0022301216              2               1       587   
218699  0022301216             23               1       596   
218700  0022301216             26               1       598   
218701  0022301216              7               0       617   
218702  0022301216             22               1       621   

        EVENTMSGACTIONTYPE  PERIOD     TEAM_ID  FITS_PROFILE  
0                       78       1  1610612737             1  
1                        1       1  1610612737             0  
2                       79       1  1610612737        

In [90]:
# Check the number of instances where shots fit the profile
fit_counts = sznFILTdf['FITS_PROFILE'].value_counts()

# Print the counts and ratio
print(fit_counts, fit_counts[1]/(fit_counts[0]+fit_counts[1]))

FITS_PROFILE
0    112461
1    106242
Name: count, dtype: int64 0.48578208803720113


shots that fit a teams shot profile are those that have high rates of makes. These shots stand out above the rest and theoretically should be valued even more highly when considering strategy.

In [95]:
sznFILTdf.to_csv('nbaShotsFiltered2324.csv',index=False)

In [93]:
cont_names = ['SHOT_DISTANCE']  # Continuous variables
cat_names = ['EVENTMSGACTIONTYPE', 'PERIOD','TEAM_ID']  # Categorical variables including team_id
y_names = 'FITS_PROFILE'  # Target variable


# Create DataLoaders
dataloaders = TabularDataLoaders.from_df(sznFILTdf, 
                                         path='.', 
                                         procs=[Categorify,FillMissing], 
                                         cat_names=cat_names, 
                                         cont_names=cont_names, 
                                         y_names=y_names,
                                         valid_pct=0.2,
                                         test_pct=.1,
                                         bs=128)  # Adjust batch size as needed


In [94]:
# Create a TabularLearner
learn = tabular_learner(dataloaders,
                        metrics=accuracy)  # You can use different metrics as needed

# Train the model
learn.fit_one_cycle(5)  # Number of epochs


epoch,train_loss,valid_loss,accuracy,time
0,0.123172,0.118963,0.514152,00:13
1,0.116747,0.11586,0.514152,00:13
2,0.110367,0.106754,0.514152,00:13
3,0.102153,0.100407,0.514152,00:13
4,0.100728,0.112152,0.514152,00:13


In [53]:
# Check accuracy
learn.show_results()

# Plot confusion matrix
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()


Unnamed: 0,EVENTMSGACTIONTYPE,PERIOD,TEAM_ID,SHOT_DISTANCE,FITS_PROFILE,FITS_PROFILE_pred
0,35.0,4.0,15.0,0.0,1.0,0.943578
1,16.0,2.0,15.0,4.0,1.0,0.247982
2,27.0,3.0,6.0,12.0,0.0,0.445141
3,27.0,1.0,20.0,10.0,1.0,0.675533
4,27.0,4.0,13.0,13.0,0.0,0.278604
5,1.0,4.0,29.0,26.0,0.0,-0.013689
6,28.0,4.0,3.0,25.0,0.0,0.051666
7,27.0,1.0,22.0,29.0,0.0,0.019143
8,36.0,2.0,21.0,1.0,1.0,0.950865


AttributeError: vocab

In [77]:
# Example of making predictions
new_data = pd.DataFrame({
    'SHOT_DISTANCE': [14],
    'EVENTMSGACTIONTYPE': [97],
    'PERIOD': [3],
    'TEAM_ID':[15.0]
})

# Convert to DataLoader format
new_dl = learn.dls.test_dl(new_data)

# Make predictions
preds, _ = learn.get_preds(dl=new_dl)
print(preds)


tensor([[-0.1230]])


In [None]:
shotLab={

    1: "Jump Shot",
    2: "Running Jump Shot",
    3: "Hook Shot",
    5: "Layup",
    6: "Driving Layup",
    7: "Dunk",  
    41: "Running Layup",
    43: "Alley Oop Layup",
    44: "Reverse Layup",
    47: "Turnaround Jump Shot",
    50: "Running Dunk",  
    52: "Alley Oop Dunk",
    58: "Turnaround Hook Shot",  
    63: "Fadeaway Jumper",
    66: "Jump Bank Shot",
    71: "Finger Roll Layup",
    72: "Putback Layup",  
    73: "Driving Reverse Layup",
    75: "Driving Finger Roll Layup",
    76: "Running Finger Roll Layup",
    78: "Floating Jump Shot",
    79: "Pullup Jump Shot",
    80: "Step Back Jump Shot",
    86: "Turnaround Fadeaway",
    97: "Tip Layup Shot",
    98: "Cutting Layup Shot",
    99: "Cutting Finger Roll Layup Shot",  
    105: "Turnaround Fadeaway Bank Jump Shot",
    107: "Tip Dunk Shot",
    108: "Cutting Dunk Shot"  

}

In [98]:
season_data[tid]['SHOT_ZONE_AREA'].unique()

array(['Right Side Center(RC)', 'Right Side(R)', 'Left Side Center(LC)',
       'Center(C)', 'Left Side(L)', 'Back Court(BC)'], dtype=object)