This notebook corresponds with src/data/streak.py; it explains some of the insights and methods implemented into streak.py: a script used to generate streak.csv. 

In [1]:
# Imports
import pandas as pd

First, we load in 'mirror.csv', which is an intermediary dataset created in order to work with streak data. It contains game instances from a team's perspective. For more on the mirror data or how it was generated from the original data, please read through mirror.ipynb. 

In [2]:
mirror_df = pd.read_csv('../data/processed/mirror.csv', index_col=0)
mirror_df.head(6)

Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
20600003,1610612742,2003-10-05,2003,0,85.0,90.0,0.447,0.457,0.25,0.143,20.0,23.0,38.0,41.0
20600004,1610612763,2003-10-06,2003,1,105.0,94.0,0.494,0.427,0.267,0.154,25.0,20.0,48.0,43.0
20600005,1610612749,2003-10-06,2003,0,94.0,105.0,0.427,0.494,0.154,0.267,20.0,25.0,43.0,48.0
20600006,1610612765,2003-10-07,2003,0,96.0,100.0,0.391,0.494,0.444,0.667,19.0,25.0,37.0,52.0
20600007,1610612739,2003-10-07,2003,1,100.0,96.0,0.494,0.391,0.667,0.444,25.0,19.0,52.0,37.0


Now, what we want to do using these game instances is engineer 'streak' features indicating how a a certain team has been playing leading up to a game, and use these streak features to predict the game in question. 

To work up to this, we first find all of the games a team has played.

(NOTE : For this project, I wanted to look at streak data limited to the current season; in other words, games played within the same season leading up to the game in question, and am not considering cross-seasonal data.)

In [3]:
team_id = 1610612762
season = 2003
team_df = mirror_df[(mirror_df['TEAM_ID'] == 1610612762) & (mirror_df['SEASON'] == 2003)]
print(team_df.shape)
team_df.head(6)

(83, 14)


Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
40600024,1610612762,2003-10-29,2003,1,99.0,92.0,0.575,0.429,0.556,0.333,25.0,20.0,29.0,40.0
40600067,1610612762,2003-11-01,2003,0,102.0,127.0,0.44,0.517,0.25,0.391,25.0,28.0,38.0,49.0
40600092,1610612762,2003-11-03,2003,1,93.0,88.0,0.432,0.444,0.333,0.429,17.0,21.0,53.0,33.0
40600118,1610612762,2003-11-05,2003,1,91.0,80.0,0.461,0.375,0.1,0.313,16.0,13.0,48.0,35.0
40600151,1610612762,2003-11-07,2003,0,89.0,95.0,0.384,0.433,0.273,0.462,20.0,22.0,43.0,41.0


Using data in dataset that is filled (fix later...)

In [4]:
team_df = team_df.dropna()
print(team_df.shape)
team_df.head(6)

(83, 14)


Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
40600024,1610612762,2003-10-29,2003,1,99.0,92.0,0.575,0.429,0.556,0.333,25.0,20.0,29.0,40.0
40600067,1610612762,2003-11-01,2003,0,102.0,127.0,0.44,0.517,0.25,0.391,25.0,28.0,38.0,49.0
40600092,1610612762,2003-11-03,2003,1,93.0,88.0,0.432,0.444,0.333,0.429,17.0,21.0,53.0,33.0
40600118,1610612762,2003-11-05,2003,1,91.0,80.0,0.461,0.375,0.1,0.313,16.0,13.0,48.0,35.0
40600151,1610612762,2003-11-07,2003,0,89.0,95.0,0.384,0.433,0.273,0.462,20.0,22.0,43.0,41.0


Now that we have data pertaining to an entire team over the course of a season, we can start to make streak data. To do this, we first want to sort data chronologically. From the above, it seems that games are sorted chronologically by 'GAME_ID', but we can't be too sure! So we sort. 

In [5]:
team_df = team_df.sort_values(by='GAME_DATE_EST', ascending=True)
team_df.head(6)

Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
40600024,1610612762,2003-10-29,2003,1,99.0,92.0,0.575,0.429,0.556,0.333,25.0,20.0,29.0,40.0
40600067,1610612762,2003-11-01,2003,0,102.0,127.0,0.44,0.517,0.25,0.391,25.0,28.0,38.0,49.0
40600092,1610612762,2003-11-03,2003,1,93.0,88.0,0.432,0.444,0.333,0.429,17.0,21.0,53.0,33.0
40600118,1610612762,2003-11-05,2003,1,91.0,80.0,0.461,0.375,0.1,0.313,16.0,13.0,48.0,35.0
40600151,1610612762,2003-11-07,2003,0,89.0,95.0,0.384,0.433,0.273,0.462,20.0,22.0,43.0,41.0


Now, let's say that we want to make a streak datapoint out of the past 5 games to predict the 6th game. To do this we can simply average stat columns. Furthermore, since this streak data represents the stats og the preceding 5 games to this 6th game in question, we can use the 'GAME_ID' and 'GAME_DATE_EST' of the 6th game to identify this new streak datapoint. Thus we have:

In [6]:
game_6 = team_df.iloc[5]
game_6

TEAM_ID            1610612762
GAME_DATE_EST      2003-11-07
SEASON                   2003
TEAM_WINS                   0
PTS_for                  89.0
PTS_against              95.0
FG_PCT_for              0.384
FG_PCT_against          0.433
FG3_PCT_for             0.273
FG3_PCT_against         0.462
AST_for                  20.0
AST_against              22.0
REB_for                  43.0
REB_against              41.0
Name: 40600151, dtype: object

In [7]:
game_6[['TEAM_ID', 'GAME_DATE_EST', 'SEASON', 'TEAM_WINS']]

TEAM_ID          1610612762
GAME_DATE_EST    2003-11-07
SEASON                 2003
TEAM_WINS                 0
Name: 40600151, dtype: object

In [8]:
past_5_games = team_df.iloc[:5]
past_5_games

Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
40600024,1610612762,2003-10-29,2003,1,99.0,92.0,0.575,0.429,0.556,0.333,25.0,20.0,29.0,40.0
40600067,1610612762,2003-11-01,2003,0,102.0,127.0,0.44,0.517,0.25,0.391,25.0,28.0,38.0,49.0
40600092,1610612762,2003-11-03,2003,1,93.0,88.0,0.432,0.444,0.333,0.429,17.0,21.0,53.0,33.0
40600118,1610612762,2003-11-05,2003,1,91.0,80.0,0.461,0.375,0.1,0.313,16.0,13.0,48.0,35.0


In [9]:
past_5_games.drop(['TEAM_ID', 'GAME_DATE_EST', 'SEASON'], axis=1).mean()

TEAM_WINS           0.8000
PTS_for            95.0000
PTS_against        94.4000
FG_PCT_for          0.4730
FG_PCT_against      0.4424
FG3_PCT_for         0.2764
FG3_PCT_against     0.3432
AST_for            21.2000
AST_against        20.4000
REB_for            41.8000
REB_against        39.0000
dtype: float64

Together for:

In [10]:
meta_cols = ['TEAM_ID', 'GAME_DATE_EST', 'SEASON']
are_stat_cols = ~mirror_df.columns.isin(meta_cols)
old_stat_cols = mirror_df.columns[are_stat_cols]
print(old_stat_cols)
print(meta_cols)

Index(['TEAM_WINS', 'PTS_for', 'PTS_against', 'FG_PCT_for', 'FG_PCT_against',
       'FG3_PCT_for', 'FG3_PCT_against', 'AST_for', 'AST_against', 'REB_for',
       'REB_against'],
      dtype='object')
['TEAM_ID', 'GAME_DATE_EST', 'SEASON']


In [11]:
ks = [5, 10]

new_stat_cols = []
for k in ks:
    for col in old_stat_cols:
        new_stat_cols.append("{}_prev_{}".format(col, k))
            
print(new_stat_cols)

['TEAM_WINS_prev_5', 'PTS_for_prev_5', 'PTS_against_prev_5', 'FG_PCT_for_prev_5', 'FG_PCT_against_prev_5', 'FG3_PCT_for_prev_5', 'FG3_PCT_against_prev_5', 'AST_for_prev_5', 'AST_against_prev_5', 'REB_for_prev_5', 'REB_against_prev_5', 'TEAM_WINS_prev_10', 'PTS_for_prev_10', 'PTS_against_prev_10', 'FG_PCT_for_prev_10', 'FG_PCT_against_prev_10', 'FG3_PCT_for_prev_10', 'FG3_PCT_against_prev_10', 'AST_for_prev_10', 'AST_against_prev_10', 'REB_for_prev_10', 'REB_against_prev_10']


In [12]:
new_cols = meta_cols + new_stat_cols
new_cols

['TEAM_ID',
 'GAME_DATE_EST',
 'SEASON',
 'TEAM_WINS_prev_5',
 'PTS_for_prev_5',
 'PTS_against_prev_5',
 'FG_PCT_for_prev_5',
 'FG_PCT_against_prev_5',
 'FG3_PCT_for_prev_5',
 'FG3_PCT_against_prev_5',
 'AST_for_prev_5',
 'AST_against_prev_5',
 'REB_for_prev_5',
 'REB_against_prev_5',
 'TEAM_WINS_prev_10',
 'PTS_for_prev_10',
 'PTS_against_prev_10',
 'FG_PCT_for_prev_10',
 'FG_PCT_against_prev_10',
 'FG3_PCT_for_prev_10',
 'FG3_PCT_against_prev_10',
 'AST_for_prev_10',
 'AST_against_prev_10',
 'REB_for_prev_10',
 'REB_against_prev_10']

In [13]:
index = mirror_df.index
streak_df = pd.DataFrame(index=index, columns=new_cols)
streak_df

Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS_prev_5,PTS_for_prev_5,PTS_against_prev_5,FG_PCT_for_prev_5,FG_PCT_against_prev_5,FG3_PCT_for_prev_5,FG3_PCT_against_prev_5,...,PTS_for_prev_10,PTS_against_prev_10,FG_PCT_for_prev_10,FG_PCT_against_prev_10,FG3_PCT_for_prev_10,FG3_PCT_against_prev_10,AST_for_prev_10,AST_against_prev_10,REB_for_prev_10,REB_against_prev_10
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20600002,,,,,,,,,,,...,,,,,,,,,,
20600003,,,,,,,,,,,...,,,,,,,,,,
20600004,,,,,,,,,,,...,,,,,,,,,,
20600005,,,,,,,,,,,...,,,,,,,,,,
20600006,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104200263,,,,,,,,,,,...,,,,,,,,,,
104200402,,,,,,,,,,,...,,,,,,,,,,
104200403,,,,,,,,,,,...,,,,,,,,,,
104200422,,,,,,,,,,,...,,,,,,,,,,


In [14]:
seasons = mirror_df['SEASON'].unique()
teams = mirror_df['TEAM_ID'].unique()

team_df = mirror_df[(mirror_df['TEAM_ID'] == teams[0]) & (mirror_df['SEASON'] == seasons[0])]
team_df.head(6)

Unnamed: 0_level_0,TEAM_ID,GAME_DATE_EST,SEASON,TEAM_WINS,PTS_for,PTS_against,FG_PCT_for,FG_PCT_against,FG3_PCT_for,FG3_PCT_against,AST_for,AST_against,REB_for,REB_against
GAME_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20600002,1610612762,2003-10-05,2003,1,90.0,85.0,0.457,0.447,0.143,0.25,23.0,20.0,41.0,38.0
40600024,1610612762,2003-10-29,2003,1,99.0,92.0,0.575,0.429,0.556,0.333,25.0,20.0,29.0,40.0
40600067,1610612762,2003-11-01,2003,0,102.0,127.0,0.44,0.517,0.25,0.391,25.0,28.0,38.0,49.0
40600092,1610612762,2003-11-03,2003,1,93.0,88.0,0.432,0.444,0.333,0.429,17.0,21.0,53.0,33.0
40600118,1610612762,2003-11-05,2003,1,91.0,80.0,0.461,0.375,0.1,0.313,16.0,13.0,48.0,35.0
40600151,1610612762,2003-11-07,2003,0,89.0,95.0,0.384,0.433,0.273,0.462,20.0,22.0,43.0,41.0


In [15]:
game_id = 20600002
team_df.loc[20600002]

TEAM_ID            1610612762
GAME_DATE_EST      2003-10-05
SEASON                   2003
TEAM_WINS                   1
PTS_for                  90.0
PTS_against              85.0
FG_PCT_for              0.457
FG_PCT_against          0.447
FG3_PCT_for             0.143
FG3_PCT_against          0.25
AST_for                  23.0
AST_against              20.0
REB_for                  41.0
REB_against              38.0
Name: 20600002, dtype: object

In [16]:
team_df.loc[20600002, meta_cols]

TEAM_ID          1610612762
GAME_DATE_EST    2003-10-05
SEASON                 2003
Name: 20600002, dtype: object

In [17]:
streak_df.loc[20600002, new_stat_cols]

TEAM_WINS_prev_5           NaN
PTS_for_prev_5             NaN
PTS_against_prev_5         NaN
FG_PCT_for_prev_5          NaN
FG_PCT_against_prev_5      NaN
FG3_PCT_for_prev_5         NaN
FG3_PCT_against_prev_5     NaN
AST_for_prev_5             NaN
AST_against_prev_5         NaN
REB_for_prev_5             NaN
REB_against_prev_5         NaN
TEAM_WINS_prev_10          NaN
PTS_for_prev_10            NaN
PTS_against_prev_10        NaN
FG_PCT_for_prev_10         NaN
FG_PCT_against_prev_10     NaN
FG3_PCT_for_prev_10        NaN
FG3_PCT_against_prev_10    NaN
AST_for_prev_10            NaN
AST_against_prev_10        NaN
REB_for_prev_10            NaN
REB_against_prev_10        NaN
Name: 20600002, dtype: object

So, basically what I did in streak.py was make a set of new columns, initialize the streak_df, calculate streaks with respect to each game, and then manually put each entry into the new dataframe.