# Data Preparation

This parts aim to prepare the data before exploring it and building a prediction model. You must execute notebook `data_retrieval.ipynb` before executing this one.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
games = pd.read_csv('data/game_scores.csv', parse_dates=['date'])
stats = pd.read_csv('data/games_statistics.csv')
raw_data = games.merge(stats, on='boxscore_url', how='inner')
raw_data_backup = raw_data.copy()

### Filter All-Star game
We see the All-Star game is included, we want to discard it as it is not a regular game:

In [3]:
teams = pd.concat([raw_data.home_team, raw_data.away_team], ignore_index=True).unique() #Important to take home + away teams if the algorithm is run early in the season
print(teams)
filtered_teams = ['All Star France', 'All Star Monde']
raw_data = raw_data[~raw_data.home_team.isin(filtered_teams)]
teams = list(filter(lambda team: team not in filtered_teams, teams))

# Validation
assert all(team not in raw_data['home_team'] and team not in raw_data['away_team'] for team in filtered_teams)
assert all(team not in teams for team in filtered_teams)

['Dijon' 'Châlons-Reims' 'Boulogne-Levallois' 'Monaco' 'Chalon/Saône'
 'Cholet' 'Boulazac' 'Bourg-en-Bresse' 'Lyon-Villeurbanne' 'Roanne'
 'Le Mans' 'Pau-Lacq-Orthez' 'Limoges' 'Strasbourg' 'Nanterre' 'Le Portel'
 'Gravelines-Dunkerque' 'Orléans' 'All Star France' 'All Star Monde']


### Win/Loss ratio computation

Compute team wins/loss before each game:

In [4]:
teams_WL_tmp = pd.DataFrame(data=np.zeros((len(teams), 2)), dtype=np.int64, columns=['wins', 'losses'], index=teams) # Creating a temporary dataframe to hold current team win/loss
raw_data.sort_values(by='date', axis='index', ascending=True, inplace=True) # Sorting by ascending dates
raw_data['home_team_wins'] = 0
raw_data['home_team_losses'] = 0
raw_data['away_team_wins'] = 0
raw_data['away_team_losses'] = 0

In [5]:
for index, row in raw_data.iterrows():
    raw_data.at[index, 'home_team_wins'] = teams_WL_tmp.loc[row["home_team"]]['wins']
    raw_data.at[index, 'away_team_wins'] = teams_WL_tmp.loc[row["away_team"]]['wins']
    raw_data.at[index, 'home_team_losses'] = teams_WL_tmp.loc[row["home_team"]]['losses']
    raw_data.at[index, 'away_team_losses'] = teams_WL_tmp.loc[row["away_team"]]['losses']
    
    if row['home_score'] > row['away_score']:
        teams_WL_tmp.at[row["home_team"], 'wins'] = teams_WL_tmp.at[row["home_team"], 'wins'] + 1
        teams_WL_tmp.at[row["away_team"], 'losses'] = teams_WL_tmp.at[row["away_team"], 'losses'] + 1
    elif row['home_score'] < row['away_score']:
        teams_WL_tmp.at[row["away_team"], 'wins'] = teams_WL_tmp.at[row["away_team"], 'wins'] + 1
        teams_WL_tmp.at[row["home_team"], 'losses'] = teams_WL_tmp.at[row["home_team"], 'losses'] + 1

# Validation
display(raw_data.sample(n=5))

Unnamed: 0,date,home_team,home_score,away_team,away_score,qt_1_home_score,qt_1_away_score,qt_2_home_score,qt_2_away_score,qt_3_home_score,...,away_ftm,away_blk,away_stl,away_tov,away_pf,away_pfd,home_team_wins,home_team_losses,away_team_wins,away_team_losses
158,2020-01-11,Cholet,83,Gravelines-Dunkerque,65,18,27,24,13,17,...,9,4,3,13,19,13,11,6,5,12
161,2020-01-11,Boulogne-Levallois,101,Nanterre,90,26,22,27,29,25,...,20,5,6,19,24,20,12,5,8,9
155,2020-01-11,Bourg-en-Bresse,109,Chalon/Saône,85,32,12,31,20,23,...,19,0,3,8,21,24,11,6,6,11
28,2019-10-11,Bourg-en-Bresse,90,Orléans,79,25,21,20,13,22,...,14,3,7,16,21,16,3,0,1,2
220,2020-02-29,Cholet,93,Chalon/Saône,99,29,28,22,26,23,...,18,0,6,13,17,23,14,10,9,14


### Possessions & Pace

Computation of the number of possessions. We used formula $FGA + 0.44*FTA - ORB + TOV$. See notes for more in-depth information. We can then infer the pace (possessions per 40 minutes), by computing $\dfrac{40 * possessions}{minutes}$ to see how fast a team plays.  
(*Note that `minutes` column is divided by 5, as it is the total played, by all 5 players on the floor.*).

In [13]:
raw_data['home_team_possessions'] = round(raw_data['home_2pa'] + raw_data['home_3pa'] + 0.44 * raw_data['home_fta'] - raw_data['home_orbd'] + raw_data['home_tov'], 2)
raw_data['away_team_possessions'] = round(raw_data['away_2pa'] + raw_data['away_3pa'] + 0.44 * raw_data['away_fta'] - raw_data['away_orbd'] + raw_data['away_tov'], 2)
raw_data['home_team_pace'] = round((raw_data['home_team_possessions']*40)/(raw_data['minutes']/5), 2)
raw_data['away_team_pace'] = round((raw_data['away_team_possessions']*40)/(raw_data['minutes']/5), 2)

assert all(np.greater(raw_data[raw_data.minutes > 200].home_team_possessions, raw_data[raw_data.minutes > 200].home_team_pace))
assert all(np.equal(raw_data[raw_data.minutes == 200].home_team_pace, raw_data[raw_data.minutes == 200].home_team_possessions))
assert all(np.greater(raw_data[raw_data.minutes > 200].away_team_possessions, raw_data[raw_data.minutes > 200].away_team_pace))
assert all(np.equal(raw_data[raw_data.minutes == 200].away_team_pace, raw_data[raw_data.minutes == 200].away_team_possessions))

display(raw_data[['home_team', 'away_team', 'home_team_possessions', 'away_team_possessions', 'home_team_pace', 'away_team_pace']].sample(n=5))

Unnamed: 0,home_team,away_team,home_team_possessions,away_team_possessions,home_team_pace,away_team_pace
210,Dijon,Bourg-en-Bresse,74.6,74.8,74.6,74.8
108,Roanne,Boulazac,74.8,73.6,74.8,73.6
221,Le Mans,Bourg-en-Bresse,75.32,75.48,75.32,75.48
36,Dijon,Lyon-Villeurbanne,68.36,69.36,68.36,69.36
92,Lyon-Villeurbanne,Châlons-Reims,73.44,74.48,73.44,74.48


### Offensive Rating (ORtg), Defensive Rating (DRtg) and Net rating (NRtg)

Now that we have the pace, we can derive Offensive Rating, Defensive Rating and Net Rating. Offensive Rating is points scored per 100 possessions:  
$\dfrac{Pts * 100}{Poss}$  
  
Defensive rating is basically the opponent's offensive rating, and finally Net Rating is:  
$NRtg = ORtg - DRtg$.

In [15]:
raw_data['home_ortg'] = raw_data['home_score'] * 100 / raw_data['home_team_possessions']
raw_data['away_ortg'] = raw_data['away_score'] * 100 / raw_data['away_team_possessions']
raw_data['home_drtg'] = raw_data['away_ortg']
raw_data['away_drtg'] = raw_data['home_ortg']
raw_data['home_nrtg'] = raw_data['home_ortg'] - raw_data['home_drtg']
raw_data['away_nrtg'] = raw_data['away_ortg'] - raw_data['away_drtg']

display(raw_data[[
    'home_team', 
    'away_team', 
    'home_score', 
    'away_score', 
    'home_team_possessions', 
    'away_team_possessions',
    'home_ortg',
    'away_ortg',
    'home_drtg',
    'away_drtg',
    'home_nrtg',
    'away_nrtg',
]].sample(n=5))

Unnamed: 0,home_team,away_team,home_score,away_score,home_team_possessions,away_team_possessions,home_ortg,away_ortg,home_drtg,away_drtg,home_nrtg,away_nrtg
140,Limoges,Strasbourg,77,61,70.0,71.08,110.0,85.818796,85.818796,110.0,24.181204,-24.181204
230,Lyon-Villeurbanne,Le Mans,101,85,74.32,77.12,135.898816,110.217842,110.217842,135.898816,25.680974,-25.680974
112,Orléans,Gravelines-Dunkerque,90,78,72.32,71.36,124.446903,109.304933,109.304933,124.446903,15.14197,-15.14197
205,Boulazac,Roanne,96,77,75.44,77.16,127.253446,99.792639,99.792639,127.253446,27.460808,-27.460808
142,Monaco,Bourg-en-Bresse,94,89,75.12,77.24,125.13312,115.225272,115.225272,125.13312,9.907848,-9.907848


### Compute average team statistics before each game

Pseudo code of the function to compute a stat for all previous game efficiently:
```
acc_dataframe <- Create dataframe [team=list of unique teams, stats=0]
raw_data <- Sort raw_data by game date
raw_data <- Add new stat column to raw_data filled with 0 for both home and away team
compute_stat(game, home: bool) <- function that given a game returns a stat for home or away team.

for game in raw_data:
    game[home_stat] = acc_dataframe[game[home_team]]
    game[away_stat] = acc_dataframe[game[away_team]]
    acc_dataframe[home_team] = compute_stat(game, home=True)
    acc_dataframe[away_team] = compute_stat(game, home=False)
```

### Saving processed data

In [51]:
raw_data.to_csv('data/processed_data.csv', index=False)