## Data Preparation - PER dataset

The [Player Efficiency Rating (PER)](https://en.wikipedia.org/wiki/Player_efficiency_rating) is a basketball statistical measurement developed by John Hollinger that relates all of players contributions to their team in a single metric. 

PER takes into account positive ratings (e.g. scoring and assists) and subtracts negative ratings (e.g. misses and turnovers).
It is adjusted according to each player's playing time, in order to ensure fairness.
Moreover, the metric is also calibrated to the player's team for the season.

The formula for unadjusted PER (uPER) is defined in [<a href="ref2">2</a>] as:

$$
\begin{aligned}
& \text{uPER} = \frac{1}{\text{minutes}} \cdot ( \\\
& \text{threeMade} + \frac{2}{3} \cdot \text{assists} + (2 - \text{factor} \cdot \frac{\text{team\_AST}}{\text{team\_FG}}) \cdot \text{fgMade} . \\\
& + (\text{ftMade} \cdot 0.5 \cdot (1 + (1 - \frac{\text{team\_AST}}{\text{team\_FG}}) + \frac{2}{3} \cdot \frac{\text{team\_AST}}{\text{team\_FG}})) \\\
& - \text{VOP} \cdot \text{turnovers} - \text{VOP} \cdot \text{DRB\%} \cdot (\text{fgAttempted} - \text{fgMade}) \\\
& - \text{VOP} \cdot 0.44 \cdot (0.44 + 0.56 \cdot \text{DRB\%}) \cdot (\text{ftAttempted} - \text{ftMade}) \\\
& + \text{VOP} \cdot (1 - \text{DRB\%}) \cdot (\text{rebounds} - \text{oRebounds}) + \text{VOP} \cdot \text{DRB\%} \cdot \text{oRebounds} \\\
& + \text{VOP} \cdot \text{steals} + \text{VOP} \cdot \text{DRB\%} \cdot \text{blocks} \\\
& - \text{PF} \cdot (\frac{\text{lg\_FT}}{\text{lg\_PF}} - 0.44 \cdot \frac{\text{lg\_FTA}}{\text{lg\_PF}} \cdot \text{VOP}) \\\
& ) 
\end{aligned}
$$

Most of the terms follow the nomenclature defined by the original datasets.
Some additional terms are defined below:

$$
\begin{aligned}
\text{factor} = \left(\frac{2}{3}\right) - \frac{0.5 \cdot \left(\frac{\text{lg\_AST}}{\text{lg\_FG}}\right)}{2 \cdot \left(\frac{\text{lg\_FG}}{\text{lg\_FT}}\right)}
\end{aligned}
$$

$$
\begin{aligned}
\text{VOP} = \frac{\text{lg\_PTS}}{\text{lg\_FGA} - \text{lg\_ORB} + \text{lg\_TOV} + 0.44 \cdot \text{lg\_FTA}}
\end{aligned}
$$

$$
\begin{aligned}
\text{DRB\%} = \frac{\text{lg\_TRB} - \text{lg\_ORB}}{\text{lg\_TRB}}
\end{aligned}
$$


As a reference, the average PER is 15.00 every season.
For a better understanding of this metric, refer to the table below[<a href="#ref1">1</a>]:
| Type of Player             | PER                    |
|---------------------------|-------------------------|
| A Year For The Ages       | 35.0                    |
| Runaway MVP Candidate     | 30.0                    |
| Strong MVP Candidate      | 27.5                    |
| Weak MVP Candidate        | 25.0                    |
| Bona Fide All-Star        | 22.5                    |
| Borderline All-Star       | 20.0                    |
| Solid 2nd Option          | 18.0                    |
| 3rd Banana                | 16.5                    |
| Pretty Good Player*       | 15.0                    |
| In The Rotation           | 13.0                    |
| Scrounging For Minutes    | 11.0                    |
| Definitely Renting        | 9.0                     |
| The Next Stop: D-League   | 5.0                     |


In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os

#include utils directory
import sys
sys.path.append('..')

from utils.files import *
DATA_PATH = os.path.join('..', 'data')


#### Calculating PER

The following cell calculates the PER for each player in the dataset.
It then is used to create the dataset using either a sum or average of the PER per team.
Trough experimenting, we found that sum could be a better option as it is more sensitive to outliers.

In [41]:
# Players
players_teams_df = pd.read_csv(os.path.join(DATA_PATH, DATA_PLAYERS_TEAMS))
pt_df = preparePlayersTeamsDf(players_teams_df)

new_pt_df = pd.DataFrame()
for col in ['playerID', 'year', 'tmID']:
    new_pt_df[col] = pt_df[col]

teams_df = pd.read_csv(os.path.join(DATA_PATH, DATA_TEAMS))
teams_df = prepareTeamsDf(teams_df)
pt_df = pd.merge(pt_df, teams_df[['year', 'tmID', 'asts', 'fgm']], on=['year', 'tmID'], how='left')
pt_df.rename(columns={'asts': 't_asts', 'fgm': 't_fgm'}, inplace=True)

# Get league stats (yearly)
for index, row in pt_df.iterrows():
    # Assists
    lg_asts = teams_df[(teams_df['year'] == row['year'])]['asts'].sum()
    pt_df.at[index, 'lg_asts'] = lg_asts
    # Field Goals Made
    lg_fgm = teams_df[(teams_df['year'] == row['year'])]['fgm'].sum()
    pt_df.at[index, 'lg_fgm'] = lg_fgm
    # Field Goals Attempted
    lg_fga = pt_df[(pt_df['year'] == row['year'])]['fgAttempted'].sum()
    pt_df.at[index, 'lg_fgAttempted'] = lg_fga
    # Personal Fouls
    lg_pf = pt_df[(pt_df['year'] == row['year'])]['PF'].sum()
    pt_df.at[index, 'lg_pf'] = lg_pf
    # Free Throws Made
    lg_ftMade = pt_df[(pt_df['year'] == row['year'])]['ftMade'].sum()
    pt_df.at[index, 'lg_ftMade'] = lg_ftMade
    # Free Throws Attempted
    lg_ftAttempted = pt_df[(pt_df['year'] == row['year'])]['ftAttempted'].sum()
    pt_df.at[index, 'lg_ftAttempted'] = lg_ftAttempted
    # Points
    lg_points = pt_df[(pt_df['year'] == row['year'])]['points'].sum()
    # Offesnive Rebounds
    lg_oRebounds = pt_df[(pt_df['year'] == row['year'])]['oRebounds'].sum()
    # Rebounds
    lg_rebounds = pt_df[(pt_df['year'] == row['year'])]['rebounds'].sum()
    # Turnovers
    lg_turnovers = pt_df[(pt_df['year'] == row['year'])]['turnovers'].sum()

    pt_df.at[index, 'factor'] = (2 / 3) - (0.5 * (lg_asts / lg_fgm)) / (2 * (lg_fgm / lg_ftMade))
    pt_df.at[index, 'vop'] = lg_points / (lg_fga - lg_oRebounds + lg_turnovers + 0.44 * lg_ftAttempted)
    pt_df.at[index, 'drb'] = (lg_rebounds - lg_oRebounds) / lg_rebounds


# Make PER stats for each player
new_pt_df['uPER'] = 1 / pt_df['minutes'] * (
    pt_df['threeMade'] + (2/3) * pt_df['assists'] 
    + (2 - pt_df['factor'] * (pt_df['t_asts'] / pt_df['t_fgm'])) * pt_df['fgMade']
    + pt_df['ftMade'] * 0.5 * (1 + (1 - (pt_df['t_asts'] / pt_df['t_fgm'])) + 2/3 * (pt_df['t_asts'] / pt_df['t_fgm']))
    - pt_df['vop'] * pt_df['turnovers'] - pt_df['vop'] * pt_df['drb'] * (pt_df['fgAttempted'] - pt_df['fgMade'])
    - pt_df['vop'] * 0.44 * (0.44 + (0.56 * pt_df['drb'])) * (pt_df['ftAttempted'] - pt_df['ftMade'])
    + pt_df['vop'] * (1 - pt_df['drb']) * (pt_df['rebounds'] - pt_df['oRebounds']) + pt_df['vop'] * pt_df['drb'] * pt_df['oRebounds']
    + pt_df['vop'] * pt_df['steals'] + pt_df['vop'] * pt_df['drb'] * pt_df['blocks']
    - pt_df['PF'] * ((pt_df['lg_ftAttempted'] / pt_df['lg_pf']) - 0.44 * (pt_df['lg_ftAttempted'] / pt_df['lg_pf']) * pt_df['vop'])
)

# Standardize PER
LG_AVG = 15
new_pt_df['PER'] = new_pt_df['uPER'] * (15 / new_pt_df['uPER'].mean())
new_pt_df.drop(columns=['uPER'], inplace=True)

display(new_pt_df)

# get the average stats for players the previous year
merged_df = teams_df[['year', 'tmID', 'playoff']].copy()
for index, row in merged_df.iterrows():
    merged_df.loc[index, 'per'] = new_pt_df[(new_pt_df['year'] == row['year'] - 1) & (new_pt_df['tmID'] == row['tmID'])]['PER'].sum()

display(merged_df)


Unnamed: 0,playerID,year,tmID,PER
0,abrossv01w,2,MIN,21.050648
1,abrossv01w,3,MIN,15.237793
2,abrossv01w,4,MIN,20.039299
3,abrossv01w,5,MIN,15.943110
4,abrossv01w,6,MIN,17.684842
...,...,...,...,...
1871,zakalok01w,3,PHO,-8.567689
1872,zarafr01w,6,SEA,11.926119
1873,zellosh01w,10,DET,24.142646
1874,zirkozu01w,4,WAS,18.774724


Unnamed: 0,year,tmID,playoff,per
0,9,ATL,N,0.000000
1,10,ATL,Y,219.722215
2,1,CHA,N,0.000000
3,2,CHA,Y,185.796930
4,3,CHA,Y,161.314610
...,...,...,...,...
137,6,WAS,N,201.160156
138,7,WAS,Y,207.581668
139,8,WAS,N,244.497709
140,9,WAS,N,239.153552


In [42]:
merged_df['playoff'] = merged_df['playoff'].eq('Y').mul(1)
merged_df = merged_df.select_dtypes(['number']) # Remove later
merged_df.dropna(axis=0, inplace=True)
merged_df.head()

print(merged_df.shape)
merged_df.head()

(142, 3)


Unnamed: 0,year,playoff,per
0,9,0,0.0
1,10,1,219.722215
2,1,0,0.0
3,2,1,185.79693
4,3,1,161.31461


In [43]:
# Save the result to a new CSV file
merged_df.to_csv(os.path.join(DATA_PATH, DATA_MERGED), index=False)

### References

<a id="ref1"></a> [1] Maroun, E. (2012, March 7). Understanding advanced statistics: player efficiency rating. Hardwood Paroxysm. https://web.archive.org/web/20170910105350/https://hardwoodparoxysm.com/2012/03/07/understanding-advanced-statistics-player-efficiency-rating/

<a id="ref2"></a> [2] Calculating PER | Basketball-Reference.com. (n.d.). Basketball-Reference.com. https://www.basketball-reference.com/about/per.html