# NBA-Data-2010-2024 🏀

## Schema for Box_scores

### Dimensions
- **season_year**: The year of the basketball season.
- **game_date**: The date of the game.
- **gameId**: Unique identifier for the game.
- **teamId**: Unique identifier for the team.
- **teamCity**: The city where the team is based.
- **teamName**: The name of the team.
- **teamTricode**: A three-letter code representing the team.
- **teamSlug**: A unique identifier for the team.
- **personId**: Unique identifier for the person (player).
- **personName**: The name of the person (player).
- **position**: The position of the player.
- **comment**: Any additional comments or notes.
- **jerseyNum**: The jersey number of the player.

### Metrics
- **minutes**: The number of minutes played by the player.
- **fieldGoalsMade**: The number of field goals made by the player.
- **fieldGoalsAttempted**: The number of field goals attempted by the player.
- **fieldGoalsPercentage**: The shooting percentage for field goals.
- **threePointersMade**: The number of three-pointers made by the player.
- **threePointersAttempted**: The number of three-pointers attempted by the player.
- **threePointersPercentage**: The shooting percentage for three-pointers.
- **freeThrowsMade**: The number of free throws made by the player.
- **freeThrowsAttempted**: The number of free throws attempted by the player.
- **freeThrowsPercentage**: The shooting percentage for free throws.
- **reboundsOffensive**: The number of offensive rebounds by the player.
- **reboundsDefensive**: The number of defensive rebounds by the player.
- **reboundsTotal**: The total number of rebounds by the player.
- **assists**: The number of assists by the player.
- **steals**: The number of steals by the player.
- **blocks**: The number of blocks by the player.
- **turnovers**: The number of turnovers by the player.
- **foulsPersonal**: The number of personal fouls committed by the player.
- **points**: The total number of points scored by the player.
- **plusMinusPoints**: The plus-minus statistic for the player, indicating the team's score differential when the player is on the court.

## Schema of game totals 

### Dimensions
- **SEASON_YEAR**: The year of the NBA season.
- **TEAM_ID**: Unique identifier for the team.
- **TEAM_ABBREVIATION**: Abbreviated name of the team.
- **TEAM_NAME**: Full name of the team.
- **GAME_ID**: Unique identifier for the game.
- **GAME_DATE**: Date of the game.
- **MATCHUP**: Matchup details indicating the teams involved.
- **WL**: Outcome of the game (Win or Loss).

### Metrics
- **MIN**: Total minutes played in the game.
- **FGM**: Field goals made.
- **FGA**: Field goals attempted.
- **FG_PCT**: Field goal percentage.
- **FG3M**: Three-point field goals made.
- **FG3A**: Three-point field goals attempted.
- **FG3_PCT**: Three-point field goal percentage.
- **FTM**: Free throws made.
- **FTA**: Free throws attempted.
- **FT_PCT**: Free throw percentage.
- **OREB**: Offensive rebounds.
- **DREB**: Defensive rebounds.
- **REB**: Total rebounds.
- **AST**: Assists.
- **TOV**: Turnovers.
- **STL**: Steals.
- **BLK**: Blocks.
- **BLKA**: Opponent's blocks.
- **PF**: Personal fouls.
- **PFD**: Personal fouls drawn.
- **PTS**: Total points scored.
- **PLUS_MINUS**: Plus-minus statistic.
- **GP_RANK**: Rank based on games played.
- **W_RANK**: Rank based on wins.
- **L_RANK**: Rank based on losses.
- **W_PCT_RANK**: Rank based on win percentage.
- **MIN_RANK**: Rank based on minutes played.
- **Ranks for various statistical categories like field goals made, rebounds, assists, etc., indicated by suffix _RANK.**
- **AVAILABLE_FLAG**: Indicates if the data for this row is available.

## Authors

- [@NocturneBear](https://github.com/NocturneBear)

## License

[MIT](https://github.com/NocturneBear/NBA-Data-2010-2024/blob/main/LICENSE)

In [13]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import re
import itertools 
pd.set_option('future.no_silent_downcasting', True)

In [14]:
NUM_GAMES=82
teams=['DAL','MIL','ATL','DEN','HOU','IND','OKC','CHI','ORL','BOS','DET','NYK'
,'CHA','LAL','SAC','MIA','LAC','GSW','POR','MIN','WAS','BKN','MEM','SAS'
,'PHX','NOP','UTA','TOR','PHI','CLE']
all_possible_matchups=list(itertools.combinations(teams, 2))
playoff_games_total=pd.read_csv("./datasets/NBA_DATA_2010_2024/play_off_totals_2010_2024.csv",delimiter=',',header=0)
regular_games_total=pd.read_csv("./datasets/NBA_DATA_2010_2024/regular_season_totals_2010_2024.csv",delimiter=',',header=0)
regular_season_all_parts=pd.concat([
        pd.read_csv("./datasets/NBA_DATA_2010_2024/regular_season_box_scores_2010_2024_part_1.csv",delimiter=',',header=0),
        pd.read_csv("./datasets/NBA_DATA_2010_2024/regular_season_box_scores_2010_2024_part_2.csv",delimiter=',',header=0),
        pd.read_csv("./datasets/NBA_DATA_2010_2024/regular_season_box_scores_2010_2024_part_3.csv",delimiter=',',header=0)])

## Some Utils Functions

In [6]:
def convert_min_to_float(min_str):
    if isinstance(min_str, str) and ':' in min_str:
        mins, secs = map(int, min_str.split(':'))
        return mins + secs / 60
    return 0.0  # handle empty or malformed entries

In [7]:
def convert_int_season_to_str(season):
    if isinstance(season, int):
        return f"{season}-{season%2000 +1 :02d}" 
    return season

# Data Studying

## Justifications
- Why we choose player stats instead of team overall stats 
- Why we choose to make a single season analysis
- Why we analyse teams based on the players available 

## Average Points Per-Game,Per-Teams Per-Season
Here i are looking for a correalation between the average points between seasons, since this might me an indicator of why we should or shouldn't use all seasons to train the model, since there can be or not a relation between this features

In [None]:
def getTeamAvgPointsBySeason(scores,season=None,teamname=None):
    """
    Function to get the average points of a team in a season
    :param teamname: team name :List[str]
    :param season: season: str or int or None (for all seasons) 
    :return: average points of the team 
    """
    season = convert_int_season_to_str(season)  # e.g. 2023 -> "2023-24"
    # Filter by season first
    season_scores=scores
    if season is not None:
        season_scores = scores[scores['SEASON_YEAR'] == season]
    # Assuming your DataFrame is called 'df' and the column is called 'result'
    season_scores.loc[:,'WL'] = season_scores['WL'].replace({'W': 1, 'L': 0}).infer_objects(copy=False)
    # Group by team and calculate averages
    team_avg = season_scores.groupby([ 'TEAM_ABBREVIATION','SEASON_YEAR']).agg(
        {
            'PTS': 'mean',
            'FG_PCT': 'mean',
            'FG3_PCT': 'mean',
            'FT_PCT': 'mean',
            'WL':'sum'
        }
    ).reset_index()
    # Optional filter by team name
    if teamname:
        team_avg = team_avg[team_avg['TEAM_ABBREVIATION'].isin(teamname)]
    
    return team_avg.sort_values(by=['SEASON_YEAR','WL'],ascending=True).reset_index(drop=True)

In [None]:
AveragePointsPerGameinSeason= getTeamAvgPointsBySeason(regular_games_total,teamname=["LAL","BOS","PHI","CHI","NYK","MIA","GSW","SAS","OKC","DAL","DEN","MIL","HOU","IND","ORL","CHA","BKN","MEM","NOP","UTA","PHX"])
sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
plt.figure(figsize=(20, 8))  # Increase width and height for better readability
plt.tight_layout()
AveragePointsPerGameinSeason['WL'] = AveragePointsPerGameinSeason['WL']/82
sns.barplot(data=AveragePointsPerGameinSeason, x='TEAM_ABBREVIATION', y='WL', hue='SEASON_YEAR')
plt.title('Winning Percentage by Team and Season')
plt.savefig('figures/winning_percentage.png')

**As we can see on the above graphic, although there seems to be an overall raise of points in all teams throughout the seasons, there is no direct correlation of the average points per-game between seasons so we need more features to deduce and infer some feature relations and traits of the data in order to build a more accurate model. (Alterations*)**

## Average Team Points Per Win Rate
Here we need to see if there is a correlation between the average points per-game and the win percentage on the regular season

In [None]:
for year in range(2010,2024):
    plt.figure(figsize=(20, 8))  # Increase width and height for better readability
    plt.tight_layout()
    plt.title(f"Average Points per Game in {year}")
    sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
    ap= getTeamAvgPointsBySeason(regular_games_total,season=year)
    ap['WL'] = ap['WL']/82
    sns.barplot(data=ap, x='PTS', y='WL', hue='TEAM_ABBREVIATION')
    savefig = f"figures/win_loss_per_points_{year}.png"
    plt.title(f"Win Loss Ratio per Points in {year}")
    plt.savefig(savefig)
    plt.show()

**As we can deduce by the graphics there is also no direct correlation between the average points per-game and the number of wins for each team, in a season, so it this will probably be a multi-dimensional features model, and the data exploration process may come at a cost of absence of visual representation of the data**

## Justify the Player choice ... (TODO)
- (per team and season,win percentage by player points played games) DONE
- (per team and season,win percentage by the average plusminuspoints for each player) DONE
- (per team and season,win percentage by the reboundsTotal for each player)
- (per team and season,win percentage by the assists for each player)
- (per team and season, win percentage by [plusMinusPoints (average) by the minutes played (average)])


## Average Player Points per Win Percentage of the games played

In [None]:
def getMatchAndPlayerStats(game,player,season=None,teamname=None):
    """
    Function to get the average points of a team in a season
    :param teamname: team name :List[str]
    :param season: season: str or int or None (for all seasons) 
    :return: average points of the team 
    """
    season=convert_int_season_to_str(season)
    playerScores = player[player['minutes'].notna()].copy()
    playerScores['minutesParsed'] = playerScores['minutes'].apply(convert_min_to_float)
    gamePlayer=game.merge(playerScores, how='inner', left_on=['GAME_ID','TEAM_ABBREVIATION'], right_on=['gameId','teamTricode'])
    # add a collumn to count the number of games played by each player
    aggregation= gamePlayer.groupby(['personName','teamTricode','season_year']).agg(
        {
            'WL': 'sum',
            'points': 'mean',
            'fieldGoalsPercentage': 'mean',
            'threePointersPercentage': 'mean',
            'reboundsTotal': 'mean',
            'assists': 'mean',
            'plusMinusPoints':'mean',
            'minutesParsed': 'mean'
        }
    ).reset_index()
    aggregation['gamesPlayed'] = gamePlayer.groupby(['personName','teamTricode','season_year'])['gameId'].count().reset_index(drop=True)

    if season is not None:
        aggregation = aggregation[aggregation['season_year'] == season]
    if teamname is not None:
        aggregation = aggregation[aggregation['teamTricode'].isin(teamname)]

    return aggregation 

In [None]:
gameP=getMatchAndPlayerStats(regular_games_total,regular_season_all_parts,season=None,teamname=None)
for i in range(2010,2024):
    playerStatus = gameP[gameP['season_year'] == convert_int_season_to_str(i)]
    playerStatus.loc[:,'WL'] = playerStatus['WL']/playerStatus['gamesPlayed']
    sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
    plt.figure(figsize=(20, 8))  # Increase width and height for better readability
    plt.tight_layout()
    sns.scatterplot(data=playerStatus, y='points', x='WL', hue='teamTricode')
    plt.title(f"Season {i}")
    savefig = f"figures/avg_points_win_percentage_of_each_player_{i}.png"
    plt.savefig(savefig)


**Again we fail to notice any correlation between the average points per player per game and the number of wins of the team. Perhaps we can observe a linear relation between the coeficient between the total points and the number of wins in the regular season.**
## PlusMinusPoints per Player WinPercentage in a season and team 



In [None]:
# (per team and season,win percentage by the average plusminuspoints for each player)
for i in range(2010,2024):
    playerStatus = gameP[gameP['season_year'] == convert_int_season_to_str(i)]
    playerStatus.loc[:,'WL'] = playerStatus['WL']/playerStatus['gamesPlayed']
    sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
    plt.figure(figsize=(20, 8))  # Increase width and height for better readability
    plt.tight_layout()
    sns.scatterplot(data=playerStatus, x='plusMinusPoints', y='WL', hue='teamTricode')
    plt.title(f"Season {i}")
    savefig = f"figures/avg_plus_minus_win_percentage_of_each_player_{i}.png"
    plt.savefig(savefig)

## Average total Rebounds per Win Percentage on games played by each player

In [None]:
for i in range(2010,2024):
    playerStatus = gameP[gameP['season_year'] == convert_int_season_to_str(i)]
    playerStatus.loc[:,'WL'] = playerStatus['WL']/playerStatus['gamesPlayed']
    sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
    plt.figure(figsize=(20, 8))  # Increase width and height for better readability
    plt.tight_layout()
    sns.scatterplot(data=playerStatus, x='reboundsTotal', y='WL', hue='teamTricode')
    plt.title(f"Season {i}")
    savefig = f"figures/rebounds_win_rate_p_{i}.png"
    plt.savefig(savefig)

## Average Plus Minus Points per Average Minutes Played by Win Percentage

In [None]:
for i in range(2010,2024):
    playerStatus = gameP[gameP['season_year'] == convert_int_season_to_str(i)]
    playerStatus.loc[:,'WL'] = playerStatus['WL']/playerStatus['gamesPlayed']
    playerStatus.loc[:,'worth']=playerStatus['plusMinusPoints']/playerStatus['minutesParsed']
    sns.set_theme(style="whitegrid")  # You can also try 'darkgrid', 'white', 'dark', or 'ticks'
    plt.figure(figsize=(20, 8))  # Increase width and height for better readability
    plt.tight_layout()
    sns.scatterplot(data=playerStatus, x='worth', y='WL', hue='teamTricode')
    plt.title(f"Season {i}")
    savefig = f"figures/rebounds_win_rate_p_{i}.png"
    plt.savefig(savefig)

## Home/Away win rate
We want to be able to determine the relevance of a game being home or away on the game result.

In [28]:
def getMatchupByTeamBySeason(scores,matchup,season=False):
    """
    Function to get the matchup of a team in a season
    :param team_tag: team tag
    :optional param season: season to filter the data by season 
    :return: matchup of the team in the season
    """
    teams=scores.filter(items=['SEASON_YEAR','TEAM_ABBREVIATION','MATCHUP','TEAM_ID','WL','FGA','FGM'])
    if season is not False:
        teams=teams[teams['SEASON_YEAR']==convert_int_season_to_str(season)]
    mathcup_tag=matchup[0]+" vs. "+matchup[1]
    teams['MATCHUP_STANDARD'] = teams['MATCHUP'].str.replace("@", "vs.")
    teams=pd.concat([teams[teams['MATCHUP_STANDARD'] ==  mathcup_tag]],ignore_index=True)
    teams.loc[:,'WL'] = teams['WL'].replace({'W': 1, 'L': 0}).infer_objects(copy=False)
    aggregation=teams.groupby(['MATCHUP_STANDARD',"SEASON_YEAR"]).agg(
        {
            "WL":"sum",
            "FGM":"mean",
        }
    )
    return aggregation 
    

In [30]:
for matchup in all_possible_matchups:
    mt = getMatchupByTeamBySeason(regular_games_total,matchup,season=2022)
    print(mt)

                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. MIL      2022-23      0  41.0
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. ATL      2022-23      0  46.5
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. DEN      2022-23      2  39.0
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. HOU      2022-23      3  37.0
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. IND      2022-23      1  46.5
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. OKC      2022-23      1  36.0
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. CHI      2022-23      0  41.5
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. ORL      2022-23      1  36.5
                             WL   FGM
MATCHUP_STANDARD SEASON_YEAR         
DAL vs. BOS 

# Pair Plot

In [None]:
# exprimentar NN, RNN, MLP,
# Verificar acesso softmax na redes-neurais 
