# Feature Selection for NBA Dataset (Predicting Home Team Win Probability)

Import Libraries

In [1]:
import sqlite3 as sql
import pandas as pd
import os

Connect to Database

In [2]:
conn = -1
db_fp = os.path.join("data", "nba.sqlite")
assert os.path.exists(db_fp), "Database file does not exist"


conn = sql.connect(db_fp)
assert conn != -1, "Database connection failed"

## common_player_info table

Feature Descriptions



In [3]:
df = pd.read_sql_query("SELECT * FROM common_player_info", conn)
df.columns

Index(['person_id', 'first_name', 'last_name', 'display_first_last',
       'display_last_comma_first', 'display_fi_last', 'player_slug',
       'birthdate', 'school', 'country', 'last_affiliation', 'height',
       'weight', 'season_exp', 'jersey', 'position', 'rosterstatus',
       'games_played_current_season_flag', 'team_id', 'team_name',
       'team_abbreviation', 'team_code', 'team_city', 'playercode',
       'from_year', 'to_year', 'dleague_flag', 'nba_flag', 'games_played_flag',
       'draft_year', 'draft_round', 'draft_number', 'greatest_75_flag'],
      dtype='object')

## game table

Feature Descriptions

season_id: A unique identifier for the NBA season (e.g., 22023 might represent the 2023-2024 season).  
game_id: A unique identifier for this specific game.
game_date: The date the game was played (e.g., '2023-10-24').
season_type: Indicates the part of the season (e.g., 'Regular Season', 'Playoffs', 'Pre Season', 'Play-In').

Home Team Information & Stats:

team_id_home: The unique numerical identifier for the home team.  
team_abbreviation_home: The standard abbreviation for the home team (e.g., 'LAL', 'GSW', 'BOS').  
team_name_home: The full name of the home team (e.g., 'Los Angeles Lakers', 'Golden State Warriors').  
matchup_home: A string describing the matchup from the home team's perspective (e.g., 'LAL vs. DEN').  
wl_home: The result of the game for the home team, likely 'W' for a win or 'L' for a loss.  
min: Minutes played in the game. For a team game log, this is usually the total duration (48 for regulation, 53 for 1OT, etc.).  
fgm_home: Field Goals Made by the home team.  
fga_home: Field Goals Attempted by the home team.  
fg_pct_home: Field Goal Percentage for the home team (FGM / FGA).  
fg3m_home: Three-Point Field Goals Made by the home team.  
fg3a_home: Three-Point Field Goals Attempted by the home team.  
fg3_pct_home: Three-Point Field Goal Percentage for the home team (FG3M / FG3A).  
ftm_home: Free Throws Made by the home team.  
fta_home: Free Throws Attempted by the home team.  
ft_pct_home: Free Throw Percentage for the home team (FTM / FTA).  
oreb_home: Offensive Rebounds by the home team.  
dreb_home: Defensive Rebounds by the home team.  
reb_home: Total Rebounds by the home team (OREB + DREB).  
ast_home: Assists by the home team.  
stl_home: Steals by the home team.  
blk_home: Blocks by the home team.  
tov_home: Turnovers committed by the home team.  
pf_home: Personal Fouls committed by the home team.  
pts_home: Total Points scored by the home team.  
plus_minus_home: The final score differential for the home team (Points Scored - Points Allowed; essentially pts_home - pts_away).  
video_available_home: A flag (likely 1 or 0 / True or False) indicating if video footage/highlights are available for this game, potentially from the home team's perspective or feed.  

Away Team Information & Stats:

team_id_away: The unique numerical identifier for the away team.  
team_abbreviation_away: The standard abbreviation for the away team.  
team_name_away: The full name of the away team.  
matchup_away: A string describing the matchup from the away team's perspective (e.g., 'DEN @ LAL').  
wl_away: The result of the game for the away team, likely 'W' for a win or 'L' for a loss.  
fgm_away: Field Goals Made by the away team.  
fga_away: Field Goals Attempted by the away team.  
fg_pct_away: Field Goal Percentage for the away team.  
fg3m_away: Three-Point Field Goals Made by the away team.  
fg3a_away: Three-Point Field Goals Attempted by the away team.  
fg3_pct_away: Three-Point Field Goal Percentage for the away team.  
ftm_away: Free Throws Made by the away team.  
fta_away: Free Throws Attempted by the away team.  
ft_pct_away: Free Throw Percentage for the away team.  
oreb_away: Offensive Rebounds by the away team.  
dreb_away: Defensive Rebounds by the away team.  
reb_away: Total Rebounds by the away team.  
ast_away: Assists by the away team.  
stl_away: Steals by the away team.  
blk_away: Blocks by the away team.  
tov_away: Turnovers committed by the away team.  
pf_away: Personal Fouls committed by the away team.  
pts_away: Total Points scored by the away team.  
plus_minus_away: The final score differential for the away team (Points Scored - Points Allowed; essentially pts_away - pts_home). This should be the negative of plus_minus_home.  
video_available_away: A flag indicating if video footage/highlights are available for this game, potentially from the away team's perspective or feed.  

In [4]:
df = pd.read_sql_query("SELECT * FROM game", conn)

df.columns

Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type'],
      dtype='object')

Since our goal is to predict the probability of the home team winning at the start of the game, we need to eliminate features that would not be available to the model. This way, we can avoid data leakage, which would lead to high performance on training set but low performance on test set.

These data-leaking features will include all game metrics. Furthermore, the 'id' columns can be disregarded in the model as they wouldn't provide useful information to the model. Lastly, 'matchup' and 'team_abbreviation' can be dropped, and 'team_name' will used instead to retrieve team data.

Resulting features:
'game_date', 'team_name_home', 'team_name_away'

**The target feature is 'wl_home'.**