# Feature Selection for NBA Dataset (Predicting if the home team wins)

Import Libraries

In [3]:
import sqlite3 as sql
import pandas as pd
import os

Connect to Database

In [4]:
conn = -1
db_fp = os.path.join("data", "nba.sqlite")
assert os.path.exists(db_fp), "Database file does not exist"


conn = sql.connect(db_fp)
assert conn != -1, "Database connection failed"

## common_player_info table

Feature Descriptions

person_id: A unique identification number for each player.  
first_name: The first name of the player.  
last_name: The last name of the player.  
display_first_last: The player's name displayed as "First Last" (e.g., "LeBron James").  
display_last_comma_first: The player's name displayed as "Last, First" (e.g., "James, LeBron").  
display_fi_last: The player's name displayed as "FirstInitial. Last" (e.g., "L. James").  
player_slug: A URL-friendly version of the player's name, often used for web links.  
birthdate: The player's date of birth.  
school: The college or university the player attended (if any).  
country: The country where the player was born.  
last_affiliation: Likely refers to the player's last known affiliation (e.g., college, league before NBA).  
height: The player's height, likely in a standardized format (e.g., feet-inches or centimeters).  
weight: The player's weight, likely in pounds or kilograms.  
season_exp: The number of seasons the player has been in the NBA.  
jersey: The player's jersey number.  
position: The primary playing position(s) of the player (e.g., Guard, Forward, Center).  
rosterstatus: Indicates if the player is currently on a team's active roster.  
games_played_current_season_flag: A flag indicating if the player has played games in the current season (likely 1 for yes, 0 for no).  
team_id: A unique identification number for the player's current team.  
team_name: The full name of the player's current team.  
team_abbreviation: The abbreviated name of the player's current team (e.g., LAL for Los Angeles Lakers).  
team_code: A short code for the player's current team.  
team_city: The city of the player's current team.  
playercode: Another unique code or identifier for the player.  
from_year: The first year the player was active in the NBA.  
to_year: The last year the player was active in the NBA (or the current year if still active).  
dleague_flag: A flag indicating if the player has played in the NBA G-League (formerly D-League).  
nba_flag: A flag indicating if the player has played in the NBA.  
games_played_flag: A general flag indicating if the player has played any games (could be in NBA, G-League, etc.).  
draft_year: The year the player was drafted into the NBA.  
draft_round: The round in which the player was drafted.  
draft_number: The overall pick number in which the player was drafted.  
greatest_75_flag: A flag indicating if the player was named to the NBA's 75th Anniversary Team.  

In [5]:
df = pd.read_sql_query("SELECT * FROM common_player_info", conn)
df.columns

Index(['person_id', 'first_name', 'last_name', 'display_first_last',
       'display_last_comma_first', 'display_fi_last', 'player_slug',
       'birthdate', 'school', 'country', 'last_affiliation', 'height',
       'weight', 'season_exp', 'jersey', 'position', 'rosterstatus',
       'games_played_current_season_flag', 'team_id', 'team_name',
       'team_abbreviation', 'team_code', 'team_city', 'playercode',
       'from_year', 'to_year', 'dleague_flag', 'nba_flag', 'games_played_flag',
       'draft_year', 'draft_round', 'draft_number', 'greatest_75_flag'],
      dtype='object')

Player experience can be useful in determining player performance (player value). The team's player performance would play a crucial role in predicting the game winner, with higher average player performance increasing the chance of winning.

Player demographics, while less impactful, can still have an affect. A taller or heavier player could indicate stronger performances. A certain country could produce better NBA players than others. Most NBA players reach their 'prime' in the late 20s. In order to validate these hypotheses, further EDA is needed.

Player identification, such as name and jersey number, will not have an affect and can be disregarded.

Resulting Features:
'birthdate', 'school', 'country', 'height', 'weight', 'draft_year', 'season_exp', 'draft_number'

## draft_combine_stats table

Feature Descriptions

season: The NBA season (e.g., '2022-23') in which the draft combine took place.  
player_id: A unique identification number for each player.  
first_name: The first name of the player.  
last_name: The last name of the player.  
player_name: The full name of the player (e.g., "John Doe").  
position: The primary playing position(s) of the player.  
height_wo_shoes: The player's height measured without shoes, likely in inches or centimeters.  
height_wo_shoes_ft_in: The player's height without shoes, displayed in feet and inches (e.g., "6'8.5").  
height_w_shoes: The player's height measured with shoes, likely in inches or centimeters.  
height_w_shoes_ft_in: The player's height with shoes, displayed in feet and inches.  
weight: The player's weight, likely in pounds.  
wingspan: The player's arm span from fingertip to fingertip, likely in inches or centimeters.  
wingspan_ft_in: The player's wingspan, displayed in feet and inches.  
standing_reach: The player's reach with arms fully extended upwards while standing flat-footed, likely in inches or centimeters.  
standing_reach_ft_in: The player's standing reach, displayed in feet and inches.  
body_fat_pct: The player's body fat percentage.  
hand_length: The length of the player's hand, likely in inches or centimeters.  
hand_width: The width of the player's hand, likely in inches or centimeters.  
standing_vertical_leap: The player's vertical jump height starting from a standing position.  
max_vertical_leap: The player's maximum vertical jump height, often with a running start.  
lane_agility_time: The time taken by the player to complete a specific agility drill on the court.  
modified_lane_agility_time: Another measurement of agility, possibly a variation of the lane agility drill.  
three_quarter_sprint: The time taken by the player to sprint three-quarters of the court length.  
bench_press: The number of repetitions a player can complete on the bench press with a specific weight (often 185 lbs).  

The following columns relate to shooting drills, with varying distances and starting points:  

spot_fifteen_corner_left: Shooting percentage from a "spot up" (stationary) position, 15 feet from the basket, in the left corner.  
spot_fifteen_break_left: Shooting percentage from a spot-up position, 15 feet from the basket, on the left "break" (mid-range area).  
spot_fifteen_top_key: Shooting percentage from a spot-up position, 15 feet from the basket, at the top of the key.  
spot_fifteen_break_right: Shooting percentage from a spot-up position, 15 feet from the basket, on the right "break".  
spot_fifteen_corner_right: Shooting percentage from a spot-up position, 15 feet from the basket, in the right corner.  
spot_college_corner_left: Shooting percentage from a spot-up position, from college three-point range, in the left corner.  
spot_college_break_left: Shooting percentage from a spot-up position, from college three-point range, on the left break.  
spot_college_top_key: Shooting percentage from a spot-up position, from college three-point range, at the top of the key.  
spot_college_break_right: Shooting percentage from a spot-up position, from college three-point range, on the right break.  
spot_college_corner_right: Shooting percentage from a spot-up position, from college three-point range, in the right corner.  
spot_nba_corner_left: Shooting percentage from a spot-up position, from NBA three-point range, in the left corner.  
spot_nba_break_left: Shooting percentage from a spot-up position, from NBA three-point range, on the left break.  
spot_nba_top_key: Shooting percentage from a spot-up position, from NBA three-point range, at the top of the key.  
spot_nba_break_right: Shooting percentage from a spot-up position, from NBA three-point range, on the right break.  
spot_nba_corner_right: Shooting percentage from a spot-up position, from NBA three-point range, in the right corner.  
off_drib_fifteen_break_left: Shooting percentage from "off the dribble" (after dribbling), 15 feet from the basket, on the left break.  
off_drib_fifteen_top_key: Shooting percentage off the dribble, 15 feet from the basket, at the top of the key.  
off_drib_fifteen_break_right: Shooting percentage off the dribble, 15 feet from the basket, on the right break.  
off_drib_college_break_left: Shooting percentage off the dribble, from college three-point range, on the left break.  
off_drib_college_top_key: Shooting percentage off the dribble, from college three-point range, at the top of the key.  
off_drib_college_break_right: Shooting percentage off the dribble, from college three-point range, on the right break.  
on_move_fifteen: Shooting percentage while "on the move" (moving without the ball), from 15 feet.  
on_move_college: Shooting percentage while on the move, from college three-point range.  

In [7]:
df = pd.read_sql_query("SELECT * FROM draft_combine_stats", conn)
df.columns

Index(['season', 'player_id', 'first_name', 'last_name', 'player_name',
       'position', 'height_wo_shoes', 'height_wo_shoes_ft_in',
       'height_w_shoes', 'height_w_shoes_ft_in', 'weight', 'wingspan',
       'wingspan_ft_in', 'standing_reach', 'standing_reach_ft_in',
       'body_fat_pct', 'hand_length', 'hand_width', 'standing_vertical_leap',
       'max_vertical_leap', 'lane_agility_time', 'modified_lane_agility_time',
       'three_quarter_sprint', 'bench_press', 'spot_fifteen_corner_left',
       'spot_fifteen_break_left', 'spot_fifteen_top_key',
       'spot_fifteen_break_right', 'spot_fifteen_corner_right',
       'spot_college_corner_left', 'spot_college_break_left',
       'spot_college_top_key', 'spot_college_break_right',
       'spot_college_corner_right', 'spot_nba_corner_left',
       'spot_nba_break_left', 'spot_nba_top_key', 'spot_nba_break_right',
       'spot_nba_corner_right', 'off_drib_fifteen_break_left',
       'off_drib_fifteen_top_key', 'off_drib_fifteen_bre

## game table

Feature Descriptions

season_id: A unique identifier for the NBA season (e.g., 22023 might represent the 2023-2024 season).  
game_id: A unique identifier for this specific game.
game_date: The date the game was played (e.g., '2023-10-24').
season_type: Indicates the part of the season (e.g., 'Regular Season', 'Playoffs', 'Pre Season', 'Play-In').

Home Team Information & Stats:

team_id_home: The unique numerical identifier for the home team.  
team_abbreviation_home: The standard abbreviation for the home team (e.g., 'LAL', 'GSW', 'BOS').  
team_name_home: The full name of the home team (e.g., 'Los Angeles Lakers', 'Golden State Warriors').  
matchup_home: A string describing the matchup from the home team's perspective (e.g., 'LAL vs. DEN').  
wl_home: The result of the game for the home team, likely 'W' for a win or 'L' for a loss.  
min: Minutes played in the game. For a team game log, this is usually the total duration (48 for regulation, 53 for 1OT, etc.).  
fgm_home: Field Goals Made by the home team.  
fga_home: Field Goals Attempted by the home team.  
fg_pct_home: Field Goal Percentage for the home team (FGM / FGA).  
fg3m_home: Three-Point Field Goals Made by the home team.  
fg3a_home: Three-Point Field Goals Attempted by the home team.  
fg3_pct_home: Three-Point Field Goal Percentage for the home team (FG3M / FG3A).  
ftm_home: Free Throws Made by the home team.  
fta_home: Free Throws Attempted by the home team.  
ft_pct_home: Free Throw Percentage for the home team (FTM / FTA).  
oreb_home: Offensive Rebounds by the home team.  
dreb_home: Defensive Rebounds by the home team.  
reb_home: Total Rebounds by the home team (OREB + DREB).  
ast_home: Assists by the home team.  
stl_home: Steals by the home team.  
blk_home: Blocks by the home team.  
tov_home: Turnovers committed by the home team.  
pf_home: Personal Fouls committed by the home team.  
pts_home: Total Points scored by the home team.  
plus_minus_home: The final score differential for the home team (Points Scored - Points Allowed; essentially pts_home - pts_away).  
video_available_home: A flag (likely 1 or 0 / True or False) indicating if video footage/highlights are available for this game, potentially from the home team's perspective or feed.  

Away Team Information & Stats:

team_id_away: The unique numerical identifier for the away team.  
team_abbreviation_away: The standard abbreviation for the away team.  
team_name_away: The full name of the away team.  
matchup_away: A string describing the matchup from the away team's perspective (e.g., 'DEN @ LAL').  
wl_away: The result of the game for the away team, likely 'W' for a win or 'L' for a loss.  
fgm_away: Field Goals Made by the away team.  
fga_away: Field Goals Attempted by the away team.  
fg_pct_away: Field Goal Percentage for the away team.  
fg3m_away: Three-Point Field Goals Made by the away team.  
fg3a_away: Three-Point Field Goals Attempted by the away team.  
fg3_pct_away: Three-Point Field Goal Percentage for the away team.  
ftm_away: Free Throws Made by the away team.  
fta_away: Free Throws Attempted by the away team.  
ft_pct_away: Free Throw Percentage for the away team.  
oreb_away: Offensive Rebounds by the away team.  
dreb_away: Defensive Rebounds by the away team.  
reb_away: Total Rebounds by the away team.  
ast_away: Assists by the away team.  
stl_away: Steals by the away team.  
blk_away: Blocks by the away team.  
tov_away: Turnovers committed by the away team.  
pf_away: Personal Fouls committed by the away team.  
pts_away: Total Points scored by the away team.  
plus_minus_away: The final score differential for the away team (Points Scored - Points Allowed; essentially pts_away - pts_home). This should be the negative of plus_minus_home.  
video_available_away: A flag indicating if video footage/highlights are available for this game, potentially from the away team's perspective or feed.  

In [6]:
df = pd.read_sql_query("SELECT * FROM game", conn)

df.columns

Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type'],
      dtype='object')

Since our goal is to predict the probability of the home team winning at the start of the game, we need to eliminate features that would not be available to the model. This way, we can avoid data leakage, which would lead to high performance on training set but low performance on test set.

These data-leaking features will include all game metrics. Furthermore, the 'id' columns can be disregarded in the model as they wouldn't provide useful information to the model. Lastly, 'matchup' and 'team_abbreviation' can be dropped, and 'team_name' will used instead to retrieve team data.

Resulting features:
'game_date', 'team_name_home', 'team_name_away'

**The target feature is 'wl_home'.**