# Exploratory Data Analysis for NBA Lineup Prediction and Hidden Patterns

This notebook organizes the data exploration steps from the raw CSV file and then implements several additional views to uncover hidden patterns. We aim to understand:

- **Lineup stability and variation** over time
- **Frequency of player appearances** in different positions (home_0 to home_4)
- **Outcome analysis** by lineup
- **Common matchups:** which home players appear most frequently against certain away lineups

These insights can guide further feature engineering to improve our predictive model.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# For warnings
import warnings
warnings.filterwarnings('ignore')

# Set display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

pd.option_context('display.max_rows', None, 'display.max_columns', None)

<pandas._config.config.option_context at 0x29d5753a080>

## 1. Load and Inspect the Raw Data

Here we load a sample raw CSV file (2007 season) and inspect its contents.

In [2]:
# Define the folder containing the CSV files (update the path as needed)
data_folder = "../data"  # Replace with your actual data folder path

# Load all CSV files from 2007 to 2015
csv_files = sorted(glob.glob(os.path.join(data_folder, "matchups-20*.csv")))
dfs = []
for file in csv_files:
    print(f"Loading {file}")
    df = pd.read_csv(file)
    dfs.append(df)

df_all_years = pd.concat(dfs, ignore_index=True)

# Define the allowed features
allowed_features = [
    'game', 'season', 'home_team', 'away_team', 'starting_min',
    'home_0', 'home_1', 'home_2', 'home_3', 'home_4',
    'away_0', 'away_1', 'away_2', 'away_3', 'away_4',
    'outcome'  
]

# Filter the dataframe
df_filtered = df_all_years[allowed_features].copy()


# Dictionary to map old team acronyms to their new ones
team_acronym_mapping = {
    "NJN": "BRK",   # New Jersey Nets → Brooklyn Nets
    "NOK": "NOP",   # New Orleans/Oklahoma City Hornets → New Orleans Pelicans
    "NOH": "NOP",    # New Orleans Hornets → New Orleans Pelicans
    "CHO": "CHA",
    "SEA": "OKC"
}

# Apply the mapping to the home_team column
df_filtered["home_team"] = df_filtered["home_team"].replace(team_acronym_mapping)

df_filtered["away_team"] = df_filtered["away_team"].replace(team_acronym_mapping)



# Create lineup tuples (sorted alphabetically for consistency)
df_filtered['home_lineup'] = df_filtered[['home_0', 'home_1', 'home_2', 'home_3', 'home_4']].apply(lambda x: tuple(sorted(x)), axis=1)
df_filtered['away_lineup'] = df_filtered[['away_0', 'away_1', 'away_2', 'away_3', 'away_4']].apply(lambda x: tuple(sorted(x)), axis=1)

df_filtered_concat = df_filtered[['game', 'season', 'home_team', 'away_team', 'starting_min', 'home_lineup', 'away_lineup', 'outcome']]

Loading ../data\matchups-2007.csv
Loading ../data\matchups-2008.csv
Loading ../data\matchups-2009.csv
Loading ../data\matchups-2010.csv
Loading ../data\matchups-2011.csv
Loading ../data\matchups-2012.csv
Loading ../data\matchups-2013.csv
Loading ../data\matchups-2014.csv
Loading ../data\matchups-2015.csv


## 4. Unique Lineup Counts per Game

We group by game and count how many unique home and away lineups were used.

In [4]:
# Count unique lineups per game
lineup_counts = df_filtered.groupby('game').agg(
    unique_home_lineups=('home_lineup', 'nunique'),
    unique_away_lineups=('away_lineup', 'nunique')
).reset_index()

lineup_counts.head()

Unnamed: 0,game,unique_home_lineups,unique_away_lineups
0,200610310LAL,11,9
1,200610310MIA,11,16
2,200611010BOS,19,17
3,200611010CHA,16,15
4,200611010CLE,12,10


In [5]:
unique_games = df_all_years['game'].nunique()
print("Total unique games:", unique_games)

Total unique games: 10828


In [6]:
# Create df_filtered_pos with only rows where outcome is 1
df_filtered_pos = df_filtered[df_filtered["outcome"] == 1].copy()


In [7]:
df_filtered_pos.head()

Unnamed: 0,game,season,home_team,away_team,starting_min,home_0,home_1,home_2,home_3,home_4,away_0,away_1,away_2,away_3,away_4,outcome,home_lineup,away_lineup
2,200610310LAL,2007,LAL,PHO,8,Lamar Odom,Luke Walton,Maurice Evans,Ronny Turiaf,Smush Parker,Amar'e Stoudemire,Leandro Barbosa,Raja Bell,Shawn Marion,Steve Nash,1,"(Lamar Odom, Luke Walton, Maurice Evans, Ronny Turiaf, Smush Parker)","(Amar'e Stoudemire, Leandro Barbosa, Raja Bell, Shawn Marion, Steve Nash)"
3,200610310LAL,2007,LAL,PHO,10,Lamar Odom,Luke Walton,Maurice Evans,Ronny Turiaf,Smush Parker,Boris Diaw,James Jones,Kurt Thomas,Leandro Barbosa,Marcus Banks,1,"(Lamar Odom, Luke Walton, Maurice Evans, Ronny Turiaf, Smush Parker)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)"
5,200610310LAL,2007,LAL,PHO,12,Brian Cook,Maurice Evans,Sasha Vujacic,Smush Parker,Vladimir Radmanovic,Boris Diaw,James Jones,Kurt Thomas,Leandro Barbosa,Marcus Banks,1,"(Brian Cook, Maurice Evans, Sasha Vujacic, Smush Parker, Vladimir Radmanovic)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)"
6,200610310LAL,2007,LAL,PHO,13,Brian Cook,Jordan Farmar,Lamar Odom,Sasha Vujacic,Vladimir Radmanovic,Boris Diaw,James Jones,Kurt Thomas,Leandro Barbosa,Marcus Banks,1,"(Brian Cook, Jordan Farmar, Lamar Odom, Sasha Vujacic, Vladimir Radmanovic)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)"
7,200610310LAL,2007,LAL,PHO,16,Brian Cook,Jordan Farmar,Lamar Odom,Sasha Vujacic,Vladimir Radmanovic,Boris Diaw,James Jones,Marcus Banks,Shawn Marion,Steve Nash,1,"(Brian Cook, Jordan Farmar, Lamar Odom, Sasha Vujacic, Vladimir Radmanovic)","(Boris Diaw, James Jones, Marcus Banks, Shawn Marion, Steve Nash)"


## 5. Most Frequently Used Lineups per Game

We now identify the most frequently used home and away lineups for each game.

In [8]:
# Sort the DataFrame by game and starting_min
df_sorted = df_filtered.sort_values(['game', 'starting_min']).copy()

# Compute the duration for each lineup segment per game.
# For each game, duration = next starting_min - current starting_min.
# For the last segment in each game, duration = 48 - current starting_min.
df_sorted['duration'] = df_sorted.groupby('game')['starting_min'].transform(lambda x: x.shift(-1) - x)
df_sorted['duration'] = df_sorted['duration'].fillna(48 - df_sorted['starting_min'])

# Group by the lineup combination and sum the duration (total minutes played)
lineup_usage = df_sorted.groupby(
    ['game', 'home_team', 'away_team', 
     'home_0', 'home_1', 'home_2', 'home_3', 'home_4',
     'away_0', 'away_1', 'away_2', 'away_3', 'away_4', 
     'home_lineup', 'away_lineup']
)['duration'].sum().reset_index(name='total_minutes')

# Sort the DataFrame by game and total_minutes in descending order
most_used_lineups = lineup_usage.sort_values(['game', 'total_minutes'], ascending=[True, False])
display(most_used_lineups.head())

Unnamed: 0,game,home_team,away_team,home_0,home_1,home_2,home_3,home_4,away_0,away_1,away_2,away_3,away_4,home_lineup,away_lineup,total_minutes
8,200610310LAL,LAL,PHO,Brian Cook,Jordan Farmar,Lamar Odom,Sasha Vujacic,Vladimir Radmanovic,Boris Diaw,James Jones,Kurt Thomas,Leandro Barbosa,Marcus Banks,"(Brian Cook, Jordan Farmar, Lamar Odom, Sasha Vujacic, Vladimir Radmanovic)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)",7.0
5,200610310LAL,LAL,PHO,Andrew Bynum,Lamar Odom,Luke Walton,Sasha Vujacic,Smush Parker,Boris Diaw,Kurt Thomas,Raja Bell,Shawn Marion,Steve Nash,"(Andrew Bynum, Lamar Odom, Luke Walton, Sasha Vujacic, Smush Parker)","(Boris Diaw, Kurt Thomas, Raja Bell, Shawn Marion, Steve Nash)",6.0
2,200610310LAL,LAL,PHO,Andrew Bynum,Lamar Odom,Luke Walton,Maurice Evans,Smush Parker,Boris Diaw,Leandro Barbosa,Raja Bell,Shawn Marion,Steve Nash,"(Andrew Bynum, Lamar Odom, Luke Walton, Maurice Evans, Smush Parker)","(Boris Diaw, Leandro Barbosa, Raja Bell, Shawn Marion, Steve Nash)",5.0
3,200610310LAL,LAL,PHO,Andrew Bynum,Lamar Odom,Luke Walton,Maurice Evans,Smush Parker,Kurt Thomas,Leandro Barbosa,Raja Bell,Shawn Marion,Steve Nash,"(Andrew Bynum, Lamar Odom, Luke Walton, Maurice Evans, Smush Parker)","(Kurt Thomas, Leandro Barbosa, Raja Bell, Shawn Marion, Steve Nash)",4.0
1,200610310LAL,LAL,PHO,Andrew Bynum,Lamar Odom,Luke Walton,Maurice Evans,Smush Parker,Boris Diaw,Kurt Thomas,Raja Bell,Shawn Marion,Steve Nash,"(Andrew Bynum, Lamar Odom, Luke Walton, Maurice Evans, Smush Parker)","(Boris Diaw, Kurt Thomas, Raja Bell, Shawn Marion, Steve Nash)",3.0


## 6. Additional Views and Analyses

In this section, we implement several new views to uncover hidden patterns:

1. **Frequency Analysis by Player Position:** How often does each player appear in a given home position (home_0, home_1, etc.)?
2. **Outcome Analysis by Home Lineup:** What are the win/loss outcomes (or average outcome) for different home lineups?
3. **Common Home Players Against Specific Away Lineups:** Which home players are most frequently used against a given away lineup?
4. **Lineup Variation Over Time:** How does the number of unique lineups change over time?

### 6.1 Frequency Analysis by Player Position

In [9]:
# Analyze frequency of players in each home position
for pos in ['home_0', 'home_1', 'home_2', 'home_3', 'home_4']:
    print(f"Frequency for {pos}:")
    print(df_filtered[pos].value_counts().head(10))
    print("---")

Frequency for home_0:
home_0
Andre Iguodala       5150
Al Jefferson         4492
Amar'e Stoudemire    3804
Boris Diaw           3502
Al Horford           3340
Al Harrington        3129
Beno Udrih           2951
Andrew Bogut         2652
Anderson Varejao     2650
Andray Blatche       2532
Name: count, dtype: int64
---
Frequency for home_1:
home_1
Dwyane Wade        2468
Carmelo Anthony    2446
David West         2433
Dirk Nowitzki      2258
Dwight Howard      2216
Andre Miller       2139
Deron Williams     2029
David Lee          1985
Kevin Durant       1983
Jamal Crawford     1982
Name: count, dtype: int64
---
Frequency for home_2:
home_2
LaMarcus Aldridge    2340
Josh Smith           2114
LeBron James         1864
J.R. Smith           1846
Kobe Bryant          1821
Jarrett Jack         1815
Deron Williams       1763
Paul Pierce          1761
Lamar Odom           1721
Mike Conley          1706
Name: count, dtype: int64
---
Frequency for home_3:
home_3
Tim Duncan       3281
Monta Ellis 

### 6.2 Outcome Analysis by Home Lineup

We group by the home lineup and calculate the average outcome and the number of games for each unique lineup. (Note: The `outcome` column typically indicates win/loss; you may need to adjust if using different metrics.)

In [10]:
# Ensure the dataset is sorted by game and starting_min
df_sorted = df_filtered.sort_values(['game', 'starting_min']).copy()

# Compute the duration of each lineup segment for each game.
# For each game, duration = next starting_min - current starting_min.
# For the last segment in each game, duration = 48 - current starting_min.
df_sorted['duration'] = df_sorted.groupby('game')['starting_min'].transform(lambda x: x.shift(-1) - x)
df_sorted['duration'] = df_sorted['duration'].fillna(48 - df_sorted['starting_min'])

# Group by season, home_team, and lineup columns to compute:
# - games: number of segments (or games) in which the lineup was used
# - avg_outcome: the average outcome for those segments
# - total_time: the sum of durations (i.e., total minutes played) for that lineup
lineup_outcomes = df_sorted.groupby(
    ['season', 'home_team', 'home_0', 'home_1', 'home_2', 'home_3', 'home_4', 'home_lineup']
).agg(
    games=('game', 'count'),
    avg_outcome=('outcome', 'mean'),
    total_time=('duration', 'sum')
).reset_index()

# Sort the results by season, then home_team, then by number of games (descending)
all_lineup_outcomes = lineup_outcomes.sort_values(['season', 'home_team', 'games'])

all_lineup_outcomes.head()


Unnamed: 0,season,home_team,home_0,home_1,home_2,home_3,home_4,home_lineup,games,avg_outcome,total_time
0,2007,ATL,Anthony Johnson,Esteban Batista,Josh Smith,Royal Ivey,Solomon Jones,"(Anthony Johnson, Esteban Batista, Josh Smith, Royal Ivey, Solomon Jones)",1,-1.0,1.0
8,2007,ATL,Anthony Johnson,Joe Johnson,Josh Childress,Marvin Williams,Tyronn Lue,"(Anthony Johnson, Joe Johnson, Josh Childress, Marvin Williams, Tyronn Lue)",1,-1.0,4.0
10,2007,ATL,Anthony Johnson,Joe Johnson,Marvin Williams,Solomon Jones,Speedy Claxton,"(Anthony Johnson, Joe Johnson, Marvin Williams, Solomon Jones, Speedy Claxton)",1,-1.0,3.0
15,2007,ATL,Anthony Johnson,Josh Childress,Josh Smith,Marvin Williams,Tyronn Lue,"(Anthony Johnson, Josh Childress, Josh Smith, Marvin Williams, Tyronn Lue)",1,1.0,1.0
19,2007,ATL,Anthony Johnson,Josh Childress,Josh Smith,Shelden Williams,Solomon Jones,"(Anthony Johnson, Josh Childress, Josh Smith, Shelden Williams, Solomon Jones)",1,-1.0,2.0


In [11]:
all_lineup_outcomes.head()

Unnamed: 0,season,home_team,home_0,home_1,home_2,home_3,home_4,home_lineup,games,avg_outcome,total_time
0,2007,ATL,Anthony Johnson,Esteban Batista,Josh Smith,Royal Ivey,Solomon Jones,"(Anthony Johnson, Esteban Batista, Josh Smith, Royal Ivey, Solomon Jones)",1,-1.0,1.0
8,2007,ATL,Anthony Johnson,Joe Johnson,Josh Childress,Marvin Williams,Tyronn Lue,"(Anthony Johnson, Joe Johnson, Josh Childress, Marvin Williams, Tyronn Lue)",1,-1.0,4.0
10,2007,ATL,Anthony Johnson,Joe Johnson,Marvin Williams,Solomon Jones,Speedy Claxton,"(Anthony Johnson, Joe Johnson, Marvin Williams, Solomon Jones, Speedy Claxton)",1,-1.0,3.0
15,2007,ATL,Anthony Johnson,Josh Childress,Josh Smith,Marvin Williams,Tyronn Lue,"(Anthony Johnson, Josh Childress, Josh Smith, Marvin Williams, Tyronn Lue)",1,1.0,1.0
19,2007,ATL,Anthony Johnson,Josh Childress,Josh Smith,Shelden Williams,Solomon Jones,"(Anthony Johnson, Josh Childress, Josh Smith, Shelden Williams, Solomon Jones)",1,-1.0,2.0


In [12]:
lineup_outcomes.head()

Unnamed: 0,season,home_team,home_0,home_1,home_2,home_3,home_4,home_lineup,games,avg_outcome,total_time
0,2007,ATL,Anthony Johnson,Esteban Batista,Josh Smith,Royal Ivey,Solomon Jones,"(Anthony Johnson, Esteban Batista, Josh Smith, Royal Ivey, Solomon Jones)",1,-1.0,1.0
1,2007,ATL,Anthony Johnson,Esteban Batista,Marvin Williams,Royal Ivey,Shelden Williams,"(Anthony Johnson, Esteban Batista, Marvin Williams, Royal Ivey, Shelden Williams)",2,0.0,3.0
2,2007,ATL,Anthony Johnson,Joe Johnson,Josh Childress,Josh Smith,Marvin Williams,"(Anthony Johnson, Joe Johnson, Josh Childress, Josh Smith, Marvin Williams)",3,-1.0,11.0
3,2007,ATL,Anthony Johnson,Joe Johnson,Josh Childress,Josh Smith,Solomon Jones,"(Anthony Johnson, Joe Johnson, Josh Childress, Josh Smith, Solomon Jones)",2,0.0,3.0
4,2007,ATL,Anthony Johnson,Joe Johnson,Josh Childress,Josh Smith,Zaza Pachulia,"(Anthony Johnson, Joe Johnson, Josh Childress, Josh Smith, Zaza Pachulia)",6,-0.333333,9.0


### 6.3 Most Common Home Players Against Specific Away Lineups

For each unique away lineup, we analyze which home players appear most frequently. Here we melt the home player columns into one long format and then group by away lineup and player.

In [14]:
# Melt the home player columns to analyze individual player frequency per team per season
home_players = df_filtered.melt(
    id_vars=['season', 'game', 'home_team', 'away_team', 'away_lineup'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='home_position',
    value_name='home_player'
)

# Count the frequency of each home player per away lineup, per team, and per season
players_vs_away_count = home_players.groupby(['season', 'home_team', 'away_team', 'away_lineup', 'home_player']).size().reset_index(name='count')

# Sort by frequency of occurrence
players_vs_away_count = players_vs_away_count.sort_values(['season', 'count'], ascending=[True, False])


players_vs_away_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 745798 entries, 19254 to 745797
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   season       745798 non-null  int64 
 1   home_team    745798 non-null  object
 2   away_team    745798 non-null  object
 3   away_lineup  745798 non-null  object
 4   home_player  745798 non-null  object
 5   count        745798 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 39.8+ MB


In [18]:
df_filtered_concat

Unnamed: 0,game,season,home_team,away_team,starting_min,home_lineup,away_lineup,outcome
0,200610310LAL,2007,LAL,PHO,0,"(Andrew Bynum, Lamar Odom, Luke Walton, Sasha Vujacic, Smush Parker)","(Boris Diaw, Kurt Thomas, Raja Bell, Shawn Marion, Steve Nash)",-1
1,200610310LAL,2007,LAL,PHO,6,"(Andrew Bynum, Lamar Odom, Luke Walton, Sasha Vujacic, Smush Parker)","(Amar'e Stoudemire, Leandro Barbosa, Raja Bell, Shawn Marion, Steve Nash)",-1
2,200610310LAL,2007,LAL,PHO,8,"(Lamar Odom, Luke Walton, Maurice Evans, Ronny Turiaf, Smush Parker)","(Amar'e Stoudemire, Leandro Barbosa, Raja Bell, Shawn Marion, Steve Nash)",1
3,200610310LAL,2007,LAL,PHO,10,"(Lamar Odom, Luke Walton, Maurice Evans, Ronny Turiaf, Smush Parker)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)",1
4,200610310LAL,2007,LAL,PHO,11,"(Luke Walton, Maurice Evans, Ronny Turiaf, Smush Parker, Vladimir Radmanovic)","(Boris Diaw, James Jones, Kurt Thomas, Leandro Barbosa, Marcus Banks)",-1
...,...,...,...,...,...,...,...,...
236907,201504050NYK,2015,NYK,PHI,35,"(Jason Smith, Quincy Acy, Ricky Ledo, Shane Larkin, Tim Hardaway)","(Henry Sims, Hollis Thompson, JaKarr Sampson, Jason Richardson, Jerami Grant)",-1
236908,201504050NYK,2015,NYK,PHI,39,"(Jason Smith, Langston Galloway, Quincy Acy, Ricky Ledo, Tim Hardaway)","(Henry Sims, Hollis Thompson, JaKarr Sampson, Jason Richardson, Jerami Grant)",-1
236909,201504050NYK,2015,NYK,PHI,40,"(Jason Smith, Langston Galloway, Quincy Acy, Ricky Ledo, Tim Hardaway)","(Furkan Aldemir, Hollis Thompson, JaKarr Sampson, Jason Richardson, Nerlens Noel)",-1
236910,201504050NYK,2015,NYK,PHI,42,"(Andrea Bargnani, Jason Smith, Lance Thomas, Langston Galloway, Shane Larkin)","(Furkan Aldemir, Ish Smith, Jerami Grant, Nerlens Noel, Robert Covington)",-1


In [19]:
players_vs_away_count.head()

Unnamed: 0,season,home_team,away_team,away_lineup,home_player,count
19254,2007,DAL,MIN,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",Dirk Nowitzki,15
19259,2007,DAL,MIN,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",Josh Howard,13
21529,2007,DEN,HOU,"(Dikembe Mutombo, Juwan Howard, Rafer Alston, Shane Battier, Tracy McGrady)",Allen Iverson,13
42488,2007,MEM,MIN,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",Mike Miller,13
10735,2007,CHA,NYK,"(Eddy Curry, Malik Rose, Mardy Collins, Nate Robinson, Steve Francis)",Walter Herrmann,12


## 7. Conclusions and Next Steps

We have now organized the initial exploratory code and extended our analysis with additional views:

- **Frequency by Position:** Reveals which players appear most frequently in each home position.
- **Outcome Analysis:** Shows the performance (win/loss average) of different home lineups.
- **Common Matchups:** Identifies which home players are used most often against particular away lineups.
- **Temporal Trends:** Examines lineup variation over time.

These insights can help guide further feature engineering—such as adding a "lineup stability" score, network centrality measures for player synergy, or clustering of lineups—to improve the accuracy, precision, and recall of our missing-player prediction model.

Feel free to extend the analysis by exploring additional correlations (e.g., linking player positions if that information is available externally) or by applying association rule mining to detect frequent player combinations.

Let's continue discussing how we can further manipulate these data tables to gain even deeper insights.

In [20]:
lineup_outcomes.sort_values(['season', 'games'], ascending=[True, False]).head()

Unnamed: 0,season,home_team,home_0,home_1,home_2,home_3,home_4,home_lineup,games,avg_outcome,total_time
4221,2007,MIN,Kevin Garnett,Mark Blount,Mike James,Ricky Davis,Trenton Hassell,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",126,-0.301587,490.0
6684,2007,WAS,Antawn Jamison,Brendan Haywood,Caron Butler,DeShawn Stevenson,Gilbert Arenas,"(Antawn Jamison, Brendan Haywood, Caron Butler, DeShawn Stevenson, Gilbert Arenas)",108,-0.111111,352.0
6397,2007,UTA,Andrei Kirilenko,Carlos Boozer,Derek Fisher,Deron Williams,Mehmet Okur,"(Andrei Kirilenko, Carlos Boozer, Derek Fisher, Deron Williams, Mehmet Okur)",101,-0.049505,338.0
1774,2007,DAL,Devin Harris,Dirk Nowitzki,Erick Dampier,Jason Terry,Josh Howard,"(Devin Harris, Dirk Nowitzki, Erick Dampier, Jason Terry, Josh Howard)",96,0.145833,316.0
5401,2007,PHO,Amar'e Stoudemire,Boris Diaw,Raja Bell,Shawn Marion,Steve Nash,"(Amar'e Stoudemire, Boris Diaw, Raja Bell, Shawn Marion, Steve Nash)",94,-0.021277,355.0


In [21]:
# Extract the starting lineup for each game (row with the smallest starting_min per game)
starting_lineups = df_filtered.sort_values('starting_min').groupby(['season', 'game']).first().reset_index()

# Group by season, home_team, and home_lineup to compute number of games and the average outcome
starting_lineup_stats = starting_lineups.groupby(['season', 'home_team', 'home_lineup']).agg(
    games_count=('game', 'count'),
    avg_outcome=('outcome', 'mean')
).reset_index()

# For each season and home_team, select the lineup with the maximum games_count (most frequently used lineup)
idx = starting_lineup_stats.groupby(['season', 'home_team'])['games_count'].idxmax()
most_common_starting_lineups = starting_lineup_stats.loc[idx].reset_index(drop=True)

# Sort results to see the most used lineups first
most_common_starting_lineups = most_common_starting_lineups.sort_values(['season', 'games_count'], ascending=[True, False])




In [22]:
most_common_starting_lineups.sort_values('season', ascending=True).head()

Unnamed: 0,season,home_team,home_lineup,games_count,avg_outcome
17,2007,MIN,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",26,-0.076923
6,2007,DAL,"(Devin Harris, Dirk Nowitzki, Erick Dampier, Jason Terry, Josh Howard)",23,0.304348
21,2007,ORL,"(Dwight Howard, Grant Hill, Hedo Turkoglu, Jameer Nelson, Tony Battie)",23,0.217391
23,2007,PHO,"(Amar'e Stoudemire, Boris Diaw, Raja Bell, Shawn Marion, Steve Nash)",22,0.363636
4,2007,CHI,"(Ben Gordon, Ben Wallace, Kirk Hinrich, Luol Deng, P.J. Brown)",20,0.4


In [23]:
# Group by season, home_team, and home_lineup to calculate the number of games and average outcome
team_lineup_success = df_filtered.groupby(['season', 'home_team', 'home_lineup']).agg(
    games=('game', 'count'),
    avg_outcome=('outcome', 'mean')
).reset_index()

# Optionally, filter out lineups with very few appearances to avoid outliers
min_games = 5  # Adjust threshold as needed
team_lineup_success = team_lineup_success[team_lineup_success['games'] >= min_games]

# For each season and home_team, select the lineup with the highest average outcome
idx = team_lineup_success.groupby(['season', 'home_team'])['avg_outcome'].idxmax()
most_successful_lineups = team_lineup_success.loc[idx].reset_index(drop=True)

# Sort results for easier reading (by season and success rate)
most_successful_lineups = most_successful_lineups.sort_values(['season', 'avg_outcome'], ascending=[True, False])

# Display the final DataFrame
most_successful_lineups.head()


Unnamed: 0,season,home_team,home_lineup,games,avg_outcome
13,2007,LAL,"(Kobe Bryant, Lamar Odom, Ronny Turiaf, Sasha Vujacic, Shammond Williams)",5,1.0
25,2007,SAC,"(Brad Miller, Corliss Williamson, John Salmons, Kevin Martin, Mike Bibby)",5,1.0
4,2007,CHI,"(Adrian Griffin, Ben Wallace, Chris Duhon, Kirk Hinrich, Luol Deng)",7,0.714286
27,2007,TOR,"(Chris Bosh, Joey Graham, Jorge Garbajosa, Jose Calderon, Morris Peterson)",7,0.714286
1,2007,BOS,"(Allan Ray, Gerald Green, Kevinn Pinkney, Leon Powe, Sebastian Telfair)",6,0.666667


In [24]:
pd.set_option('display.max_colwidth', None)
most_successful_lineups.head()

Unnamed: 0,season,home_team,home_lineup,games,avg_outcome
13,2007,LAL,"(Kobe Bryant, Lamar Odom, Ronny Turiaf, Sasha Vujacic, Shammond Williams)",5,1.0
25,2007,SAC,"(Brad Miller, Corliss Williamson, John Salmons, Kevin Martin, Mike Bibby)",5,1.0
4,2007,CHI,"(Adrian Griffin, Ben Wallace, Chris Duhon, Kirk Hinrich, Luol Deng)",7,0.714286
27,2007,TOR,"(Chris Bosh, Joey Graham, Jorge Garbajosa, Jose Calderon, Morris Peterson)",7,0.714286
1,2007,BOS,"(Allan Ray, Gerald Green, Kevinn Pinkney, Leon Powe, Sebastian Telfair)",6,0.666667


In [25]:
# Extract home players, including season information, and rename the team column
home_players = df_filtered.melt(
    id_vars=['game', 'season', 'home_team'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='position',
    value_name='player'
).rename(columns={'home_team': 'team'})

# Extract away players, including season information, and rename the team column
away_players = df_filtered.melt(
    id_vars=['game', 'season', 'away_team'],
    value_vars=['away_0', 'away_1', 'away_2', 'away_3', 'away_4'],
    var_name='position',
    value_name='player'
).rename(columns={'away_team': 'team'})

# Combine home and away players into one DataFrame
all_players = pd.concat(
    [home_players[['game', 'season', 'team', 'player']], 
     away_players[['game', 'season', 'team', 'player']]],
    ignore_index=True
)

# Group by team and season to get unique players for each team in each season
team_players = all_players.groupby(['team', 'season'])['player'].unique().reset_index()

# Optionally, convert the array of players to a sorted, comma-separated string for easier reading
team_players['players'] = team_players['player'].apply(lambda x: ', '.join(sorted(x)))
team_players = team_players[['team', 'season', 'players']]

team_players.head()


Unnamed: 0,team,season,players
0,ATL,2007,"Anthony Johnson, Cedric Bozeman, Dijon Thompson, Esteban Batista, Jeremy Richardson, Joe Johnson, Josh Childress, Josh Smith, Lorenzen Wright, Marvin Williams, Matt Freije, Royal Ivey, Salim Stoudamire, Shelden Williams, Solomon Jones, Speedy Claxton, Stanislav Medvedenko, Tyronn Lue, Zaza Pachulia"
1,ATL,2008,"Acie Law, Al Horford, Anthony Johnson, Jeremy Richardson, Joe Johnson, Josh Childress, Josh Smith, Lorenzen Wright, Mario West, Marvin Williams, Mike Bibby, Salim Stoudamire, Shelden Williams, Solomon Jones, Tyronn Lue, Zaza Pachulia"
2,ATL,2009,"Acie Law, Al Horford, Joe Johnson, Josh Smith, Mario West, Marvin Williams, Maurice Evans, Mike Bibby, Othello Hunter, Randolph Morris, Ronald Murray, Solomon Jones, Speedy Claxton, Thomas Gardner, Zaza Pachulia"
3,ATL,2010,"Al Horford, Jamal Crawford, Jason Collins, Jeff Teague, Joe Johnson, Joe Smith, Josh Smith, Mario West, Marvin Williams, Maurice Evans, Mike Bibby, Othello Hunter, Randolph Morris, Zaza Pachulia"
4,ATL,2011,"Al Horford, Damien Wilkins, Etan Thomas, Hilton Armstrong, Jamal Crawford, Jason Collins, Jeff Teague, Joe Johnson, Jordan Crawford, Josh Powell, Josh Smith, Kirk Hinrich, Marvin Williams, Maurice Evans, Mike Bibby, Pape Sy, Zaza Pachulia"


In [26]:
# Melt the home player columns to analyze player participation
home_players = df_filtered.melt(
    id_vars=['game', 'season', 'home_team'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='home_position',
    value_name='player'
).rename(columns={'home_team': 'team'})

# Melt the away player columns
away_players = df_filtered.melt(
    id_vars=['game', 'season', 'away_team'],
    value_vars=['away_0', 'away_1', 'away_2', 'away_3', 'away_4'],
    var_name='away_position',
    value_name='player'
).rename(columns={'away_team': 'team'})

# Combine both home and away players into one DataFrame
all_players = pd.concat(
    [home_players[['game', 'season', 'team', 'player']], 
     away_players[['game', 'season', 'team', 'player']]],
    ignore_index=True
)

# Count how many games each player appeared in for each team in each season
player_game_counts = all_players.groupby(['season', 'team', 'player'])['game'].nunique().reset_index()

# Rename the column for clarity
player_game_counts.rename(columns={'game': 'games_played'}, inplace=True)

# Sort the results for better readability
player_game_counts = player_game_counts.sort_values(['season', 'team', 'games_played'], ascending=[True, True, False])


player_game_counts.head()

Unnamed: 0,season,team,player,games_played
13,2007,ATL,Shelden Williams,81
7,2007,ATL,Josh Smith,72
18,2007,ATL,Zaza Pachulia,72
8,2007,ATL,Lorenzen Wright,67
9,2007,ATL,Marvin Williams,64


In [27]:
# First, aggregate unique players by team and season without converting to string
team_players_df = all_players.groupby(['team', 'season'])['player'].unique().reset_index()

# Convert the 'player' column (which is an array) to a list for each row
team_players_df['player'] = team_players_df['player'].apply(list)

# Now, build a nested dictionary: {team: {season: [players]}}
rosters_dict = {}
for _, row in team_players_df.iterrows():
    team = row['team']
    season = row['season']
    players_list = row['player']
    if team not in rosters_dict:
        rosters_dict[team] = {}
    rosters_dict[team][season] = sorted(players_list)  # sorted for consistency




In [28]:
# --- Compute Minutes Played by Each Player for Both Home and Away ---

# Sort the filtered DataFrame by game and starting_min, then compute the duration for each lineup segment.
df_sorted = df_filtered.sort_values(['game', 'starting_min']).copy()
# Duration is the difference between the current starting_min and the next one in the same game;
# for the last segment in each game, use (48 - current starting_min)
df_sorted['duration'] = df_sorted.groupby('game')['starting_min'].transform(lambda x: x.shift(-1) - x)
df_sorted['duration'] = df_sorted['duration'].fillna(48 - df_sorted['starting_min'])

# --- Compute minutes for home players ---
# Melt the home player columns to long format.
home_players = df_sorted.melt(
    id_vars=['game', 'season', 'home_team', 'duration'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='position',
    value_name='player'
)
# Rename the team column for consistency.
home_players = home_players.rename(columns={'home_team': 'team'})

# --- Compute minutes for away players ---
# Melt the away player columns similarly.
away_players = df_sorted.melt(
    id_vars=['game', 'season', 'away_team', 'duration'],
    value_vars=['away_0', 'away_1', 'away_2', 'away_3', 'away_4'],
    var_name='position',
    value_name='player'
)
away_players = away_players.rename(columns={'away_team': 'team'})

# Combine the two DataFrames
all_players_minutes = pd.concat([home_players[['game', 'season', 'team', 'player', 'duration']],
                                 away_players[['game', 'season', 'team', 'player', 'duration']]],
                                ignore_index=True)

# Group by team, season, and player to get total minutes played.
player_minutes = (
    all_players_minutes.groupby(['team', 'season', 'player'])['duration']
    .sum()
    .reset_index()
    .rename(columns={'duration': 'total_minutes'})
)

# Optional: sort for readability
player_minutes = player_minutes.sort_values(['team', 'season', 'total_minutes'], ascending=[True, True, False])
display(player_minutes)

# --- Build a Roster Dictionary Including Minutes Played ---
# This dictionary will have the structure:
# { team: { season: { player: total_minutes, ... }, ... }, ... }

rosters_with_minutes = {}
for _, row in player_minutes.iterrows():
    team = row['team']
    season = row['season']
    player = row['player']
    minutes = row['total_minutes']
    if team not in rosters_with_minutes:
        rosters_with_minutes[team] = {}
    if season not in rosters_with_minutes[team]:
        rosters_with_minutes[team][season] = {}
    rosters_with_minutes[team][season][player] = minutes





Unnamed: 0,team,season,player,total_minutes
7,ATL,2007,Josh Smith,2637.0
5,ATL,2007,Joe Johnson,2355.0
9,ATL,2007,Marvin Williams,2181.0
18,ATL,2007,Zaza Pachulia,2019.0
6,ATL,2007,Josh Childress,1982.0
...,...,...,...,...
4721,WAS,2015,Martell Webster,330.0
4713,WAS,2015,DeJuan Blair,153.0
4728,WAS,2015,Will Bynum,57.0
4716,WAS,2015,Glen Rice,42.0


In [29]:
rosters_with_minutes['TOR'][2015] 

{'Kyle Lowry': 2412.0,
 'Jonas Valanciunas': 2154.0,
 'DeMar DeRozan': 2117.0,
 'Terrence Ross': 2087.0,
 'Patrick Patterson': 2073.0,
 'Amir Johnson': 1983.0,
 'Greivis Vasquez': 1958.0,
 'Lou Williams': 1949.0,
 'James Johnson': 1356.0,
 'Tyler Hansbrough': 1047.0,
 'Chuck Hayes': 254.0,
 'Landry Fields': 206.0,
 'Greg Stiemsma': 50.0,
 'Lucas Nogueira': 18.0,
 'Bruno Caboclo': 16.0}

In [30]:
# --- Calculate minutes played by each player on each team for each season ---

# Sort the DataFrame by game and starting_min, then compute the duration for each lineup segment.
df_sorted = df_filtered.sort_values(['game', 'starting_min']).copy()
df_sorted['duration'] = df_sorted.groupby('game')['starting_min'].transform(lambda x: x.shift(-1) - x)
df_sorted['duration'] = df_sorted['duration'].fillna(48 - df_sorted['starting_min'])

# Melt the home player columns so each row corresponds to a player's appearance in a lineup segment.
# This includes the game, season, home_team, and duration (minutes played in that segment).
home_players = df_sorted.melt(
    id_vars=['game', 'season', 'home_team', 'duration'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='position',
    value_name='player'
)

# Group by team, season, and player to calculate the total minutes played.
player_minutes = home_players.groupby(['home_team', 'season', 'player'])['duration'].sum().reset_index()
player_minutes.rename(columns={'duration': 'total_minutes'}, inplace=True)

# Sort the results for easier reading.
player_minutes = player_minutes.sort_values(['home_team', 'season', 'total_minutes'], ascending=[True, True, False])

# Display the complete DataFrame
display(player_minutes.head())


Unnamed: 0,home_team,season,player,total_minutes
7,ATL,2007,Josh Smith,1322.0
5,ATL,2007,Joe Johnson,1120.0
9,ATL,2007,Marvin Williams,1107.0
6,ATL,2007,Josh Childress,1077.0
18,ATL,2007,Zaza Pachulia,1034.0


In [31]:
most_common_starting_lineups.head()

Unnamed: 0,season,home_team,home_lineup,games_count,avg_outcome
17,2007,MIN,"(Kevin Garnett, Mark Blount, Mike James, Ricky Davis, Trenton Hassell)",26,-0.076923
6,2007,DAL,"(Devin Harris, Dirk Nowitzki, Erick Dampier, Jason Terry, Josh Howard)",23,0.304348
21,2007,ORL,"(Dwight Howard, Grant Hill, Hedo Turkoglu, Jameer Nelson, Tony Battie)",23,0.217391
23,2007,PHO,"(Amar'e Stoudemire, Boris Diaw, Raja Bell, Shawn Marion, Steve Nash)",22,0.363636
4,2007,CHI,"(Ben Gordon, Ben Wallace, Kirk Hinrich, Luol Deng, P.J. Brown)",20,0.4


In [None]:
most_successful_lineups.head()

Unnamed: 0,season,home_team,home_lineup,games,avg_outcome
13,2007,LAL,"(Kobe Bryant, Lamar Odom, Ronny Turiaf, Sasha Vujacic, Shammond Williams)",5,1.000000
25,2007,SAC,"(Brad Miller, Corliss Williamson, John Salmons, Kevin Martin, Mike Bibby)",5,1.000000
4,2007,CHI,"(Adrian Griffin, Ben Wallace, Chris Duhon, Kirk Hinrich, Luol Deng)",7,0.714286
27,2007,TOR,"(Chris Bosh, Joey Graham, Jorge Garbajosa, Jose Calderon, Morris Peterson)",7,0.714286
1,2007,BOS,"(Allan Ray, Gerald Green, Kevinn Pinkney, Leon Powe, Sebastian Telfair)",6,0.666667
...,...,...,...,...,...
253,2015,LAL,"(Ed Davis, Jeremy Lin, Jordan Hill, Kobe Bryant, Wesley Johnson)",12,0.500000
252,2015,LAC,"(Chris Paul, DeAndre Jordan, Glen Davis, J.J. Redick, Matt Barnes)",7,0.428571
268,2015,UTA,"(Dante Exum, Elijah Millsap, Rudy Gobert, Trevor Booker, Trey Burke)",10,0.400000
259,2015,NYK,"(Alexey Shved, Cole Aldrich, Jason Smith, Shane Larkin, Travis Wear)",6,0.333333


In [32]:
df_train


NameError: name 'df_train' is not defined

END ANALYSIS

In [3]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

# =============================================================================
# STEP 1: Compute Duration for Each Lineup Segment
# =============================================================================
df_sorted = df_filtered.sort_values(['game', 'starting_min']).copy()
df_sorted['duration'] = df_sorted.groupby('game')['starting_min'].transform(lambda x: x.shift(-1) - x)
df_sorted['duration'] = df_sorted['duration'].fillna(48 - df_sorted['starting_min'])

# =============================================================================
# STEP 2: Compute Mean Encoding for Home & Away Players
# =============================================================================
home_players = df_sorted.melt(
    id_vars=['game', 'season', 'home_team', 'duration', 'outcome'],
    value_vars=['home_0', 'home_1', 'home_2', 'home_3', 'home_4'],
    var_name='position',
    value_name='player'
).rename(columns={'home_team': 'team'})

away_players = df_sorted.melt(
    id_vars=['game', 'season', 'away_team', 'duration', 'outcome'],
    value_vars=['away_0', 'away_1', 'away_2', 'away_3', 'away_4'],
    var_name='position',
    value_name='player'
).rename(columns={'away_team': 'team'})

# Compute Weighted Outcome for Home/Away Players
home_players['weighted_outcome'] = home_players['outcome'] * home_players['duration']
away_players['weighted_outcome'] = (1 - away_players['outcome']) * away_players['duration']

# Aggregate Mean Encoding for Players by Season & Team
def compute_player_encoding(df):
    player_outcome = df.groupby(['season', 'team', 'player']).agg(
        total_weighted_outcome=('weighted_outcome', 'sum'),
        total_minutes=('duration', 'sum')
    ).reset_index()
    player_outcome['base_mean_encoding'] = player_outcome['total_weighted_outcome'] / player_outcome['total_minutes']
    return player_outcome

player_encoding_home = compute_player_encoding(home_players)
player_encoding_away = compute_player_encoding(away_players)

# Build Dictionary for Fast Lookup
mean_encoding_dict_home = dict(zip(
    zip(player_encoding_home['season'], player_encoding_home['team'], player_encoding_home['player']),
    player_encoding_home['base_mean_encoding']
))

mean_encoding_dict_away = dict(zip(
    zip(player_encoding_away['season'], player_encoding_away['team'], player_encoding_away['player']),
    player_encoding_away['base_mean_encoding']
))

# Compute Fallback Mean Encoding
overall_fallback = df_sorted['outcome'].mean()

# =============================================================================
# STEP 3: Apply Mean Encoding to Training Data
# =============================================================================
df_encoded = df_sorted.copy()

def encode_player(row, col, mean_dict):
    key = (row['season'], row['home_team'], row[col])
    return mean_dict.get(key, overall_fallback)

for col in ['home_0', 'home_1', 'home_2', 'home_3', 'home_4']:
    df_encoded[col] = df_encoded.apply(lambda row: encode_player(row, col, mean_encoding_dict_home), axis=1)

for col in ['away_0', 'away_1', 'away_2', 'away_3', 'away_4']:
    df_encoded[col] = df_encoded.apply(lambda row: encode_player(row, col, mean_encoding_dict_away), axis=1)

# Compute Away Lineup Mean Encoding
df_encoded['away_lineup_encoded'] = df_encoded.apply(
    lambda row: np.mean([mean_encoding_dict_away.get((row['season'], row['away_team'], row[col]), overall_fallback) 
                         for col in ['away_0', 'away_1', 'away_2', 'away_3', 'away_4']]), axis=1)

print("✅ Training Data Ready!")


✅ Training Data Ready!


In [None]:
# =============================================================================
# STEP 5: Create Training Data by Simulating Missingness
# =============================================================================
training_rows = []
for idx, enc_row in df_encoded.iterrows():
    raw_row = df_filtered.loc[idx]  # Using raw data for true player names
    for pos in range(5):
        feature_row = enc_row.copy()
        feature_row['missing_position'] = pos
        feature_row['missing_player'] = raw_row[f'home_{pos}']
        # Instead of dropping the column, we set the value to NaN to simulate missingness.
        feature_row[f'home_{pos}'] = np.nan
        training_rows.append(feature_row)

df_train = pd.DataFrame(training_rows).reset_index(drop=True)

In [None]:
#df_train.to_csv("../data/train_updated_1.csv", index=False)

In [4]:
df_train = pd.read_csv("../data/train_updated_1.csv")

In [6]:
df_train.head(3)

Unnamed: 0,game,season,home_team,away_team,starting_min,home_0,home_1,home_2,home_3,home_4,away_0,away_1,away_2,away_3,away_4,outcome,home_lineup,away_lineup,duration,away_lineup_encoded,missing_position,missing_player,home_lineup_avg,away_lineup_avg
0,200610310LAL,2007,LAL,PHO,0,,-0.134421,-0.134525,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,0,Andrew Bynum,-0.107619,-0.255631
1,200610310LAL,2007,LAL,PHO,0,-0.135498,,-0.134525,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,1,Lamar Odom,-0.107889,-0.255631
2,200610310LAL,2007,LAL,PHO,0,-0.135498,-0.134421,,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,2,Luke Walton,-0.107863,-0.255631


In [8]:
# =============================================================================
# STEP 6: Compute Additional Features (Player Performance + Lineup Chemistry)
# =============================================================================
df_train['home_lineup_avg'] = df_train[['home_0', 'home_1', 'home_2', 'home_3', 'home_4']].mean(axis=1)
df_train['away_lineup_avg'] = df_train[['away_0', 'away_1', 'away_2', 'away_3', 'away_4']].mean(axis=1)
df_train.head(3)

Unnamed: 0,game,season,home_team,away_team,starting_min,home_0,home_1,home_2,home_3,home_4,away_0,away_1,away_2,away_3,away_4,outcome,home_lineup,away_lineup,duration,away_lineup_encoded,missing_position,missing_player,home_lineup_avg,away_lineup_avg
0,200610310LAL,2007,LAL,PHO,0,,-0.134421,-0.134525,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,0,Andrew Bynum,-0.107619,-0.255631
1,200610310LAL,2007,LAL,PHO,0,-0.135498,,-0.134525,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,1,Lamar Odom,-0.107889,-0.255631
2,200610310LAL,2007,LAL,PHO,0,-0.135498,-0.134421,,-0.024876,-0.136656,-0.255631,-0.255631,-0.255631,-0.255631,-0.255631,-1,"('Andrew Bynum', 'Lamar Odom', 'Luke Walton', 'Sasha Vujacic', 'Smush Parker')","('Boris Diaw', 'Kurt Thomas', 'Raja Bell', 'Shawn Marion', 'Steve Nash')",6.0,-0.255631,2,Luke Walton,-0.107863,-0.255631


In [5]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df_train[df_train['outcome'] == -1]
df_minority = df_train[df_train['outcome'] == 1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # Sample without replacement
                                   n_samples=len(df_minority),  # Match minority count
                                   random_state=42)

# Combine downsampled majority and original minority class
df_train_balanced = pd.concat([df_majority_downsampled, df_minority])

# Shuffle the dataset
df_train_balanced = df_train_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Check new balance
print(df_train_balanced['outcome'].value_counts())


outcome
-1    440875
 1    440875
Name: count, dtype: int64


In [6]:
# =============================================================================
# STEP 6: Reduce Training Set by 50%
# =============================================================================
df_train_balanced = df_train_balanced.sample(frac=0.25, random_state=42).reset_index(drop=True)
df_train_balanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220438 entries, 0 to 220437
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   game                 220438 non-null  object 
 1   season               220438 non-null  int64  
 2   home_team            220438 non-null  object 
 3   away_team            220438 non-null  object 
 4   starting_min         220438 non-null  int64  
 5   home_0               176365 non-null  float64
 6   home_1               176273 non-null  float64
 7   home_2               176321 non-null  float64
 8   home_3               176402 non-null  float64
 9   home_4               176391 non-null  float64
 10  away_0               220438 non-null  float64
 11  away_1               220438 non-null  float64
 12  away_2               220438 non-null  float64
 13  away_3               220438 non-null  float64
 14  away_4               220438 non-null  float64
 15  outcome          

In [7]:

# =============================================================================
# STEP 8: Split Data for Training
# =============================================================================
feature_cols = ['season', 'starting_min', 'home_team', 'away_team', 
                'home_lineup_avg', 'away_lineup_avg', 'away_lineup_encoded',
                'home_0', 'home_1', 'home_2', 'home_3', 'home_4', 'away_0', 'away_1', 'away_2', 'away_3', 'away_4', 'outcome',]

X = df_train_balanced[feature_cols]
y = df_train_balanced['missing_player']

categorical_cols = ['season', 'home_team', 'away_team', 'outcome']
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# =============================================================================
# STEP 9: Encode the Target Variable
# =============================================================================
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print("Target Classes:", le.classes_)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.1, random_state=42)

Target Classes: ['A.J. Price' 'Aaron Brooks' 'Aaron Gordon' 'Aaron Gray' 'Aaron McKie'
 'Aaron Williams' 'Acie Law' 'Adam Morrison' 'Adonal Foyle'
 'Adonis Thomas' 'Adreian Payne' 'Adrian Griffin' 'Al Harrington'
 'Al Horford' 'Al Jefferson' 'Al Thornton' 'Al-Farouq Aminu'
 'Alan Anderson' 'Alan Henderson' 'Alando Tucker' 'Alec Burks'
 'Alex Acker' 'Alex Len' 'Alexander Johnson' 'Alexey Shved'
 'Alexis Ajinca' 'Allan Ray' 'Allen Crabbe' 'Allen Iverson' 'Alonzo Gee'
 'Alonzo Mourning' 'Alvin Williams' "Amar'e Stoudemire" 'Amir Johnson'
 'Anderson Varejao' 'Andray Blatche' 'Andre Barrett' 'Andre Brown'
 'Andre Drummond' 'Andre Emmett' 'Andre Iguodala' 'Andre Miller'
 'Andre Owens' 'Andre Roberson' 'Andrea Bargnani' 'Andrei Kirilenko'
 'Andres Nocioni' 'Andrew Bogut' 'Andrew Bynum' 'Andrew Goudelock'
 'Andrew Nicholson' 'Andrew Wiggins' 'Andris Biedrins' 'Andy Rautins'
 'Anfernee Hardaway' 'Antawn Jamison' 'Anthony Bennett' 'Anthony Carter'
 'Anthony Davis' 'Anthony Johnson' 'Anthony Morr

In [8]:
# =============================================================================
# STEP 9: Train Random Forest Classifier
# =============================================================================
rf_model = RandomForestClassifier(n_estimators=125, max_depth=15, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)


In [9]:
import joblib

# Save the model to a file named 'random_forest_model.pkl'
joblib.dump(rf_model, 'rf_model_trained_5.pkl')


['rf_model_trained_5.pkl']

In [None]:

# Load the trained model
rf_model = joblib.load("rf_model_trained_3.pkl")



In [10]:

# ----------------------------------------------------------------------------
# STEP 1: Load Test Data and Replace "?" with NaN
# ----------------------------------------------------------------------------
df_test = pd.read_csv("../data/NBA_test copy.csv")
df_test_labels = pd.read_csv("../data/NBA_test_labels.csv")

# Merge labels into test data (actual missing player names, if needed for evaluation)
df_test['missing_player'] = df_test_labels['removed_value']

# Replace team acronyms if needed
df_test["home_team"] = df_test["home_team"].replace(team_acronym_mapping)
df_test["away_team"] = df_test["away_team"].replace(team_acronym_mapping)

df_test['outcome'] = 1  

# Replace "?" in player columns with NaN
df_test.replace("?", np.nan, inplace=True)

# ----------------------------------------------------------------------------
# STEP 2: Map Mean Encoding Values to Player Columns in Test Data
# ----------------------------------------------------------------------------
def encode_player_test(row, col, mean_dict, team_col):
    """
    For a given row and player column, look up the encoding using the key:
    (season, team, player). The team depends on whether it is a home or away player.
    NaN values are preserved.
    """
    player = row[col]
    season = row['season']
    team = row[team_col]
    if pd.isna(player):  # Preserve NaN for missing players
        return np.nan
    key = (season, team, player)
    return mean_dict.get(key, overall_fallback)

# Apply mean encoding to home and away players
for col in ['home_0', 'home_1', 'home_2', 'home_3', 'home_4']:
    df_test[col] = df_test.apply(lambda row: encode_player_test(row, col, mean_encoding_dict_home, 'home_team'), axis=1)

for col in ['away_0', 'away_1', 'away_2', 'away_3', 'away_4']:
    df_test[col] = df_test.apply(lambda row: encode_player_test(row, col, mean_encoding_dict_away, 'away_team'), axis=1)

# ----------------------------------------------------------------------------
# STEP 3: Encode the Missing Player Using LabelEncoder
# ----------------------------------------------------------------------------
def encode_missing_player(player):
    """Encodes the missing player using the trained label encoder from training."""
    if pd.isna(player):
        return np.nan  # Preserve NaN values
    elif player in le.classes_:
        return le.transform([player])[0]  # Use existing label
    else:
        return -1  # Assign a special label for unseen players

df_test['missing_player_encoded'] = df_test['missing_player'].apply(encode_missing_player)

# ----------------------------------------------------------------------------
# STEP 4: Compute Aggregate Features
# ----------------------------------------------------------------------------
df_test['away_lineup_encoded'] = df_test.apply(
    lambda row: np.nanmean([row[col] for col in ['away_0', 'away_1', 'away_2', 'away_3', 'away_4']]), 
    axis=1
)

df_test['home_lineup_avg'] = df_test[['home_0', 'home_1', 'home_2', 'home_3', 'home_4']].mean(axis=1, skipna=True)
df_test['away_lineup_avg'] = df_test[['away_0', 'away_1', 'away_2', 'away_3', 'away_4']].mean(axis=1, skipna=True)

# ----------------------------------------------------------------------------
# STEP 5: Apply One-Hot Encoding to Categorical Features
# ----------------------------------------------------------------------------

categorical_cols = ['season', 'home_team', 'away_team', 'outcome']
X_test_eval = df_test[['season', 'starting_min', 'home_team', 'away_team', 
                       'home_lineup_avg', 'away_lineup_avg', 'away_lineup_encoded',
                       'home_0', 'home_1', 'home_2', 'home_3', 'home_4', 
                       'away_0', 'away_1', 'away_2', 'away_3', 'away_4','outcome']]

# Apply one-hot encoding (aligning with training data)
X_test_eval = pd.get_dummies(X_test_eval, columns=categorical_cols, drop_first=True)

# Ensure columns match training data
missing_cols = set(X.columns) - set(X_test_eval.columns)
extra_cols = set(X_test_eval.columns) - set(X.columns)

# Add missing columns with zeros
for col in missing_cols:
    X_test_eval[col] = 0

# Drop extra columns
X_test_eval = X_test_eval[X.columns]


# Ensure test labels match the label-encoded format
y_test_eval = df_test['missing_player_encoded'].astype(int)

print("✅ Test Data Prepared Successfully!")


✅ Test Data Prepared Successfully!


In [13]:
 #----------------------------------------------------------------------------
# STEP 6: Make Predictions and Evaluate Model
# ----------------------------------------------------------------------------
rf_predictions = rf_model.predict(X_test_eval)

# Compute accuracy
rf_accuracy = accuracy_score(y_test_eval.dropna(), rf_predictions)  # Drop NaNs to avoid errors

# Convert numeric predictions back to player names
y_pred_names = le.inverse_transform(rf_predictions)

# Generate classification report
unique_labels = np.unique(y_test_eval.dropna())  # Drop NaN values to avoid issues
target_names = [le.classes_[i] for i in unique_labels]

print(f"🎯 Random Forest Accuracy: {rf_accuracy:.4f}")
print("\n📊 Classification Report:")
print(classification_report(y_test_eval.dropna(), rf_predictions, labels=unique_labels, target_names=target_names))

🎯 Random Forest Accuracy: 0.4410

📊 Classification Report:
                         precision    recall  f1-score   support

     Zydrunas Ilgauskas       0.00      0.00      0.00         8
             A.J. Price       0.00      0.00      0.00         1
           Aaron Brooks       0.33      1.00      0.50         2
               Acie Law       0.00      0.00      0.00         1
          Al Harrington       0.50      0.50      0.50         2
             Al Horford       0.38      1.00      0.55         3
           Al Jefferson       0.50      1.00      0.67         4
        Al-Farouq Aminu       0.50      1.00      0.67         1
          Alan Anderson       0.50      1.00      0.67         1
             Alec Burks       0.00      0.00      0.00         1
           Alexey Shved       1.00      1.00      1.00         1
          Alexis Ajinca       0.00      0.00      0.00         1
           Allen Crabbe       0.00      0.00      0.00         2
          Allen Iverson       

In [None]:
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd

# ----------------------------------------------------------------------------
# STEP 6: Make Predictions and Evaluate Model
# ----------------------------------------------------------------------------
rf_pred_probs = rf_model.predict_proba(X_test_eval)  # Get probability distributions
rf_predictions = np.argmax(rf_pred_probs, axis=1)  # Get the most probable player (Top-1 Prediction)

# Remove any test labels that are -1 before inverse transformation
y_test_filtered = y_test_eval[y_test_eval != -1]  # Exclude unseen labels

# Convert numeric predictions back to player names (Ensure values exist in le.classes_)
y_pred_names = le.inverse_transform(rf_predictions[:len(y_test_filtered)])  # Match length of y_test_filtered
y_test_names = le.inverse_transform(y_test_filtered)  # Ensure only known labels are used

# **Compute Top-3 Predictions**
top_3_preds = np.argsort(rf_pred_probs, axis=1)[:, -3:]  # Indices of top 3 predictions
top_3_pred_names = np.array([le.inverse_transform(row) for row in top_3_preds[:len(y_test_filtered)]])

# **Check if Actual Player is in Top-3 Predictions**
top_3_correct = np.array([y_test_names[i] in top_3_pred_names[i] for i in range(len(y_test_names))])
top_3_accuracy = np.mean(top_3_correct)  # Compute Top-3 Accuracy

# **Compute Top-5 Predictions**
top_5_preds = np.argsort(rf_pred_probs, axis=1)[:, -5:]  # Indices of top 3 predictions
top_5_pred_names = np.array([le.inverse_transform(row) for row in top_5_preds[:len(y_test_filtered)]])

# **Check if Actual Player is in Top-5 Predictions**
top_5_correct = np.array([y_test_names[i] in top_5_pred_names[i] for i in range(len(y_test_names))])
top_5_accuracy = np.mean(top_5_correct)  # Compute Top-6 Accuracy


# Generate classification report
unique_labels = np.unique(y_test_filtered)  # Drop NaN values to avoid issues
target_names = [le.classes_[i] for i in unique_labels]

# **Store Predictions in DataFrame**
df_results = pd.DataFrame({
    'actual_player': y_test_names,
    'predicted_player': y_pred_names,
    'top_3_predictions': [list(row) for row in top_3_pred_names],
    'top_3_correct': top_3_correct
})

# **Display Metrics**
print(f"🎯 Random Forest Top-1 Accuracy: {accuracy_score(y_test_filtered, rf_predictions[:len(y_test_filtered)]):.4f}")
print(f"🎯 Random Forest Top-3 Accuracy: {top_3_accuracy:.4f}")
print(f"🎯 Random Forest Top-5 Accuracy: {top_5_accuracy:.4f}")
print("\n📊 Classification Report:")
print(classification_report(y_test_filtered, rf_predictions[:len(y_test_filtered)], labels=unique_labels, target_names=target_names))

# **Show Sample Predictions**
print("\n🔍 Sample Predictions:")
print(df_results.head(30))  # Show first 15 rows of results


🎯 Random Forest Top-1 Accuracy: 0.1784
🎯 Random Forest Top-3 Accuracy: 0.2964
🎯 Random Forest Top-3 Accuracy: 0.3367

📊 Classification Report:
                         precision    recall  f1-score   support

             A.J. Price       0.00      0.00      0.00         1
           Aaron Brooks       0.33      1.00      0.50         2
               Acie Law       0.00      0.00      0.00         1
          Al Harrington       0.50      0.50      0.50         2
             Al Horford       0.38      1.00      0.55         3
           Al Jefferson       0.50      1.00      0.67         4
        Al-Farouq Aminu       0.50      1.00      0.67         1
          Alan Anderson       0.50      1.00      0.67         1
             Alec Burks       0.00      0.00      0.00         1
           Alexey Shved       1.00      1.00      1.00         1
          Alexis Ajinca       0.00      0.00      0.00         1
           Allen Crabbe       0.00      0.00      0.00         2
          A

In [9]:
print(f"X_train shape: {X_train.shape}, X_test_eval shape: {X_test_eval.shape}")


NameError: name 'X_train' is not defined

In [20]:
print(f"X_test_eval shape: {X_test_eval.shape}")
print(f"y_test_eval shape: {y_test_eval.shape}")


X_test_eval shape: (1000, 80)
y_test_eval shape: (1000,)
