# NBA Lineup Prediction Analysis

This notebook examines and explains a machine learning approach for predicting the best fifth player in NBA lineups to maximize winning probability. The model selects four known players from a home team and all five players from an away team before predicting which player from the home team's roster would be the greatest addition to optimize the chances of winning.

## Table of Contents
1. [Introduction](#introduction)
2. [Data Loading and Preprocessing](#data-loading-and-preprocessing)
3. [Building Team Rosters](#building-team-rosters)
4. [Feature Engineering](#feature-engineering)
5. [Model Training](#model-training)
6. [Prediction Function](#prediction-function)
7. [Evaluation](#evaluation)
8. [Results Analysis](#results-analysis)
9. [Future Improvements](#future-improvements)

## Introduction <a name="introduction"></a>

The model is designed to predict the optimal fifth player for an NBA lineup. It shows:
- High accuracy (~90%) on test data from 2007-2015
- Lower accuracy (~59%) on unseen data from 2016

This notebook explores the code structure, the machine learning approach, and possible reasons for the performance difference.

### Initial Setup

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import os, joblib, numpy as np
from collections import defaultdict
import time

# Team name mapping to standardize team abbreviations
TEAM_MAPPINGS = {
    'CHO': 'CHA',  	# Charlotte Hornets
    'NOP': 'NOK',  	# New Orleans Pelicans/Oklahoma City
    'NOH': 'NOK',   # New Orleans Hornets/Oklahoma City
    'NJN': 'BRK',	# New Jersey/Brooklyn Nets
    'SEA': 'OKC',	# Seattle Supersonics/Oklahoma City
}

This section imports necessary libraries and defines a mapping dictionary to standardize team abbreviations. This is important because some NBA franchises have changed names or locations over the years, and maintaining consistency in team identification is crucial for accurate modeling.

## Data Loading and Preprocessing <a name="data-loading-and-preprocessing"></a>

### Loading Data

In [None]:
def load_data(data_dir):
    all_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.csv')]
    df_list = []

    for file in all_files:  # Append all csv files into one dataframe to be used later
        df = pd.read_csv(file)
        df_list.append(df)
    df = pd.concat(df_list, ignore_index=True)
    
    # Function to sort all players alphabetically in the dataframe
    def sort_players(row):
        home_players = sorted([row[f'home_{i}'] for i in range(5)])
        away_players = sorted([row[f'away_{i}'] for i in range(5)])
        for i in range(5):
            row[f'home_{i}'] = home_players[i]
            row[f'away_{i}'] = away_players[i]
        return row
    
    # Apply the sorting on the dataframe
    df = df.apply(sort_players, axis=1)
    
    return df

**Explanation:**
This function loads all CSV files from a specified directory and combines them into a single DataFrame. The key preprocessing step is sorting player names alphabetically within each team. This standardization is crucial because:

1. **Order Invariance**: The order of players in a lineup doesn't matter for prediction purposes - what matters is which five players are on the court together.
2. **Consistent Feature Representation**: Sorting ensures that the same lineup will always be represented the same way in the data, regardless of the original order.
3. **Model Stability**: This helps the model learn patterns about player combinations rather than specific positions in the data structure.

## Building Team Rosters <a name="building-team-rosters"></a>

In [None]:
def create_rosters(train_df, test_df=None):
    # Create a roster for each team and season from training data
    roster = train_df.groupby(['home_team', 'season'])[[f'home_{i}' for i in range(5)]].apply(
        lambda x: set(pd.unique(x.values.ravel()))
    ).to_dict()
    
    # Supplement with away team data (important for completeness)
    for idx, row in train_df.iterrows():
        key = (row['away_team'], row['season'])
        if key not in roster:
            roster[key] = set()
        for i in range(5):
            roster[key].add(row[f'away_{i}'])
    
    # If test data is provided, build rosters for new seasons only
    if test_df is not None:
        # Identify new seasons in test data
        train_seasons = set(train_df['season'].unique())
        test_seasons = set(test_df['season'].unique())
        new_seasons = test_seasons - train_seasons
        
        if new_seasons:
            print(f"Building rosters for new seasons in test data: {new_seasons}")
            
            # Create temporary dataframe with only new seasons
            new_season_df = test_df[test_df['season'].isin(new_seasons)]
            
            # Build rosters for new seasons from test data
            for idx, row in new_season_df.iterrows():
                # Add home team players
                key = (row['home_team'], row['season'])
                if key not in roster:
                    roster[key] = set()
                for i in range(5):
                    player = row[f'home_{i}']
                    if player != '?':  # Skip unknown players
                        roster[key].add(player)
                
                # Add away team players
                key = (row['away_team'], row['season'])
                if key not in roster:
                    roster[key] = set()
                for i in range(5):
                    roster[key].add(row[f'away_{i}'])
                    
            # Add the true fifth player to the roster (from labels)
            if 'true_fifth_player' in new_season_df.columns:
                for idx, row in new_season_df.iterrows():
                    key = (row['home_team'], row['season'])
                    if key in roster and row['true_fifth_player'] != '?':
                        roster[key].add(row['true_fifth_player'])
    
    return roster

**Explanation:**
This function creates a dictionary that maps each (team, season) pair to a set of players who played for that team in that season. This is essential for the model because:

1. **Realistic Predictions**: The model should only predict players who actually played for a team in a given season.
2. **Handling New Seasons**: The function is designed to handle test data from seasons not seen in the training data (like 2016).
3. **Complete Rosters**: It builds rosters from both home and away team data, and even incorporates the true fifth player from the labels when available.

The roster creation process is a key component for making realistic predictions, especially for new seasons. For the 2016 season, the model relies on the test data itself to build the roster, which is a pragmatic approach when dealing with new, unseen data.

## Feature Engineering <a name="feature-engineering"></a>

In [None]:
def encode_features(df, player_encoder, team_encoder, season_encoder):
    df['home_team_encoded'] = team_encoder.transform(df['home_team'])
    df['away_team_encoded'] = team_encoder.transform(df['away_team'])
    df['season_encoded'] = season_encoder.transform(df['season'])
    for i in range(5):
        df[f'home_{i}_encoded'] = player_encoder.transform(df[f'home_{i}'])
        df[f'away_{i}_encoded'] = player_encoder.transform(df[f'away_{i}'])
    return df

**Explanation:**
This function encodes categorical features (team names, player names, seasons) into numerical values that can be used by machine learning algorithms. The encoders are:

1. **Label Encoders**: Convert categorical values to integers (0, 1, 2, etc.)
2. **Consistent Encoding**: Using the same encoders for both training and test data ensures consistency
3. **Feature Preparation**: This prepares the data for the RandomForest model, which requires numerical inputs

Feature encoding is crucial because machine learning models can't directly work with text data. By encoding all categorical features consistently, we ensure that the model can learn patterns across different teams, players, and seasons.

## Model Training <a name="model-training"></a>

In [None]:
# Training model for 2007-2015 test data
X = df[['home_team_encoded', 'season_encoded', 
       'home_0_encoded', 'home_1_encoded', 'home_2_encoded', 'home_3_encoded', 'home_4_encoded',
       'away_0_encoded', 'away_1_encoded', 'away_2_encoded', 'away_3_encoded', 'away_4_encoded']]
y = df['outcome']

# Training model for 2016 (unseen) test data
X_recent = recent_df[['home_team_encoded', 'season_encoded', 
                     'home_0_encoded', 'home_1_encoded', 'home_2_encoded', 'home_3_encoded', 'home_4_encoded',
                     'away_0_encoded', 'away_1_encoded', 'away_2_encoded', 'away_3_encoded', 'away_4_encoded']]
y_recent = recent_df['outcome']

# Train Random Forest models
# For Seen Test Data
model = RandomForestClassifier(n_estimators=300, max_depth=None, random_state=1, n_jobs=-1)
model.fit(X, y)

# For Unseen Test Data
recent_model = RandomForestClassifier(n_estimators=300, max_depth=None, random_state=1, n_jobs=-1)
recent_model.fit(X_recent, y_recent)

**Explanation:**
This section trains two Random Forest models:

1. **Main Model**: Trained on all data from 2007-2015 for predicting test cases within that period
2. **Recent Model**: Trained on only 2015 data (subset of recent_df) for predicting 2016 test cases

**Why Random Forest?**
The Random Forest algorithm was chosen for several reasons:

1. **Handles Non-Linear Relationships**: Basketball performance involves complex interactions between players
2. **Feature Importance**: Can identify which players or teams have the most impact on outcomes
3. **Robust to Overfitting**: The ensemble nature helps prevent overfitting to specific lineups
4. **Probability Outputs**: Can provide probabilities of winning with different player combinations
5. **Handles High Dimensionality**: Can effectively manage many features (players, teams, seasons)

The model parameters include:
- `n_estimators=300`: Uses 300 decision trees in the ensemble for robust predictions
- `max_depth=None`: Allows trees to grow to their full depth for complex patterns
- `n_jobs=-1`: Uses all available CPU cores for faster training
- `random_state=1`: Sets a seed for reproducibility

## Prediction Function <a name="prediction-function"></a>

In [None]:
def predict_fifth_player(home_team, season, home_players_4, away_players_5, k):
    # Load encoders and model
    if season >= 2016:
        model = joblib.load('encoders/nba_lineup_model_recent.pkl')
    else:
        model = joblib.load('encoders/nba_lineup_model.pkl')
    player_encoder = joblib.load('encoders/player_encoder.pkl')
    team_encoder = joblib.load('encoders/team_encoder.pkl')
    season_encoder = joblib.load('encoders/season_encoder.pkl')
    
    # Get the appropriate roster based on home team and season
    key = (home_team, season)
    eligible_players = rosters_dict.get(key, set())
    
    if not eligible_players:
        print(f"Warning: No roster found for {home_team} in season {season}")
        return None
    
    # Remove the 4 players already in the lineup from eligible players
    eligible_players = eligible_players - set(home_players_4)
    
    # [...rest of the function code...]
    
    # Return the top-k players with highest probabilities
    top_k_indices = np.argsort(probs)[-k:][::-1]
    top_k_players = [valid_candidates[i] for i in top_k_indices]
    top_k_probs = probs[top_k_indices]
    
    return top_k_players

**Explanation:**
This function predicts the k-best fifth players to add to a four-player home lineup. The key steps are:

1. **Model Selection**: Uses the appropriate model based on the season (recent model for 2016+)
2. **Roster Filtering**: Identifies eligible players from the team's roster for that season
3. **Candidate Evaluation**: For each eligible player:
   - Creates a complete lineup by adding them to the existing four players
   - Encodes the lineup features
   - Gets a winning probability from the model
4. **Top-K Selection**: Returns the k players with the highest predicted winning probabilities

This approach effectively finds the best player options to complete the lineup, considering only players who actually played for the team that season.

## Evaluation <a name="evaluation"></a>

In [None]:
def generate_test_cases(test_file, labels_file):
    test_df = pd.read_csv(test_file)
    labels_df = pd.read_csv(labels_file)
    
    # Combine the true labels into test DataFrame
    test_df['true_fifth_player'] = labels_df['removed_value']
    
    # Fix team names in test data
    test_df['home_team'] = test_df['home_team'].replace(TEAM_MAPPINGS)
    test_df['away_team'] = test_df['away_team'].replace(TEAM_MAPPINGS)

    test_cases = []
    for _, row in test_df.iterrows():
        # Extract home players and find missing position
        home_players = [row[f'home_{i}'] for i in range(5)]
        missing_idx = [i for i, player in enumerate(home_players) if player == '?'][0]
        
        # Get sorted known home players
        home_players_4 = sorted([p for i, p in enumerate(home_players) if i != missing_idx])
        
        # Get sorted away players
        away_players_5 = sorted([row[f'away_{i}'] for i in range(5)])
        
        test_cases.append({
            'home_team': row['home_team'],
            'away_team': row['away_team'],
            'season': row['season'],
            'home_players_4': home_players_4,
            'away_players_5': away_players_5,
            'true_fifth_player': row['true_fifth_player']
        })
    return test_cases

def evaluate_accuracy(test_cases, k):
    season_results = defaultdict(list)
    
    for case in test_cases:
        top_k_players = predict_fifth_player(
            case['home_team'],
            case['season'],
            case['home_players_4'],
            case['away_players_5'],
            k
        )
        
        success = False
        if top_k_players:
            success = case['true_fifth_player'] in top_k_players[:k]

        season_results[case['season']].append(success)
        
        print(f"Test Case: {case['home_team']} vs. {case['away_team']} ({case['season']})")
        print(f"Home Players (4): {case['home_players_4']}")
        print(f"Away Players (5): {case['away_players_5']}")
        print(f"True Fifth Player: {case['true_fifth_player']}")
        print(f"Top {k} Predicted Players: {top_k_players}")
        print(f"Success: {success}")
        print("-" * 50)
    
    # Calculate overall accuracy
    all_results = []
    for season, results in season_results.items():
        all_results.extend(results)
    overall_accuracy = np.mean(all_results)
    print(f"\nOverall Top-{k} Accuracy: {overall_accuracy:.2f}")
    
    # Calculate per-season accuracy
    print("\nSeason-wise Accuracy:")
    for season in sorted(season_results.keys()):
        acc = np.mean(season_results[season])
        count = len(season_results[season])
        print(f"Season {season}: {acc:.2f} (n={count})")

**Explanation:**
The evaluation process consists of two main functions:

1. **Test Case Generation**: 
   - Extracts test cases from the test data files
   - Each test case includes a home team with 4 known players, an away team with 5 players, and the true fifth player to predict
   - Standardizes team names using the mapping dictionary

2. **Accuracy Evaluation**:
   - For each test case, predicts the top-k players to add to the lineup
   - Checks if the true fifth player is in the top-k predictions
   - Calculates overall accuracy and season-specific accuracy
   - Provides detailed output for each test case

The evaluation is comprehensive, breaking down results by season to identify where the model performs well or poorly. Using top-k accuracy is a reasonable metric because in practice, coaches might consider multiple player options rather than a single recommendation.

## Results Analysis <a name="results-analysis"></a>

In [None]:
# Generate test cases from test data
test_cases = generate_test_cases(test_file, labels_file)

# Evaluate model accuracy with test cases
k = 4
evaluate_accuracy(test_cases, k)

end_time = time.time()
elapsed_time_minutes = (end_time - start_time) / 60
print(f"Total execution time: {elapsed_time_minutes:.2f} minutes")

**Explanation:**
The results show:
- High accuracy (~90%) on test data from 2007-2015
- Lower accuracy (~59%) on unseen data from 2016

This performance drop on 2016 data can be attributed to several factors:

1. **Data Shift**: NBA basketball changes over time - playing styles, player movements, and team strategies evolve
2. **Limited Training**: The model for 2016 predictions was trained only on recent data (2015)
3. **Roster Knowledge**: For 2016, rosters were built from the test data itself, which might be incomplete
4. **Player Development**: Players' skills and impacts change over time, and the model may not capture these developments
5. **New Players**: Rookies and newly significant players in 2016 would have no history in the training data

The code uses a pragmatic approach by using two separate models:
1. A general model for 2007-2015 predictions
2. A "recent" model focused on 2015 data for 2016 predictions

This dual-model approach represents a reasonable compromise given the data limitations, but the accuracy drop suggests room for improvement.

## Main Execution Flow <a name="main-execution"></a>

In [None]:
# ------------- MAIN CODE ------------- #
start_time = time.time()
data_dir = './training_files'
df = load_data(data_dir)  # Initialize training dataframe

# Load test data
test_file = 'test_files/NBA_test.csv'
labels_file = 'test_files/NBA_test_labels.csv'
test_df = pd.read_csv(test_file)
labels_df = pd.read_csv(labels_file)

# Fix team names in test data
test_df['home_team'] = test_df['home_team'].replace(TEAM_MAPPINGS)
test_df['away_team'] = test_df['away_team'].replace(TEAM_MAPPINGS)

# Combine true labels into test DataFrame
test_df['true_fifth_player'] = labels_df['removed_value']

# Create rosters from training data and supplement with new seasons from test data
rosters_dict = create_rosters(df, test_df)

# Initialize and fit encoders
player_encoder = LabelEncoder()
team_encoder = LabelEncoder()
season_encoder = LabelEncoder()

# Get unique values from both training and test data
all_players_train = pd.unique(df[[f'home_{i}' for i in range(5)] + [f'away_{i}' for i in range(5)]].values.ravel())
all_players_test = pd.unique(test_df[[f'home_{i}' for i in range(5)] + [f'away_{i}' for i in range(5)]].values.ravel())
all_players_test = np.append(all_players_test, labels_df['removed_value'].values)
all_players = np.unique(np.concatenate([all_players_train, all_players_test]))
all_players = all_players[all_players != '?']  # Remove placeholder

# Encode features, train models, and evaluate

**Explanation:**
The main execution flow follows these steps:

1. **Data Loading**: Loads training data and test data
2. **Data Standardization**: Fixes team names using the mapping dictionary
3. **Roster Creation**: Creates team rosters for all seasons, including 2016
4. **Encoder Preparation**: Fits label encoders on all data (training + test)
5. **Feature Engineering**: Adds encoded features to the dataframe
6. **Model Training**: Trains two Random Forest models
7. **Model Evaluation**: Evaluates the models on test cases

The code is well-structured with a clear separation of data loading, preprocessing, model training, and evaluation. The time tracking provides visibility into the computational resources required.

## Future Improvements <a name="future-improvements"></a>

Based on the analysis, here are some potential improvements for the model:

1. **Advanced Features**:
   - Player statistics (points, rebounds, assists)
   - Player positions and roles
   - Team playing styles and strategies
   - Coach influence

2. **More Sophisticated Models**:
   - Neural networks for capturing complex player interactions
   - Time series models to capture trends and player development
   - Transfer learning approaches to leverage knowledge across seasons

3. **Data Augmentation**:
   - Synthetic data generation for rare player combinations
   - Incorporating additional seasons of data
   - External data sources (injuries, trades, etc.)

4. **Ensemble Approaches**:
   - Combining multiple model types
   - Weighting models based on recency of training data
   - Specialized models for different types of matchups

5. **Interpretability Tools**:
   - Feature importance analysis
   - Partial dependence plots
   - SHAP values for explaining individual predictions

By implementing these improvements, the model could potentially achieve higher accuracy on unseen data like the 2016 season.

## Conclusion

The NBA lineup prediction model demonstrates a practical application of machine learning to sports analytics. While achieving high accuracy on data from the same time period as the training data, it struggles with predictions for a new season (2016). This highlights the challenges of predicting in dynamic domains like professional sports, where player movements, team strategies, and game styles evolve over time.

The approach of using two separate models - one for historical data and one for recent data - is a reasonable compromise that delivers decent results. However, the accuracy drop on 2016 data suggests that more sophisticated techniques might be needed for truly robust predictions across seasons.

Overall, the code provides a solid foundation for NBA lineup optimization that could be extended with more advanced features, models, and data sources.

## Run the Program:
(This can take a very long time for the entire test set, up to an hour)

Before running, ensure:
- You have cloned the repository.
- The "encoders" directory exists (even if empty).
- You are running this report from within the repository.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import os, joblib, numpy as np
from collections import defaultdict
import time

TEAM_MAPPINGS = {
    'CHO': 'CHA',  	# Charlotte Hornets
    'NOP': 'NOK',  	# New Orleans Pelicans/Oklahoma City
    'NOH': 'NOK',   # New Orleans Hornets/Oklahoma City
    'NJN': 'BRK',	# New Jersey/Brooklyn Nets
    'SEA': 'OKC',	# Seattle Supersonics/Oklahoma City
}

# ------------- LOADING & PROCESSING DATA ------------- # 
# Function to load and preprocess the data
def load_data(data_dir):
    all_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.csv')]
    df_list = []

    for file in all_files:  # Append all csv files into one dataframe to be used later
        df = pd.read_csv(file)
        df_list.append(df)
    df = pd.concat(df_list, ignore_index=True)
    
    # Function to sort all players alphabetically in the dataframe
    def sort_players(row):
        home_players = sorted([row[f'home_{i}'] for i in range(5)])
        away_players = sorted([row[f'away_{i}'] for i in range(5)])
        for i in range(5):
            row[f'home_{i}'] = home_players[i]
            row[f'away_{i}'] = away_players[i]
        return row
    
    # Apply the sorting on the dataframe
    df = df.apply(sort_players, axis=1)
    
    return df

# Function to create roster for each team and season - modified to handle test data
def create_rosters(train_df, test_df=None):
    # Create a roster for each team and season from training data
    roster = train_df.groupby(['home_team', 'season'])[[f'home_{i}' for i in range(5)]].apply(
        lambda x: set(pd.unique(x.values.ravel()))
    ).to_dict()
    
    # Supplement with away team data (important for completeness)
    for idx, row in train_df.iterrows():
        key = (row['away_team'], row['season'])
        if key not in roster:
            roster[key] = set()
        for i in range(5):
            roster[key].add(row[f'away_{i}'])
    
    # If test data is provided, build rosters for new seasons only
    if test_df is not None:
        # Identify new seasons in test data
        train_seasons = set(train_df['season'].unique())
        test_seasons = set(test_df['season'].unique())
        new_seasons = test_seasons - train_seasons
        
        if new_seasons:
            print(f"Building rosters for new seasons in test data: {new_seasons}")
            
            # Create temporary dataframe with only new seasons
            new_season_df = test_df[test_df['season'].isin(new_seasons)]
            
            # Build rosters for new seasons from test data
            for idx, row in new_season_df.iterrows():
                # Add home team players
                key = (row['home_team'], row['season'])
                if key not in roster:
                    roster[key] = set()
                for i in range(5):
                    player = row[f'home_{i}']
                    if player != '?':  # Skip unknown players
                        roster[key].add(player)
                
                # Add away team players
                key = (row['away_team'], row['season'])
                if key not in roster:
                    roster[key] = set()
                for i in range(5):
                    roster[key].add(row[f'away_{i}'])
                    
            # Add the true fifth player to the roster (from labels)
            if 'true_fifth_player' in new_season_df.columns:
                for idx, row in new_season_df.iterrows():
                    key = (row['home_team'], row['season'])
                    if key in roster and row['true_fifth_player'] != '?':
                        roster[key].add(row['true_fifth_player'])
    
    return roster

# Function to encode all categorical features to numerical values 
def encode_features(df, player_encoder, team_encoder, season_encoder):
    df['home_team_encoded'] = team_encoder.transform(df['home_team'])
    df['away_team_encoded'] = team_encoder.transform(df['away_team'])
    df['season_encoded'] = season_encoder.transform(df['season'])
    for i in range(5):
        df[f'home_{i}_encoded'] = player_encoder.transform(df[f'home_{i}'])
        df[f'away_{i}_encoded'] = player_encoder.transform(df[f'away_{i}'])
    return df


# ------------- PREDICTION, TESTING ACCURACY ------------- #
# Prediction function, predicts k best options for 5th player to maximize winning.
def predict_fifth_player(home_team, season, home_players_4, away_players_5, k):
    # Load encoders and model
    if season >= 2016:
        model = joblib.load('encoders/nba_lineup_model_recent.pkl')
    else:
        model = joblib.load('encoders/nba_lineup_model.pkl')
    player_encoder = joblib.load('encoders/player_encoder.pkl')
    team_encoder = joblib.load('encoders/team_encoder.pkl')
    season_encoder = joblib.load('encoders/season_encoder.pkl')
    
    # Get the appropriate roster (eligible players) based on home team and season of input case
    key = (home_team, season)
    eligible_players = rosters_dict.get(key, set())
    
    if not eligible_players:
        print(f"Warning: No roster found for {home_team} in season {season}")
        return None
    
    # Remove the 4 players already in the lineup from eligible players
    eligible_players = eligible_players - set(home_players_4)
    
    if not eligible_players:
        print(f"Warning: No eligible players left for {home_team} in season {season} after removing existing players")
        return None
    
    eligible_players = list(eligible_players)
    
    try:
        home_team_enc = team_encoder.transform([home_team])[0]
    except ValueError:
        print(f"Unknown home team: {home_team}")
        return None
        
    try:
        season_enc = season_encoder.transform([season])[0]
    except ValueError:
        print(f"Unknown season: {season}")
        return None
    
    # Encode away players
    away_encoded = []
    for p in away_players_5:
        try:
            encoded_p = player_encoder.transform([p])[0]
        except ValueError:
            print(f"Unknown away player: {p}, using default encoding")
            encoded_p = -1  # Default for unknown player
        away_encoded.append(encoded_p)
    
    # Evaluate all eligible players
    candidates = []
    valid_candidates = []
    
    for candidate in eligible_players:
        try:
            # Create and encode home lineup with the current eligible player
            home_lineup = sorted(home_players_4 + [candidate])
            home_encoded = []
            valid_player = True
            
            for p in home_lineup:
                try:
                    encoded_p = player_encoder.transform([p])[0]
                except ValueError:
                    print(f"Unknown home player: {p}, skipping candidate")
                    valid_player = False
                    break
                home_encoded.append(encoded_p)
            
            if not valid_player:
                continue
                
            # Prepare features
            features = [home_team_enc, season_enc] + home_encoded + away_encoded
            
            # Create candidate DataFrame
            candidate_df = pd.DataFrame(
                [features], columns=['home_team_encoded', 'season_encoded', 
                                     'home_0_encoded', 'home_1_encoded', 'home_2_encoded', 
                                     'home_3_encoded', 'home_4_encoded',
                                     'away_0_encoded', 'away_1_encoded', 'away_2_encoded', 
                                     'away_3_encoded', 'away_4_encoded'])
            candidates.append(candidate_df)
            valid_candidates.append(candidate)
        except Exception as e:
            print(f"Error processing candidate {candidate}: {e}")
            continue
    
    if not candidates:
        print(f"No valid candidates found for {home_team} in season {season}")
        return None
    
    # Use model to predict winning probabilities
    all_candidates = pd.concat(candidates)
    probs = model.predict_proba(all_candidates)[:, 1]
    
    # Return the top-k players with highest probabilities
    top_k_indices = np.argsort(probs)[-k:][::-1]
    top_k_players = [valid_candidates[i] for i in top_k_indices]
    top_k_probs = probs[top_k_indices]
    
    return top_k_players

# Function to automatically generate test cases
def generate_test_cases(test_file, labels_file):
    test_df = pd.read_csv(test_file)
    labels_df = pd.read_csv(labels_file)
    
    # Combine the true labels into test DataFrame
    test_df['true_fifth_player'] = labels_df['removed_value']
    
    # Fix team names in test data
    test_df['home_team'] = test_df['home_team'].replace(TEAM_MAPPINGS)
    test_df['away_team'] = test_df['away_team'].replace(TEAM_MAPPINGS)

    test_cases = []
    for _, row in test_df.iterrows():
        # Extract home players and find missing position
        home_players = [row[f'home_{i}'] for i in range(5)]
        missing_idx = [i for i, player in enumerate(home_players) if player == '?'][0]
        
        # Get sorted known home players
        home_players_4 = sorted([p for i, p in enumerate(home_players) if i != missing_idx])
        
        # Get sorted away players
        away_players_5 = sorted([row[f'away_{i}'] for i in range(5)])
        
        test_cases.append({
            'home_team': row['home_team'],
            'away_team': row['away_team'],
            'season': row['season'],
            'home_players_4': home_players_4,
            'away_players_5': away_players_5,
            'true_fifth_player': row['true_fifth_player']
        })
    return test_cases

def evaluate_accuracy(test_cases, k):
    season_results = defaultdict(list)
    
    for case in test_cases:
        top_k_players = predict_fifth_player(
            case['home_team'],
            case['season'],
            case['home_players_4'],
            case['away_players_5'],
            k
        )
        
        success = False
        if top_k_players:
            success = case['true_fifth_player'] in top_k_players[:k]

        season_results[case['season']].append(success)
        
        print(f"Test Case: {case['home_team']} vs. {case['away_team']} ({case['season']})")
        print(f"Home Players (4): {case['home_players_4']}")
        print(f"Away Players (5): {case['away_players_5']}")
        print(f"True Fifth Player: {case['true_fifth_player']}")
        print(f"Top {k} Predicted Players: {top_k_players}")
        print(f"Success: {success}")
        print("-" * 50)
    
    # Calculate overall accuracy
    all_results = []
    for season, results in season_results.items():
        all_results.extend(results)
    overall_accuracy = np.mean(all_results)
    print(f"\nOverall Top-{k} Accuracy: {overall_accuracy:.2f}")
    
    # Calculate per-season accuracy
    print("\nSeason-wise Accuracy:")
    for season in sorted(season_results.keys()):
        acc = np.mean(season_results[season])
        count = len(season_results[season])
        print(f"Season {season}: {acc:.2f} (n={count})")



# ------------- MAIN CODE ------------- #
start_time = time.time()
data_dir = './training_files'
df = load_data(data_dir)  # Initialize training dataframe

# Load test data
test_file = 'test_files/NBA_test.csv'
labels_file = 'test_files/NBA_test_labels.csv'
test_df = pd.read_csv(test_file)
labels_df = pd.read_csv(labels_file)

# Fix team names in test data (training data has already been fixed directly in the csv files)
test_df['home_team'] = test_df['home_team'].replace(TEAM_MAPPINGS)
test_df['away_team'] = test_df['away_team'].replace(TEAM_MAPPINGS)

# Combine the true labels into test DataFrame
test_df['true_fifth_player'] = labels_df['removed_value']

# Create rosters from training data and supplement with new seasons from test data
rosters_dict = create_rosters(df, test_df)

# Initialize encoders
player_encoder = LabelEncoder()
team_encoder = LabelEncoder()
season_encoder = LabelEncoder()

# Get unique values from both training and test data
all_players_train = pd.unique(df[[f'home_{i}' for i in range(5)] + [f'away_{i}' for i in range(5)]].values.ravel())
all_players_test = pd.unique(test_df[[f'home_{i}' for i in range(5)] + [f'away_{i}' for i in range(5)]].values.ravel())
all_players_test = np.append(all_players_test, labels_df['removed_value'].values)
all_players = np.unique(np.concatenate([all_players_train, all_players_test]))
all_players = all_players[all_players != '?']  # Remove placeholder

# Teams from both datasets
teams_train = pd.unique(pd.concat([df['home_team'], df['away_team']]))
teams_test = pd.unique(pd.concat([test_df['home_team'], test_df['away_team']]))
teams = np.unique(np.concatenate([teams_train, teams_test]))

# Seasons from both datasets
seasons_train = pd.unique(df['season'])
seasons_test = pd.unique(test_df['season'])
seasons = np.unique(np.concatenate([seasons_train, seasons_test]))

# Fit encoders on all data (including test data) to ensure we can encode all values
player_encoder.fit(all_players)
team_encoder.fit(teams)
season_encoder.fit(seasons)

# Add encoded versions of each column to the training dataframe ONLY
df = encode_features(df, player_encoder, team_encoder, season_encoder)

recent_df = df[df['season'] > 2014] # For use in new/unseen data

# ------------- TRAINING MODEL ------------- #
# Training model for 2007-2015 test data
X = df[['home_team_encoded', 'season_encoded', 
       'home_0_encoded', 'home_1_encoded', 'home_2_encoded', 'home_3_encoded', 'home_4_encoded',
       'away_0_encoded', 'away_1_encoded', 'away_2_encoded', 'away_3_encoded', 'away_4_encoded']]
y = df['outcome']

# Training model for 2016 (unseen) test data
X_recent = recent_df[['home_team_encoded', 'season_encoded', 
                     'home_0_encoded', 'home_1_encoded', 'home_2_encoded', 'home_3_encoded', 'home_4_encoded',
                     'away_0_encoded', 'away_1_encoded', 'away_2_encoded', 'away_3_encoded', 'away_4_encoded']]
y_recent = recent_df['outcome']

# Train Random Forest models
# For Seen Test Data
model = RandomForestClassifier(n_estimators=300, max_depth=None, random_state=1, n_jobs=-1)
model.fit(X, y)

# For Unseen Test Data
recent_model = RandomForestClassifier(n_estimators=300, max_depth=None, random_state=1, n_jobs=-1)
recent_model.fit(X_recent, y_recent)

# Save models and encoders
joblib.dump(model, 'encoders/nba_lineup_model.pkl')
joblib.dump(recent_model, 'encoders/nba_lineup_model_recent.pkl')
joblib.dump(player_encoder, 'encoders/player_encoder.pkl')
joblib.dump(team_encoder, 'encoders/team_encoder.pkl')
joblib.dump(season_encoder, 'encoders/season_encoder.pkl')

# ------------- TESTING MODEL ------------- #
# Generate test cases from test data
test_cases = generate_test_cases(test_file, labels_file)

# Evaluate model accuracy with test cases
k = 4
evaluate_accuracy(test_cases, k)

end_time = time.time()
elapsed_time_minutes = (end_time - start_time) / 60
print(f"Total execution time: {elapsed_time_minutes:.2f} minutes")