# Artificial Neural Network: NBA Player Dataset Team Optimazitation

## Team

Gabriel Aracena
Joshua Canode
Aaron Galicia

### Project Description

Select a pool of 100 players from the data set, within a 5-year window.
Define "optimal team" based on your decision of the player characteristics necessary to build a team. For example, if all 5 players are 3-point shooters, the team will miss defenders, which will make it unbalanced.
Your task is to identify the optimal team of 5 players from that pool.
Examine the multilayer neural network MLP architecture depicted in the "CST-435 An Artificial Neural Network Model Image."
Build a deep artificial neural network MLP to include the following: a) 1 input layer, b) as many hidden layers as you deem necessary, and c) an output layer fully connected to the hidden layers.
Explain your architecture and how the basketball player characteristics are used as inputs.
Activate the MLP by performing the following steps:

Starting at the input layer, forward propagate the patterns of the training data through the network to generate an output.
Based on the network's output, calculate the error that we want to minimize using a cost function that we will describe later.
Backpropagate the error, find its derivative with respect to each weight in the network, and update the model.
Repeat steps 1 through 3 for multiple epochs and learn the weights of the MLP.
Use forward propagation to calculate the network output and apply a threshold function to obtain the predicted class labels in the one-hot representation.
Interpret the output of your MLP in the context of selecting an optimal basketball team.

## Abstract

The objective is to use a deep artificial neural network (ANN) to determine an optimal team composition from a pool of basketball players. Given player characteristics, we want to identify the best five players that result in a balanced team.

### Data Preparation:

* Load the NBA Players Dataset.
* Filter to get a pool of 100 players from a random 5-year window.
* Normalize/Standardize player characteristics.

### ANN Model Building:

* Design a Multi-layer Perceptron (MLP) based on the architecture of the CST-435 An Artificial Neural Network Model Image (see below)
* Define layers: Input layer, Hidden layers, and Output layer.
* Determine the appropriate activation function, optimizer, and loss function for the MLP.

![ANNModel](ANNModel.png)

### Training the ANN:

* Forward propagation: Use player characteristics to propagate input data through the network and generate an output.
* Calculate the error using a predefined cost function.
* Backpropagate the error to update model weights.
* Repeat the above steps for several epochs.

### Evaluation and Team Selection:

* Use forward propagation on the trained ANN to predict player effectiveness or class labels.
* Apply a threshold function to these predictions.
* Select the top five players that meet the optimal team criteria.

## Model Architecture

* Input Layer: This layer will have neurons equal to the number of player characteristics we're considering (e.g. points, assists, offensive rebounds, defensive rebounds,etc.).
* Hidden Layers: Multiple hidden layers can be used to capture intricate patterns and relationships. We initially thought we would do 5 hidden layers, one for each position,  but we decided to stick with only a single layer for simplicity and might change that later. 
* Output Layer: This layer can have neurons equal to the number of classes or roles in the team we're predicting for (e.g., point guard, shooting guard, center, etc.). Each neuron will give the likelihood of a player fitting that role.

## Activation and Threshold Function

During forward propagation, each neuron processes input data and transmits it to the next layer. An activation function is applied to this data. For this model, we can use the ReLU (Rectified Linear Unit) activation function for hidden layers due to its computational efficiency and the ability to handle non-linearities. The softmax function might be applied to the output layer as it provides a probability distribution.

After obtaining the output, a threshold function is applied to convert continuous values into distinct class labels. In this case, it can be the player's most likely role in the team.

## Interpretation and Conclusion

The final output provides us with a categorization of each player in our pool. By examining the predicted class labels and the associated probabilities, we can:
* Identify which role or position each player is most suited for.
* Select the top players for each role to form our optimal team.

We are going to define target values for each position and use hope to use that in the end of each training to classify if the output team was good or not. 

It's worth noting that the "optimal" team is contingent on the data provided and the neural network's training. For better results, the model should be regularly trained with updated data, and other external factors (like team chemistry and current form) should also be considered in real-world scenarios. For our optimal team we defined some weights based on each player position that will take into account the 2 most important stats for each position according to our criteria. See Definig player types bellow:


## Defining Player types    

After research, the teams will be made up of different positions: center, foward, small forward, guard, and point guard. These positions requre different specialties. Making use of the statistics provided by the CSV, we have chosen two weights that control what factors are important to the role.

In [20]:
"""
5 center
	height = 0.5
	weight = 0.5

4 forward
	net_rating = 0.6
	reb = 0.4

3 small forward
	ast_pct = 0.3
	usg_pct = 0.7

2 guard
	pts = 0.8
	ts_pct = 0.2

1 point guard
	ast = 0.8
	gp = 0.2


"""

'\n5 center\n\theight = 0.5\n\tweight = 0.5\n\n4 forward\n\tnet_rating = 0.6\n\treb = 0.4\n\n3 small forward\n\tast_pct = 0.3\n\tusg_pct = 0.7\n\n2 guard\n\tpts = 0.8\n\tts_pct = 0.2\n\n1 point guard\n\tast = 0.8\n\tgp = 0.2\n\n\n'

In [2]:
import pandas as pd
import random
import tensorflow as tf
from tensorflow import keras
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [3]:

# Specify the file path
file_path = "all_seasons.csv"

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Display the head (first few rows) of the DataFrame
print("Head of the DataFrame:")
print(df.head())

# Display the tail (last few rows) of the DataFrame
print("\nTail of the DataFrame:")
print(df.tail())


Head of the DataFrame:
   Unnamed: 0        player_name team_abbreviation   age  player_height  \
0           0      Dennis Rodman               CHI  36.0         198.12   
1           1  Dwayne Schintzius               LAC  28.0         215.90   
2           2       Earl Cureton               TOR  39.0         205.74   
3           3        Ed O'Bannon               DAL  24.0         203.20   
4           4        Ed Pinckney               MIA  34.0         205.74   

   player_weight                      college country draft_year draft_round  \
0      99.790240  Southeastern Oklahoma State     USA       1986           2   
1     117.933920                      Florida     USA       1990           1   
2      95.254320                Detroit Mercy     USA       1979           3   
3     100.697424                         UCLA     USA       1995           1   
4     108.862080                    Villanova     USA       1985           1   

   ...  pts   reb  ast  net_rating  oreb_pct 

## Defining target stats based on player types

The weights decided above will be used before training

In [4]:


def calculateTargetValue(position, stat1, stat2):
    # Point Guard: ast = 0.8 gp = 0.2
    # Shooting Guard: pts = 0.8 ts_pct = 0.2
    if (position == 1 or position == 2):
        weightedValue = ((stat1) * 0.8 + (stat2) * 0.2 )
        return weightedValue
    
    # Small Forward: ast_pct = 0.3 usg_pct = 0.7
    elif (position == 3):
        weightedValue = ((stat1) * 0.3 + (stat2) * 0.7 )
        return weightedValue
    # Forward: net_rating = 0.6 reb = 0.4
    elif (position == 4):
        weightedValue = ((stat1) * 0.6 + (stat2) * 0.4 )
        return weightedValue
    # Center: height = 0.5 weight = 0.5
    elif (position == 5):
        weightedValue = ((stat1) * 0.6 + (stat2) * 0.4 )
        return weightedValue

'''MAXIMUM_ASSIST = max(df['ast'])
MAXIMUM_GP = max(df['gp'])
MAXIMUM_PTS = max(df['pts'])
MAXIMUM_SHOOTING_RATE = max(df['ts_pct'])
MAXIMUM_ASSIST_PCTG = max(df['ast_pct'])
MAXIMUM_USG_PCT = max(df['usg_pct']) 
MAXIMUM_NET_RATING = max(df['net_rating'])
MAXIMUM_REB = max(df['oreb_pct'])
MAXIMUM_HEIGHT = max(df['player_height']) 
MAXIMUM_WEIGHT = max(df['player_weight']) 

# The target stats will be 80% of the maximum value (it will be really hard to get 100% all the time since we are going to only use 100 players out of the whole dataset)
TARGET_POINT_GUARD_VALUE = calculateTargetValue(1, MAXIMUM_ASSIST, MAXIMUM_GP)
TARGET_SHOOTING_GUARD_VALUE = calculateTargetValue(2, MAXIMUM_PTS, MAXIMUM_SHOOTING_RATE)
TARGET_SMALL_FORWARD_VALUE = calculateTargetValue(3, MAXIMUM_ASSIST_PCTG, MAXIMUM_USG_PCT)
TARGET_FORWARD_VALUE = calculateTargetValue(4, MAXIMUM_NET_RATING, MAXIMUM_REB)
TARGET_CENTER_VALUE = calculateTargetValue(5, MAXIMUM_HEIGHT, MAXIMUM_WEIGHT)

print(TARGET_POINT_GUARD_VALUE)
print(TARGET_SHOOTING_GUARD_VALUE)
print(TARGET_SMALL_FORWARD_VALUE)
print(TARGET_FORWARD_VALUE)
print(TARGET_CENTER_VALUE)'''
# Calculate the 90th percentile for each statistic
PERCENTILE = 0.9
p90_assist = df['ast'].quantile(PERCENTILE)
p90_gp = df['gp'].quantile(PERCENTILE)
p90_pts = df['pts'].quantile(PERCENTILE)
p90_shooting_rate = df['ts_pct'].quantile(PERCENTILE)
p90_assist_pctg = df['ast_pct'].quantile(PERCENTILE)
p90_usg_pct = df['usg_pct'].quantile(PERCENTILE)
p90_net_rating = df['net_rating'].quantile(PERCENTILE)
p90_reb = df['oreb_pct'].quantile(PERCENTILE)
p90_height = df['player_height'].quantile(PERCENTILE)
p90_weight = df['player_weight'].quantile(PERCENTILE)

# Adjust the target values to be 80% of the 90th percentile
TARGET_POINT_GUARD_VALUE = calculateTargetValue(1, p90_assist, p90_gp)
TARGET_SHOOTING_GUARD_VALUE = calculateTargetValue(2, p90_pts, p90_shooting_rate)
TARGET_SMALL_FORWARD_VALUE = calculateTargetValue(3, p90_assist_pctg, p90_usg_pct)
TARGET_FORWARD_VALUE = calculateTargetValue(4, p90_net_rating, p90_reb)
TARGET_CENTER_VALUE = calculateTargetValue(5, p90_height, p90_weight)




In [15]:
df['draft_year'] = df['season'].str.split('-').str[0].astype(int)

# Define the target year and the window size
enough_players = False
window_size = 5
while not enough_players:
    target_year = random.randint(min(df['draft_year']), max(df['draft_year']))
    start_year = target_year - window_size
    end_year = target_year
    filtered_df = df[(df['draft_year'] >= start_year) & (df['draft_year'] <= end_year)]
    
    if len(filtered_df) >= 100:
        enough_players = True
        selected_players = random.sample(range(len(filtered_df)), 100)
        test_df = filtered_df.iloc[selected_players]
        '''random.seed(42)
        selected_players = random.sample(range(len(filtered_df)), 100)
        selected_df = filtered_df.iloc[selected_players]
        '''
        print(test_df)

# Split the rest of the data (excluding the selected 100 players) for training
train_df = df.drop(test_df.index)


      Unnamed: 0               player_name team_abbreviation   age  \
7882        7882           Marvin Williams               UTA  28.0   
9812        9812              Allen Crabbe               BKN  26.0   
9818        9818               Alec Peters               PHX  23.0   
8658        8658  Kentavious Caldwell-Pope               DET  23.0   
8814        8814           Sean Kilpatrick               BKN  26.0   
...          ...                       ...               ...   ...   
7778        7778            Anthony Morrow               NOP  28.0   
8698        8698         LaMarcus Aldridge               SAS  30.0   
8957        8957            Andre Iguodala               GSW  32.0   
9058        9058               Chris Kaman               POR  34.0   
8154        8154             Larry Drew II               PHI  25.0   

      player_height  player_weight           college country  draft_year  \
7882         205.74     107.501304    North Carolina     USA        2013   
9812   

## Defining and Training Model:

In [20]:
# Define the neural network model for position prediction
def create_model():
    model = keras.Sequential([
        keras.layers.Input(shape=(10,)),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(1, activation='linear')  # Predict a score between 0 and 1
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
    return model

# Create a model for each position
models = {i: create_model() for i in range(1, 6)}

# Normalize the dataset
scaler = StandardScaler()
selected_features = scaler.fit_transform(train_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']])
names_df = train_df['player_name'] # Player Names

# Train a model for each position using the ideal values as targets
for i in range(1, 6):
    if i == 1:
        target = (train_df['ast']*0.8 + train_df['gp']*0.2) / TARGET_POINT_GUARD_VALUE
    elif i == 2:
        target = (train_df['pts']*0.8 + train_df['ts_pct']*0.2) / TARGET_SHOOTING_GUARD_VALUE
    elif i == 3:
        target = (train_df['ast_pct']*0.3 + train_df['usg_pct']*0.7) / TARGET_SMALL_FORWARD_VALUE
    elif i == 4:
        target = (train_df['net_rating']*0.6 + train_df['oreb_pct']*0.4) / TARGET_FORWARD_VALUE
    elif i == 5:
        target = (train_df['player_height']*0.5 + train_df['player_weight']*0.5) / TARGET_CENTER_VALUE
    models[i].fit(selected_features, target, epochs=10)

# Evaluate each player in the test set (100 players) for each position and select the optimal player
X_test = scaler.transform(test_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Predicting the Optimal Team

In [21]:
optimal_team = {}

for i in range(1, 6):
    scores = models[i].predict(selected_features).flatten()
    best_player_idx = scores.argmax()
    for x, y in optimal_team.items():
        if names_df.iloc[best_player_idx] == y:
            selected_features = np.delete(selected_features, best_player_idx, axis=0)
            names_df.drop(best_player_idx)
            scores = models[i].predict(selected_features).flatten()
            best_player_idx = scores.argmax()
            break
    optimal_team[i] = names_df.iloc[best_player_idx]
    # Remove this player so they aren't selected again
    selected_features = np.delete(selected_features, best_player_idx, axis=0)
    names_df.drop(best_player_idx)

print("Optimal Team:")
for pos, player_idx in optimal_team.items():
    print(f"Position {pos}: {player_idx}")

    


Optimal Team:
Position 1: Chris Paul
Position 2: James Ennis III
Position 3: Gheorghe Muresan
Position 4: Bruce Bowen
Position 5: Joel Freeland


The output above represents the optimal team that the neural network decided. When running the prediction multiple times, the players predicted by the neural network does fluctuate. This could be due to multiple factors. It is likely due to the inherent randomness in some aspects of the code and the network itself. A random time range is chosen with a random 100 players so the players will change.

## Second Approach: 1 ANN 5 Hidden layers

Since we were unsure if it is acceptable to do the project with 5 small ANN's for each position, we decided to also do 1 singular ANN with 5 hidden layer. Each Layer will train and adjust the weights for that correspondent position. 

In [19]:
def create_model2():
    model = keras.Sequential([
        keras.layers.Input(shape=(10,)),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(5, activation='linear')  # Predict 5 scores (one for each position)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=('accuracy'))
    return model

model2 = create_model2()

scaler = StandardScaler()

X_train2 = scaler.fit_transform(train_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']])
X_test2 = scaler.transform(test_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']])

# Create training target values for each position
Y_train2 = np.vstack([
    (train_df['ast']*0.8 + train_df['gp']*0.2) / TARGET_POINT_GUARD_VALUE,
    (train_df['pts']*0.8 + train_df['ts_pct']*0.2) / TARGET_SHOOTING_GUARD_VALUE,
    (train_df['ast_pct']*0.3 + train_df['usg_pct']*0.7) / TARGET_SMALL_FORWARD_VALUE,
    (train_df['net_rating']*0.6 + train_df['oreb_pct']*0.4) / TARGET_FORWARD_VALUE,
    (train_df['player_height']*0.5 + train_df['player_weight']*0.5) / TARGET_CENTER_VALUE
]).T

model2.fit(X_train2, Y_train2, epochs=100)

# Make predictions on the test set
predictions2 = model2.predict(X_test2)

# Select optimal team
optimal_team2 = {}
for i in range(5):  # For each position
    best_player_idx2 = predictions2[:, i].argmax()
    optimal_team2[i + 1] = test_df.iloc[best_player_idx2].name
    # Remove this player so they aren't selected again
    predictions2 = np.delete(predictions2, best_player_idx2, axis=0)

print("Optimal Team:")
for pos, player_idx in optimal_team2.items():
    player_name = df.loc[df.index == player_idx, 'player_name'].values[0]
    print(f"Position {pos}: Player Name {player_name}")

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## Conclusion

Both methods produce different results, though both exhibit impressively low loss functions, making it challenging to definitively determine superiority. Notably, the 5 hidden layer Artificial Neural Network (ANN) achieves higher accuracy, but this observation is limited to the specific dataset and relies on subjective judgment.

It's crucial to recognize that numerous avenues for potential improvement exist to enhance accuracy. These include working with a larger dataset to account for potential duplicate player entries. Additionally, exploring architectural variations within the neural network, adjusting activation and loss functions, careful data scaling, feature engineering, and strategies to prevent overfitting offer promising opportunities.

Further optimization can be achieved by tweaking training epochs, selecting optimizers, fine-tuning learning rates, and defining more precise evaluation criteria. This iterative process enables ongoing model refinement.

It's worth noting that the pursuit of an ideal model is limitless, with seemingly endless possibilities for improvement. However, our progress has led to the development of a model that consistently achieves an accuracy rate exceeding 90%. This marks significant progress compared to the initial model, which had an accuracy level of approximately 1%, and we are pleased with these results.




