# Natural Language Processing Application: Sentimental Analysis on Steam Reviews (Possibly?)

## Team

* Gabriel Aracena
* Joshua Canode
* Aaron Galicia

### Project Description

A key area of knowledge in data analytics is the ability to extract meaning from text. This assignment provides the foundational skills in this area by detecting whether a text conveys a positive or negative message.

Analyze the sentiment (e.g., negative, neutral, positive) conveyed in a large body (corpus) of texts using the NLTK package in Python. Complete the steps below. Then, write a comprehensive technical report as a Python Jupyter notebook to include all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) Problem statement, b) Algorithm of the solution, c) Analysis of the findings, and d) References.

## Abstract

The objective is to use a deep artificial neural network (ANN) to determine an optimal team composition from a pool of basketball players. Given player characteristics, we want to identify the best five players that result in a balanced team.

### Data Preparation:

* Load the NBA Players Dataset.
* Filter to get a pool of 100 players from a random 5-year window.
* Normalize/Standardize player characteristics.

### ANN Model Building:

* Design a Multi-layer Perceptron (MLP) based on the architecture of the CST-435 An Artificial Neural Network Model Image (see below)
* Define layers: Input layer, Hidden layers, and Output layer.
* Determine the appropriate activation function, optimizer, and loss function for the MLP.

![ANNModel](ANNModel.png)

### Training the ANN:

* Forward propagation: Use player characteristics to propagate input data through the network and generate an output.
* Calculate the error using a predefined cost function.
* Backpropagate the error to update model weights.
* Repeat the above steps for several epochs.

### Evaluation and Team Selection:

* Use forward propagation on the trained ANN to predict player effectiveness or class labels.
* Apply a threshold function to these predictions.
* Select the top five players that meet the optimal team criteria.

## Model Architecture

* Input Layer: This layer will have neurons equal to the number of player characteristics we're considering (e.g. points, assists, offensive rebounds, defensive rebounds,etc.).
* Hidden Layers: Multiple hidden layers can be used to capture intricate patterns and relationships. We initially thought we would do 5 hidden layers, one for each position,  but we decided to stick with only a single layer for simplicity and might change that later. 
* Output Layer: This layer can have neurons equal to the number of classes or roles in the team we're predicting for (e.g., point guard, shooting guard, center, etc.). Each neuron will give the likelihood of a player fitting that role.

## Activation and Threshold Function

During forward propagation, each neuron processes input data and transmits it to the next layer. An activation function is applied to this data. For this model, we can use the ReLU (Rectified Linear Unit) activation function for hidden layers due to its computational efficiency and the ability to handle non-linearities. The softmax function might be applied to the output layer as it provides a probability distribution.

After obtaining the output, a threshold function is applied to convert continuous values into distinct class labels. In this case, it can be the player's most likely role in the team.

## Interpretation and Conclusion

The final output provides us with a categorization of each player in our pool. By examining the predicted class labels and the associated probabilities, we can:
* Identify which role or position each player is most suited for.
* Select the top players for each role to form our optimal team.

We are going to define target values for each position and use hope to use that in the end of each training to classify if the output team was good or not. 

It's worth noting that the "optimal" team is contingent on the data provided and the neural network's training. For better results, the model should be regularly trained with updated data, and other external factors (like team chemistry and current form) should also be considered in real-world scenarios. For our optimal team we defined some weights based on each player position that will take into account the 2 most important stats for each position according to our criteria. See Definig player types bellow:


## Defining Player types    

In [183]:
"""
5 center
	height = 0.5
	weight = 0.5

4 forward
	net_rating = 0.6
	reb = 0.4

3 small forward
	ast_pct = 0.3
	usg_pct = 0.7

2 guard
	pts = 0.8
	ts_pct = 0.2

1 point guard
	ast = 0.8
	gp = 0.2


"""

'\n5 center\n\theight = 0.5\n\tweight = 0.5\n\n4 forward\n\tnet_rating = 0.6\n\treb = 0.4\n\n3 small forward\n\tast_pct = 0.3\n\tusg_pct = 0.7\n\n2 guard\n\tpts = 0.8\n\tts_pct = 0.2\n\n1 point guard\n\tast = 0.8\n\tgp = 0.2\n\n\n'

In [193]:
import pandas as pd
import random
import tensorflow as tf
from tensorflow import keras
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [194]:

# Specify the file path
file_path = "all_seasons.csv"

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Display the head (first few rows) of the DataFrame
print("Head of the DataFrame:")
print(df.head())

# Display the tail (last few rows) of the DataFrame
print("\nTail of the DataFrame:")
print(df.tail())


Head of the DataFrame:
   Unnamed: 0        player_name team_abbreviation   age  player_height  \
0           0      Dennis Rodman               CHI  36.0         198.12   
1           1  Dwayne Schintzius               LAC  28.0         215.90   
2           2       Earl Cureton               TOR  39.0         205.74   
3           3        Ed O'Bannon               DAL  24.0         203.20   
4           4        Ed Pinckney               MIA  34.0         205.74   

   player_weight                      college country draft_year draft_round  \
0      99.790240  Southeastern Oklahoma State     USA       1986           2   
1     117.933920                      Florida     USA       1990           1   
2      95.254320                Detroit Mercy     USA       1979           3   
3     100.697424                         UCLA     USA       1995           1   
4     108.862080                    Villanova     USA       1985           1   

   ...  pts   reb  ast  net_rating  oreb_pct 

In [195]:
# Defining target stats based on player types

PROPORTINALITY_FACTOR = 1000

def calculateTargetValue(position, stat1, stat2):
    # Point Guard: ast = 0.8 gp = 0.2
    # Shooting Guard: pts = 0.8 ts_pct = 0.2
    if (position == 1 or position == 2):
        weightedValue = ((stat1 + PROPORTINALITY_FACTOR) * 0.8 + (stat2 + PROPORTINALITY_FACTOR) * 0.2 ) / PROPORTINALITY_FACTOR
        return weightedValue
    
    # Small Forward: ast_pct = 0.3 usg_pct = 0.7
    elif (position == 3):
        weightedValue = ((stat1 + PROPORTINALITY_FACTOR) * 0.3 + (stat2 + PROPORTINALITY_FACTOR) * 0.7 ) / PROPORTINALITY_FACTOR
        return weightedValue
    # Forward: net_rating = 0.6 reb = 0.4
    elif (position == 4):
        weightedValue = ((stat1 + PROPORTINALITY_FACTOR) * 0.6 + (stat2 + PROPORTINALITY_FACTOR) * 0.4 ) / PROPORTINALITY_FACTOR
        return weightedValue
    # Center: height = 0.5 weight = 0.5
    elif (position == 5):
        weightedValue = ((stat1 + PROPORTINALITY_FACTOR) * 0.6 + (stat2 + PROPORTINALITY_FACTOR) * 0.4 ) / PROPORTINALITY_FACTOR
        return weightedValue

MAXIMUM_ASSIST = max(df['ast'])
MAXIMUM_GP = max(df['gp'])
MAXIMUM_PTS = max(df['pts'])
MAXIMUM_SHOOTING_RATE = max(df['ts_pct'])
MAXIMUM_ASSIST_PCTG = max(df['ast_pct'])
MAXIMUM_USG_PCT = max(df['usg_pct']) 
MAXIMUM_NET_RATING = max(df['net_rating'])
MAXIMUM_REB = max(df['oreb_pct'])
MAXIMUM_HEIGHT = max(df['player_height']) 
MAXIMUM_WEIGHT = max(df['player_weight']) 

# The target stats will be 80% of the maximum value (it will be really hard to get 100% all the time since we are going to only use 100 players out of the whole dataset)
TARGET_POINT_GUARD_VALUE = calculateTargetValue(1, MAXIMUM_ASSIST, MAXIMUM_GP)
TARGET_SHOOTING_GUARD_VALUE = calculateTargetValue(2, MAXIMUM_PTS, MAXIMUM_SHOOTING_RATE)
TARGET_SMALL_FORWARD_VALUE = calculateTargetValue(3, MAXIMUM_ASSIST_PCTG, MAXIMUM_USG_PCT)
TARGET_FORWARD_VALUE = calculateTargetValue(4, MAXIMUM_NET_RATING, MAXIMUM_REB)
TARGET_CENTER_VALUE = calculateTargetValue(5, MAXIMUM_HEIGHT, MAXIMUM_WEIGHT)

print(TARGET_POINT_GUARD_VALUE)
print(TARGET_SHOOTING_GUARD_VALUE)
print(TARGET_SMALL_FORWARD_VALUE)
print(TARGET_FORWARD_VALUE)
print(TARGET_CENTER_VALUE)



1.0263600000000002
1.02918
1.001
1.1804000000000001
1.204001248


In [196]:
df['draft_year'] = df['season'].str.split('-').str[0].astype(int)

# Define the target year and the window size
enough_players = False
while not enough_players:
    target_year = random.randint(min(df['draft_year']), max(df['draft_year']))
    start_year = target_year - window_size
    end_year = target_year
    filtered_df = df[(df['draft_year'] >= start_year) & (df['draft_year'] <= end_year)]
    
    if len(filtered_df) >= 100:
        enough_players = True
        random.seed(42)
        selected_players = random.sample(range(len(filtered_df)), 100)
        selected_df = filtered_df.iloc[selected_players]
        print(selected_df)


      Unnamed: 0        player_name team_abbreviation   age  player_height  \
7958        7958     John Lucas III               UTA  31.0         180.34   
5795        5795  Pops Mensah-Bonsu               TOR  26.0         205.74   
5441        5441          T.J. Ford               IND  26.0         182.88   
6465        6465      Keyon Dooling               MIL  31.0         190.50   
6342        6342        Tony Battie               PHI  35.0         210.82   
...          ...                ...               ...   ...            ...   
6435        6435      Sasha Vujacic               NJN  27.0         200.66   
5610        5610       Maceo Baston               IND  33.0         208.28   
6203        6203     Joel Przybilla               POR  30.0         215.90   
7662        7662      Rashard Lewis               MIA  34.0         208.28   
6627        6627        Omri Casspi               SAC  23.0         205.74   

      player_weight            college   country  draft_year dr

In [199]:
# CELL HERE
# Defining the model

#input_shape = (100, 10)

'''model = keras.Sequential([
    keras.layers.Input(shape = (10,)),  # 8 input features
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(5, activation='softmax')   # Output layer with 5 nodes (one for each player type)
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define the neural network model for position prediction
def create_model():
    model = keras.Sequential([
        keras.layers.Input(shape=(10,)),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')  # Predict a score between 0 and 1
    ])
    model.compile(optimizer='adam', loss='mse')
    return model'''

# Create a model for each position
models = {i: create_model() for i in range(1, 6)}

# Normalize the dataset
scaler = StandardScaler()
selected_features = scaler.fit_transform(selected_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']])

# Train a model for each position using the ideal values as targets
for i in range(1, 6):
    if i == 1:
        target = (selected_df['ast']*0.8 + selected_df['gp']*0.2) / TARGET_POINT_GUARD_VALUE
    elif i == 2:
        target = (selected_df['pts']*0.8 + selected_df['ts_pct']*0.2) / TARGET_SHOOTING_GUARD_VALUE
    elif i == 3:
        target = (selected_df['ast_pct']*0.3 + selected_df['usg_pct']*0.7) / TARGET_SMALL_FORWARD_VALUE
    elif i == 4:
        target = (selected_df['net_rating']*0.6 + selected_df['oreb_pct']*0.4) / TARGET_FORWARD_VALUE
    elif i == 5:
        target = (selected_df['player_height']*0.5 + selected_df['player_weight']*0.5) / TARGET_CENTER_VALUE
    models[i].fit(selected_features, target, epochs=50)

# Evaluate each player for each position and select the optimal player
optimal_team = {}
for i in range(1, 6):
    scores = models[i].predict(selected_features).flatten()
    best_player_idx = scores.argmax()
    optimal_team[i] = selected_df.iloc[best_player_idx].name
    # Remove this player so they aren't selected again
    selected_features = np.delete(selected_features, best_player_idx, axis=0)

print("Optimal Team:")
for pos, player_idx in optimal_team.items():
    print(f"Position {pos}: Player Index {player_idx}")


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

In [189]:
# Defining positions for dataset
# Create new column called position for selected_df
# Do a for loop 5 times where we use the function calculateTargetValue(i, stat1, stat2) and check the biggest return value for
# every value of i to then populate the new column position with the value of i that had the biggest value for that specific player 
def determine_position(row):
    best_value = float('-inf')
    best_position = None

    for i in range(1, 6):  # Loop 5 times as you mentioned
        if i == 1:
            value = calculateTargetValue(i, row['ast'], row['gp']) - TARGET_POINT_GUARD_VALUE
        elif i == 2:
            value = calculateTargetValue(i, row['pts'], row['ts_pct']) - TARGET_SHOOTING_GUARD_VALUE
        elif i == 3:
            value = calculateTargetValue(i, row['ast_pct'], row['usg_pct']) - TARGET_SMALL_FORWARD_VALUE
        elif i == 4:
            value = calculateTargetValue(i, row['net_rating'], row['oreb_pct']) - TARGET_FORWARD_VALUE
        elif i == 5:
            value = calculateTargetValue(i, row['player_height'], row['player_weight']) - TARGET_CENTER_VALUE

        if value > best_value:
            best_value = value
            best_position = i
    return best_position

selected_df = selected_df.copy()
selected_df['position'] = selected_df.apply(determine_position, axis=1)



In [190]:
# Normalize the dataset
scaler = StandardScaler()

# Split the data into training and validation sets
train_df, val_df = train_test_split(selected_df, test_size=0.2, random_state=42)

# Extract the features and labels for training and validation datasets
X_train = train_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']].values
y_train = train_df[['position']].values.ravel()

X_val = val_df[['ast', 'gp', 'pts', 'ts_pct', 'ast_pct', 'usg_pct', 'net_rating', 'oreb_pct', 'player_height', 'player_weight']].values
y_val = val_df[['position']].values.ravel()

# Apply normalization
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))

# Evaluate the model
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Accuracy: {accuracy*100:.2f}%")


Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Validation Accuracy: 100.00%


In [191]:
# Make predictions on the validation dataset
predictions = model.predict(X_val)
predicted_positions = [tf.argmax(pred).numpy() + 1 for pred in predictions]

# Create a DataFrame for results
results_df = pd.DataFrame({
    'Player Index': val_df.index,
    'True Position': y_val,
    'Predicted Position': predicted_positions
})

# Display the results
print(results_df)


    Player Index  True Position  Predicted Position
0           4620              3                   4
1           5640              3                   4
2           4582              3                   4
3           3243              3                   4
4           4149              3                   4
5           4622              3                   4
6           5727              3                   4
7           4068              3                   4
8           5484              3                   4
9           5685              3                   4
10          5531              3                   4
11          3719              3                   4
12          4159              3                   4
13          4204              3                   4
14          4004              3                   4
15          4069              3                   4
16          5561              3                   4
17          5666              3                   4
18          