# ANN for predicting Win/Loss

This project utilizes a dataset of individual player statistics from the 2024 League of Legends World Championship to build a predictive model aimed at forecasting a player's likelihood of winning a game. The core objective is to determine whether a player's performance, as measured by key in-game metrics, can serve as a reliable indicator of their chances of securing a win. Specifically, the model focuses on three crucial statistics: Gold Difference at 15 minutes (GD@15), Creep Score Difference at 15 minutes (CSD@15), and Experience Difference at 15 minutes (XPD@15). These metrics were selected for their direct relevance to the state of the game, as they represent the player's relative advantage over their opponent in the same role. By analyzing these statistics in comparison to an opponent’s performance, the model aims to predict the player's probability of winning, providing valuable insights into the dynamics of competitive gameplay. This project leverages machine learning techniques, including a fully connected feedforward neural network, to explore how these in-game statistics influence a player’s overall success.

In [1]:
#Import packages
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

This section defines the dataset path with two options for locating the data based on storage location. If the dataset is stored in Google Drive, drive.mount('/content/drive') mounts the drive, allowing us to specify the dataset path within the Google Drive directory. Alternatively, if the dataset is uploaded directly to Colab, we set dataset_path to a local file location ('/content/player_statistics_cleaned_final.csv').

In [2]:
from google.colab import drive

# Mount Google Drive & Access your dataset
#drive.mount('/content/drive')
#dataset_path = '//content//drive//MyDrive//NYP AAI DATASET//player_statistics_cleaned_final.csv'

# dataset if uploaded to local
dataset_path = '//content//player_statistics_cleaned_final.csv'

 We load the dataset by reading a CSV file containing individual player statistics from the 2024 League of Legends Worlds tournament. Using pd.read_csv(dataset_path), we read the CSV file into a pandas DataFrame. This DataFrame serves as an organized data structure where each row represents a player, and each column contains a specific statistic related to the player's performance in a game, such as gold difference, creep score difference, or experience difference at 15 minutes into the game. By loading the data into a DataFrame, we benefit from pandas' extensive data handling capabilities, which allow for efficient manipulation, filtering, and analysis. To verify the data has been loaded correctly, we print the first few rows using data.head(). This command provides a preview of the data's shape and format, allowing us to confirm that the file has been read correctly and displays the intended columns and rows.

In [3]:
data = pd.read_csv(dataset_path)

# Print data shape for verification
print("Data shape:", data.head)

Data shape: <bound method NDFrame.head of            TeamName PlayerName Position  Games  Win rate  KDA  Avg kills  \
0       Top Esports        369      Top      8     0.500  3.1        2.5   
1         Dplus KIA     aiming      Adc      9     0.333  4.8        5.0   
2     MAD Lions KOI     alvaro  Support      5     0.200  1.5        0.2   
3       Team Liquid        apa      Mid     10     0.500  2.4        3.5   
4         PSG Talon       azhi      Top      5     0.200  2.3        2.2   
..              ...        ...      ...    ...       ...  ...        ...   
71      LNG Esports     weiwei   Jungle      8     0.625  3.9        2.3   
72      PaiN Gaming      wizer      Top      4     0.000  1.6        3.0   
73        PSG Talon      woody  Support      5     0.200  3.3        1.0   
74     Weibo Gaming     xiaohu      Mid     13     0.615  6.2        3.8   
75  Bilibili Gaming        xun   Jungle      9     0.778  5.3        2.4   

    Avg deaths  Avg assists  CSPerMin  ...  A

In this section, we process the data to set up our input features and target variable for model training. First, we create a binary target column named 'Win', where each value is 1 if the 'Win rate' of a player is greater than 0.6, and 0 otherwise. This binary classification helps simplify our prediction task, where the model will learn to identify whether a player's likelihood of winning exceeds this threshold based on key game metrics.

Next, we select the input features: 'GD@15' (Gold Difference at 15 minutes), 'CSD@15' (Creep Score Difference at 15 minutes), and 'XPD@15' (Experience Difference at 15 minutes). These features represent critical in-game statistics that measure early-game advantages compared to an opponent in the same role, which can strongly influence the likelihood of winning. We assign these columns to X, our feature matrix, while the newly created 'Win' column is assigned to y, our target variable.

In [4]:
# Create a binary target column for 'Win' (1 if 'Win rate' > 0.6, 0 otherwise)
data['Win'] = (data['Win rate'] > 0.6).astype(int)

# Select 'GD@15', 'CSD@15', and 'XPD@15' as features and 'Win' as the target
X = data[['GD@15', 'CSD@15', 'XPD@15']]  # Selected Inputs
y = data['Win']                          # Target

Since the dataset is relatively small, we use K-Fold Cross-Validation with k=5, dividing the data into 5 parts. In each iteration, the model is trained on 4 folds and tested on the remaining fold, ensuring that every data point is used for both training and testing across different rotations. This approach, facilitated by KFold(n_splits=k, shuffle=True, random_state=42), helps prevent overfitting, as the model is not exposed to the same subset during every training round, and shuffle=True randomizes the order of data points to create diverse training-test splits. Setting a random_state ensures consistent shuffling for reproducibility. The fold_accuracies list will store the accuracy for each fold, allowing us to assess the model's performance across all folds and calculate an average accuracy.

In [5]:
# Define 10-fold cross-validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
fold_accuracies = []

This section defines a function, `create_model()`, which builds and compiles a neural network model for binary classification. The model is created using Keras' `Sequential` API, allowing layers to be added sequentially.

1. **Input Layer**: The input shape is set to `(3,)`, matching the three selected features (`GD@15`, `CSD@15`, and `XPD@15`).
2. **Hidden Layers**: The network includes two hidden layers:
   - The first hidden layer has X neurons and uses the ReLU activation function, which introduces non-linearity to capture complex patterns in the data.
   - The second hidden layer has X neurons, also using ReLU, further refining learned features.
3. **Output Layer**: The output layer has a single neuron with a sigmoid activation function, outputting a probability between 0 and 1 to represent the likelihood of a win. This is ideal for binary classification tasks.

After defining the structure, the model summary is printed to show layer details. The model is then compiled with the Adam optimizer, which adjusts learning rates dynamically, and uses binary cross-entropy as the loss function for evaluating binary classification performance. We also specify `accuracy` as a metric to track during training. Finally, the function returns the compiled model, ready for training.

In [6]:
# Define the model creation function with updated input shape
def create_model():
    model = keras.Sequential([
        layers.Input(shape=(3,)),                # Input layer with 3 features
        layers.Dense(18, activation='relu'),     # Hidden layer 1
        layers.Dense(14, activation='relu'),      # Hidden layer 2
        layers.Dense(1, activation='sigmoid')    # Output layer for binary classification
    ])

    # Print the model summary
    print(model.summary())


    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

Here, we initialize a new instance of the model by calling create_model(). This function call sets up a fresh neural network with the architecture and parameters defined in the create_model function. Once created, we save the model's initial weights to a file named "initial_weights.weights.h5".

Saving these initial weights allows us to reload the untrained model's weights later. This is particularly useful in cross-validation, where we train the model multiple times on different subsets of the data. By resetting the weights to their initial values for each fold, we ensure that each training iteration starts from the same baseline, providing a fair comparison across folds.

In [7]:
model = create_model()

model.save_weights("initial_weights.weights.h5")

None


In this step, we use `train_test_split` to create an initial test set and a larger remaining set from the full dataset. Here, `test_size=0.8` specifies that 80% of the data should go into the remaining set (`X_rest`, `y_rest`), while 20% is set aside as an initial test set (`X_initial_test`, `y_initial_test`). Setting `random_state=42` ensures consistent shuffling each time the code is run, so we always split the data the same way.

This initial test set allows us to evaluate the model's performance in an untrained state. By using a separate test set right at the beginning, we can establish a baseline performance for the model with random weights, which can be compared to performance after training.

In [8]:
# Split off a test set to evaluate the untrained model
X_initial_test, X_rest, y_initial_test, y_rest = train_test_split(X, y, test_size=0.8, random_state=42)

In this section of the code, the model's initial weights are loaded from the file `"initial_weights.weights.h5"`. This step is important for evaluating the performance of the untrained model before any training begins. The `model.evaluate()` function is then used to assess the model's loss and accuracy on the `X_initial_test` and `y_initial_test` data. These metrics help understand how well the model is performing at the start, when it has not yet been trained on any data. The loss indicates how far off the model's predictions are from the actual values, while accuracy shows the percentage of correct predictions. Since the model is untrained at this point, the loss is likely to be high, and the accuracy low. This baseline evaluation provides a reference point to measure improvement after training.

In [9]:
# Load initial weights and evaluate on the initial test set
model.load_weights("initial_weights.weights.h5")
test_loss, test_acc = model.evaluate(X_initial_test, y_initial_test, verbose=0)
print("Untrained model loss:", test_loss)
print("Untrained model accuracy:", test_acc)

Untrained model loss: 12.447905540466309
Untrained model accuracy: 0.8666666746139526


Implementing cross-validation, a technique used to evaluate the model’s performance more reliably by testing it on different subsets of the data. In the loop, the `KFold` object splits the `X_rest` dataset into `k` folds (in this case, 5 folds). Each fold acts as a validation set once, while the remaining folds are used for training the model.

For each fold:
1. The dataset is divided into a training set (`X_train_fold`, `y_train_fold`) and a test set (`X_test_fold`, `y_test_fold`) based on the indices from the `kf.split()` method.
2. A new model is created and the initial weights are loaded from `"initial_weights.weights.h5"`, ensuring that the model starts from the same state in each fold.
3. The model is then trained using the `fit()` function on the training data, for 50 epochs with a batch size of 8. The `verbose=0` argument suppresses the output during training.
4. After training, the model is evaluated on the test data from the current fold, and the accuracy is stored in the `fold_accuracies` list.

Cross-validation is useful because it helps prevent overfitting, providing a better estimate of the model's generalizability. By using different subsets of data for training and testing in each fold, it allows for a more robust performance evaluation compared to a single train-test split. The final result is an average accuracy across all folds, offering a clearer picture of how well the model is likely to perform on unseen data.

In [10]:
# Cross-validation loop
for train_index, test_index in kf.split(X_rest):
    # Split data into training and testing sets for the current fold
    X_train_fold, X_test_fold = X_rest.iloc[train_index], X_rest.iloc[test_index]
    y_train_fold, y_test_fold = y_rest.iloc[train_index], y_rest.iloc[test_index]

    # Create a new model and load initial weights for each fold
    model = create_model()
    model.load_weights("initial_weights.weights.h5")

    # Train the model on the current fold
    model.fit(X_train_fold, y_train_fold, epochs=50, batch_size=8, verbose=0)

    # Evaluate the model on the test fold and store accuracy
    loss, accuracy = model.evaluate(X_test_fold, y_test_fold, verbose=0)
    fold_accuracies.append(accuracy)

None


None


None


None




None




Here, we are calculating the average accuracy of the model across all folds in the cross-validation process. The np.mean(fold_accuracies) function computes the mean of the accuracy values stored in the fold_accuracies list, which contains the accuracy of the model for each fold.

The print() statement outputs the average accuracy, formatted to four decimal places. This final metric provides an overall performance measure of the model after it has been evaluated on different subsets of the data.

Calculating the average accuracy across all folds is crucial for understanding the model’s overall effectiveness. It mitigates any bias that might arise from using a single train-test split and gives a more reliable estimate of the model's generalization ability. If the average accuracy is high, it suggests the model is performing well on different subsets of data, whereas a low average indicates that the model may not generalize effectively and may require adjustments.

In [11]:
# Calculate and print the average accuracy across all folds
average_accuracy = np.mean(fold_accuracies)
print(f'Average accuracy across {k} folds: {average_accuracy:.4f}')

Average accuracy across 5 folds: 0.6077


This section of the code is responsible for saving the trained model for future use.

1. `model.save_weights("trained_weights.weights.h5")`: This line saves only the weights of the model to the file `trained_weights.weights.h5`. Weights represent the learned parameters of the model, and saving them allows you to reload the model's state later without needing to retrain it from scratch. However, since this command saves only the weights, the model architecture (layers, activation functions, etc.) must be redefined when loading the weights for the model to work properly.

2. `model.save("trained_model.keras")`: This line saves both the architecture and the weights of the model together in a single file (`trained_model.keras`). By saving the entire model, you can later load it without needing to redefine the architecture explicitly. This makes it easier to deploy the model or use it in different environments, as the model can be fully restored from the saved file, including both the learned parameters and the structure.

In [12]:
# save the weights - Note save_weights only saves the weights of the model. You will need to define your model architecture before loading the weights.
model.save_weights("trained_weights.weights.h5")

# save the whole model (architecture and weights)
model.save("trained_model.keras")

This section of the code is used to load the trained model and use it for making predictions on new data. First, the trained model is loaded from the file `"trained_model.keras"` using `keras.models.load_model()`. This restores both the model architecture and its trained weights, allowing the model to be used without the need for retraining. Next, the prediction data is loaded from the CSV file `"prediction.csv"` using `pd.read_csv()`. The relevant features, `GD@15`, `CSD@15`, and `XPD@15`, are extracted from this data into `X_pred`, which will be used for predictions. The actual "Win rate" values are also extracted into `actual_win_rate` for comparison against the model's predictions.

Once the features are ready, the model predicts the win probability for each data point by passing `X_pred` through the `model.predict()` function. This produces a probability score between 0 and 1 for each instance, representing the likelihood of a win. The predictions are then converted into binary outcomes by setting a threshold of 0.6; if the probability exceeds this threshold, the prediction is classified as a "win" (1), otherwise it is classified as "no win" (0).

Finally, the actual win rate, predicted probability, and binary predictions are displayed for each data point in the dataset using a loop. This allows for a direct comparison between the model's predictions and the actual outcomes, providing insight into how well the model is performing on this new data. By showing both the predicted probabilities and the final binary classifications, the code also allows for an evaluation of the model's confidence in its predictions.

In [13]:
# Load the trained model
model = keras.models.load_model("trained_model.keras")

# Load the prediction data from the CSV file
prediction_data_path = '//content//prediction.csv'
prediction_data = pd.read_csv(prediction_data_path)

# Extract the features and the actual "Win rate" for comparison
X_pred = prediction_data[['GD@15', 'CSD@15', 'XPD@15']]
actual_win_rate = prediction_data['Win rate']  # Assuming "Win rate" column is present

# Make predictions
predictions = model.predict(X_pred)

# Convert probabilities to binary predictions (e.g., threshold of 0.6)
binary_predictions = (predictions > 0.6).astype(int)

# Display actual win rates and model predictions
for i in range(len(X_pred)):
    print("Actual Win Rate:", actual_win_rate[i])
    print("Predicted Probability of Win:", predictions[i][0])  # Probability of Win
    print("Model Prediction (Win=1/No Win=0):", binary_predictions[i][0])  # Binary prediction
    print("\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
Actual Win Rate: 0.5
Predicted Probability of Win: 0.00038116798
Model Prediction (Win=1/No Win=0): 0


Actual Win Rate: 0.375
Predicted Probability of Win: 0.3522403
Model Prediction (Win=1/No Win=0): 0


Actual Win Rate: 0.5
Predicted Probability of Win: 0.9991787
Model Prediction (Win=1/No Win=0): 1


Actual Win Rate: 0.875
Predicted Probability of Win: 0.9035286
Model Prediction (Win=1/No Win=0): 1


Actual Win Rate: 0.625
Predicted Probability of Win: 0.90840214
Model Prediction (Win=1/No Win=0): 1




The attached documents include the various configurations of neuron counts per layer that were experimented with to achieve more optimal results, the base training dataset, and a dataset used to evaluate the model's predictions.

# Remarks for ANN
Considering the small sample size of the training data, which consists of fewer than 100 samples, it is not surprising that the model consistently achieves an accuracy between 50-68%. The limited dataset size likely hinders the model’s ability to effectively capture complex patterns and generalize to unseen data. Small datasets increase the risk of both overfitting and underfitting, as the model may memorize the limited examples it is trained on or fail to learn robust representations from insufficient data.

Additionally, the small sample size results in less variation in the training data, which may prevent the neural network from adequately learning the relationships between the input features (GD@15, CSD@15, and XPD@15) and the target outcome (win/loss). While cross-validation helps mitigate some of these challenges, the model still struggles to improve its predictive accuracy due to the scarcity of meaningful data points.

In this scenario, I opted not to perform extensive data preprocessing, such as normalization or feature engineering, as the dataset is relatively simple and the features are already in a form that is directly interpretable by the model. Given the nature of the features, such as gold and experience differences at specific points in the game, significant preprocessing was deemed unnecessary.

However, further improvements in accuracy may require augmenting the dataset with more player statistics, possibly by incorporating data from different tournament games or seasons, or using synthetic data generation techniques. Alternatively, a more complex architecture with additional layers and neurons may not yield significant improvements without sufficient data, as the model could become overly sensitive to noise. Thus, increasing the dataset size would likely be a more effective strategy to improve the model's performance.

# Discussion Between Different Neural Networks

For the task of predicting a player's likelihood of winning based on in-game statistics (GD@15, CSD@15, and XPD@15), different deep learning architectures could be considered. This segment compares the **Fully Connected Feedforward Neural Network (FNN)**, with **Convolutional Neural Networks (CNNs)** and **Recurrent Neural Networks (RNNs)** to evaluate their relevance and performance for this task.

### Fully Connected Feedforward Neural Network (FNN)
The **Fully Connected Feedforward Neural Network (FNN)** is ideal for this classification problem because it is specifically designed for tabular, structured data like the one provided in the dataset. FNNs work by passing the input features through multiple layers of neurons, where each neuron is connected to all the neurons in the previous and next layers. This structure allows the model to learn complex interactions between the input features, making it suitable for predicting outcomes based on player statistics. In the context of this problem, the FNN can effectively capture the relationships between the features (GD@15, CSD@15, XPD@15) and the target variable, which indicates whether the player is likely to win the game.

Given that the dataset is relatively small, FNNs are an appropriate choice because they tend to perform well with smaller datasets, provided proper regularization techniques (like dropout) are used. The use of 5-fold cross-validation in the code also helps mitigate the risk of overfitting by providing an estimate of model performance across multiple data splits.

### Convolutional Neural Networks (CNNs)
**Convolutional Neural Networks (CNNs)** are typically used for image data or other types of data with a spatial structure. CNNs operate by applying convolutional filters that detect local patterns in the data, which are then combined in deeper layers to form more complex features. This process is highly effective for tasks like image classification, where the data consists of pixels arranged in a grid and the relationships between neighboring pixels are important.

However, CNNs are not suitable for this problem, as the dataset consists of tabular data rather than spatial or image-like data. The features (GD@15, CSD@15, XPD@15) do not have spatial relationships that CNNs are designed to exploit. Therefore, using CNNs would be an inefficient approach for this problem, adding unnecessary complexity without improving the model’s ability to predict the win probability.

### Recurrent Neural Networks (RNNs)
**Recurrent Neural Networks (RNNs)** are designed for sequential data, where the order of the data points is critical. RNNs have an internal loop that allows them to maintain a memory of past inputs, which makes them ideal for tasks involving time-series data, such as speech recognition, language modeling, and stock price prediction.

In contrast, the data in this problem does not have sequential dependencies. The statistics for GD@15, CSD@15, and XPD@15 are independent of each other and represent a snapshot of the game at a specific moment in time. Because RNNs are primarily beneficial for sequential data where past information influences future predictions, they would not be the best choice for this problem. Additionally, due to the small dataset size, using RNNs would likely lead to overfitting and unnecessarily complicate the model.

### Conclusion
The **Fully Connected Feedforward Neural Network (FNN)** is the most suitable architecture for this task. It is specifically designed to handle tabular data and works well for classification problems with independent features, such as predicting a player's win probability based on in-game statistics. On the other hand, **Convolutional Neural Networks (CNNs)** and **Recurrent Neural Networks (RNNs)** are not appropriate for this problem. CNNs are better suited for spatial data, while RNNs are designed for sequential data. Given the small size of the dataset, the FNN provides a simpler and more effective solution without the risk of overfitting that might occur with more complex models like CNNs or RNNs.