<a href="https://colab.research.google.com/github/3RFUNn/Computational-Game-Design/blob/main/ProfilingLab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Erfan Rafieioskouei - 240842587

## ECS7017P Lab 2

**Complete the exercises in this notebook and submit it as part of Coursework 1.**

# Player Profiling in Dota 2

In this week’s lab you will use this [Jupyter Notebook](https://docs.jupyter.org/en/latest/) to create a set of **player profiles** for the popular MOBA game *Dota 2*. Profiles separate players into groups who are similar in some way. The notebook guides you through this process. However, you will need to make your own decisions and interpret the results as part of the analysis. You are also free to e.g. remove players or adjust features if it helps the profiling, and to add extra visualisations if it helps you understand the data.

A notebook is a series of editable cells, containing either Python code (Code cells) or formatted text (Markdown cells). This notebook was written for [Google Colab](https://colab.research.google.com/) - but it could be run on another cloud service (like [Binder](https://mybinder.org/)) or [locally](https://jupyter.org/install).

### Exercises

The exercises in this lab are assessed, as part of the first coursework in this module. There are 25 marks available.
* Exercise 1: examine correlations between the original features. (2 points)
* Exercise 2: derive some new features to describe players. (3 points)
* Exercise 3: extract new features and reduce the dimensionality of the dataset using PCA. (6 points)
* Exercise 4: use K-Means to look for clusters in the reduced space, and interpret the results as player
profiles. (6 points)
* Exercise 5: consider the cluster tendancy of the dataset. (2 points)
* Exercise 6: apply AHC to the same dataset. (6 points)

### Submitting Your Work

Save/download a copy of this notebook and include that **ipynb file** in your ZIP file submission for Coursework 1.
That file should include all your lab work, i.e. any code/text you've added, and any results and images you generated.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Dota 2

In Dota 2, two teams (*Radient* and *Dire*) of five players compete in a match. Each player controls a different *hero*, and fight opposing heroes in player versus player (PvP) combat on a large battlefield. The goal is to destroy the opponent team's *base* and defend yours from the other team. Both teams have weaker NPC members (*creeps*) and defensive buildings which their opponents will try to destroy, e.g. *barracks* (called *rax*) and *towers*. The map also contains neutral NPCs and buildings.

<img src="https://cdn.fastly.steamstatic.com/steam/apps/570/header.jpg" style="margin-left:auto; margin-right:auto"/>

**Experience points (XP)** are gained by killing enemy units, or being nearby as enemy units get killed (an *assist*). It is used to increase a hero's level, increasing their attributes and unlocking new abilities.

**Gold** is earned in various ways, including hero kills/assists, destroying buildings, and *last hitting* enemy creeps (landing the final blow). It is passively gained periodically throughout the game. Gold is used to purchase items, and to bypass the respawn cooldown after death (*buyback*). Players may *deny* opponents gold by killing their own team's creeps.

# The Dataset

The dataset we will be using provides in-game metrics for 865 players. Each player has a unique ID (`PlayerID`) and the following features:

* `GamesPlayed`: number of games the player has played.
* `GamesWon`: number they have won.
* `Ditches`: how many times the player was thrown out of a game.
* `GamesLeft`: how many times they voluntarily left a game early.
* `Points`: experience points gained.
* `Kills`: enemy heroes the player killed.
* `KillsPerMin`: the player’s mean kills per minute played.
* `Deaths`: number of time the player died.
* `Assist`: player is near or damages an enemy hero who’s killed.
* `CreepsKilled`: enemy creep kills.
* `CreepsDenied`: own team's creeps killed.
* `NeutralsKilled`: neutral NPCs killed.
* `TowersDestroyed`: tower buildings destroyed.
* `RaxsDestroyed`: barracks buildings destroyed.
* `TotalTime`: total playtime logged for this player.

# Getting Started

Download the Dota dataset from QMPlus and import it into a Pandas dataframe.

In [None]:
dota = pd.read_csv("DoTalicious_lab4.csv")
dota.describe()

PlayerID is a **nominal** variable: these numbers are just names for individual players. The rest of the features are numeric, and will be the basis for our player profiles.

In [None]:
numData = dota.drop(columns='PlayerID') # Ignore player IDs
numFeatures = list(numData.columns.values) # Make a list of numeric features
numFeatures # Show the list

In [None]:
dota[numFeatures].hist(bins=20,figsize=(10,30),layout=(8,2))
plt.subplots_adjust(hspace=1)

# Correlations

Let's see how these features are [correlated](https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/types-of-correlation.html).

In [None]:
# Calculate correlation between each pair of features
correlationMatrix = dota[numFeatures].corr()
# Display correlation matrix
correlationMatrix.style.background_gradient(cmap='Reds')

It seems many of the features are strongly correlated with how many games/playtime was logged. This makes sense: for example, the number of games you win (`GamesWon`) is going to be higher the more games you play (`GamesPlayed`).

---
# Exercise 1

Which feature(s) are not strongly correlated with GamesPlayed? Why do you think this is?

Your answer: ...

---

# Feature Engineering


In order to profile players using these in-game metrics, we should focus on how players perform in an *average* game. We want to engineer some new metrics which are independant of how many matches were logged for particular player (`GamesPlayed`), or how much of their playtime was logged overall (`TotalTime`).

For example, we can define a new feature from `WinRate` which measures a player’s wins per game.

In [None]:
dota['WinRate'] = dota['GamesWon'] / dota['GamesPlayed']

dota[['WinRate']].hist(bins=20,figsize=(10,30),layout=(8,2))

---
# Exercise 2

Following the example of `WinRate`, consider which existing features should be made independent of games played or total playtime. Define these new features. Briefly explain your choice. Plot the distributions for the new features.

Our player profiling will use the new features, and any original features that were already time-independent.

Define list `myFeatures` to include ALL the features you intend to use for player profiling.

In [None]:
# Add your code
myFeatures = ['WinRate']

Explain your choice: ...

---

# Scaling Features

Before we work with the data, we need to scale our selected features.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Scale our features
dota[myFeatures] = scaler.fit_transform(dota[myFeatures])

# Feature Extraction with PCA

We will now try to **extract** some new features from the data using PCA. Define the variable `nPC` as the
maximum number of principal components to compute. Then apply PCA to the data.

In [None]:
from sklearn.decomposition import PCA
nPC = 1 # Choose a suitable value here
pca = PCA(n_components=nPC).fit(dota[myFeatures]) # Compute PCA
dota_pca = pca.transform(dota[myFeatures]) # Project the player data in the new space

Let's look at the player data plotted against the first two principal components.

In [None]:
plt.scatter(dota_pca[:,0],dota_pca[:,1])

Examine the proportion of the total variance explained by each of the principal components.

In [None]:
pca.explained_variance_

Now plot these values for each principal component (PC1, PC2, PC3, ... etc.):

In [None]:
plt.plot(np.arange(pca.n_components_) + 1, pca.explained_variance_, 'ro-', linewidth=2)

Finally, we examine the loadings of your features on the principal components, i.e. the weights used to calculate
each principal component as a linear combination of input features.

In [None]:
pcLabels = ["PC"+str(i) for i in range(1,nPC + 1)]
loadings = pd.DataFrame(pca.components_.T, columns=pcLabels, index=myFeatures)
loadings.style.background_gradient(cmap='bwr',vmin=-1,vmax=1)

---
# Exercise 3

The next step is to drop the less important principal components, retaining those which tell us most about how players vary in their behaviour. Which principal components should we keep, and why? Is there more than one alternative?

Provide an interpretation for each of the components you intend to keep, i.e. how each one describes players.

Your answer:

---
Define a list `pcs` of the principal components you are keeping. Add each of these as a new column in the dataframe. This is a new lower dimension feature space you can use to analyse player behaviour.

For example, if you only wanted to keep the first two principal components (i.e. reducing the data to two dimensions):

In [None]:
# Edit this with your choice of PCs
pcs = ['PC1','PC2']
dota['PC1'] = dota_pca[:,0]
dota['PC2'] = dota_pca[:,1]

---

# Clustering Players with K-Means

The next stage is to try to cluster players in your lower dimension feature space using K-Means. We will have to decide on the number of clusters `K`.

Let’s define a function `kmeans_eval` to evaluate K-Means clusterings using both WSS and silhouette scores.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

k_values = range(2, 10) # Range of K values to examine

# Compute and evaluate K-means clusters for all k_values
def kmeans_eval(some_data):
    wss = []
    silhouettes = []
    for k in k_values:
        model = KMeans(n_clusters=k)
        model.fit(some_data)
        km_labels = model.predict(some_data)
        wss.append(model.inertia_)
        silhouettes.append(silhouette_score(some_data, km_labels))
    return wss, silhouettes

# Apply this to our data
wss, silhouettes = kmeans_eval(dota[pcs])

Plot the WSS for each value of K.

In [None]:
# Plot WSS for each K
plt.plot(k_values, wss, '-o', color='black')
plt.ylabel('WSS')
plt.show()

Repeat this plot for the silhouette widths. Use these plots to decide on a good choice for K. Define this with a variable
`goodK`.

In [None]:
# Plot the silhouette widths for each K

In [None]:
goodK = 1

Rerun K-Means with your chosen number of clusters and visualise the results using the first two principal components. You may want to add plots for other PCs.

In [None]:
km = KMeans(n_clusters=goodK)
km.fit(dota[pcs])
dota['Cluster1'] = km.predict(dota[pcs])
plt.scatter(dota['PC1'],dota['PC2'],c=dota['Cluster1'])

---
# Exercise 4

Why did you chose this value for K? Interpret the clusters as a set of player profiles. Provide an appropriate **name** and **description** of player behaviour for each profile.

Your answer: ...

---
# Exercise 5

Do you think the player data has high cluster tendancy? **Justify** your answer using `pyclustertend` package. ([Documentation](https://pyclustertend.readthedocs.io/en/latest/).

In [None]:
# Your code
from pyclustertend import vat


In [None]:
Your answer:

---
# Exercise 6

Read the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) for
Agglomerative Hierarchical Clustering (AHC) and [their example code](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html).

Apply AHC to the Dota dataset and visualise the results as a dendrogram. Provide an interpretation of the results in terms of player behaviour.

In [None]:
# Your code


Your answer: ...

# Submission Reminder

Your Coursework 1 ZIP file submission should include a modified copy of this notebook (a .ipynb file) documenting your lab work.