# Social Computing/Social Gaming

# Exercise Sheet 3: Clustering with EVE

In this exercise, we will revisit the K-means clustering algorithm from the lecture and apply it to data gathered from the sandbox MMO *EVE Online*. We will explore whether we can find indications for Bartle's player types when applying the algorithm with k=4. As a refresher, Bartle's model posits that players can be divided into 4 groups: Socializers, achievers, explorers and killers.

For those of you unfamiliar with EVE, here is a brief explanation that should help in working with the data: In Eve, a player takes control over a clone in a fictional galaxy with a fully-functioning, player-driven economy. There are no set roles, but players usually form alliances and choose roles based on what they like to do most. Much of the game revolves around spaceship combat, but many players concentrate on mining and processing resources which are needed to build ships and stations. Other players enjoy a more lone-wolf style of game and either explore the galaxy or hunt other players. EVE's most unique feature is the doctrine of no safe spaces: There are zones in which killing is prohibited, but the NPC law enforcement does not work as a prevention unit, only as a force of retaliation. This makes preying on new players much easier than in other games that do have safe havens.

For a more detailed explanation about this game, we **strongly recommend** that you watch this [YouTube video](https://www.youtube.com/watch?v=D76acXPGK7I) [1] (6 min).

If you want to know more about the game or feel that this explanation was not helpful enough, the Wikipedia article about EVE Online is a good place to start.

Before the actual task starts, try to think of general behavior patterns that would describe EVE players falling into Bartle's categories. What characterizes achievers? How would you distinguish them from killers?

## Task 3.0: The Data

Below, you will find the code used to gather the data, given a number of character IDs from the game. **Important: Do not run this code! It only serves as part of the explanation for the data!** We have commented it out for you. 

For every player, the following information was retrieved:

- **soloRatio**: Measures how many kills the player has achieved without the assistance of others (please note that this is not a k/d ratio, and that killing a player is different from destroying a ship). 
- **secStatus**: An in-game metric that measures the criminal activity: Killing players in safe zones (without a valid reason) lowers the security status
- **shipsDestroyed**: The number of ships the character has destroyed (alone or with help of others)
- **combatShipsLost**: The amount of ships lost that are classified as combat ships
- **miningShipsLost**: These are ships used for harvesting resources and have limited to no combat capabilities
- **exploShipsLost**: These ships are used to explore the galaxy and only have limited defensive capabilities
- **otherShipsLost**: This category consists of ships that could be considered "support" classes: Freighters (i.e. cargo ships), logistics (i.e. a "healer" in a typical MMO), etc.

Note that losing ships might not always be an indication of what ships players actually use the most (which is what we want to know). It could be possible that some players just never lose a certain type of ship, right? However, given the violent nature of the game and [statistics](http://evemaps.dotlan.net/stats) [2] like these, it is very unlikely.

Based on this, we will assume the following regarding the types of players:

- **Explorers**: Low kill/death ratio, high use of exploration ships, rather low kill numbers, high security status
- **Socializers**: Low kill/death ratio, high security status, low soloRatio, high amount of non-combat ships lost
- **Achievers**: High kill/death ratio? (Depends on what you define as "achieving")
- **Killers**: High kill/death ratio, high soloRatio, low security status, losing virtually only combat ships

In [None]:
# Important: Do not run this code!
# It only serves as part of the explanation for the data! 

#import json
#import urllib
#import pandas as pd
#import re
#import requests
#import html
#import json
#import math


# Create the dataframe to store everything in:
#columns = ['characterID', 'soloRatio', 'secStatus', 'shipsDestroyed', 'combatShipsLost', 'miningShipsLost', 'exploShipsLost', 'otherShipsLost']
#data = pd.DataFrame()
#character_IDs = pd.read_csv(r'EVEPlayerStats.csv')
#IDList = character_IDs['characterID'].values.tolist()

#IDList = IDList[:500]

#for characterID in IDList:
#     print(characterID)
#    end = False;
#    i =2;
#    combatShips = 0
#    exploShips = 0
#    miningShips = 0
#    otherShips = 0
    #Get general info: Name, soloratio, secStatus and handle JSON not existing
#    try:
#        link = "https://zkillboard.com/api/stats/characterID/" + str(characterID) + "/"
#        f = requests.get(link)
#        file= json.loads(f.text)
#    except ValueError:
#        print("")
#    if 'gangRatio' not in file:
#        print("")
#    else:
#        print(file['shipsDestroyed'])
#        soloRatio = 100- file['gangRatio']
#        info = file['info']
#        secStatus = info['secStatus']
#        groups = file['groups']

        # numbers for groups are already present in the JSON
#        frame = pd.DataFrame.from_dict(groups)
#        shipLosses =pd.DataFrame(frame.iloc[1])
#        for key, value in shipLosses.iterrows():
#            if (key == "25" or key=="29" or key == "31" or key == "1246" or key == "1250" or key == "311" or key == "361" or key == "363" or
#            key == "365" or key == "417" or key == "471" or key == "1025" or key == "1249" or key == "1273" or key == "1276"):
#                #TODO ignore:
#                print("")
#            elif key == "380" or key == "513" or key == "832" or key == "1202" or key == "1527":
#                if math.isnan(value[0]):
#                    otherShips+=0
#                else:
#                    otherShips+=int(value[0])
#            elif key == "463" or key == "543" or key == "883" or key == "941" or key == "1283":
#                if math.isnan(value[0]):
#                    miningShips+=0
#                else:
#                    miningShips+=int(value[0])
#            elif key == "830":
#                if math.isnan(value[0]):
#                    exploShips+=0
#                else:
#                    exploShips+=int(value[0])
#            else:
#                if math.isnan(value[0]):
#                    combatShips+=0
#                else:
#                    combatShips+=int(value[0])

#        data = data.append({'characterID': characterID, 'soloRatio': soloRatio, 'secStatus': secStatus, 'shipsDestroyed': 0, 'combatShipsLost': combatShips, 
#                            'miningShipsLost': miningShips, 'exploShipsLost': exploShips, 'otherShipsLost': otherShips}, ignore_index = True)
    
#if len(data) > 0:
#    data.to_csv(r'C:\Users\jgott\Documents\EVEPlayerStatsNew.csv')

## Task 3.1: Preparation

Now that you are armed with all the knowledge needed, let us begin. 

**a)** First, **read** the .csv that you downloaded with this exercise into a dataframe.

**b)** Then **drop** the `Unnamed: 0` and `characterID` columns. We don't need them.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# –––––––––––––––––– Solution –––––––––––––––––––––––

# TODO 1: read the EVEPlayerStats.csv file
data = #TODO
# TODO 2: drop the Unnamed: 0 and characterID columns
data = #TODO

# ––––––––––––––– End of Solution –––––––––––––––––––

### Task 3.2: Normalizing & Clustering

As you might have seen, the value ranges differ greatly across the metrics. Where the number of kills can reach up to 10.000, the security status rarely exceeds 5. This creates an imbalance, as the calculation of distance will obviously be impacted a lot more by kill counts than security status. To rectify this and let all metrics influence the result in an equal manner, we need to normalize the data. 

**a)** **Go through the dataframe and normalize all values to a [0,1] range**

**Hint:** Consider negative values in your normalization, the range is **not** [-1,1].

There is still one problem with the dataset: Even normalized, the clustering 'favors' those players with higher numbers: For example, where we to posit that explorers have a high amount of exploration ships lost, then a player with 200 lost ships would rather be classified as an explorer than a player with 5 lost ships, even if the former lost 2000 non-exploration ships and the latter lost only 2 non-exploration ships. EVE Online has been around since 2003, so there exist players who have played the game extensively for 13 years, while some others might have played for only 2. To mitigate this, we should look at ratios in ship losses: How big is the percentage of lost combat/exploration/mining/other ships given the total amount of ships lost? 

**b)** **Convert the normalized absolute numbers into ratios**. Divide the absolute number by the number of all ships lost by a player, this will give you the ratio.  


**c)** **Cluster the dataset** with the k-Means algorithm and print out the centroids.

**Hint:** For the clustering we will use the k-means algorithm provided by the scikit-learn library. Import the algorithm and use the `fit()` function to let the algorithm do its work. Remember to set the amount of clusters to 4. 

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# TODO 1: Normalize all values to a [0,1] range.

# determine minimum value per column
minValue = #TODO

# normalize values into a [0, inf] intervall
for columnName,columnData in data.items():
    data[columnName] = #TODO

# normalize values into a [0, 1] intervall
for columnName,columnData in data.items():
    data[columnName] = #TODO
    
# TODO 2: Convert the normalized absolute numbers into ratios.
for index,row in data.iterrows():
    #TODO

# TODO 3: Cluster the dataset with the k-Means algorithm and print out the centroids.
kmeans = #TODO


**Tutor Note:** Tutors, please be aware that there are multiple ways of implementing this. The solution provided above is just an example. Students may come up with different solutions. As long as the code works as expected and the normalization range is [0,1], the points should be awarded accordingly.

## Task 3.3: Analyzing the results

### **a)** Heatmap

Since we have 7 features for each player in total, our datapoints lie in a 7-dimensional space. It can be tricky to read a 7-dimensional graph, so we will first use a heatmap to analyze our data. A heatmap is a data visualization technique that shows the data as color in two dimensions.

**1.** Use the seaborn library to **generate a heatmap**. For readability purposes, **display the last 20 players** from the processed dataset **only**.  
**Hint:** If you feel like the graph is too small, scale it up a bit.

**2.** From these 20 entries, **choose 4** that you think are the most representative for each of Bartle's player groups (one for each group) and **briefly explain** why you chose them based on the heatmap.

In [None]:
import seaborn as sns
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# TODO:

**TODO 2: Write your observations here**

### **b)** t-SNE 

Heatmaps are nice, but if we want to display large amounts of data, they become unreadable. Therefore, we introduce an algorithm called [t-SNE](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) [3] that can transform a high dimensional dataset into a 2 dimensional plot. For more information you can check out the linked paper. For a simple but intuitive explanation have a look at [this video](https://www.youtube.com/watch?v=NEaUSP4YerM) [4]. 

**1.** **Run the given code to generate a t-SNE graph**. Look at the plot and choose **one** cluster which you want to analyze. From this cluster **choose 2-3 players**, analyze their stats and describe your observations. Can you tell what kind of player type the cluster represents in Bartle's model? Can you explain the meaning of the distance between the clusters?  
**Hint**: You can see the assigned clusters for each player with the list `kmeans.labels_`


**Note:** If you get the impression that the clustering is not very accurate do not feel discouraged as the data set does not contain enough information about the other activities of the players besides ship killing. 

In [None]:
# For the tutors: use this to get a better overview of the stats
from IPython.display import display

tempData = data

lists = [[], [], [], []]
for i, el in enumerate(kmeans.labels_):
    lists[el].append(i)

display(tempData.iloc[lists[0]].mean(), tempData.iloc[lists[1]].mean(), tempData.iloc[lists[2]].mean(), tempData.iloc[lists[3]].mean())

In [None]:
# t-SNE Graph
def tsne(tempData):
    tsne = TSNE(n_components=2, random_state=0)
    X_2d = tsne.fit_transform(tempData)

    new = tempData.copy()
    new['tsne-2d-one'] = X_2d[:,0]
    new['tsne-2d-two'] = X_2d[:,1]

    plt.figure(figsize=(16,10))
    sns.scatterplot(
        x = "tsne-2d-one", y = "tsne-2d-two",
        hue = kmeans.labels_,
        palette = sns.color_palette("hls", 4),
        data = new,
        legend = "full"
    )
    
tsne(tempData)

**TODO: Write your observations here**

### **b)** DBSCAN and Gaussian Mixture
Next look at the Clustering with DBSCAN and Gaussian Mixture Model. Compare these two and k-Means and rite down your observations. 

In [None]:
from sklearn.cluster import DBSCAN
# change the parameter if you want
clustering = DBSCAN(eps = 0.2, min_samples=10, algorithm='ball_tree').fit(data)

tempData2 = data

def tsne(tempData):
    tsne = TSNE(n_components=2, random_state=0)
    X_2d = tsne.fit_transform(tempData)

    new = tempData.copy()
    new['tsne-2d-one'] = X_2d[:,0]
    new['tsne-2d-two'] = X_2d[:,1]

    plt.figure(figsize=(16,10))
    sns.scatterplot(
        x = "tsne-2d-one", y = "tsne-2d-two",
        hue = clustering.labels_,
        palette = sns.color_palette("hls", 6),
        data = new,
        legend = "full"
    )

tsne(tempData2)

lists2 = [[], [], [], []]
for i, el in enumerate(clustering.labels_):
    lists2[el].append(i)
display(tempData2.iloc[lists2[0]].mean(), tempData2.iloc[lists2[1]].mean(), tempData2.iloc[lists2[2]].mean(), tempData2.iloc[lists2[3]].mean())

In [None]:
from sklearn.mixture import GaussianMixture
gauss = GaussianMixture(n_components=4, random_state=0).fit(data)

tempData3 = data
labels3 = gauss.predict(data)

tsne = TSNE(n_components=2, random_state=0)
X_2d = tsne.fit_transform(tempData3)

plt.figure(figsize=(9,7))
sns.scatterplot(data=data, 
                x=X_2d[:,0],
                y=X_2d[:,1], 
                hue=labels3,
                palette=sns.color_palette("hls", 4))


**Compare k-Means, DBSCAN and Gaussian Mixture and write down your findings here:**

## References

[1] https://www.youtube.com/watch?v=D76acXPGK7I
<br>[2] http://evemaps.dotlan.net/stats
<br>[3] http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
<br>[4] https://www.youtube.com/watch?v=NEaUSP4YerM