SMITE is a third-person MOBA developed by Titan Forge Games. In the primary competitive mode of SMITE, players on teams of five select combinations of deities from a pool of over 100, and then must choose 6 items to purchase from a pool of over 250+ items throughout the game. This project focuses on the second of these challenges, attempting to provide a beginner-friendly guide to which items are currently performing well on particular deities.

There is a plethora of information available from SMITE's developer API, including kills, deaths, and assists, damage dealt, taken, and mitigated, and team and self healing. To begin, I will explore the connections between a player's performance statistics and their likelihood of victory.

I will investigate both engineered features (such as K+A / D, or DM + SH / DT, or any other interesting combinations) alongside the raw features themselves to see if there is a linear or nonlinear relationship between these features and victory.

Next, I will explore the relationship between a player's performance and the items they ended the game with. While I cannot see the items purchased or sold throughout the game, the structure of the in-game economy discourages selling most purchased items. As such, this is the highest resolution data available. Additionally, I prune my information (restricting to t3+ items) to prevent skewing.

In [1]:
import json
from sklearn.manifold import TSNE
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sbn
import pandas as pd
import numpy as np
from tqdm import tqdm

from typing import Any, Dict, NamedTuple
import os
from copy import deepcopy

from utilities import data_to_df, data_to_training_df

In [3]:
training_dfs = []
all_dfs = []
root = 'P:/SmiteData/conquest_match_data'
files = os.listdir(root)
for file in tqdm(files):
    training_df, scaler = data_to_training_df(os.path.join(root,file))
    training_dfs.append(training_df)
    all_dfs.append(data_to_df(os.path.join(root, file), scaler))
    
df = training_dfs[0]

100%|█████████████████████████████████████████████████████████████████████████████| 112/112 [00:45<00:00,  2.45it/s]


In [None]:
# First, let's plot our raw features to see any visual correlations between them and victory
for col in col_names:
    if col == 'BIAS' or col == 'win_status' or col == 'match_id':
        continue
    sbn.boxplot(data=df, x='win_status', y=col)
    plt.show()

From these plots we can clearly see that many statistics' distributions are markedly different for a win than a loss, indicating that there may be a correlation between a player's "performance data" and their chance at a victory. Let's see what all of these data points remapped into a 2-D space looks like.

In [None]:
two_d = TSNE(random_state=0, n_jobs=-1).fit_transform(df.values[:,1:-2])

In [None]:
plt.cla()
x = [x[0] for x in two_d]
y = [y[1] for y in two_d]
c = ['tab:blue' if e == 1 else 'tab:red' for e in df.values[:,-2]]
plt.scatter(x,y, c=c, alpha=0.3)
# plt.legend()
plt.show()

Once again, the data is seperable between "unlikely to win" and "likely to win", indicating a function should fit it well. First, let's try fitting a linear function, then see how much better we can do with a nonlinear function!

In [None]:
# Obv, first step is to split into train - test
x_train, x_test, y_train, y_test = train_test_split(df.values[:,:-2], df.values[:,-2], test_size=0.33, random_state=0)

In [None]:
parameters_lrc = {
    'C': [10 ** x for x in range(-5,0)]
}
# NOTE: LOOK INTO LR_CV (the cross validated version of this, check diff params)
lrc = LogisticRegression(random_state=0)
clf_lrc = GridSearchCV(lrc, parameters_lrc, n_jobs=-1)
clf_lrc.fit(x_train, y_train.ravel())

In [None]:
cl_df = pd.DataFrame(clf_lrc.cv_results_)
cl_df.loc[cl_df['rank_test_score'] == 1]
# NOTE: best params are C=0.1

In [None]:
new_l = clf_lrc.predict(df.values[:,:-2])
clf_lrc.score(x_test,y_test)

In [None]:
plt.cla()
x = [x[0] for x in two_d]
y = [y[1] for y in two_d]
c = ['tab:blue' if e == 1 else 'tab:red' for e in new_l]
c = [e if df.values[:,-2][idx] == new_l[idx] else 'tab:olive' for idx,e in enumerate(c)]
plt.scatter(x,y, c=c, alpha=0.3)
# plt.legend()
plt.show()

In [None]:
parameters_rbf = {
    'kernel': ['rbf'],
    'C': [10 ** x for x in range(-3, 3)],
    'gamma': [10 ** x for x in range(-3, 3)],
}
svm = SVC(random_state=0)

In [None]:
clf_rbf = GridSearchCV(svm, parameters_rbf, n_jobs=-1)
clf_rbf.fit(x_train, y_train.ravel())

In [None]:
cr_df = pd.DataFrame(clf_rbf.cv_results_)
cr_df.loc[cr_df['rank_test_score'] == 1]
# NOTE: best params are C=100, gamma=0.001, kernel=rbf

In [None]:
new_l = clf_rbf.predict(df.values[:,:-2])
clf_rbf.score(x_test, y_test)

In [None]:
plt.cla()
x = [x[0] for x in two_d]
y = [y[1] for y in two_d]
c = ['tab:blue' if e == 1 else 'tab:red' for e in new_l]
c = [e if df.values[:,-2][idx] == new_l[idx] else 'tab:olive' for idx,e in enumerate(c)]
plt.scatter(x,y, c=c, alpha=0.3)
# plt.legend()
plt.show()

First, I will do some visual exploration of the data, examining the characteristics of each statistic and how it correlates to victory for a particular diety. Once the structure is complete, I intend to check it out with one diety of each class.

Next, I'd like to determine whether a linear fitter will work for the data or if a non-linear model is required. SGDC for linear, SVMC for non-linear.

After that, I need to determine if / how any data adjustment / de-fuzzing affects the final item classification accuracy. (essentially, does "correcting" the performance data labelling favorably impact the item classification accuracy?)

After that, maybe explore different classifiers for items and performance, see if you can accurately determine the performance statistics based upon the items (this would prove the performance / item correlation)

Now, I want to see if I can get even better prediction by seeing all the statistics of all the players on a team.

**Determine the noise caused by teammates and if that noise can be meaningfully reduced by taking the prediction of the linear model rather than the actual win label**

Steps:
* Load all gods, make predictions for every entry.
* Assemble all 5-man teams
* train a classifier to predict team victory based on probability of individual player victory
* See if we can do some mix-and-match, substituting ".5" percent victory predictions for teammates and seeing what the output is.
* Determine if the output of the linear classifier matches the victory predictions for teammates with a range of probabilities, from 0.25 to 0.75.

Current Idea: First, predict every individual player's likelihood of victory. Next, get the percentiles of all the probabilities (lets say by 10). Finally, count the number of players on each team who are in each percentile and then put that into a regression classifier targeting the victory label. Hopefully, this will give me a 95% plus accuracy, I'd love 99% to 100%.

In [5]:
predict_col_names = (
    'match_id',
    'prediction',
    'win_status',
)

predictions_dfs = []
for i in tqdm(range(len(all_dfs))):
    training_df = training_dfs[i]
    all_df = all_dfs[i]
    lr = LogisticRegression(random_state=0, C=0.1, n_jobs=-1)
    lr.fit(training_df.values[:,:-2], training_df.values[:,-2].ravel())
    predictions_dfs.append(
        pd.DataFrame(
            np.hstack((
                all_df.values[:,-1].reshape((all_df.values.shape[0], 1)),
                np.array([
                    x[1]
                    for x in lr.predict_proba(all_df.values[:,:-2])
                ]).reshape(all_df.values.shape[0], 1),
                all_df.values[:,-2].reshape((all_df.values.shape[0], 1)),
            )),
            columns=predict_col_names,
        )
    )

100%|█████████████████████████████████████████████████████████████████████████████| 112/112 [00:03<00:00, 33.78it/s]


In [6]:
predictions = pd.concat(predictions_dfs)
predictions.sort_values('match_id')
percentiles = np.percentile(predictions.values[:, 1], list(range(0,101,10)))

In [7]:
team_noise = {}
num_rows = range(predictions.values.shape[0])
for i in tqdm(num_rows):
    row = predictions.iloc[i]
    match_id = row['match_id']
    try:
        team_noise[match_id].append(tuple(row[['prediction', 'win_status']]))
    
    except KeyError:
        team_noise[match_id] = [tuple(row[['prediction', 'win_status']])]

100%|█████████████████████████████████████████████████████████████████████| 567230/567230 [04:15<00:00, 2216.83it/s]


In [15]:
len(team_noise.keys())

56723

In [17]:
list(team_noise.items())[10000:10003]

[(1124201970.0,
  [(0.8611375738963265, 1.0),
   (0.08374994558023388, 0.0),
   (0.9881562278147311, 1.0),
   (0.12842607480443832, 0.0),
   (0.9524856538467421, 1.0),
   (0.9289861508657391, 1.0),
   (0.9951852235078725, 1.0),
   (0.024370638679009054, 0.0),
   (0.09077516541573394, 0.0),
   (0.10138273383499986, 0.0)]),
 (1124203678.0,
  [(0.022827118188160918, 0.0),
   (0.0014696720100761422, 0.0),
   (0.011506460408757049, 0.0),
   (0.9995904919845279, 1.0),
   (0.984851284304099, 1.0),
   (0.9904482309271895, 1.0),
   (0.9561156451032152, 1.0),
   (0.09834561562747843, 0.0),
   (0.21504237587129718, 1.0),
   (0.023342706447793696, 0.0)]),
 (1124203696.0,
  [(0.9995494615744414, 1.0),
   (0.868710595959615, 1.0),
   (0.9776702140147104, 1.0),
   (0.01627769931759985, 0.0),
   (0.003139414500156998, 0.0),
   (0.019261372645486077, 0.0),
   (0.9924365761746903, 1.0),
   (0.9819352387315695, 1.0),
   (0.006789046050023316, 0.0),
   (0.010551590324919553, 0.0)])]