# Dota match prediction based on the lineup
The idea is to create a model able to predict the outcome of a dota match based only on the lineup of both teams.
It is clear that the presition can never be very high: there are too many variables at play, expecially assuming the games are mostly balanced in terms of picks.

### Imports
Let's start by importing all the needed modules

In [23]:
from json import loads
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import StructType, StructField, DataFrame
from pyspark.sql.types import IntegerType, LongType, BooleanType
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf, col

### Create the Spark Session
Creates a Spark sessio for the app named _"dotingestion"_

In [24]:
sc = SparkSession.builder.appName("dotingestion").getOrCreate()
sc

## Pre-Processing 🔧

### Define a schema
This is the schema that contains all the data we need:
```yaml
r0: 3               # (int) radiant first pick
r1: 14              # (int) radiant second pick
r2: 51              # (int) radiant third pick
r3: 113             # (int) radiant fourth pick
r4: 135             # (int) radiant fifth pick
d0: 41              # (int) dire first pick
d1: 55              # (int) dire second pick
d2: 68              # (int) dire third pick
d3: 88              # (int) dire fourth pick
d4: 91              # (int) dire fifth pick
radiant_win: true   # whether the radiant team won
match_id: 10        # sequential id of the match, to make sure each match is accounted for once
```

In [25]:
schema = StructType([StructField("r0", IntegerType(), False),
                    StructField("r1", IntegerType(), False),
                    StructField("r2", IntegerType(), False),
                    StructField("r3", IntegerType(), False),
                    StructField("r4", IntegerType(), False),
                    StructField("d0", IntegerType(), False),
                    StructField("d1", IntegerType(), False),
                    StructField("d2", IntegerType(), False),
                    StructField("d3", IntegerType(), False),
                    StructField("d4", IntegerType(), False),
                    StructField("radiant_win", BooleanType(), False),
                    StructField("match_id", LongType(), False)])
schema

StructType(List(StructField(r0,IntegerType,false),StructField(r1,IntegerType,false),StructField(r2,IntegerType,false),StructField(r3,IntegerType,false),StructField(r4,IntegerType,false),StructField(d0,IntegerType,false),StructField(d1,IntegerType,false),StructField(d2,IntegerType,false),StructField(d3,IntegerType,false),StructField(d4,IntegerType,false),StructField(radiant_win,BooleanType,false),StructField(match_id,LongType,false)))

### Read the data
Reads the matches data already collected from the _"./data.json"_ file.  
Drops immedialy any null or repeated row.

In [26]:
path = "data.csv"
df = sc.read.csv(path, schema=schema, header=True).na.drop("all").distinct()
df.printSchema()

root
 |-- r0: integer (nullable = true)
 |-- r1: integer (nullable = true)
 |-- r2: integer (nullable = true)
 |-- r3: integer (nullable = true)
 |-- r4: integer (nullable = true)
 |-- d0: integer (nullable = true)
 |-- d1: integer (nullable = true)
 |-- d2: integer (nullable = true)
 |-- d3: integer (nullable = true)
 |-- d4: integer (nullable = true)
 |-- radiant_win: boolean (nullable = true)
 |-- match_id: long (nullable = true)



In [27]:
df.count()

500000

In [28]:
df.show(5)

+---+---+---+---+---+---+---+---+---+---+-----------+----------+
| r0| r1| r2| r3| r4| d0| d1| d2| d3| d4|radiant_win|  match_id|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+
| 42| 27| 75| 85| 39| 16|  7| 21|  1| 20|      false|5026549253|
| 26| 12|  2| 86|123| 96| 50| 70|  6| 94|      false|5026550191|
| 27| 48| 98|  9| 10| 52|105| 16| 96| 93|       true|5026550339|
| 81| 86| 14|104| 70| 63|  7| 44| 23| 41|       true|5026550716|
| 26| 44| 53| 99| 79| 29| 57| 11| 69| 60|       true|5026550933|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+
only showing top 5 rows



### Process data 1
The hero_id values go from 1 to 135, but not all indeces are used.  
For this reason, first we compact them in a dictionary based on the data in the _hero.json_ file, to map each possible value to an index going from **0** to **n**, where n is the number of possible hero_id values.

In [29]:
with open("heroes.json", 'r', encoding="utf-8") as f:
    heroes_dict = {hero['id']: i for i, hero in enumerate(loads(f.read()))}
heroes_dict

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 8: 7,
 9: 8,
 11: 9,
 10: 10,
 12: 11,
 13: 12,
 14: 13,
 15: 14,
 16: 15,
 17: 16,
 18: 17,
 19: 18,
 20: 19,
 21: 20,
 22: 21,
 23: 22,
 25: 23,
 31: 24,
 26: 25,
 27: 26,
 28: 27,
 29: 28,
 30: 29,
 32: 30,
 33: 31,
 34: 32,
 35: 33,
 36: 34,
 37: 35,
 38: 36,
 39: 37,
 40: 38,
 41: 39,
 42: 40,
 43: 41,
 44: 42,
 45: 43,
 46: 44,
 47: 45,
 48: 46,
 49: 47,
 50: 48,
 51: 49,
 52: 50,
 53: 51,
 54: 52,
 55: 53,
 56: 54,
 57: 55,
 58: 56,
 59: 57,
 60: 58,
 61: 59,
 62: 60,
 63: 61,
 64: 62,
 65: 63,
 66: 64,
 67: 65,
 69: 66,
 68: 67,
 70: 68,
 71: 69,
 72: 70,
 73: 71,
 74: 72,
 75: 73,
 76: 74,
 77: 75,
 78: 76,
 79: 77,
 80: 78,
 81: 79,
 82: 80,
 83: 81,
 84: 82,
 85: 83,
 86: 84,
 87: 85,
 88: 86,
 89: 87,
 90: 88,
 91: 89,
 92: 90,
 93: 91,
 94: 92,
 95: 93,
 96: 94,
 97: 95,
 98: 96,
 99: 97,
 100: 98,
 101: 99,
 102: 100,
 103: 101,
 104: 102,
 106: 103,
 107: 104,
 109: 105,
 110: 106,
 111: 107,
 105: 108,
 112: 109,
 113: 1

### Process data 2
Tranform the _"dire_lineup"_ and _"radiant_lineup"_ columns, replacing the list of hero_id with an array filled with 0, except for the indeces corrisponding at the hero_id present in the original array mapped using the *heroes_dict* variable, which will have a value of 1.  
Additionally, the type of those two columns changes from **ArrayType** to **VectorUDT**. This is required for the **VectorAssembler** to work correctly.

#### Example:
**Assuming there are a total of 10 heroes and the indeces go from 1 to 10**

| r0 | r1 | r2 | r3 | r4 | d0 | d1 | d2 | d3 | d4 | radiant_win | match_id |
| - | - | - | - | - | - | - | - | - | - | - | - |
| 2 | 4 | 6 | 7 | 8 | 1 | 3 | 5 | 9 | 10 | true | 4976549005 |

Becomes

| r0 | r1 | r2 | r3 | r4 | d0 | d1 | d2 | d3 | d4 | radiant_win | match_id | dire_lineup_vec | radiant_lineup_vec |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 2 | 4 | 6 | 7 | 8 | 1 | 3 | 5 | 9 | 10 | true | 4976549005 | [0, 1, 0, 1, 0, 1, 1, 1, 0, 0] | [1, 0, 1, 0, 1, 0, 0, 0, 1, 1] |

In [30]:
def convert_heroes_to_lineup(df: DataFrame) -> DataFrame:

    def onehot(l0, l1, l2, l3, l4):
        lineup = tuple(heroes_dict[hero] for hero in (l0, l1, l2, l3, l4))
        return Vectors.dense([1 if hero_slot in lineup else 0 for hero_slot in range(len(heroes_dict))])

    heros_to_lineup_udf = udf(onehot, VectorUDT())
    return df.withColumn("radiant_lineup_vec", heros_to_lineup_udf(df.r0, df.r1, df.r2, df.r3, df.r4))\
             .withColumn("dire_lineup_vec", heros_to_lineup_udf(df.d0, df.d1, df.d2, df.d3, df.d4))

df = convert_heroes_to_lineup(df)
df.show(5)

+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+
| r0| r1| r2| r3| r4| d0| d1| d2| d3| d4|radiant_win|  match_id|  radiant_lineup_vec|     dire_lineup_vec|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+
| 42| 27| 75| 85| 39| 16|  7| 21|  1| 20|      false|5026549253|[0.0,0.0,0.0,0.0,...|[1.0,0.0,0.0,0.0,...|
| 26| 12|  2| 86|123| 96| 50| 70|  6| 94|      false|5026550191|[0.0,1.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|
| 27| 48| 98|  9| 10| 52|105| 16| 96| 93|       true|5026550339|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|
| 81| 86| 14|104| 70| 63|  7| 44| 23| 41|       true|5026550716|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|
| 26| 44| 53| 99| 79| 29| 57| 11| 69| 60|       true|5026550933|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+
only showing top 5 rows



### Process data 3
Cast the _"radiant_win"_ column type from **BooleanType** to **IntegerType**.

#### Example:
**Assuming there are a total of 10 heroes and the indeces go from 1 to 10**

| r0 | r1 | r2 | r3 | r4 | d0 | d1 | d2 | d3 | d4 | radiant_win | match_id | radiant_lineup_vec | dire_lineup_vec |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 2 | 4 | 6 | 7 | 8 | 1 | 3 | 5 | 9 | 10 | true | 4976549005 | [0, 1, 0, 1, 0, 1, 1, 1, 0, 0] | [1, 0, 1, 0, 1, 0, 0, 0, 1, 1] |

Becomes

| r0 | r1 | r2 | r3 | r4 | d0 | d1 | d2 | d3 | d4 | radiant_win | match_id | radiant_lineup_vec | dire_lineup_vec | radiant_win_int |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 2 | 4 | 6 | 7 | 8 | 1 | 3 | 5 | 9 | 10 | true | 4976549005 | [0, 1, 0, 1, 0, 1, 1, 1, 0, 0] | [1, 0, 1, 0, 1, 0, 0, 0, 1, 1] | 1 |

In [31]:
def convert_types(df: DataFrame) -> DataFrame:
    return df.withColumn("radiant_win_int", df.radiant_win.cast(IntegerType()))

df = convert_types(df)
df.show(5)

+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+---------------+
| r0| r1| r2| r3| r4| d0| d1| d2| d3| d4|radiant_win|  match_id|  radiant_lineup_vec|     dire_lineup_vec|radiant_win_int|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+---------------+
| 42| 27| 75| 85| 39| 16|  7| 21|  1| 20|      false|5026549253|[0.0,0.0,0.0,0.0,...|[1.0,0.0,0.0,0.0,...|              0|
| 26| 12|  2| 86|123| 96| 50| 70|  6| 94|      false|5026550191|[0.0,1.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              0|
| 27| 48| 98|  9| 10| 52|105| 16| 96| 93|       true|5026550339|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              1|
| 81| 86| 14|104| 70| 63|  7| 44| 23| 41|       true|5026550716|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              1|
| 26| 44| 53| 99| 79| 29| 57| 11| 69| 60|       true|5026550933|[0.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              1|
+---+---+---+---

## ML 🤖

In [32]:
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, VectorSizeHint
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

### ML pipeline
The machine learning pipeline has 4 stages, but the first 2 are the same:
```yaml
VectorSizeHint x2: # needed for the Vector assembler, since when applyng this pipeline during a structured stream it doesn't know in advance size of the vectors
    inputCol: dire_lineup_vec [or] radiant_lineup_vec
    size: len(heroes_dict)
    handleInvalid: skip
VectorAssembler: # assembles the "radiant_lineup_vec" and "dire_lineup_vec" vectors in a single "features" vector
    inputCols: ['radiant_lineup_vec', 'dire_lineup_vec']
    outputCol: ['features']
LogisticRegression: # uses the "features" vector to predict the "radiant_win_int" column value
    featuresCol: ['features']
    labelCol: ['radiant_win_int']
```

In [33]:
size_hint_dire = VectorSizeHint(inputCol="dire_lineup_vec", size=len(heroes_dict), handleInvalid="skip")
size_hint_radiant = VectorSizeHint(inputCol="radiant_lineup_vec", size=len(heroes_dict), handleInvalid="skip")
vec_assembler = VectorAssembler(inputCols=['dire_lineup_vec', 'radiant_lineup_vec'], outputCol="features")
regression = LogisticRegression(featuresCol="features", labelCol="radiant_win_int")
pipeline = Pipeline(stages=[size_hint_dire, size_hint_radiant, vec_assembler, regression])

### Param grid and CrossValidator
The param grid and the CrossValidator allow to run the same fit operation multiple time, altering the hyperparamentes to try and find the set that optimizes the result

In [34]:
paramGrid = ParamGridBuilder()\
    .addGrid(regression.elasticNetParam,[0.0, 0.5, 1.0])\
    .addGrid(regression.maxIter,[1000])\
    .addGrid(regression.regParam,[0.01, 0.5, 2.0]) \
    .build()

In [35]:
evaluator=BinaryClassificationEvaluator(labelCol="radiant_win_int")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

### ML model
Split the original dataset in a training and test set.  
Then use the training set to fit the model, and see the accuracy reached.

In [36]:
traint_df, test_df = df.randomSplit([0.8, 0.2])
model = cv.fit(traint_df)

## ML results
Check the results by using the newly created model with the test set.

In [39]:
result_df = model.transform(test_df)
result_df.show(5)

+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+---------------+--------------------+--------------------+--------------------+----------+
| r0| r1| r2| r3| r4| d0| d1| d2| d3| d4|radiant_win|  match_id|  radiant_lineup_vec|     dire_lineup_vec|radiant_win_int|            features|       rawPrediction|         probability|prediction|
+---+---+---+---+---+---+---+---+---+---+-----------+----------+--------------------+--------------------+---------------+--------------------+--------------------+--------------------+----------+
|  1| 20| 13|108| 50| 94| 74| 88| 87| 76|      false|5026795783|[1.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              0|(242,[72,74,85,86...|[-0.1187791040452...|[0.47034008713804...|       1.0|
|  1| 84| 96| 13| 45| 99| 70| 62| 30| 36|      false|5027387970|[1.0,0.0,0.0,0.0,...|[0.0,0.0,0.0,0.0,...|              0|(242,[29,34,60,68...|[0.05718586068549...|[0.51429257039301...|       0.0|
|  1| 95| 64| 2

In [38]:
predict_train=model.transform(traint_df)
predict_test=model.transform(test_df)
print("The area under ROC for train set after CV  is {}".format(evaluator.evaluate(predict_train)))
print("The area under ROC for test set after CV  is {}".format(evaluator.evaluate(predict_test)))

The area under ROC for train set after CV  is 0.5923819498967543
The area under ROC for test set after CV  is 0.590186560695288


In [42]:
model.save("cv_model")