In [1]:
!rm -rf spark-3.1.2-bin-hadoop3.2
!apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [1 InRelease 5,485 B/88.[0m                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [1 InRelease 43.1 kB/88.[0m[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [1 InRelease 47.5 kB/88.[0m[33m0% [2 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (185.125.190.36[0m                                                                               Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
[33m0% [2 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (185.125.190.36[0m[33m0% [2 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (185.125.190.36[0m[33

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

In [3]:
# Importamos los módulos necesarios
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [4]:
# Creamos una nueva sesión Spark
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Análisis de partidas en esports
### Enrique Marín Sánchez

El mundo de los esports está en pleno auge, cada vez son más populares torneos internacionales de Counter-Strike, Valorant, League of Legends o incluso Rocket League. Desde el nacimiento de los esports con Starcraft en Corea del Sur, los analistas siempre se han sentido muy interesados en estudiar las partidas para optimizar y crear estrategias.

El League of Legends es actualmente el deporte electrónico más popular, en este cada equipo profesional cuenta con uno o varios analistas cuyo trabajo se centra en analizar otros equipos y su estilo de juego. Son analistas independientes, como es el caso de LS (ex-jugador polémico de Starcraft), los que se centran en analizar de una forma más matemática el juego y llegar a conclusiones inesperadas para otras personas.

Este proyecto se va a centrar sobretodo en la predicción del resultado final de la partida en base a estadísticas relevantes (hay otras estadísticas como los dragones o diferentes objetivos que son irrelevantes en comparación de las escogidas) al minuto 10 y al minuto 15 (la duraccion media de una partida de LoL profesional dependiendo del metajuego suele estar alrededor de los 30 minutos).

Las estadísticas elegidas para realizar este estudio son:
  - Oro: recurso con el que los jugadores compran objetos.
  - Experiencia: recurso que sirve para subir a los personajes de nivel.
  - Farm: recurso que guarda relación con los dos anterios y es algo complejo de explicar brevemente.
  - Asesinatos: computo total de los asesinatos del equipo.
  - Muertes: computo total de las muertes del equipo.
  - Asistencias: computo total de las assistencias a asesinatos del equipo.
  - Lado: el mapa se divide diagonalmente en dos y a cada equipo le correspendonde uno, rojo o azul. Cada lado tiene sus ventajas y desventajas. 

Para las tres primeras (oro, exp y farm) se estudian tanto las propio equipo como las del rival y también la diferencias entre estas. ¿Por qué es necesario esto? Porque no es lo mismo, por ejemplo, que haya una diferencia de oro de 2k si un equipo tiene 22k y el rival 20k que la misma diferencian entre 12k y 10k. Y de los otros tres sólo las del propio equipo y las rival.

## Estadísicas al minuto 10

Cargamos el csv limpiado previamente de las estadísticas al minuto 10.

In [6]:
teamsat10 = (spark.read
          .format("csv")
          .option('header', 'true')
          .load("/content/drive/MyDrive/teamsat10.csv"))

Previsualizamos los datos de las 5 primeras filas.

In [7]:
teamsat10.show(5)

+---+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|_c0|league|side|   teamname|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|
+---+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
| 46|   LPL|Blue|Top Esports|     0| 16177.0|19640.0| 380.0|     15445.0|   19565.0|     360.0|       732.0|      75.0|      20.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|
| 47|   LPL| Red|     Suning|     1| 15445.0|19565.0| 360.0|     16177.0|   19640.0|     380.0|      -732.0|     -75.0|     -20.0|      0.0|        0.0|       1.0| 

Lo vemos en formato pandas

In [8]:
teamsat10.toPandas()

Unnamed: 0,_c0,league,side,teamname,result,goldat10,xpat10,csat10,opp_goldat10,opp_xpat10,opp_csat10,golddiffat10,xpdiffat10,csdiffat10,killsat10,assistsat10,deathsat10,opp_killsat10,opp_assistsat10,opp_deathsat10
0,46,LPL,Blue,Top Esports,0,16177.0,19640.0,380.0,15445.0,19565.0,360.0,732.0,75.0,20.0,1.0,1.0,0.0,0.0,0.0,1.0
1,47,LPL,Red,Suning,1,15445.0,19565.0,360.0,16177.0,19640.0,380.0,-732.0,-75.0,-20.0,0.0,0.0,1.0,1.0,1.0,0.0
2,58,LPL,Blue,Top Esports,0,16752.0,20020.0,361.0,15250.0,18856.0,321.0,1502.0,1164.0,40.0,2.0,2.0,1.0,1.0,1.0,2.0
3,59,LPL,Red,Suning,1,15250.0,18856.0,321.0,16752.0,20020.0,361.0,-1502.0,-1164.0,-40.0,1.0,1.0,2.0,2.0,2.0,1.0
4,70,LPL,Blue,Oh My God,0,15842.0,18405.0,322.0,15812.0,18712.0,333.0,30.0,-307.0,-11.0,2.0,3.0,2.0,2.0,4.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3869,147875,VCS,Red,GAM Esports,1,15885.0,17439.0,322.0,15964.0,18247.0,296.0,-79.0,-808.0,26.0,3.0,4.0,4.0,4.0,5.0,3.0
3870,147898,VCS,Blue,CERBERUS Esports,1,16533.0,18695.0,354.0,15165.0,18677.0,332.0,1368.0,18.0,22.0,1.0,1.0,1.0,1.0,0.0,1.0
3871,147899,VCS,Red,GAM Esports,0,15165.0,18677.0,332.0,16533.0,18695.0,354.0,-1368.0,-18.0,-22.0,1.0,0.0,1.0,1.0,1.0,1.0
3872,147922,VCS,Blue,GAM Esports,0,17265.0,17931.0,314.0,17364.0,17443.0,311.0,-99.0,488.0,3.0,5.0,7.0,4.0,4.0,4.0,5.0


Imprimimos el nombre de odas las columnas.

In [9]:
teamsat10.columns

['_c0',
 'league',
 'side',
 'teamname',
 'result',
 'goldat10',
 'xpat10',
 'csat10',
 'opp_goldat10',
 'opp_xpat10',
 'opp_csat10',
 'golddiffat10',
 'xpdiffat10',
 'csdiffat10',
 'killsat10',
 'assistsat10',
 'deathsat10',
 'opp_killsat10',
 'opp_assistsat10',
 'opp_deathsat10']

Vemos los tipos de las columnas.

In [10]:
teamsat10.dtypes

[('_c0', 'string'),
 ('league', 'string'),
 ('side', 'string'),
 ('teamname', 'string'),
 ('result', 'string'),
 ('goldat10', 'string'),
 ('xpat10', 'string'),
 ('csat10', 'string'),
 ('opp_goldat10', 'string'),
 ('opp_xpat10', 'string'),
 ('opp_csat10', 'string'),
 ('golddiffat10', 'string'),
 ('xpdiffat10', 'string'),
 ('csdiffat10', 'string'),
 ('killsat10', 'string'),
 ('assistsat10', 'string'),
 ('deathsat10', 'string'),
 ('opp_killsat10', 'string'),
 ('opp_assistsat10', 'string'),
 ('opp_deathsat10', 'string')]

Vemos una breve descripcion de los datos.

In [11]:
teamsat10.describe().toPandas()

Unnamed: 0,summary,_c0,league,side,teamname,result,goldat10,xpat10,csat10,opp_goldat10,...,opp_csat10,golddiffat10,xpdiffat10,csdiffat10,killsat10,assistsat10,deathsat10,opp_killsat10,opp_assistsat10,opp_deathsat10
0,count,3874.0,3874,3874,3874,3874.0,3874.0,3874.0,3874.0,3874.0,...,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0
1,mean,70735.11744966442,,,,0.5,15867.389778007228,18539.29401135777,328.1200309757357,15867.389778007228,...,328.1200309757357,0.0,0.0,0.0,2.140165203923593,3.381775942178627,2.146618482188952,2.140165203923593,3.381775942178627,2.146618482188952
2,stddev,45155.29941221581,,,,0.5000645452787817,969.0709834618204,902.6651953940352,23.320991086374875,969.0709834618202,...,23.320991086375063,1505.970250384317,1120.4006418439144,26.13503435344638,1.8687152383253232,3.482846956613331,1.8731197269581887,1.8687152383253176,3.482846956613336,1.8731197269581883
3,min,100018.0,LCK,Blue,100 Thieves,0.0,13283.0,14196.0,232.0,13283.0,...,232.0,-1.0,-10.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,max,9995.0,WCS,Red,İstanbul Wildcats,1.0,21975.0,22663.0,403.0,21975.0,...,403.0,999.0,996.0,95.0,9.0,9.0,9.0,9.0,9.0,9.0


Cambiamos los datos de las columnas a numéricos.

In [12]:
from pyspark.sql.functions import col
teams10 = teamsat10.select(col('league'),
                           col('side'),
                           col('teamname'),
                           col('result').cast('float'),
                           col('goldat10').cast('float'),
                           col('xpat10').cast('float'),
                           col('csat10').cast('float'),
                           col('opp_goldat10').cast('float'),
                           col('opp_xpat10').cast('float'),
                           col('opp_csat10').cast('float'),
                           col('golddiffat10').cast('float'),
                           col('xpdiffat10').cast('float'),
                           col('csdiffat10').cast('float'),
                           col('killsat10').cast('float'),
                           col('assistsat10').cast('float'),
                           col('deathsat10').cast('float'),
                           col('opp_killsat10').cast('float'),
                           col('opp_assistsat10').cast('float'),
                           col('opp_deathsat10').cast('float')
                           )
teams10.show()

+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|league|side|       teamname|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|
+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|   LPL|Blue|    Top Esports|   0.0| 16177.0|19640.0| 380.0|     15445.0|   19565.0|     360.0|       732.0|      75.0|      20.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|
|   LPL| Red|         Suning|   1.0| 15445.0|19565.0| 360.0|     16177.0|   19640.0|     380.0|      -732.0|     -75.0|     -20.0|      0.0|        0.0|       1.0| 

Comprobamos que no haya ningun valor nulo.

In [13]:
from pyspark.sql.functions import isnull, when, count, col
teams10.select([count(when(isnull(c), c)).alias(c) for c in teams10.columns]).show()

+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|league|side|teamname|result|goldat10|xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|
+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|     0|   0|       0|     0|       0|     0|     0|           0|         0|         0|           0|         0|         0|        0|          0|         0|            0|              0|             0|
+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+-----------

Pasamos la columna del lado del mapa de cada equipo a numérico.

In [14]:
from pyspark.ml.feature import StringIndexer
teams10 = StringIndexer(
    inputCol='side', 
    outputCol='side2', 
    handleInvalid='keep').fit(teams10).transform(teams10)
teams10.show()

+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|league|side|       teamname|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|side2|
+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|   LPL|Blue|    Top Esports|   0.0| 16177.0|19640.0| 380.0|     15445.0|   19565.0|     360.0|       732.0|      75.0|      20.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|
|   LPL| Red|         Suning|   1.0| 15445.0|19565.0| 360.0|     16177.0|   19640.0|     380.0|      -732.0|     -75.0|     -20.0|      0.0|

Comprobamos por última vez el tipo de las columnas para saber si podemos trabajar con ellas.

In [15]:
teams10.dtypes

[('league', 'string'),
 ('side', 'string'),
 ('teamname', 'string'),
 ('result', 'float'),
 ('goldat10', 'float'),
 ('xpat10', 'float'),
 ('csat10', 'float'),
 ('opp_goldat10', 'float'),
 ('opp_xpat10', 'float'),
 ('opp_csat10', 'float'),
 ('golddiffat10', 'float'),
 ('xpdiffat10', 'float'),
 ('csdiffat10', 'float'),
 ('killsat10', 'float'),
 ('assistsat10', 'float'),
 ('deathsat10', 'float'),
 ('opp_killsat10', 'float'),
 ('opp_assistsat10', 'float'),
 ('opp_deathsat10', 'float'),
 ('side2', 'double')]

Dropeamos las columnas que nos nos sirven.

In [16]:
teams10 = teams10.drop('league')
teams10 = teams10.drop('side')
teams10 = teams10.drop('teamname')
teams10.show()

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|side2|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|   0.0| 16177.0|19640.0| 380.0|     15445.0|   19565.0|     360.0|       732.0|      75.0|      20.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|
|   1.0| 15445.0|19565.0| 360.0|     16177.0|   19640.0|     380.0|      -732.0|     -75.0|     -20.0|      0.0|        0.0|       1.0|          1.0|            1.0|           0.0|  1.0|
|   0.0| 16752.0|20020.0| 361.0|     15250.0|   18856.0|     321.


Reuno todas las características con VectorAssembler. Para poder trabajar con ellas



In [17]:
required_features = ['goldat10',
                     'xpat10',
                     'csat10',
                     'opp_goldat10',
                     'opp_xpat10',
                     'opp_csat10',
                     'golddiffat10',
                     'xpdiffat10',
                     'csdiffat10',
                     'killsat10',
                     'assistsat10',
                     'deathsat10',
                     'opp_killsat10',
                     'opp_assistsat10',
                     'opp_deathsat10',
                     'side2'
                      ]
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
teams10f = assembler.transform(teams10)
teams10f.show(5)

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+
|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|side2|            features|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+
|   0.0| 16177.0|19640.0| 380.0|     15445.0|   19565.0|     360.0|       732.0|      75.0|      20.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|[16177.0,19640.0,...|
|   1.0| 15445.0|19565.0| 360.0|     16177.0|   19640.0|     380.0|      -732.0|     -75.0|     -20.0|      0.0|        0.0|       1.0|          1.0|            1.0|   

Dividimos el dataset en dos partes una para entrenar y otra para testear el modelo. Introducimos también una seed para que cada vez que volvamos a ejecutar el notebook nos de siempre los mismo resultados.

In [18]:
(teams10_train, teams10_test) = teams10f.randomSplit([0.8,0.2],21)

Primero vamos a probar a elaborar el modelo con RandomForest.

In [19]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol='result', 
                            featuresCol='features',
                            maxDepth=20)

In [20]:
model10 = rf.fit(teams10_train)

Una vez realizado el modelo para las estadísticas al minuto 10 con Random Forest lo aplicamos a nuestra prueba.

In [21]:
predictions10rf = model10.transform(teams10_test)

In [22]:
predictions10rf.show(5)

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+--------------------+--------------------+----------+
|result|goldat10| xpat10|csat10|opp_goldat10|opp_xpat10|opp_csat10|golddiffat10|xpdiffat10|csdiffat10|killsat10|assistsat10|deathsat10|opp_killsat10|opp_assistsat10|opp_deathsat10|side2|            features|       rawPrediction|         probability|prediction|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+--------------------+--------------------+----------+
|   0.0| 13743.0|14196.0| 233.0|     20769.0|   18202.0|     300.0|     -7026.0|   -4006.0|     -67.0|      2.0|        4.0|      14.0|         14.0|           32.0|           2.0|  1.0|[13743.0,14196.0,...|          

Evaluamos nuestro modelo para comprobar su precisión. Sólo averiguamos su precisión por medio de la curva ROC porque al estar balanceado dada su naturaleza no es necesario averiguar la curva PR.

In [23]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
    labelCol='result', 
    predictionCol='prediction', 
    metricName='accuracy')

In [24]:
import pyspark.ml.evaluation as ev

evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability', 
    labelCol='result')

print('Precisión de RF al minuto 10 =',evaluator.evaluate(predictions10rf, {evaluator.metricName: 'areaUnderROC'}))

Precisión de RF al minuto 10 = 0.7196487989425229


Hacemos los mismo para el modelo de regresión logística.

In [25]:
import pyspark.ml.classification as cl

logistic = cl.LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    featuresCol='features',
    labelCol='result')

In [26]:
model10log = logistic.fit(teams10_train)

In [27]:
predictions10log = model10log.transform(teams10_test)

In [28]:
print('Precisión de R. Logística al minuto 10 =',evaluator.evaluate(predictions10log, {evaluator.metricName: 'areaUnderROC'}))

Precisión de R. Logística al minuto 10 = 0.7725739550572023


## Estadísticas al minuto 15


Cargamos el csv limpiado previamente de las estadísticas al minuto 15. Y repetimos los mismos procesos que anteriormente para los del minuto 10

In [29]:
teamsat15 = (spark.read
          .format("csv")
          .option('header', 'true')
          .load("/content/drive/MyDrive/teamsat15.csv"))

In [30]:
teamsat15.show(5)

+---+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|_c0|league|side|   teamname|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|
+---+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
| 46|   LPL|Blue|Top Esports|     0| 24815.0|31121.0| 608.0|     23864.0|   31228.0|     590.0|       951.0|    -107.0|      18.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|
| 47|   LPL| Red|     Suning|     1| 23864.0|31228.0| 590.0|     24815.0|   31121.0|     608.0|      -951.0|     107.0|     -18.0|      0.0|        0.0|       1.0| 

In [31]:
teamsat15.toPandas()

Unnamed: 0,_c0,league,side,teamname,result,goldat15,xpat15,csat15,opp_goldat15,opp_xpat15,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
0,46,LPL,Blue,Top Esports,0,24815.0,31121.0,608.0,23864.0,31228.0,590.0,951.0,-107.0,18.0,1.0,1.0,0.0,0.0,0.0,1.0
1,47,LPL,Red,Suning,1,23864.0,31228.0,590.0,24815.0,31121.0,608.0,-951.0,107.0,-18.0,0.0,0.0,1.0,1.0,1.0,0.0
2,58,LPL,Blue,Top Esports,0,27355.0,32158.0,560.0,25210.0,32578.0,516.0,2145.0,-420.0,44.0,5.0,6.0,6.0,6.0,9.0,5.0
3,59,LPL,Red,Suning,1,25210.0,32578.0,516.0,27355.0,32158.0,560.0,-2145.0,420.0,-44.0,6.0,9.0,5.0,5.0,6.0,6.0
4,70,LPL,Blue,Oh My God,0,24131.0,29284.0,527.0,24588.0,30502.0,543.0,-457.0,-1218.0,-16.0,2.0,3.0,3.0,3.0,5.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3869,147875,VCS,Red,GAM Esports,1,25270.0,27190.0,476.0,27648.0,30278.0,485.0,-2378.0,-3088.0,-9.0,5.0,10.0,10.0,10.0,17.0,5.0
3870,147898,VCS,Blue,CERBERUS Esports,1,26290.0,30222.0,554.0,23305.0,29428.0,523.0,2985.0,794.0,31.0,2.0,2.0,1.0,1.0,0.0,2.0
3871,147899,VCS,Red,GAM Esports,0,23305.0,29428.0,523.0,26290.0,30222.0,554.0,-2985.0,-794.0,-31.0,1.0,0.0,2.0,2.0,2.0,1.0
3872,147922,VCS,Blue,GAM Esports,0,26087.0,29426.0,537.0,25960.0,28646.0,529.0,127.0,780.0,8.0,5.0,7.0,4.0,4.0,4.0,5.0


In [32]:
teamsat15.dtypes

[('_c0', 'string'),
 ('league', 'string'),
 ('side', 'string'),
 ('teamname', 'string'),
 ('result', 'string'),
 ('goldat15', 'string'),
 ('xpat15', 'string'),
 ('csat15', 'string'),
 ('opp_goldat15', 'string'),
 ('opp_xpat15', 'string'),
 ('opp_csat15', 'string'),
 ('golddiffat15', 'string'),
 ('xpdiffat15', 'string'),
 ('csdiffat15', 'string'),
 ('killsat15', 'string'),
 ('assistsat15', 'string'),
 ('deathsat15', 'string'),
 ('opp_killsat15', 'string'),
 ('opp_assistsat15', 'string'),
 ('opp_deathsat15', 'string')]

In [33]:
teamsat15.describe().toPandas()

Unnamed: 0,summary,_c0,league,side,teamname,result,goldat15,xpat15,csat15,opp_goldat15,...,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
0,count,3874.0,3874,3874,3874,3874.0,3874.0,3874.0,3874.0,3874.0,...,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0,3874.0
1,mean,70735.11744966442,,,,0.5,25084.14739287558,29866.96695921528,525.2823954568921,25084.14739287558,...,525.2823954568921,0.0,0.0,0.0,3.8089829633453793,6.459989674754776,3.8177594217862674,3.8089829633453793,6.459989674754776,3.8177594217862674
2,stddev,45155.29941221581,,,,0.5000645452787817,1783.1544721739765,1437.1459861591638,34.50658083139995,1783.1544721739767,...,34.50658083140012,2954.108671383711,2022.1726656489816,39.29105807979333,2.751740401453861,5.247656907688167,2.756460296713686,2.751740401453856,5.247656907688169,2.7564602967136893
3,min,100018.0,LCK,Blue,100 Thieves,0.0,19995.0,23082.0,378.0,19995.0,...,378.0,-10.0,-10.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,max,9995.0,WCS,Red,İstanbul Wildcats,1.0,33596.0,35250.0,616.0,33596.0,...,616.0,997.0,998.0,99.0,9.0,9.0,9.0,9.0,9.0,9.0


In [34]:
from pyspark.sql.functions import col
teams15 = teamsat15.select(col('league'),
                           col('side'),
                           col('teamname'),
                           col('result').cast('float'),
                           col('goldat15').cast('float'),
                           col('xpat15').cast('float'),
                           col('csat15').cast('float'),
                           col('opp_goldat15').cast('float'),
                           col('opp_xpat15').cast('float'),
                           col('opp_csat15').cast('float'),
                           col('golddiffat15').cast('float'),
                           col('xpdiffat15').cast('float'),
                           col('csdiffat15').cast('float'),
                           col('killsat15').cast('float'),
                           col('assistsat15').cast('float'),
                           col('deathsat15').cast('float'),
                           col('opp_killsat15').cast('float'),
                           col('opp_assistsat15').cast('float'),
                           col('opp_deathsat15').cast('float')
                           )
teams15.show()

+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|league|side|       teamname|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|
+------+----+---------------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|   LPL|Blue|    Top Esports|   0.0| 24815.0|31121.0| 608.0|     23864.0|   31228.0|     590.0|       951.0|    -107.0|      18.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|
|   LPL| Red|         Suning|   1.0| 23864.0|31228.0| 590.0|     24815.0|   31121.0|     608.0|      -951.0|     107.0|     -18.0|      0.0|        0.0|       1.0| 

In [35]:
from pyspark.sql.functions import isnull, when, count, col
teams15.select([count(when(isnull(c), c)).alias(c) for c in teams15.columns]).show()

+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|league|side|teamname|result|goldat15|xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|
+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+
|     0|   0|       0|     0|       0|     0|     0|           0|         0|         0|           0|         0|         0|        0|          0|         0|            0|              0|             0|
+------+----+--------+------+--------+------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+-----------

In [36]:
from pyspark.ml.feature import StringIndexer
teams15 = StringIndexer(
    inputCol='side', 
    outputCol='side2', 
    handleInvalid='keep').fit(teams15).transform(teams15)
teams15.show(5)

+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|league|side|   teamname|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|side2|
+------+----+-----------+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|   LPL|Blue|Top Esports|   0.0| 24815.0|31121.0| 608.0|     23864.0|   31228.0|     590.0|       951.0|    -107.0|      18.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|
|   LPL| Red|     Suning|   1.0| 23864.0|31228.0| 590.0|     24815.0|   31121.0|     608.0|      -951.0|     107.0|     -18.0|      0.0|        0.0|       1

In [37]:
teams15.dtypes

[('league', 'string'),
 ('side', 'string'),
 ('teamname', 'string'),
 ('result', 'float'),
 ('goldat15', 'float'),
 ('xpat15', 'float'),
 ('csat15', 'float'),
 ('opp_goldat15', 'float'),
 ('opp_xpat15', 'float'),
 ('opp_csat15', 'float'),
 ('golddiffat15', 'float'),
 ('xpdiffat15', 'float'),
 ('csdiffat15', 'float'),
 ('killsat15', 'float'),
 ('assistsat15', 'float'),
 ('deathsat15', 'float'),
 ('opp_killsat15', 'float'),
 ('opp_assistsat15', 'float'),
 ('opp_deathsat15', 'float'),
 ('side2', 'double')]

In [38]:
teams15 = teams15.drop('league')
teams15 = teams15.drop('side')
teams15 = teams15.drop('teamname')
teams15.show(5)

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|side2|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+
|   0.0| 24815.0|31121.0| 608.0|     23864.0|   31228.0|     590.0|       951.0|    -107.0|      18.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|
|   1.0| 23864.0|31228.0| 590.0|     24815.0|   31121.0|     608.0|      -951.0|     107.0|     -18.0|      0.0|        0.0|       1.0|          1.0|            1.0|           0.0|  1.0|
|   0.0| 27355.0|32158.0| 560.0|     25210.0|   32578.0|     516.

In [39]:
required_features2 = ['goldat15',
                     'xpat15',
                     'csat15',
                     'opp_goldat15',
                     'opp_xpat15',
                     'opp_csat15',
                     'golddiffat15',
                     'xpdiffat15',
                     'csdiffat15',
                     'killsat15',
                     'assistsat15',
                     'deathsat15',
                     'opp_killsat15',
                     'opp_assistsat15',
                     'opp_deathsat15',
                     'side2'
                      ]
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features2, outputCol='features')
teams15f = assembler.transform(teams15)
teams15f.show(5)

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+
|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|side2|            features|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+
|   0.0| 24815.0|31121.0| 608.0|     23864.0|   31228.0|     590.0|       951.0|    -107.0|      18.0|      1.0|        1.0|       0.0|          0.0|            0.0|           1.0|  0.0|[24815.0,31121.0,...|
|   1.0| 23864.0|31228.0| 590.0|     24815.0|   31121.0|     608.0|      -951.0|     107.0|     -18.0|      0.0|        0.0|       1.0|          1.0|            1.0|   

In [40]:
(teams15_train, teams15_test) = teams15f.randomSplit([0.8,0.2],21)

In [41]:
model15 = rf.fit(teams15_train)

In [42]:
predictions15rf = model15.transform(teams15_test)

In [43]:
predictions15rf.show(5)

+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+-------------+-----------+----------+
|result|goldat15| xpat15|csat15|opp_goldat15|opp_xpat15|opp_csat15|golddiffat15|xpdiffat15|csdiffat15|killsat15|assistsat15|deathsat15|opp_killsat15|opp_assistsat15|opp_deathsat15|side2|            features|rawPrediction|probability|prediction|
+------+--------+-------+------+------------+----------+----------+------------+----------+----------+---------+-----------+----------+-------------+---------------+--------------+-----+--------------------+-------------+-----------+----------+
|   0.0| 21279.0|27986.0| 456.0|     25991.0|   29174.0|     521.0|     -4712.0|   -1188.0|     -65.0|      1.0|        2.0|       4.0|          4.0|            4.0|           1.0|  0.0|[21279.0,27986.0,...|   [20.0,0.0]|  [1.0,0.0]|       0.0|
|   0.0| 21463.0|263

In [44]:
print('Precisión de RF al minuto 15 =',evaluator.evaluate(predictions15rf, {evaluator.metricName: 'areaUnderROC'}))

Precisión de RF al minuto 15 = 0.7897266283786866


In [45]:
model15log = logistic.fit(teams15_train)

In [46]:
predictions15log = model15log.transform(teams15_test)

In [47]:
print('Precisión de R. Logística al minuto 15 =',evaluator.evaluate(predictions15log, {evaluator.metricName: 'areaUnderROC'}))

Precisión de R. Logística al minuto 15 = 0.8245419116641599


## Conclusiones

Como podemos observar tanto con las estadísticas al minuto 10 como las del minuto 15 el modelo de regresión logística nos da una mayor precisión que el de random forest. También hay que añadir que las predicciones al minuto 15 son mas precisas ya que esas se encuentran más cerca del resulttado final de la partida. Finalmente, me gustaría añadir que el normal que no sea 100% preciso puesto que por naturaleza es un videojuego 5 contra 5 donde influyen muchos factores diferentes que no se pueden estudiar como por ejemplo: fallos humanos, composiciones de equipo de escalado, superioridad individual, etc. Es sorprendente que pese a esta 'volatilidad' en las partidas, teniendo en cuenta que durande media 30 minutos, que a mitad de partida puedas predecir el resultado con un 83% de probabilidades de acierto, es decir acierta 5 de cada 6 partidas.

## Mejoras en Regresión logística al minuto 15 (esadarización, hyper-tuning, PCA y Chi)

#### Estandarización

In [48]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
#Estandariza la columna de features
scaler_model = scaler.fit(teams15f)
scaled_train = scaler_model.transform(teams15_train)
scaled_test = scaler_model.transform(teams15_test)

In [49]:
#Entrenamos el modelo de nuevo
scaled_lr = cl.LogisticRegression(labelCol='result')
scaled_lr_model = scaled_lr.fit(scaled_train)
scaled_predictions = scaled_lr_model.transform(scaled_test)

In [50]:
scaled_evaluator = ev.BinaryClassificationEvaluator(labelCol="result")
print("Test de precisión tras estandarizar: ",scaled_evaluator.evaluate(scaled_predictions, {scaled_evaluator.metricName: "areaUnderROC"}))

Test de precisión tras estandarizar:  0.824382378412872


#### Hyper-tuning

In [51]:
import pyspark.ml.tuning as tune
grid= tune.ParamGridBuilder().addGrid(logistic.regParam, [0.01, 0.1, 0.5, 1.0])\
    .addGrid(logistic.maxIter, [1, 5, 10, 20]).build()
cv = tune.CrossValidator(estimator=logistic,
                          estimatorParamMaps=grid,
                          evaluator=evaluator,
                          numFolds=5)
cvModel = cv.fit(teams15_train)
cvPredictions = cvModel.transform(teams15_test)
print("Test de precisión tras usar hiperuning : ",evaluator.evaluate(cvPredictions, {evaluator.metricName: "areaUnderROC"}))

Test de precisión tras usar hiperuning :  0.8239151738912436


#### PCA

Implementamos y comprobamos la eficacia del método PCA

In [52]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import PCA
pca = PCA(k=7, inputCol="features", outputCol="pcaFeatures")
pca_lr = cl.LogisticRegression(featuresCol="pcaFeatures", labelCol="result")
pipeline_pca = Pipeline(stages=[pca,pca_lr])
model_nb_pca = pipeline_pca.fit(teams15_train).transform(teams15_test)
nbaccuracy_pca = evaluator.evaluate(model_nb_pca)
print('Test de precisión tras usar PCA =', str(nbaccuracy_pca))

Test de precisión tras usar PCA = 0.8228668125256394


#### Chi cuadrado

In [53]:
from pyspark.ml.feature import ChiSqSelector
chi = ChiSqSelector(numTopFeatures = 10, featuresCol ='features', outputCol='features_chi', labelCol='result')
chi_model = chi.fit(teams15f)
teams15chi = chi_model.transform(teams15f)

In [54]:
required_features2

['goldat15',
 'xpat15',
 'csat15',
 'opp_goldat15',
 'opp_xpat15',
 'opp_csat15',
 'golddiffat15',
 'xpdiffat15',
 'csdiffat15',
 'killsat15',
 'assistsat15',
 'deathsat15',
 'opp_killsat15',
 'opp_assistsat15',
 'opp_deathsat15',
 'side2']

In [55]:
chi_model.selectedFeatures

[2, 5, 8, 9, 10, 11, 12, 13, 14, 15]

In [56]:
logistic_chi = cl.LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    featuresCol='features_chi',
    labelCol='result')
(teams15chi_train, teams15chi_test) = teams15chi.randomSplit([0.8,0.2],21)
model15log_chi = logistic_chi.fit(teams15chi_train)
predictions15logchi = model15log_chi.transform(teams15chi_test)
print('Test de precisión tras selecionar con chi =',evaluator.evaluate(predictions15logchi, {evaluator.metricName: 'areaUnderROC'}))

Test de precisión tras selecionar con chi = 0.8097736906878156


## Conclusiones de las mejoras

Las mejoras no nos han resultado muy útiles puesto que como he mencionado antes por culpa de la 'volatilidad' de las partidas cuando el modelo falla no es culpa de haber evaluado mal los datos, es por factores humanos y otras variables que no se pueden analizar. Hay que mencionar tambien que probando valores con el chi selector de quedarnos solo con una columna esta sería la diferencia de experiencia total de los equipos. Esto es un dato que va muy acorde con las perspectiva de un analista independiente (LS/@LSXYZ9) que suele hacer mucho incapié a la hora de administrar bien este recurso y darle la importancia que se merece.