<a href="https://colab.research.google.com/github/FerminMendez/AlgoritmosAvanzados/blob/main/BigData/ChessGameClassificationModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sobre este proyecto

En este proyecto vamos a resolver un problema de clasificación binaria. Dado un conjunto de datos de partidas de ajedrez determinaremos quien fue el jugador que ganó. El dataset está disponible en la plataforma Kaggle. El Dataset contiene más de 6 millones de registros y tiene peso de 4.38 GB.


Fuentes útiles:
Hay algunas libretas que hacen un análisis exploratorio de los datos. A continuación dos de ellas.
https://www.kaggle.com/code/justinwitter/data-preparation

https://www.kaggle.com/code/sumeetpachauri/dm-chess-data


# Importando los datos

Recuperado de https://www.kaggle.com/datasets/arevel/chess-games

¿Cómo importar los datos? Para importar el dataset directo de Kaggle puedes seguir las siguiente guia https://medium.com/analytics-vidhya/how-to-fetch-kaggle-datasets-into-google-colab-ea682569851a


##Configura tus variables de Kaggle

In [None]:
import os
os.environ['KAGGLE_USERNAME']=''
os.environ['KAGGLE_KEY']=''
!kaggle datasets download -d arevel/chess-games

Downloading chess-games.zip to /content
 99% 1.44G/1.45G [00:18<00:00, 90.6MB/s]
100% 1.45G/1.45G [00:18<00:00, 83.0MB/s]


In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=3b26389507c7e5ad9bcadc6d66ce7c344c7fa29af526d04e61bb6fea180b99b8
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('used-cars-bigdata').getOrCreate()

Una vez importados los datos vamos a descomprimir el zip para tener disponible el archivo chess_games.csv que a continuación convertiremos a un dataframe de pyspark.

In [None]:
!ls
!unzip \*.zip  && rm *.zip

chess-games.zip  sample_data
Archive:  chess-games.zip
  inflating: chess_games.csv         


In [None]:
data = spark.read.csv('chess_games.csv', header=True)

In [None]:
data.show()

+------------------+---------------+---------------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|             Event|          White|          Black|Result|   UTCDate| UTCTime|WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|             Opening|TimeControl| Termination|                  AN|
+------------------+---------------+---------------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|        Classical |        eisaaaa|       HAMID449|   1-0|2016.06.30|22:00:01|    1901|    1896|           11.0|          -11.0|D10|        Slav Defense|      300+5|Time forfeit|1. d4 d5 2. c4 c6...|
|            Blitz |         go4jas|     Sergei1973|   0-1|2016.06.30|22:00:01|    1641|    1627|          -11.0|           12.0|C20|King's Pawn Openi...|      300+0|      Normal|1. e4 e5 2. b3 Nf

## Sobre el dataset





Número de registros. 6.25 millones.
Peso: 4.38 GB
Columnas:
- Event: Game type.
- White: White's ID.
- Black: Black's ID.
- Result: Game Result (1-0 White wins) (0-1 Black wins)
- UTCDate: UTC Date.
- UTCTime: UTC Time.
- WhiteElo: White's ELO.
- BlackElo: Black's ELO.
- WhiteRatingDiff: White's rating points difference after the game.
- BlackRatingDiff: Blacks's rating points difference after the game.
- ECO: Opening in ECO encoding.
- Opening: Opening name.
- TimeControl: Time of the game for each player in seconds. The number after the increment is the number of seconds before the player's clock starts ticking in each turn.
- Termination: Reason of the game's end.
- AN: Movements in Movetext format.

In [None]:
data.count()

6256184

El dataset no tiene valores faltantes.

# Sobre los datos

Filtrando las variables que utilizaremos para el análisis:
Eliminamos los nombres de los jugadores por identificadores únicos.

Variables independientes:
*   Event as GameType: Variable categorica del tipo de partido que se juega. (Bullet,tournament, blitz, etc)
*   WhiteElo: Entero del Elo del jugador con blancas
*   BlackElo: Entero del Elo del jugador con negras
*   Opening

Variable dependiente:
*   Result

Variables para agregar en siguientes versiones
*   TimeControl
*   Termination



### Selección de variables que vamos a utilizar

In [None]:
df = data.select("event", "WhiteElo","BlackElo","Opening","TimeControl","Termination","Result")
df=df.withColumnRenamed("event", "GameType")
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|
+------------------+--------+--------+--------------------+-----------+------------+------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|   1-0|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|   0-1|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|   1-0|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Normal|   1-0|
| Blitz tournament |    1945|    1900|Sicilian Defense:...|      180+0|Time forfeit|   0-1|
| Blitz tournament |    1773|    1809|         Vienna Game|      180+0|      Normal|   0-1|
| Blitz tournament |    1895|    1886|Caro-Kann Defense...|      180+0|Time forfeit|   0-1|
| Blitz tournament |    2155|    2356|Queen's Pawn Game...|      180+0|      Nor

## Data preparation

Cambiamos Result por valores binarios donde 0 es "Gana negras" y 1 "Ganan blancas"

In [None]:
result_mapping = {
    "1-0": "1",
    "0-1": "0",
}
df = df.replace(to_replace=result_mapping, subset=['Result'])
df = df.filter((df["Result"] == 0) | (df["Result"] == 1))
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|
+------------------+--------+--------+--------------------+-----------+------------+------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|     1|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|     0|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|     1|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Normal|     1|
| Blitz tournament |    1945|    1900|Sicilian Defense:...|      180+0|Time forfeit|     0|
| Blitz tournament |    1773|    1809|         Vienna Game|      180+0|      Normal|     0|
| Blitz tournament |    1895|    1886|Caro-Kann Defense...|      180+0|Time forfeit|     0|
| Blitz tournament |    2155|    2356|Queen's Pawn Game...|      180+0|      Nor

Convertimos las variables categóricas con One-hot encoding

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="GameType", outputCol="GameTypeIndex")
df = indexer.fit(df).transform(df)
indexer = StringIndexer(inputCol="Opening", outputCol="OpeningIndex")
df = indexer.fit(df).transform(df)
# Step 2: Perform one-hot encoding
encoder = OneHotEncoder(inputCol="GameTypeIndex", outputCol="GameTypeOneHot")
df = encoder.fit(df).transform(df)
encoder = OneHotEncoder(inputCol="OpeningIndex", outputCol="OpeningOneHot")
df = encoder.fit(df).transform(df)
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+-------------+------------+--------------+------------------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|GameTypeIndex|OpeningIndex|GameTypeOneHot|     OpeningOneHot|
+------------------+--------+--------+--------------------+-----------+------------+------+-------------+------------+--------------+------------------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|     1|          1.0|       131.0|(12,[1],[1.0])|(2938,[131],[1.0])|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|     0|          0.0|       313.0|(12,[0],[1.0])|(2938,[313],[1.0])|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|     1|          4.0|         1.0|(12,[4],[1.0])|  (2938,[1],[1.0])|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Norma

Nos aseguramos que las variables numéricas se trate como enteros

In [None]:
from pyspark.sql.functions import col, when
df=df.withColumn("WhiteElo", col("WhiteElo").cast("int"))
df=df.withColumn("BlackELo", col("BlackELo").cast("int"))
df=df.withColumn("Result", col("Result").cast("int"))

Filtramos solo los datos que vamos a utilizar para entrenar el modelo

In [None]:
df_model=df.select( "WhiteElo","BlackElo","OpeningOneHot","GameTypeOneHot","Result")

In [None]:
from pyspark.ml.feature import VectorAssembler

featassembler = VectorAssembler(inputCols=['WhiteElo',
 'BlackElo',
 'OpeningOneHot',
 'GameTypeOneHot',], outputCol = "Independent Features" )
featassembler

VectorAssembler_3f308d26bef7

In [None]:
df_model = featassembler.transform(df_model)
df_model.show()

+--------+--------+------------------+--------------+------+--------------------+
|WhiteElo|BlackElo|     OpeningOneHot|GameTypeOneHot|Result|Independent Features|
+--------+--------+------------------+--------------+------+--------------------+
|    1901|    1896|(2938,[131],[1.0])|(12,[1],[1.0])|     1|(2952,[0,1,133,29...|
|    1641|    1627|(2938,[313],[1.0])|(12,[0],[1.0])|     0|(2952,[0,1,315,29...|
|    1647|    1688|  (2938,[1],[1.0])|(12,[4],[1.0])|     1|(2952,[0,1,3,2944...|
|    1706|    1317|  (2938,[0],[1.0])|(12,[6],[1.0])|     1|(2952,[0,1,2,2946...|
|    1945|    1900|(2938,[338],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,340,29...|
|    1773|    1809| (2938,[91],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,93,294...|
|    1895|    1886|(2938,[207],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,209,29...|
|    2155|    2356| (2938,[95],[1.0])|(12,[4],[1.0])|     1|(2952,[0,1,97,294...|
|    2010|    2111| (2938,[15],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,17,294...|
|    1764|    17

### Dividimos el dataset en train y test

In [None]:
train_data, test_data = df_model.randomSplit([0.8, 0.2])

## Modelo


En este caso vamos a preparar un modelo de regressión logistica para resolver el problema de clasificación binaria.

### Logistic regression model

In [None]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'Independent Features', labelCol = 'Result')
lrModel = lr.fit(train_data)

In [None]:
results = lrModel.transform(test_data)
# Showing the results
results.show()

+--------+--------+------------------+--------------+------+--------------------+--------------------+--------------------+----------+
|WhiteElo|BlackElo|     OpeningOneHot|GameTypeOneHot|Result|Independent Features|       rawPrediction|         probability|prediction|
+--------+--------+------------------+--------------+------+--------------------+--------------------+--------------------+----------+
|     799|    1250| (2938,[50],[1.0])|(12,[0],[1.0])|     1|(2952,[0,1,52,294...|[1.65481429251155...|[0.83954065470821...|       0.0|
|     864|    1223|(2938,[279],[1.0])|(12,[0],[1.0])|     0|(2952,[0,1,281,29...|[1.33190051165478...|[0.79115482793485...|       0.0|
|     878|    1370| (2938,[41],[1.0])|(12,[1],[1.0])|     0|(2952,[0,1,43,294...|[1.79121317880432...|[0.85707595138601...|       0.0|
|     881|    1351| (2938,[10],[1.0])|(12,[2],[1.0])|     0|(2952,[0,1,12,294...|[1.58931804096527...|[0.83052013457472...|       0.0|
|     887|    1250| (2938,[42],[1.0])|(12,[0],[1.0])|  

## Resultados


In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Calling the evaluator
res = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Result')

# Evaluating the AUC on results
ROC_AUC = res.evaluate(results)

In [None]:
print("Accuracy:", ROC_AUC)

Accuracy: 0.6550754182814911


Nos encontramos con un accuarcy de 0.65 lo que es un gran resultado.