<a href="https://colab.research.google.com/github/FerminMendez/ModuleAI/blob/main/BigData/ChessGameClassificationModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importando los datos

Recuperado de https://www.kaggle.com/datasets/arevel/chess-games

In [1]:
import os
os.environ['KAGGLE_USERNAME']=''
os.environ['KAGGLE_KEY']=''
!kaggle datasets download -d arevel/chess-games

Downloading chess-games.zip to /content
 99% 1.44G/1.45G [00:18<00:00, 151MB/s]
100% 1.45G/1.45G [00:18<00:00, 83.1MB/s]


In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=3b26389507c7e5ad9bcadc6d66ce7c344c7fa29af526d04e61bb6fea180b99b8
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('used-cars-bigdata').getOrCreate()

In [4]:
!ls
!unzip \*.zip  && rm *.zip

chess-games.zip  sample_data
Archive:  chess-games.zip
  inflating: chess_games.csv         


In [25]:
data = spark.read.csv('chess_games.csv', header=True)

In [26]:
data.show()

+------------------+---------------+---------------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|             Event|          White|          Black|Result|   UTCDate| UTCTime|WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|             Opening|TimeControl| Termination|                  AN|
+------------------+---------------+---------------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|        Classical |        eisaaaa|       HAMID449|   1-0|2016.06.30|22:00:01|    1901|    1896|           11.0|          -11.0|D10|        Slav Defense|      300+5|Time forfeit|1. d4 d5 2. c4 c6...|
|            Blitz |         go4jas|     Sergei1973|   0-1|2016.06.30|22:00:01|    1641|    1627|          -11.0|           12.0|C20|King's Pawn Openi...|      300+0|      Normal|1. e4 e5 2. b3 Nf

## Sobre el dataset
Tenemos registros de más de 6 millones de partidas de ajedrez. En este vamos a tratar de predecir quien gana si blancas o negras.

In [27]:
data.count()

6256184

# Sobre los datos

Filtrando las variables que utilizaremos para el análisis:
Eliminamos los nombres de los jugadores por identificadores únicos.

Variables independientes:
*   Event as GameType: Variable categorica del tipo de partido que se juega. (Bullet,tournament, blitz, etc)
*   WhiteElo: Entero del Elo del jugador con blancas
*   BlackElo: Entero del Elo del jugador con negras
*   Opening

Variable dependiente:
*   Result

Variables para agregar en siguientes versiones
*   TimeControl
*   Termination



### Filtramos solo las variabes que vamos a utilizar

In [34]:
df = data.select("event", "WhiteElo","BlackElo","Opening","TimeControl","Termination","Result")
df=df.withColumnRenamed("event", "GameType")
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|
+------------------+--------+--------+--------------------+-----------+------------+------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|   1-0|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|   0-1|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|   1-0|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Normal|   1-0|
| Blitz tournament |    1945|    1900|Sicilian Defense:...|      180+0|Time forfeit|   0-1|
| Blitz tournament |    1773|    1809|         Vienna Game|      180+0|      Normal|   0-1|
| Blitz tournament |    1895|    1886|Caro-Kann Defense...|      180+0|Time forfeit|   0-1|
| Blitz tournament |    2155|    2356|Queen's Pawn Game...|      180+0|      Nor

## Data preparation

Cambiamos Result por valores binarios donde 0 es "Gana negras" y 1 "Ganan blancas"

In [35]:
result_mapping = {
    "1-0": "1",
    "0-1": "0",
}
df = df.replace(to_replace=result_mapping, subset=['Result'])
df = df.filter((df["Result"] == 0) | (df["Result"] == 1))
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|
+------------------+--------+--------+--------------------+-----------+------------+------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|     1|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|     0|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|     1|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Normal|     1|
| Blitz tournament |    1945|    1900|Sicilian Defense:...|      180+0|Time forfeit|     0|
| Blitz tournament |    1773|    1809|         Vienna Game|      180+0|      Normal|     0|
| Blitz tournament |    1895|    1886|Caro-Kann Defense...|      180+0|Time forfeit|     0|
| Blitz tournament |    2155|    2356|Queen's Pawn Game...|      180+0|      Nor

Convertimos las variables categóricas con One-hot encoding

In [36]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="GameType", outputCol="GameTypeIndex")
df = indexer.fit(df).transform(df)
indexer = StringIndexer(inputCol="Opening", outputCol="OpeningIndex")
df = indexer.fit(df).transform(df)
# Step 2: Perform one-hot encoding
encoder = OneHotEncoder(inputCol="GameTypeIndex", outputCol="GameTypeOneHot")
df = encoder.fit(df).transform(df)
encoder = OneHotEncoder(inputCol="OpeningIndex", outputCol="OpeningOneHot")
df = encoder.fit(df).transform(df)
df.show()

+------------------+--------+--------+--------------------+-----------+------------+------+-------------+------------+--------------+------------------+
|          GameType|WhiteElo|BlackElo|             Opening|TimeControl| Termination|Result|GameTypeIndex|OpeningIndex|GameTypeOneHot|     OpeningOneHot|
+------------------+--------+--------+--------------------+-----------+------------+------+-------------+------------+--------------+------------------+
|        Classical |    1901|    1896|        Slav Defense|      300+5|Time forfeit|     1|          1.0|       131.0|(12,[1],[1.0])|(2938,[131],[1.0])|
|            Blitz |    1641|    1627|King's Pawn Openi...|      300+0|      Normal|     0|          0.0|       313.0|(12,[0],[1.0])|(2938,[313],[1.0])|
| Blitz tournament |    1647|    1688|Scandinavian Defe...|      180+0|Time forfeit|     1|          4.0|         1.0|(12,[4],[1.0])|  (2938,[1],[1.0])|
|   Correspondence |    1706|    1317|Van't Kruijs Opening|          -|      Norma

Nos aseguramos que las variables numéricas se trate como enteros

In [57]:
from pyspark.sql.functions import col, when
df=df.withColumn("WhiteElo", col("WhiteElo").cast("int"))
df=df.withColumn("BlackELo", col("BlackELo").cast("int"))
df=df.withColumn("Result", col("Result").cast("int"))

Filtramos solo los datos que vamos a utilizar para entrenar el modelo

In [58]:
df_model=df.select( "WhiteElo","BlackElo","OpeningOneHot","GameTypeOneHot","Result")

In [59]:
from pyspark.ml.feature import VectorAssembler

featassembler = VectorAssembler(inputCols=['WhiteElo',
 'BlackElo',
 'OpeningOneHot',
 'GameTypeOneHot',], outputCol = "Independent Features" )
featassembler

VectorAssembler_3f308d26bef7

In [60]:
df_model = featassembler.transform(df_model)
df_model.show()

+--------+--------+------------------+--------------+------+--------------------+
|WhiteElo|BlackElo|     OpeningOneHot|GameTypeOneHot|Result|Independent Features|
+--------+--------+------------------+--------------+------+--------------------+
|    1901|    1896|(2938,[131],[1.0])|(12,[1],[1.0])|     1|(2952,[0,1,133,29...|
|    1641|    1627|(2938,[313],[1.0])|(12,[0],[1.0])|     0|(2952,[0,1,315,29...|
|    1647|    1688|  (2938,[1],[1.0])|(12,[4],[1.0])|     1|(2952,[0,1,3,2944...|
|    1706|    1317|  (2938,[0],[1.0])|(12,[6],[1.0])|     1|(2952,[0,1,2,2946...|
|    1945|    1900|(2938,[338],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,340,29...|
|    1773|    1809| (2938,[91],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,93,294...|
|    1895|    1886|(2938,[207],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,209,29...|
|    2155|    2356| (2938,[95],[1.0])|(12,[4],[1.0])|     1|(2952,[0,1,97,294...|
|    2010|    2111| (2938,[15],[1.0])|(12,[4],[1.0])|     0|(2952,[0,1,17,294...|
|    1764|    17

### Dividimos el dataset en train y test

In [61]:
train_data, test_data = df_model.randomSplit([0.8, 0.2])

### Random forest classification

In [62]:
from pyspark.ml.classification import RandomForestClassifier
model = RandomForestClassifier(labelCol="Result", featuresCol="Independent Features")
randomForestModel = model.fit(train_data)

In [63]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

modelPredict = randomForestModel.transform(test_data)

In [66]:
evaluator = MulticlassClassificationEvaluator(labelCol = "Result", predictionCol="prediction")
accuracy = evaluator.evaluate(modelPredict)
print("Accuracy:", accuracy)

Accuracy: 0.46092995836353834


### Logistic regression model

In [67]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'Independent Features', labelCol = 'Result')
lrModel = lr.fit(train_data)

In [74]:
trainingSummary = lrModel.summary

# for multiclass, we can inspect metrics on a per-label basis
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

False positive rate by label:
label 0: 0.30439700115469825
label 1: 0.3825185953599698
True positive rate by label:
label 0: 0.6174814046400302
label 1: 0.6956029988453017
Precision by label:
label 0: 0.6538888784977234
label 1: 0.6613106678908223
Recall by label:
label 0: 0.6174814046400302
label 1: 0.6956029988453017
F-measure by label:
label 0: 0.6351638519924729
label 1: 0.678023510308858
Accuracy: 0.6579309820462579
FPR: 0.34484657856092604
TPR: 0.6579309820462579
F-measure: 0.6573556042769662
Precision: 0.6577317115530761
Recall: 0.6579309820462579


In [68]:
results = lrModel.transform(test_data)
# Showing the results
results.show()

+--------+--------+------------------+--------------+------+--------------------+--------------------+--------------------+----------+
|WhiteElo|BlackElo|     OpeningOneHot|GameTypeOneHot|Result|Independent Features|       rawPrediction|         probability|prediction|
+--------+--------+------------------+--------------+------+--------------------+--------------------+--------------------+----------+
|     799|    1250| (2938,[50],[1.0])|(12,[0],[1.0])|     1|(2952,[0,1,52,294...|[2.08263238624840...|[0.88920364319083...|       0.0|
|     864|    1223|(2938,[279],[1.0])|(12,[0],[1.0])|     0|(2952,[0,1,281,29...|[1.68063617658159...|[0.84298875300409...|       0.0|
|     878|    1370| (2938,[41],[1.0])|(12,[1],[1.0])|     0|(2952,[0,1,43,294...|[2.26809744292539...|[0.90620019267324...|       0.0|
|     881|    1351| (2938,[10],[1.0])|(12,[2],[1.0])|     0|(2952,[0,1,12,294...|[2.06137915117172...|[0.88709237876844...|       0.0|
|     887|    1250| (2938,[42],[1.0])|(12,[0],[1.0])|  

In [69]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Calling the evaluator
res = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Result')

# Evaluating the AUC on results
ROC_AUC = res.evaluate(results)

In [70]:
print("Accuracy:", ROC_AUC)

Accuracy: 0.6557312003264703
