# **Maestría en Inteligencia Artificial Aplicada**

## **Curso: Análisis de Grandes Volúmenes de Datos (TC4034.10)**

### Tecnológico de Monterrey

### Prof Dr. Iván Olmos Pineda

## **Actividad 3**
### **Aprendizaje supervisado y no supervisado.**

# **Nombre y Matrícula**

Daniel Iván Benitez Fernandez de Jauregui -
A01795860

# 1. Introducción teórica


# 1. Introducción teórica

## Aprendizaje Supervisado

El aprendizaje supervisado es una técnica de Machine Learning donde un modelo se entrena utilizando un conjunto de datos etiquetado, es decir, datos que contienen tanto las entradas (features) como las salidas deseadas (etiquetas o targets). El modelo aprende a mapear entradas a salidas, y luego puede generalizar para predecir la salida de nuevos datos no vistos.

### Algoritmos comunes supervisados

- **Árboles de Decisión (DecisionTreeClassifier)**: Algoritmo interpretable que divide el espacio de características con reglas basadas en los valores de las variables.
- **Random Forest (RandomForestClassifier)**: Ensamble de múltiples árboles de decisión, lo que mejora la precisión y reduce el sobreajuste.
- **Gradient Boosted Trees (GBTClassifier)**: Variante más potente que construye árboles secuencialmente, optimizando errores anteriores.
- **Multilayer Perceptron (MLPClassifier)**: Red neuronal simple con múltiples capas, adecuada para datos no lineales.

## Aprendizaje No Supervisado

A diferencia del aprendizaje supervisado, el aprendizaje no supervisado no utiliza etiquetas. El objetivo es descubrir estructuras subyacentes en los datos, como agrupaciones o patrones.

### Algoritmos comunes no supervisados

- **K-Means**: Algoritmo de agrupamiento basado en la minimización de la distancia intra-cluster.
- **Gaussian Mixture Models (GMM)**: Supone que los datos son una mezcla de varias distribuciones gaussianas.
- **Power Iteration Clustering (PIC)**: Algoritmo basado en grafos y técnicas espectrales para encontrar grupos conectados.

## Disponibilidad en PySpark

PySpark, a través de `pyspark.ml`, ofrece implementaciones de los algoritmos antes mencionados tanto para clasificación supervisada (`classification`) como para agrupamiento no supervisado (`clustering`).


In [1]:
# Instalación de Java y PySpark
!apt-get install openjdk-11-jdk -y
!pip install pyspark
!pip install -q kaggle

# Configura las variables de entorno
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["KAGGLE_CONFIG_DIR"] = "/content"

# Subir kaggle.json manualmente
from google.colab import files
files.upload()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
openjdk-11-jdk is already the newest version (11.0.27+6~us1-0ubuntu1~22.04).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"danielbenitezf","key":"25722d8132805e1708bca2f5e7402b6f"}'}

In [2]:
# Descargar el dataset desde Kaggle (después de subir tu kaggle.json)
!kaggle datasets download -d joebeachcapital/fifa-players
!unzip -o fifa-players.zip -d .

Dataset URL: https://www.kaggle.com/datasets/joebeachcapital/fifa-players
License(s): other
fifa-players.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  fifa-players.zip
  inflating: ./2024/all_players.csv  
  inflating: ./2024/female_players.csv  
  inflating: ./2024/male_players.csv  
  inflating: ./female_coaches_23.csv  
  inflating: ./female_players (legacy)_23.csv  
  inflating: ./female_players_23.csv  
  inflating: ./female_teams_23.csv   
  inflating: ./fifa_2022_datasets/Career Mode female player datasets - FIFA 16-22.xlsx  
  inflating: ./fifa_2022_datasets/Career Mode player datasets - FIFA 15-22.xlsx  
  inflating: ./fifa_2022_datasets/female_players_16.csv  
  inflating: ./fifa_2022_datasets/female_players_17.csv  
  inflating: ./fifa_2022_datasets/female_players_18.csv  
  inflating: ./fifa_2022_datasets/female_players_19.csv  
  inflating: ./fifa_2022_datasets/female_players_20.csv  
  inflating: ./fifa_2022_datasets/fema

In [7]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MLFIFA").getOrCreate()
df = spark.read.csv("male_players_23.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)


root
 |-- player_id: integer (nullable = true)
 |-- player_url: string (nullable = true)
 |-- fifa_version: integer (nullable = true)
 |-- fifa_update: integer (nullable = true)
 |-- fifa_update_date: date (nullable = true)
 |-- short_name: string (nullable = true)
 |-- long_name: string (nullable = true)
 |-- player_positions: string (nullable = true)
 |-- overall: integer (nullable = true)
 |-- potential: integer (nullable = true)
 |-- value_eur: integer (nullable = true)
 |-- wage_eur: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- dob: date (nullable = true)
 |-- height_cm: integer (nullable = true)
 |-- weight_kg: integer (nullable = true)
 |-- league_id: integer (nullable = true)
 |-- league_name: string (nullable = true)
 |-- league_level: integer (nullable = true)
 |-- club_team_id: integer (nullable = true)
 |-- club_name: string (nullable = true)
 |-- club_position: string (nullable = true)
 |-- club_jersey_number: integer (nullable = true)
 |-- club_loane

# 2. Selección de los datos

In [8]:

# 2. Selección de los datos

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand

spark = SparkSession.builder.appName("MLFIFA").getOrCreate()

# Cargar dataset (se debe descargar previamente de Kaggle)
df = spark.read.csv("male_players_23.csv", header=True, inferSchema=True)

# Aplicar reglas de particionamiento según edad y valor de mercado
partitions = {
    "joven_bajo": df.filter((col("age") <= 22) & (col("value_eur") <= 450000)),
    "joven_medio": df.filter((col("age") <= 22) & (col("value_eur") > 450000) & (col("value_eur") <= 1200000)),
    "joven_alto": df.filter((col("age") <= 22) & (col("value_eur") > 1200000)),
    "adulto_bajo": df.filter((col("age") > 22) & (col("age") <= 27) & (col("value_eur") <= 450000)),
    "adulto_medio": df.filter((col("age") > 22) & (col("age") <= 27) & (col("value_eur") > 450000) & (col("value_eur") <= 1200000)),
    "adulto_alto": df.filter((col("age") > 22) & (col("age") <= 27) & (col("value_eur") > 1200000)),
    "veterano_bajo": df.filter((col("age") >= 28) & (col("value_eur") <= 450000)),
    "veterano_medio": df.filter((col("age") >= 28) & (col("value_eur") > 450000) & (col("value_eur") <= 1200000)),
    "veterano_alto": df.filter((col("age") >= 28) & (col("value_eur") > 1200000))
}

# Muestreo simple o sistemático según categoría
from functools import reduce
from pyspark.sql import DataFrame

muestras = [
    partitions["joven_bajo"].orderBy(rand()).limit(1000),
    partitions["joven_medio"].orderBy(rand()).limit(1000),
    partitions["joven_alto"].orderBy("potential").limit(1000),
    partitions["adulto_bajo"].orderBy(rand()).limit(1000),
    partitions["adulto_medio"].orderBy(rand()).limit(1000),
    partitions["adulto_alto"].orderBy("value_eur").limit(1000),
    partitions["veterano_bajo"].orderBy(rand()).limit(1000),
    partitions["veterano_medio"].orderBy(rand()).limit(1000),
    partitions["veterano_alto"].orderBy("overall").limit(1000)
]

M = reduce(DataFrame.unionAll, muestras)
M.show(5)


initial_count_M = M.count()
print(f"Cantidad de registros en M antes de dropna: {initial_count_M}")

+---------+--------------------+------------+-----------+----------------+------------+--------------------+----------------+-------+---------+---------+--------+---+----------+---------+---------+---------+------------------+------------+------------+----------+-------------+------------------+----------------+----------------+------------------------------+--------------+----------------+--------------+---------------+--------------------+--------------+---------+-----------+------------------------+-------------+--------------+---------+------------------+-----------+-------------+----+--------+-------+---------+---------+------+------------------+-------------------+--------------------------+-----------------------+-----------------+---------------+-----------+-----------------+------------------+------------------+---------------------+---------------------+----------------+------------------+----------------+----------------+-------------+-------------+--------------+-----------

In [None]:
from pyspark.sql.functions import col, sum, isnan
from pyspark.sql.functions import rand

M = M.orderBy(rand()).limit(1000)


M.select([sum(col(c).isNull().cast("int")).alias(c) for c in M.columns]).show()


columns_required = ["age", "value_eur", "overall", "potential", "pace", "shooting",
                    "passing", "dribbling", "defending", "physic", "preferred_foot"]

M = M.dropna(subset=columns_required)

print("Cantidad de registros tras dropna de columnas clave:", M.count())

# 3. Preparación de los datos

In [5]:

# 3. Preparación de los datos

from pyspark.ml.feature import StringIndexer, VectorAssembler


# Eliminar nulos
M = M.dropna()
count_after_dropna = M.count()

print(f"Cantidad de registros en M después de dropna: {count_after_dropna}")


# Indexar variable categórica
indexer = StringIndexer(inputCol="preferred_foot", outputCol="foot_indexed")
# Checar si M esta vacia antes de pasar el indexer
if M.count() == 0:
    print("DataFrame M is empty after dropping nulls. Cannot fit StringIndexer.")
else:
    M = indexer.fit(M).transform(M)

# VectorAssembler para las variables predictoras
features = ["age", "potential", "value_eur", "pace", "shooting", "passing", "dribbling", "defending", "physic", "foot_indexed"]
assembler = VectorAssembler(inputCols=features, outputCol="features")
# Checar si M esta vacia transofrmar con assembler
M = assembler.transform(M)

print("Cantidad de registros en M después de transformaciones:", M.count())
# Solo despliega si M no esta vacia
if M.count() > 0:
    M.select("overall", "features").show(5)


Cantidad de registros en M después de dropna: 0
DataFrame M is empty after dropping nulls. Cannot fit StringIndexer.


IllegalArgumentException: foot_indexed does not exist. Available: player_id, player_url, fifa_version, fifa_update, fifa_update_date, short_name, long_name, player_positions, overall, potential, value_eur, wage_eur, age, dob, height_cm, weight_kg, league_id, league_name, league_level, club_team_id, club_name, club_position, club_jersey_number, club_loaned_from, club_joined_date, club_contract_valid_until_year, nationality_id, nationality_name, nation_team_id, nation_position, nation_jersey_number, preferred_foot, weak_foot, skill_moves, international_reputation, work_rate, body_type, real_face, release_clause_eur, player_tags, player_traits, pace, shooting, passing, dribbling, defending, physic, attacking_crossing, attacking_finishing, attacking_heading_accuracy, attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accuracy, skill_long_passing, skill_ball_control, movement_acceleration, movement_sprint_speed, movement_agility, movement_reactions, movement_balance, power_shot_power, power_jumping, power_stamina, power_strength, power_long_shots, mentality_aggression, mentality_interceptions, mentality_positioning, mentality_vision, mentality_penalties, mentality_composure, defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle, goalkeeping_diving, goalkeeping_handling, goalkeeping_kicking, goalkeeping_positioning, goalkeeping_reflexes, goalkeeping_speed, ls, st, rs, lw, lf, cf, rf, rw, lam, cam, ram, lm, lcm, cm, rcm, rm, lwb, ldm, cdm, rdm, rwb, lb, lcb, cb, rcb, rb, gk, player_face_url

# 4. Preparación del conjunto de entrenamiento y prueba

In [7]:

# 4. División en entrenamiento y prueba

# Checar si M no esta vacía antes de dividir
if M.count() == 0:
    print("DataFrame M is empty. Cannot split into train and test sets.")
else:
    train, test = M.randomSplit([0.7, 0.3], seed=42)
    print(f"Cantidad de registros en train: {train.count()}")
    print(f"Cantidad de registros en test: {test.count()}")



# 5. Construcción de modelos de aprendizaje supervisado y no supervisado

In [8]:

# 5.1 Aprendizaje Supervisado - Random Forest

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

rf = RandomForestClassifier(labelCol="overall", featuresCol="features", numTrees=10)
modelo_rf = rf.fit(train)
predicciones = modelo_rf.transform(test)

evaluador = MulticlassClassificationEvaluator(labelCol="overall", predictionCol="prediction", metricName="accuracy")
accuracy = evaluador.evaluate(predicciones)
print("Precisión (Accuracy) del modelo supervisado:", accuracy)

# 5.2 Aprendizaje No Supervisado - KMeans

from pyspark.ml.clustering import KMeans

kmeans = KMeans(k=4, featuresCol="features", seed=42)
modelo_kmeans = kmeans.fit(M)
clusters = modelo_kmeans.transform(M)
clusters.select("prediction").show(5)


Py4JJavaError: An error occurred while calling o242.fit.
: org.apache.spark.SparkException: ML algorithm was given empty dataset.
	at org.apache.spark.ml.util.DatasetUtils$.getNumClasses(DatasetUtils.scala:195)
	at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:75)
	at org.apache.spark.ml.classification.RandomForestClassifier.$anonfun$train$1(RandomForestClassifier.scala:144)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
	at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:47)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)


## Conclusiones

Este notebook demuestra cómo construir un flujo completo de Machine Learning en PySpark con FIFA 23, aplicando tanto aprendizaje supervisado como no supervisado, respetando las técnicas de muestreo y segmentación detalladas en el proyecto.