# Clasificación de Ingresos con Spark ML

Juan Sebastian Gonzalez - A00371810

Juan Felipe Jojoa Crespo - A00382042

## Objetivo
Construir un modelo de **clasificación binaria** con Spark ML utilizando **Logistic Regression** para predecir si una persona pertenece a la clase $>50K$ o $<=50K$, a partir de sus características demográficas y laborales.


## 1. Carga de datos
- Leer el archivo CSV en un DataFrame de Spark.
- Inspeccionar el esquema y mostrar algunos registros para entender los datos.

In [37]:
import os
import sys
from pyspark.sql import SparkSession

# Configurar variables de entorno para usar la misma versión de Python
python_path = sys.executable
os.environ['PYSPARK_PYTHON'] = python_path
os.environ['PYSPARK_DRIVER_PYTHON'] = python_path

print(f"Configurando Spark para usar Python: {python_path}")

# Detener Spark y reiniciarlo
try:
    spark.stop()
except:
    pass

# Crear sesión de Spark
spark = SparkSession.builder \
    .appName("ClasificacionIngresos") \
    .config("spark.sql.adaptive.enabled", "false") \
    .getOrCreate()

# Cargar el archivo CSV
data = spark.read.csv("adult_income_sample.csv", header=True, inferSchema=True)

# Inspeccionar esquema y mostrar primeras filas
data.printSchema()
data.show(5, truncate=False)


Configurando Spark para usar Python: c:\Users\juans\anaconda3\envs\py310\python.exe
root
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- hours_per_week: integer (nullable = true)
 |-- label: string (nullable = true)

+---+------+---------+------+------------+--------------+-----+
|age|sex   |workclass|fnlwgt|education   |hours_per_week|label|
+---+------+---------+------+------------+--------------+-----+
|58 |Male  |Private  |164194|HS-grad     |34            |>50K |
|65 |Male  |Gov      |305929|Bachelors   |57            |<=50K|
|20 |Male  |Private  |134629|HS-grad     |52            |>50K |
|53 |Male  |Gov      |360726|Some-college|54            |<=50K|
|32 |Female|Gov      |165852|Bachelors   |30            |<=50K|
+---+------+---------+------+------------+--------------+-----+
only showing top 5 rows

root
 |-- age: integer (nullable 

## 2. Preprocesamiento de variables categóricas
- Usar `StringIndexer` para transformar las columnas categóricas:  
  $sex$, $workclass$, $education$, $label$.
- Aplicar `OneHotEncoder` para convertir esas variables en vectores binarios y evitar interpretaciones de orden.


In [38]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

# Columnas categóricas
categorical_cols = ["sex", "workclass", "education", "label"]

# Indexación de categorías con manejo de valores inválidos
indexers = [StringIndexer(inputCol=col, outputCol=col+"_index", handleInvalid="skip") for col in categorical_cols]

# Codificación OneHot (excepto la variable objetivo "label")
encoder = OneHotEncoder(
    inputCols=["sex_index", "workclass_index", "education_index"],
    outputCols=["sex_vec", "workclass_vec", "education_vec"]
)

## 3. Ensamblaje de características
- Construir un vector de características ($features$) con las columnas:  
  $age$, $fnlwgt$, $hours\_per\_week$,  
  más las variables categóricas codificadas.

In [39]:
from pyspark.ml.feature import VectorAssembler

# Columnas numéricas + categóricas codificadas
feature_cols = ["age", "fnlwgt", "hours_per_week", "sex_vec", "workclass_vec", "education_vec"]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")


## 4. Definición y entrenamiento del modelo
- Configurar un modelo de $LogisticRegression$ con Spark ML.  
- Usar un $Pipeline$ para encadenar todo el flujo:  
  indexación, codificación, ensamblaje y entrenamiento.

In [40]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# Modelo de regresión logística
lr = LogisticRegression(featuresCol="features", labelCol="label_index")

# Construir el pipeline
pipeline = Pipeline(stages=indexers + [encoder, assembler, lr])

# Entrenar modelo
model = pipeline.fit(data)


## 5. Evaluación del modelo
- Entrenar el modelo con los 2000 registros del archivo.  
- Mostrar las predicciones junto con las probabilidades y la etiqueta real ($label$).  
- Reflexionar: ¿Qué observas sobre los resultados?  

In [41]:
# Hacer predicciones
predictions = model.transform(data)

# Mostrar resultados: predicción, probabilidad y etiqueta real
predictions.select("age", "sex", "workclass", "education", "hours_per_week", 
                   "label", "probability", "prediction").show(10, truncate=False)


+---+------+---------+------------+--------------+-----+----------------------------------------+----------+
|age|sex   |workclass|education   |hours_per_week|label|probability                             |prediction|
+---+------+---------+------------+--------------+-----+----------------------------------------+----------+
|58 |Male  |Private  |HS-grad     |34            |>50K |[0.5367656843616814,0.46323431563831863]|0.0       |
|65 |Male  |Gov      |Bachelors   |57            |<=50K|[0.552717753217908,0.44728224678209205] |0.0       |
|20 |Male  |Private  |HS-grad     |52            |>50K |[0.5414849514158411,0.4585150485841589] |0.0       |
|53 |Male  |Gov      |Some-college|54            |<=50K|[0.46718829394714395,0.5328117060528561]|1.0       |
|32 |Female|Gov      |Bachelors   |30            |<=50K|[0.5859348819083985,0.4140651180916015] |0.0       |
|39 |Female|Private  |11th        |26            |>50K |[0.5656434150174948,0.4343565849825052] |0.0       |
|42 |Male  |Self-em

## 6. Predicción con nuevos datos
- Construir un DataFrame con al menos 9 registros nuevos (creados manualmente).  
- Aplicar el modelo entrenado para predecir si esas personas ganan $>50K$ o $<=50K$.  

In [42]:
from pyspark.sql import Row

# Crear algunos registros nuevos (solo valores válidos del dataset original)
nuevos_datos = [
    Row(age=25, sex="Male", workclass="Private", fnlwgt=200000, education="Bachelors", hours_per_week=40, label="<=50K"),
    Row(age=45, sex="Female", workclass="Gov", fnlwgt=150000, education="Masters", hours_per_week=50, label="<=50K"),
    Row(age=30, sex="Male", workclass="Self-emp", fnlwgt=180000, education="HS-grad", hours_per_week=60, label="<=50K"),
    Row(age=38, sex="Female", workclass="Private", fnlwgt=120000, education="Bachelors", hours_per_week=35, label="<=50K"),
    Row(age=55, sex="Male", workclass="Gov", fnlwgt=160000, education="Masters", hours_per_week=45, label="<=50K"),
    Row(age=28, sex="Female", workclass="Private", fnlwgt=100000, education="11th", hours_per_week=30, label="<=50K"),
    Row(age=40, sex="Male", workclass="Self-emp", fnlwgt=175000, education="HS-grad", hours_per_week=70, label="<=50K"),
    Row(age=22, sex="Female", workclass="Private", fnlwgt=130000, education="HS-grad", hours_per_week=20, label="<=50K"),
    Row(age=60, sex="Male", workclass="Gov", fnlwgt=140000, education="Masters", hours_per_week=55, label="<=50K")
]

df_nuevos = spark.createDataFrame(nuevos_datos)

# Predecir con el modelo entrenado
pred_nuevos = model.transform(df_nuevos)

# Mostrar resultados
pred_nuevos.select("age", "sex", "workclass", "education", "hours_per_week",
                   "probability", "prediction").show(truncate=False)


+---+------+---------+---------+--------------+----------------------------------------+----------+
|age|sex   |workclass|education|hours_per_week|probability                             |prediction|
+---+------+---------+---------+--------------+----------------------------------------+----------+
|25 |Male  |Private  |Bachelors|40            |[0.5832575293597162,0.4167424706402838] |0.0       |
|45 |Female|Gov      |Masters  |50            |[0.484105261454619,0.5158947385453809]  |1.0       |
|30 |Male  |Self-emp |HS-grad  |60            |[0.5109571800619322,0.48904281993806775]|0.0       |
|38 |Female|Private  |Bachelors|35            |[0.5848330980537818,0.4151669019462182] |0.0       |
|55 |Male  |Gov      |Masters  |45            |[0.48814641941662684,0.5118535805833732]|1.0       |
|28 |Female|Private  |11th     |30            |[0.570097701638898,0.42990229836110205] |0.0       |
|40 |Male  |Self-emp |HS-grad  |70            |[0.5029396957374317,0.4970603042625683] |0.0       |
