# Membrete

<img src="https://upload.wikimedia.org/wikipedia/commons/6/6c/Javeriana.svg" alt="Logo Javeriana" width="150"/>

- **Nombre:** Alberto Luis Vigna Arroyo
- **Universidad:** Pontificia Universidad Javeriana
- **Materia:** Procesamiento de Datos a Gran Escala
- **Nombre del Profesor:** John Corredor
- **Correo Electrónico:** a-vigna@javeriana.edu.co
- **Fecha:** 26 de febrero de 2024
- **Objetivo:** Presentar los diferentes métodos de PySpark para el tratamiento de los datos (EDA) y conocer los primeros pasos de ML con Spark.)

**Quiz:** El presente cuaderno se entrega a la fecha de 26/02/2024 como quiz.

# Análisis Exploratorio en PySpark - Datos del Titanic:


The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)


The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the ground truth) for each passenger. 

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.\
We also include


Data Dictionary

- survival  Survival  0 = No, 1 = Si
- pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex	Sex	
- Age	Age in years	
- sibsp	# of siblings / spouses aboard the Titanic	
- parch	# of parents / children aboard the Titanic	
- ticket	Ticket number	
- fare	Passenger fare	
- cabin	Cabin number	
- embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes
- pclass: A proxy for socio-economic status (SES)
    - 1st = Upper
    - 2nd = Middle
    - 3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson

*Some children travelled only with a nanny, therefore parch=0 for them.*


## Bibliotecas:

In [0]:
# PySpart
import pyspark

from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

# ML
from pyspark.ml import *
from pyspark.ml.feature import *


# Pandas
import pandas as pd


# Para importarlo todo:
from pyspark import *
from pandas import *

In [0]:
# Levantar la sección de PySpark:
sc = SparkContext.getOrCreate()
sql_sc = SQLContext(sc)

sc



## Cargar el enlace para importar los datos:

### Path para test:

In [0]:
# Path de test:
urlTest = "https://raw.githubusercontent.com/corredor-john/ExploratoryDataAnalisys/main/Varios/Titanic/test.csv"

### Path para train:

In [0]:
# Path de train:
urlTrain = "https://raw.githubusercontent.com/corredor-john/ExploratoryDataAnalisys/main/Varios/Titanic/train.csv"

## Creación de los Data Frames

In [0]:
# Se crea el Data Frame del Titanic en Pandas:
DF_Titanic = pd.read_csv(urlTrain, sep = ",")

# Conversión a un Data Frame de Spark:
DF_Titanic = sql_sc.createDataFrame(DF_Titanic)

# Mostrar el modelo:
DF_Titanic.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

## Mostrar las columnas

In [0]:
DF_Titanic.columns

Out[174]: ['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

## Tipos de datos en el Data Frame:

In [0]:
DF_Titanic.printSchema()

root
 |-- PassengerId: long (nullable = true)
 |-- Survived: long (nullable = true)
 |-- Pclass: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: long (nullable = true)
 |-- Parch: long (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



## Visualización de Datos:

### Ver la cantidad de registros en el Data Frame:

In [0]:
print("La cantidad de registros en el Data Frame de Spark es: " + str(DF_Titanic.count()))

La cantidad de registros en el Data Frame de Spark es: 891


### Ver la cantidad de personas según su sexo:

In [0]:
DF_Titanic.groupBy("Sex").count().show()

+------+-----+
|   Sex|count|
+------+-----+
|female|  314|
|  male|  577|
+------+-----+



### Ver si hay valores nulos:

In [0]:
# El comando "when" con "otherwise",  funciona como si fuere una desición "si", en caso contrario:

DF_Titanic.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in DF_Titanic.columns]).show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|177|    0|    0|     0|   0|  687|       2|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



### Eliminación de Cabin:

In [0]:
# Esta columna NO da información para estudiar
DF_Titanic = DF_Titanic.drop("Cabin")
DF_Titanic.limit(3).toPandas()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S


### Estratificación en función del saludo

In [0]:
# Para hacer la imputación de edades nulas, se hará la estratificación en función del saludo. Primero se hará una nueva columna que contenga los saludos de cada pasajero:

# Se extrae cualquier palabra que cuyo patrón es de la A hasta la Z finalizando con un (.)
DF_Titanic = DF_Titanic.withColumn("Saludo", regexp_extract(col("Name"), "([A-Za-z]+)\.", 1))

DF_Titanic.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Embarked|Saludo|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+--------+------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25|       S|    Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|       C|   Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925|       S|  Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1|       S|   Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05|       S|    Mr|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    

In [0]:
# Se presentan los valores únicos o distintos de la columna "Saludo":
DF_Titanic.select("Saludo").distinct().show()

+--------+
|  Saludo|
+--------+
|     Don|
|    Miss|
|  Master|
|      Mr|
|     Mrs|
|     Rev|
|      Dr|
|     Mme|
|      Ms|
|   Major|
|     Col|
|    Lady|
|     Sir|
|    Mlle|
|Countess|
|    Capt|
|Jonkheer|
+--------+



In [0]:
# Se agrupan para contarlos:
DF_Titanic.groupby("Saludo").count().show()

+--------+-----+
|  Saludo|count|
+--------+-----+
|     Don|    1|
|    Miss|  182|
|  Master|   40|
|      Mr|  517|
|     Mrs|  125|
|     Rev|    6|
|      Dr|    7|
|     Mme|    1|
|      Ms|    1|
|   Major|    2|
|     Col|    2|
|    Lady|    1|
|     Sir|    1|
|    Mlle|    2|
|Countess|    1|
|    Capt|    1|
|Jonkheer|    1|
+--------+-----+



In [0]:
# Dado a que son muchos saludos, se requiere reducirlos a 4 o 5.
saludo_inicial = ["Don", "Rev", "Dr", "Mme", "Ms", "Major", "Col", "Lady", "Sir", "Mlle", "Countess", "Capt", "Jonkheer"]

saludo_final = ["Mr", "Other", "Other", "Mrs", "Miss", "Mr", "Mr", "Miss", "Mr", "Miss", "Miss", "Mr", "Other"]

DF_Titanic = DF_Titanic.replace(saludo_inicial, saludo_final)
DF_Titanic.groupby("Saludo").count().show()

+------+-----+
|Saludo|count|
+------+-----+
|  Miss|  187|
|Master|   40|
|    Mr|  524|
|   Mrs|  126|
| Other|   14|
+------+-----+



### Extracción del promedio de edad según el saludo:

In [0]:
DF_Titanic.groupby("Saludo").avg("Age").collect()

Out[184]: [Row(Saludo='Miss', avg(Age)=22.09271523178808),
 Row(Saludo='Master', avg(Age)=4.574166666666667),
 Row(Saludo='Mr', avg(Age)=32.727160493827164),
 Row(Saludo='Mrs', avg(Age)=35.788990825688074),
 Row(Saludo='Other', avg(Age)=42.23076923076923)]

### Donde aparezca un NULL, sustituir por el promedio según la categoría de Saludo

In [0]:
# Para Miss:
DF_Titanic = DF_Titanic.withColumn("Age", when((DF_Titanic["Saludo"] == "Miss") & (DF_Titanic["Age"].isNull()), 22).otherwise(DF_Titanic["Age"]))

# Para Master:
DF_Titanic = DF_Titanic.withColumn("Age", when((DF_Titanic["Saludo"] == "Master") & (DF_Titanic["Age"].isNull()), 5).otherwise(DF_Titanic["Age"]))

# Para Mr:
DF_Titanic = DF_Titanic.withColumn("Age", when((DF_Titanic["Saludo"] == "Mr") & (DF_Titanic["Age"].isNull()), 36).otherwise(DF_Titanic["Age"]))

# Para Other:
DF_Titanic = DF_Titanic.withColumn("Age", when((DF_Titanic["Saludo"] == "Other") & (DF_Titanic["Age"].isNull()), 42).otherwise(DF_Titanic["Age"]))

In [0]:
# Visualizar los NULL en el embarque:
DF_Titanic.groupby("Embarked").count().show()

+--------+-----+
|Embarked|count|
+--------+-----+
|       Q|   77|
|    null|    2|
|       C|  168|
|       S|  644|
+--------+-----+



In [0]:
# En embarque hay 2 NULL, se procede a cambiarlos por S (categoría que más apariciones tiene)
DF_Titanic = DF_Titanic.na.fill({"Embarked": "S"})
DF_Titanic.groupby("Embarked").count().show()

+--------+-----+
|Embarked|count|
+--------+-----+
|       Q|   77|
|       C|  168|
|       S|  646|
+--------+-----+



In [0]:
DF_Titanic.limit(5).toPandas()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Saludo
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,Mr


In [0]:
DF_Titanic = DF_Titanic.withColumn("Tamano Familia", col("SibSp") + col("Parch"))

# Se crea una columna de "Solo" (si está o no está solo durante el viaje)
DF_Titanic = DF_Titanic.withColumn("Solo", lit(0))

# Si está o no está solo, en al columna (depende del Tamaño de la Familia)
DF_Titanic = DF_Titanic.withColumn("Solo", when(DF_Titanic["Tamano Familia"] == 0, 1).otherwise(DF_Titanic["Solo"]))


DF_Titanic.limit(11).toPandas()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Saludo,Tamano Familia,Solo
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,Mr,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,Mrs,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,Miss,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,Mrs,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,Mr,0,1
5,6,0,3,"Moran, Mr. James",male,36.0,0,0,330877,8.4583,Q,Mr,0,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,Mr,0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S,Master,4,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S,Mrs,2,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C,Mrs,1,0


### Imputación de variables categoricas:

In [0]:
indexador = [StringIndexer(inputCol=columna, outputCol=columna+"ind").fit(DF_Titanic) for columna in ["Sex", "Embarked", "Saludo"]]

# Se crea un PIPELINE, para procesar el indexador:
pipeIndexador = Pipeline(stages = indexador)
DF_Titanic = pipeIndexador.fit(DF_Titanic).transform(DF_Titanic)

### Eliminar las columnas no útiles:

In [0]:
DF_Titanic = DF_Titanic.drop("PassengerId", "Name", "Ticket", "Embarked", "Saludo", "Sex")

DF_Titanic.limit(6).toPandas()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Tamano Familia,Solo,Sexind,Embarkedind,Saludoind
0,0,3,22.0,1,0,7.25,1,0,0.0,0.0,0.0
1,1,1,38.0,1,0,71.2833,1,0,1.0,1.0,2.0
2,1,3,26.0,0,0,7.925,0,1,1.0,0.0,1.0
3,1,1,35.0,1,0,53.1,1,0,1.0,0.0,2.0
4,0,3,35.0,0,0,8.05,0,1,0.0,0.0,0.0
5,0,3,36.0,0,0,8.4583,0,1,0.0,2.0,0.0


## Preparación del Data Frame para Machine Learning

In [0]:
varCaracter = VectorAssembler(inputCols = DF_Titanic.columns[1:], outputCol = "VarCaracteristicas")
varCaracterTrans = varCaracter.transform(DF_Titanic)

DF_Titanic.limit(6).toPandas()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Tamano Familia,Solo,Sexind,Embarkedind,Saludoind
0,0,3,22.0,1,0,7.25,1,0,0.0,0.0,0.0
1,1,1,38.0,1,0,71.2833,1,0,1.0,1.0,2.0
2,1,3,26.0,0,0,7.925,0,1,1.0,0.0,1.0
3,1,1,35.0,1,0,53.1,1,0,1.0,0.0,2.0
4,0,3,35.0,0,0,8.05,0,1,0.0,0.0,0.0
5,0,3,36.0,0,0,8.4583,0,1,0.0,2.0,0.0


### Separar el Train y el Test:

In [0]:
DF_ML_Titanic = varCaracterTrans.select(["VarCaracteristicas", "Survived"])
(DF_Train, DF_Test) = DF_ML_Titanic.randomSplit([0.75, 0.25], seed=0.12)

print(f'La cantidad de datos de entrenamiento es de: {DF_Train.count()}')
print(f'La cantidad de datos de prueba es de: {DF_Test.count()}')
print(f'La cantidad de datos totales es de: {DF_ML_Titanic.count()}')

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-4389589189287441>:4[0m
[1;32m      1[0m DF_ML_Titanic [38;5;241m=[39m varCaracterTrans[38;5;241m.[39mselect([[38;5;124m"[39m[38;5;124mVarCaracteristicas[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mSurvived[39m[38;5;124m"[39m])
[1;32m      2[0m (DF_Train, DF_Test) [38;5;241m=[39m DF_ML_Titanic[38;5;241m.[39mrandomSplit([[38;5;241m0.75[39m, [38;5;241m0.25[39m], seed[38;5;241m=[39m[38;5;241m0.12[39m)
[0;32m----> 4[0m [38;5;28mprint[39m([38;5;124mf[39m[38;5;124m'[39m[38;5;124mLa cantidad de datos de entrenamiento es de: [39m[38;5;132;01m{[39;00mDF_Train[38;5;241m.[39mcount()[38;5;132;01m}[39;00m[38;5;124m'[39m)
[1;32m      5[0m [38;5;28mprint[39m([38;5;124mf[39m[38;5;124m'[39m[38;5;124mLa cantidad de datos de prueba es de: [39m[38;5;132