# Variables
1. Años

2. Sexo
3. Tipo de dolor torácico (4 valores)
4. Presión arterial en reposo
5. Colesterol sérico en mg/dl
6. Azúcar en sangre en ayunas > 120 mg/dl
7. Resultados electrocardiográficos en reposo (valores 0,1,2)
8. Frecuencia cardíaca máxima alcanzada
9. Angina inducida por el ejercicio
10. Oldpeak = depresión del ST inducida por el ejercicio en relación con el
reposo
11. Pendiente del segmento ST de ejercicio máximo
12. Número de vasos principales (0-3) coloreados por fluoroscopia
13. Thal: 3 = normal; 6 = defecto fijo; 7 = defecto reversible
14. Variable Interes

# Setup

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=c3ffbc31a59383576502abce1a8c6562a794b4e367052ad2402818479a54c875
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql
from pyspark.sql.functions import col, when
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.sql.types import *

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [None]:
# sc.stop()

#Inicializando Spark

In [None]:
conf = SparkConf().setMaster("local").setAppName("Enfermedad al Corazon")

# Inicializo el Spark Context
sc = SparkContext(conf = conf)

distFile = sc.textFile("/content/drive/MyDrive/DataScience/M8/CD - M8 AE3 heart.data")

#distFile.collect() # visualización de datos

# Creando el DataFrame

In [None]:
spark = SparkSession.builder.getOrCreate()

columnas = ['Años', 'Sexo', 'TipoDolor', 'PresionArterial', 'Colesterol', 'AzucarAyuna', 'Electrocardio', 'FrecuenciaCardio', 'Angina', 'Oldpeak', 'PendienteST', 'NumeroVasos', 'Thal', 'VariableInteres']

data = distFile.map(lambda x: x.split()).toDF(columnas)

data.show()

+----+----+---------+---------------+----------+-----------+-------------+----------------+------+-------+-----------+-----------+----+---------------+
|Años|Sexo|TipoDolor|PresionArterial|Colesterol|AzucarAyuna|Electrocardio|FrecuenciaCardio|Angina|Oldpeak|PendienteST|NumeroVasos|Thal|VariableInteres|
+----+----+---------+---------------+----------+-----------+-------------+----------------+------+-------+-----------+-----------+----+---------------+
|70.0| 1.0|      4.0|          130.0|     322.0|        0.0|          2.0|           109.0|   0.0|    2.4|        2.0|        3.0| 3.0|              2|
|67.0| 0.0|      3.0|          115.0|     564.0|        0.0|          2.0|           160.0|   0.0|    1.6|        2.0|        0.0| 7.0|              1|
|57.0| 1.0|      2.0|          124.0|     261.0|        0.0|          0.0|           141.0|   0.0|    0.3|        1.0|        0.0| 7.0|              2|
|64.0| 1.0|      4.0|          128.0|     263.0|        0.0|          0.0|           105

# Creando la Variable "Enfermo"

In [None]:
data = data.withColumn("Enfermo", when(col("Thal") > 3, 1).otherwise(0))
data.show()

+----+----+---------+---------------+----------+-----------+-------------+----------------+------+-------+-----------+-----------+----+---------------+-------+
|Años|Sexo|TipoDolor|PresionArterial|Colesterol|AzucarAyuna|Electrocardio|FrecuenciaCardio|Angina|Oldpeak|PendienteST|NumeroVasos|Thal|VariableInteres|Enfermo|
+----+----+---------+---------------+----------+-----------+-------------+----------------+------+-------+-----------+-----------+----+---------------+-------+
|70.0| 1.0|      4.0|          130.0|     322.0|        0.0|          2.0|           109.0|   0.0|    2.4|        2.0|        3.0| 3.0|              2|      0|
|67.0| 0.0|      3.0|          115.0|     564.0|        0.0|          2.0|           160.0|   0.0|    1.6|        2.0|        0.0| 7.0|              1|      1|
|57.0| 1.0|      2.0|          124.0|     261.0|        0.0|          0.0|           141.0|   0.0|    0.3|        1.0|        0.0| 7.0|              2|      1|
|64.0| 1.0|      4.0|          128.0|   

# Preprocesando Data

In [None]:
data.printSchema()

root
 |-- Años: string (nullable = true)
 |-- Sexo: string (nullable = true)
 |-- TipoDolor: string (nullable = true)
 |-- PresionArterial: string (nullable = true)
 |-- Colesterol: string (nullable = true)
 |-- AzucarAyuna: string (nullable = true)
 |-- Electrocardio: string (nullable = true)
 |-- FrecuenciaCardio: string (nullable = true)
 |-- Angina: string (nullable = true)
 |-- Oldpeak: string (nullable = true)
 |-- PendienteST: string (nullable = true)
 |-- NumeroVasos: string (nullable = true)
 |-- Thal: string (nullable = true)
 |-- VariableInteres: string (nullable = true)
 |-- Enfermo: integer (nullable = false)



Observando las variables de la data, se encontro que los datos estaban siendo reconocidos como strings y no en datos numericos, por lo tanto habia que procesarlos nuevamente con su correspondiente tipo de variable.

In [None]:
columnas = ['Años', 'Sexo', 'TipoDolor', 'PresionArterial', 'Colesterol', 'AzucarAyuna', 'Electrocardio', 'FrecuenciaCardio', 'Angina', 'Oldpeak', 'PendienteST', 'NumeroVasos', 'Thal', 'VariableInteres']
columnasTipos = [DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType(), DoubleType()]


for colName, colType in zip(columnas, columnasTipos):
    data = data.withColumn(colName, data[colName].cast(colType))

data = data.withColumn('VariableInteres', data['VariableInteres'].cast(DoubleType()))

In [None]:
data.printSchema()

root
 |-- Años: double (nullable = true)
 |-- Sexo: double (nullable = true)
 |-- TipoDolor: double (nullable = true)
 |-- PresionArterial: double (nullable = true)
 |-- Colesterol: double (nullable = true)
 |-- AzucarAyuna: double (nullable = true)
 |-- Electrocardio: double (nullable = true)
 |-- FrecuenciaCardio: double (nullable = true)
 |-- Angina: double (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- PendienteST: double (nullable = true)
 |-- NumeroVasos: double (nullable = true)
 |-- Thal: double (nullable = true)
 |-- VariableInteres: double (nullable = true)
 |-- Enfermo: integer (nullable = false)



Comprobamos que todas las variables corresponden a una variable numerica.

#Assembler Data

In [None]:
#Probando sin la variable "VariableInteres", pero dio el mismo resultado
#columnas2 = ['Años', 'Sexo', 'TipoDolor', 'PresionArterial', 'Colesterol', 'AzucarAyuna', 'Electrocardio', 'FrecuenciaCardio', 'Angina', 'Oldpeak', 'PendienteST', 'NumeroVasos', 'Thal']

assembler = VectorAssembler(inputCols = columnas, outputCol='features')

output = assembler.transform(data)
modelData = output.select('features','Enfermo')
modelData.show()

+--------------------+-------+
|            features|Enfermo|
+--------------------+-------+
|[70.0,1.0,4.0,130...|      0|
|[67.0,0.0,3.0,115...|      1|
|[57.0,1.0,2.0,124...|      1|
|[64.0,1.0,4.0,128...|      1|
|[74.0,0.0,2.0,120...|      0|
|[65.0,1.0,4.0,120...|      1|
|[56.0,1.0,3.0,130...|      1|
|[59.0,1.0,4.0,110...|      1|
|[60.0,1.0,4.0,140...|      1|
|[63.0,0.0,4.0,150...|      1|
|[59.0,1.0,4.0,135...|      1|
|[53.0,1.0,4.0,142...|      1|
|[44.0,1.0,3.0,140...|      0|
|[61.0,1.0,1.0,134...|      0|
|[57.0,0.0,4.0,128...|      0|
|[71.0,0.0,4.0,112...|      0|
|[46.0,1.0,4.0,140...|      1|
|[53.0,1.0,4.0,140...|      1|
|[64.0,1.0,1.0,110...|      0|
|[40.0,1.0,1.0,140...|      1|
+--------------------+-------+
only showing top 20 rows



# StandardScaler Data

In [None]:
scaler = StandardScaler(inputCol = 'features', outputCol= 'scaledFeatures', withMean = True, withStd= True)

scalerModel = scaler.fit(modelData)
scaledData = scalerModel.transform(modelData)
finalData = scaledData.select('scaledFeatures','Enfermo')
finalData.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|scaledFeatures                                                                                                                                                                                                                                                                         |Enfermo|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[1.7089200771370505,0.6882216640697679,0.8693133244601348,-0.07527006652510361,1.3996132196232811,-0.41625583610924666,0.97984406

#Train Test Split

In [None]:
trainData , testData = finalData.randomSplit([0.5,0.5])

# Modelamiento Regresión Logística

In [None]:
logicClass = LogisticRegression(featuresCol = 'scaledFeatures', labelCol = 'Enfermo')

model = logicClass.fit(trainData)


In [None]:
pred = model.transform(testData)

# Metricas AUC y Precisión

In [None]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol = 'Enfermo')
AUC = evaluator.evaluate(pred)
print("AUC-ROC:", AUC)

accuracy = MulticlassClassificationEvaluator(labelCol="Enfermo", metricName="accuracy")
accuracy = accuracy.evaluate(pred)
print("Exactitud:", accuracy)

recall = MulticlassClassificationEvaluator(labelCol="Enfermo", metricName="recallByLabel")
recall = recall.evaluate(pred)
print("Recall:", recall)

f1 = MulticlassClassificationEvaluator(labelCol="Enfermo", metricName="f1")
f1Score = f1.evaluate(pred)
print("F1-Score:", f1Score)


AUC-ROC: 1.0
Exactitud: 1.0
Recall: 1.0
F1-Score: 1.0


Segun las metricas obtenidas, el obtener un "1.0" como resultado indica que el modelo tiene un rendimiento perfecto y que todas sus predicciones los esta clasificando correctamente.

In [None]:
# pred.select('Enfermo','prediction').show()
pred.show()

+--------------------+-------+--------------------+--------------------+----------+
|      scaledFeatures|Enfermo|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|[-2.2431863111028...|      0|[19.7274290219951...|[0.99999999729301...|       0.0|
|[-2.1334055780961...|      0|[20.3120875148059...|[0.99999999849140...|       0.0|
|[-2.1334055780961...|      1|[-25.172143898952...|[1.16916769511768...|       1.0|
|[-1.9138441120828...|      0|[21.4560037986716...|[0.99999999951940...|       0.0|
|[-1.9138441120828...|      0|[15.4570281103857...|[0.99999980631414...|       0.0|
|[-1.8040633790761...|      1|[-26.379777685277...|[3.49468565622738...|       1.0|
|[-1.6942826460694...|      0|[22.6160868120582...|[0.99999999984935...|       0.0|
|[-1.5845019130628...|      1|[-26.484884640965...|[3.14601469287166...|       1.0|
|[-1.4747211800561...|      0|[21.8130480123959...|[0.99999999966371...|    

Comprobando los datos obtenidos en la prediccion con los datos de testeo, encontramos que los datos con una rawPrediction con un valor negativo los clasifica como 1 ("Enfermos") y los positivos en 0 ("Sanos").

# Corroborando Prediccion vs Enfermo

In [None]:
predData = pred.withColumn('correcta', F.when(F.col('Enfermo') == F.col('prediction'), True).otherwise(False))
predData.select('Enfermo','prediction','correcta').show()

+-------+----------+--------+
|Enfermo|prediction|correcta|
+-------+----------+--------+
|      0|       0.0|    true|
|      0|       0.0|    true|
|      1|       1.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      1|       1.0|    true|
|      0|       0.0|    true|
|      1|       1.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      1|       1.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      0|       0.0|    true|
|      1|       1.0|    true|
+-------+----------+--------+
only showing top 20 rows



Corroboramos la prediccion realizada contra la variable Enfermo si corresponden efectivamente. Se observa en la variable 'Correcta' si ambos datos son iguales se da un True y sino es un False. En los datos que se observan todos corresponden.