# Ejercicio aplicado_Machine Learning con Spark

En este notebook entrenaremos un modelo de clasificacion binaria capaz de predecir la enfermedad cardiaca en base a diferentes mediciones de parametros bioquimicos. Para ello utilizaremos el dataset de Kaggle

In [22]:
import numpy as np 
import pandas as pd 

In [23]:
import findspark
findspark.init()

In [24]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('UCI Heart disease').getOrCreate()

<div class="alert alert-info">
<h3><font color="#0000"><i><b> Why Use It?:</b> Es la primera vez que usamos dentro de los Notebooks una libería enfocada en el ML. 
<br>La regresión logística estima la probabilidad de que ocurra un evento, como votar o no votar, en función de un conjunto de datos determinado de variables independientes.
</i></font></h3>
</div>


In [25]:
from pyspark.ml.classification import LogisticRegression

In [26]:
heart = spark.read.csv('data/heart.csv', 
                       inferSchema = True, 
                       header = True)

In [27]:
heart.show(5)

+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|
| 56|  1|  1|     120| 236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|
| 57|  0|  0|     120| 354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
only showing top 5 rows



In [28]:
heart.rdd.getNumPartitions()

1

In [29]:
heart_pandas = heart.toPandas()
heart_pandas

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [30]:
heart.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: integer (nullable = true)
 |-- cp: integer (nullable = true)
 |-- trestbps: integer (nullable = true)
 |-- chol: integer (nullable = true)
 |-- fbs: integer (nullable = true)
 |-- restecg: integer (nullable = true)
 |-- thalach: integer (nullable = true)
 |-- exang: integer (nullable = true)
 |-- oldpeak: double (nullable = true)
 |-- slope: integer (nullable = true)
 |-- ca: integer (nullable = true)
 |-- thal: integer (nullable = true)
 |-- target: integer (nullable = true)



### <font color = #ff4fa4> <i> <b> Shape? </b>:

In [31]:
heart_pandas.shape

(303, 14)

In [32]:
type(heart)

pyspark.sql.dataframe.DataFrame

### Preprocesamiento de datos

In [33]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [34]:
heart.columns

['age',
 'sex',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalach',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal',
 'target']

### <font color = #ff4fa4> <i> <b> Why Use It? </b>: 
* inputCols: This is a list of columns (features) from your dataset that you want to combine into one. In this case, columns like 'age', 'sex', 'cp', 'trestbps', etc.
* OutputCol: This is the name of the new column that will contain the combined features. In your code, the new column is named "features".
### <font color = #ff4fa4> <i> <b> Machine learning algorithms in Spark generally require a single vector column as input, where each feature is represented as part of that vector. The VectorAssembler performs this transformation. <br> <br> <font color = #ff0000> TARGET is not included since is the variable to predict!!

In [35]:
assembler = VectorAssembler(
                            inputCols=['age',
                            'sex',
                            'cp',
                            'trestbps',
                            'chol',
                            'fbs',
                            'restecg',
                            'thalach',
                            'exang',
                            'oldpeak',
                            'slope',
                            'ca',
                            'thal'],
                            outputCol="features")

In [36]:
output = assembler.transform(heart)

In [37]:
output.show(5)

+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|            features|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|[63.0,1.0,3.0,145...|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|[37.0,1.0,2.0,130...|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|[41.0,0.0,1.0,130...|
| 56|  1|  1|     120| 236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|[56.0,1.0,1.0,120...|
| 57|  0|  0|     120| 354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|[57.0,0.0,0.0,120...|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
only showing top 5 rows



In [54]:
In_Pandas = output.toPandas()
a = In_Pandas.loc[0:12, 'thal':'features']
a

Unnamed: 0,thal,target,features
0,1,1,"[63.0, 1.0, 3.0, 145.0, 233.0, 1.0, 0.0, 150.0..."
1,2,1,"[37.0, 1.0, 2.0, 130.0, 250.0, 0.0, 1.0, 187.0..."
2,2,1,"[41.0, 0.0, 1.0, 130.0, 204.0, 0.0, 0.0, 172.0..."
3,2,1,"[56.0, 1.0, 1.0, 120.0, 236.0, 0.0, 1.0, 178.0..."
4,2,1,"[57.0, 0.0, 0.0, 120.0, 354.0, 0.0, 1.0, 163.0..."
5,1,1,"[57.0, 1.0, 0.0, 140.0, 192.0, 0.0, 1.0, 148.0..."
6,2,1,"[56.0, 0.0, 1.0, 140.0, 294.0, 0.0, 0.0, 153.0..."
7,3,1,"[44.0, 1.0, 1.0, 120.0, 263.0, 0.0, 1.0, 173.0..."
8,3,1,"[52.0, 1.0, 2.0, 172.0, 199.0, 1.0, 1.0, 162.0..."
9,2,1,"[57.0, 1.0, 2.0, 150.0, 168.0, 0.0, 1.0, 174.0..."


### <font color = #ff4fa4> <i> <b> output.select(): 
* 1: input variables used to predict the target variable
* 2: target for predicting

In [55]:
final_data = output.select("features",'target')

### Entrenamiento del modelo

### <font color = #ff4fa4> <i> <b> we use 70 % of data to train and we test it in 30% from the original data

In [40]:
train, test = final_data.randomSplit([0.7,0.3])

In [41]:
lr = LogisticRegression(labelCol="target",
                        featuresCol="features")

### <font color = #ff4fa4> Used to train the Logistic Regression model created in the previous step <BR> Then tested 

In [42]:
model = lr.fit(train)

In [43]:
predict_train=model.transform(train)
predict_test=model.transform(test)
predict_test.select("target","prediction").show(10)

+------+----------+
|target|prediction|
+------+----------+
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       0.0|
|     0|       1.0|
|     1|       1.0|
+------+----------+
only showing top 10 rows



### Evaluación del modelo

### <font color = #ff4fa4> <i> <b> Se compara lo obtenido por el modelo contra la data original

In [44]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',
                                          labelCol='target')

predict_test.select("target","rawPrediction","prediction","probability").show(5)


+------+--------------------+----------+--------------------+
|target|       rawPrediction|prediction|         probability|
+------+--------------------+----------+--------------------+
|     1|[-1.0076827274690...|       1.0|[0.26743358952919...|
|     1|[-1.5202184212452...|       1.0|[0.17942935802066...|
|     0|[1.39933771985257...|       0.0|[0.80207877366843...|
|     1|[-4.8672134215876...|       1.0|[0.00763602006585...|
|     1|[-3.8788490556679...|       1.0|[0.02025582558526...|
+------+--------------------+----------+--------------------+
only showing top 5 rows



In [45]:
print("The area under ROC for train set is {}".format(evaluator.evaluate(predict_train)))

print("The area under ROC for test set is {}".format(evaluator.evaluate(predict_test)))

The area under ROC for train set is 0.9253140096618357
The area under ROC for test set is 0.9045833333333335
