<a href="https://colab.research.google.com/github/Sayed-Hossein-Hosseini/SparkML_Heart_Risk_Classifier/blob/master/SparkML_Heart_Risk_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SparkML Heart Risk Classifier**

## **Libraries**

In [7]:
pip install pyspark



In [8]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.sql.functions import col

## **Loading Dataset**

In [9]:
# Create a local Spark session
spark = SparkSession.builder.appName("HeartDiseaseClassification").getOrCreate()

# Upload CSV file
data = spark.read.csv("heart_disease_uci.csv", header=True, inferSchema=True)

# Display data
data.printSchema()
data.show(5)

root
 |-- id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- dataset: string (nullable = true)
 |-- cp: string (nullable = true)
 |-- trestbps: integer (nullable = true)
 |-- chol: integer (nullable = true)
 |-- fbs: boolean (nullable = true)
 |-- restecg: string (nullable = true)
 |-- thalch: integer (nullable = true)
 |-- exang: boolean (nullable = true)
 |-- oldpeak: double (nullable = true)
 |-- slope: string (nullable = true)
 |-- ca: integer (nullable = true)
 |-- thal: string (nullable = true)
 |-- num: integer (nullable = true)

+---+---+------+---------+---------------+--------+----+-----+--------------+------+-----+-------+-----------+---+-----------------+---+
| id|age|   sex|  dataset|             cp|trestbps|chol|  fbs|       restecg|thalch|exang|oldpeak|      slope| ca|             thal|num|
+---+---+------+---------+---------------+--------+----+-----+--------------+------+-----+-------+-----------+---+--------------

## **Data Preprocessing**

In [10]:
# Check for null values
data.select([col(c).isNull().alias(c) for c in data.columns]).show()

# Convert feature variables to vectors
feature_cols = [col for col in data.columns if col != 'target']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

# If target is not a number, we convert it to a number
labelIndexer = StringIndexer(inputCol="target", outputCol="label")

# Split data into training and testing
(trainingData, testData) = data.randomSplit([0.8, 0.2], seed=42)

+-----+-----+-----+-------+-----+--------+-----+-----+-------+------+-----+-------+-----+-----+-----+-----+
|   id|  age|  sex|dataset|   cp|trestbps| chol|  fbs|restecg|thalch|exang|oldpeak|slope|   ca| thal|  num|
+-----+-----+-----+-------+-----+--------+-----+-----+-------+------+-----+-------+-----+-----+-----+-----+
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|false|   false|false|false|  false| false|false|  false|false|false|false|false|
|false|false|false|  false|f