# For this exercise, you will analyze the heart disease dataset and build a classification model using logistic regression.

## Instruction: You may experiment with the code using try-and-error and step-by-step methods. For submission, please prepare a clean copy of the notebook file with the final code and final result for each task, and with additional blank cells removed. Irrelevant code will be counted as wrong answers.

## Step 1: create a spark session and import libraries

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('logrex').getOrCreate()

## Step 2: import data and describe the data

In [2]:
data = spark.read.csv('heart_disease.csv',inferSchema=True,header=True)

In [3]:
data.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: integer (nullable = true)
 |-- chest: integer (nullable = true)
 |-- resting_blood_pressure: integer (nullable = true)
 |-- serum_cholestoral: integer (nullable = true)
 |-- fasting_blood_sugar: integer (nullable = true)
 |-- resting_electrocardiographic_results: integer (nullable = true)
 |-- maximum_heart_rate_achieved: integer (nullable = true)
 |-- exercise_induced_angina: integer (nullable = true)
 |-- oldpeak: double (nullable = true)
 |-- slope: integer (nullable = true)
 |-- number_of_major_vessels: integer (nullable = true)
 |-- thal: integer (nullable = true)
 |-- class: string (nullable = true)



## Step 3: Transform features and label columns

### Instruction: 
### - All the features need to be transformed into a vector. The class column ("present" vs "absent") needs to be converted to a label column (1 vs 0).
### - The final dataset should contain two columns (features and label).

In [12]:
from pyspark.ml.feature import (VectorAssembler,
                               VectorIndexer,
                               OneHotEncoder,
                               StringIndexer)

In [13]:
# StringIndexer: create index (0,1,2,...) for categories ('A','B','C',...)
class_indexer = StringIndexer(inputCol='Sex',outputCol='Sex')

In [14]:
# StringIndexer (aka label encoding in SK Learn) assumes higher the categorical
#    value, better the category
# OneHotEncoder: create a vector indicating category
# for example: categories (0,1,2)
# category 0 would be [1,0,0]
# category 1 would be [0,1,0]
# category 2 would be [0,0,1]
class_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [15]:
embark_indexer = StringIndexer(inputCol='Class',outputCol='ClassIndex')

In [16]:
embark_encoder = OneHotEncoder(inputCol='ClassIndex',outputCol='ClassVec')

In [20]:
assembler = VectorAssembler(inputCols=['Age','SexVec','ClassVec',
                                      'exercise_induced_angina','oldpeak','Thal'],outputCol='features')

In [21]:
from pyspark.ml import Pipeline

In [23]:
pipeline = Pipeline(stages=[gender_indexer,class_indexer,
                           gender_encoder,class_encoder,
                           assembler])

NameError: name 'sex_indexer' is not defined

In [6]:
df = spark.createDataFrame(data, ["new_data])

SyntaxError: EOL while scanning string literal (<ipython-input-6-bb40ca4f60dc>, line 1)

In [11]:
output = pipeline.fit(new_data).transform(new_data)

NameError: name 'pipeline' is not defined

In [1]:
output.head()

NameError: name 'output' is not defined

In [None]:
final_data = output.select(['features','Class'])

In [None]:
final_data.show()

## Step 4: build a logistic model 
### Instructions:
### - split the data into training and test sets
### - fit the model on training set
### - output regression coefficients

In [24]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

NameError: name 'final_data' is not defined

## Step 5: test the model on test set and display the results (label and prediction)

In [2]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol='features',labelCol='Sex')
lr_model = lr.fit(train_data)

NameError: name 'train_data' is not defined

In [3]:
# output coefficients
lr_model.coefficients

NameError: name 'lr_model' is not defined

In [None]:
lr_model.summary.areaUnderROC

## Step 6: evaluate the model and display AUC

In [4]:
results = lr_model.transform(test_data)

NameError: name 'lr_model' is not defined

In [5]:
results.select('class','prediction').show()

NameError: name 'results' is not defined

In [7]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [8]:
eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='class')

In [9]:
AUC = eval.evaluate(results)

NameError: name 'results' is not defined

In [None]:
AUC

## Step 7: Conclusions

### Briefly describe the evaluation results and how well your model predicts heart disease based on the given features.