#### Rex Gayas
#### Week 10 Exercise 10.2 Spring 2024
#### DSC400-T301 Big Data, Technology, and Algo (2245-1)
#### Classification in PySpark and Keras

**Assignment 10**

Assignment 10.1

In [4]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression

# Initialize the SparkSession
spark = SparkSession.builder.appName("DSC 400 Assignment 10").getOrCreate()

# Download the sample data file
!wget https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt -P /content/data/mllib/

# Path to the sample data
sample_libsvm_data_path = "/content/data/mllib/sample_libsvm_data.txt"

# Load training data
training = spark.read.format("libsvm").load(sample_libsvm_data_path)

# Create LogisticRegression instance and set parameters
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print coefficients and intercept for Logistic Regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

# Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example
trainingSummary = lrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
lrModel.setThreshold(bestThreshold)


--2024-05-19 01:32:29--  https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104736 (102K) [text/plain]
Saving to: ‘/content/data/mllib/sample_libsvm_data.txt’


2024-05-19 01:32:29 (4.50 MB/s) - ‘/content/data/mllib/sample_libsvm_data.txt’ saved [104736/104736]

Coefficients: (692,[272,300,323,350,351,378,379,405,406,407,428,433,434,435,455,456,461,462,483,484,489,490,496,511,512,517,539,540,568],[-7.520689871384125e-05,-8.115773146847006e-05,3.814692771846427e-05,0.0003776490540424338,0.0003405148366194403,0.0005514455157343107,0.0004085386116096912,0.0004197467332749452,0.0008119171358670031,0.000502770837266875,-2.3929260406600902e-05,0.0005745048020902297,0.0009037546426

LogisticRegressionModel: uid=LogisticRegression_0e3fee72adc0, numClasses=2, numFeatures=692

Initialized a “SparkSession” and downloaded the “sample_libsvm_data.txt” dataset from a remote URL. The dataset was loaded into a Spark DataFrame using the “libsvm” format. Created a LogisticRegression” model with specific parameters (“maxIter=10”, “regParam=0.3”, “elasticNetParam=0.8”) and trained it on the dataset.

The output shows the coefficients of the model, the intercept, and the ROC curve data. The “objectiveHistory” array provides the values of the objective function at each iteration during model training, indicating the model's convergence. The ROC curve data displays the false positive rate (FPR) and true positive rate (TPR), and the area under the ROC (AUC) being 1.0 suggests a perfect fit for this dataset.

Assignment 10.2

In [5]:
# Import necessary libraries
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

# Load the dataset
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

# Display the shape of the dataset and the first few rows
print(dataframe.shape)
print(dataframe.head())

# Split the data into training and validation sets
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    f"Using {len(train_dataframe)} samples for training "
    f"and {len(val_dataframe)} for validation"
)

# Convert dataframes to tf.data.Dataset objects
def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

# Batch the datasets
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

# Define functions for feature encoding
def encode_numerical_feature(feature, name, dataset):
    normalizer = layers.Normalization()
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    normalizer.adapt(feature_ds)
    encoded_feature = normalizer(feature)
    return encoded_feature

def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = layers.StringLookup if is_string else layers.IntegerLookup
    lookup = lookup_class(output_mode="binary")
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    lookup.adapt(feature_ds)
    encoded_feature = lookup(feature)
    return encoded_feature

# Define inputs for the model
sex = keras.Input(shape=(1,), name="sex", dtype="int64")
cp = keras.Input(shape=(1,), name="cp", dtype="int64")
fbs = keras.Input(shape=(1,), name="fbs", dtype="int64")
restecg = keras.Input(shape=(1,), name="restecg", dtype="int64")
exang = keras.Input(shape=(1,), name="exang", dtype="int64")
ca = keras.Input(shape=(1,), name="ca", dtype="int64")
thal = keras.Input(shape=(1,), name="thal", dtype="string")
age = keras.Input(shape=(1,), name="age")
trestbps = keras.Input(shape=(1,), name="trestbps")
chol = keras.Input(shape=(1,), name="chol")
thalach = keras.Input(shape=(1,), name="thalach")
oldpeak = keras.Input(shape=(1,), name="oldpeak")
slope = keras.Input(shape=(1,), name="slope")

all_inputs = [
    sex, cp, fbs, restecg, exang, ca, thal,
    age, trestbps, chol, thalach, oldpeak, slope,
]

# Encode features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, False)
cp_encoded = encode_categorical_feature(cp, "cp", train_ds, False)
fbs_encoded = encode_categorical_feature(fbs, "fbs", train_ds, False)
restecg_encoded = encode_categorical_feature(restecg, "restecg", train_ds, False)
exang_encoded = encode_categorical_feature(exang, "exang", train_ds, False)
ca_encoded = encode_categorical_feature(ca, "ca", train_ds, False)
thal_encoded = encode_categorical_feature(thal, "thal", train_ds, True)
age_encoded = encode_numerical_feature(age, "age", train_ds)
trestbps_encoded = encode_numerical_feature(trestbps, "trestbps", train_ds)
chol_encoded = encode_numerical_feature(chol, "chol", train_ds)
thalach_encoded = encode_numerical_feature(thalach, "thalach", train_ds)
oldpeak_encoded = encode_numerical_feature(oldpeak, "oldpeak", train_ds)
slope_encoded = encode_numerical_feature(slope, "slope", train_ds)

# Combine encoded features
all_features = layers.concatenate([
    sex_encoded, cp_encoded, fbs_encoded, restecg_encoded,
    exang_encoded, ca_encoded, thal_encoded,
    age_encoded, trestbps_encoded, chol_encoded,
    thalach_encoded, oldpeak_encoded, slope_encoded,
])

# Build the model
x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

# Visualize the model
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

# Train the model
model.fit(train_ds, epochs=50, validation_data=val_ds)

# Inference on new data
sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)

print(
    f"This particular patient had a {100 * predictions[0][0]:.1f} "
    "percent probability of having a heart disease, "
    "as evaluated by our model."
)


(303, 14)
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   1       145   233    1        2      150      0      2.3      3   
1   67    1   4       160   286    0        2      108      1      1.5      2   
2   67    1   4       120   229    0        2      129      1      2.6      2   
3   37    1   3       130   250    0        0      187      0      3.5      3   
4   41    0   2       130   204    0        2      172      0      1.4      1   

   ca        thal  target  
0   0       fixed       0  
1   3      normal       1  
2   2  reversible       0  
3   0      normal       0  
4   0      normal       0  
Using 242 samples for training and 61 for validation
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epo

Performed binary classification to predict heart disease using a neural network built with TensorFlow and Keras. Loaded the dataset from a CSV file into a Pandas DataFrame, which contains 303 samples and 14 columns, each representing various patient attributes and the target label indicating the presence of heart disease. The dataset was split into training (242 samples) and validation (61 samples) sets. Converted these sets into “tf.data.Dataset” objects, enabling efficient data handling and preprocessing. Each numerical feature was normalized, and categorical features were encoded using integer and string lookups. The neural network model was built with layers for input, dense connections, and dropout regularization. The model was compiled and trained for 50 epochs, achieving a training accuracy of up to 90.08% and a validation accuracy of 80.33%. The output shows the loss and accuracy metrics for each epoch, indicating the model's performance improvement over time. Finally, made predictions on new data, estimating a particular patient's probability of having heart disease to be 13.9%. This shows the model's capability to make informed predictions based on the provided attributes.