# Healthcare Data Analysis for Predictive Modeling

The project aims to predict the likelihood of diseases, such as diabetes or heart disease, using patient health records. By analyzing historical healthcare data, the goal is to build predictive models that can identify individuals at high risk for these conditions. This can assist healthcare providers in early detection and personalized treatment plans, potentially improving patient outcomes and reducing healthcare costs.


# Dataset Overview and Preprocessing

The dataset includes patient information such as age, sex, chest pain type, blood pressure, cholesterol levels, and exercise-induced angina. These features are used to predict the presence or absence of heart disease. Key preprocessing steps include handling missing values, type casting, and encoding categorical variables.

In [1]:
import boto3

# Initialize the DynamoDB client (local instance)
dynamodb = boto3.resource(
    'dynamodb',
    endpoint_url='http://dynamodb-local:8000',  # Local DynamoDB URL
    region_name='us-west-2',               # AWS region (optional for local)
    aws_access_key_id='dummy',             # Dummy credentials for local use
    aws_secret_access_key='dummy'          # Dummy credentials for local use
)

DB_TABLE_NAME = "processed"

# Creating a DynamoDB Table for Data Storage

Here, we set up a DynamoDB table to store the processed data for each patient. Each entry is given a unique identifier and stored with attributes corresponding to patient details and heart disease indicators.

In [2]:
# Create DB Table
# Define table parameters
table_name = DB_TABLE_NAME
attribute_definitions = [
    {
        'AttributeName': 'pk',
        'AttributeType': 'S'  # S for String, N for Number, B for Binary
    }
]
key_schema = [
    {
        'AttributeName': 'pk',
        'KeyType': 'HASH'  # Partition key
    }
]
provisioned_throughput = {
    'ReadCapacityUnits': 1,
    'WriteCapacityUnits': 1
}

# Create the table
try:
    table = dynamodb.create_table(
        TableName=table_name,
        AttributeDefinitions=attribute_definitions,
        KeySchema=key_schema,
        ProvisionedThroughput=provisioned_throughput
    )
    # Wait until the table is created
    table.meta.client.get_waiter('table_exists').wait(TableName=table_name)
    print(f"Table '{table_name}' created successfully.")
except Exception as e:
    print(f"Error creating table: {e}")



Table 'processed' created successfully.


### Reading the Healthcare Dataset
We load and inspect the dataset, which contains various patient attributes such as age, sex, blood pressure, cholesterol levels, and more. This dataset is crucial for building a predictive model for heart disease detection.

The dataset is stored on HDFS and after loading, we assign meaningful column names to facilitate further analysis and transformations.

In [4]:
# Read data
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, col, when, count, isnan
from pyspark.ml.feature import StringIndexer, VectorAssembler


spark = SparkSession.builder.appName("HealthcareDataAnalysis").getOrCreate()

columns = [
    "age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"
]

data = spark.read.csv("hdfs://namenode:9000/input/heart_decease.csv", header=False, inferSchema=True)

data = data.toDF(*columns)


data.show()

+----+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+
| age|sex| cp|trestbps| chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|
+----+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+
|63.0|1.0|1.0|   145.0|233.0|1.0|    2.0|  150.0|  0.0|    2.3|  3.0|0.0| 6.0|     0|
|67.0|1.0|4.0|   160.0|286.0|0.0|    2.0|  108.0|  1.0|    1.5|  2.0|3.0| 3.0|     2|
|67.0|1.0|4.0|   120.0|229.0|0.0|    2.0|  129.0|  1.0|    2.6|  2.0|2.0| 7.0|     1|
|37.0|1.0|3.0|   130.0|250.0|0.0|    0.0|  187.0|  0.0|    3.5|  3.0|0.0| 3.0|     0|
|41.0|0.0|2.0|   130.0|204.0|0.0|    2.0|  172.0|  0.0|    1.4|  1.0|0.0| 3.0|     0|
|56.0|1.0|2.0|   120.0|236.0|0.0|    0.0|  178.0|  0.0|    0.8|  1.0|0.0| 3.0|     0|
|62.0|0.0|4.0|   140.0|268.0|0.0|    2.0|  160.0|  0.0|    3.6|  3.0|2.0| 3.0|     3|
|57.0|0.0|4.0|   120.0|354.0|0.0|    0.0|  163.0|  1.0|    0.6|  1.0|0.0| 3.0|     0|
|63.0|1.0|4.0|   130.0|254.0|0.0|    2.0|  147.0|  0.0

## Column Headers and Their Descriptions

**age**: Age of the patient in years.

**sex**: Sex of the patient.
- `1` = Male
- `0` = Female

**cp**: Chest pain type (4 types).
- `1` = Typical angina: chest pain related to decreased blood supply to the heart.
- `2` = Atypical angina: chest pain not related to the heart.
- `3` = Non-anginal pain: typically esophageal or another form of chest pain.
- `4` = Asymptomatic: no chest pain.

**trestbps**: Resting blood pressure (in mm Hg on admission to the hospital).
- This is the blood pressure measured while at rest.

**chol**: Serum cholesterol in mg/dl.
- Represents the cholesterol level of the patient.

**fbs**: Fasting blood sugar > 120 mg/dl.
- `1` = True (fasting blood sugar is higher than 120 mg/dl)
- `0` = False (fasting blood sugar is lower than or equal to 120 mg/dl)

**restecg**: Resting electrocardiographic results (values 0, 1, 2).
- `0` = Normal.
- `1` = Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV).
- `2` = Showing probable or definite left ventricular hypertrophy by Estes' criteria.

**thalach**: Maximum heart rate achieved.
- The highest heart rate reached during a stress test.

**exang**: Exercise-induced angina.
- `1` = Yes (angina induced by exercise)
- `0` = No (no angina induced by exercise)

**oldpeak**: ST depression induced by exercise relative to rest.
- Represents the difference between the heart's state during rest and during exercise.

**slope**: The slope of the peak exercise ST segment.
- `1` = Upsloping: better heart rate recovery.
- `2` = Flat: minimal or no change in the ST segment.
- `3` = Downsloping: indicative of heart problems.

**ca**: Number of major vessels (0-3) colored by fluoroscopy.
- Represents the number of major blood vessels that are visible after injecting a contrast dye.
- The values range from `0` to `3`.

**thal**: Thalassemia (a blood disorder that affects hemoglobin levels).
- `3` = Normal.
- `6` = Fixed defect: no proper blood movement in part of the heart.
- `7` = Reversible defect: blood flow is impaired, but not permanently.

**target**: Diagnosis of heart disease (angiographic disease status).
- `0` = No presence of heart disease.
- `1` = Presence of heart disease.


### Verifying DynamoDB Table Connectivity
To confirm that our DynamoDB table is set up correctly, we insert a sample entry with a unique identifier (`pk`) and a sample attribute (`Name`). The success of this operation indicates that the DynamoDB table is ready for storing our processed healthcare data.

In [5]:
# Persist data
table = dynamodb.Table(DB_TABLE_NAME)

# Put an item into the table
table.put_item(
    Item={
        'pk': '123',
        'Name': 'Example Item'
    }
)

{'ResponseMetadata': {'RequestId': 'fa214045-baae-43e9-af79-5c8f0f7e1679',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Jetty(12.0.8)',
   'date': 'Sat, 26 Oct 2024 14:55:07 GMT',
   'x-amzn-requestid': 'fa214045-baae-43e9-af79-5c8f0f7e1679',
   'content-type': 'application/x-amz-json-1.0',
   'x-amz-crc32': '2745614147',
   'content-length': '2'},
  'RetryAttempts': 0}}

### Storing Processed Data in DynamoDB
To store patient data, we iterate through each record in our DataFrame, converting numeric fields to the `Decimal` format compatible with DynamoDB. Each record is uniquely identified using a combination of the patient's `age` and `sex` attributes, ensuring traceable and organized storage in the database.

In [6]:
from decimal import Decimal, ROUND_HALF_UP, InvalidOperation

# Helper function to safely convert to Decimal
def safe_decimal(value):
    # Check if the value is None or non-numeric
    if value is None:
        return None
    try:
        return Decimal(value).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
    except (InvalidOperation, TypeError):
        return None  # Return None or another placeholder if conversion fails

# Iterate through each row in the DataFrame and store it in DynamoDB
for row in data.collect():
    table.put_item(
        Item={
            'pk': f"patient_{int(row['age'])}_{int(row['sex'])}",  # Unique ID combining age and sex
            'age': safe_decimal(row['age']),
            'sex': 'Male' if row['sex'] == 1 else 'Female',
            'cp': safe_decimal(row['cp']),
            'trestbps': safe_decimal(row['trestbps']),
            'chol': safe_decimal(row['chol']),
            'fbs': safe_decimal(row['fbs']),
            'restecg': safe_decimal(row['restecg']),
            'thalach': safe_decimal(row['thalach']),
            'exang': safe_decimal(row['exang']),
            'oldpeak': safe_decimal(row['oldpeak']),
            'slope': safe_decimal(row['slope']),
            'ca': safe_decimal(row['ca']),
            'thal': safe_decimal(row['thal']),
            'target': safe_decimal(row['target'])
        }
    )
print("Data stored in DynamoDB successfully.")

Data stored in DynamoDB successfully.


### Retrieving Sample Data from DynamoDB
To verify successful data storage, we retrieve a few sample records from the DynamoDB table. This step ensures that the data structure aligns with expectations and that each attribute is stored as intended. This retrieval can be useful for validation before proceeding to any further data processing.

In [7]:
import boto3
from boto3.dynamodb.conditions import Key

# Initialize the DynamoDB client (assuming local instance)
dynamodb = boto3.resource(
    'dynamodb',
    endpoint_url='http://dynamodb-local:8000',  # Local DynamoDB URL
    region_name='us-west-2',
    aws_access_key_id='dummy',
    aws_secret_access_key='dummy'
)

# Specify the table name
table = dynamodb.Table('processed')

# Retrieve a few items from DynamoDB
try:
    # Scan the table to retrieve all items (useful for validation)
    response = table.scan(Limit=5)  # Retrieve 5 items to check for data accuracy
    
    # Print the retrieved items
    items = response.get('Items', [])
    print("Retrieved items from DynamoDB:")
    for item in items:
        print(item)
except Exception as e:
    print(f"Error retrieving items: {e}")

Retrieved items from DynamoDB:
{'exang': Decimal('1'), 'sex': 'Male', 'thal': Decimal('6'), 'chol': Decimal('169'), 'slope': Decimal('3'), 'cp': Decimal('4'), 'trestbps': Decimal('120'), 'target': Decimal('2'), 'oldpeak': Decimal('2.8'), 'thalach': Decimal('144'), 'fbs': Decimal('0'), 'pk': 'patient_44_1', 'age': Decimal('44'), 'ca': Decimal('0'), 'restecg': Decimal('0')}
{'exang': Decimal('1'), 'sex': 'Female', 'thal': Decimal('3'), 'chol': Decimal('243'), 'slope': Decimal('2'), 'cp': Decimal('4'), 'trestbps': Decimal('138'), 'target': Decimal('0'), 'oldpeak': Decimal('0'), 'thalach': Decimal('152'), 'fbs': Decimal('0'), 'pk': 'patient_46_0', 'age': Decimal('46'), 'ca': Decimal('0'), 'restecg': Decimal('2')}
{'exang': Decimal('0'), 'sex': 'Female', 'thal': Decimal('3'), 'chol': Decimal('306'), 'slope': Decimal('1'), 'cp': Decimal('2'), 'trestbps': Decimal('126'), 'target': Decimal('0'), 'oldpeak': Decimal('0'), 'thalach': Decimal('163'), 'fbs': Decimal('0'), 'pk': 'patient_41_0', 'age

### Loading Dataset from HDFS into Spark DataFrame
This code loads the heart disease dataset stored in HDFS into a Spark DataFrame, allowing efficient handling of large data volumes. Columns are renamed to clarify their meanings, aligning with the dataset schema and a sample of the data is displayed to confirm successful loading and structure.

In [26]:
# Load data from HDFS
data = spark.read.csv("hdfs://namenode:9000/input/heart_decease.csv", header=False, inferSchema=True)

# Rename columns as needed
columns = [
    "age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"
]
data = data.toDF(*columns)

# Display to confirm successful loading
data.show(5)

+----+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+
| age|sex| cp|trestbps| chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|
+----+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+
|63.0|1.0|1.0|   145.0|233.0|1.0|    2.0|  150.0|  0.0|    2.3|  3.0|0.0| 6.0|     0|
|67.0|1.0|4.0|   160.0|286.0|0.0|    2.0|  108.0|  1.0|    1.5|  2.0|3.0| 3.0|     2|
|67.0|1.0|4.0|   120.0|229.0|0.0|    2.0|  129.0|  1.0|    2.6|  2.0|2.0| 7.0|     1|
|37.0|1.0|3.0|   130.0|250.0|0.0|    0.0|  187.0|  0.0|    3.5|  3.0|0.0| 3.0|     0|
|41.0|0.0|2.0|   130.0|204.0|0.0|    2.0|  172.0|  0.0|    1.4|  1.0|0.0| 3.0|     0|
+----+---+---+--------+-----+---+-------+-------+-----+-------+-----+---+----+------+
only showing top 5 rows



### Data Preprocessing and Feature Engineering
This section preprocesses the data to make it suitable for machine learning models. Key steps include:
- Converting necessary columns to `float` type for compatibility.
- Handling missing values by dropping rows with `NA` values.
- Encoding categorical variables (e.g., `sex`) to numerical format.
- Assembling feature columns into a single vector (`features`) needed for modeling.

The preprocessed data now has a `features` column with a vector of values and a `target` column for labels.

In [29]:
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler, StringIndexer

# Convert columns to float type to ensure compatibility with VectorAssembler
data = data.withColumn("ca", col("ca").cast("float"))
data = data.withColumn("thal", col("thal").cast("float"))

# Check and remove the 'sex_indexed' column if it already exists
if 'sex_indexed' in data.columns:
    data = data.drop('sex_indexed')

# Handle missing values (if any exist, drop for simplicity here)
data = data.na.drop()

# Index categorical columns like 'sex'
indexer = StringIndexer(inputCol="sex", outputCol="sex_indexed")
data = indexer.fit(data).transform(data)

# Assemble features into a single vector for training
feature_columns = [col for col in data.columns if col not in ['target', 'sex']]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(data).select("features", "target")

data.show(5)

24/10/26 15:25:40 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+--------------------+------+
|            features|target|
+--------------------+------+
|[63.0,1.0,145.0,2...|     0|
|[67.0,4.0,160.0,2...|     2|
|[67.0,4.0,120.0,2...|     1|
|[37.0,3.0,130.0,2...|     0|
|[41.0,2.0,130.0,2...|     0|
+--------------------+------+
only showing top 5 rows



### Baseline Model Training: Logistic Regression
This section establishes a baseline model using Logistic Regression to predict heart disease likelihood based on patient features. The process includes:
- **Data Splitting**: Dividing data into training (70%) and test (30%) sets.
- **Model Training**: Training a Logistic Regression model on the training data.
- **Predictions**: Generating predictions on the test data.
- **Evaluation**: Calculating model accuracy to understand its initial performance.

Initial results show a model accuracy of 66%, providing a starting point for further model optimization and comparison with other algorithms.

In [30]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Step 1: Split the data into training and testing sets
train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)

# Step 2: Initialize and train the Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="target")
model = lr.fit(train_data)

# Step 3: Make predictions on the test data
predictions = model.transform(test_data)

# Step 4: Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print(f"Model Accuracy: {accuracy:.2f}")

# Show some predictions for inspection
predictions.select("features", "target", "prediction").show(5)


24/10/26 15:27:06 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/10/26 15:27:06 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


Model Accuracy: 0.66
+--------------------+------+----------+
|            features|target|prediction|
+--------------------+------+----------+
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     1|       0.0|
+--------------------+------+----------+
only showing top 5 rows



### Feature Scaling
Standardizing feature values is crucial for machine learning models that depend on the scale of input data, especially when combining features of different units and magnitudes. Here, we:
- **Scale Features**: Using `StandardScaler`, all features are scaled to have a mean of 0 and standard deviation of 1.
- **Output Adjustments**: The transformed `scaled_features` column is renamed back to `features` for seamless model integration.

With standardized features, models can make more consistent predictions, unaffected by variations in individual feature scales.

In [34]:
from pyspark.sql.functions import col
from pyspark.ml.feature import StandardScaler

# Make sure 'features' column only exists once
data = data.select("features", "target")

# Apply StandardScaler on the 'features' column
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(data)
scaled_data = scaler_model.transform(data)

# Select the scaled features and target, renaming 'scaled_features' to 'features' for the model input
final_data = scaled_data.select(col("scaled_features").alias("features"), "target")

final_data.show(5)

+--------------------+------+
|            features|target|
+--------------------+------+
|[6.96152928881618...|     0|
|[7.40353114842356...|     2|
|[7.40353114842356...|     1|
|[4.08851720136823...|     0|
|[4.53051906097561...|     0|
+--------------------+------+
only showing top 5 rows



### Checking Class Distribution
To understand the class balance in our dataset, we count the instances of each target class:
- **Class Balance**: Imbalance in classes can bias the model, causing it to favor the majority class.
- **Insights**: Here, the `target=0` class has the most samples, indicating an imbalance. We'll consider balancing techniques in subsequent steps to address this.

In [36]:
# Check class distribution
final_data.groupBy("target").count().show()

+------+-----+
|target|count|
+------+-----+
|     1|   54|
|     3|   35|
|     4|   13|
|     2|   35|
|     0|  160|
+------+-----+



### Oversampling Minority Classes
To handle class imbalance, we oversample the minority classes in the dataset:
- **Purpose**: Balancing the dataset ensures the model learns features from each class equally.
- **Method**: We use a custom function to duplicate instances in the minority classes (`target=1, 2, 3, 4`) based on a specified sample ratio.
- **Result**: After oversampling, each class has an improved representation, enhancing the model's prediction capabilities.

In [41]:
from pyspark.sql import DataFrame

# Function to oversample the minority classes
def oversample(df: DataFrame, target_col: str, sample_ratio: float) -> DataFrame:
    # Separate the majority and minority classes
    majority_class = df.filter(f"{target_col} = 0")
    other_classes = df.filter(f"{target_col} != 0")
    
    # Oversample each minority class and union them
    oversampled_df = majority_class
    for label in [1, 2, 3, 4]:
        sampled_class = other_classes.filter(f"{target_col} = {label}").sample(withReplacement=True, fraction=sample_ratio)
        oversampled_df = oversampled_df.union(sampled_class)
        
    return oversampled_df

# Oversample the minority classes (tune sample_ratio as needed)
balanced_data = oversample(final_data, "target", sample_ratio=2.0)
balanced_data.groupBy("target").count().show()  # Check the new class distribution

+------+-----+
|target|count|
+------+-----+
|     0|  160|
|     1|  118|
|     2|   80|
|     3|   69|
|     4|   18|
+------+-----+



### Training and Evaluating Logistic Regression on Balanced Data
To improve predictive performance, we use a Logistic Regression model trained on the balanced data:
- **Data Split**: We split the data into 70% training and 30% test sets.
- **Model Training**: Logistic Regression is chosen for its simplicity and interpretability.
- **Evaluation Metrics**:
  - **Accuracy**: Measures the overall correctness of predictions.
  - **F1 Score**: Accounts for both precision and recall, especially useful for imbalanced datasets.
- **Result**: This step allows us to assess the Logistic Regression model's performance on the balanced data before moving to more complex models.

### Logistic Regression Performance on Balanced Data
Upon evaluating the Logistic Regression model on balanced data, the results show:
- **Model Accuracy**: 0.59
- **F1 Score**: 0.59

These metrics suggest that the Logistic Regression model struggles with classifying the target variable accurately. Given this, exploring more complex algorithms, such as Random Forest or Decision Trees, could improve predictive performance.

In [42]:
# Split the balanced data into training and test sets
train_data, test_data = balanced_data.randomSplit([0.7, 0.3], seed=42)

In [43]:
from pyspark.ml.classification import LogisticRegression

# Initialize and train the Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="target", maxIter=10)
model = lr.fit(train_data)

In [44]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Make predictions on the test data
predictions = model.transform(test_data)

# Evaluate accuracy
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Model Accuracy: {accuracy:.2f}")

# Calculate F1 Score
f1_evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="f1")
f1_score = f1_evaluator.evaluate(predictions)
print(f"F1 Score: {f1_score:.2f}")

# Show a few sample predictions
predictions.select("features", "target", "prediction").show(5)


Model Accuracy: 0.59
F1 Score: 0.59
+--------------------+------+----------+
|            features|target|prediction|
+--------------------+------+----------+
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
+--------------------+------+----------+
only showing top 5 rows



### Random Forest Model Performance
Using a Random Forest classifier with 100 trees, the model achieved:
- **Model Accuracy**: 0.76
- **F1 Score**: 0.76

This improvement over Logistic Regression suggests that Random Forest is more effective in capturing the data's complexity. The sample predictions demonstrate a better distribution of predictions across classes, although some misclassification remains. Further tuning of the Random Forest parameters or exploring other algorithms may enhance these results.

In [58]:
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier

# Initialize the Random Forest model
model = RandomForestClassifier(featuresCol="features", labelCol="target", numTrees=100)

# Train the model
tree_model = model.fit(train_data)

# Make predictions
predictions = tree_model.transform(test_data)

In [53]:
# Re-run evaluation on the test data
accuracy = evaluator.evaluate(predictions)
f1_score = f1_evaluator.evaluate(predictions)
print(f"Model Accuracy: {accuracy:.2f}")
print(f"F1 Score: {f1_score:.2f}")

# Show a few sample predictions to understand class distribution
predictions.select("features", "target", "prediction").show(5)

Model Accuracy: 0.76
F1 Score: 0.76
+--------------------+------+----------+
|            features|target|prediction|
+--------------------+------+----------+
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       1.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
|(13,[0,1,2,3,6,9,...|     0|       0.0|
+--------------------+------+----------+
only showing top 5 rows



### Hyperparameter Tuning with Cross-Validation
To optimize the Random Forest model, cross-validation with a parameter grid was applied. This tuning tested multiple configurations:
- **numTrees**: Number of trees (values: 50, 100, 150)
- **maxDepth**: Maximum depth of trees (values: 5, 10, 15)

After cross-validation, the best model achieved:
- **Tuned Random Forest Accuracy**: 0.81
- **Tuned Random Forest F1 Score**: 0.81

These results demonstrate that tuning hyperparameters significantly improved model performance, suggesting that the tuned model better generalizes to the data.

In [50]:
# Configure Spark session to limit broadcast size
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)  # Disables broadcasting for joins

# Reduce logging level to suppress DAGScheduler warnings if they persist (optional)
spark.sparkContext.setLogLevel("ERROR")  # Set to "ERROR" to only see critical warnings

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Define evaluator and parameter grid for tuning
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="f1")
param_grid = ParamGridBuilder() \
    .addGrid(rf_model.numTrees, [50, 100, 150]) \
    .addGrid(rf_model.maxDepth, [5, 10, 15]) \
    .build()

# Setup CrossValidator for tuning
crossval = CrossValidator(estimator=rf_model, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
cv_model = crossval.fit(train_data)

# Apply best model from cross-validation to test data
rf_tuned_predictions = cv_model.bestModel.transform(test_data)

# Evaluate the model
rf_tuned_accuracy = evaluator.evaluate(rf_tuned_predictions, {evaluator.metricName: "accuracy"})
rf_tuned_f1 = evaluator.evaluate(rf_tuned_predictions)

print(f"Tuned Random Forest Accuracy: {rf_tuned_accuracy:.2f}")
print(f"Tuned Random Forest F1 Score: {rf_tuned_f1:.2f}")

Tuned Random Forest Accuracy: 0.81
Tuned Random Forest F1 Score: 0.81


### Feature Importances
The numbers correspond to the contribution of each feature (higher values mean the feature has a larger impact on predictions):

- **age**: 0.1367
- **sex**: 0.0627
- **cp (chest pain type)**: 0.1009
- **trestbps (resting blood pressure)**: 0.0963
- **chol (cholesterol)**: 0.0196
- **fbs (fasting blood sugar)**: 0.0251
- **restecg (resting ECG results)**: 0.1445
- **thalach (maximum heart rate achieved)**: 0.0423
- **exang (exercise-induced angina)**: 0.1377
- **oldpeak (ST depression induced by exercise)**: 0.0372
- **slope (slope of the peak exercise ST segment)**: 0.0812
- **ca (number of major vessels colored by fluoroscopy)**: 0.0731
- **thal (thalassemia type)**: 0.0427

### Key Insights
- **Top Features**: `restecg`, `age`, `exang`, and `cp` are among the top contributors, suggesting that they are significant indicators of heart disease likelihood in this dataset.
- **Low Impact Features**: `chol` and `fbs` have relatively low feature importance, indicating they have a smaller effect on predictions.

In [54]:
feature_importances = cv_model.bestModel.featureImportances
print("Feature Importances:", feature_importances)

Feature Importances: (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.13671807808413206,0.06273086359763451,0.10095429121553508,0.0962744782030749,0.019580941444494866,0.02510959181143669,0.14448491866333338,0.04226907342420461,0.137696543492306,0.037196738904549304,0.08117716395702822,0.0731145382315963,0.0426927789706739])


### Storing Results in DynamoDB
To facilitate future retrieval and analysis, the predictions and feature importances are stored in a DynamoDB table:

1. **Predictions**: Each prediction entry includes the input features, actual target, and model-predicted target.
2. **Feature Importances**: Feature importance scores are saved in a structured format, allowing insights into which features contributed most to the model's predictions.

Upon running this code, a success message confirms the data storage process.

In [56]:
import boto3
from decimal import Decimal

# Initialize the DynamoDB client
dynamodb = boto3.resource(
    'dynamodb',
    endpoint_url='http://dynamodb-local:8000',  # Adjust if needed
    region_name='us-west-2',
    aws_access_key_id='dummy',
    aws_secret_access_key='dummy'
)

# Define or create the table
table_name = 'ModelResults'
try:
    table = dynamodb.Table(table_name)
    table.load()  # Verify if table exists
except Exception:
    # Create table if it doesn't exist
    table = dynamodb.create_table(
        TableName=table_name,
        KeySchema=[{'AttributeName': 'id', 'KeyType': 'HASH'}],
        AttributeDefinitions=[{'AttributeName': 'id', 'AttributeType': 'S'}],
        ProvisionedThroughput={'ReadCapacityUnits': 1, 'WriteCapacityUnits': 1}
    )
    table.meta.client.get_waiter('table_exists').wait(TableName=table_name)

# Save predictions
for i, row in enumerate(rf_tuned_predictions.collect()):
    table.put_item(
        Item={
            'id': f"sample_{i}",
            'features': [Decimal(str(x)) for x in row['features']],
            'actual_target': int(row['target']),
            'predicted_target': int(row['prediction'])
        }
    )

# Save feature importances
table.put_item(
    Item={
        'id': 'feature_importances',
        'importances': {col: Decimal(str(imp)) for col, imp in zip(feature_columns, feature_importances)}
    }
)

print("Predictions and feature importances saved to DynamoDB successfully!")

Predictions and feature importances saved to DynamoDB successfully!


## Conclusion and Summary

This project applied machine learning techniques to healthcare data with the aim of predicting the likelihood of heart disease. Through careful data preprocessing, oversampling of minority classes, and model optimization, a tuned Random Forest model achieved an accuracy and F1 score of 0.81. Feature importance analysis highlighted key indicators, such as resting ECG results and exercise-induced angina, as critical predictors of heart disease.

The final predictions and feature importances were stored in DynamoDB, enabling scalable, persistent access for further analysis and integration with other applications. This project illustrates how machine learning models can support early diagnosis and preventive healthcare, potentially leading to more personalized and effective patient care strategies.