In [0]:
table_name = "clinisight_analytics.default.cleaned_healthcare_data_unique"

# Read table
df = spark.table(table_name)

# Quick sanity checks
print("Table path:", table_name)
print("Number of rows (approx):", df.count())
display(df.limit(5))


In this step, we will undertake essential data preparation tasks to ensure the dataset is suitable for predictive modeling. Specifically, we will:

- Select a subset of relevant columns that are most informative for the prediction task, thereby reducing dimensionality and focusing on features that contribute meaningfully to model performance.
- Standardize data types by converting numerical columns to appropriate numeric formats, which facilitates accurate computations and ensures compatibility with machine learning algorithms.
- Identify and remove any rows containing missing or null values to maintain data integrity and prevent potential issues during model training and evaluation.

These actions are critical for enhancing data quality, improving model reliability, and streamlining the subsequent analytical workflow.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, DoubleType

df_ml = df.select(
    "Patient_ID",
    "Age",
    "Gender",
    "Medical Condition",
    "Hospital",
    "Length of Stay",
    "Billing Amount ($)"
)

# Clean and cast numeric columns
df_ml = df_ml.withColumn(
    "Age",
    F.col("Age").cast(IntegerType())
).withColumn(
    "Length_of_Stay",
    F.col("Length of Stay").cast(DoubleType())
).withColumn(
    "BillingAmount",
    F.regexp_replace(
        F.col("Billing Amount ($)"),
        "[$,]",
        ""
    ).cast(DoubleType())
)

# Drop missing or invalid rows
df_ml = df_ml.na.drop(
    subset=[
        "Age",
        "Gender",
        "Medical Condition",
        "Hospital",
        "Length_of_Stay",
        "BillingAmount"
    ]
)

print("Row count after cleaning:", df_ml.count())
display(df_ml.limit(10))

## Train Your Prediction Model: Predicting Patient Length of Stay

In this step, we will build and train a machine learning model to predict the length of stay for patients based on relevant features in our dataset. This process involves:

- Selecting informative features that influence patient length of stay, such as demographics, admission details, and clinical variables.
- Splitting the data into training and test sets to evaluate model performance.
- Choosing an appropriate regression algorithm for predicting continuous outcomes.
- Training the model on historical data and validating its accuracy using standard metrics.
- Interpreting the results to understand which factors most impact patient length of stay.

By the end of this step, you will have a predictive model that can estimate how long a patient is likely to remain in care, supporting resource planning and operational efficiency.

In [0]:
# Step 5: Build and train a simple ML model to predict Length of Stay

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

# 1️⃣ Define categorical columns
categorical_cols = ["Gender", "Medical Condition", "Hospital"]

# Index + one-hot encode categorical features
indexers = [StringIndexer(inputCol=c, outputCol=f"{c}_Index", handleInvalid="keep") for c in categorical_cols]
encoder = OneHotEncoder(inputCols=[f"{c}_Index" for c in categorical_cols],
                        outputCols=[f"{c}_Vec" for c in categorical_cols])

# 2️⃣ Define numeric columns and final feature vector
numeric_cols = ["Age", "BillingAmount"]
assembler_inputs = [f"{c}_Vec" for c in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

# 3️⃣ Define the regression model
lr = LinearRegression(featuresCol="features", labelCol="Length_of_Stay")

# 4️⃣ Create a pipeline for transformations + model
pipeline = Pipeline(stages=indexers + [encoder, assembler, lr])

# 5️⃣ Split data into training and testing sets
train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

# 6️⃣ Fit (train) the model
model = pipeline.fit(train_df)

# 7️⃣ Make predictions on test data
predictions = model.transform(test_df)

# Show some predicted vs actual values
display(predictions.select("Patient_ID", "Length_of_Stay", "prediction").limit(10))


# Step 6: Evaluate model performance

evaluator_rmse = RegressionEvaluator(labelCol="Length_of_Stay", predictionCol="prediction", metricName="rmse")
evaluator_r2 = RegressionEvaluator(labelCol="Length_of_Stay", predictionCol="prediction", metricName="r2")

rmse = evaluator_rmse.evaluate(predictions)
r2 = evaluator_r2.evaluate(predictions)

print(f"RMSE (error): {rmse}")
print(f"R² (goodness of fit): {r2}")


Now that your predictive model is operational, the next step is to persist your predictions for seamless integration and visualization within your Databricks dashboard. Saving these results enables efficient analysis, facilitates data-driven decision making, and enhances collaboration across teams. By storing your predictions in a structured format, you ensure they are readily accessible for reporting, monitoring, and further exploration in your analytics workflows.

In [0]:
# Rename "Medical Condition" to "Medical_Condition" to avoid invalid characters
predictions_table = predictions.select(
    "Patient_ID",
    "Age",
    "Gender",
    F.col("Medical Condition").alias("Medical_Condition"),
    "Hospital",
    F.round("Length_of_Stay", 2).alias("Actual_Stay"),
    F.round("prediction", 2).alias("Predicted_Stay")
)

# Save as a managed Delta table (overwrites safely)
output_table = "clinisight_analytics.default.ai_length_of_stay_predictions"
predictions_table.write.mode("overwrite").saveAsTable(output_table)

print(f"✅ Predictions saved successfully to: {output_table}")
display(predictions_table.limit(10))