In [1]:
import joblib

# Load model
model_path = "model/model.joblib"
lr = joblib.load(model_path)

In [2]:
import polars as pl

# Extract steps
clf = lr.named_steps["clf"]  # LogisticRegression
preprocessor = lr.named_steps["preprocess"]

# Get processed feature names
feature_names = preprocessor.get_feature_names_out()

# Get coefficients
coefs = clf.coef_[0]

# Create Polars DataFrame
importance = (
    pl.DataFrame(
        {
            "feature": feature_names,
            "coefficient": coefs,
        }
    )
    .with_columns(abs_coeff=pl.col("coefficient").abs())
    .sort("abs_coeff", descending=True)
)


In [3]:
import plotly.express as px

fig = px.bar(
    importance,
    x="abs_coeff",
    y="feature",
    orientation="h",
    title="Logistic Regression Feature Importance",
    labels={"abs_coeff": "Absolute Coefficient", "feature": "Feature"},
)
fig.update_layout(yaxis=dict(autorange="reversed"))  # highest at the top
fig.show()


In the above plot we see the absolute feature importance of the model. The features are ranked by their abosulte predictive power. Interesting to note that "ST_slope" has the most predictive power, remember, we encoded this as an ordinal variable using domain knowledge. Had we one-hot encoded it, we would have lost information, and the performance of the model (especially a linear model) would likely have suffered.

In [4]:
fig = px.bar(
    importance,
    x="coefficient",
    y="feature",
    orientation="h",
    title="Logistic Regression Feature Importance",
    labels={"coefficient": "Coefficient", "feature": "Feature"},
)
fig.update_layout(yaxis=dict(autorange="reversed"))  # highest at the top
fig.show()


This plot shows the magnitude and the direction of each feature's effect on the predicion. A positive coefficient means that the higher the feature value, the higher the class probability, i.e. heart disease. Conversely, negative coefficient means that higher feature values correlate negatively with the target class, e.g., high ST slope values are predictice of absence of heart disease and low ST slope values are predictive of heart disease. This corresponds with domain knowledge and our encoding of the variables. 0 - down, 1 - flat, 2 - up.

Since we dropped one category of the one-hot encoded features, we can now interpret the coefficients directly relative to the dropped feature. For example, chest pain type ASY (asymptomatic) was dropped. We see that all the other chest pain types have negative coefficients, meaning that compared to asymptomatic presentation, they are **less indicative** of heart disease. This would be interesting to review with a domain expert for additional interpretation.

Similarly, we observe that **males** are at higher risk than females, and that having **exercise-induced angina** is indicative of heart disease compared to no angina.

Dropping one-hot feature categories is only necessary for interpretability; the model predictions remain unchanged.