# Interpretable Modeling and Local Explanations with LIME

This notebook trains an inherently interpretable model (Decision Tree) for global interpretability and a higher-performing ensemble model (Random Forest) explained locally with LIME on the **Census Income (Adult)** dataset.


In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt

import lime
from lime.lime_tabular import LimeTabularExplainer


## Load dataset

In [None]:
df = pd.read_csv("adult_data.csv")
df.head()


In [None]:
df.info()


In [None]:
df["income"].unique()


## Preprocessing

- Separate features and target.
- Encode categorical variables with `LabelEncoder`.
- Encode the target labels.
- Split into train/test.


In [None]:
X_raw = df.drop("income", axis=1)
y_raw = df["income"]

X = X_raw.copy()
for col in X.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

le_income = LabelEncoder()
y = le_income.fit_transform(y_raw)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


## Global interpretability with a Decision Tree

The Decision Tree is inherently interpretable. We inspect:
- The full tree visualization.
- Feature importance scores from the fitted model.


In [None]:
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X_train, y_train)

plt.figure(figsize=(20, 10))
plot_tree(
    tree,
    feature_names=X.columns,
    class_names=le_income.classes_,
    filled=True,
    rounded=True
)
plt.title("Decision Tree")
plt.show()


In [None]:
importances = tree.feature_importances_
features = X.columns

feat_importance_df = (
    pd.DataFrame({"feature": features, "importance": importances})
    .sort_values(by="importance", ascending=False)
)

feat_importance_df.head(15)


In [None]:
plt.figure(figsize=(12, 6))
plt.barh(feat_importance_df["feature"], feat_importance_df["importance"])
plt.gca().invert_yaxis()
plt.title("Feature Importance (Decision Tree)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()


## Non-interpretable model + local explanations with LIME

A Random Forest is less transparent than a single tree. We use LIME to explain individual predictions (local interpretability) for 3 training records.


In [None]:
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)

explainer = LimeTabularExplainer(
    training_data=np.array(X_train),
    feature_names=X.columns,
    class_names=le_income.classes_,
    mode="classification"
)

# Select three training instances to explain
instances_to_explain = X_train.iloc[[1, 2, 5]]
instances_to_explain


In [None]:
explanations = {}

for idx in instances_to_explain.index:
    exp = explainer.explain_instance(
        data_row=X_train.loc[idx].values,
        predict_fn=model_rf.predict_proba,
        num_features=10
    )
    explanations[idx] = exp
    print(f"\nExplanation for record {idx}:\n")
    # Display a table view inside the notebook
    exp.show_in_notebook(show_table=True)


## Findings

**Decision Tree (global view):**
- The feature importance ranking highlights which variables the tree relies on most to split the data.

**Random Forest + LIME (local view):**
- For each explained record, LIME identifies the strongest feature conditions that push the prediction toward either `<=50K` or `>50K`.
- In this dataset, income predictions tend to be influenced by combinations of education-related signals, capital gains/losses, relationship/marital indicators, and age.
