# Assignment 5.4 – Feature Importance and SHAP Analysis
### Geovanny Peña

In this lab, I analyze how a Random Forest model makes predictions using global feature importance and SHAP explanations.  
Although Logistic Regression was selected as the best tuned model in Assignment 5.3, Random Forest is used here because tree-based models provide clearer and more intuitive interpretability for feature importance and SHAP analysis.


In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse

# Load feature pipeline
pipeline = joblib.load("./etl_pipeline/stedi_feature_pipeline.pkl")

# Load transformed data
X_train_transformed = joblib.load("./etl_pipeline/X_train_transformed.pkl")
X_test_transformed = joblib.load("./etl_pipeline/X_test_transformed.pkl")

# Helper function to ensure numeric arrays
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([x.toarray() if issparse(x) else np.array(x, dtype=float) for x in arr])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr

X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

# Load labels
y_train = np.ravel(joblib.load("./etl_pipeline/y_train.pkl"))
y_test = np.ravel(joblib.load("./etl_pipeline/y_test.pkl"))

X_train.shape, X_test.shape, y_train.shape, y_test.shape

The dataset shapes confirm that the transformed feature matrices and labels are aligned correctly and ready for model interpretation.

In [0]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest (use the same hyperparameters from 5.3)
model = RandomForestClassifier(
    n_estimators=50,
    max_depth=5,
    random_state=42
)
model.fit(X_train, y_train)

model  # show summary

In [0]:
import numpy as np

importances = model.feature_importances_
importance_order = np.argsort(importances)[::-1]

try:
    feature_names = pipeline.named_steps["preprocess"].get_feature_names_out()
except:
    feature_names = [f"feature_{i}" for i in range(X_train.shape[1])]

# Print top 10 features
for idx in importance_order[:10]:
    print(feature_names[idx], ":", importances[idx])

###Global Feature Importance Interpretation

- The global feature importance results show a very strong dominance of the feature **num__distance_cm**, which accounts for approximately **92.7%** of the total importance in the Random Forest model. This indicates that the model relies primarily on the distance measurement to distinguish between *step* and *no_step* events.
- This behavior is logically consistent with the STEDI dataset, as stepping movements are expected to produce systematic changes in distance that are not present during non-stepping states. Therefore, distance acts as a direct and highly informative signal for the classification task.
- The remaining importance is distributed across categorical features such as **sensor_type** (accelerometer, gyroscope, ultrasonic sensor) and **device_id**. These features contribute marginally but likely help the model adjust predictions based on sensor characteristics or device-specific calibration differences.
- While the dominance of a single feature may raise concerns about over-reliance, in this context it appears reasonable given the physical nature of the problem. As long as distance measurements are reliable, the model's importance pattern is interpretable and trustworthy.


In [0]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.barh([feature_names[i] for i in importance_order[:10]],
         importances[importance_order[:10]])
plt.xlabel("Importance")
plt.title("Top Global Feature Importance")
plt.gca().invert_yaxis()
plt.show()

This visualization highlights the features that contribute most to the Random Forest's decisions and will be useful for dashboard reporting in later assignments.

###Step 4 – Install and Import SHAP

In [0]:
%pip install shap

In [0]:
import shap
shap.__version__


In [0]:
import shap
shap.initjs()

# Create SHAP explainer for Random Forest
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Confirm shapes
X_test.shape, shap_values[1].shape

### Step 5 – SHAP Summary Plot (Global View)

In [0]:
shap.summary_plot(
    shap_values[..., 1],
    X_test,
    feature_names=feature_names,
    rng=42
)

### SHAP Summary Plot Observations

- The SHAP summary plot confirms the findings from the global feature importance analysis. The feature **num__distance_cm** shows the largest spread of SHAP values, indicating that it has the strongest and most consistent influence on model predictions across the dataset.
- Higher or lower values of distance significantly shift predictions toward either the *step* or *no_step* class, demonstrating that the model uses this feature as its primary decision driver.
- Categorical features such as **sensor_type** and **device_id** appear lower in the ranking and exhibit much smaller SHAP magnitudes. Their influence is secondary and appears to fine-tune predictions rather than determine them outright.
- Overall, the SHAP summary plot reinforces that the model behavior is coherent and largely driven by a physically meaningful variable, with no obvious signs of spurious or illogical feature influence.


### Step 6 – SHAP Force Plot (Local Explanation)

In [0]:
i = 0  # example index
shap.force_plot(
    explainer.expected_value[1],
    shap_values[...,1][i],
    X_test[i],
    feature_names=feature_names,
    matplotlib=True  # <- static version
)

### Local Prediction Explanation Using SHAP Force Plot

- For the selected instance (index 0), the SHAP force plot shows that **num__distance_cm** is the primary feature influencing the prediction. In this case, its contribution is negative, pushing the model toward the *no_step* classification.
- Several device-specific and sensor-related features contribute smaller positive or negative effects. These features slightly adjust the prediction but do not outweigh the dominant influence of distance.
- The explanation is understandable and consistent with human reasoning: the observed distance value does not strongly resemble a stepping pattern, and therefore the model predicts *no_step*. A human analyst reviewing the same data would likely reach a similar conclusion based on this information.


In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# We choose the row we want to analyze.
i = 0

# We create a DataFrame with feature names and their SHAP contributions.
shap_contributions = pd.DataFrame({
    "Feature": feature_names,
    "SHAP_value": shap_values[...,1][i]  # valores para la clase 'step'
})

# We select the 10 features with the greatest absolute impact.
top_shap = shap_contributions.reindex(shap_contributions.SHAP_value.abs().sort_values(ascending=False).index).head(10)

# Show table
display(top_shap)

# Bar chart
plt.figure(figsize=(10,6))
colors = np.where(top_shap["SHAP_value"] > 0, 'green', 'red')  # positive = green, negative = red
plt.barh(top_shap["Feature"], top_shap["SHAP_value"], color=colors)
plt.xlabel("SHAP Value (Impact on prediction)")
plt.title(f"Top 10 Feature Impacts for prediction index {i}")
plt.gca().invert_yaxis()
plt.show()

### Additional Local SHAP Analysis

- An inspection of the top SHAP contributions for this instance confirms that **num__distance_cm** has the largest absolute impact on the prediction. Its negative SHAP value indicates that the observed distance reduces the likelihood of a *step* classification.
- All other features have SHAP values several orders of magnitude smaller, suggesting that their influence is minimal for this specific prediction. This further emphasizes the central role of distance in the model's decision-making process and highlights the relatively limited effect of sensor type and device identity at the local level.


### Step 7 – Reflection Questions (Model Behavior and Intuition)

**Global Insight**

Which features are the most important overall? Why do you think they matter for predicting step vs. no_step?

- The most important feature by a very large margin is distance_cm, which alone accounts for more than 92% of the model's total feature importance.
- This makes sense because the STEDI system detects steps based on changes in physical distance, so distance measurements are directly tied to step detection.
- Sensor type features (accelerometer, gyroscope, ultrasonic sensor) appear next, but their importance is much smaller, suggesting they provide contextual support rather than being primary drivers.
- Device-specific features (individual spotter IDs) have minor influence, indicating that the model relies more on movement patterns than on the specific device generating the data.

**Local Insight**

What did the SHAP force plot reveal about a single prediction? What features pushed the prediction up or down?

- For the selected observation (index 0), distance_cm had the strongest negative SHAP value, pushing the prediction toward no_step.
- This indicates that, for this specific row, the distance measurement was not consistent with a stepping motion.
- Some sensor and device-related features (such as accelerometer and certain device IDs) had small positive SHAP values, slightly pushing the prediction toward step, but their influence was not strong enough to override the distance signal.
- Overall, the prediction was dominated by physical distance rather than metadata or sensor type.

**Human Intuition Check**

Does the model's logic match what a human might expect from the STEDI data? Why or why not?

- Yes, the model's behavior aligns well with human intuition. A person analyzing step data would naturally focus first on distance and movement patterns rather than on which device recorded the data.
- The fact that distance overwhelmingly drives predictions suggests the model is learning meaningful physical signals rather than memorizing device identifiers.
- The smaller contributions from sensor type also make sense, as different sensors capture motion differently but should not dominate the decision on their own.

**Dashboard Preparation**

Which visualizations from this lab do you plan to include in your Week 6 dashboard?

- I plan to include the Global Feature Importance bar chart to clearly show which features the model relies on overall.
- The SHAP summary plot will be included to illustrate both the direction and magnitude of feature effects across all predictions.
- I will also include one local explanation (either the SHAP force plot or the SHAP contribution bar chart) to demonstrate how the model makes a decision for an individual data point.
- Together, these visuals provide both transparency and interpretability for technical and non-technical audiences.

In [0]:
from sklearn.metrics import classification_report, confusion_matrix

# Predicciones en el set de prueba
y_pred = model.predict(X_test)

# Reporte de métricas
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Matriz de confusión
print("Confusion Matrix:")
confusion_matrix(y_test, y_pred)

In [0]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Calcular confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Crear heatmap manual con matplotlib
plt.figure()
plt.imshow(cm)
plt.title("Confusion Matrix Heatmap")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.xticks([0, 1], ["no_step", "step"])
plt.yticks([0, 1], ["no_step", "step"])

# Mostrar valores dentro de la matriz
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j], ha="center", va="center")

plt.colorbar()
plt.show()

In [0]:
from sklearn.metrics import roc_curve, auc

# Probabilidades para la clase positiva (step)
y_proba = model.predict_proba(X_test)[:, 1]

# Calcular ROC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Graficar ROC
plt.figure()
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()