# Step 3 â€” Bivariate Patterns by Branch and Certificate

This notebook moves from simple counts to **bivariate analysis** of medication error patterns.

Goals:

- Identify the most frequent `Pattern Specifics` categories.
- Compare how often these patterns occur by **Branch** (Air vs Ground).
- Compare how often these patterns occur by **Certificate / Source** (AEL, GFL, MTC, REACH, AMR, etc.).

Outputs:

- A table and heatmap of **top error patterns by Branch**.
- A table of **top error patterns by Certificate**.

This mirrors the bivariate EDA step from the loan assignment but applied to **clinical medication-error patterns** across branches and certificates.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

# ---------------------------------------------------------
# 1. Load the Medication data
# ---------------------------------------------------------
try:
    med = pd.read_excel('Krista 240726 Final.xlsx', sheet_name='Medication')
except FileNotFoundError:
    med = pd.read_csv('Krista 240726 Final.xlsx - Medication.csv')

print("Medication sheet shape:", med.shape)

# ---------------------------------------------------------
# 2. Top N Pattern Specifics categories (overall)
# ---------------------------------------------------------
pattern_counts = (
    med["Pattern Specifics"]
    .dropna()
    .value_counts()
)

n_top = 10
top_patterns = pattern_counts.head(n_top).index.tolist()

print(f"Top {n_top} Pattern Specifics (overall):")
print(pattern_counts.head(n_top))

# Filter to just these top patterns
med_top = med[med["Pattern Specifics"].isin(top_patterns)].copy()

# ---------------------------------------------------------
# 3. Bivariate: Pattern Specifics x Branch (Air vs Ground)
# ---------------------------------------------------------
ct_branch_pattern = pd.crosstab(
    med_top["Pattern Specifics"],
    med_top["Branch"]
)

print("\nCrosstab of top Pattern Specifics by Branch (Air vs Ground):")
display(ct_branch_pattern)

plt.figure(figsize=(10, 6))
sns.heatmap(
    ct_branch_pattern,
    annot=True,
    fmt="d",
    cmap="YlOrRd"
)
plt.title(f"Top {n_top} Medication Error Patterns by Branch (Air vs Ground)")
plt.xlabel("Branch")
plt.ylabel("Pattern Specifics")
plt.tight_layout()
plt.show()

# ---------------------------------------------------------
# 4. Bivariate: Pattern Specifics x Certificate / Source
# ---------------------------------------------------------
# Drop rows with missing Pattern Specifics
df_clean = med.dropna(subset=['Pattern Specifics']).copy()

# Cross-tab: rows = Pattern Specifics, columns = Source (certificate)
pattern_summary = pd.crosstab(df_clean['Pattern Specifics'], df_clean['Source'])

# Add a Total Events column
pattern_summary['Total Events'] = pattern_summary.sum(axis=1)

# Sort by total volume
pattern_summary_sorted = pattern_summary.sort_values(by='Total Events', ascending=False)

print("\nTop 10 Pattern Specifics by Total Events (all certificates):")
display(pattern_summary_sorted[['Total Events']].head(10))
