# Task 3 — Exploratory Data Analysis (EDA) on Merged Dataset (Outcome-Oriented)

**Input:** merged dataset from Task 2 (`task2_merged_cleaned.xlsx`)  
Generated: **2025-12-16 06:33**

## What this notebook delivers (business outcomes)
- Trend analysis with **at least 3** interpretable visualizations
- Root-cause style breakdowns:
  - Failure component/condition ↔ Cost
  - Failure component/condition ↔ Actual Hours
  - Fix component/condition patterns
- Stakeholder-ready insights and “what to do next” recommendations


In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

merged_path = r"/mnt/data/axionray_task2_3/task2_merged_cleaned.xlsx"
df = pd.read_excel(merged_path)
print("Merged shape:", df.shape)
df.head()

---
## 1) Feature engineering for EDA
**Why:** Some columns combine multiple concepts (Condition + Component). Splitting them improves interpretability.

**Outcome:** Enables clean group-bys like “Cost by Failure Component”.


In [None]:
import re
def normalize_text(x):
    if pd.isna(x): return ""
    x = str(x).replace("\n"," ").replace("\r"," ")
    return re.sub(r"\s+"," ", x).strip()

def parse_condition_component(s):
    s = normalize_text(s)
    if not s:
        return ("UNKNOWN","UNKNOWN")
    for sep in [" - ", " – ", " | ", ":", ";"]:
        if sep in s:
            a, b = s.split(sep, 1)
            return (a.strip() or "UNKNOWN", b.strip() or "UNKNOWN")
    return ("UNKNOWN", s)

if "Failure Condition - Failure Component" in df.columns:
    df[["Failure_Condition","Failure_Component"]] = df["Failure Condition - Failure Component"].apply(lambda x: pd.Series(parse_condition_component(x)))
if "Fix Condition - Fix Component" in df.columns:
    df[["Fix_Condition","Fix_Component"]] = df["Fix Condition - Fix Component"].apply(lambda x: pd.Series(parse_condition_component(x)))

df[["Failure_Condition","Failure_Component","Fix_Condition","Fix_Component"]].head()

---
## 2) Trend Analysis (at least 3 visuals)
We focus on trends that directly support decision-making.

### Visual A: Repairs over time (by Order Date)
**Outcome:** Detect spikes that can point to supplier issues, releases, or seasonal effects.


In [None]:
if "Order Date" in df.columns:
    df["Order Date"] = pd.to_datetime(df["Order Date"], errors="coerce")
    ts = df.dropna(subset=["Order Date"]).set_index("Order Date").resample("MS").size()
    plt.figure(figsize=(9,4.8))
    ts.plot(kind="line")
    plt.title("Work Orders Over Time (Monthly)")
    plt.xlabel("Month")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()
else:
    print("Order Date not found")

### Visual B: Average Cost by Failure Component (Top 10)
**Outcome:** Separates high-impact components from high-frequency components.


In [None]:
if "Cost" in df.columns and "Failure_Component" in df.columns:
    g = (df.groupby("Failure_Component")["Cost"]
         .agg(["count","mean","median","sum"])
         .sort_values("sum", ascending=False)
         .head(10))
    plt.figure(figsize=(9,4.8))
    g["mean"].plot(kind="bar")
    plt.title("Average Cost by Failure Component (Top 10 by Total Cost)")
    plt.xlabel("Failure Component")
    plt.ylabel("Avg Cost")
    plt.tight_layout()
    plt.show()
    g
else:
    print("Need Cost and Failure_Component")

### Visual C: Average Actual Hours by Fix Component (Top 10)
**Outcome:** Identifies labor-heavy fixes that may need process improvement or better diagnostics.


In [None]:
if "Actual Hours" in df.columns and "Fix_Component" in df.columns:
    h = (df.groupby("Fix_Component")["Actual Hours"]
         .agg(["count","mean","median","sum"])
         .sort_values("mean", ascending=False)
         .head(10))
    plt.figure(figsize=(9,4.8))
    h["mean"].plot(kind="bar")
    plt.title("Average Actual Hours by Fix Component (Top 10)")
    plt.xlabel("Fix Component")
    plt.ylabel("Avg Actual Hours")
    plt.tight_layout()
    plt.show()
    h
else:
    print("Need Actual Hours and Fix_Component")

### Bonus Visual D: Correlation heatmap (numeric metrics)
**Outcome:** Quantifies relationships like cost vs hours.


In [None]:
num_cols = [c for c in ["Revenue","Cost","Actual Hours","Qty","Segment Total $","Meter 1 Reading"] if c in df.columns]
if len(num_cols) >= 2:
    corr = df[num_cols].corr(numeric_only=True)
    plt.figure(figsize=(7,5))
    plt.imshow(corr, aspect="auto")
    plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
    plt.yticks(range(len(num_cols)), num_cols)
    plt.title("Correlation Heatmap (Numeric Features)")
    plt.colorbar()
    plt.tight_layout()
    plt.show()
    corr
else:
    print("Not enough numeric columns for correlation.")

---
## 3) Root cause style investigations (Failure ↔ Fix)
This section answers stakeholder questions:
- Which failures drive the most total cost?
- Which failures consume the most labor?
- Are certain fix actions repeatedly used for certain failures?

**Outcome:** Actionable prioritization for engineering and ops.


In [None]:
# Top failures by total cost
if "Cost" in df.columns and "Failure_Component" in df.columns:
    top_fail_cost = (df.groupby("Failure_Component")["Cost"]
                     .agg(["count","mean","sum"])
                     .sort_values("sum", ascending=False)
                     .head(15))
    top_fail_cost
else:
    print("Need Cost and Failure_Component")

In [None]:
# Top failures by total actual hours
if "Actual Hours" in df.columns and "Failure_Component" in df.columns:
    top_fail_hours = (df.groupby("Failure_Component")["Actual Hours"]
                      .agg(["count","mean","sum"])
                      .sort_values("sum", ascending=False)
                      .head(15))
    top_fail_hours
else:
    print("Need Actual Hours and Failure_Component")

In [None]:
# Failure ↔ Fix matrix (top 10 failures & fixes)
if "Failure_Component" in df.columns and "Fix_Component" in df.columns:
    topF = df["Failure_Component"].value_counts().head(10).index
    topX = df["Fix_Component"].value_counts().head(10).index
    mat = (df[df["Failure_Component"].isin(topF) & df["Fix_Component"].isin(topX)]
           .pivot_table(index="Failure_Component", columns="Fix_Component", values="Primary Key", aggfunc="count", fill_value=0))
    plt.figure(figsize=(10,5))
    plt.imshow(mat.values, aspect="auto")
    plt.xticks(range(mat.shape[1]), mat.columns, rotation=45, ha="right")
    plt.yticks(range(mat.shape[0]), mat.index)
    plt.title("Failure Component vs Fix Component (counts)")
    plt.colorbar()
    plt.tight_layout()
    plt.show()
    mat
else:
    print("Need Failure_Component and Fix_Component")

---
## 4) Stakeholder-ready synthesis (how to use these findings)
Use the outputs above to:
1. Prioritize **high total cost** failure components for engineering investigation
2. Improve diagnostics/process for **labor-heavy fixes**
3. Standardize repair playbooks where the same fix repeats for the same failure

**Recommended next step:** build a dashboard with filters by Manufacturer / Model / Model Year / Product Category.
