# Data Exploration

This notebook demonstrates a hands-on data exploration workflow using Python.
Every code cell is executable and produces visible output.

## Generating Sample Data

We start by creating a small synthetic dataset representing pipeline run metrics.

In [1]:
import random
import statistics

random.seed(42)

STAGES = ["load", "chunk", "embed", "retrieve", "generate"]
STATUSES = ["success", "success", "success", "success", "failure"]  # 80% success rate

runs = []
for run_id in range(1, 21):
    stage = random.choice(STAGES)
    status = random.choice(STATUSES)
    duration_ms = round(random.gauss(450, 120), 1)
    runs.append({"run_id": run_id, "stage": stage, "status": status, "duration_ms": duration_ms})

print(f"Generated {len(runs)} pipeline run records.")

Generated 20 pipeline run records.


## Inspecting the Data

Let's look at the first few records and compute basic statistics.

In [2]:
print(f"{'run_id':>6}  {'stage':<10}  {'status':<8}  {'duration_ms':>11}")
print(f"{'------':>6}  {'----------':<10}  {'--------':<8}  {'-----------':>11}")
for r in runs[:10]:
    print(f"{r['run_id']:>6}  {r['stage']:<10}  {r['status']:<8}  {r['duration_ms']:>11}")

run_id  stage       status    duration_ms
------  ----------  --------  -----------
     1  load        success         445.2
     2  chunk       success         360.2
     3  generate    success         424.3
     4  load        success         433.6
     5  chunk       failure         326.5
     6  generate    success         357.9
     7  chunk       success         265.0
     8  load        success         333.8
     9  retrieve    success         435.4
    10  embed       success         532.3


In [3]:
durations = [r["duration_ms"] for r in runs]

print("Duration Statistics")
print(f"  Mean:    {statistics.mean(durations):.1f} ms")
print(f"  Median:  {statistics.median(durations):.1f} ms")
print(f"  Stdev:   {statistics.stdev(durations):.1f} ms")
print(f"  Min:     {min(durations):.1f} ms")
print(f"  Max:     {max(durations):.1f} ms")

Duration Statistics
  Mean:    419.3 ms
  Median:  429.0 ms
  Stdev:   121.3 ms
  Min:     135.6 ms
  Max:     641.1 ms


## Aggregating by Stage

Group runs by pipeline stage and compute per-stage success rates and average durations.

In [4]:
from collections import defaultdict

by_stage = defaultdict(list)
for r in runs:
    by_stage[r["stage"]].append(r)

print(f"{'Stage':<11}  {'Runs':>4}  {'Success Rate':>12}  {'Avg Duration':>12}")
print(f"{'-----------':<11}  {'----':>4}  {'------------':>12}  {'------------':>12}")
for stage in sorted(by_stage):
    stage_runs = by_stage[stage]
    total = len(stage_runs)
    successes = sum(1 for r in stage_runs if r["status"] == "success")
    avg_dur = statistics.mean(r["duration_ms"] for r in stage_runs)
    print(f"{stage:<11}  {total:>4}  {successes/total*100:>11.1f}%  {avg_dur:>9.1f} ms")

Stage        Runs  Success Rate  Avg Duration
-----------  ----  ------------  ------------
chunk           5         80.0%      367.7 ms
embed           4         75.0%      462.7 ms
generate        4        100.0%      473.7 ms
load            5        100.0%      397.9 ms
retrieve        2        100.0%      406.3 ms


## Duration Distribution

Visualize the distribution of run durations using a text-based histogram.

In [5]:
def text_histogram(values, bins=5):
    lo, hi = min(values), max(values)
    bin_width = (hi - lo) / bins
    counts = [0] * bins
    for v in values:
        idx = min(int((v - lo) / bin_width), bins - 1)
        counts[idx] += 1

    print("Duration Distribution (ms)\n")
    for i, count in enumerate(counts):
        left = int(lo + i * bin_width)
        right = int(lo + (i + 1) * bin_width)
        bar = "\u2588" * count
        print(f"{left:>3}-{right:<3}  {bar} {count}")

text_histogram(durations)

Duration Distribution (ms)

135-236  █ 1
236-337  ████ 4
337-438  ███████ 7
438-540  █████ 5
540-641  ███ 3


## Failure Analysis

Filter and examine failed runs to identify patterns.

In [6]:
failures = [r for r in runs if r["status"] == "failure"]

print(f"Failed Runs: {len(failures)} / {len(runs)} ({len(failures)/len(runs)*100:.1f}%)\n")
print(f"{'run_id':>6}  {'stage':<10}  {'duration_ms':>11}")
print(f"{'------':>6}  {'----------':<10}  {'-----------':>11}")
for r in failures:
    print(f"{r['run_id']:>6}  {r['stage']:<10}  {r['duration_ms']:>11}")

fail_stages = defaultdict(int)
for r in failures:
    fail_stages[r["stage"]] += 1

print("\nFailure counts by stage:")
for stage, count in sorted(fail_stages.items()):
    print(f"  {stage}: {count}")

Failed Runs: 2 / 20 (10.0%)

run_id  stage       duration_ms
------  ----------  -----------
     5  chunk             326.5
    16  embed             303.7

Failure counts by stage:
  chunk: 1
  embed: 1
