# MR-KG Database Characteristics

This notebook provides comprehensive visualizations of the MR-KG knowledge
graph, including:
- Overall database statistics
- Trait profile similarity metrics
- Evidence profile similarity metrics
- Cross-database comparisons

**Data dependencies:** Run `just generate-all-summary-stats` before
executing this notebook.

## Setup

In [None]:
from pathlib import Path

import altair as alt
import duckdb
import pandas as pd
from yiutils.project_utils import find_project_root

# ---- Project paths ----
PROJECT_ROOT = find_project_root("docker-compose.yml")
DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
OVERALL_STATS_DIR = PROCESSED_DIR / "overall-stats"
TRAIT_ANALYSIS_DIR = PROCESSED_DIR / "trait-profiles" / "analysis"
EVIDENCE_ANALYSIS_DIR = PROCESSED_DIR / "evidence-profiles" / "analysis"

# ---- Altair configuration ----
alt.data_transformers.enable("default", max_rows=None)
alt.themes.enable("default")

print(f"Project root: {PROJECT_ROOT}")
print(f"Overall stats directory: {OVERALL_STATS_DIR}")
print(f"Trait analysis directory: {TRAIT_ANALYSIS_DIR}")
print(f"Evidence analysis directory: {EVIDENCE_ANALYSIS_DIR}")

## Data Loading and Validation

In [None]:
def load_csv_with_check(filepath: Path, description: str) -> pd.DataFrame:
    """Load CSV file with existence check and error message.
    
    Args:
        filepath: Path to CSV file
        description: Human-readable description for error message
    
    Returns:
        Loaded DataFrame
    
    Raises:
        FileNotFoundError: If file does not exist
    """
    if not filepath.exists():
        raise FileNotFoundError(
            f"{description} not found: {filepath}\n"
            "Run 'just generate-all-summary-stats' in the processing "
            "directory first."
        )
    return pd.read_csv(filepath)

print("Loading data files...")

In [None]:
# ---- Overall database statistics ----
db_summary = load_csv_with_check(
    OVERALL_STATS_DIR / "database-summary.csv",
    "Database summary"
)
model_stats = load_csv_with_check(
    OVERALL_STATS_DIR / "model-statistics.csv",
    "Model statistics"
)
journal_stats = load_csv_with_check(
    OVERALL_STATS_DIR / "journal-statistics.csv",
    "Journal statistics"
)

print("Loaded overall database statistics")
print(f"  - Database summary: {len(db_summary)} rows")
print(f"  - Model statistics: {len(model_stats)} models")
print(f"  - Journal statistics: {len(journal_stats)} journals")

In [None]:
# ---- Trait profile statistics ----
trait_summary = load_csv_with_check(
    TRAIT_ANALYSIS_DIR / "summary-stats-by-model.csv",
    "Trait profile summary"
)
trait_similarity_dist = load_csv_with_check(
    TRAIT_ANALYSIS_DIR / "similarity-distributions.csv",
    "Trait similarity distributions"
)
trait_metric_corr = load_csv_with_check(
    TRAIT_ANALYSIS_DIR / "metric-correlations.csv",
    "Trait metric correlations"
)
trait_count_dist = load_csv_with_check(
    TRAIT_ANALYSIS_DIR / "trait-count-distributions.csv",
    "Trait count distributions"
)

print("\nLoaded trait profile statistics")
print(f"  - Summary by model: {len(trait_summary)} models")
print(f"  - Similarity distributions: {len(trait_similarity_dist)} records")
print(f"  - Metric correlations: {len(trait_metric_corr)} records")
print(f"  - Trait count distributions: {len(trait_count_dist)} records")

In [None]:
# ---- Evidence profile statistics ----
evidence_summary = load_csv_with_check(
    EVIDENCE_ANALYSIS_DIR / "summary-stats-by-model.csv",
    "Evidence profile summary"
)
evidence_similarity_dist = load_csv_with_check(
    EVIDENCE_ANALYSIS_DIR / "similarity-distributions.csv",
    "Evidence similarity distributions"
)
evidence_completeness = load_csv_with_check(
    EVIDENCE_ANALYSIS_DIR / "completeness-by-model.csv",
    "Evidence completeness by model"
)
evidence_matched_pairs = load_csv_with_check(
    EVIDENCE_ANALYSIS_DIR / "matched-pairs-distribution.csv",
    "Evidence matched pairs distribution"
)

print("\nLoaded evidence profile statistics")
print(f"  - Summary by model: {len(evidence_summary)} models")
print(f"  - Similarity distributions: {len(evidence_similarity_dist)} records")
print(f"  - Completeness by model: {len(evidence_completeness)} records")
print(f"  - Matched pairs: {len(evidence_matched_pairs)} records")

print("\nAll data files loaded successfully!")

---
# Section A: Overall Database Characteristics

## Plot 1: Database Entity Counts

In [None]:
# ---- Prepare data for entity counts ----
entity_data = pd.DataFrame({
    "Entity": [
        "Unique Papers (PMIDs)",
        "Unique Traits",
        "Model Extraction Results",
    ],
    "Count": [
        db_summary["total_unique_pmids"].iloc[0],
        db_summary["total_unique_traits"].iloc[0],
        db_summary["total_model_results"].iloc[0],
    ],
})

display(entity_data)

In [None]:
entity_chart = (
    alt.Chart(entity_data)
    .mark_bar()
    .encode(
        y=alt.Y("Entity:N", title=None, sort="-x"),
        x=alt.X("Count:Q", title="Count"),
        color=alt.Color("Entity:N", legend=None, scale=alt.Scale(scheme="category10")),
        tooltip=[
            alt.Tooltip("Entity:N", title="Entity"),
            alt.Tooltip("Count:Q", title="Count", format=","),
        ],
    )
    .properties(
        width=600,
        height=250,
        title="MR-KG Database Entity Counts",
    )
)

# Add text labels
text = entity_chart.mark_text(
    align="left",
    baseline="middle",
    dx=5,
).encode(text=alt.Text("Count:Q", format=","))

entity_chart + text

## Plot 2: Per-Model Extraction Statistics

In [None]:
# ---- Prepare model statistics data ----
print("Model statistics:")
display(model_stats)

In [None]:
model_chart = (
    alt.Chart(model_stats)
    .mark_bar()
    .encode(
        x=alt.X("model:N", title="Model", axis=alt.Axis(labelAngle=-45)),
        y=alt.Y("extraction_count:Q", title="Extraction Count"),
        color=alt.Color("model:N", legend=None, scale=alt.Scale(scheme="tableau10")),
        tooltip=[
            alt.Tooltip("model:N", title="Model"),
            alt.Tooltip("extraction_count:Q", title="Extractions", format=","),
            alt.Tooltip("unique_pmids:Q", title="Unique Papers", format=","),
            alt.Tooltip(
                "avg_traits_per_extraction:Q",
                title="Avg Traits/Extraction",
                format=".2f",
            ),
        ],
    )
    .properties(
        width=600,
        height=400,
        title="Extraction Results by Model",
    )
)

model_chart

## Plot 3: Top Journals

In [None]:
# ---- Show top 20 journals ----
top_journals = journal_stats.head(20)

journal_chart = (
    alt.Chart(top_journals)
    .mark_bar()
    .encode(
        y=alt.Y("journal:N", title=None, sort="-x"),
        x=alt.X("paper_count:Q", title="Number of Papers"),
        color=alt.Color(
            "paper_count:Q",
            scale=alt.Scale(scheme="blues"),
            legend=None,
        ),
        tooltip=[
            alt.Tooltip("journal:N", title="Journal"),
            alt.Tooltip("paper_count:Q", title="Papers", format=","),
            alt.Tooltip("percentage:Q", title="Percentage", format=".2f"),
        ],
    )
    .properties(
        width=600,
        height=500,
        title="Top 20 Journals in MR-KG Corpus",
    )
)

journal_chart

---
# Section B: Trait Profile Similarity

## Plot 4: Similarity Score Distributions

In [None]:
# ---- Examine trait similarity distributions ----
print("Trait similarity distribution columns:")
print(trait_similarity_dist.columns.tolist())
print("\nFirst few rows:")
display(trait_similarity_dist.head())

In [None]:
# ---- Prepare data for layered density plot ----
# Create long format for both semantic and Jaccard similarities
semantic_data = trait_similarity_dist[
    ["model", "mean_semantic_similarity"]
].copy()
semantic_data["metric"] = "Semantic Similarity"
semantic_data.rename(
    columns={"mean_semantic_similarity": "similarity_value"},
    inplace=True,
)

jaccard_data = trait_similarity_dist[
    ["model", "mean_jaccard_similarity"]
].copy()
jaccard_data["metric"] = "Jaccard Similarity"
jaccard_data.rename(
    columns={"mean_jaccard_similarity": "similarity_value"},
    inplace=True,
)

combined_similarity = pd.concat(
    [semantic_data, jaccard_data],
    ignore_index=True,
)

print(f"Combined similarity data: {len(combined_similarity)} records")
display(combined_similarity.head())

In [None]:
similarity_dist_chart = (
    alt.Chart(combined_similarity)
    .transform_density(
        density="similarity_value",
        as_=["similarity_value", "density"],
        groupby=["metric", "model"],
    )
    .mark_area(opacity=0.6)
    .encode(
        x=alt.X("similarity_value:Q", title="Similarity Score"),
        y=alt.Y("density:Q", title="Density"),
        color=alt.Color("metric:N", title="Metric Type"),
        facet=alt.Facet("model:N", columns=2, title="Model"),
    )
    .properties(
        width=300,
        height=200,
        title="Distribution of Trait Similarity Scores by Model",
    )
)

similarity_dist_chart

## Plot 5: Trait Count Distribution

In [None]:
# ---- Examine trait count distributions ----
print("Trait count distribution columns:")
print(trait_count_dist.columns.tolist())
print("\nFirst few rows:")
display(trait_count_dist.head())
print("\nSummary statistics:")
display(trait_count_dist.describe())

In [None]:
trait_count_hist = (
    alt.Chart(trait_count_dist)
    .mark_bar()
    .encode(
        x=alt.X(
            "trait_count:Q",
            bin=alt.Bin(maxbins=30),
            title="Number of Traits per Study",
        ),
        y=alt.Y("count():Q", title="Frequency"),
        tooltip=[
            alt.Tooltip("trait_count:Q", bin=True, title="Trait Count"),
            alt.Tooltip("count():Q", title="Frequency"),
        ],
    )
    .properties(
        width=700,
        height=400,
        title="Distribution of Traits per Study",
    )
)

trait_count_hist

## Plot 6: Similarity Metric Correlation

In [None]:
# ---- Examine metric correlations ----
print("Metric correlation columns:")
print(trait_metric_corr.columns.tolist())
print("\nFirst few rows:")
display(trait_metric_corr.head())

In [None]:
metric_corr_scatter = (
    alt.Chart(trait_metric_corr)
    .mark_circle(size=60, opacity=0.5)
    .encode(
        x=alt.X(
            "semantic_similarity:Q",
            title="Semantic Similarity",
            scale=alt.Scale(domain=[0, 1]),
        ),
        y=alt.Y(
            "jaccard_similarity:Q",
            title="Jaccard Similarity",
            scale=alt.Scale(domain=[0, 1]),
        ),
        color=alt.Color("model:N", title="Model"),
        tooltip=[
            alt.Tooltip("model:N", title="Model"),
            alt.Tooltip(
                "semantic_similarity:Q",
                title="Semantic Similarity",
                format=".3f",
            ),
            alt.Tooltip(
                "jaccard_similarity:Q",
                title="Jaccard Similarity",
                format=".3f",
            ),
        ],
    )
    .properties(
        width=600,
        height=600,
        title="Semantic vs Jaccard Similarity Correlation",
    )
)

# Add regression line
regression = metric_corr_scatter.transform_regression(
    "semantic_similarity",
    "jaccard_similarity",
    groupby=["model"],
).mark_line()

metric_corr_scatter + regression

---
# Section C: Evidence Profile Similarity

## Plot 7: Evidence Similarity Metrics

In [None]:
# ---- Examine evidence similarity distributions ----
print("Evidence similarity distribution columns:")
print(evidence_similarity_dist.columns.tolist())
print("\nFirst few rows:")
display(evidence_similarity_dist.head())

In [None]:
# ---- Prepare data for multi-panel visualization ----
# Create long format for all metrics
evidence_metrics = []

for metric in [
    "direction_concordance",
    "effect_size_similarity",
    "statistical_consistency",
    "evidence_overlap",
]:
    if metric in evidence_similarity_dist.columns:
        metric_data = evidence_similarity_dist[
            ["model", metric]
        ].dropna()
        metric_data["metric_type"] = metric.replace("_", " ").title()
        metric_data.rename(columns={metric: "value"}, inplace=True)
        evidence_metrics.append(metric_data)

if evidence_metrics:
    combined_evidence = pd.concat(evidence_metrics, ignore_index=True)
    print(f"Combined evidence metrics: {len(combined_evidence)} records")
    display(combined_evidence.head())
else:
    print("No evidence metrics found in the data")

In [None]:
if evidence_metrics:
    evidence_violin = (
        alt.Chart(combined_evidence)
        .transform_density(
            density="value",
            as_=["value", "density"],
            groupby=["metric_type", "model"],
        )
        .mark_area(orient="horizontal", opacity=0.7)
        .encode(
            x=alt.X("density:Q", title=None, stack="center", axis=None),
            y=alt.Y("value:Q", title="Metric Value"),
            color=alt.Color("model:N", title="Model"),
            column=alt.Column("metric_type:N", title="Metric Type"),
        )
        .properties(
            width=150,
            height=300,
            title="Distribution of Evidence Similarity Metrics",
        )
    )
    
    evidence_violin
else:
    print("No data available for visualization")

## Plot 8: Data Completeness by Model

In [None]:
# ---- Examine completeness data ----
print("Evidence completeness columns:")
print(evidence_completeness.columns.tolist())
print("\nData:")
display(evidence_completeness)

In [None]:
# ---- Prepare data for stacked bar chart ----
# Reshape for stacking
completeness_long = pd.melt(
    evidence_completeness,
    id_vars=["model"],
    value_vars=["prop_high", "prop_medium", "prop_low"],
    var_name="completeness_category",
    value_name="proportion",
)

# Clean up category names
completeness_long["completeness_category"] = (
    completeness_long["completeness_category"]
    .str.replace("prop_", "")
    .str.title()
)

print("Completeness data (long format):")
display(completeness_long)

In [None]:
completeness_chart = (
    alt.Chart(completeness_long)
    .mark_bar()
    .encode(
        x=alt.X("model:N", title="Model"),
        y=alt.Y("proportion:Q", title="Proportion", stack="normalize"),
        color=alt.Color(
            "completeness_category:N",
            title="Completeness",
            scale=alt.Scale(
                domain=["High", "Medium", "Low"],
                range=["#2ca02c", "#ff7f0e", "#d62728"],
            ),
        ),
        tooltip=[
            alt.Tooltip("model:N", title="Model"),
            alt.Tooltip(
                "completeness_category:N",
                title="Completeness",
            ),
            alt.Tooltip("proportion:Q", title="Proportion", format=".2%"),
        ],
    )
    .properties(
        width=600,
        height=400,
        title="Data Completeness Distribution by Model",
    )
)

completeness_chart

## Plot 9: Matched Pairs Distribution

In [None]:
# ---- Examine matched pairs data ----
print("Matched pairs columns:")
print(evidence_matched_pairs.columns.tolist())
print("\nFirst few rows:")
display(evidence_matched_pairs.head())

In [None]:
matched_pairs_box = (
    alt.Chart(evidence_matched_pairs)
    .mark_boxplot(extent="min-max")
    .encode(
        x=alt.X("model:N", title="Model"),
        y=alt.Y("matched_pairs:Q", title="Number of Matched Pairs"),
        color=alt.Color("model:N", legend=None),
        tooltip=[
            alt.Tooltip("model:N", title="Model"),
            alt.Tooltip("matched_pairs:Q", title="Matched Pairs"),
        ],
    )
    .properties(
        width=600,
        height=400,
        title="Distribution of Matched Trait Pairs",
    )
)

matched_pairs_box

---
# Section D: Cross-Database Comparison

## Plot 10: Model Coverage Comparison

In [None]:
# ---- Prepare coverage comparison data ----
trait_coverage = trait_summary[["model", "total_combinations"]].copy()
trait_coverage["database"] = "Trait Profiles"
trait_coverage.rename(
    columns={"total_combinations": "combinations"},
    inplace=True,
)

evidence_coverage = evidence_summary[["model", "total_combinations"]].copy()
evidence_coverage["database"] = "Evidence Profiles"
evidence_coverage.rename(
    columns={"total_combinations": "combinations"},
    inplace=True,
)

combined_coverage = pd.concat(
    [trait_coverage, evidence_coverage],
    ignore_index=True,
)

print("Coverage comparison data:")
display(combined_coverage)

In [None]:
coverage_chart = (
    alt.Chart(combined_coverage)
    .mark_bar()
    .encode(
        x=alt.X("model:N", title="Model", axis=alt.Axis(labelAngle=-45)),
        y=alt.Y("combinations:Q", title="Number of Combinations"),
        color=alt.Color("database:N", title="Database Type"),
        column=alt.Column("database:N", title=None),
        tooltip=[
            alt.Tooltip("model:N", title="Model"),
            alt.Tooltip("database:N", title="Database"),
            alt.Tooltip("combinations:Q", title="Combinations", format=","),
        ],
    )
    .properties(
        width=300,
        height=400,
        title="Trait vs Evidence Profile Coverage by Model",
    )
)

coverage_chart

---
# Summary

## Key Insights

Based on the visualizations above:

**Literature Coverage:**
- Total papers in the corpus
- Publication year range
- Growth trajectory over time

**Trait Diversity:**
- Number of unique traits extracted
- Distribution of traits per study
- Variation across extraction models

**Model Performance:**
- Extraction quality varies by model
- Semantic and Jaccard similarities show correlation
- Different models have different coverage patterns

**Similarity Patterns:**
- Trait profiles show distinct similarity distributions
- Evidence profiles are sparser with more variability
- Cross-database patterns reveal different aspects of MR studies

## Data Quality Notes

- **Evidence profiles** have sparser data due to requirement for
  quantitative results
- **Trait profiles** cover all studies with extracted traits
- **Completeness** varies by model and reporting standards in the
  literature
- **Matched pairs** depend on trait overlap between studies

## Next Steps

- See `docs/processing/summary-statistics.md` for detailed methodology
- Analysis scripts available in `processing/scripts/analysis/`
- Export visualizations using the cells below for manuscript inclusion

## Export Visualizations (Optional)

Uncomment and run the cells below to save plots for manuscript use.

In [None]:
# # ---- Create output directory ----
# output_dir = PROCESSED_DIR / "figures" / "mr-kg"
# output_dir.mkdir(parents=True, exist_ok=True)
# print(f"Output directory: {output_dir}")

In [None]:
# # ---- Save all plots as JSON ----
# entity_chart.save(str(output_dir / "entity_counts.json"))
# model_chart.save(str(output_dir / "model_statistics.json"))
# journal_chart.save(str(output_dir / "top_journals.json"))
# similarity_dist_chart.save(str(output_dir / "trait_similarity_distributions.json"))
# trait_count_hist.save(str(output_dir / "trait_count_distribution.json"))
# (metric_corr_scatter + regression).save(str(output_dir / "metric_correlation.json"))
# if evidence_metrics:
#     evidence_violin.save(str(output_dir / "evidence_similarity_metrics.json"))
# completeness_chart.save(str(output_dir / "completeness_by_model.json"))
# matched_pairs_box.save(str(output_dir / "matched_pairs_distribution.json"))
# coverage_chart.save(str(output_dir / "coverage_comparison.json"))
# print("All plots saved successfully!")