# Knowledge Graph Analysis — Methods & Key Findings

Concise slide deck summarizing methods, logic, and analysis across the notebooks in this workspace.

## Notebook: `knowledge_graph_rule_discovery.ipynb` — Overview

- Goal: discover logical rules (inverse, symmetric, and Horn clauses) in the KG and quantify quality with support & confidence.
- Data: triples (subject relation object) from `train.txt`.
- Output: ranked rules with examples, CSV exports, and visual summaries.

## `knowledge_graph_rule_discovery.ipynb` — Method & Logic

- Parse triples into fast indices: relation -> set of (subj,obj), outgoing/incoming maps.
- Inverse discovery: evaluate ordered relation pairs rel1(X,Y) -> rel2(Y,X); compute matches, confidence = matches / count(rel1).
- Symmetric discovery: rel(X,Y) -> rel(Y,X) using same-relation matching.
- Horn clauses: enumerate small relation combinations and check chain patterns rel1(X,Y) ∧ rel2(Y,Z) → rel3(X,Z); count body occurrences and successful heads.
- Thresholding: apply minimum confidence and support filters (configurable).

## `knowledge_graph_rule_discovery.ipynb` — Analysis & Recommendations

- Inspect top rules by confidence then support; review counter-examples to validate noise vs true exceptions.
- Sparse graphs lead to low support even if confidence is moderate — consider imputing missing edges or lowering search granularity.
- For large relation sets, sample or limit combinations (already applied for Horn-3).
- Next steps: validate discovered rules against a holdout set, and export high-confidence rules for use in completion tasks.

## Notebook: `component_modularity_notebook.ipynb` — Overview

- Goal: compute modularity per weakly-connected component and visualize network statistics.
- Data: same triple-edge list; components built on undirected projection.
- Output: modularity table, community counts, visual plots, and CSV files (with diameters).

## `component_modularity_notebook.ipynb` — Method & Logic

- Build undirected adjacency from edges; find weakly connected components via BFS.
- For each component: construct NetworkX graph, run greedy modularity community detection, compute modularity, density, avg clustering.
- Diameter: compute per-component diameter (max shortest-path length in connected subcomponents); attach to results.
- Visuals: modularity distributions, modularity vs size, heatmaps, and top-component comparisons.

## `component_modularity_notebook.ipynb` — Analysis & Recommendations

- Use modularity and community counts to identify cohesive subgraphs; inspect high-modularity components for domain patterns.
- Diameter identifies component span — large diameters may indicate chain-like structures (low cohesion).
- For noisy graphs, consider filtering low-degree nodes before community detection to emphasize meaningful structure.
- Save `component_modularity_results_with_diameter.csv` for downstream reporting and linking to rule-discovery results.

## Notebook: `find_components_notebook.ipynb` — Overview & Logic

- Purpose: identify connected components and optionally export component membership.
- Method: undirected adjacency, BFS/DFS to collect component node sets, sort by size.
- Analysis: component-size distributions and basic summaries for subsequent per-component analysis.

## Notebook: `directed_graph_viz_notebook.ipynb` & `knowledge_graph_explorer.ipynb` — Methods

- `directed_graph_viz_notebook`: visual exploration of directed edges — drawing ego networks, edge-type filters, and visual encodings for relation types.
- `knowledge_graph_explorer`: interactive exploration utilities — lookup by entity, relation-centric views, and small graph extraction for manual inspection.
- Both notebooks emphasize inspection-first workflows to validate data quality before large-scale mining.

## Notebook: `kg_completion_distmult3.ipynb` — Overview

- Goal: train a knowledge graph embedding model (DistMult or similar) for link prediction / completion.
- Method: embed entities and relations, score triples via bilinear form, train with negative sampling and ranking loss.
- Analysis: monitor validation MRR/Hit@K, and use trained model to suggest candidate inverse or missing edges.

## Cross-Notebook Analysis: Key Insights

- Data sparsity is the main blocker: many logical patterns exist but have low support.
- Rule discovery (symbolic) and embedding-based completion (statistical) are complementary: use high-confidence rules to seed constrained inference and embeddings to suggest missing links.
- Component-level analysis (modularity, diameter) helps prioritize which subgraphs to analyze or to augment data for (e.g., dense clusters vs sparse chains).

## Practical Next Steps

- Run `knowledge_graph_rule_discovery.ipynb` and export high-confidence rules (>= 90% conf, support threshold) for validation.
- Use `kg_completion_distmult3.ipynb` to produce candidate edges; re-run rule discovery to measure recall improvement.
- Focus manual inspection on top components with high modularity and moderate diameter — likely to contain meaningful family structures.

## Delivery

- Notebook: `presentation_slides.ipynb` (this file) — ready for quick export to slides (e.g., nbconvert).
- If you want, I can: 1) convert to PDF/HTML slides, 2) run notebooks and capture top outputs, or 3) expand any notebook's slide section into more detailed frames.