Skip to content

Snow6667/Knowledge-Graph-Study-Note

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI for Science Knowledge Graph Toolkit

Knowledge graph preview 3D graph viewer

Table of Contents

  1. Project Goals
  2. Repository Contents
  3. Prerequisites
  4. End-to-End Workflow
  5. 3D WebGL Viewer
  6. Troubleshooting & Tips
  7. Next Steps

Project Goals

This workspace assembles a full pipeline for turning large AI-for-Science bibliographic exports into an interactive knowledge graph:

  • Merge multiple Web of Science BibTeX dumps into a single canonical file for processing (@merge_ai_for_science_bib.py#1-68).
  • Explore, clean, and summarize bibliographic metadata with pandas, seaborn, and matplotlib (@merge_ai_for_science_dataset.py#1-172; @eda_ai_for_science.py#1-358).
  • Import enriched entities and relationships into Neo4j, building authorship, keyword, institution, venue, and similarity networks (@import_ai_for_science_merged.py#1-480; @import_bib.py#1-170).
  • Run NetworkX analytics directly from Neo4j to surface influential authors, hotspots, and collaboration patterns (@data_analysis.py#1-218; @networkx_global_analysis.py#40-210).
  • Export a curated subgraph to an interactive 3D WebGL viewer powered by 3d-force-graph, with PageRank-driven sizing and neighborhood sampling (@export_webgl_graph.py#1-267; graph_webgl.html).

The resulting assets include static dashboards, CSV summaries, and a standalone HTML experience suitable for demos or offline sharing.

EDA dashboard


Repository Contents

Path Purpose
AI-FOR-SCIENCE-DATA*.bib, WOS-TEST-KG-*.bib Raw Web of Science exports used across the pipeline.
merge_ai_for_science_bib.py Combines the five AI-for-Science BibTeX files into ai_for_science_merged.bib (@merge_ai_for_science_bib.py#17-67).
merge_ai_for_science_dataset.py Builds a structured pandas dataset with required bibliographic columns; optional CSV export (@merge_ai_for_science_dataset.py#47-167).
eda_ai_for_science.py Generates multi-panel EDA plots and textual summaries for merged data (@eda_ai_for_science.py#137-357).
import_ai_for_science_merged.py Primary Neo4j ingest script with constraint management, deduplication, and relationship construction (@import_ai_for_science_merged.py#13-480).
import_bib.py, import_bib_00.py, import_bib_01.py Lightweight importers for individual BibTeX subsets (@import_bib.py#1-170).
data_analysis.py In-database analytics for co-authorship, citation, and keyword networks (@data_analysis.py#29-218).
networkx_global_analysis.py Loads the full Neo4j graph into NetworkX for PageRank/bridging reports and CSV summaries (@networkx_global_analysis.py#40-210).
export_webgl_graph.py Samples core nodes plus neighbors, computes PageRank sizing, and writes graph_webgl.html (@export_webgl_graph.py#47-267).
viz_3d.py Plotly-based 3D preview directly from Neo4j (@viz_3d.py#1-50).
graph_webgl.html, graph_webgl.json Standalone WebGL visualization (HTML embeds data; JSON stores raw payload).
EDA-R.png, KG-3D-GRAPH.png, final_graph.png, bib_graph-little.png Generated visuals used throughout this README.
network_metrics_summary.csv Example output from networkx_global_analysis.py.
.venv/ Optional local Python virtual environment (not tracked).

Prerequisites

Data

  • Five Web of Science BibTeX exports named AI-FOR-SCIENCE-DATA1.bibAI-FOR-SCIENCE-DATA5.bib.
  • Optional smaller subsets (WOS-TEST-KG-00.bib, etc.) for rapid iteration.

Python Environment

Create and activate a virtual environment (PowerShell example):

python -m venv .venv
.venv\Scripts\activate
pip install pandas networkx neo4j matplotlib seaborn plotly bibtexparser

Neo4j

  • Local Neo4j instance reachable at neo4j://127.0.0.1:7687 with credentials configured in the scripts (@import_ai_for_science_merged.py#13-18; @data_analysis.py#6-11).
  • APOC is not required; Cypher queries use built-in aggregations.
  • Ensure adequate heap/page cache for large imports.

End-to-End Workflow

  1. Merge the raw BibTeX files

    python merge_ai_for_science_bib.py

    Produces ai_for_science_merged.bib with preserved entry formatting and gentle logging (@merge_ai_for_science_bib.py#35-63).

  2. (Optional) Assemble a tabular dataset

    python merge_ai_for_science_dataset.py

    Check terminal output for field coverage and set SAVE_DATASET = True to persist a CSV (@merge_ai_for_science_dataset.py#119-167).

  3. Run exploratory data analysis

    python eda_ai_for_science.py

    Generates the multi-panel visualization shown above and prints summary tables for years, authors, keywords, and research areas (@eda_ai_for_science.py#137-357).

  4. Import into Neo4j (full pipeline)

    python import_ai_for_science_merged.py --bib ai_for_science_merged.bib --no-reset

    Key features include duplicate tracking, canonical keyword mapping, institution noise filtering, collaboration edges, keyword co-occurrence, and similarity thresholds (@import_ai_for_science_merged.py#144-475). Use --reset for a clean slate.

  5. Alternative quick imports For smaller test files:

    python import_bib.py
    python import_bib_00.py
    python import_bib_01.py

    Each script rebuilds a minimal author/keyword graph suited for smoke tests (@import_bib.py#51-170).

  6. Validate the database connection

    python test_neo4j_conn.py

    Confirms driver initialization and Cypher execution (@test_neo4j_conn.py#1-20).

  7. Run analytical reports

    • Co-authorship, citation, keyword layers (console summaries):
      python data_analysis.py
      (@data_analysis.py#29-218)
    • Global metrics with CSV export:
      python networkx_global_analysis.py --directed --output network_metrics_summary.csv
      (@networkx_global_analysis.py#56-210)
  8. Export a 3D-ready subgraph

    python export_webgl_graph.py --output graph_webgl --core-candidate-limit 400 --core-keep 150

    The script samples hub nodes, computes PageRank-based sizing (safe_scale_values), and writes an HTML viewer with embedded data (@export_webgl_graph.py#130-266).

  9. Explore the Plotly preview (optional)

    python viz_3d.py

    Opens an interactive Plotly figure for up to 1,000 relationships (@viz_3d.py#19-49).

  10. Open the WebGL visualization Double-click graph_webgl.html or serve it locally to interact with the 3D force layout (graph_webgl.html).


3D WebGL Viewer

Final WebGL export

  • Nodes are colored by inferred community (paper, author, keyword, institution, or research area) and sized by PageRank with soft capping to keep the scene legible (@export_webgl_graph.py#140-159; @export_webgl_graph.py#223-245).
  • Clicking a node animates the camera to focus on that entity, making it easy to inspect local neighborhoods (graph_webgl.html).
  • The generated graph_webgl.json file stores the raw payload in case you wish to feed other visualization frameworks.

Troubleshooting & Tips

  • Large imports: Use --progress and --min-common-keywords flags on import_ai_for_science_merged.py to tune throughput (@import_ai_for_science_merged.py#438-475).
  • Neo4j memory: Increase page cache for exports involving tens of thousands of relationships.
  • Keyword normalization: Add new synonyms directly in the keyword mapper within the import script to keep communities consistent (@import_ai_for_science_merged.py#220-305).
  • Visualization size: Adjust --core-candidate-limit, --core-keep, and --neighbors-per-core to balance density and load time (@export_webgl_graph.py#269-275).
  • PowerShell path quoting: Wrap bib paths containing spaces with double quotes when invoking scripts on Windows.

Next Steps

  1. Enrich the Neo4j model with citation relationships sourced from Cited-References fields.
  2. Extend the WebGL page with tooltips that surface PageRank, degree, and community metadata.
  3. Publish a Docker Compose stack bundling Neo4j, the ETL scripts, and a static web server for the HTML viewer.
  4. Automate nightly refreshes by wiring the scripts into a scheduled workflow.

Happy graph exploring!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages