- Project Goals
- Repository Contents
- Prerequisites
- End-to-End Workflow
- 3D WebGL Viewer
- Troubleshooting & Tips
- Next Steps
This workspace assembles a full pipeline for turning large AI-for-Science bibliographic exports into an interactive knowledge graph:
- Merge multiple Web of Science BibTeX dumps into a single canonical file for processing (@merge_ai_for_science_bib.py#1-68).
- Explore, clean, and summarize bibliographic metadata with pandas, seaborn, and matplotlib (@merge_ai_for_science_dataset.py#1-172; @eda_ai_for_science.py#1-358).
- Import enriched entities and relationships into Neo4j, building authorship, keyword, institution, venue, and similarity networks (@import_ai_for_science_merged.py#1-480; @import_bib.py#1-170).
- Run NetworkX analytics directly from Neo4j to surface influential authors, hotspots, and collaboration patterns (@data_analysis.py#1-218; @networkx_global_analysis.py#40-210).
- Export a curated subgraph to an interactive 3D WebGL viewer powered by
3d-force-graph, with PageRank-driven sizing and neighborhood sampling (@export_webgl_graph.py#1-267; graph_webgl.html).
The resulting assets include static dashboards, CSV summaries, and a standalone HTML experience suitable for demos or offline sharing.
| Path | Purpose |
|---|---|
AI-FOR-SCIENCE-DATA*.bib, WOS-TEST-KG-*.bib |
Raw Web of Science exports used across the pipeline. |
merge_ai_for_science_bib.py |
Combines the five AI-for-Science BibTeX files into ai_for_science_merged.bib (@merge_ai_for_science_bib.py#17-67). |
merge_ai_for_science_dataset.py |
Builds a structured pandas dataset with required bibliographic columns; optional CSV export (@merge_ai_for_science_dataset.py#47-167). |
eda_ai_for_science.py |
Generates multi-panel EDA plots and textual summaries for merged data (@eda_ai_for_science.py#137-357). |
import_ai_for_science_merged.py |
Primary Neo4j ingest script with constraint management, deduplication, and relationship construction (@import_ai_for_science_merged.py#13-480). |
import_bib.py, import_bib_00.py, import_bib_01.py |
Lightweight importers for individual BibTeX subsets (@import_bib.py#1-170). |
data_analysis.py |
In-database analytics for co-authorship, citation, and keyword networks (@data_analysis.py#29-218). |
networkx_global_analysis.py |
Loads the full Neo4j graph into NetworkX for PageRank/bridging reports and CSV summaries (@networkx_global_analysis.py#40-210). |
export_webgl_graph.py |
Samples core nodes plus neighbors, computes PageRank sizing, and writes graph_webgl.html (@export_webgl_graph.py#47-267). |
viz_3d.py |
Plotly-based 3D preview directly from Neo4j (@viz_3d.py#1-50). |
graph_webgl.html, graph_webgl.json |
Standalone WebGL visualization (HTML embeds data; JSON stores raw payload). |
EDA-R.png, KG-3D-GRAPH.png, final_graph.png, bib_graph-little.png |
Generated visuals used throughout this README. |
network_metrics_summary.csv |
Example output from networkx_global_analysis.py. |
.venv/ |
Optional local Python virtual environment (not tracked). |
- Five Web of Science BibTeX exports named
AI-FOR-SCIENCE-DATA1.bib…AI-FOR-SCIENCE-DATA5.bib. - Optional smaller subsets (
WOS-TEST-KG-00.bib, etc.) for rapid iteration.
Create and activate a virtual environment (PowerShell example):
python -m venv .venv
.venv\Scripts\activate
pip install pandas networkx neo4j matplotlib seaborn plotly bibtexparser- Local Neo4j instance reachable at
neo4j://127.0.0.1:7687with credentials configured in the scripts (@import_ai_for_science_merged.py#13-18; @data_analysis.py#6-11). - APOC is not required; Cypher queries use built-in aggregations.
- Ensure adequate heap/page cache for large imports.
-
Merge the raw BibTeX files
python merge_ai_for_science_bib.py
Produces
ai_for_science_merged.bibwith preserved entry formatting and gentle logging (@merge_ai_for_science_bib.py#35-63). -
(Optional) Assemble a tabular dataset
python merge_ai_for_science_dataset.py
Check terminal output for field coverage and set
SAVE_DATASET = Trueto persist a CSV (@merge_ai_for_science_dataset.py#119-167). -
Run exploratory data analysis
python eda_ai_for_science.py
Generates the multi-panel visualization shown above and prints summary tables for years, authors, keywords, and research areas (@eda_ai_for_science.py#137-357).
-
Import into Neo4j (full pipeline)
python import_ai_for_science_merged.py --bib ai_for_science_merged.bib --no-reset
Key features include duplicate tracking, canonical keyword mapping, institution noise filtering, collaboration edges, keyword co-occurrence, and similarity thresholds (@import_ai_for_science_merged.py#144-475). Use
--resetfor a clean slate. -
Alternative quick imports For smaller test files:
python import_bib.py python import_bib_00.py python import_bib_01.py
Each script rebuilds a minimal author/keyword graph suited for smoke tests (@import_bib.py#51-170).
-
Validate the database connection
python test_neo4j_conn.py
Confirms driver initialization and Cypher execution (@test_neo4j_conn.py#1-20).
-
Run analytical reports
- Co-authorship, citation, keyword layers (console summaries):
(@data_analysis.py#29-218)
python data_analysis.py
- Global metrics with CSV export:
(@networkx_global_analysis.py#56-210)
python networkx_global_analysis.py --directed --output network_metrics_summary.csv
- Co-authorship, citation, keyword layers (console summaries):
-
Export a 3D-ready subgraph
python export_webgl_graph.py --output graph_webgl --core-candidate-limit 400 --core-keep 150
The script samples hub nodes, computes PageRank-based sizing (
safe_scale_values), and writes an HTML viewer with embedded data (@export_webgl_graph.py#130-266). -
Explore the Plotly preview (optional)
python viz_3d.py
Opens an interactive Plotly figure for up to 1,000 relationships (@viz_3d.py#19-49).
-
Open the WebGL visualization Double-click
graph_webgl.htmlor serve it locally to interact with the 3D force layout (graph_webgl.html).
- Nodes are colored by inferred community (paper, author, keyword, institution, or research area) and sized by PageRank with soft capping to keep the scene legible (@export_webgl_graph.py#140-159; @export_webgl_graph.py#223-245).
- Clicking a node animates the camera to focus on that entity, making it easy to inspect local neighborhoods (graph_webgl.html).
- The generated
graph_webgl.jsonfile stores the raw payload in case you wish to feed other visualization frameworks.
- Large imports: Use
--progressand--min-common-keywordsflags onimport_ai_for_science_merged.pyto tune throughput (@import_ai_for_science_merged.py#438-475). - Neo4j memory: Increase page cache for exports involving tens of thousands of relationships.
- Keyword normalization: Add new synonyms directly in the keyword mapper within the import script to keep communities consistent (@import_ai_for_science_merged.py#220-305).
- Visualization size: Adjust
--core-candidate-limit,--core-keep, and--neighbors-per-coreto balance density and load time (@export_webgl_graph.py#269-275). - PowerShell path quoting: Wrap bib paths containing spaces with double quotes when invoking scripts on Windows.
- Enrich the Neo4j model with citation relationships sourced from
Cited-Referencesfields. - Extend the WebGL page with tooltips that surface PageRank, degree, and community metadata.
- Publish a Docker Compose stack bundling Neo4j, the ETL scripts, and a static web server for the HTML viewer.
- Automate nightly refreshes by wiring the scripts into a scheduled workflow.
Happy graph exploring!



