AI for Science Knowledge Graph Toolkit

Project Goals

This workspace assembles a full pipeline for turning large AI-for-Science bibliographic exports into an interactive knowledge graph:

Merge multiple Web of Science BibTeX dumps into a single canonical file for processing (@merge_ai_for_science_bib.py#1-68).
Explore, clean, and summarize bibliographic metadata with pandas, seaborn, and matplotlib (@merge_ai_for_science_dataset.py#1-172; @eda_ai_for_science.py#1-358).
Import enriched entities and relationships into Neo4j, building authorship, keyword, institution, venue, and similarity networks (@import_ai_for_science_merged.py#1-480; @import_bib.py#1-170).
Run NetworkX analytics directly from Neo4j to surface influential authors, hotspots, and collaboration patterns (@data_analysis.py#1-218; @networkx_global_analysis.py#40-210).
Export a curated subgraph to an interactive 3D WebGL viewer powered by 3d-force-graph, with PageRank-driven sizing and neighborhood sampling (@export_webgl_graph.py#1-267; graph_webgl.html).

The resulting assets include static dashboards, CSV summaries, and a standalone HTML experience suitable for demos or offline sharing.

Repository Contents

Path	Purpose
`AI-FOR-SCIENCE-DATA.bib`, `WOS-TEST-KG-.bib`	Raw Web of Science exports used across the pipeline.
`merge_ai_for_science_bib.py`	Combines the five AI-for-Science BibTeX files into `ai_for_science_merged.bib` (@merge_ai_for_science_bib.py#17-67).
`merge_ai_for_science_dataset.py`	Builds a structured pandas dataset with required bibliographic columns; optional CSV export (@merge_ai_for_science_dataset.py#47-167).
`eda_ai_for_science.py`	Generates multi-panel EDA plots and textual summaries for merged data (@eda_ai_for_science.py#137-357).
`import_ai_for_science_merged.py`	Primary Neo4j ingest script with constraint management, deduplication, and relationship construction (@import_ai_for_science_merged.py#13-480).
`import_bib.py`, `import_bib_00.py`, `import_bib_01.py`	Lightweight importers for individual BibTeX subsets (@import_bib.py#1-170).
`data_analysis.py`	In-database analytics for co-authorship, citation, and keyword networks (@data_analysis.py#29-218).
`networkx_global_analysis.py`	Loads the full Neo4j graph into NetworkX for PageRank/bridging reports and CSV summaries (@networkx_global_analysis.py#40-210).
`export_webgl_graph.py`	Samples core nodes plus neighbors, computes PageRank sizing, and writes `graph_webgl.html` (@export_webgl_graph.py#47-267).
`viz_3d.py`	Plotly-based 3D preview directly from Neo4j (@viz_3d.py#1-50).
`graph_webgl.html`, `graph_webgl.json`	Standalone WebGL visualization (HTML embeds data; JSON stores raw payload).
`EDA-R.png`, `KG-3D-GRAPH.png`, `final_graph.png`, `bib_graph-little.png`	Generated visuals used throughout this README.
`network_metrics_summary.csv`	Example output from `networkx_global_analysis.py`.
`.venv/`	Optional local Python virtual environment (not tracked).

Prerequisites

Data

Five Web of Science BibTeX exports named AI-FOR-SCIENCE-DATA1.bib … AI-FOR-SCIENCE-DATA5.bib.
Optional smaller subsets (WOS-TEST-KG-00.bib, etc.) for rapid iteration.

Python Environment

Create and activate a virtual environment (PowerShell example):

python -m venv .venv
.venv\Scripts\activate
pip install pandas networkx neo4j matplotlib seaborn plotly bibtexparser

Neo4j

Local Neo4j instance reachable at neo4j://127.0.0.1:7687 with credentials configured in the scripts (@import_ai_for_science_merged.py#13-18; @data_analysis.py#6-11).
APOC is not required; Cypher queries use built-in aggregations.
Ensure adequate heap/page cache for large imports.

End-to-End Workflow

Merge the raw BibTeX files
```
python merge_ai_for_science_bib.py
```
Produces ai_for_science_merged.bib with preserved entry formatting and gentle logging (@merge_ai_for_science_bib.py#35-63).
(Optional) Assemble a tabular dataset
```
python merge_ai_for_science_dataset.py
```
Check terminal output for field coverage and set SAVE_DATASET = True to persist a CSV (@merge_ai_for_science_dataset.py#119-167).
Run exploratory data analysis
```
python eda_ai_for_science.py
```
Generates the multi-panel visualization shown above and prints summary tables for years, authors, keywords, and research areas (@eda_ai_for_science.py#137-357).
Import into Neo4j (full pipeline)
```
python import_ai_for_science_merged.py --bib ai_for_science_merged.bib --no-reset
```
Key features include duplicate tracking, canonical keyword mapping, institution noise filtering, collaboration edges, keyword co-occurrence, and similarity thresholds (@import_ai_for_science_merged.py#144-475). Use --reset for a clean slate.
Alternative quick imports For smaller test files:
```
python import_bib.py
python import_bib_00.py
python import_bib_01.py
```
Each script rebuilds a minimal author/keyword graph suited for smoke tests (@import_bib.py#51-170).
Validate the database connection
```
python test_neo4j_conn.py
```
Confirms driver initialization and Cypher execution (@test_neo4j_conn.py#1-20).
Run analytical reports
- Co-authorship, citation, keyword layers (console summaries):
```
python data_analysis.py
```
  (@data_analysis.py#29-218)
- Global metrics with CSV export:
```
python networkx_global_analysis.py --directed --output network_metrics_summary.csv
```
  (@networkx_global_analysis.py#56-210)
Export a 3D-ready subgraph
```
python export_webgl_graph.py --output graph_webgl --core-candidate-limit 400 --core-keep 150
```
The script samples hub nodes, computes PageRank-based sizing (safe_scale_values), and writes an HTML viewer with embedded data (@export_webgl_graph.py#130-266).
Explore the Plotly preview (optional)
```
python viz_3d.py
```
Opens an interactive Plotly figure for up to 1,000 relationships (@viz_3d.py#19-49).
Open the WebGL visualization Double-click graph_webgl.html or serve it locally to interact with the 3D force layout (graph_webgl.html).

3D WebGL Viewer

Nodes are colored by inferred community (paper, author, keyword, institution, or research area) and sized by PageRank with soft capping to keep the scene legible (@export_webgl_graph.py#140-159; @export_webgl_graph.py#223-245).
Clicking a node animates the camera to focus on that entity, making it easy to inspect local neighborhoods (graph_webgl.html).
The generated graph_webgl.json file stores the raw payload in case you wish to feed other visualization frameworks.

Troubleshooting & Tips

Large imports: Use --progress and --min-common-keywords flags on import_ai_for_science_merged.py to tune throughput (@import_ai_for_science_merged.py#438-475).
Neo4j memory: Increase page cache for exports involving tens of thousands of relationships.
Keyword normalization: Add new synonyms directly in the keyword mapper within the import script to keep communities consistent (@import_ai_for_science_merged.py#220-305).
Visualization size: Adjust --core-candidate-limit, --core-keep, and --neighbors-per-core to balance density and load time (@export_webgl_graph.py#269-275).
PowerShell path quoting: Wrap bib paths containing spaces with double quotes when invoking scripts on Windows.

Next Steps

Enrich the Neo4j model with citation relationships sourced from Cited-References fields.
Extend the WebGL page with tooltips that surface PageRank, degree, and community metadata.
Publish a Docker Compose stack bundling Neo4j, the ETL scripts, and a static web server for the HTML viewer.
Automate nightly refreshes by wiring the scripts into a scheduled workflow.

Happy graph exploring!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI for Science Knowledge Graph Toolkit

Table of Contents

Project Goals

Repository Contents

Prerequisites

Data

Python Environment

Neo4j

End-to-End Workflow

3D WebGL Viewer

Troubleshooting & Tips

Next Steps

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
3d_graph_fixed.html		3d_graph_fixed.html
AI-FOR-SCIENCE-DATA1.bib		AI-FOR-SCIENCE-DATA1.bib
AI-FOR-SCIENCE-DATA2.bib		AI-FOR-SCIENCE-DATA2.bib
AI-FOR-SCIENCE-DATA3.bib		AI-FOR-SCIENCE-DATA3.bib
AI-FOR-SCIENCE-DATA4.bib		AI-FOR-SCIENCE-DATA4.bib
AI-FOR-SCIENCE-DATA5.bib		AI-FOR-SCIENCE-DATA5.bib
EDA-R.png		EDA-R.png
EDA_A4S.ipynb		EDA_A4S.ipynb
KG-3D-GRAPH.png		KG-3D-GRAPH.png
README-zh.md		README-zh.md
README.md		README.md
WOS-TEST-KG-00.bib		WOS-TEST-KG-00.bib
ai_for_science_merged.bib		ai_for_science_merged.bib
bib_graph-little.png		bib_graph-little.png
data_analysis.py		data_analysis.py
eda_ai_for_science.py		eda_ai_for_science.py
export_webgl_graph.py		export_webgl_graph.py
final_graph.png		final_graph.png
graph_webgl.html		graph_webgl.html
graph_webgl.json		graph_webgl.json
import_ai_for_science_merged.py		import_ai_for_science_merged.py
import_bib.py		import_bib.py
import_bib_00.py		import_bib_00.py
import_bib_01.py		import_bib_01.py
merge_ai_for_science_bib.py		merge_ai_for_science_bib.py
merge_ai_for_science_dataset.py		merge_ai_for_science_dataset.py
network_metrics_summary.csv		network_metrics_summary.csv
networkx_global_analysis.py		networkx_global_analysis.py
test_neo4j_conn.py		test_neo4j_conn.py
viz_3d.py		viz_3d.py
wos-test-kg.bib		wos-test-kg.bib

Snow6667/Knowledge-Graph-Study-Note

Folders and files

Latest commit

History

Repository files navigation

AI for Science Knowledge Graph Toolkit

Table of Contents

Project Goals

Repository Contents

Prerequisites

Data

Python Environment

Neo4j

End-to-End Workflow

3D WebGL Viewer

Troubleshooting & Tips

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages