# 🕸️ Tutorial 4: Building Knowledge Graphs

**Learn to build knowledge graphs from research papers using AI.**

## What You'll Learn:
- Extract entities from papers using AI
- Build knowledge graphs automatically
- Visualize research connections
- Query graphs for insights

**Time:** 15 minutes | **Level:** Beginner

## Step 1: Setup

In [2]:
# Import what we need
import sys
import os

# Add parent directory to path
if os.path.basename(os.getcwd()) == 'tutorial':
    sys.path.insert(0, '..')
else:
    sys.path.insert(0, '.')

# Enable widgets for visualization
try:
    import google.colab
    from google.colab import output
    output.enable_custom_widget_manager()
    print("📱 Google Colab widget support enabled")
except:
    pass

# Import the GraphRAG system
from src.langchain_graph_rag import LangChainGraphRAG

print("✅ Setup complete!")

✅ Setup complete!


## Step 2: Create GraphRAG System

In [3]:
# Create knowledge graph system
graph_rag = LangChainGraphRAG(
    llm_model="llama3.1:8b",
    embedding_model="nomic-embed-text"
)

print("🕸️ Knowledge graph system ready!")
print(f"📊 Current papers: {len(graph_rag.get_all_papers())}")

  self.llm = Ollama(model=llm_model, temperature=0.1)
INFO:src.enhanced_knowledge_graph:🕸️ Enhanced Knowledge Graph initialized
INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO:src.langchain_graph_rag:🕸️ LangChain GraphRAG initialized


🕸️ Knowledge graph system ready!
📊 Current papers: 2


## Step 3: Add Your First Paper

In [4]:
# Create substantial sample text with rich entities and connections
print("📄 Creating rich sample text for knowledge graph demonstration...")
print("⚡ This will be much faster than processing real papers!")

# Rich sample content with overlapping entities and clear relationships
paper_content = """
Transformer-Based Drug Discovery using Graph Neural Networks
Authors: Dr. Sarah Chen (MIT), Prof. Michael Torres (Stanford), Dr. Yuki Tanaka (Tokyo Institute of Technology)

Abstract:
This paper presents ChemFormer, a novel transformer architecture for molecular property prediction in drug discovery. 
Our approach combines graph neural networks with self-attention mechanisms to analyze chemical compounds.

Introduction:
Machine learning has revolutionized drug discovery, with deep learning methods showing particular promise. 
Previous work by Chen et al. (2019) demonstrated the effectiveness of BERT-style transformers on molecular data.
Graph neural networks (GNNs) have emerged as powerful tools for representing molecular structures.

Methods:
We developed ChemFormer using the following components:
1. Molecular graph encoding using Graph Convolutional Networks (GCN)
2. Transformer attention layers adapted for chemical structures  
3. Multi-task learning framework for property prediction

Our model was trained on several key datasets:
- PubChem: 100M molecular structures
- ChEMBL: Bioactivity data for 2M compounds
- QM9: Quantum mechanical properties for 134K molecules
- ZINC: Drug-like compounds for virtual screening

We used PyTorch and RDKit for implementation, with training performed on NVIDIA V100 GPUs.
The model achieves 94% accuracy on molecular solubility prediction tasks.

Experiments:
We evaluated ChemFormer on multiple benchmarks:
- Molecular property prediction (RMSE: 0.23)
- Drug-target interaction prediction (AUC: 0.89)
- Toxicity classification (F1-score: 0.91)

Comparison with existing methods shows significant improvements:
- Outperforms Graph Attention Networks by 12%
- Exceeds classical fingerprint methods by 35% 
- Matches performance of specialized chemical predictors

Results:
ChemFormer demonstrates superior performance across all evaluated tasks.
The attention mechanism successfully identifies key molecular substructures.
Transfer learning enables rapid adaptation to new chemical domains.

Applications in pharmaceutical research include:
- Lead compound optimization
- ADMET property prediction  
- Virtual compound screening
- Drug repurposing analysis

Conclusions:
We have successfully combined transformer architectures with graph neural networks for drug discovery.
The ChemFormer model represents a significant advance in computational chemistry.
Future work will explore larger scale pretraining and multi-modal molecular representations.

Acknowledgments:
This work was supported by NIH grants R01-AI123456 and NSF Award CHE-7890123.
We thank Google Cloud Platform for computational resources and the open-source community.
"""

paper_title = "Transformer-Based Drug Discovery using Graph Neural Networks"

print(f"✅ Sample text created successfully!")
print(f"📰 Title: {paper_title}")
print(f"📊 Content length: {len(paper_content):,} characters")
print(f"🔬 Rich with entities: authors, institutions, methods, datasets, technologies!")

# Add paper to knowledge graph (using basic extraction for speed)
result = graph_rag.extract_entities_and_relationships(
    paper_content=paper_content,
    paper_title=paper_title,
    paper_id="paper_1"
)

print(f"\n✅ Sample paper added to knowledge graph!")
print(f"📝 Documents created: {result['documents_added']}")
print(f"🏷️ Entities extracted: {sum(len(entities) for entities in result['entities'].values())}")
print(f"📊 This demonstrates the knowledge graph system with meaningful entities!")

INFO:src.langchain_graph_rag:🔍 Processing paper: Transformer-Based Drug Discovery using Graph Neural Networks
INFO:src.langchain_graph_rag:🚀 Using enhanced entity extraction for richer knowledge graph...
INFO:src.enhanced_knowledge_graph:🔍 Enhanced extraction for: Transformer-Based Drug Discovery using Graph Neural Networks
INFO:src.enhanced_knowledge_graph:📄 Split paper into 1 sections
INFO:src.enhanced_knowledge_graph:📖 Processing section_1 section...


📄 Creating rich sample text for knowledge graph demonstration...
⚡ This will be much faster than processing real papers!
✅ Sample text created successfully!
📰 Title: Transformer-Based Drug Discovery using Graph Neural Networks
📊 Content length: 2,692 characters
🔬 Rich with entities: authors, institutions, methods, datasets, technologies!


INFO:src.enhanced_knowledge_graph:📊 Built graph: 58 nodes, 5 edges
INFO:src.enhanced_knowledge_graph:✅ Enhanced extraction: 74 entities, 5 relationships
INFO:src.langchain_graph_rag:📈 Enhanced extraction: 74 entities from 1 sections
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:src.langchain_graph_rag:✅ Added paper to graph: 4 documents



✅ Sample paper added to knowledge graph!
📝 Documents created: 4
🏷️ Entities extracted: 74
📊 This demonstrates the knowledge graph system with meaningful entities!


## Step 4: See What AI Found

In [5]:
# Show extracted entities
entities = result['entities']

print("🤖 AI Found These Entities:")
print("=" * 30)

for category, entity_list in entities.items():
    if entity_list:
        print(f"\n📋 {category.upper()}:")
        for entity in entity_list:
            print(f"   • {entity}")

🤖 AI Found These Entities:

📋 AUTHORS:
   • Prof. Michael Torres
   • Chen et al.
   • Dr. Sarah Chen
   • Dr. Yuki Tanaka
   • BERT authors (not specified)

📋 INSTITUTIONS:
   • NIH
   • Stanford University
   • Tokyo Institute of Technology
   • MIT
   • NSF

📋 METHODS:
   • Graph Convolutional Networks (GCN)
   • Transformer attention layers
   • RDKit
   • Classical fingerprint methods
   • Specialized chemical predictors
   • PyTorch
   • Graph Attention Networks
   • Multi-task learning framework

📋 CONCEPTS:
   • Drug discovery
   • Graph neural networks (GNNs)
   • Virtual screening
   • Self-attention mechanisms
   • Quantum mechanical properties
   • Molecular structures
   • Chemical compounds
   • Molecular property prediction
   • Bioactivity data

📋 TECHNOLOGIES:
   • QM9 dataset
   • PubChem database
   • Open-source community
   • RDKit
   • Transformer architecture
   • ZINC database
   • PyTorch
   • NVIDIA V100 GPUs
   • ChEMBL database
   • Google Cloud Platform

📋 

## Step 5: Add Another Paper

In [6]:
# Create a second related paper with overlapping entities
print("📄 Creating second related paper...")
print("🔗 This will have overlapping authors and methods for connections!")

# Second paper with overlapping entities to demonstrate connections
paper_content_2 = """
Neural Network Approaches for Chemical Property Prediction
Authors: Prof. Michael Torres (Stanford), Dr. Elena Rodriguez (UCSF), Dr. James Wilson (Berkeley)

Abstract:
This study explores various neural network architectures for predicting molecular properties.
We compare traditional feedforward networks with modern graph-based approaches for chemical analysis.

Introduction:
Chemical informatics has benefited greatly from advances in machine learning.
Building on previous transformer work in molecular modeling, we investigate simpler neural architectures.
Our goal is to provide accessible alternatives to complex graph neural networks.

Methods:
We implemented several neural network variants:
1. Multi-layer perceptrons with molecular fingerprints
2. Convolutional neural networks for SMILES sequences
3. Recurrent neural networks for sequential molecular data
4. Comparison with Graph Convolutional Networks

Our training utilized the following datasets:
- PubChem: Molecular structure database
- ChEMBL: Bioactivity measurements
- Tox21: Toxicity prediction challenges
- FreeSolv: Solvation free energy data

Implementation used TensorFlow and scikit-learn, with experiments on NVIDIA RTX 3090 GPUs.
Best models achieved 89% accuracy on drug solubility tasks.

Experiments:
We conducted comprehensive evaluations:
- QSAR modeling (R²: 0.85)
- Bioactivity prediction (precision: 0.87)
- Toxicity assessment (recall: 0.83)

Performance analysis reveals:
- Simpler models often match complex architectures
- Molecular fingerprints remain highly effective
- Transfer learning improves generalization

Results:
Neural networks demonstrate robust performance across chemical prediction tasks.
Feature engineering proves as important as architectural complexity.
Ensemble methods combining different approaches show promise.

Applications span multiple domains:
- Pharmaceutical screening
- Environmental toxicology
- Materials discovery
- Agrochemical development

Conclusions:
We demonstrate that well-designed simple neural networks can compete with sophisticated graph methods.
The choice of molecular representation significantly impacts model performance.
Future research should focus on hybrid approaches combining multiple representation types.

Funding:
Research supported by NIH grant R01-GM789012 and Stanford University internal funds.
Special thanks to the PyTorch community and Molecular AI collaborative network.
"""

paper_title_2 = "Neural Network Approaches for Chemical Property Prediction"

print(f"✅ Second paper created!")
print(f"📰 Title: {paper_title_2}")
print(f"📊 Content length: {len(paper_content_2):,} characters")

# Add second paper to knowledge graph
result_2 = graph_rag.extract_entities_and_relationships(
    paper_content=paper_content_2,
    paper_title=paper_title_2, 
    paper_id="paper_2"
)

print(f"\n✅ Second paper added to knowledge graph!")
print(f"📚 Total papers in graph: {result_2.get('total_papers', 'N/A')}")
print(f"🏷️ New entities extracted: {sum(len(entities) for entities in result_2['entities'].values())}")

# Show total system stats
total_papers = result_2.get('total_papers', 2)
print(f"\n📈 Knowledge Graph now contains:")
print(f"   📄 Papers: {total_papers}")
print(f"   🔗 Shared authors: Prof. Michael Torres appears in both papers!")
print(f"   🗃️ Shared datasets: PubChem, ChEMBL used by both research groups")
print(f"   🧠 Related methods: Both use neural networks for chemical analysis")

INFO:src.langchain_graph_rag:🔍 Processing paper: Neural Network Approaches for Chemical Property Prediction
INFO:src.langchain_graph_rag:🚀 Using enhanced entity extraction for richer knowledge graph...
INFO:src.enhanced_knowledge_graph:🔍 Enhanced extraction for: Neural Network Approaches for Chemical Property Prediction
INFO:src.enhanced_knowledge_graph:📄 Split paper into 1 sections
INFO:src.enhanced_knowledge_graph:📖 Processing section_1 section...


📄 Creating second related paper...
🔗 This will have overlapping authors and methods for connections!
✅ Second paper created!
📰 Title: Neural Network Approaches for Chemical Property Prediction
📊 Content length: 2,433 characters


INFO:src.enhanced_knowledge_graph:📊 Built graph: 64 nodes, 11 edges
INFO:src.enhanced_knowledge_graph:✅ Enhanced extraction: 64 entities, 11 relationships
INFO:src.langchain_graph_rag:📈 Enhanced extraction: 64 entities from 1 sections
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:src.langchain_graph_rag:✅ Added paper to graph: 3 documents



✅ Second paper added to knowledge graph!
📚 Total papers in graph: 2
🏷️ New entities extracted: 64

📈 Knowledge Graph now contains:
   📄 Papers: 2
   🔗 Shared authors: Prof. Michael Torres appears in both papers!
   🗃️ Shared datasets: PubChem, ChEMBL used by both research groups
   🧠 Related methods: Both use neural networks for chemical analysis


## Step 6: Find Connections

In [7]:
# Find papers connected by shared authors
connections = graph_rag.find_related_papers("paper_1", "authors")

print("🔗 Paper Connections Found:")
print("=" * 30)

if connections['related_papers']:
    for paper_id, info in connections['related_papers'].items():
        print(f"\n📄 {info['paper_title']}")
        print(f"   🔗 Shared authors: {', '.join(info['shared_entities'])}")
else:
    print("No connections found")

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


🔗 Paper Connections Found:

📄 Neural Network Approaches for Chemical Property Prediction
   🔗 Shared authors: Prof. Michael Torres


## Step 7: Query the Knowledge Graph

In [8]:
# Ask questions about your research
query = "machine learning and chemistry"

results = graph_rag.query_graph(query)

print(f"🔍 Query: '{query}'")
print("=" * 40)
print(f"📊 Found {results['papers_found']} relevant papers")

for paper_id, paper_data in results['papers'].items():
    print(f"\n📄 {paper_data['paper_title']}")
    print(f"   💬 Relevant sections: {len(paper_data['chunks'])}")
    
    # Show snippet
    if paper_data['chunks']:
        snippet = paper_data['chunks'][0][:100]
        print(f"   📝 {snippet}...")

  initial_docs = self.retriever.get_relevant_documents(query)
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:

🔍 Query: 'machine learning and chemistry'
📊 Found 2 relevant papers

📄 Neural Network Approaches for Chemical Property Prediction
   💬 Relevant sections: 3
   📝 Neural Network Approaches for Chemical Property Prediction
Authors: Prof. Michael Torres (Stanford),...

📄 Transformer-Based Drug Discovery using Graph Neural Networks
   💬 Relevant sections: 1
   📝 Applications in pharmaceutical research include:
- Lead compound optimization
- ADMET property predi...


## Step 8: Visualize Your Knowledge Graph

In [9]:
# Visualize Your Knowledge Graph
from src.notebook_visualization import show_knowledge_graph

print("🎨 Creating knowledge graph visualization...")

# Display interactive graph with professional features
result = show_knowledge_graph(graph_rag)

if result:
    print("\n💡 If you see a graph above, you can:")
    print("   • Drag nodes to rearrange")
    print("   • Use sidebar to explore properties") 
    print("   • Search for specific entities")
    print("   • Zoom and pan to navigate")
else:
    print("\n⚠️ Visualization had issues - but your knowledge graph is working!")
    
    # Show what we built anyway
    summary = graph_rag.get_graph_summary()
    print(f"\n📊 Your Knowledge Graph:")
    print(f"   📄 Papers: {summary['total_papers']}")
    print(f"   📝 Documents: {summary['total_documents']}")
    print(f"   🏷️ Entities: {sum(summary['unique_entities'].values())}")
    print("\n🎉 Knowledge graph built successfully!")

🎨 Creating knowledge graph visualization...


GraphWidget(layout=Layout(height='800px', width='100%'))

✅ Interactive yFiles graph displayed above!
🎯 Use the sidebar to explore nodes and relationships
🔍 Try the search function to find specific entities

💡 If you see a graph above, you can:
   • Drag nodes to rearrange
   • Use sidebar to explore properties
   • Search for specific entities
   • Zoom and pan to navigate


## Step 9: Explore Your Graph

In [10]:
# Get overview of your knowledge graph
summary = graph_rag.get_graph_summary()

print("📊 Knowledge Graph Summary:")
print("=" * 30)
print(f"📄 Papers: {summary['total_papers']}")
print(f"📝 Document chunks: {summary['total_documents']}")

print("\n🏷️ Unique Entities:")
for entity_type, count in summary['unique_entities'].items():
    if count > 0:
        print(f"   • {entity_type.title()}: {count}")

print("\n🎉 You built a knowledge graph!")

📊 Knowledge Graph Summary:
📄 Papers: 2
📝 Document chunks: 18

🏷️ Unique Entities:
   • Authors: 7
   • Institutions: 9
   • Methods: 14
   • Concepts: 19
   • Technologies: 16
   • Datasets: 12

🎉 You built a knowledge graph!


## Try It Yourself!

In [11]:
# Add your own paper content here!
your_paper = """
Replace this with content from your research area:
- Title and authors
- Abstract or summary
- Key methods and findings
- Datasets used
"""

# Uncomment and modify to add your paper:
# result = graph_rag.extract_entities_and_relationships(
#     paper_content=your_paper,
#     paper_title="Your Paper Title",
#     paper_id="your_paper"
# )
# print("✅ Your paper added to the knowledge graph!")

print("💡 Add your own research content above to expand the graph!")

💡 Add your own research content above to expand the graph!


## 🎓 What You Learned

**Congratulations!** You built an AI-powered knowledge graph that:

✅ **Extracts entities** from research papers automatically  
✅ **Finds connections** between different papers  
✅ **Answers questions** about your research  
✅ **Visualizes relationships** in an interactive graph  

### 🚀 Next Steps:
- Add real papers from your research area
- Try different query types
- Explore the interactive visualization
- Scale up to larger paper collections

### 🔗 Real-world Applications:
- **Literature reviews** - Find research gaps
- **Collaboration mapping** - Discover research networks  
- **Citation analysis** - Track research influence
- **Knowledge discovery** - Uncover hidden connections