# 🚀 Complete Ollama + Knowledge Graph System

**All-in-one notebook: Ollama setup + Knowledge graph processing**

This notebook:
- Installs and starts Ollama in Colab
- Downloads required models (llama3.1:8b, nomic-embed-text)
- Processes one research paper into a knowledge graph
- Creates embeddings and vector store
- Shows comprehensive results

**Requirements:** Enable GPU runtime (Runtime → Change runtime type → GPU)

## ⚙️ Configuration: Real vs Sample Data

In [None]:
# Configuration: Choose your data source
# Set USE_SAMPLE_DATA = True to test with fake data (fast, no PDF needed)
# Set USE_SAMPLE_DATA = False to process real PDF papers (requires PDF upload)

USE_SAMPLE_DATA = True  # Change to False for real PDF processing

if USE_SAMPLE_DATA:
    print("🎭 DEMO MODE: Using sample data")
    print("   ⚡ Fast testing without PDF upload")
    print("   🧪 Pre-extracted entities and content")
    print("   🚀 Perfect for testing the knowledge graph system")
    print("   📋 Still uses Ollama for processing and embeddings")
    print("")
    print("💡 To process real PDFs:")
    print("   1. Set USE_SAMPLE_DATA = False")
    print("   2. Wait for Ollama setup (10-15 minutes)")
    print("   3. Upload your own PDF file")
else:
    print("📄 REAL DATA MODE: Processing actual PDFs")
    print("   📋 Full Ollama setup required")
    print("   🧠 Uses LLM for entity extraction")
    print("   ⏱️ Takes 15-20 minutes total (setup + processing)")
    print("")
    print("💡 For quick testing:")
    print("   1. Set USE_SAMPLE_DATA = True")
    print("   2. Still gets full Ollama + LLM experience")

## Step 1: Environment Setup

In [None]:
# Check if we're in Google Colab and GPU status
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("✅ Running in Google Colab")
    
    # Check GPU
    import torch
    if torch.cuda.is_available():
        print(f"✅ GPU Available: {torch.cuda.get_device_name(0)}")
        print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("⚠️ No GPU detected!")
        print("   Go to Runtime → Change runtime type → Hardware accelerator → GPU")
        if not USE_SAMPLE_DATA:
            print("   GPU is REQUIRED for real data processing!")
else:
    print("🏠 Running locally")

## Step 2: Install Dependencies

In [None]:
if IN_COLAB:
    print("📦 Installing core dependencies...")
    !pip install -q langchain langchain-ollama langchain-chroma
    !pip install -q chromadb>=0.4.0
    !pip install -q networkx
    !pip install -q yfiles_jupyter_graphs
    
    if not USE_SAMPLE_DATA:
        print("📦 Installing PDF processing dependencies...")
        !pip install -q pdfplumber
    
    print("✅ Dependencies installed!")
else:
    print("🏠 Using local environment")

## Step 3: Install Ollama

In [None]:
if IN_COLAB:
    print("🚀 Installing Ollama in Colab...")
    print("⏱️ This takes about 2-3 minutes...")
    
    # Download and install Ollama
    !curl -fsSL https://ollama.ai/install.sh | sh
    
    print("✅ Ollama installed!")
    
else:
    print("🏠 Assuming local Ollama is running")

## Step 4: Start Ollama Server

In [None]:
if IN_COLAB:
    import subprocess
    import time
    import threading
    import os
    
    print("🚀 Starting Ollama server...")
    
    # Function to run Ollama serve in background
    def run_ollama_serve():
        os.system("ollama serve > /dev/null 2>&1 &")
    
    # Start Ollama in a separate thread
    ollama_thread = threading.Thread(target=run_ollama_serve, daemon=True)
    ollama_thread.start()
    
    # Wait for server to start
    print("⏳ Waiting for server to start...")
    time.sleep(10)
    
    # Test if server is running
    try:
        result = !curl -s http://localhost:11434/api/version
        if result:
            print("✅ Ollama server is running!")
            print(f"   Version info: {result[0] if result else 'N/A'}")
        else:
            print("❌ Server not responding")
    except:
        print("❌ Failed to check server status")
        
else:
    print("🏠 Assuming local Ollama server is running")

## Step 5: Download Models

In [None]:
if IN_COLAB:
    print("📥 Downloading models (this takes 5-10 minutes)...")
    print("☕ Perfect time for a coffee break!")
    print("")
    
    # Download LLM model
    print("🧠 Downloading llama3.1:8b (main LLM)...")
    !ollama pull llama3.1:8b
    
    print("")
    print("🔤 Downloading nomic-embed-text (embeddings)...")
    !ollama pull nomic-embed-text
    
    print("")
    print("✅ All models downloaded and ready!")
    
else:
    print("🏠 Check local models with: ollama list")

## Step 6: Test Ollama Connection

In [None]:
# Test basic LLM functionality
try:
    from langchain_ollama import ChatOllama
    
    print("🧪 Testing LLM connection...")
    
    # Create LLM instance
    llm = ChatOllama(
        model="llama3.1:8b",
        temperature=0.1
    )
    
    # Simple test
    response = llm.invoke("Say 'Hello from Colab!' and nothing else.")
    print(f"✅ LLM Response: {response.content}")
    
    # Test embeddings
    from langchain_ollama import OllamaEmbeddings
    
    print("🔤 Testing embeddings...")
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    
    test_embedding = embeddings.embed_query("This is a test.")
    print(f"✅ Embedding created: {len(test_embedding)} dimensions")
    
    print("")
    print("🎉 SUCCESS! Ollama is working perfectly in Colab!")
    print("🚀 Ready to process research papers!")
    
except Exception as e:
    print(f"❌ Test failed: {e}")
    print("💡 You may need to restart runtime and try again")

## Step 7: Load Paper Data

In [None]:
import os

if USE_SAMPLE_DATA:
    print("🎭 Loading sample paper data...")
    
    # Use built-in sample data (no download needed)
    SAMPLE_PAPER_DATA = {
        "title": "Machine Learning for Drug Discovery: A Comprehensive Review",
        "content": """Machine Learning for Drug Discovery: A Comprehensive Review

Authors: Dr. Sarah Chen (MIT), Prof. Michael Torres (Stanford), Dr. Lisa Wang (UC Berkeley)

Abstract:
This comprehensive review examines the application of machine learning techniques to drug discovery processes. 
We analyze various computational approaches including deep learning, graph neural networks, and transformer 
architectures for molecular property prediction and drug-target interaction modeling.

Methods:
We conducted a systematic review of machine learning applications in drug discovery, focusing on:

1. Molecular Property Prediction
- Graph Convolutional Networks (GCNs) for molecular representation
- Transformer models adapted for SMILES sequences
- Recurrent Neural Networks for sequential molecular data

2. Drug-Target Interaction Prediction
- Matrix factorization techniques
- Deep neural networks with protein sequence embeddings
- Graph-based approaches combining molecular and protein structures

Technologies and Tools:
- Deep Learning: TensorFlow, PyTorch, Keras
- Cheminformatics: RDKit, OpenEye, ChemAxon
- Graph Processing: DGL, PyTorch Geometric, NetworkX

Conclusions:
Machine learning has fundamentally transformed drug discovery by enabling more efficient exploration of chemical 
and biological space. Future success will depend on continued collaboration between computational scientists, 
medicinal chemists, and clinical researchers.""",
        "pages": 12,
        "char_count": 1234
    }
    
    # Use sample data for natural discovery
    paper_path = "sample_data"  # Placeholder
    paper_title = SAMPLE_PAPER_DATA["title"]
    text_content = SAMPLE_PAPER_DATA["content"]
    
    print(f"✅ Sample data loaded!")
    print(f"📰 Title: {paper_title}")
    print(f"📊 Content length: {len(text_content):,} characters")
    print(f"📄 Simulated pages: {SAMPLE_PAPER_DATA['pages']}")
    print(f"🌿 Ready for natural knowledge discovery")
    
elif IN_COLAB:
    print("📤 Choose how to load your PDF:")
    print("   1️⃣ Upload file using file picker")
    print("   2️⃣ Use file already in Colab storage")
    print("")
    
    # Check for existing PDFs in current directory
    existing_pdfs = [f for f in os.listdir('.') if f.endswith('.pdf')]
    
    if existing_pdfs:
        print(f"📁 Found {len(existing_pdfs)} PDF(s) in current directory:")
        for i, pdf in enumerate(existing_pdfs, 1):
            file_size = os.path.getsize(pdf) / (1024*1024)  # MB
            print(f"   {i}. {pdf} ({file_size:.1f} MB)")
        print("")
        
        choice = input("Type filename to use existing PDF, or press Enter to upload new file: ").strip()
        
        if choice and choice in existing_pdfs:
            paper_path = choice
            print(f"✅ Using existing file: {paper_path}")
        else:
            print("📤 Upload a new PDF file...")
            from google.colab import files
            uploaded = files.upload()
            
            # Get the first PDF
            paper_path = None
            for filename in uploaded.keys():
                if filename.endswith('.pdf'):
                    paper_path = filename
                    break
    else:
        print("📁 No existing PDFs found in current directory")
        print("📤 Upload a PDF file...")
        from google.colab import files
        uploaded = files.upload()
        
        # Get the first PDF
        paper_path = None
        for filename in uploaded.keys():
            if filename.endswith('.pdf'):
                paper_path = filename
                break
    
    if paper_path:
        file_size = os.path.getsize(paper_path) / (1024*1024)  # MB
        print(f"✅ Paper selected: {paper_path} ({file_size:.1f} MB)")
        
        # Show file details
        print(f"📁 File location: /content/{paper_path}")
        print(f"📊 File size: {file_size:.1f} MB")
    else:
        print("❌ No PDF file found! Please upload a PDF.")
        
else:
    # Use local example
    paper_path = '../../examples/d4sc03921a.pdf'
    if os.path.exists(paper_path):
        print(f"✅ Using local paper: {paper_path}")
    else:
        print(f"❌ Local paper not found: {paper_path}")
        paper_path = None

## Step 8: Extract Text from PDF

In [None]:
if USE_SAMPLE_DATA:
    print("🎭 Using sample text content (already loaded)")
    print(f"✅ Text content ready!")
    print(f"📰 Title: {paper_title}")
    print(f"📊 Content length: {len(text_content):,} characters")
    print(f"📄 Sample paper simulates {SAMPLE_PAPER_DATA['pages']} pages")
    
elif paper_path:
    import pdfplumber
    
    print(f"📄 Extracting text from: {paper_path}")
    
    try:
        # Extract text
        with pdfplumber.open(paper_path) as pdf:
            text_content = ""
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text_content += page_text + "\n\n"
        
        # Get paper title (first substantial line)
        lines = text_content.split('\n')
        paper_title = "Unknown Title"
        for line in lines:
            if len(line.strip()) > 20 and not line.strip().isdigit():
                paper_title = line.strip()[:100]
                break
        
        print(f"✅ Text extracted successfully!")
        print(f"📰 Title: {paper_title}")
        print(f"📊 Content length: {len(text_content):,} characters")
        print(f"📄 Pages processed: {len(pdf.pages)}")
        
    except Exception as e:
        print(f"❌ Failed to extract text: {e}")
        text_content = None
        paper_title = None
        
else:
    print("❌ No paper to process")
    text_content = None
    paper_title = None

## Step 9: Natural Paper Analysis

In [None]:
if USE_SAMPLE_DATA:
    print("🎭 Using sample paper content for natural analysis")
    print(f"✅ Sample paper loaded!")
    
    # Use the complete sample paper content directly for analysis
    paper_content = text_content
    paper_title_final = paper_title
    
    # For demo mode, create a sample natural analysis
    complete_analysis = """This paper provides a comprehensive review of machine learning applications in drug discovery. The research examines how computational approaches, particularly deep learning and graph neural networks, are transforming pharmaceutical research.

The paper covers three main areas: molecular property prediction using Graph Convolutional Networks and transformer models, drug-target interaction prediction through deep neural networks and matrix factorization, and virtual screening using generative models and reinforcement learning.

Key findings include the effectiveness of graph-based approaches for molecular representation, the importance of transformer architectures for SMILES sequences, and the potential of generative adversarial networks for novel molecule design. The work highlights major datasets like ChEMBL, PubChem, and ZINC, along with important technologies including TensorFlow, PyTorch, and RDKit.

The research concludes that machine learning has fundamentally transformed drug discovery by enabling more efficient exploration of chemical and biological space, though challenges remain in data quality, model interpretability, and regulatory acceptance."""
    
    print(f"📊 Content length: {len(text_content):,} characters")
    print(f"📝 Analysis length: {len(complete_analysis):,} characters")
    print(f"📄 Ready for natural knowledge graph creation")
    
elif text_content:
    from langchain_ollama import ChatOllama
    from langchain_core.prompts import ChatPromptTemplate
    
    print("🧠 Analyzing complete paper with LLM...")
    print("⏱️ This analyzes the ENTIRE paper content without predefined categories...")
    
    # Create LLM
    llm = ChatOllama(
        model="llama3.1:8b",
        temperature=0.1
    )
    
    # Simple, open-ended analysis prompt
    prompt_text = '''You are an expert research analyst. Read this COMPLETE research paper and provide a comprehensive, natural analysis.

COMPLETE PAPER CONTENT:
{content}

Analyze this paper thoroughly and naturally. Don't force it into categories - just understand it completely and tell me:

1. What is this paper about?
2. What are the main ideas, findings, and contributions?
3. What methods, approaches, or techniques are used?
4. What's important or interesting about this work?
5. What are the key concepts, technologies, or data mentioned?

Provide a thorough, natural analysis - not a structured format. Just understand the paper completely and explain it comprehensively.'''
    
    prompt = ChatPromptTemplate.from_template(prompt_text)
    
    try:
        # Process paper in chunks if too long, then synthesize
        max_chars = 25000  # Conservative limit for analysis
        
        if len(text_content) > max_chars:
            print(f"📄 Paper is long ({len(text_content):,} chars), analyzing in sections...")
            
            # Split into logical sections
            sections = []
            chunk_size = max_chars
            
            for i in range(0, len(text_content), chunk_size):
                section = text_content[i:i+chunk_size]
                sections.append(section)
            
            print(f"🔄 Analyzing {len(sections)} sections...")
            
            section_analyses = []
            for i, section in enumerate(sections, 1):
                print(f"   Analyzing section {i}/{len(sections)}...")
                
                chain = prompt | llm
                result = chain.invoke({
                    "content": section
                })
                
                section_analyses.append(result.content)
            
            # Now synthesize all sections into final analysis
            if section_analyses:
                print("🔄 Synthesizing complete paper understanding...")
                
                synthesis_prompt = '''You have analyzed different sections of a research paper. Now synthesize these section analyses into one comprehensive understanding of the complete paper.

SECTION ANALYSES:
{sections}

Provide a complete, unified analysis of the entire paper. What is this research really about? What are the key insights across the whole work?'''
                
                synthesis_chain = ChatPromptTemplate.from_template(synthesis_prompt) | llm
                synthesis_result = synthesis_chain.invoke({
                    "sections": "\n\n---SECTION---\n\n".join(section_analyses)
                })
                
                complete_analysis = synthesis_result.content
            else:
                print("❌ No section analyses completed")
                complete_analysis = None
                
        else:
            print(f"📄 Analyzing complete paper ({len(text_content):,} chars)...")
            
            # Process entire paper at once
            chain = prompt | llm
            result = chain.invoke({
                "content": text_content
            })
            
            complete_analysis = result.content
        
        if complete_analysis:
            print("✅ Complete paper analysis finished!")
            print(f"\n📊 PAPER ANALYSIS:")
            print(f"📄 Title: {paper_title}")
            print(f"📝 Analysis length: {len(complete_analysis):,} characters")
            print(f"🔍 Analysis preview: {complete_analysis[:200]}...")
            
            # Store the results
            paper_content = text_content
            paper_title_final = paper_title
            
        else:
            print("❌ Paper analysis failed")
            complete_analysis = None
            
    except Exception as e:
        print(f"❌ Paper analysis failed: {e}")
        complete_analysis = None
        
else:
    print("❌ No text content to analyze")
    complete_analysis = None
    paper_content = None
    paper_title_final = None

## Step 10: Create Embeddings and Vector Store

In [None]:
if paper_content and complete_analysis:
    from langchain_ollama import OllamaEmbeddings
    from langchain_chroma import Chroma
    from langchain_core.documents import Document
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    print("🔤 Creating embeddings and vector store from complete paper...")
    print("⏱️ This takes 2-3 minutes...")
    
    # Create embeddings model
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    
    # Split text into chunks for embeddings
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    
    chunks = text_splitter.split_text(paper_content)
    print(f"📄 Created {len(chunks)} text chunks from complete paper")
    
    # Create documents with metadata from analysis
    documents = []
    for i, chunk in enumerate(chunks):
        metadata = {
            'paper_title': paper_title_final,
            'chunk_id': f"chunk_{i}",
            'chunk_index': i,
            'total_chunks': len(chunks),
            'analysis_preview': complete_analysis[:500] if complete_analysis else '',
            'has_analysis': bool(complete_analysis)
        }
        
        doc = Document(page_content=chunk, metadata=metadata)
        documents.append(doc)
    
    # Also add the complete analysis as a document
    if complete_analysis:
        analysis_doc = Document(
            page_content=complete_analysis,
            metadata={
                'paper_title': paper_title_final,
                'chunk_id': 'complete_analysis',
                'chunk_index': -1,
                'total_chunks': len(chunks),
                'is_analysis': True
            }
        )
        documents.append(analysis_doc)
    
    # Create vector store
    persist_directory = "/tmp/chroma_paper_complete"
    
    print("🗄️ Creating vector store with ChromaDB...")
    vector_store = Chroma(
        embedding_function=embeddings,
        persist_directory=persist_directory
    )
    
    # Add documents to vector store
    document_ids = vector_store.add_documents(documents)
    
    print(f"✅ Vector store created from complete paper!")
    print(f"   📝 {len(documents)} documents added (including analysis)")
    print(f"   🔤 Embeddings created with nomic-embed-text")
    print(f"   🗄️ Stored in ChromaDB at {persist_directory}")
    
    # Test semantic search on complete paper
    print("\n🔍 Testing semantic search on complete paper...")
    query = "What are the main findings and contributions?"
    results = vector_store.similarity_search(query, k=3)
    
    print(f"Query: '{query}'")
    print(f"Found {len(results)} relevant chunks:")
    for i, result in enumerate(results, 1):
        is_analysis = result.metadata.get('is_analysis', False)
        content_type = "LLM Analysis" if is_analysis else "Paper Content"
        print(f"  {i}. [{content_type}] {result.page_content[:100]}...")
    
else:
    print("❌ No paper content to process - skipping vector store creation")
    vector_store = None
    documents = []

## Step 11: Create Natural Knowledge Graph

In [None]:
if paper_content and complete_analysis:
    from langchain_ollama import ChatOllama
    from langchain_core.prompts import ChatPromptTemplate
    import networkx as nx
    import json
    
    print("🕸️ Creating natural knowledge graph from paper content...")
    print("⏱️ Let the LLM discover natural relationships...")
    
    # Use LLM to discover natural connections in the content
    llm = ChatOllama(
        model="llama3.1:8b",
        temperature=0.1
    )
    
    graph_prompt = '''You are analyzing this research paper to discover natural relationships and connections.

PAPER CONTENT:
{content}

LLM ANALYSIS:
{analysis}

Look at this content naturally and identify:
1. Key concepts, ideas, and topics that emerge from the paper
2. Natural relationships and connections between these concepts
3. Important terms, methods, findings that relate to each other

Return a JSON with nodes and edges that represent the natural structure you see:

{
  "nodes": [
    {"id": "concept_name", "label": "Natural concept from paper", "importance": "high/medium/low"},
    ...
  ],
  "edges": [
    {"source": "concept1", "target": "concept2", "relationship": "natural relationship you observe"},
    ...
  ]
}

Discover what's naturally connected in this research - don't force categories. Let the content reveal its own structure.

JSON:'''
    
    try:
        print("🔍 Discovering natural connections in the paper...")
        
        # Let LLM discover natural graph structure
        prompt = ChatPromptTemplate.from_template(graph_prompt)
        chain = prompt | llm
        result = chain.invoke({
            "content": paper_content[:15000],  # First part of content
            "analysis": complete_analysis[:5000] if complete_analysis else ""
        })
        
        # Extract JSON from response
        response_text = result.content
        json_start = response_text.find('{')
        json_end = response_text.rfind('}') + 1
        
        if json_start != -1 and json_end != -1:
            json_str = response_text[json_start:json_end]
            graph_data = json.loads(json_str)
            
            # Create NetworkX graph from discovered structure
            G = nx.Graph()
            
            # Add nodes with natural attributes
            nodes_added = set()
            for node in graph_data.get('nodes', []):
                node_id = node.get('id', '')
                if node_id and node_id not in nodes_added:
                    G.add_node(
                        node_id,
                        label=node.get('label', node_id),
                        importance=node.get('importance', 'medium'),
                        type='natural_concept'
                    )
                    nodes_added.add(node_id)
            
            # Add edges with natural relationships
            for edge in graph_data.get('edges', []):
                source = edge.get('source', '')
                target = edge.get('target', '')
                relationship = edge.get('relationship', 'related_to')
                
                if source in nodes_added and target in nodes_added:
                    G.add_edge(source, target, relationship=relationship)
            
            print(f"✅ Natural knowledge graph discovered!")
            print(f"   🔗 Nodes: {G.number_of_nodes()}")
            print(f"   📊 Edges: {G.number_of_edges()}")
            print(f"   🌿 Structure emerged naturally from content")
            
            # Show discovered concepts
            print(f"\n🌿 Naturally discovered concepts:")
            for node in G.nodes():
                importance = G.nodes[node].get('importance', 'medium')
                label = G.nodes[node].get('label', node)
                print(f"   • {node}: {label} ({importance} importance)")
            
        else:
            print("❌ Could not parse natural graph structure")
            print("🔄 Creating simple content-based graph...")
            
            # Fallback: simple content representation
            G = nx.Graph()
            G.add_node(paper_title_final or "Research Paper", type='paper')
            G.add_node("Paper Content", type='content')
            G.add_node("LLM Analysis", type='analysis')
            G.add_edge(paper_title_final or "Research Paper", "Paper Content", relationship='contains')
            G.add_edge("Paper Content", "LLM Analysis", relationship='analyzed_to_produce')
    
    except Exception as e:
        print(f"❌ Natural graph discovery failed: {e}")
        print("🔄 Creating simple representation...")
        
        # Simple fallback
        G = nx.Graph()
        G.add_node(paper_title_final or "Research Paper", type='paper')
        G.add_node("Paper Content", type='content')
        if complete_analysis:
            G.add_node("LLM Analysis", type='analysis')
            G.add_edge(paper_title_final or "Research Paper", "Paper Content", relationship='contains')
            G.add_edge("Paper Content", "LLM Analysis", relationship='produces')
    
    # Store for visualization
    knowledge_graph = {
        'graph': G,
        'paper_content': paper_content,
        'complete_analysis': complete_analysis,
        'paper_title': paper_title_final,
        'stats': {
            'nodes': G.number_of_nodes(),
            'edges': G.number_of_edges(),
            'discovery_method': 'natural_llm_discovery'
        }
    }
    
else:
    print("❌ No paper content to build graph from")
    knowledge_graph = None

## Step 12: Interactive Visualization

In [None]:
if knowledge_graph and knowledge_graph['graph'].number_of_nodes() > 0:
    print("📊 Creating interactive yFiles visualization of natural knowledge graph...")
    
    try:
        from yfiles_jupyter_graphs import GraphWidget
        import networkx as nx
        
        G = knowledge_graph['graph']
        
        print(f"🎮 Building interactive graph with {G.number_of_nodes()} naturally discovered nodes...")
        
        # Create yFiles widget
        widget = GraphWidget(graph=G)
        
        # Configure node styling based on natural attributes
        def configure_node_style(node):
            node_data = G.nodes[node]
            importance = node_data.get('importance', 'medium')
            node_type = node_data.get('type', 'natural_concept')
            
            # Natural color scheme based on importance and type
            if node_type == 'paper':
                color = '#1f4e79'  # Deep blue for main paper
                size = 50
            elif node_type == 'analysis':
                color = '#7b68ee'  # Medium slate blue for analysis
                size = 40
            elif importance == 'high':
                color = '#e74c3c'  # Red for high importance
                size = 45
            elif importance == 'medium':
                color = '#3498db'  # Blue for medium importance
                size = 35
            else:  # low importance
                color = '#95a5a6'  # Gray for low importance
                size = 25
            
            label = node_data.get('label', node)
            return {
                'color': color,
                'size': size,
                'label': label[:40] + "..." if len(label) > 40 else label
            }
        
        # Apply node styling
        widget.set_node_styles_mapping(configure_node_style)
        
        # Configure edge styling
        def configure_edge_style(edge):
            return {
                'color': '#bdc3c7',
                'thickness': 2,
                'style': 'solid'
            }
        
        widget.set_edge_styles_mapping(configure_edge_style)
        
        # Use organic layout for natural structure
        widget.set_layout('organic')
        
        # Enable overview and navigation
        widget.overview_enabled = True
        widget.context_start_with = 'clean-slate'
        
        print("✅ Interactive natural knowledge graph created!")
        print("🎮 Controls:")
        print("   • Drag nodes to rearrange")
        print("   • Zoom with mouse wheel")
        print("   • Click nodes to highlight connections")
        print("   • Use overview panel for navigation")
        print("")
        
        # Show the widget
        display(widget)
        
        # Show natural relationships discovered
        print("🌿 Natural relationships discovered:")
        for edge in G.edges(data=True):
            source, target, data = edge
            relationship = data.get('relationship', 'connected to')
            print(f"   • {source} {relationship} {target}")
        
    except ImportError:
        print("❌ yfiles_jupyter_graphs not available")
        print("💡 Install with: pip install yfiles_jupyter_graphs")
        
    except Exception as e:
        print(f"❌ Error creating yFiles visualization: {e}")
        print("📊 Falling back to summary")
    
    # Print comprehensive summary
    print(f"\n📊 NATURAL KNOWLEDGE GRAPH SUMMARY:")
    print(f"   📄 Paper: {knowledge_graph.get('paper_title', 'Unknown')}")
    print(f"   🌿 Discovery method: {knowledge_graph['stats'].get('discovery_method', 'natural')}")
    print(f"   🔗 Naturally discovered nodes: {knowledge_graph['stats']['nodes']}")
    print(f"   📊 Natural relationships: {knowledge_graph['stats']['edges']}")
    print(f"   🔤 Vector store documents: {len(documents) if 'documents' in locals() else 0}")
    print(f"   🗄️ Vector store: {'✅ Created' if 'vector_store' in locals() and vector_store else '❌ Not created'}")
    
    # Show analysis preview
    analysis = knowledge_graph.get('complete_analysis', '')
    if analysis:
        print(f"\n📝 LLM ANALYSIS PREVIEW:")
        print(f"   {analysis[:300]}...")
    
else:
    print("❌ No natural knowledge graph to visualize")

In [None]:
# 💾 Save Natural Analysis and Knowledge Graph

if complete_analysis and knowledge_graph:
    import json
    import pickle
    from datetime import datetime
    
    print("💾 Saving natural analysis and knowledge graph...")
    
    # Create timestamp for unique filenames
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    paper_name = (paper_title_final or 'unknown_paper')[:30].replace(" ", "_").replace("/", "_")
    base_filename = f"{paper_name}_{timestamp}"
    
    # 1. Save complete natural analysis as text file
    analysis_file = f"{base_filename}_analysis.txt"
    with open(analysis_file, 'w', encoding='utf-8') as f:
        f.write(f"# Natural Analysis of: {paper_title_final}\n\n")
        f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        f.write(complete_analysis)
    print(f"✅ Natural analysis saved: {analysis_file}")
    
    # 2. Save graph as GraphML (standard format, works with many tools)
    graph_file = f"{base_filename}_graph.graphml"
    import networkx as nx
    nx.write_graphml(knowledge_graph['graph'], graph_file)
    print(f"✅ Graph saved: {graph_file}")
    
    # 3. Save complete knowledge graph as pickle (Python objects)
    kg_file = f"{base_filename}_knowledge_graph.pkl"
    with open(kg_file, 'wb') as f:
        pickle.dump(knowledge_graph, f)
    print(f"✅ Complete KG saved: {kg_file}")
    
    # 4. Save paper metadata and processing info
    metadata_file = f"{base_filename}_metadata.json"
    metadata = {
        "title": paper_title_final or 'Unknown',
        "timestamp": timestamp,
        "content_length": len(paper_content) if paper_content else 0,
        "analysis_length": len(complete_analysis),
        "graph_nodes": knowledge_graph['stats']['nodes'],
        "graph_edges": knowledge_graph['stats']['edges'],
        "discovery_method": knowledge_graph['stats'].get('discovery_method', 'natural'),
        "file_path": paper_path if 'paper_path' in locals() and paper_path != "sample_data" else "sample_data",
        "mode": "sample_data" if USE_SAMPLE_DATA else "real_pdf"
    }
    
    if not USE_SAMPLE_DATA and 'text_content' in locals() and text_content:
        # Save text content for real papers
        text_file = f"{base_filename}_content.txt"
        with open(text_file, 'w', encoding='utf-8') as f:
            f.write(text_content)
        metadata["content_file"] = text_file
        print(f"✅ Text content saved: {text_file}")
    
    with open(metadata_file, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"✅ Metadata saved: {metadata_file}")
    
    # 5. Create a comprehensive report
    report_file = f"{base_filename}_report.md"
    with open(report_file, 'w') as f:
        f.write(f"# Natural Knowledge Graph Report\n\n")
        f.write(f"**Paper:** {paper_title_final or 'Unknown'}\n")
        f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"**Mode:** {'Sample Data' if USE_SAMPLE_DATA else 'Real PDF'}\n\n")
        
        f.write(f"## Natural Analysis\n\n")
        f.write(f"{complete_analysis}\n\n")
        
        f.write(f"## Knowledge Graph Statistics\n\n")
        f.write(f"- **Content Length:** {len(paper_content) if paper_content else 0:,} characters\n")
        f.write(f"- **Analysis Length:** {len(complete_analysis):,} characters\n")
        f.write(f"- **Graph Nodes:** {knowledge_graph['stats']['nodes']}\n")
        f.write(f"- **Graph Edges:** {knowledge_graph['stats']['edges']}\n")
        f.write(f"- **Discovery Method:** {knowledge_graph['stats'].get('discovery_method', 'natural')}\n\n")
        
        # Show discovered concepts if available
        G = knowledge_graph['graph']
        if G.number_of_nodes() > 0:
            f.write(f"## Naturally Discovered Concepts\n\n")
            for node in G.nodes():
                importance = G.nodes[node].get('importance', 'medium')
                label = G.nodes[node].get('label', node)
                f.write(f"- **{node}**: {label} ({importance} importance)\n")
            f.write(f"\n")
            
            f.write(f"## Natural Relationships\n\n")
            for edge in G.edges(data=True):
                source, target, data = edge
                relationship = data.get('relationship', 'connected to')
                f.write(f"- {source} **{relationship}** {target}\n")
            f.write(f"\n")
        
        f.write(f"## Files Generated\n\n")
        f.write(f"- `{analysis_file}` - Natural analysis in text format\n")
        f.write(f"- `{graph_file}` - Knowledge graph in GraphML format\n")
        f.write(f"- `{kg_file}` - Complete knowledge graph (Python pickle)\n")
        f.write(f"- `{metadata_file}` - Processing metadata\n")
        if not USE_SAMPLE_DATA and 'text_content' in locals() and text_content:
            f.write(f"- `{text_file}` - Extracted text content\n")
        f.write(f"- `{report_file}` - This comprehensive report\n")
    
    print(f"✅ Comprehensive report saved: {report_file}")
    
    print(f"\n📊 SAVED FILES SUMMARY:")
    print(f"📁 All files saved to: /content/")
    print(f"🏷️ Base filename: {base_filename}")
    print(f"📄 Files created:")
    print(f"   • {analysis_file} (Natural analysis)")
    print(f"   • {graph_file} (GraphML graph)")
    print(f"   • {kg_file} (Python pickle)")
    print(f"   • {metadata_file} (metadata)")
    if not USE_SAMPLE_DATA and 'text_content' in locals() and text_content:
        print(f"   • {text_file} (text content)")
    print(f"   • {report_file} (comprehensive report)")
    
    # 6. Download files option (Colab only)
    if IN_COLAB:
        print(f"\n📥 DOWNLOAD FILES:")
        print(f"Right-click files in the file panel to download")
        print(f"Or run this code to download all at once:")
        print(f"```python")
        print(f"from google.colab import files")
        print(f"files.download('{analysis_file}')")
        print(f"files.download('{graph_file}')")
        print(f"files.download('{kg_file}')")
        print(f"files.download('{metadata_file}')")
        if not USE_SAMPLE_DATA and 'text_content' in locals() and text_content:
            print(f"files.download('{text_file}')")
        print(f"files.download('{report_file}')")
        print(f"```")
    
    # 7. How to reload the analysis
    print(f"\n🔄 TO RELOAD THIS ANALYSIS LATER:")
    print(f"```python")
    print(f"import pickle")
    print(f"")
    print(f"# Load natural analysis")
    print(f"with open('{analysis_file}', 'r') as f:")
    print(f"    complete_analysis = f.read()")
    print(f"")
    print(f"# Load complete knowledge graph")
    print(f"with open('{kg_file}', 'rb') as f:")
    print(f"    knowledge_graph = pickle.load(f)")
    print(f"")
    print(f"# Load graph separately (if needed)")
    print(f"import networkx as nx")
    print(f"graph = nx.read_graphml('{graph_file}')")
    print(f"```")
    
else:
    print("❌ No natural analysis to save")

## 🎉 Complete Success!

If you see results above, you have successfully created a **natural knowledge graph system** with Ollama running in Colab!

### ✅ What You Accomplished:

**Infrastructure:**
- ✅ **Installed Ollama** in Google Colab environment
- ✅ **Downloaded models** (llama3.1:8b + nomic-embed-text)
- ✅ **Started server** successfully in background

**Natural Knowledge Graph System:**
- ✅ **Processed research paper** with PDF text extraction  
- ✅ **Natural analysis** using local Ollama LLM without forced categories
- ✅ **Created embeddings** with nomic-embed-text model
- ✅ **Built vector store** with ChromaDB for semantic search
- ✅ **Discovered knowledge graph** with natural concepts and relationships
- ✅ **Interactive visualization** with yFiles organic layout
- ✅ **Saved complete results** in multiple formats

### 🔍 Technical Stack Validated:

**Local LLM Processing**: Ollama running on Colab T4 GPU  
**Natural Analysis**: Open-ended paper understanding without categories  
**Vector Embeddings**: Semantic search capabilities over paper content  
**Knowledge Graph**: Naturally discovered concepts and relationships  
**Vector Store**: ChromaDB with persistent storage  
**Hybrid System**: Vector similarity + natural graph structure  
**Interactive Visualization**: yFiles with organic layout and importance-based sizing
**Complete Save System**: Text analysis, GraphML, pickle, and summary files

### 🚀 Next Steps:
- Process multiple papers for cross-paper natural connections
- Build corpus with naturally emerging themes
- Scale to literature collections with organic knowledge discovery
- Use for automated literature synthesis and review generation

**You've proven complete natural knowledge discovery!** 🎯

Each paper creates its own unique knowledge structure based on what the LLM naturally discovers!