# Ollama Setup and Testing

This notebook will help you set up and test Ollama with LangChain connectors before starting the main RAG assignment.

## Prerequisites

1. **Install Ollama** from https://ollama.ai
   - On Linux/Mac: `curl https://ollama.ai/install.sh | sh`
   - On Windows: Download and run the installer

2. **Verify Installation**
   - Run `ollama -v` in your terminal
   - Should show version 0.11.10 or greater

3. **Pull Required Models**
   ```bash
   # For chat/inference
   ollama pull gpt-oss:20b
   
   # For embeddings
   ollama pull embeddinggemma:latest
   ```


## Step 1: Test Ollama Connection

First, let's verify that Ollama is running and accessible:


In [1]:
import requests
import json

# Test if Ollama is running
try:
    response = requests.get('http://localhost:11434/api/tags')
    if response.status_code == 200:
        models = json.loads(response.text)
        print("✅ Ollama is running!")
        print("\nAvailable models:")
        for model in models.get('models', []):
            print(f"  - {model['name']}")
    else:
        print("❌ Ollama is not responding properly")
except requests.exceptions.ConnectionError:
    print("❌ Cannot connect to Ollama. Make sure it's running!")
    print("Start Ollama by running 'ollama serve' in a terminal")


✅ Ollama is running!

Available models:
  - embeddinggemma:latest
  - gpt-oss:20b


## Step 2: Test Embeddings with Ollama

Now let's test creating embeddings using the LangChain Ollama connector:


In [2]:
from langchain_ollama import OllamaEmbeddings

# Initialize the embedding model
embedding_model = OllamaEmbeddings(
    model="embeddinggemma:latest",
    base_url="http://localhost:11434"  # Default Ollama URL
)

print("✅ Embedding model initialized")

✅ Embedding model initialized


In [3]:
# Test embedding a single query
test_query = "What is the meaning of life?"

print(f"Embedding query: '{test_query}'")
embedding = embedding_model.embed_query(test_query)

print(f"\n✅ Successfully created embedding!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")


Embedding query: 'What is the meaning of life?'

✅ Successfully created embedding!
Embedding dimension: 768
First 10 values: [-0.14623165, 0.029127467, 0.037609786, -0.02487736, -0.02654742, 0.016047234, -0.027490975, 0.027250502, 0.01138677, -2.815361e-07]


In [4]:
# Test embedding multiple documents
test_documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science."
]

print("Embedding multiple documents...")
embeddings = embedding_model.embed_documents(test_documents)

print(f"\n✅ Successfully created {len(embeddings)} embeddings!")
for i, doc in enumerate(test_documents):
    print(f"\nDocument {i+1}: '{doc[:50]}...'")
    print(f"  Embedding dimension: {len(embeddings[i])}")
    print(f"  First 5 values: {embeddings[i][:5]}")


Embedding multiple documents...

✅ Successfully created 3 embeddings!

Document 1: 'The quick brown fox jumps over the lazy dog....'
  Embedding dimension: 768
  First 5 values: [-0.14782572, 0.002899217, 0.05214841, -0.029081974, -0.036695857]

Document 2: 'Machine learning is a subset of artificial intelli...'
  Embedding dimension: 768
  First 5 values: [-0.124072984, -0.0027437524, -0.0003292989, 0.010069011, 0.0016800934]

Document 3: 'Python is a popular programming language for data ...'
  Embedding dimension: 768
  First 5 values: [-0.1626911, -0.010286673, 0.025470829, 0.0006976369, -0.01888673]


## Step 3: Test Model Inference with Ollama

Now let's test using Ollama for text generation/inference using the LangChain connector:


In [5]:
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the chat model
chat_model = ChatOllama(
    model="gpt-oss:20b",
    temperature=0.6,
    base_url="http://localhost:11434",
    verbose=True
)

print("✅ Chat model initialized")


✅ Chat model initialized


#Let's add a procedure to measure the inference performance of the model

In [6]:
def detailed_performance_metrics(response_metadata):
    """
    Calculate comprehensive performance metrics from Ollama response metadata
    """
    # Extract all timing data (in nanoseconds)
    total_duration = response_metadata.get('total_duration', 0)
    load_duration = response_metadata.get('load_duration', 0)
    prompt_eval_duration = response_metadata.get('prompt_eval_duration', 0)
    eval_duration = response_metadata.get('eval_duration', 0)
    
    # Extract token counts
    prompt_eval_count = response_metadata.get('prompt_eval_count', 0)
    eval_count = response_metadata.get('eval_count', 0)
    
    # Convert to seconds
    total_seconds = total_duration / 1_000_000_000
    load_seconds = load_duration / 1_000_000_000
    prompt_eval_seconds = prompt_eval_duration / 1_000_000_000
    eval_seconds = eval_duration / 1_000_000_000
    
    # tokens per second
    tokens_per_second = eval_count / eval_seconds

    # Calculate metrics
    metrics = {
        'generation_tokens_per_second': eval_count / eval_seconds if eval_seconds > 0 else 0,
        'prompt_tokens_per_second': prompt_eval_count / prompt_eval_seconds if prompt_eval_seconds > 0 else 0,
        'total_tokens': prompt_eval_count + eval_count,
        'total_time_seconds': total_seconds,
        'load_time_seconds': load_seconds,
        'generation_time_seconds': eval_seconds,
        'prompt_processing_time_seconds': prompt_eval_seconds,
        }

    print(f"Total tokens: {metrics['total_tokens']}")
    print(f"Total time seconds: {metrics['total_time_seconds']}")
    print(f"Load time seconds: {metrics['load_time_seconds']}")
    print(f"Generation time seconds: {metrics['generation_time_seconds']}")
    print(f"Prompt processing time seconds: {metrics['prompt_processing_time_seconds']}")
    print(f"Generation tokens per second: {metrics['generation_tokens_per_second']}")
    print(f"Prompt tokens per second: {metrics['prompt_tokens_per_second']}")

    return metrics

In [7]:
# Test simple inference
prompt = "Explain quantum computing in one sentence."

print(f"Prompt: {prompt}")
print("\nGenerating response...")

response = chat_model.invoke(prompt)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")


Prompt: Explain quantum computing in one sentence.

Generating response...

✅ Response generated!

Model output: Quantum computing harnesses qubits that can occupy multiple states at once and become entangled, enabling certain complex problems to be solved exponentially faster than with classical computers.


In [8]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 248
Total time seconds: 25.746506958
Load time seconds: 20.192723042
Generation time seconds: 3.760666666
Prompt processing time seconds: 1.77428375
Generation tokens per second: 46.268392137257294
Prompt tokens per second: 41.70697048879583


In [9]:
# Test with system message and human message
messages = [
    SystemMessage(content="You are a helpful AI assistant that explains complex topics simply."),
    HumanMessage(content="What is machine learning?")
]

print("Sending messages to model...")
response = chat_model.invoke(messages)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")


Sending messages to model...

✅ Response generated!

Model output: **Machine learning (ML)** is a way of teaching computers to make decisions or predictions by learning from data, rather than by following a set of hard‑coded rules written by a programmer.

---

### The big idea
1. **Data** – You give the computer a bunch of examples (images, numbers, text, etc.).  
2. **Pattern‑finding** – The computer automatically looks for patterns or regularities in those examples.  
3. **Model** – It builds an internal “model” that captures those patterns.  
4. **Prediction / Decision** – When you give it a new, unseen example, the model uses what it learned to guess the answer or make a decision.

---

### How it works in a nutshell

| Step | What happens | Example |
|------|--------------|---------|
| 1. **Collect data** | Gather many examples of what you want the computer to learn. | Thousands of pictures of cats and dogs. |
| 2. **Choose a model** | Pick a mathematical function that can repres

In [10]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 882
Total time seconds: 17.905281583
Load time seconds: 0.120607542
Generation time seconds: 17.363052708
Prompt processing time seconds: 0.397103667
Generation tokens per second: 45.4989115845976
Prompt tokens per second: 231.67753824846952


## Step 4: Test Streaming Response

Ollama supports streaming responses, which is useful for real-time applications:


In [11]:
# Test streaming
prompt = "Write a haiku about artificial intelligence."

print(f"Prompt: {prompt}")
print("\nStreaming response:")
print("-" * 40)

for chunk in chat_model.stream(prompt):
    print(chunk.content, end="", flush=True)

print("\n" + "-" * 40)
print("\n✅ Streaming completed!")


Prompt: Write a haiku about artificial intelligence.

Streaming response:
----------------------------------------
Silicon mind hums,  
patterns bloom in code’s quiet,  
futures whisper soft.
----------------------------------------

✅ Streaming completed!


## Summary

If all the tests above passed, you're ready to use Ollama with LangChain! Here's what we tested:

✅ **Embeddings**: 
- Created embeddings for single queries
- Created embeddings for multiple documents
- Verified embedding dimensions

✅ **Model Inference**:
- Simple text generation
- Chat with system and human messages
- Streaming responses
- Integration with LangChain chains

## Troubleshooting

If you encounter issues:

1. **Model Not Found**: Pull the required models (`ollama pull <model-name>`)
2. **Slow Performance**: Ollama models run on CPU by default. For better performance:
   - Use smaller models for testing
   - Consider GPU acceleration if available
3. **Memory Issues**: Large models require significant RAM. Try smaller variants if needed.

## Next Steps

Now you're ready to proceed with the main RAG assignment using Ollama!
