# Ollama Sentiment Analysis Pipeline — Tutorial

Welcome to this hands-on tutorial on building a **sentiment analysis pipeline** using **Ollama**, an open-source platform for running Large Language Models (LLMs) locally or on-premise.

## What You'll Learn

In this tutorial, you will:
- 🔧 Connect to an Ollama API endpoint and verify connectivity
- 📊 Load and preprocess text data from CSV files
- 🤖 Use LLMs for intelligent column detection and text analysis
- 🎭 Perform sentiment classification and emotion detection
- ✅ Validate and structure LLM outputs using Pydantic models
- 💾 Save results for downstream analysis

## Why Ollama?

Unlike cloud-based solutions (Azure OpenAI, AWS Bedrock, etc.), Ollama allows you to:
- Run models **entirely on-premise** (no data leaves your network)
- Avoid per-token API costs
- Maintain full control over your infrastructure
- Use open-source models (Llama, Mistral, Gemma, etc.)

## Prerequisites

Before starting, ensure you have:
1. **Ollama installed and running** (download from [ollama.ai](https://ollama.ai))
2. A model pulled (e.g., `ollama pull gemma3` or `ollama pull llama3`)
3. Python 3.8+ with required packages installed
4. A dataset with text reviews (we provide a sample mental health dataset)
     - URL to the dataset provided in this repo (./datasets/Mental Health Dataset.csv.zip)  [Licnese](https://creativecommons.org/licenses/by/4.0/)
        - https://www.kaggle.com/datasets/sujaykapadnis/mental-health-insights-data
        - Be sure to extract the dataset file and update the DATA_SET_PATH variable.

👉 **No authentication or API keys required!**

Let's get started! 🚀

## 1) Setup and Imports

First, we'll import all the Python libraries needed for this tutorial. Understanding what each library does will help you customize the pipeline later.

### Libraries We're Using:

| Library | Purpose |
|---------|---------|
| `pandas` | Data manipulation and CSV handling |
| `requests` | HTTP communication with Ollama API |
| `json` | Parsing JSON responses from the model |
| `difflib` | Fuzzy string matching for column detection |
| `re` | Regular expressions for cleaning LLM output |
| `tenacity` | Retry logic for network resilience |
| `pydantic` | Data validation and schema enforcement |
| `typing` | Type hints for better code clarity |

### Why Import Everything Up Front?

Following Python best practices, we consolidate all imports at the top of the notebook. This makes it easy to:
- See all dependencies at a glance
- Identify missing packages before running the pipeline
- Maintain clean, organized code

Run this cell first to ensure all dependencies are available.

In [None]:
import pandas as pd
import textwrap
import requests
import json
import difflib
import re
from tenacity import retry, stop_after_attempt, wait_random_exponential
from pydantic import BaseModel, Field
from typing import List

## 2) Configure Ollama Connection

Now we'll configure the connection parameters for your Ollama instance. This is the only section you'll need to customize for your environment.

### Configuration Parameters:

**`OLLAMA_URL`** - The HTTP endpoint where Ollama is listening
- Default: `http://localhost:11434/api/chat` (Ollama's default local server)
- Change this if Ollama runs on a different machine or port
- Example: `http://192.168.1.100:11434/api/chat` for a server on your network

**`MODEL_NAME`** - The LLM model to use for analysis
- We're using `gemma3` (Google's efficient model, good for sentiment tasks)
- Other recommended options:
  - `llama3` - Meta's powerful general-purpose model
  - `mistral` - Fast and accurate for text classification
  - `phi3` - Microsoft's small but capable model
  - `qwen2` - Strong multilingual support

**`DATA_SET_PATH`** - Location of your input CSV file
- Default: `./datasets/mental-health-dataset.csv`
- Can be an absolute path or relative to this notebook

### Column Detection Settings:

**`FORCE_OLLAMA_COLUMN_SELECTION`** - Always use LLM for column detection
- `False` (default): Use fast heuristic first, LLM only if ambiguous
- `True`: Always ask the LLM (slower but more accurate)

**`ENABLE_OLLAMA_FALLBACK`** - Use LLM when heuristic is unsure
- `True` (default): Get LLM help when needed
- `False`: Fall back to first text column if heuristic fails

💡 **Tip**: Start with defaults, then adjust based on your dataset characteristics.

In [None]:
# Ollama configuration
OLLAMA_URL = "http://localhost:11434/api/chat"  # 🔧 Change this to your actual Ollama endpoint
MODEL_NAME = "gemma3"  # Choose a good model for sentiment tasks
DATA_SET_PATH = "./datasets/mental-health-dataset.csv"  # Path to your dataset

# Column detection settings
FORCE_OLLAMA_COLUMN_SELECTION = True  # Set True to always use LLM for column detection (slower but smarter)
ENABLE_OLLAMA_FALLBACK = True  # Set True to use LLM when heuristic is ambiguous

## 3) Define Helper Functions

Before diving into the analysis, we'll define two reusable helper functions that encapsulate common operations. This follows the **DRY principle** (Don't Repeat Yourself) and makes the code more maintainable.

### Function 1: `query_ollama()`

This is the **core communication function** that sends prompts to Ollama and handles streaming responses.

**Key Features:**
- ✅ **Streaming support** - Processes NDJSON responses incrementally
- ✅ **Retry logic** - Automatically retries on network failures (via `@retry` decorator)
- ✅ **Error handling** - Gracefully handles malformed responses
- ✅ **Verbose mode** - Optional diagnostic output for troubleshooting

**How it works:**
1. Constructs a chat-format payload (compatible with OpenAI API format)
2. Sends POST request to Ollama with streaming enabled
3. Iterates through NDJSON lines, extracting message content
4. Returns assembled response text

**Why streaming?** Ollama sends responses incrementally (token by token), which allows for:
- Real-time feedback in production applications
- Lower memory usage for long responses
- Ability to stop generation early if needed

### Function 2: `extract_json()`

LLMs sometimes wrap JSON in markdown code fences (` ```json ... ``` `), which breaks JSON parsers. This function cleans the response.

**Key Features:**
- ✅ Removes markdown code fences (` ``` `, ` ```json `)
- ✅ Strips whitespace
- ✅ Safe for responses that are already clean JSON

**Example:**
```python
# Input:  "```json\n{\"key\": \"value\"}\n```"
# Output: "{\"key\": \"value\"}"
```

These functions will be used throughout the notebook, making later cells much cleaner and easier to read.

In [None]:
@retry(wait=wait_random_exponential(min=10, max=30), stop=stop_after_attempt(3))
def query_ollama(prompt, model=MODEL_NAME, verbose=False):
    """
    Send a prompt to Ollama and return the assembled response.
    
    Args:
        prompt: The text prompt to send
        model: Model name to use (default: MODEL_NAME)
        verbose: If True, print diagnostic info during processing
    
    Returns:
        Assembled response text from the model
    """
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}]
    }
    
    if verbose:
        print(f'Sending request to {OLLAMA_URL} with model {model}...')
    
    response = requests.post(OLLAMA_URL, json=payload, timeout=60, stream=True)
    response.raise_for_status()
    
    # Handle streaming NDJSON response
    assembled = ''
    for line in response.iter_lines(decode_unicode=False):
        if not line:
            continue
        try:
            obj = json.loads(line.decode('utf-8', errors='replace'))
            if isinstance(obj, dict):
                msg = obj.get('message')
                if isinstance(msg, dict) and 'content' in msg:
                    assembled += str(msg['content'])
        except Exception:
            continue
    
    return assembled if assembled else "No response content"


def extract_json(text):
    """
    Extract JSON from response, handling markdown code fences.
    
    Args:
        text: Response text that may contain JSON wrapped in ```json...```
    
    Returns:
        Cleaned JSON string
    """
    text = text.strip()
    # Remove markdown code fences if present
    if text.startswith('```'):
        text = re.sub(r'^```(?:json)?\s*\n?', '', text)
        text = re.sub(r'\n?```\s*$', '', text)
    return text.strip()

## 4) Quick Ollama Connectivity Test

Before processing your data, it's crucial to verify that Ollama is reachable and responding correctly. This diagnostic step will save you time troubleshooting later.

### What This Test Does:

1. **Displays configuration** - Shows the URL and model you're using
2. **Sends a simple prompt** - Asks the model to reply with "PONG"
3. **Validates the response** - Checks if the model understood the request
4. **Reports status** - Clear success/failure messages with troubleshooting hints

### Possible Outcomes:

✅ **Success** - Ollama is working, proceed with confidence!

⚠️ **Unexpected Response** - Model responded but didn't say "PONG"
   - May indicate prompt understanding issues
   - Check if the correct model is loaded

❌ **Connection Error** - Can't reach Ollama
   - Verify Ollama is running: `ollama serve` (in terminal)
   - Check `OLLAMA_URL` is correct
   - Ensure no firewall is blocking port 11434

❌ **Timeout** - Request took too long
   - Model might be loading (first run can be slow)
   - Check system resources (CPU/RAM/GPU usage)
   - Try a smaller model

### Troubleshooting Tips:

- **Not installed?** Run `curl -fsSL https://ollama.com/install.sh | sh` (Linux/Mac)
- **Model not pulled?** Run `ollama pull gemma3` in your terminal
- **Port conflict?** Check if another service is using port 11434

Run this cell now - don't skip it! A working connection is essential for everything that follows.

In [None]:
print('OLLAMA_URL:', OLLAMA_URL)
print('MODEL_NAME:', MODEL_NAME)
print('\n-- Testing Ollama connectivity --')

try:
    test_response = query_ollama("Please reply with the single token: PONG", verbose=True)
    print(f'\nResponse: {test_response}')
    
    if 'PONG' in test_response.upper():
        print('\n✅ SUCCESS: Ollama is reachable and responding correctly!')
    else:
        print(f'\n⚠️ WARNING: Got unexpected response: {test_response[:200]}')
        
except requests.exceptions.ConnectionError as ce:
    print(f'❌ Connection error: {ce}')
    print('Please verify OLLAMA_URL is correct and Ollama is running.')
except requests.exceptions.Timeout as te:
    print(f'❌ Request timed out: {te}')
except Exception as e:
    print(f'❌ Unexpected error: {e}')

## 5) Load Dataset

Now we'll load your CSV dataset using pandas. This tutorial uses a **mental health dataset** with patient reviews, but the pipeline works with any text data.

### What Happens Here:

The code performs two operations:
1. **Reads the CSV file** specified in `DATA_SET_PATH`
2. **Displays the first 5 rows** using `.head()` for quick inspection

### Understanding Your Data:

After running this cell, examine the output to answer:
- ✅ How many columns does the dataset have?
- ✅ What are the column names?
- ✅ Which column likely contains the review text?
- ✅ Are there any missing values (NaN)?
- ✅ What's the data type of each column?

### Common Dataset Structures:

Your CSV might look like one of these patterns:

**Pattern 1: Simple reviews**
```
review_text, rating
"Great product!", 5
"Not satisfied", 2
```

**Pattern 2: Social media posts**
```
post_id, username, content, timestamp, likes
123, user1, "Loving this!", 2024-01-01, 42
```

**Pattern 3: Survey responses**
```
respondent_id, question_1, question_2, comments
A001, 4, 5, "Additional feedback here"
```

The next step (column detection) will intelligently identify which column contains your text data, regardless of the structure.

### Troubleshooting:

- **File not found?** Check `DATA_SET_PATH` points to the correct location
- **Encoding errors?** Try adding `encoding='utf-8'` or `encoding='latin-1'` to `pd.read_csv()`
- **Large file?** Use `pd.read_csv(path, nrows=1000)` to load only the first 1000 rows for testing

In [None]:
reviews = pd.read_csv(DATA_SET_PATH)
reviews.head()

## 6) Inspect a Single Review — Robust Column Detection

This cell automatically detects which column contains the review text using either:
1. **LLM-based detection** (if `FORCE_OLLAMA_COLUMN_SELECTION=True`) - slower but more accurate
2. **Heuristic matching** - fast keyword and fuzzy matching
3. **LLM fallback** (if `ENABLE_OLLAMA_FALLBACK=True`) - used when heuristic is ambiguous

The cell then displays the first review from the selected column.

In [None]:
print('Available columns:', list(reviews.columns))

target = 'review_text'
chosen_col = None

def call_ollama_for_column(cols):
    """Ask Ollama to identify the text column from the list."""
    prompt = (
        "You are a helpful assistant for a data-preprocessing script.\n"
        "Given the following dataset columns: \n"
        f"{cols}\n\n"
        "Which column is most likely to contain the main user-generated review, comment, or post text?\n"
        "Answer with the exact column name only, nothing else."
    )
    try:
        reply = query_ollama(prompt).strip().splitlines()[0].strip().strip('"').strip("'")
        print(f"Ollama replied: {reply}")
        
        # Try exact match
        if reply in reviews.columns:
            return reply
        # Try case-insensitive match
        ci = [c for c in reviews.columns if c.lower() == reply.lower()]
        if ci:
            return ci[0]
        # Try fuzzy match
        close = difflib.get_close_matches(reply, reviews.columns, n=1, cutoff=0.6)
        if close:
            print(f"Using fuzzy match to column: {close[0]}")
            return close[0]
        print('Ollama reply did not match any column.')
    except Exception as e:
        print(f"Ollama call failed: {e}")
    return None

# Strategy 1: Force LLM selection
if FORCE_OLLAMA_COLUMN_SELECTION:
    print('Using LLM for column selection...\n')
    chosen_col = call_ollama_for_column(list(reviews.columns))
    if chosen_col:
        print(f"✅ LLM selected column: {chosen_col}")

# Strategy 2: Exact match or heuristic
if chosen_col is None:
    if target in reviews.columns:
        chosen_col = target
        print(f"✅ Found exact match: {chosen_col}")
    else:
        # Heuristic scoring
        synonyms = ['review', 'text', 'comment', 'body', 'post', 'content', 'message', 'feedback']
        scores = {}
        for col in reviews.columns:
            col_low = col.lower()
            score = sum(2.0 for syn in synonyms if syn in col_low)
            score += difflib.SequenceMatcher(None, col_low, target).ratio()
            scores[col] = score
        
        best_col = max(scores, key=scores.get)
        best_score = scores[best_col]
        
        if best_score > 0.3:
            chosen_col = best_col
            print(f"✅ Heuristic selected: {best_col} (score={best_score:.2f})")
        else:
            print(f"⚠️ Heuristic ambiguous (best={best_col}, score={best_score:.2f})")
            
            # Strategy 3: LLM fallback
            if ENABLE_OLLAMA_FALLBACK:
                print('Falling back to LLM for column selection...')
                ol_col = call_ollama_for_column(list(reviews.columns))
                if ol_col:
                    chosen_col = ol_col
                    print(f"✅ LLM fallback selected: {chosen_col}")
            
            # Strategy 4: Dtype-based fallback
            if chosen_col is None:
                obj_cols = reviews.select_dtypes(include=['object']).columns.tolist()
                if obj_cols:
                    chosen_col = obj_cols[0]
                    print(f"⚠️ Falling back to first object column: {chosen_col}")
                else:
                    raise KeyError(f"No text column found. Available: {list(reviews.columns)}")

# Display the first review
review = reviews[chosen_col].iloc[0]
print(f"\n--- First value from '{chosen_col}' ---\n")
review

## 7) Create Sentiment Analysis Prompt

Prompts are the "instructions" we give to LLMs. Good prompts = good results! Let's break down what makes this prompt effective.

### The Prompt Template:

```
You are a sentiment analysis expert. Analyze the following patient review and return a JSON object with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- confidence_score: a number between 0 and 1
- emotions: a list of detected emotions like ['joy', 'anger', 'sadness', 'fear', 'surprise']
Return only the JSON object and no extra text.

Review text:
{text}
```

### Prompt Engineering Principles Applied:

#### 1. **Role Assignment** ("You are a sentiment analysis expert")
- Sets the model's context and expertise domain
- Encourages specialized, high-quality responses
- Similar to how you'd brief a human analyst

#### 2. **Clear Task Definition** ("Analyze the following patient review")
- Specifies exactly what needs to be done
- Provides domain context ("patient review" vs generic text)
- Helps model understand the expected tone and content

#### 3. **Structured Output Requirements**
- Lists exact fields needed: `sentiment_label`, `confidence_score`, `emotions`
- Specifies data types and formats
- Provides examples for complex fields (emotion list)

#### 4. **Constraints** ("Return only the JSON object and no extra text")
- Critical for parsing! Prevents preambles like "Here's the analysis:"
- Ensures machine-readable output
- Reduces the need for complex text cleanup

#### 5. **Format Specification** ("JSON object")
- Ensures consistent, parseable responses
- Enables automated validation with Pydantic
- Makes batch processing reliable

### Why Three Sentiment Labels?

We use `positive`, `negative`, and `neutral` (not just positive/negative) because:
- ✅ Many reviews are mixed or balanced
- ✅ Neutral sentiment is valuable information
- ✅ Reflects real-world sentiment distribution more accurately

### Confidence Scores:

The `confidence_score` (0.0 to 1.0) helps you:
- Filter low-confidence predictions for human review
- Weight results in aggregate statistics
- Identify edge cases where sentiment is unclear

### Emotion Detection:

Beyond binary sentiment, emotions provide **richer insights**:
- `positive` + `joy` = satisfaction
- `positive` + `surprise` = delight/exceeded expectations
- `negative` + `fear` = anxiety/concern
- `negative` + `anger` = frustration/complaint

### Customizing the Prompt:

You can modify this template to fit your needs:

**For product reviews:**
```python
PROMPT_TEMPLATE = '''Analyze this product review and return JSON with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- confidence_score: 0 to 1
- aspects: list of product features mentioned (e.g., ['quality', 'price', 'shipping'])
- recommendation: would the reviewer recommend this? (true/false)
'''
```

**For customer support tickets:**
```python
PROMPT_TEMPLATE = '''Analyze this support ticket and return JSON with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- urgency: 'low', 'medium', or 'high'
- category: type of issue (e.g., 'technical', 'billing', 'feature request')
- requires_human: boolean (true if complex/sensitive)
'''
```

**For social media posts:**
```python
PROMPT_TEMPLATE = '''Analyze this social media post and return JSON:
- sentiment_label: 'positive', 'negative', or 'neutral'
- intent: 'complaint', 'praise', 'question', or 'general'
- brand_mentioned: boolean
- engagement_potential: 'low', 'medium', or 'high'
'''
```

### Testing Your Prompt:

Before running on your full dataset:
1. Test with 3-5 diverse examples
2. Check if JSON is valid and complete
3. Verify sentiment labels make sense
4. Ensure confidence scores are reasonable (not all 0.5 or all 1.0)

The next cell will test this prompt on a single review - perfect for validation! 🔍

In [None]:
PROMPT_TEMPLATE = '''You are a sentiment analysis expert. Analyze the following patient review and return a JSON object with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- confidence_score: a number between 0 and 1
- emotions: a list of detected emotions like ['joy', 'anger', 'sadness', 'fear', 'surprise']
Return only the JSON object and no extra text.

Review text:
{text}
'''

In [None]:
prompt = PROMPT_TEMPLATE.format(text=review)
response_text = query_ollama(prompt)
print(response_text)

## 9) Batch Sentiment Analysis
Apply the same query to multiple reviews and store the structured results.

In [None]:
results = []
for text in reviews[chosen_col].head(3):  # Use detected column, analyze first 3 for demo
    prompt = PROMPT_TEMPLATE.format(text=text)
    output = query_ollama(prompt)
    results.append(output)

results

## 10) Parse and Validate Results
Convert each model output to JSON, then use **Pydantic** to ensure valid fields and types.

In [None]:
class SentimentOutput(BaseModel):
    sentiment_label: str = Field(..., description="'positive'|'negative'|'neutral'")
    confidence_score: float = Field(..., description="Value between 0 and 1")
    emotions: List[str]

parsed = []
for res in results:
    try:
        cleaned = extract_json(res)
        data = json.loads(cleaned)
        parsed.append(SentimentOutput(**data).model_dump())
    except Exception as e:
        print(f"⚠️ Could not parse response: {e}\nResponse was: {res[:200]}...")

df_results = pd.DataFrame(parsed)
df_results

## 11) Save Results

You've done the hard work of analysis - now let's preserve it! This cell saves your validated results to a CSV file for future use.

### What This Cell Does:

```python
df_results.to_csv('ollama_sentiment_results.csv', index=False)
```

**Parameters explained:**
- **Filename**: `'ollama_sentiment_results.csv'` - Saved in the same directory as this notebook
- **`index=False`**: Excludes row numbers (0, 1, 2...) from the CSV - usually what you want

### The Output File:

Opens in Excel, Google Sheets, or any CSV viewer:

```csv
sentiment_label,confidence_score,emotions
negative,0.85,"['fear', 'sadness', 'anger']"
positive,0.92,"['joy']"
neutral,0.71,"['surprise']"
```

### Why Save to CSV?

✅ **Universal format** - Readable by virtually every data tool  
✅ **Version control friendly** - Git can track changes  
✅ **Portable** - Easy to share with colleagues  
✅ **No dependencies** - Doesn't require Python to open  
✅ **Merge-able** - Combine with other datasets easily

### Alternative Storage Options:

#### 1. **Excel format (for non-technical stakeholders)**
```python
df_results.to_excel('sentiment_results.xlsx', index=False, engine='openpyxl')
```

#### 2. **JSON (for APIs or JavaScript applications)**
```python
df_results.to_json('sentiment_results.json', orient='records', indent=2)
```

#### 3. **Parquet (for big data pipelines)**
```python
df_results.to_parquet('sentiment_results.parquet', index=False)
```

#### 4. **Database (for production systems)**
```python
from sqlalchemy import create_engine

engine = create_engine('postgresql://user:password@localhost/dbname')
df_results.to_sql('sentiment_analysis', engine, if_exists='append', index=False)
```

#### 5. **Pickle (preserves exact Python objects)**
```python
import pickle
with open('sentiment_results.pkl', 'wb') as f:
    pickle.dump(df_results, f)
```

### File Naming Best Practices:

Include metadata in filenames for future you:

```python
from datetime import datetime

# With timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
df_results.to_csv(f'sentiment_{timestamp}.csv', index=False)

# With dataset name and model
df_results.to_csv(f'sentiment_{MODEL_NAME}_mentalhealth.csv', index=False)

# With row count
df_results.to_csv(f'sentiment_{len(df_results)}_reviews.csv', index=False)
```

### What to Do with the Results:

**Immediate analysis:**
```python
# Sentiment distribution
print(df_results['sentiment_label'].value_counts())

# Average confidence
print(f"Mean confidence: {df_results['confidence_score'].mean():.2f}")

# Most common emotions
from collections import Counter
all_emotions = [e for sublist in df_results['emotions'] for e in sublist]
print(Counter(all_emotions).most_common(5))
```

**Visualization:**
```python
import matplotlib.pyplot as plt

df_results['sentiment_label'].value_counts().plot(kind='bar')
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()
```

**Merge with original data:**
```python
# Combine with original reviews for context
reviews_with_sentiment = reviews.head(3).copy()
reviews_with_sentiment = pd.concat([reviews_with_sentiment, df_results], axis=1)
reviews_with_sentiment.to_csv('reviews_analyzed.csv', index=False)
```

### Archiving Best Practices:

For production pipelines:
1. **Version your results** - Include dates/run IDs
2. **Keep metadata** - Save model name, parameters, timestamps
3. **Document changes** - Maintain a changelog or README
4. **Backup regularly** - Don't rely on a single storage location

### Next Steps After Saving:

✅ Share the CSV with stakeholders  
✅ Build dashboards (Tableau, Power BI, Streamlit)  
✅ Run statistical tests on the results  
✅ Compare with human-labeled data (if available)  
✅ Set up automated reporting  
✅ Integrate into production systems

Your analysis is now preserved and ready for action! 💾

## ✅ Summary and Next Steps

Congratulations! 🎉 You've successfully built an **end-to-end sentiment analysis pipeline** using Ollama. Let's recap what you've accomplished and explore where to go from here.

---

## What You've Built

This pipeline demonstrates:

### 1. **On-Premise LLM Integration** 🏠
- Connected to a local Ollama instance (no cloud dependencies)
- Handled streaming NDJSON responses efficiently
- Implemented retry logic for network resilience

### 2. **Intelligent Data Handling** 🧠
- Automatic column detection using heuristics + LLM fallback
- Flexible approach works across diverse dataset structures
- Multi-strategy system balances speed and accuracy

### 3. **Robust Text Analysis** 📊
- Sentiment classification (positive/negative/neutral)
- Confidence scoring for quality assessment
- Emotion detection for nuanced insights

### 4. **Production-Ready Engineering** ⚙️
- Data validation using Pydantic models
- Error handling and graceful degradation
- Clean code structure with reusable functions

### 5. **Reproducible Workflow** 🔄
- Configuration management (easy to adapt to new environments)
- Clear documentation for future maintainers
- Modular design supports customization

---

## Key Improvements in This Version

Compared to typical tutorials, this notebook features:

✅ **Consolidated imports** - All dependencies declared upfront  
✅ **Reusable helper functions** - `query_ollama` and `extract_json` used throughout  
✅ **No code duplication** - Streaming logic centralized in one place  
✅ **Simplified connectivity test** - From 100+ lines to 15 lines  
✅ **Clearer configuration** - Explicit naming (`FORCE_OLLAMA_COLUMN_SELECTION` vs `FALLBACK_TO_OLLAMA`)  
✅ **Modern Pydantic V2** - Uses `model_dump()` instead of deprecated `.dict()`  
✅ **Comprehensive documentation** - Every cell explained in detail  
✅ **Logical flow** - Setup → Helpers → Validation → Load → Process → Results

---

## Customization Ideas

### Adapt to Your Use Case

**For Product Reviews:**
```python
PROMPT_TEMPLATE = '''Analyze this product review and return JSON with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- confidence_score: 0 to 1
- aspects: list of product features mentioned (e.g., ['quality', 'price', 'shipping'])
- recommendation: would the reviewer recommend this? (true/false)
'''
```

**For Customer Support:**
```python
PROMPT_TEMPLATE = '''Analyze this support ticket and return JSON with:
- sentiment_label: 'positive', 'negative', or 'neutral'
- urgency: 'low', 'medium', or 'high'
- category: type of issue (e.g., 'technical', 'billing', 'feature request')
- requires_human: boolean (true if complex/sensitive)
'''
```

**For Social Media:**
```python
PROMPT_TEMPLATE = '''Analyze this social media post and return JSON:
- sentiment_label: 'positive', 'negative', or 'neutral'
- intent: 'complaint', 'praise', 'question', or 'general'
- brand_mentioned: boolean
- engagement_potential: 'low', 'medium', or 'high'
'''
```

### Performance Optimization

**For Large Datasets (10k+ reviews):**
1. Use smaller models (Phi-3, Gemma 2B)
2. Implement parallel processing with ThreadPoolExecutor
3. Add checkpointing to resume interrupted jobs
4. Consider GPU acceleration if available

**For Real-Time Applications:**
1. Keep model loaded in memory (persistent Ollama service)
2. Use async/await for concurrent requests
3. Cache common queries
4. Set up load balancing across multiple Ollama instances

### Integration Opportunities

**Database Integration:**
```python
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost/mydb')
df_results.to_sql('sentiment_analysis', engine, if_exists='append')
```

**API Endpoint:**
```python
from fastapi import FastAPI
app = FastAPI()

@app.post("/analyze")
def analyze_text(text: str):
    prompt = PROMPT_TEMPLATE.format(text=text)
    result = query_ollama(prompt)
    return json.loads(extract_json(result))
```

**Scheduled Jobs:**
```python
from apscheduler.schedulers.blocking import BlockingScheduler

scheduler = BlockingScheduler()

@scheduler.scheduled_job('cron', hour=2)  # Run at 2 AM daily
def nightly_analysis():
    reviews = load_new_reviews()  # Your custom function
    # Run pipeline...
    
scheduler.start()
```

---

## Recommended Experiments

### 1. **Model Comparison**
Try different models and compare results:
- Llama 3 8B (balanced)
- Mistral 7B (fast)
- Gemma 2B (lightweight)
- Qwen 2 (multilingual)

### 2. **Prompt Engineering**
Test variations:
- More explicit instructions
- Few-shot examples in the prompt
- Chain-of-thought reasoning
- Role-playing (e.g., "You are a psychologist")

### 3. **Validation Improvements**
Add stricter Pydantic validators:
- Enum constraints for sentiment labels
- Range validation for confidence scores
- Vocabulary checks for emotions

### 4. **Batch Size Tuning**
Find optimal throughput:
```python
batch_sizes = [1, 5, 10, 20]
for size in batch_sizes:
    time_start = time.time()
    # Process 'size' reviews...
    time_elapsed = time.time() - time_start
    print(f"Batch {size}: {time_elapsed:.2f}s")
```

---

## Learning Resources

### Ollama Documentation
- [Official Docs](https://ollama.com/docs)
- [Model Library](https://ollama.com/library)
- [API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)

### Prompt Engineering
- [OpenAI Prompt Engineering Guide](https://platform.openai.com/docs/guides/prompt-engineering)
- [Anthropic Prompt Engineering](https://docs.anthropic.com/claude/docs/prompt-engineering)

### Pydantic
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [Validation Tutorial](https://docs.pydantic.dev/latest/concepts/validators/)

### Sentiment Analysis
- [VADER Sentiment (compare to your results)](https://github.com/cjhutto/vaderSentiment)
- [Hugging Face Sentiment Models](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending)

---

## Common Troubleshooting

| Problem | Solution |
|---------|----------|
| Ollama not responding | Check if service is running: `ollama serve` |
| Slow performance | Try smaller model or enable GPU |
| Inconsistent JSON | Make prompt more explicit with examples |
| High error rate | Add retries and improve error handling |
| Memory issues | Process in smaller batches, free cache regularly |

---

## Production Deployment Checklist

Before deploying to production:

- [ ] Test on diverse, representative data samples
- [ ] Benchmark performance and resource usage
- [ ] Set up monitoring and alerting
- [ ] Implement comprehensive error handling
- [ ] Add logging for debugging
- [ ] Document API/interface clearly
- [ ] Plan for model updates and versioning
- [ ] Establish data privacy and security measures
- [ ] Set up backup and disaster recovery
- [ ] Train team members on maintenance

---

## What's Next?

### Short Term (Next Session)
1. Run full dataset analysis (remove `.head(3)` limitation)
2. Generate summary statistics and visualizations
3. Compare results with baseline methods (if available)

### Medium Term (This Week)
1. Experiment with different models and prompts
2. Integrate into existing workflows
3. Build a simple dashboard for stakeholders

### Long Term (This Month)
1. Automate the pipeline with scheduling
2. Set up continuous monitoring
3. Gather feedback and iterate

---

## Final Thoughts

You now have a **flexible, powerful sentiment analysis system** that:
- Runs entirely on your infrastructure
- Requires no cloud API keys or costs
- Adapts to different datasets automatically
- Produces structured, validated output
- Scales from prototypes to production

The techniques you've learned here apply beyond sentiment analysis:
- **Named Entity Recognition** (extract people, places, organizations)
- **Text Classification** (categorize documents by topic)
- **Summarization** (condense long texts)
- **Q&A Systems** (answer questions about documents)
- **Content Moderation** (detect inappropriate content)

**The foundation you've built is a springboard for countless AI applications!** 🚀

---

## We'd Love Your Feedback

If you found this tutorial helpful:
- ⭐ Star the repo on GitHub
- 💬 Share your use case in discussions
- 🐛 Report issues or suggest improvements
- 📚 Contribute additional examples

**Happy analyzing!** 🎉