<h1 style="text-align: center; font-size: 50px;">Scientific Presentation Script Generator with Local LLM & ChromaDB</h1>

## 🎯 **Overview**

This notebook demonstrates how to build a comprehensive **Scientific Presentation Script Generator** using:

- **arXiv Paper Retrieval**: Search and download academic papers  
- **Document Processing**: Text extraction and chunking for optimal processing  
- **Vector Database**: ChromaDB for semantic search and retrieval  
- **Local LLM Integration**: Meta Llama 3.1 model for analysis and generation  
- **Interactive Generation**: Step-by-step script creation with user approval  

**Pipeline Flow**: arXiv → Text Extraction → Vector Storage → Analysis → Script Generation → Interactive Refinement

---

## 🛠 **What You'll Learn**

- Paper retrieval from arXiv using search queries  
- Text extraction and chunking strategies  
- Vector database setup with ChromaDB  
- LLM configuration for local inference  
- Script generation and evaluation workflows  
- MLflow model registration and deployment  

---

## 📋 **Prerequisites**

- LangChain setup and configuration  
- Vector database fundamentals  
- Basic understanding of embeddings and retrieval systems

## Imports

This step installs the necessary libraries for local LLM processing and document analysis.

In [None]:
!pip install -r ../requirements.txt --quiet

In [None]:
# System
import os
import sys
import yaml
import mlflow
import logging
from pathlib import Path
import warnings
import torch

# Add the src directory to the path to import utils
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
from src.utils import configure_hf_cache
from src.utils import configure_proxy
from src.utils import load_config_and_secrets
from src.utils import initialize_llm

# Import transformers from huggingface
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Import components of notebook
from core.extract_text.arxiv_search import ArxivSearcher
from core.generator.script_generator import ScriptGenerator
from core.analyzer.scientific_paper_analyzer import ScientificPaperAnalyzer
from core.deploy.text_generation_service import TextGenerationService

# Import langchain libraries
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEndpoint
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

# Libraries from python
from typing import List

## Configurations and Secrets Loading


In [None]:
# Suppress Python warnings
warnings.filterwarnings("ignore")

In [None]:
# === Create logger ===
logger = logging.getLogger("text-generation-notebook")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                             datefmt="%Y-%m-%d %H:%M:%S") 

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [None]:
# Standard library imports
import time
import json
import os
import pandas as pd
from pathlib import Path

# ML and data processing
import mlflow
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# LangChain imports
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from operator import itemgetter

# === Project-Specific Imports (from src.utils) ===
from src.utils import (
    load_config_and_secrets,
    configure_proxy,
    initialize_llm,
    login_huggingface,
    clean_code,
    generate_code_with_retries,
    get_model_context_window,
    get_context_window,
    dynamic_retriever,
    format_docs_with_adaptive_context,
    estimate_tokens_accurate
)

# === Core Module Imports ===
from core.extract_text.arxiv_search import ArxivSearcher
from core.analyzer.scientific_paper_analyzer import ScientificPaperAnalyzer
from core.generator.script_generator import ScriptGenerator

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
config, secrets = load_config_and_secrets(CONFIG_PATH, SECRETS_PATH)

In [None]:
# Initialize logging for notebook execution
logger.info('Notebook execution started - Local text generation pipeline')
logger.info('All dependencies loaded successfully')

### Verify Assets

In [None]:
# Load configuration and secrets
config, secrets = load_config_and_secrets()

# Configure proxy if specified in config
configure_proxy(config)

print("✅ Configuration loaded successfully.")
print(f"📁 Model source: {config.get('model_source', 'local')}")

# Setup HuggingFace authentication if available
if "HUGGINGFACE_API_KEY" in secrets:
    try:
        login_huggingface(secrets)
    except Exception as e:
        print(f"⚠️ HuggingFace login failed: {e}")
else:
    print("ℹ️ No HuggingFace API key found - using models without authentication")

### Proxy Configuration

For certain enterprise networks, you might need to configure proxy settings to access external services. If this is your case, set up the "proxy" field in your config.yaml and the following cell will configure the necessary environment variable.

In [None]:
configure_proxy(config)

### Configuration of Hugging face caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [None]:
# Configure HuggingFace cache
configure_hf_cache()

In [None]:
# Initialize HuggingFace Embeddings
embeddings = HuggingFaceEmbeddings()

## 🎯 Step 4: Script Generation and Evaluation

### Interactive Script Generation

The ScriptGenerator orchestrates the prompt flow, allowing users to generate each section of the presentation interactively, with built-in approval workflows for quality control.

**Key Features:**
- **Interactive Approval**: Review and approve each generated section  
- **Iterative Refinement**: Regenerate content until satisfied  
- **Structured Output**: Organized presentation script format  
- **Context-Aware Generation**: Uses analyzed content for accurate scripts

### 1. ✅ **Local LLM Initialization**

Initialize the local language model for content analysis and script generation.

In [None]:
# Configuration is already loaded - no additional setup needed
print("✅ Environment configured for local LLM processing")
print("🔧 Ready to proceed with document analysis and script generation")

In [None]:
# Initialize the language model for script generation
print("🤖 Initializing local language model...")

try:
    # Load LLM using configuration
    llm = initialize_llm(
        model_source=config.get("model_source", "local"),
        secrets=secrets
    )
    print("✅ Language model loaded successfully!")
    print(f"📊 Context window: {get_context_window(llm)} tokens")
    
except Exception as e:
    print(f"❌ Error loading language model: {e}")
    raise

### 🧱 Step 2: Processing and Embedding Generation
In this step, we transform the raw text extracted from the papers into structured embeddings that can be stored and retrieved efficiently in the RAG pipeline.

The flow includes three main stages:

1. **📄 Create Document Objects**
The full text of each paper is wrapped into Document objects — a standard structure used by LangChain to manage and manipulate textual data.

2. **✂️ Split Text into Chunks**
Using LangChain's RecursiveCharacterTextSplitter, the documents are segmented into smaller blocks (chunks) based on character limits. This makes the downstream embedding and retrieval process more effective.

The chunk_size parameter defines the maximum length of each chunk.

3. **📊 Generate Embeddings**
Each text chunk is converted into a vector representation (embedding) using HuggingFaceEmbeddings. These embeddings are later used to populate the vector store and serve as the foundation for similarity-based retrieval in the generation step.



In [None]:
# Creates a list of Document objects from the scientific articles in the `papers` variable.
# Each `Document` is created with the article content and a metadata dictionary containing the title.
documents = [Document(page_content=paper['text'], metadata={"title": paper['title']}) for paper in papers]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=400)
splits = text_splitter.split_documents(documents)

### 🧩 Step 3: Vector Data Storage and Retrieval
This step handles the storage of embeddings into a vector database and configures a retriever to enable similarity-based search — a key component in the RAG pipeline.

🧠 Store Embeddings with Chroma
The segmented text chunks, previously converted into embeddings, are stored in a local vector store using ChromaDB. This enables efficient access to semantically similar information later on.

🔎 Configure the Retriever
After storing the embeddings, a retriever is set up to perform similarity search queries. This retriever is responsible for:

- Receiving a user query or prompt

- Searching through the stored embeddings

- Returning the most relevant chunks based on vector similarity

> 📦 This mechanism allows the generation model to work with only the most relevant information, improving accuracy and reducing hallucinations.

In [None]:
#Our vector database
vectordb = Chroma.from_documents(documents=splits, embedding=embeddings)

In [None]:
retriever = vectordb.as_retriever()

## 🧠 Chapter 2: Building a Prompt Flow for Generating Scientific Presentation Scripts
In this chapter, we build a prompt flow to generate a complete scientific presentation script using LLMs. Each section of the script (e.g., title, introduction, methodology) is created individually through dedicated prompt templates.

The process is composed of four main steps:

1. 🧠 **Model Selection**
Choose the best-suited LLM for the generation task, depending on performance or local availability.

2. 🔍 **Analysis with ScientificPaperAnalyzer**
Using the component ScientificPaperAnalyzer, a custom LangChain chain is built to analyze the scientific paper and generate context-aware responses.

3. 🧾 **Script Generation**
The ScriptGenerator orchestrates the prompt flow, allowing users to generate each section of the presentation interactively.

#### ⚙️ Step 4: Config Enviroment


In [None]:
# Configuration for the script generation project
PROJECT_NAME = 'Academic Script Generator'
print(f"✅ Project configured: {PROJECT_NAME}")
print("🚀 Ready for script generation pipeline")

## Local Environment Setup

This section configures the local environment for script generation. The following steps will:

1. ✅ **Initialize Local Configuration**
2. ✅ **Set Up Script Generator**
3. ✅ **Configure Content Generation Parameters**

The ScriptGenerator orchestrates the prompt flow, allowing users to generate each section of the presentation interactively, with all processing done locally using the configured LLM.

In [None]:
# Configure environment for local development
import os
from datetime import datetime

# Set up working directory and logging
WORK_DIR = os.getcwd()
TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")

print(f"✅ Working directory: {WORK_DIR}")
print(f"✅ Session timestamp: {TIMESTAMP}")
print("🔧 Environment ready for script generation")

In [None]:
# Initialize Script Generator
from core.generator.script_generator import ScriptGenerator

# Instantiate generator
generator = ScriptGenerator()

# Initialize
print("✅ Script generator initialized successfully")
print("🚀 Ready to generate academic scripts with LLM")

### ✅ Step 6: Run and Approve
The ScriptGenerator component is responsible for generating each section of the scientific presentation script in an interactive and human-in-the-loop fashion.

In [None]:
# Configure content generation parameters
generation_config = {
    "topic": "The Impact of Artificial Intelligence on Modern Education",
    "script_type": "academic_presentation",
    "duration_minutes": 10,
    "target_audience": "university_students",
    "tone": "informative_engaging"
}

# Display configuration
print("📝 Content Generation Configuration:")
for key, value in generation_config.items():
    print(f"   {key}: {value}")
    
print("\n✅ Configuration set - ready to generate script content")

## Model Service

In this section, we implement the **Model Service**, a REST API responsible for serving the language model. The API is automatically documented using Swagger (via FastAPI), enabling interactive testing and clear documentation of the endpoints.


In [None]:
# Generate academic script content
print("🚀 Starting script generation...")

try:
    # Generate script using the configured parameters
    generated_script = generator.generate_script(
        topic=generation_config["topic"],
        script_type=generation_config["script_type"],
        duration_minutes=generation_config["duration_minutes"],
        target_audience=generation_config["target_audience"],
        tone=generation_config["tone"]
    )
    
    print("✅ Script generation completed successfully!")
    print(f"📄 Generated script length: {len(generated_script)} characters")
    
    # Display first 500 characters as preview
    print("\n📖 Script Preview:")
    print("-" * 50)
    print(generated_script[:500] + "..." if len(generated_script) > 500 else generated_script)
    print("-" * 50)
    
except Exception as e:
    print(f"❌ Error during script generation: {str(e)}")
    print("Please check your configuration and try again.")

In [None]:
# Analyze generated script locally
print("📊 Local Script Analysis")
print("=" * 40)

if 'generated_script' in locals():
    # Basic text analysis
    word_count = len(generated_script.split())
    char_count = len(generated_script)
    estimated_reading_time = word_count / 150  # Average reading speed
    
    print(f"📈 Word count: {word_count}")
    print(f"📈 Character count: {char_count}")
    print(f"⏱️  Estimated reading time: {estimated_reading_time:.1f} minutes")
    print(f"🎯 Target duration: {generation_config['duration_minutes']} minutes")
    
    # Check if content meets target duration
    duration_diff = abs(estimated_reading_time - generation_config['duration_minutes'])
    if duration_diff <= 1:
        print("✅ Script duration matches target well!")
    elif estimated_reading_time < generation_config['duration_minutes']:
        print("⚠️  Script may be shorter than target duration")
    else:
        print("⚠️  Script may be longer than target duration")
        
    print("\n🎉 Local analysis completed!")
else:
    print("❌ No generated script found. Please run the generation cell first.")

Built with ❤️ using Z by HP AI Studio.