# 🔗 Tutorial 2: LangChain Fundamentals

**Welcome back!** Now that you know how to chat with AI, let's learn how to work with documents and text files using LangChain.

## 🎯 What You'll Learn:
- What LangChain really does
- How to load and read PDF files
- How to break documents into smaller pieces
- How to create "prompts" (instructions for AI)
- How to build simple AI workflows

## ⏱️ Time: 25-30 minutes
## 📚 Level: Beginner
## 📋 Prerequisites: Tutorial 1 completed

## 🤔 What is LangChain Really?

In Tutorial 1, you used LangChain to chat with AI. But LangChain can do much more!

### 🔧 Think of LangChain as a "Toolkit" for AI:
- **Document Loaders**: Read PDF, Word, text files
- **Text Splitters**: Break long documents into chunks
- **Prompts**: Create better instructions for AI
- **Chains**: Connect multiple AI steps together
- **Memory**: Help AI remember previous conversations

### 🌟 Real-World Example:
Instead of just asking "What's 2+2?", you can:
1. Load a 50-page research paper
2. Ask "What are the main findings in this paper?"
3. Get intelligent answers based on the actual content

## 📚 Step 1: Setup and Imports

Let's import the LangChain tools we'll need for working with documents.

In [None]:
# Import the tools we need
import sys
sys.path.append('..')  # This lets us use our project files

# LangChain tools for working with documents
from langchain_ollama import ChatOllama
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

print("📚 LangChain document tools imported!")
print("🔧 Ready to work with PDF files and text documents")

## 📄 Step 2: Loading Your First Document

Let's learn how to load a PDF file. We'll use the example paper that comes with this project.

In [None]:
# Load a PDF file
print("📄 Loading a PDF document...")

# Path to our example paper
pdf_path = "../examples/d4sc03921a.pdf"

# Create a PDF loader
pdf_loader = PyPDFLoader(pdf_path)

# Load the document
print("⏳ Reading PDF... (this might take a few seconds)")
documents = pdf_loader.load()

print(f"✅ PDF loaded successfully!")
print(f"📊 Number of pages: {len(documents)}")
print(f"📝 First page has {len(documents[0].page_content)} characters")

# Let's see what the first page looks like
print("\n📖 First 200 characters of the paper:")
print(documents[0].page_content[:200] + "...")

## ✂️ Step 3: Breaking Documents into Chunks

AI models can only read a limited amount of text at once. So we need to break long documents into smaller "chunks".

### 🧩 Why Split Documents?
- **AI Limitation**: Can only process ~4000 words at once
- **Better Answers**: Smaller chunks = more focused responses
- **Efficiency**: Only use relevant parts of the document

In [None]:
# Create a text splitter
print("✂️ Breaking document into smaller chunks...")

# This tool splits text intelligently
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Each chunk is about 1000 characters
    chunk_overlap=200,      # 200 characters overlap between chunks
    length_function=len     # How to measure length
)

# Split our PDF into chunks
chunks = text_splitter.split_documents(documents)

print(f"✅ Document split into {len(chunks)} chunks")
print(f"📏 Average chunk size: {sum(len(chunk.page_content) for chunk in chunks) // len(chunks)} characters")

# Let's look at the first chunk
print("\n📝 First chunk:")
print(chunks[0].page_content)
print(f"\n📊 This chunk has {len(chunks[0].page_content)} characters")

## 💬 Step 4: Creating Better Prompts

A "prompt" is like giving instructions to the AI. Good prompts get better answers!

### 🎯 Prompt Tips:
- Be specific about what you want
- Give context and examples
- Tell the AI what format to use

In [None]:
# Let's create a prompt template
print("💬 Creating a smart prompt template...")

# This is a template for asking questions about documents
prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant that reads scientific papers and answers questions.

Here is a piece of text from a research paper:
{document_text}

Question: {question}

Please provide a clear, accurate answer based only on the text above. 
If the answer isn't in the text, say "I can't find that information in this text."
""")

print("✅ Prompt template created!")
print("🎯 This template will help us ask better questions about documents")

# Let's see what the template looks like
print("\n📋 Our prompt template structure:")
print("   1. Tells AI it's a helpful assistant")
print("   2. Gives it the document text")
print("   3. Asks the specific question")
print("   4. Instructs how to answer")

## 🔗 Step 5: Building Your First Chain

A "chain" connects multiple steps together. Let's build a chain that can answer questions about documents.

In [None]:
# Create an AI assistant
ai_assistant = ChatOllama(
    model="llama3.1:8b",
    temperature=0.1,    # Low temperature for factual answers
    num_ctx=4096
)

# Create a chain: Prompt → AI → Text Output
print("🔗 Building a document analysis chain...")

# This chain connects: prompt template → AI → text parser
document_chain = prompt_template | ai_assistant | StrOutputParser()

print("✅ Chain created!")
print("🔧 Chain flow: Prompt → AI → Clean Text Output")
print("📚 Ready to answer questions about documents!")

## 🎯 Step 6: Ask Questions About the Document

Now let's use our chain to ask questions about the research paper!

In [None]:
# Ask a question about the first chunk
print("🎯 Asking a question about the document...")
print("⏳ This might take 10-20 seconds...")

# Our question
question = "What is this paper about? Give me a brief summary."

# Use the chain to get an answer
answer = document_chain.invoke({
    "document_text": chunks[0].page_content,  # Use first chunk
    "question": question
})

print(f"\n❓ Question: {question}")
print(f"\n🤖 AI Answer:")
print(answer)

print("\n💡 The AI read the document chunk and answered based on that content!")

In [None]:
# Try another question
question2 = "Who are the authors of this paper?"

print(f"❓ Question: {question2}")
print("🤖 AI Answer:")

answer2 = document_chain.invoke({
    "document_text": chunks[0].page_content,
    "question": question2
})

print(answer2)

print("\n💡 The AI found author information in the document!")

## 🔍 Step 7: Searching Through Multiple Chunks

What if the answer isn't in the first chunk? Let's search through multiple chunks to find information.

In [None]:
# Function to search through chunks for an answer
def search_document(question, chunks, max_chunks=3):
    """Search through document chunks to find the best answer"""
    
    print(f"🔍 Searching through {min(max_chunks, len(chunks))} chunks for: {question}")
    
    best_answer = "No relevant information found."
    
    for i, chunk in enumerate(chunks[:max_chunks]):
        print(f"   📄 Checking chunk {i+1}...")
        
        answer = document_chain.invoke({
            "document_text": chunk.page_content,
            "question": question
        })
        
        # If we get a useful answer (not "can't find"), use it
        if "can't find" not in answer.lower() and "not in" not in answer.lower():
            best_answer = answer
            print(f"   ✅ Found answer in chunk {i+1}!")
            break
    
    return best_answer

# Test our search function
question = "What are the main conclusions or findings?"
answer = search_document(question, chunks, max_chunks=5)

print(f"\n❓ Question: {question}")
print(f"\n🎯 Best Answer Found:")
print(answer)

## 📝 Step 8: Working with Different Document Types

LangChain can work with many types of documents. Let's try creating and loading a simple text file.

In [None]:
# Create a simple text file to practice with
sample_text = """
Introduction to Machine Learning

Machine learning is a type of artificial intelligence that allows computers to learn 
and make decisions without being explicitly programmed for every task.

There are three main types of machine learning:
1. Supervised Learning: Learning from examples with correct answers
2. Unsupervised Learning: Finding patterns in data without examples
3. Reinforcement Learning: Learning through trial and error with rewards

Applications of machine learning include:
- Email spam detection
- Image recognition
- Voice assistants
- Recommendation systems

Machine learning is used in many industries including healthcare, finance, 
transportation, and entertainment.
"""

# Save it as a text file
with open("../tutorial/sample_text.txt", "w") as f:
    f.write(sample_text)

print("📝 Created a sample text file!")

# Load the text file
text_loader = TextLoader("../tutorial/sample_text.txt")
text_documents = text_loader.load()

print(f"✅ Text file loaded!")
print(f"📊 Document length: {len(text_documents[0].page_content)} characters")

# Ask a question about our text
question = "What are the three types of machine learning?"
answer = document_chain.invoke({
    "document_text": text_documents[0].page_content,
    "question": question
})

print(f"\n❓ Question: {question}")
print(f"🤖 Answer: {answer}")

## 🧪 Step 9: Experiment Time!

Now it's your turn to experiment. Try different questions and see what happens!

In [None]:
# 🎯 YOUR TURN: Ask any question about the documents

# Try questions about the research paper or the machine learning text
your_question = "What are some applications of machine learning?"  # 👈 Change this!

print(f"❓ Your Question: {your_question}")
print("\n🤖 Answer about the text file:")

# Answer from the text file
answer = document_chain.invoke({
    "document_text": text_documents[0].page_content,
    "question": your_question
})
print(answer)

print("\n🔍 Answer from searching the research paper:")
# Search through the research paper
paper_answer = search_document(your_question, chunks, max_chunks=3)
print(paper_answer)

print("\n💡 Notice how the AI gives different answers based on different documents!")

## 🎓 What You've Learned

**Excellent progress!** You've now learned the fundamentals of working with documents using LangChain.

### ✅ **Key Concepts:**
- **Document Loaders**: How to read PDF and text files
- **Text Splitting**: Breaking long documents into manageable chunks
- **Prompts**: Creating better instructions for AI
- **Chains**: Connecting multiple steps together
- **Document Search**: Finding information across multiple chunks

### ✅ **Skills You've Gained:**
- Loading and processing PDF files
- Creating intelligent prompt templates
- Building AI chains for document analysis
- Searching through documents for specific information
- Working with different document types

### 🚀 **What's Next:**
In **Tutorial 3**, you'll learn about **RAG (Retrieval-Augmented Generation)**:
- What RAG means and why it's powerful
- How to create embeddings (AI understanding of text)
- How to find the most relevant parts of documents
- Building a smart question-answering system

### 🎯 **Practice Ideas:**
- Try loading your own PDF files
- Experiment with different chunk sizes
- Create prompts for different types of questions
- Test the system with various document types

## 🏆 Final Challenge

Test your understanding with this challenge!

In [None]:
# 🏆 CHALLENGE: Create a document summarizer
print("🏆 FINAL CHALLENGE: Build a Document Summarizer")
print("=" * 50)

# TODO: Create a prompt template that asks for a summary
summary_prompt = ChatPromptTemplate.from_template("""
Please read this text and provide a brief, 2-3 sentence summary:

{text}

Summary:
""")

# TODO: Create a chain for summarizing
summary_chain = summary_prompt | ai_assistant | StrOutputParser()

# Test it on our machine learning text
summary = summary_chain.invoke({"text": text_documents[0].page_content})

print("📄 Original Text Length:", len(text_documents[0].page_content), "characters")
print("\n📝 AI Summary:")
print(summary)
print("\n📊 Summary Length:", len(summary), "characters")

print("\n🎉 Challenge Complete! You've built a document summarizer!")
print("🚀 Ready for Tutorial 3: Understanding RAG")