<a href="https://colab.research.google.com/github/Rohit-Munda/GenAIWorkshop/blob/main/langchain_text_splitting_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 LangChain Workshop: Document Loading & Text Splitting
Welcome! In this hands-on session, we will:
- Load a document using LangChain
- Explore different text splitting techniques

In [6]:
!pip install -q langchain_community tiktoken

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m0.7/1.2 MB[0m [31m10.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## 📄 Step 1: Create and Load a Sample Document

In [7]:

from langchain.document_loaders import TextLoader

# Create a sample text file
sample_text = (
    "Artificial Intelligence (AI) is transforming industries across the globe. "
    "From healthcare to finance, AI applications are driving innovation. "
    "Large Language Models (LLMs) are at the core of this revolution, enabling machines to understand and generate human-like language. "
    "In this session, we explore how to prepare documents for LLMs using LangChain."
)

with open("sample_doc.txt", "w") as f:
    f.write(sample_text)

# Load the document
loader = TextLoader("sample_doc.txt")
documents = loader.load()

print("Loaded document:")
print(documents[0].page_content)


Loaded document:
Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are driving innovation. Large Language Models (LLMs) are at the core of this revolution, enabling machines to understand and generate human-like language. In this session, we explore how to prepare documents for LLMs using LangChain.


## ✂️ Step 2: Basic Text Splitting with `CharacterTextSplitter`

In [8]:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(separator=" ", chunk_size=40, chunk_overlap=10)
basic_chunks = splitter.split_documents(documents)

print(f"Number of chunks: {len(basic_chunks)}")
for i, chunk in enumerate(basic_chunks):
    print(f"Chunk {i+1}: {chunk.page_content}")


Number of chunks: 12
Chunk 1: Artificial Intelligence (AI) is
Chunk 2: (AI) is transforming industries across
Chunk 3: across the globe. From healthcare to
Chunk 4: to finance, AI applications are driving
Chunk 5: driving innovation. Large Language
Chunk 6: Language Models (LLMs) are at the core
Chunk 7: the core of this revolution, enabling
Chunk 8: enabling machines to understand and
Chunk 9: and generate human-like language. In
Chunk 10: In this session, we explore how to
Chunk 11: how to prepare documents for LLMs using
Chunk 12: LLMs using LangChain.


## 🔁 Step 3: Recursive Splitting with `RecursiveCharacterTextSplitter`

In [9]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=15)
recursive_chunks = recursive_splitter.split_documents(documents)

print(f"Number of chunks: {len(recursive_chunks)}")
for i, chunk in enumerate(recursive_chunks):
    print(f"Chunk {i+1}: {chunk.page_content}")


Number of chunks: 11
Chunk 1: Artificial Intelligence (AI) is transforming
Chunk 2: transforming industries across the globe. From
Chunk 3: globe. From healthcare to finance, AI
Chunk 4: to finance, AI applications are driving
Chunk 5: are driving innovation. Large Language Models
Chunk 6: Models (LLMs) are at the core of this revolution,
Chunk 7: revolution, enabling machines to understand and
Chunk 8: understand and generate human-like language. In
Chunk 9: language. In this session, we explore how to
Chunk 10: explore how to prepare documents for LLMs using
Chunk 11: for LLMs using LangChain.


## 🧮 Step 4: Token-based Splitting with `TokenTextSplitter`

In [10]:

from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)
token_chunks = token_splitter.split_documents(documents)

print(f"Number of chunks: {len(token_chunks)}")
for i, chunk in enumerate(token_chunks):
    print(f"Chunk {i+1}: {chunk.page_content}")


Number of chunks: 5
Chunk 1: Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications
Chunk 2:  to finance, AI applications are driving innovation. Large Language Models (LLMs) are at the core
Chunk 3: ) are at the core of this revolution, enabling machines to understand and generate human-like language.
Chunk 4:  human-like language. In this session, we explore how to prepare documents for LLMs using Lang
Chunk 5:  for LLMs using LangChain.



## 📊 Comparison of Text Splitting Strategies

| Splitter Type                   | Description                                                                 | Use Case                                                                 |
|---------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------|
| `CharacterTextSplitter`        | Splits text using a simple character-based separator like space or newline | Simple, fast, good for small documents with regular structure            |
| `RecursiveCharacterTextSplitter`| Tries different levels of granularity (e.g., paragraphs, sentences)        | More intelligent splitting, avoids cutting off in the middle of sentences|
| `TokenTextSplitter`            | Splits based on token count (e.g., for LLM input limits)                   | Ideal for managing input size limits in transformer models               |

Each strategy has its advantages depending on the structure and size of your documents. Use `RecursiveCharacterTextSplitter` when preserving sentence meaning is important. Use `TokenTextSplitter` when working with models that have strict token limits.


## ✅ Summary
In this notebook, you've learned how to:
- Load documents using LangChain
- Apply different text splitting techniques:
  - Character-based splitting
  - Recursive splitting
  - Token-based splitting

These techniques are essential for preparing data for large language models (LLMs).