# Chunking and Techniques in the Notebook

## Overview
This notebook demonstrates how to split a large text file into manageable chunks using both rule-based and LLM-based techniques. Chunking is essential for tasks such as information retrieval, semantic search, and efficient processing of long documents.

## Techniques Used

### 1. Rule-Based Chunking
- **Tool Used:** `RecursiveCharacterTextSplitter` from `langchain_text_splitters`
- **Parameters:**
    - `chunk_size=500`: Each chunk contains up to 500 characters.
    - `chunk_overlap=50`: Consecutive chunks overlap by 50 characters to preserve context.
- **Process:**
    - The text is read from `text_book.txt`.
    - The splitter divides the text into chunks based on the specified size and overlap.
    - The resulting chunks are stored in the `documents` list.
- **Advantages:**
    - Simple and fast.
    - Preserves context with overlap.
- **Limitations:**
    - May split sentences or paragraphs arbitrarily.
    - Does not consider semantic boundaries.

### 2. LLM-Based Chunking
- **Tool Used:** Google Gemini API via `genai.Client`
- **Process:**
    - A prompt is created to instruct the LLM to chunk the text.
    - The Gemini model generates chunked content based on the prompt.
- **Advantages:**
    - Can understand semantic boundaries (e.g., paragraphs, topics).
    - More flexible and context-aware.
- **Limitations:**
    - Requires API access and may incur costs.
    - Slower than rule-based methods.

## Applications
- **Vector Database Storage:** Chunks are stored in a ChromaDB collection for efficient retrieval and semantic search.
- **Information Retrieval:** Queries can be run against the chunked data to find relevant information.

## Summary Table

| Technique                | Tool/Library                | Pros                        | Cons                        |
|--------------------------|-----------------------------|-----------------------------|-----------------------------|
| Rule-Based Chunking      | RecursiveCharacterTextSplitter | Fast, simple, overlap context | May split semantically |
| LLM-Based Chunking       | Google Gemini API           | Semantic, context-aware      | Requires API, slower        |

## Conclusion
Chunking is a crucial preprocessing step for handling large texts in NLP and information retrieval. This notebook showcases both rule-based and LLM-based chunking, highlighting their strengths and weaknesses for different use cases.

In [None]:
with open("text_book.txt", "r") as file: # Opening File
    text = file.read()

In [3]:
print(text)

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING: A COMPREHENSIVE GUIDE

Table of Contents:
1. Introduction to Artificial Intelligence
2. Machine Learning Fundamentals
3. Deep Learning and Neural Networks
4. Natural Language Processing
5. Computer Vision
6. AI Ethics and Future Considerations

CHAPTER 1: INTRODUCTION TO ARTIFICIAL INTELLIGENCE

What is Artificial Intelligence?

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.

The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn and adapt to new data without being assisted by humans.

Histor

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter # Text Splitter

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) # Chunk Size and Overlap


In [None]:
documents = splitter.split_text(text) # Splitting Text into Chunks

In [None]:
print(len(documents)) # Number of Chunks

54


In [None]:
print(documents[21]) # Example Chunk

Key NLP Tasks

Text Classification:
Assigning predefined categories to text documents. Examples include:
- Sentiment analysis: Determining if text expresses positive, negative, or neutral sentiment
- Spam detection: Identifying unwanted emails
- Topic classification: Categorizing news articles by subject


In [None]:
import chromadb # Vector Database

In [None]:
chroma_client = chromadb.PersistentClient("./chunking_db") # Persistent Client

In [None]:
collection = chroma_client.get_or_create_collection(name="books_chunking") # Create Collection

In [None]:
for i, doc in enumerate(documents): # Adding Documents to Collection
    collection.add(
        documents=[doc],
        metadatas={"source": f"book_{i}"},
        ids=[f"book_{i}"] # book_1, book_2, ...
    )


In [None]:
results = collection.query(n_results=3, query_texts=["What is RAG?"]) # Querying

In [None]:
# create prompt for chunking
prompt = f"""Create chunking for the following text:
{text}
"""

In [None]:
# LLM Based Chunking
import os
from dotenv import load_dotenv
from google import genai
load_dotenv(override=True)
GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")

client = genai.Client(api_key=GEMINI_API_KEY)

response = client.models.generate_content(
    model="gemini-2.5-flash", contents=prompt
)
print(response.text)

Here's the text chunked based on its logical structure, primarily using headings and subheadings:

---

**Chunk 1: Title and Main Overview**
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING: A COMPREHENSIVE GUIDE

**Chunk 2: Table of Contents**
Table of Contents:
1. Introduction to Artificial Intelligence
2. Machine Learning Fundamentals
3. Deep Learning and Neural Networks
4. Natural Language Processing
5. Computer Vision
6. AI Ethics and Future Considerations

**Chunk 3: CHAPTER 1: INTRODUCTION TO ARTIFICIAL INTELLIGENCE - What is Artificial Intelligence?**
CHAPTER 1: INTRODUCTION TO ARTIFICIAL INTELLIGENCE
What is Artificial Intelligence?

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.

The ideal characteristic of artificial intelligence is its ability to rationaliz