# Advanced Chunking Techniques for preprocessing Textual Data

In [17]:
text = """

**The History and Future of Artificial Intelligence**  

### 1. Introduction  
Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI evolve, and where is it headed?  

### 2. The Early Days of AI  
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked the birth of AI as a field. Early AI models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexity.  

### 3. The Rise of Machine Learning  
By the 1990s, a shift occurred—machine learning (ML) began to outperform traditional rule-based AI. Instead of manually encoding rules, ML models learned patterns from data. Breakthroughs in deep learning, powered by neural networks, revolutionized fields like image recognition, natural language processing (NLP), and game-playing AI. Notable achievements include IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997 and Google DeepMind’s AlphaGo outplaying human Go masters.  

### 4. Modern AI Applications  
AI now permeates various industries:  
- **Healthcare**: AI assists in diagnosing diseases, predicting patient outcomes, and personalizing treatments.  
- **Finance**: Fraud detection and algorithmic trading rely heavily on AI-driven analytics.  
- **Transportation**: Autonomous vehicles use AI to navigate complex environments.  
- **Entertainment**: Recommendation systems, such as those used by Netflix and Spotify, tailor content based on user preferences.  

### 5. Challenges and Ethical Considerations  
Despite its potential, AI faces several challenges:  
- **Bias and Fairness**: AI models can inherit biases from training data, leading to unfair outcomes.  
- **Privacy Concerns**: The collection of massive datasets raises questions about user privacy and data security.  
- **Job Displacement**: Automation threatens traditional job roles, prompting discussions on reskilling the workforce.  
- **AI Safety**: As AI systems become more autonomous, ensuring they align with human values is critical.  

### 6. The Future of AI  
Looking ahead, AI research focuses on:  
- **Explainable AI (XAI)**: Making AI decisions transparent and interpretable.  
- **General AI**: Developing systems that can reason and learn across multiple domains.  
- **AI and Creativity**: From generating art to composing music, AI is expanding its creative potential.  
- **Human-AI Collaboration**: Future AI systems will likely augment human intelligence rather than replace it.  

### 7. Conclusion  
Artificial intelligence is not just a technological evolution—it is a societal transformation. While challenges remain, AI’s potential to enhance lives is undeniable. As research progresses, the key will be balancing innovation with ethical responsibility, ensuring AI serves humanity's best interests.  

"""

## Fixed-Size (Sliding Window) Chunking

🔹 How It Works:

Splits text into fixed-size chunks (e.g., 200 characters).
Uses overlapping windows to retain context.
🔹 Best For:
✅ Large language models (LLMs) – Ensures even chunk sizes for token-limited models.
✅ Text retrieval (RAG) – Maintains context across chunks.
✅ Summarization & Sentiment Analysis – Ensures all chunks have similar length.

🔥 Most Used In:

LangChain-based RAG pipelines
LLM context windows (GPT, Claude, Gemini)
Embedding storage in vector databases

In [2]:
def fixed_size_chunking(text, chunk_size=200, overlap=50):
    """
    Splits text into fixed-size overlapping chunks.
    
    Parameters:
        text (str): The input text to be chunked.
        chunk_size (int): The length of each chunk.
        overlap (int): The number of overlapping characters between consecutive chunks.

    Returns:
        List[str]: A list of text chunks.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move the window forward by chunk_size - overlap
        start += chunk_size - overlap
    
    return chunks

chunks = fixed_size_chunking(text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:


**The History and Future of Artificial Intelligence**  

### 1. Introduction  
Artificial Intellig

Chunk 2:

Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect o

Chunk 3:
fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyd

Chunk 4:
AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI ev

Chunk 5:
s. But how did AI evolve, and where is it headed?  

### 2. The Early Days of AI  
The idea of artif

Chunk 6:
  
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were dep

Chunk 7:
ical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan

Chunk 8:
n in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence 

Chunk 9:
achine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially m

C

## Sentence-Based Chunking (sent_tokenize)

🔹 How It Works:

Splits text at sentence boundaries using NLTK’s sent_tokenize().
🔹 Best For:
✅ Extractive summarization – Ensures full sentences are maintained.
✅ Dialogue-based AI – Keeps sentence structure intact for chatbots.
✅ Fine-tuned LLMs – Works well when sentence integrity is important.

🔥 Most Used In:

Summarization tasks
Text classification & topic modeling
Dialogue systems (chatbots, FAQ bots)


In [3]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

def sentence_based_chunking(text):
    """
    Splits text into chunks based on sentences using nltk's sent_tokenize.
    
    Parameters:
        text (str): The input text to be chunked.

    Returns:
        List[str]: A list of sentence-based chunks.
    """
    sentences = sent_tokenize(text)
    return sentences

chunks = sentence_based_chunking(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/kevin.garrison/nltk_data'
    - '/Users/kevin.garrison/VS_Code_Projekte/Generativ-AI-techniques/.venv/nltk_data'
    - '/Users/kevin.garrison/VS_Code_Projekte/Generativ-AI-techniques/.venv/share/nltk_data'
    - '/Users/kevin.garrison/VS_Code_Projekte/Generativ-AI-techniques/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Document-Based Chunking (PyPDFLoader)

🔹 How It Works:

Extracts full pages or sections from PDFs, Word docs, HTML.
🔹 Best For:
✅ Processing large documents – Keeps structural context (chapters, sections).
✅ Legal, research, and academic papers – Ensures integrity across pages.
✅ Enterprise AI – Business documents, compliance reports, contracts.

🔥 Most Used In:

LangChain document retrieval (RAG)
Enterprise search
AI-powered contract analysis

In [5]:
from langchain.document_loaders import PyPDFLoader

def document_based_chunking(pdf_path):
    """
    Splits a document into chunks using LangChain's PyPDFLoader.
    
    Parameters:
        pdf_path (str): Path to the PDF document.

    Returns:
        List[str]: A list of text chunks from the document.
    """
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    return [page.page_content for page in pages]

# Example usage:
pdf_path = "test.pdf"  # Provide the path to a PDF file
chunks = document_based_chunking(pdf_path)
for i, chunk in enumerate(chunks[:3]):  # Display first 3 chunks
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:
Scopes of Alignment
Kush R. Varshney, Zahra Ashktorab, Djallel Bouneffouf, Matthew Riemer, and Justin D. Weisz
IBM Research
1101 Kitchawan Road
Yorktown Heights, NY 10598
Abstract
Much of the research focus on AI alignment seeks to
align large language models and other foundation mod-
els to the context-less and generic values of helpfulness,
harmlessness, and honesty. Frontier model providers
also strive to align their models with these values. In this
paper, we motivate why we need to move beyond such
a limited conception and propose three dimensions for
doing so. The first scope of alignment is competence:
knowledge, skills, or behaviors the model must possess
to be useful for its intended purpose. The second scope
of alignment is transience: either semantic or episodic
depending on the context of use. The third scope of
alignment is audience: either mass, public, small-group,
or dyadic. At the end of the paper, we use the proposed
framework to position some technologies an

## Semantic-Based Chunking (SentenceTransformer + KMeans)

🔹 How It Works:

Groups semantically similar sentences using BERT embeddings + clustering.
🔹 Best For:
✅ RAG & AI search – Creates contextually meaningful chunks.
✅ Question-answering systems – Keeps related information together.
✅ Context-aware embeddings – Enhances vector database search efficiency.

🔥 Most Used In:

FAISS, Pinecone, Weaviate (for semantic search)
Q&A retrieval (e.g., legal, customer support, finance)
AI-powered summarization

In [6]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

def semantic_based_chunking(text, num_clusters=5):
    """
    Splits text into semantically similar chunks using Sentence Transformers and KMeans clustering.
    
    Parameters:
        text (str): The input text to be chunked.
        num_clusters (int): Number of clusters to create.

    Returns:
        List[str]: A list of semantically similar text chunks.
    """
    sentences = text.split(". ")  # Basic sentence splitting
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(sentences)
    
    kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
    kmeans.fit(embeddings)
    
    clustered_sentences = {i: [] for i in range(num_clusters)}
    for i, label in enumerate(kmeans.labels_):
        clustered_sentences[label].append(sentences[i])
    
    return [". ".join(clustered_sentences[i]) for i in range(num_clusters)]


chunks = semantic_based_chunking(text, num_clusters=3)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


  from .autonotebook import tqdm as notebook_tqdm


Chunk 1:
Instead of manually encoding rules, ML models learned patterns from data.  
- **Finance**: Fraud detection and algorithmic trading rely heavily on AI-driven analytics.  
- **Entertainment**: Recommendation systems, such as those used by Netflix and Spotify, tailor content based on user preferences. Challenges and Ethical Considerations  
Despite its potential, AI faces several challenges:  
- **Bias and Fairness**: AI models can inherit biases from training data, leading to unfair outcomes.  
- **Privacy Concerns**: The collection of massive datasets raises questions about user privacy and data security.  
- **Job Displacement**: Automation threatens traditional job roles, prompting discussions on reskilling the workforce.  
- **AI Safety**: As AI systems become more autonomous, ensuring they align with human values is critical.  
- **General AI**: Developing systems that can reason and learn across multiple domains

Chunk 2:


**The History and Future of Artificial Intelligen

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Overlapping Chunking

🔹 How It Works:

Similar to Fixed-Size Chunking, but adds overlap (e.g., 50 characters) to maintain context.
🔹 Best For:
✅ LLMs that need smooth transitions – Ensures less context loss between chunks.
✅ Text retrieval systems – Helps prevent missing relevant info in vector search.
✅ Summarization & context preservation – Improves coherence in AI outputs.

🔥 Most Used In:

RAG pipelines with OpenAI / Cohere embeddings
LLM-based document processing
Text segmentation for retrieval models


In [7]:
def overlapping_chunking(text, chunk_size=200, overlap=50):
    """
    Splits text into overlapping chunks.
    
    Parameters:
        text (str): The input text to be chunked.
        chunk_size (int): The length of each chunk.
        overlap (int): The number of overlapping characters between consecutive chunks.

    Returns:
        List[str]: A list of overlapping text chunks.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move the window forward by chunk_size - overlap
        start += chunk_size - overlap
    
    return chunks


chunks = overlapping_chunking(text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:


**The History and Future of Artificial Intelligence**  

### 1. Introduction  
Artificial Intellig

Chunk 2:

Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect o

Chunk 3:
fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyd

Chunk 4:
AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI ev

Chunk 5:
s. But how did AI evolve, and where is it headed?  

### 2. The Early Days of AI  
The idea of artif

Chunk 6:
  
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were dep

Chunk 7:
ical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan

Chunk 8:
n in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence 

Chunk 9:
achine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially m

C

## Recursive Chunking (RecursiveCharacterTextSplitter)

🔹 How It Works:

Recursively splits text into smaller pieces (based on sections, then sentences, then words).
🔹 Best For:
✅ Hierarchical text processing – Ideal for structured content like books & reports.
✅ Multi-level chunking – Ensures optimal granularity for embeddings.
✅ Hybrid LLM workflows – Combines large & small context windows.

🔥 Most Used In:

LangChain-based document loaders
Legal, research, and knowledge management AI
Large dataset processing


In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def recursive_chunking(text, chunk_size=200, overlap=50):
    """
    Splits text into recursive character chunks using LangChain's RecursiveCharacterTextSplitter.
    
    Parameters:
        text (str): The input text to be chunked.
        chunk_size (int): The length of each chunk.
        overlap (int): The number of overlapping characters between consecutive chunks.

    Returns:
        List[str]: A list of recursively split text chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len
    )
    
    return splitter.split_text(text)


chunks = recursive_chunking(text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:
**The History and Future of Artificial Intelligence**

Chunk 2:
### 1. Introduction

Chunk 3:
Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect

Chunk 4:
fundamental aspect of modern technology. Once confined to academic discussions, AI now powers

Chunk 5:
AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI

Chunk 6:
But how did AI evolve, and where is it headed?

Chunk 7:
### 2. The Early Days of AI

Chunk 8:
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were

Chunk 9:
beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan

Chunk 10:
in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence

Chunk 11:
intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked

Chunk 12:
officially marked the birth of AI as a field. Early 

## Agentic Chunking (AutoTokenizer + Seq2SeqLM)

🔹 How It Works:

Uses transformer models (GPT, T5, BART) to intelligently determine chunks.
🔹 Best For:
✅ Adaptive AI workflows – Uses ML-based decision-making to split text.
✅ Chatbots & conversational AI – Keeps dialogue intent in context.
✅ Multi-turn LLM interactions – Adjusts chunking dynamically.

🔥 Most Used In:

AI-powered text summarization
Conversational AI (Rasa, Dialogflow, LangChain)
Personalized AI agents

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def agentic_chunking(text, model_name="t5-small", max_length=200):
    """
    Uses a transformer model to intelligently chunk text based on learned patterns.
    
    Parameters:
        text (str): The input text to be chunked.
        model_name (str): The pre-trained model to use (default: "t5-small").
        max_length (int): Maximum token length per chunk.

    Returns:
        List[str]: A list of intelligently chunked text segments.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    tokenized_text = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
    
    chunks = tokenizer.batch_decode(tokenized_text["input_ids"], skip_special_tokens=True)
    return chunks


chunks = agentic_chunking(text, model_name="t5-small", max_length=len(text))

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:
**The History and Future of Artificial Intelligence** ### 1. Introduction Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI evolve, and where is it headed? ### 2. The Early Days of AI The idea of artificial intelligence dates back to ancient myths, where mechanical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked the birth of AI as a field. Early AI models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexity. ### 3. The Rise of Machine Learning By the 1990s, a shift occurred—machine learning (ML) began to

## Content-Aware Chunking (Chapters, Sections, Headers)

🔹 How It Works:

Uses regular expressions to detect headings, sections, and content markers.
🔹 Best For:
✅ Technical & structured documents – Extracts meaningful sections.
✅ Legal, research, and educational documents – Maintains structure integrity.
✅ RAG pipelines for knowledge bases – Keeps structured data clean.

🔥 Most Used In:

PDF, HTML, and Markdown processing
Enterprise search & knowledge management
AI-powered research assistants


In [11]:
import re

def content_aware_chunking(text):
    """
    Splits text into chunks based on content structure, such as chapters, sections, and headers.
    
    Parameters:
        text (str): The input text to be chunked.

    Returns:
        List[str]: A list of content-aware chunks.
    """
    # Define regex pattern to identify sections, headers, or chapter markers
    pattern = r'(?<=\n)(# .*|## .*|### .*)'  # Matches headers like # Title, ## Section, ### Subsection
    
    # Split text based on headers
    sections = re.split(pattern, text)
    
    # Filter out empty sections
    chunks = [s.strip() for s in sections if s.strip()]
    
    return chunks

chunks = content_aware_chunking(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:
**The History and Future of Artificial Intelligence**

Chunk 2:
### 1. Introduction

Chunk 3:
Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI evolve, and where is it headed?

Chunk 4:
### 2. The Early Days of AI

Chunk 5:
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked the birth of AI as a field. Early AI models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexity.

Chunk 6:
### 3. The Rise of Machine Learning

Chunk 7:
B

## Token-Based Chunking (GPT2Tokenizer)

🔹 How It Works:

Uses LLM tokenizers (e.g., GPT2Tokenizer) to split text based on token limits.
🔹 Best For:
✅ Optimizing LLM calls – Ensures chunks fit within model limits (e.g., OpenAI’s 4K/16K tokens).
✅ Preprocessing for transformer-based models – Avoids out-of-memory errors.
✅ Long-form document processing – Keeps context within LLM constraints.

🔥 Most Used In:

OpenAI GPT / Claude / Gemini API calls
Text-to-embedding pipelines
Memory-efficient NLP processing


In [12]:
from transformers import GPT2Tokenizer

def token_based_chunking(text, model_name="gpt2", chunk_size=50):
    """
    Splits text into chunks based on token length using a GPT-2 tokenizer.
    
    Parameters:
        text (str): The input text to be chunked.
        model_name (str): The tokenizer model to use (default: "gpt2").
        chunk_size (int): Maximum token length per chunk.

    Returns:
        List[str]: A list of token-based text chunks.
    """
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    tokens = tokenizer.encode(text)
    
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunks.append(chunk_text)
    
    return chunks

chunks = token_based_chunking(text, model_name="gpt2", chunk_size=20)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:


**The History and Future of Artificial Intelligence**  

### 1. Introduction 

Chunk 2:
 
Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect of modern

Chunk 3:
 technology. Once confined to academic discussions, AI now powers everyday applications, from voice assistants to medical diagn

Chunk 4:
ostics. But how did AI evolve, and where is it headed?  

### 2

Chunk 5:
. The Early Days of AI  
The idea of artificial intelligence dates back to ancient myths,

Chunk 6:
 where mechanical beings were depicted as sentient. However, modern AI research formally began in the 1950s.

Chunk 7:
 Alan Turing, a pioneer in computing, introduced the concept of machine intelligence through his famous "Turing

Chunk 8:
 Test." In 1956, the Dartmouth Conference officially marked the birth of AI as a field. Early AI

Chunk 9:
 models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexit

## Topic-Based Chunking (LDA + CountVectorizer)

🔹 How It Works:

Uses Latent Dirichlet Allocation (LDA) to group text by topic.
🔹 Best For:
✅ Thematic segmentation – Groups text by topic clusters.
✅ News summarization – Extracts key topic-based insights.
✅ Auto-categorization – Improves document tagging & classification.

🔥 Most Used In:

AI-powered content recommendation
Thematic search engines
News/media summarization

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def topic_based_chunking(text, num_topics=5, num_words=200):
    """
    Splits text into chunks based on topics using Latent Dirichlet Allocation (LDA).
    
    Parameters:
        text (str): The input text to be chunked.
        num_topics (int): Number of topics to generate.
        num_words (int): Number of words per topic chunk.

    Returns:
        List[str]: A list of topic-based text chunks.
    """
    vectorizer = CountVectorizer(stop_words='english')
    doc_term_matrix = vectorizer.fit_transform([text])
    
    lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda_model.fit(doc_term_matrix)
    
    feature_names = vectorizer.get_feature_names_out()
    topic_chunks = []
    
    for topic_idx, topic in enumerate(lda_model.components_):
        topic_words = [feature_names[i] for i in topic.argsort()[:-num_words - 1:-1]]
        topic_chunks.append(" ".join(topic_words))
    
    return topic_chunks

chunks = topic_based_chunking(text, num_topics=3, num_words=5)
for i, chunk in enumerate(chunks):
    print(f"Topic {i+1}:\n{chunk}\n")


Topic 1:
ai intelligence systems artificial human

Topic 2:
xai fraud focuses finance fields

Topic 3:
xai fraud focuses finance fields



## Keyword-Based Chunking (TF-IDF)

🔹 How It Works:

Uses TF-IDF to identify key sentences related to high-weight words.
🔹 Best For:
✅ Extractive summarization – Highlights important information.
✅ Keyword-driven AI retrieval – Finds relevant chunks for search.
✅ News & report summarization – Works well for fact-based content.

🔥 Most Used In:

Search engine optimizations (SEO, AI-driven ranking)
Extractive text summarization tools
Financial & legal text analysis

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def keyword_based_chunking(text, num_keywords=10):
    """
    Splits text into chunks based on keyword importance using TF-IDF.
    
    Parameters:
        text (str): The input text to be chunked.
        num_keywords (int): Number of top keywords to extract.

    Returns:
        List[str]: A list of keyword-based text chunks.
    """
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    
    sorted_indices = np.argsort(tfidf_matrix.toarray()).flatten()[::-1]
    top_keywords = [feature_names[i] for i in sorted_indices[:num_keywords]]
    
    chunks = []
    for keyword in top_keywords:
        sentences = [sent for sent in text.split('. ') if keyword in sent]
        chunks.append(". ".join(sentences))
    
    return chunks

chunks = keyword_based_chunking(text, num_keywords=10)
for i, chunk in enumerate(chunks):
    print(f"Keyword Chunk {i+1}:\n{chunk}\n")


Keyword Chunk 1:
Early AI models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexity.  
- **Entertainment**: Recommendation systems, such as those used by Netflix and Spotify, tailor content based on user preferences. Challenges and Ethical Considerations  
Despite its potential, AI faces several challenges:  
- **Bias and Fairness**: AI models can inherit biases from training data, leading to unfair outcomes.  
- **Privacy Concerns**: The collection of massive datasets raises questions about user privacy and data security. The Future of AI  
Looking ahead, AI research focuses on:  
- **Explainable AI (XAI)**: Making AI decisions transparent and interpretable.  
- **General AI**: Developing systems that can reason and learn across multiple domains. While challenges remain, AI’s potential to enhance lives is undeniable

Keyword Chunk 2:
The Early Days of AI  
The idea of artificial intelligence dates back to ancient myths, w