<a href="https://colab.research.google.com/github/Shahbaz894/Generative-Ai-/blob/main/LangChainTextSplitter_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain transformers sentence-transformers faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

# 1️⃣ RecursiveCharacterTextSplitter

In [23]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Artificial Intelligence (AI) is a rapidly evolving field that encompasses various techniques and methodologies.
From natural language processing (NLP) to computer vision, AI enables machines to mimic human-like abilities.
Machine learning, a subset of AI, allows computers to learn from data patterns and improve over time.
Deep learning, a more advanced subset, uses neural networks to process vast amounts of information.
Applications of AI can be found in healthcare, finance, autonomous vehicles, and beyond.
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text(text)

print(chunks)

['Artificial Intelligence (AI) is a rapidly evolving field that encompasses various techniques and', 'techniques and methodologies.', 'From natural language processing (NLP) to computer vision, AI enables machines to mimic human-like', 'to mimic human-like abilities.', 'Machine learning, a subset of AI, allows computers to learn from data patterns and improve over', 'and improve over time.', 'Deep learning, a more advanced subset, uses neural networks to process vast amounts of information.', 'Applications of AI can be found in healthcare, finance, autonomous vehicles, and beyond.']


# 2️⃣ CharacterTextSplitter

📌 Best for: Splitting text based on a specific character (e.g., \n, .).

In [25]:
from langchain.text_splitter import CharacterTextSplitter
text = """The internet has revolutionized communication and knowledge sharing.
Social media platforms, blogs, and online forums have given everyone a voice.
The rapid spread of misinformation, however, remains a significant challenge.
With advancements in AI, detecting fake news is becoming more effective.
"""

splitter = CharacterTextSplitter(separator=".", chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

print(chunks)




['The internet has revolutionized communication and knowledge sharing', 'Social media platforms, blogs, and online forums have given everyone a voice', 'The rapid spread of misinformation, however, remains a significant challenge', 'With advancements in AI, detecting fake news is becoming more effective']


# 3️⃣ TokenTextSplitter

📌 Best for: Splitting based on token count instead of characters.

🔹 Uses OpenAI's tiktoken for precise tokenization.

In [21]:
!pip install tiktoken


Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [22]:
from langchain_text_splitters import TokenTextSplitter

# Example text to split
document_text = """
Artificial Intelligence (AI) is transforming the world rapidly. It is being used in various industries such as healthcare, finance, and education.
AI models, including deep learning and natural language processing (NLP), are improving decision-making, automation, and efficiency.

Machine Learning (ML), a subset of AI, allows systems to learn and improve from experience without being explicitly programmed.
Popular ML algorithms include decision trees, support vector machines, and neural networks.

Deep Learning, which mimics the human brain's neural networks, has led to advancements in image recognition, speech processing, and generative AI.
Large Language Models (LLMs) such as GPT, Mistral, and LLaMA are being trained on vast datasets to generate human-like text and assist in content creation.
"""

# Initialize TokenTextSplitter
text_splitter = TokenTextSplitter(
    chunk_size=2000,  # Controls the size of each chunk
    chunk_overlap=20,  # Controls overlap between chunks
)

# Split the document text into smaller chunks
text_chunks = text_splitter.split_text(document_text)

# Print the first few chunks
for i, chunk in enumerate(text_chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*50}")



Chunk 1:

Artificial Intelligence (AI) is transforming the world rapidly. It is being used in various industries such as healthcare, finance, and education. 
AI models, including deep learning and natural language processing (NLP), are improving decision-making, automation, and efficiency. 

Machine Learning (ML), a subset of AI, allows systems to learn and improve from experience without being explicitly programmed. 
Popular ML algorithms include decision trees, support vector machines, and neural networks. 

Deep Learning, which mimics the human brain's neural networks, has led to advancements in image recognition, speech processing, and generative AI. 
Large Language Models (LLMs) such as GPT, Mistral, and LLaMA are being trained on vast datasets to generate human-like text and assist in content creation.

--------------------------------------------------


# 4️⃣ NLTKTextSplitter

📌 Best for: Splitting text using sentence tokenization from NLTK.

🔹 Code Example

In [5]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
from langchain.text_splitter import NLTKTextSplitter

text = "This is the first sentence. Here is another one. And yet another."

splitter = NLTKTextSplitter()
chunks = splitter.split_text(text)

print(chunks)


['This is the first sentence.\n\nHere is another one.\n\nAnd yet another.']


# 5️⃣ SpacyTextSplitter

📌 Best for: Splitting text into sentences using SpaCy's NLP capabilities.

In [7]:
from langchain.text_splitter import SpacyTextSplitter

text = "Hello world! This is an AI-powered document processing tool."

splitter = SpacyTextSplitter()
chunks = splitter.split_text(text)

print(chunks)


['Hello world!\n\nThis is an AI-powered document processing tool.']




# 6️⃣ MarkdownTextSplitter

 Best for: Splitting Markdown documents into structured chunks.

In [26]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_content = """
# Introduction
AI is reshaping the world in many ways.
## History of AI
The origins of AI date back to the 1950s.
## Applications
AI is used in healthcare, finance, and education.
"""

splitter = MarkdownTextSplitter()
chunks = splitter.split_text(markdown_content)

print(chunks)


['# Introduction\nAI is reshaping the world in many ways.\n## History of AI\nThe origins of AI date back to the 1950s.\n## Applications\nAI is used in healthcare, finance, and education.']


# 7️⃣ LatexTextSplitter

📌 Best for: Splitting LaTeX documents into logical sections.

In [9]:
from langchain.text_splitter import LatexTextSplitter

text = "\\section{Introduction} This is the intro. \\subsection{Details} More details here."

splitter = LatexTextSplitter()
chunks = splitter.split_text(text)

print(chunks)


['\\section{Introduction} This is the intro. \\subsection{Details} More details here.']


In [10]:
from langchain.text_splitter import PythonCodeTextSplitter

code = """
def hello():
    print('Hello, World!')

class Greeting:
    def say_hello(self):
        return 'Hello!'
"""

splitter = PythonCodeTextSplitter()
chunks = splitter.split_text(code)

print(chunks)


["def hello():\n    print('Hello, World!')\n\nclass Greeting:\n    def say_hello(self):\n        return 'Hello!'"]


9️⃣ HTMLHeaderTextSplitter

📌 Best for: Splitting HTML documents based on headers.

In [24]:
from langchain_text_splitters import HTMLSectionSplitter

html_content = """
<h1>Artificial Intelligence</h1>
<p>AI is transforming industries worldwide.</p>
<h2>Machine Learning</h2>
<p>ML is a subset of AI that enables systems to learn.</p>
<h2>Deep Learning</h2>
<p>Deep learning utilizes neural networks for complex tasks.</p>
"""

# Define headers to split on
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

# Initialize the splitter with headers
splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)

# Split HTML content
chunks = splitter.split_text(html_content)

# Print the chunks
for i, chunk in enumerate(chunks, 1):
    print(f"🔹 Chunk {i}:\n{chunk}\n")


🔹 Chunk 1:
page_content='Artificial Intelligence 
 AI is transforming industries worldwide.' metadata={'Header 1': 'Artificial Intelligence'}

🔹 Chunk 2:
page_content='Machine Learning 
 ML is a subset of AI that enables systems to learn.' metadata={'Header 2': 'Machine Learning'}

🔹 Chunk 3:
page_content='Deep Learning 
 Deep learning utilizes neural networks for complex tasks.' metadata={'Header 2': 'Deep Learning'}



🔟 JSONSplitter

📌 Best for: Splitting JSON files into logical chunks.

In [29]:
import json
from langchain_text_splitters import RecursiveJsonSplitter

json_data = """
{
  "name": "AI Assistant",
  "description": "An AI system designed to assist users in various tasks.",
  "capabilities": [
    "Natural Language Processing",
    "Machine Learning",
    "Computer Vision"
  ],
  "applications": {
    "Healthcare": "AI is used for diagnosis and treatment recommendations.",
    "Finance": "AI helps in fraud detection and automated trading."
  }
}
"""

# Convert JSON string to Python dictionary
json_dict = json.loads(json_data)

# Initialize splitter
splitter = RecursiveJsonSplitter(max_chunk_size=150)

# Split the JSON dictionary (not string)
chunks = splitter.split_json(json_dict)

# Print the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")


Chunk 1: {'name': 'AI Assistant', 'description': 'An AI system designed to assist users in various tasks.'}

Chunk 2: {'capabilities': ['Natural Language Processing', 'Machine Learning', 'Computer Vision']}

Chunk 3: {'applications': {'Healthcare': 'AI is used for diagnosis and treatment recommendations.'}}

Chunk 4: {'applications': {'Finance': 'AI helps in fraud detection and automated trading.'}}



In [30]:
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Sample long XML data
xml_data = """
<articles>
    <article>
        <title>AI in Healthcare</title>
        <content>Artificial Intelligence (AI) is revolutionizing healthcare by improving diagnosis, treatment, and patient care. AI-powered systems analyze medical data to detect diseases like cancer early, helping doctors make informed decisions. Machine learning models are also used in drug discovery, predicting potential compounds and speeding up research. However, challenges like data privacy and ethical concerns remain critical issues.</content>
    </article>
    <article>
        <title>Advancements in Quantum Computing</title>
        <content>Quantum computing is an emerging field that leverages quantum mechanics to solve complex problems faster than classical computers. Companies like Google and IBM are racing to build more powerful quantum processors. These computers could revolutionize cryptography, materials science, and artificial intelligence. Despite the progress, practical quantum computers are still in their infancy, requiring further breakthroughs in qubit stability and error correction.</content>
    </article>
    <article>
        <title>Future of Space Exploration</title>
        <content>Space exploration is entering a new era with ambitious missions from NASA, SpaceX, and other organizations. Mars colonization, asteroid mining, and deep-space travel are becoming real possibilities. Advanced propulsion systems, AI-driven navigation, and robotic missions are making space more accessible. Challenges such as radiation exposure, long-duration space travel, and planetary sustainability must be addressed to ensure successful missions.</content>
    </article>
</articles>
"""

# Parse XML data using BeautifulSoup
soup = BeautifulSoup(xml_data, "xml")

# Extract text content from XML
articles = soup.find_all("article")
text_chunks = [article.get_text(separator=" ") for article in articles]

# Initialize LangChain text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)

# Split extracted text into chunks
chunks = splitter.split_text(" ".join(text_chunks))

# Print the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")



Chunk 1:
AI in Healthcare

Chunk 2:
Artificial Intelligence (AI) is revolutionizing healthcare by improving diagnosis, treatment, and patient care. AI-powered systems analyze medical data to detect diseases like cancer early, helping

Chunk 3:
like cancer early, helping doctors make informed decisions. Machine learning models are also used in drug discovery, predicting potential compounds and speeding up research. However, challenges like

Chunk 4:
However, challenges like data privacy and ethical concerns remain critical issues.

Chunk 5:
Advancements in Quantum Computing

Chunk 6:
Quantum computing is an emerging field that leverages quantum mechanics to solve complex problems faster than classical computers. Companies like Google and IBM are racing to build more powerful

Chunk 7:
racing to build more powerful quantum processors. These computers could revolutionize cryptography, materials science, and artificial intelligence. Despite the progress, practical quantum computers

Chunk 