### Data Splitting - Splitting the text / data:

- After Data Ingestion, we split the data into  chunks because LLM's have limitation w.r.t context size
- There are different types of splitters depending upon the type of data.
- Some of them are below:

#### 1. Recurssive Character Text Splitter:

- This method is recommended for generic text

- It splits the text by list of characters

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader(
    file_path="E:/Downloads/highlights.txt"
)

loaded_text = loader.load()[0].page_content
loaded_text

'**00:00 - 10:00**: Okay, here are the highlights for the transcript segment (minutes 0-10):\n\n*   **Project Introduction:** Kalian introduces a Python project for email slicing, which separates the username and domain name from an email address.\n*   **Email Structure Explanation:** Kalian explains that an email consists of a username (before the "@" symbol) and a domain name (after the "@" symbol).\n*   **Variable Creation and User Input:** Kalian creates an "email" variable to store the email address entered by the user.\n*   **Username Extraction:** Kalian extracts the username using string slicing, taking the portion of the email string from the beginning up to the "@" symbol.\n*   **Domain Name Extraction:** Kalian extracts the domain name using string slicing, starting from the character after the "@" symbol to the end of the string.\n*   **Output:** Kalian prints the extracted username and domain name using an f-string for formatted output.\n*   **Script Execution and Example:

In [7]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20
)

splitted_text = splitter.split_text(loaded_text)
print(splitted_text[0])
print(splitted_text[1])
print(splitted_text[2])

**00:00 - 10:00**: Okay, here are the highlights for the transcript segment (minutes 0-10):
*   **Project Introduction:** Kalian introduces a Python project for email slicing, which separates
which separates the username and domain name from an email address.


#### 2. Character Text Splitter:

- It is the simplest method

- Splits text based on given character sequence [Default value = "\n\n"]

- Text is splitted by single char separator, which we will give


In [17]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator='.',
    chunk_size=100,
    chunk_overlap=30
)

splitted = splitter.split_text(loaded_text)
splitted

Created a chunk of size 243, which is longer than the specified 100
Created a chunk of size 155, which is longer than the specified 100
Created a chunk of size 124, which is longer than the specified 100
Created a chunk of size 159, which is longer than the specified 100
Created a chunk of size 160, which is longer than the specified 100
Created a chunk of size 108, which is longer than the specified 100


['**00:00 - 10:00**: Okay, here are the highlights for the transcript segment (minutes 0-10):\n\n*   **Project Introduction:** Kalian introduces a Python project for email slicing, which separates the username and domain name from an email address',
 '*   **Email Structure Explanation:** Kalian explains that an email consists of a username (before the "@" symbol) and a domain name (after the "@" symbol)',
 '*   **Variable Creation and User Input:** Kalian creates an "email" variable to store the email address entered by the user',
 '*   **Username Extraction:** Kalian extracts the username using string slicing, taking the portion of the email string from the beginning up to the "@" symbol',
 '*   **Domain Name Extraction:** Kalian extracts the domain name using string slicing, starting from the character after the "@" symbol to the end of the string',
 '*   **Output:** Kalian prints the extracted username and domain name using an f-string for formatted output',
 '*   **Script Execution

In [44]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=30
)

splitted = splitter.split_text(loaded_text)
splitted

['**00:00 - 10:00**: Okay, here are the highlights for the transcript segment (minutes 0-10):',
 '*   **Project Introduction:** Kalian introduces a Python project for email slicing, which separates the username and domain name from an email address.\n*   **Email Structure Explanation:** Kalian explains that an email consists of a username (before the "@" symbol) and a domain name (after the "@" symbol).\n*   **Variable Creation and User Input:** Kalian creates an "email" variable to store the email address entered by the user.\n*   **Username Extraction:** Kalian extracts the username using string slicing, taking the portion of the email string from the beginning up to the "@" symbol.\n*   **Domain Name Extraction:** Kalian extracts the domain name using string slicing, starting from the character after the "@" symbol to the end of the string.\n*   **Output:** Kalian prints the extracted username and domain name using an f-string for formatted output.\n*   **Script Execution and Exampl

In [45]:
splitted[1]

'*   **Project Introduction:** Kalian introduces a Python project for email slicing, which separates the username and domain name from an email address.\n*   **Email Structure Explanation:** Kalian explains that an email consists of a username (before the "@" symbol) and a domain name (after the "@" symbol).\n*   **Variable Creation and User Input:** Kalian creates an "email" variable to store the email address entered by the user.\n*   **Username Extraction:** Kalian extracts the username using string slicing, taking the portion of the email string from the beginning up to the "@" symbol.\n*   **Domain Name Extraction:** Kalian extracts the domain name using string slicing, starting from the character after the "@" symbol to the end of the string.\n*   **Output:** Kalian prints the extracted username and domain name using an f-string for formatted output.\n*   **Script Execution and Example:** Kalian runs the script, enters an email address, and demonstrates the successful separation 

#### 3. HTML Header Text Splitter:

- 'structure aware' chunker

- splits text at HTML element level and adds metadata for each header 'relevant to any given chunk

- can return chunks element by element or combine elements with same metadata, with the objectives of :
    (i) keeping related text grouped (more/less) semantically
    (ii) preserving context_rich information encoded in doc structures

- can be used with other text splitters with as a part of chunking pipeline

In [33]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_code = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <h1>This is header 1</h1>
        <h2>This is header 2</h2>
    <br>
    <h3>This is header 3</h3>
        <p>This is a paragraph</p>
    <br>
</body>
</html>
"""

headers_to_split = [
    ('h1','header 1'),
    ('h2','header 2'),
    ('h3','header 3'),
]

In [42]:
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on= headers_to_split
)

content = html_splitter.split_text(html_code)
content1 = html_splitter.split_text_from_file("sf.html")

In [43]:
print(content)
print("-"*20)
print(content1)

[Document(metadata={'header 1': 'This is header 1'}, page_content='This is header 1'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2'}, page_content='This is header 2'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2', 'header 3': 'This is header 3'}, page_content='This is header 3'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2', 'header 3': 'This is header 3'}, page_content='This is a paragraph')]
--------------------
[Document(metadata={'header 1': 'This is header 1'}, page_content='This is header 1'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2'}, page_content='This is header 2'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2', 'header 3': 'This is header 3'}, page_content='This is header 3'), Document(metadata={'header 1': 'This is header 1', 'header 2': 'This is header 2', 'header 3': 'This is header 3'}, 

#### 4. Recurssive Json Splitter:

Used to split the json data which we get from any API request

In [51]:
from langchain_text_splitters import RecursiveJsonSplitter

json_splitter = RecursiveJsonSplitter(
    max_chunk_size=100,
)

js = {
    'name':{'initial':'M','first_name':'Kalyan','middle_name':'Sai','last_name':'Prasad'},
    'age':18,
    'loves_anime':True
}

In [52]:
json_content = json_splitter.split_json(js['name'])
json_content

[{'initial': 'M',
  'first_name': 'Kalyan',
  'middle_name': 'Sai',
  'last_name': 'Prasad'}]