### Document Transformers – Splitting Text for NLP Processing

- Overview: Demonstrates how to split text into sentences, words, and characters using LangChain.
- File Handling: Loads text from sample_text.txt using LangChain’s TextLoader.
- Sentence Splitting: Uses CharacterTextSplitter with a period separator.
- Token Splitting: Splits words based on whitespace.
- Character Splitting: Converts text into individual characters for fine-grained analysis. 

In [8]:
# IMPORT NECESSARY MODULES
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# LOAD TEXT FROM FILE USING LANGCHAIN'S DOCUMENT LOADER
loader = TextLoader("sample_text.txt", encoding="utf-8")
documents = loader.load()
document_text = documents[0].page_content  # EXTRACT TEXT CONTENT

# FUNCTION TO SPLIT TEXT INTO SENTENCES
def split_by_sentence(text):
    text_splitter = CharacterTextSplitter(separator=". ", chunk_size=1000)
    return text_splitter.split_text(text)

# FUNCTION TO SPLIT TEXT INTO WORDS
def split_by_word(text):
    return text.split()  # SIMPLE WORD SPLIT USING SPACE

# FUNCTION TO SPLIT TEXT INTO CHARACTERS
def split_by_character(text):
    return list(text)  # CONVERT STRING TO LIST OF CHARACTERS

# DEMONSTRATING THE SPLITS
print("\n--- SPLIT BY PERIOD (SENTENCES) ---")
print(split_by_sentence(document_text))

print("\n--- SPLIT BY TOKEN (WORDS) ---")
print(split_by_word(document_text))

print("\n--- SPLIT BY CHARACTER ---")
print(split_by_character(document_text))


--- SPLIT BY PERIOD (SENTENCES) ---
['LangChain is a framework for developing applications powered by large language models.  \nIt helps with data retrieval, memory, and document processing.  \nAI agents use LangChain to handle conversations and reasoning.']

--- SPLIT BY TOKEN (WORDS) ---
['LangChain', 'is', 'a', 'framework', 'for', 'developing', 'applications', 'powered', 'by', 'large', 'language', 'models.', 'It', 'helps', 'with', 'data', 'retrieval,', 'memory,', 'and', 'document', 'processing.', 'AI', 'agents', 'use', 'LangChain', 'to', 'handle', 'conversations', 'and', 'reasoning.']

--- SPLIT BY CHARACTER ---
['L', 'a', 'n', 'g', 'C', 'h', 'a', 'i', 'n', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'r', 'a', 'm', 'e', 'w', 'o', 'r', 'k', ' ', 'f', 'o', 'r', ' ', 'd', 'e', 'v', 'e', 'l', 'o', 'p', 'i', 'n', 'g', ' ', 'a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', 's', ' ', 'p', 'o', 'w', 'e', 'r', 'e', 'd', ' ', 'b', 'y', ' ', 'l', 'a', 'r', 'g', 'e', ' ', 'l', 'a', 'n', 'g', 'u', 