### Introduction to Data Ingestion with LangChain

In [1]:

import os
from typing import List, Dict, Any
import pandas as pd

In [2]:
from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)
print("Set up Completed!")

  from .autonotebook import tqdm as notebook_tqdm


Set up Completed!


### Understanding the Folder Structure in LangChain

In [3]:
doc = Document(
    page_content="The is the content that will get embedded",
    metadata ={
        "source":"example.txt",
        "page":1,
        "author":"Ameen",
        "date_create":"2024-01-01",
        
    }
)

print(doc.page_content)
print("Metadata",doc.metadata)

The is the content that will get embedded
Metadata {'source': 'example.txt', 'page': 1, 'author': 'Ameen', 'date_create': '2024-01-01'}


#### Reading Text file

In [4]:
import os 
os.makedirs("data/text_files", exist_ok=True)

In [7]:
Sample_texts = {
    "data/text_files/python_intor.txt": """Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
    Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.""",

}

for filepath, content in Sample_texts.items():
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

print("Sample text file created.")


Sample text file created.


### Text Loader

In [22]:
# from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/text_files/python_intor.txt", encoding="utf-8")

documents = loader.load()
print(documents[0].page_content[:20])
print(type(documents))

Python is an interpr
<class 'list'>


### Directory Loader - MUltiple Text Files

In [31]:
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader("data/text_files",
                             glob='**/*.txt',
                             loader_cls=TextLoader,
                             loader_kwargs={'encoding': 'utf-8'},
                             show_progress=True)

documents = dir_loader.load()
print(documents)

for i, doc in enumerate(documents):
    print(f"Document {i+1}")
    print(f"Source: {doc.metadata['source']}")
    print(f"Content: {doc.page_content[:20]}")

100%|██████████| 2/2 [00:00<00:00, 1975.65it/s]

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='Universal character support: UTF-8 can represent every character in the Unicode standard, making it ideal for websites and applications that need to support multiple languages.\nASCII compatibility: It is backward-compatible with ASCII. The first 128 characters are the same in both encodings, meaning existing ASCII text is already valid UTF-8, which simplifies migration and handling.\nEfficient storage: For common characters like those in the English alphabet, UTF-8 uses only one byte, which is very efficient. It only uses more bytes (two to four) for characters outside of the ASCII range.\nSelf-synchronization: UTF-8 is designed to be self-synchronizing, meaning a program can find the beginning of a character even if it starts reading in the middle of a sequence of bytes, which helps in recovering from errors.\nWeb standard: UTF-8 is the dominant character encoding on the World Wide Web and is the d




### Test Splitting Strategies

In [54]:
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)
print(documents)

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='Universal character support: UTF-8 can represent every character in the Unicode standard, making it ideal for websites and applications that need to support multiple languages.\nASCII compatibility: It is backward-compatible with ASCII. The first 128 characters are the same in both encodings, meaning existing ASCII text is already valid UTF-8, which simplifies migration and handling.\nEfficient storage: For common characters like those in the English alphabet, UTF-8 uses only one byte, which is very efficient. It only uses more bytes (two to four) for characters outside of the ASCII range.\nSelf-synchronization: UTF-8 is designed to be self-synchronizing, meaning a program can find the beginning of a character even if it starts reading in the middle of a sequence of bytes, which helps in recovering from errors.\nWeb standard: UTF-8 is the dominant character encoding on the World Wide Web and is the d

In [55]:
text = documents[0].page_content

In [59]:
### Character Text Splitter
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)
chunks = char_splitter.split_text(text)
print('-------------')
print(chunks[0])
print("-----")
print(chunks[2])

Created a chunk of size 208, which is longer than the specified 200
Created a chunk of size 208, which is longer than the specified 200
Created a chunk of size 224, which is longer than the specified 200


-------------
Universal character support: UTF-8 can represent every character in the Unicode standard, making it ideal for websites and applications that need to support multiple languages.
-----
Efficient storage: For common characters like those in the English alphabet, UTF-8 uses only one byte, which is very efficient. It only uses more bytes (two to four) for characters outside of the ASCII range.


### Recursive Character text splitter

In [62]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=[" "],
    chunk_size = 200,
    chunk_overlap=20,
    length_function=len,
)

In [64]:
recursive_chunks = recursive_splitter.split_text(text)
print(len(recursive_chunks))
print('-------------') 
print(recursive_chunks[0])
print("-----")
print(recursive_chunks[2])

6
-------------
Universal character support: UTF-8 can represent every character in the Unicode standard, making it ideal for websites and applications that need to support multiple languages.
ASCII compatibility: It
-----
migration and handling.
Efficient storage: For common characters like those in the English alphabet, UTF-8 uses only one byte, which is very efficient. It only uses more bytes (two to four) for


In [66]:
token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
)
token_chunks = token_splitter.split_text(text)
print(len(token_chunks))

6


In [68]:
print(token_chunks[0])
print("-----")
print(token_chunks[1])

Universal character support: UTF-8 can represent every character in the Unicode standard, making it ideal for websites and applications that need to support multiple languages.
ASCII compatibility: It is backward-compatible with ASCII. The first 128 characters are the same
-----
 with ASCII. The first 128 characters are the same in both encodings, meaning existing ASCII text is already valid UTF-8, which simplifies migration and handling.
Efficient storage: For common characters like those in the English alphabet, UTF
