<a href="https://colab.research.google.com/github/Alex112525/LangChain-with-LLMs/blob/main/Loaders_and_Splitting_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain pypdf tiktoken

# Document Loaders

Document loaders are used to load data from a source as Document’s. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video

## PDF's

We upload a paper called ["Attention is all you need"](https://arxiv.org/abs/1706.03762)

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/Attention.pdf")
pages = loader.load()

In [3]:
len(pages) #No. of pages

15

In [4]:
page = pages[0]
print(page.page_content[586:1500])


Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a smal


Metadata is a dictionary of any metadata you want to store about the text (source url, author, etc.).

In [5]:
page.metadata

{'source': '/content/Attention.pdf', 'page': 0}

## WEB pages


In [6]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.langchain.com/docs/")

In [7]:
web = loader.load()

In [8]:
print(web[0].page_content[355:800])

LangChainLangChain is a framework for developing applications powered by language models.
We believe that the most powerful and differentiated applications will not only call out to a language model via an api, but will also:Be data-aware: connect a language model to other sources of dataBe agentic: Allow a language model to interact with its environmentAs such, the LangChain framework is designed with the objective in mind to enable those t


# Document Spliting

### RecursiveCharacterTextSplitter

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

The **RecursiveCharacterTextSplitter** is a splitter that splits documents recursively by different characters - for example: starting with “\n\n”, then “\n”, then " ".

This is nice because it will try to keep all the semantically relevant content in the same place for as long as possible. The chunkSize controls the max size (in terms of number of characters) of the final documents. The chunkOverlap specifies how much overlap there should be between chunks. This is often helpful to make sure that the text isn’t split weirdly

In [10]:
chunk_size = 20
chunk_overlap = 5

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' ',
)

In [11]:
text = "The CharacterTextSplitter and RecursiveCharacterTextSplitter are a splitters that splits text into characters. It is used to split text into characters for indexing purposes."

In [12]:
r_splitter.split_text(text)

['The',
 'CharacterTextSplitt',
 'plitter',
 'and',
 'RecursiveCharacterT',
 'cterTextSplitter',
 'are a splitters',
 'that splits text',
 'text into',
 'into characters. It',
 'It is used to split',
 'text into',
 'into characters for',
 'for indexing',
 'purposes.']

In [13]:
c_splitter.split_text(text)



['The',
 'CharacterTextSplitter',
 'and',
 'RecursiveCharacterTextSplitter',
 'are a splitters that',
 'that splits text',
 'text into',
 'into characters. It',
 'It is used to split',
 'split text into',
 'into characters for',
 'for indexing',
 'purposes.']

In [14]:
Doc = """
There are two main value props the LangChain framework provides:

Components: LangChain provides modular abstractions for the components neccessary to work with language models. LangChain also has collections of implementations for all these abstractions. The components are designed to be easy to use, regardless of whether you are using the rest of the LangChain framework or not.
Use-Case Specific Chains: Chains can be thought of as assembling these components in particular ways in order to best accomplish a particular use case. These are intended to be a higher level interface through which people can easily get started with a specific use case. These chains are also designed to be customizable.
"""

In [15]:
r_sep_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
    separators=["\n\n", "\n", " ", ""]
)

In [16]:
r_sep_splitter.split_text(Doc)

['There are two main value props the LangChain framework provides:',
 'Components: LangChain provides modular abstractions for the components neccessary to work with',
 'work with language models. LangChain also has collections of implementations for all these',
 'all these abstractions. The components are designed to be easy to use, regardless of whether you',
 'you are using the rest of the LangChain framework or not.',
 'Use-Case Specific Chains: Chains can be thought of as assembling these components in particular',
 'ways in order to best accomplish a particular use case. These are intended to be a higher level',
 'level interface through which people can easily get started with a specific use case. These chains',
 'chains are also designed to be customizable.']

### TokenTextSplitter

In [17]:
from langchain.text_splitter import TokenTextSplitter

The **TokenTextSplitter** is a function in Langchain that splits a raw text string by first converting the text into BPE tokens, then splitting these tokens into chunks and converting the tokens within a single chunk back into text. It is an implementation of a splitter that looks at tokens

In [18]:
token_splitter = TokenTextSplitter(chunk_size=20,
                                  chunk_overlap=3)

In [19]:
token_splitter.split_text(Doc)

['\nThere are two main value props the LangChain framework provides:\n\nComponents: LangChain',
 ': LangChain provides modular abstractions for the components neccessary to work with language models. Lang',
 ' models. LangChain also has collections of implementations for all these abstractions. The components are designed to',
 ' are designed to be easy to use, regardless of whether you are using the rest of the LangChain',
 ' the LangChain framework or not.\nUse-Case Specific Chains: Chains can be thought of as',
 ' thought of as assembling these components in particular ways in order to best accomplish a particular use case. These',
 ' case. These are intended to be a higher level interface through which people can easily get started with a',
 ' started with a specific use case. These chains are also designed to be customizable.\n']