# This notebook only covers document loading. No chunking, embedding, or vector storage is done here.


# Document Structure

In [1]:
# Used to create a Document structure mainly consisting page_content and metadata .
from langchain_core.documents import Document

### LangChain Document Structure

A `Document` has two core fields:
- `page_content`: the actual text
- `metadata`: contextual information (source, page, file path)

This structure is what vector stores consume.


In [2]:
# Understanding the document structure of langchain Document .
doc = Document(
    page_content="Biography", # Content of the source .
    metadata={
        "Author": "Rakshak",
        "DOB": "31-10-2005",
        "Gender": "Male",
        "Course": "AIML"
    } # Details about the content of the source
)
print(doc) # Display page_content and metadata .

page_content='Biography' metadata={'Author': 'Rakshak', 'DOB': '31-10-2005', 'Gender': 'Male', 'Course': 'AIML'}


In [3]:
print(doc.page_content) # Can access particular parameter .

Biography


In [4]:
print(doc.metadata) # Accessing metadata .

{'Author': 'Rakshak', 'DOB': '31-10-2005', 'Gender': 'Male', 'Course': 'AIML'}


In [5]:
print(type(doc)) # Type of the Document .

<class 'langchain_core.documents.base.Document'>


## Data Ingestion

In [6]:
# Create a directory for a  simple text file
import os
try:
    os.makedirs("../data/text", exist_ok=True) # Just a file location to store test files .
except Exception as e:
    print(f" Error creating directory. Error :{e}") # Exception handling .

In [7]:
# Just some placeholder text for testing . Structure -> {Location of the file : Content }
sample = {
    "../data/text/file1.txt": """
Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.

Pre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.

We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations: one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token.

We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open-domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse, and factual language than a state-of-the-art parametric-only seq2seq baseline.

1. Introduction

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data. They can do so without any access to an external memory, acting as a parameterized implicit knowledge base. While this development is exciting, such models have downsides: they cannot easily expand or revise their memory, cannot straightforwardly provide insight into their predictions, and may produce hallucinations.

Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted.
""",

    "../data/text/file2.txt": """
Abstract

arXiv:1810.04805v2 [cs.CL] 24 May 2019

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5%, MultiNLI accuracy to 86.7%, SQuAD v1.1 Test F1 to 93.2, and SQuAD v2.0 Test F1 to 83.1.

1. Introduction

Language model pre-training has been shown to be effective for improving many natural language processing tasks. These include sentence-level tasks such as natural language inference and paraphrasing, as well as token-level tasks such as named entity recognition and question answering.

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo, uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as GPT, introduces minimal task-specific parameters and is trained on downstream tasks by simply fine-tuning all pre-trained parameters.

The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.
"""
}

In [8]:
# Accessing the file and writing down the contents .
try:
    for filename, content in sample.items():
        with open(filename, 'w', encoding="utf-8") as f: # UTF-8 ensures consistent text decoding across platforms .
            f.write(content)
    print("Successful File creation.")
except Exception as e:
    print(f"Error while creating or writing the content . Error : {e}") # Exception handling .

Successful File creation.


In [9]:
# A textloader which is used to load texts from files and can also automatically detect the data and structure content,metadata and other essential components .
# from langchain.document_loaders import TextLoader #Should have worked but not working for now .
from langchain_community.document_loaders import TextLoader #The function used to load data .

In [10]:
# Loading the text through file using file location .
try:
    file_loader = TextLoader("../data/text/file1.txt", encoding="utf-8") # Using utf-8 (Unicode Transformation Format) for encoding .
    print(file_loader.load())  # Automatically loads data and gives meta data .
except Exception as e:
    print(f"Error loading the text file . Error : {e}") # Exception handling .

[Document(metadata={'source': '../data/text/file1.txt'}, page_content='\nAbstract\n\nLarge pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.\n\nPre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.\n\nWe introduce RAG models where the parametric memory is a pre-trained seq2seq model

In [11]:
print(type(file_loader.load())) # It is List type so it can be accessed through index .
print(file_loader.load()[0]) # Apparently it only can iterate to 0th index since it is only one file .

<class 'list'>
page_content='
Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.

Pre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.

We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector 

In [12]:
# A DirectoryLoader is used to load entire directory and iterate to access all the files and load data .
from langchain_community.document_loaders import DirectoryLoader # The function used to load directory .

In [13]:
# Loading the text through file accessing by a directory location .
dir_loader = DirectoryLoader(
    "../data/text", # Location of directory .
    glob="**/*.txt", # File type to search .
    loader_cls=TextLoader, # Method using to load data .
    loader_kwargs={"encoding": "utf-8"}, # Encoding data .
    show_progress=True # A progress bar .(Optional)
)

In [14]:
try:
    text_doc = dir_loader.load() # Loading the files in the directory .
    print(text_doc) # Prints all the files data .
except Exception as e:
    print(f"Error loading the text files . Error : {e}") # Exception handling .

100%|██████████| 2/2 [00:00<?, ?it/s]

[Document(metadata={'source': '..\\data\\text\\file1.txt'}, page_content='\nAbstract\n\nLarge pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.\n\nPre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.\n\nWe introduce RAG models where the parametric memory is a pre-trained seq2seq mo




### Loaders always return a list of Documents because:
- A single file can produce multiple chunks
- A directory produces multiple files
- RAG systems always work on collections, not single documents


In [15]:
print(type(text_doc))  # This is a list so a particular file can be accessed using index .
print(text_doc[0]) # Accessing file using index .

<class 'list'>
page_content='
Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.

Pre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.

We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector 

In [16]:
# The following functions are used to load pdf documents . PyMuPDFLoader is considered better than PyPDFLoader for extra features .
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader # Function to load pdf files .

In [17]:
# Loading pdf documents in a directory .
pdf_loader = DirectoryLoader(
    path="../data", # Directory location .
    glob="**/*.pdf", # File type used to search particular file type .
    loader_cls=PyMuPDFLoader, # Method to load the files .
    # loader_kwargs={'encoding':"utf-8"}, # Encoding parameter cannot be used here .
    show_progress=True # Used to show progress .(Optional)
)

In [18]:
try:
    pdf_doc=pdf_loader.load() # Loading the files in directory .
    print(pdf_doc[18]) # Loading only one pdf due to large number of data .
except Exception as e:
    print(f"Error loading the pdf files . Error : {e}") # Exception handling .

100%|██████████| 2/2 [00:00<00:00,  6.51it/s]

page_content='Document 1: his works are considered classics of American
literature ... His wartime experiences formed the basis for his novel
”A Farewell to Arms” (1929) ...
Document 2: ... artists of the 1920s ”Lost Generation” expatriate
community. His debut novel, ”The Sun Also Rises”, was published
in 1926.
BOS ”
The
Sun
Also
R
ises
”
is
a
novel
by
this
authorof
”
A
Fare
well
to
Arms”
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Figure 2: RAG-Token document posterior p(zi|x, yi, y−i) for each generated token for input “Hem-
ingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high
when generating “A Farewell to Arms" and for document 2 when generating “The Sun Also Rises".
Table 3: Examples from generation tasks. RAG models generate more speciﬁc and factually accurate
responses. ‘?’ indicates factually incorrect responses, * indicates partially correct responses.
Task
Input
Model
Generation
MS-
MARCO
deﬁne middle
ear
BART
?The middle ear is the part of the e




In [19]:
print(type(pdf_doc)) # It is a list type so files can be accessed through index .
print(pdf_doc[0]) # Accessing files through index .

<class 'list'>
page_content='Deep Residual Learning for Image Recognition
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract
Deeper neural networks are more difﬁcult to train. We
present a residual learning framework to ease the training
of networks that are substantially deeper than those used
previously. We explicitly reformulate the layers as learn-
ing residual functions with reference to the layer inputs, in-
stead of learning unreferenced functions. We provide com-
prehensive empirical evidence showing that these residual
networks are easier to optimize, and can gain accuracy from
considerably increased depth. On the ImageNet dataset we
evaluate residual nets with a depth of up to 152 layers—8×
deeper than VGG nets [41] but still having lower complex-
ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNet test set. This result won the 1st place on the
ILSVRC 2015 classiﬁcation task. We 

### Typical metadata includes:
- `source`: file path
- file name
- directory


In [20]:
print(pdf_doc[0].metadata) # Meta data of a file . Generated automatically .

{'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-11T01:13:45+00:00', 'source': '..\\data\\pdf\\He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'file_path': '..\\data\\pdf\\He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'total_pages': 12, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2015-12-11T01:13:45+00:00', 'trapped': '', 'modDate': 'D:20151211011345Z', 'creationDate': 'D:20151211011345Z', 'page': 0}


In [21]:
# WebBaseLoader is used to load data from websites and it used beautifusoup4 for web scarping .
from langchain_community.document_loaders import WebBaseLoader # The function to load data .

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [22]:
# Loading a website data using the website link .
link = WebBaseLoader(
    web_path="https://httpbin.org/html",# Link of the test website .
    encoding="utf-8" # Encoding method .
)

In [23]:
try:
    web_loader=link.load() # Loading the data .
    print(web_loader)
except Exception as e:
    print(f"Error loading the web data . Error : {e}") # Exception handling .

[Document(metadata={'source': 'https://httpbin.org/html', 'language': 'No language found.'}, page_content="\n\n\n\n\nHerman Melville - Moby-Dick\n\n\n          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wiel

In [24]:
print(type(web_loader)) # It is a List type so it can be accessed through index .
print(web_loader[0]) # Apparently it only can iterate to 0th index since it is only one web link .

<class 'list'>
page_content='




Herman Melville - Moby-Dick


          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come from him. Silent

### Important Note: Documents are NOT embeddings

Before vector storage:
1. Documents are split into chunks
2. Chunks are embedded
3. Embeddings are stored

This notebook stops at step 1.


### What comes next
- Text splitting (RecursiveCharacterTextSplitter)
- Embedding generation
- Vector store ingestion
- Retrieval (RAG)


## Next step: splitting documents into chunks before embeddings.
