### ParentDocument Retriever

In [None]:
!pip install langchain langchain_community langchain_groq langchain_chroma chromadb sentence-transformers langchain_huggingface



In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m266.2/310.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0


In [None]:
#Importing required libraries
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever

In [None]:
#Loading environment variables
from google.colab import userdata
groq_api_key = userdata.get("GROQ_API_KEY")

In [None]:
#Download the data
!wget "https://arxiv.org/pdf/1810.04805.pdf"

--2025-09-15 07:25:03--  https://arxiv.org/pdf/1810.04805.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.131.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /pdf/1810.04805 [following]
--2025-09-15 07:25:03--  https://arxiv.org/pdf/1810.04805
Reusing existing connection to arxiv.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 775166 (757K) [application/pdf]
Saving to: ‘1810.04805.pdf’


2025-09-15 07:25:03 (19.0 MB/s) - ‘1810.04805.pdf’ saved [775166/775166]



In [None]:
#Load the Documents
loader = PyPDFLoader("1810.04805.pdf")
documents = loader.load()

In [None]:
documents

[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-05-28T00:07:51+00:00', 'author': '', 'keywords': '', 'moddate': '2019-05-28T00:07:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '1810.04805.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlab

In [None]:
len(documents)

16

In [None]:
model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap = 100)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap = 50)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=embeddings
)
# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
retriever.add_documents(documents, ids= None)

In [None]:
len(list(store.yield_keys()))

41

In [None]:
sub_docs = vectorstore.similarity_search("BERT")

In [None]:
print(sub_docs[0].page_content)

speciﬁc architectures. BERT is the ﬁrst ﬁne-
tuning based representation model that achieves
state-of-the-art performance on a large suite
of sentence-level and token-level tasks, outper-
forming many task-speciﬁc architectures.
• BERT advances the state of the art for eleven
NLP tasks. The code and pre-trained mod-
els are available at https://github.com/
google-research/bert.
2 Related Work


In [None]:
retrieved_docs = retriever.invoke("justice breyer")

In [None]:
len(retrieved_docs[0].page_content)

1989

In [None]:
print(retrieved_docs[0].page_content)

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In ACL.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-thought vectors. In
Advances in neural information processing systems,
pages 3294–3302.
Quoc Le and Tomas Mikolov. 2014. Distributed rep-
resentations of sentences and documents. In Inter-
national Conference on Machine Learning , pages
1188–1196.
Hector J Levesque, Ernest Davis, and Leora Morgen-
stern. 2011. The winograd schema challenge. In
Aaai spring symposium: Logical formalizations of
commonsense reasoning, volume 46, page 47.
Lajanugen Logeswaran and Honglak Lee. 2018. An
efﬁcient framework for learning sentence represen-
tations. In International Conference on Learning
Representations.
Bryan McCann, James Bradbury, Caiming Xiong, and
Richard Socher. 2017. Learned in translation: Con-
tex

### RAG Pipeline

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate


In [None]:
llm = ChatGroq(model = "gemma2-9b-it", api_key = groq_api_key)

In [None]:
template =  """"
You are a helpful assistant that answers questions based on the following context.
If you don't find the answer in the context, just say that you don't know.
Context: {context}

Question: {input}

Answer:

"""
prompt = ChatPromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [None]:
response = rag_chain.invoke("Describe the Feature-based Approach with BERT?")
print(response)

The feature-based approach with BERT involves extracting fixed features from the pre-trained BERT model and using them as input to a separate, task-specific model. 

Here's a breakdown:

* **Fixed Features:** Instead of fine-tuning all of BERT's parameters,  activations from one or more of its layers are extracted and treated as fixed features.
* **Task-Specific Model:** These features are then fed into a smaller, task-specific model (like a BiLSTM in the example)  which is trained to perform the desired downstream task.

**Advantages:**

* **Efficiency:** Pre-computing the BERT features once allows for faster training and experimentation with various task-specific models.
* **Flexibility:**  Suitable for tasks that don't easily fit into the Transformer architecture.

**Example:**

In the CoNLL-2003 NER task example, BERT embeddings were extracted and used as input to a BiLSTM classifier.  Various BERT layer activations (e.g., last layer, last four layers) were used to see which provid

In [None]:
response = rag_chain.invoke("What is BERT?")
print(response)

BERT stands for Bidirectional Encoder Representations from Transformers. 



In [None]:
response = rag_chain.invoke("What is GLUE?")
print(response)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks.  

