# Document retrieval and Reranking

En este cuaderno cargaremos documentos de un pdf, haremos el chunking, lo guardaremos en una vectorstore y haremos el último paso, el prompt augmentation.


## Install dependencies

In [1]:
!pip install langchain pypdf openai langchain_experimental langchain_openai faiss-cpu tiktoken

Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain_experimental
  Downloading langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.3.17-py3-none-any.whl.metadata (2.3 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting langchain-community<0.4.0,>=0.3.0 (from langchain_experimental)
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community<0.4.0,>=0.3.0->langchain_experimental)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community<0.4.0,>=0.3.0->langchain_experimental)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community<0.4.0,>=0.3.0->langchain_experim

## Document Loading



Cargaremos la transcripción de un curso de Andrew Ng: CS229 course

[Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018)](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf)

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf")
lecture1_pages = loader.load()

In [3]:
len(lecture1_pages)

22

In [4]:
page = lecture1_pages[0]

In [5]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the 


In [6]:
page.metadata

{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}

## Document Chunking

In [7]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [8]:
docs = text_splitter.split_documents(lecture1_pages)

In [9]:
len(docs)

78

In [10]:
print(docs[0])

page_content='MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the most exciting field of all the computer 
sciences. So I'm actually always excited about teaching this class. Sometimes I actually 
think that machine learning is not only the most exciting thing in computer science, but 
the most exciting thing in all of human endeavor, so maybe a little bias there.  
I also want to introduce the TAs, who are all graduate students doing research in or 
related to the machine learning and all aspects of machine learning. Paul Baumstarck' metadata={'

## Retrieval

In [11]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:··········


In [12]:
from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings

vectorstore = FAISS.from_documents(docs, embedding=OpenAIEmbeddings())

In [13]:
query = "Can you tell me something about the honor code?"

In [14]:
docs_and_scores = vectorstore.similarity_search_with_score(query)

In [15]:
for doc in docs_and_scores:
  print(doc)

(Document(id='1146d30d-64a7-441b-8f2a-d43f7d14f4e7', metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 6, 'page_label': '7'}, page_content="that people that have taken this class in previous years may have written out by \nthemselves, okay?  \nSadly, in this class, there are usually — sadly, in previous y ears, there have often been a \nfew honor code violations in this class. And last year, I think I prosecuted five honor code \nviolations, which I think is a ridiculously large number. And so just don't work without \nsolutions, and hopefully there'll be zero honor code violations this year. I'd love for that \nto happen.  \nThe section here on the late homework policy if you ever want to hand in a ho

## Rerank

In [16]:
filtered_documents = filter(lambda x: x[1] < 0.4, docs_and_scores)

In [17]:
sorted_documents = sorted(filtered_documents, key=lambda x: x[1], reverse=True)

In [18]:
for doc in sorted_documents:
  print(doc)

(Document(id='340db324-a469-479a-851f-83293f435bac', metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 5, 'page_label': '6'}, page_content="use that to try to form a study group.  \nBut some of the problems sets in this class are reasonably difficult. People that have \ntaken the class before may tell you they were very difficult. And just I bet it would be \nmore fun for you, and you'd probably have a better learning experience if you form a \nstudy group of people to work with. So I definitely encourage you to do that.  \nAnd just to say a word on the honor code, which is I definitely encourage you to form a \nstudy group and work together, discuss homework problems together. But if you discuss"), np

## Prompt Augmentation

In [42]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.
Don't invent anything. Use only the Context, do not use your own knowledge.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

In [43]:
chunk_retriever = vectorstore.as_retriever(
    search_type = "similarity_score_threshold",
    search_kwargs = { "k": 5, "score_threshold": 0.4}
)

In [44]:
chunk_retriever.invoke("Can you tell me something about the honor code?")

[Document(id='1146d30d-64a7-441b-8f2a-d43f7d14f4e7', metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 6, 'page_label': '7'}, page_content="that people that have taken this class in previous years may have written out by \nthemselves, okay?  \nSadly, in this class, there are usually — sadly, in previous y ears, there have often been a \nfew honor code violations in this class. And last year, I think I prosecuted five honor code \nviolations, which I think is a ridiculously large number. And so just don't work without \nsolutions, and hopefully there'll be zero honor code violations this year. I'd love for that \nto happen.  \nThe section here on the late homework policy if you ever want to hand in a ho

In [45]:
from langchain_openai import ChatOpenAI

base_model = ChatOpenAI(model = "gpt-4.1-mini", temperature=0.3)

In [46]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context" : chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | base_model
    | StrOutputParser()
)

In [47]:
rag_chain.invoke("Can you tell me something about the honor code?")

'The honor code in this class encourages students to form study groups and discuss homework problems together. However, when it comes to submitting homework, each student must write their own solutions independently without referring to notes taken during group study sessions. Showing your solutions to others or copying solutions directly is strictly prohibited. Additionally, students are asked not to look at solutions from previous years, whether official or those written by past students, to avoid honor code violations. The instructor has noted that there have been several honor code violations in previous years and hopes for zero violations this year.'

In [48]:
rag_chain.invoke("Can you tell me something about the sex code?")

"I don't know."

In [49]:
rag_chain.invoke("Who is Leo Messi?")

"I don't know."

In [50]:
rag_chain.invoke("Can you explain Transformers in LLMs?")

"I don't know."

In [51]:
rag_chain.invoke("Can you explain Learning Algorithms?")

'Learning algorithms are computational methods that enable systems to learn patterns and make decisions or predictions based on data. They are widely used in many everyday applications, often without users realizing it—for example, algorithms that automatically read zip codes on mail or recommend movies and products based on your preferences. Learning algorithms can optimize tasks such as improving driving performance for fuel efficiency or analyzing medical records to advance healthcare.\n\nThese algorithms work by identifying patterns in data and using those patterns to make predictions or decisions. For instance, a learning algorithm might learn the relationship between the size of a house and its price by analyzing many examples. Learning theory helps us understand when and how well these algorithms perform, including how much data is needed to achieve a certain level of accuracy.\n\nThere are different types of learning algorithms, including supervised learning, where the algorith