# Steps to build a RAG from scratch
### Step-1: Import ``` ChatTogether ``` from ``` langchain_together ```

In [1]:
! pip install langchain-together

Collecting langchain-together
  Downloading langchain_together-0.3.0-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain-openai<0.4,>=0.3 (from langchain-together)
  Downloading langchain_openai-0.3.18-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<0.4.0,>=0.3.29 (from langchain-together)
  Downloading langchain_core-0.3.61-py3-none-any.whl.metadata (5.8 kB)
Downloading langchain_together-0.3.0-py3-none-any.whl (12 kB)
Downloading langchain_openai-0.3.18-py3-none-any.whl (63 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.4/63.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.61-py3-none-any.whl (438 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.3/438.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-core, langchain-openai, langchain-together
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.3.59
    Unins

In [2]:
from langchain_together import ChatTogether

In [3]:
api_key="API_Key"

### Step-3: Define your model

In [4]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [5]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.5.0-py3-none-any.whl (303 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/303.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/303.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.5.0


### Step-6: Load the pdf for context

In [7]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader("paper.pdf")
pages=loader.load_and_split()
pages



[Document(metadata={'producer': 'macOS Version 14.4.1 (Build 23E224) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20240813034150Z00'00'", 'moddate': "D:20240813034150Z00'00'", 'source': 'paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Machine LearningLogistic Regression DR. BHARGA VI RSCOPEVIT CHENNAI'),
 Document(metadata={'producer': 'macOS Version 14.4.1 (Build 23E224) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20240813034150Z00'00'", 'moddate': "D:20240813034150Z00'00'", 'source': 'paper.pdf', 'total_pages': 15, 'page': 1, 'page_label': '2'}, page_content='Bhargavi RClassification - ApplicationsBinary Classification • Online transactions – Fraudulent / Not Fraudulent• Email – Spam/ Not spam ?• Tumor classification – Malignant/BenignMulti-class Classification• Optical Character Recognition• Face classification Multi-Label ClassificationA variant of theclassificationproblem where multiple nonexclusive labels may be assigned to

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)
page_split = text_splitter.split_documents(pages)

### Step-8: Create a vector store with the embeddings and context pdf

> #### DocArrayInMemorySearch Explanation
>
> This code creates a simple in-memory vector store that:
> - Converts document pages into vector embeddings
> - Stores them in memory for quick similarity searches
> - Useful for testing and prototyping RAG applications
> - Data is temporary and cleared when program ends

In [10]:
! pip install docarray

Collecting docarray
  Downloading docarray-0.41.0-py3-none-any.whl.metadata (36 kB)
Collecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.32.0.20250515-py3-none-any.whl.metadata (2.1 kB)
Downloading docarray-0.41.0-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20250515-py3-none-any.whl (20 kB)
Installing collected packages: types-requests, docarray
Successfully installed docarray-0.41.0 types-requests-2.32.0.20250515


In [11]:
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")

  embedding_function = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(
    page_split,
    embedding=embedding_function
)



### Step-9: Defining the retriever

> #### Retriever Operations
> This code:
> - Creates a retriever from the vector store using `as_retriever()`
> - Uses `invoke()` to search for documents related to "Machine Learning"
> - Returns semantically similar content from the stored documents
>


In [13]:
retriever=vectorstore.as_retriever()

retriever.invoke("Machine Learning")

[Document(metadata={'producer': 'macOS Version 14.4.1 (Build 23E224) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20240813034150Z00'00'", 'moddate': "D:20240813034150Z00'00'", 'source': 'paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Machine LearningLogistic Regression DR. BHARGA VI RSCOPEVIT CHENNAI'),
 Document(metadata={'producer': 'macOS Version 14.4.1 (Build 23E224) Quartz PDFContext', 'creator': 'PyPDF', 'creationdate': "D:20240813034150Z00'00'", 'moddate': "D:20240813034150Z00'00'", 'source': 'paper.pdf', 'total_pages': 15, 'page': 3, 'page_label': '4'}, page_content='Bhargavi RLogistic Regression - Introduction• Linear model.• Used for binary classification• Can be extended to handle multiclass as well• Computationally inexpensive• Easy to implement.• Logistic Regression models the response/prediction as probability that y (output variable) belongs to a particular category.\nx1\nx2'),
 Document(metadata={'producer': 'macOS Version 14.4

In [15]:
from langchain.prompts import PromptTemplate

template="""
    Answer the question based on the context below. If you don't know the answer, just say so.
    Context: {context}
    Question: {question}
"""
prompt=PromptTemplate.from_template(template)

In [17]:
MODEL = "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"

model = ChatTogether(
    together_api_key=api_key,
    model=MODEL,
)



### Step-10: Defining the chain for the RAG

In [21]:
from operator import itemgetter

chain = (
    {
        "context":itemgetter("question") | retriever,
        "question":itemgetter("question")
    }
    | prompt
    | model
    | parser
)

In [19]:
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

In [22]:
chain.invoke({"question":"What is the pdf about?"})

'The PDF appears to be about machine learning and data analysis, specifically focusing on topics such as binary classification, logistic regression, and gradient descent, with examples related to credit card marketing and customer behavior.'

In [23]:
quizText=chain.invoke({"question":"Can you generate a quiz of 15 questions based on the provided notes ?"})
print(quizText)

Based on the provided context, I can generate a quiz with 15 questions. Here are the questions:

1. What is the topic discussed on page 12 of the document?
A) Gradient Descent
B) Multiclass Classification
C) Prediction
D) Regression

Answer: B) Multiclass Classification

2. What is the formula for updating the weights in Gradient Descent?
A) w* = w* - α * (J(W))
B) w* = w* + α * (J(W))
C) w* = w* - α * ∑(h'(x) - y) * x
D) w* = w* + α * ∑(h'(x) - y) * x

Answer: A) w* = w* - α * (J(W))

3. What is the learning rate in Gradient Descent?
A) α
B) β
C) γ
D) δ

Answer: A) α

4. What is the goal of the marketing department in the given scenario?
A) To convince existing holders of the company's premium credit card to upgrade to the standard card
B) To convince existing holders of the company's standard credit card to upgrade to the premium card
C) To convince new customers to apply for the company's credit card
D) To convince existing holders of the company's credit card to cancel their accoun

In [24]:
loader=PyPDFLoader("Chapter-1.pdf")
pages=loader.load_and_split()
pages

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250118193239', 'source': 'Chapter-1.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}, page_content='Chapter I: Introduction to LLMs'),
 Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250118193239', 'source': 'Chapter-1.pdf', 'total_pages': 30, 'page': 1, 'page_label': '2'}, page_content='What are Large Language Models\nBy now, you might have heard of them. Large Language Models,\ncommonly known as LLMs, are a sophisticated type of neural network.\nThese models ignited many innovations in the field of natural language\nprocessing (NLP) and are characterized by their large number of\nparameters, often in billions, that make them proficient at processing and\ngenerating text. They are trained on extensive textual data, enabling them to\ngrasp various language patterns and structures. The primary goal of LLMs\nis to interpret and create human-like text that captures

In [26]:
vectorstore = DocArrayInMemorySearch.from_documents(
    pages,
    embedding=embedding_function
)

In [27]:
retriever=vectorstore.as_retriever()

retriever.invoke("LLM")

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250118193239', 'source': 'Chapter-1.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}, page_content='Chapter I: Introduction to LLMs'),
 Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250118193239', 'source': 'Chapter-1.pdf', 'total_pages': 30, 'page': 1, 'page_label': '2'}, page_content='What are Large Language Models\nBy now, you might have heard of them. Large Language Models,\ncommonly known as LLMs, are a sophisticated type of neural network.\nThese models ignited many innovations in the field of natural language\nprocessing (NLP) and are characterized by their large number of\nparameters, often in billions, that make them proficient at processing and\ngenerating text. They are trained on extensive textual data, enabling them to\ngrasp various language patterns and structures. The primary goal of LLMs\nis to interpret and create human-like text that captures

In [28]:
chain = (
    {
        "context":itemgetter("question") | retriever,
        "question":itemgetter("question")
    }
    | prompt
    | model
    | parser
)

In [29]:
chain.invoke({"question":"What is the pdf about?"})

'The PDF appears to be about Large Language Models (LLMs), specifically their introduction, applications, challenges, and potential biases. It covers topics such as language modeling, tokenization, few-shot learning, and mitigating hallucinations and bias in AI systems. The overall theme seems to be an educational or informative discussion about LLMs, likely from a textbook or academic chapter.'

In [30]:
print(chain.invoke({"question":"Give an essay on LLMs"}))

Large Language Models (LLMs) are a sophisticated type of neural network that has revolutionized the field of natural language processing (NLP). These models are characterized by their large number of parameters, often in billions, which enable them to process and generate human-like text. LLMs are trained on extensive textual data, allowing them to grasp various language patterns and structures, and their primary goal is to interpret and create text that captures the nuances of natural language, including syntax and semantics.

One of the most remarkable aspects of LLMs is their ability to develop emergent abilities, such as conducting arithmetic calculations, unscrambling words, and even demonstrating proficiency in professional exams, like the US Medical Licensing Exam. This is achieved through a straightforward training objective, which focuses on predicting the next word in a sentence. The autoregressive text generation process in LLMs generates the next tokens based on the sequenc

In [31]:
print(chain.invoke({"question":"Give 5 descriptive questions on LLMs with answers"}))

Here are 5 descriptive questions on LLMs with answers based on the provided context:

1. **What are Large Language Models (LLMs), and how do they work?**
Answer: LLMs are a sophisticated type of neural network that process and generate text. They are trained on extensive textual data, enabling them to grasp various language patterns and structures. The primary goal of LLMs is to interpret and create human-like text that captures the nuances of natural language.

2. **What is the core training objective of LLMs, and what emergent abilities have they demonstrated?**
Answer: The core training objective of LLMs focuses on predicting the next word in a sentence. This straightforward objective has led to the development of emergent abilities, such as conducting arithmetic calculations, unscrambling words, and demonstrating proficiency in professional exams, like passing the US Medical Licensing Exam.

3. **How do LLMs generate text, and what is the role of the attention mechanism in this pro

In [33]:
loader=PyPDFLoader("VITEEE_Brochure.pdf")
pages=loader.load_and_split()

vectorstore = DocArrayInMemorySearch.from_documents(
    pages,
    embedding=embedding_function
)

retriever=vectorstore.as_retriever()

In [34]:
retriever.invoke("Machine Learning")

[]

In [35]:
retriever.invoke("VITEEE")

[]

In [37]:
loader=PyPDFLoader("VITEEE-2024-information-brochure.pdf")
pages=loader.load_and_split()

vectorstore = DocArrayInMemorySearch.from_documents(
    pages,
    embedding=embedding_function
)

retriever=vectorstore.as_retriever()
pages

[Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updated 11', 'source': 'VITEEE-2024-information-brochure.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='/vituniversity /vellore_vit www.vit.ac.in /vellore-institute-of-technology/VIT_univ\n2024\n2024\nVIT ENGINEERING ENTRANCE\nEXAMINATION\nVIT ENGINEERING ENTRANCE\nEXAMINATION\nFor Admission to B.T ech. Programmes of\nVIT - Vellore | VIT - Chennai | VIT - AP | VIT - Bhopal\nFor Admission to B.T ech. Programmes of\nVIT - Vellore | VIT - Chennai | VIT - AP | VIT - Bhopal\nciogNg caHT jUk\nVELLORE INSTITUTE OF TECHNOLOGY\nVIT\nVIT\nVellore Institute of Technology\n(Deemed to be University under section 3 of UGC Act, 1956)\nR\nVITEEE\nProspectus'),
 Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows

In [38]:
retriever.invoke("VITEEE")

[Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updated 11', 'source': 'VITEEE-2024-information-brochure.pdf', 'total_pages': 20, 'page': 9, 'page_label': '10'}, page_content='Submission of VITEEE\nApplication through\nWeb by\napplicants\nhttps://viteee.vit.ac.in \nGenerate link for\ntest slot booking\nto the candidates\nApplication\nVeriﬁcation\n&\nScrutiny\nAppear for\nVITEEE 2024\nAnnouncement of\nOnline Counselling\nbased on\nVITEEE 2024 Rank\nTest Slot,\nCentre booking &\nAdmit Card\nGeneration\nby the candidates\nVITEEE 2024\nResult\nAnnouncement\nProvisional\nAdmission Letter\nto B.Tech Programme\nbased on Merit after\nrequired fee paymentDocument\nVeriﬁcation\nProgramme wishlist\nby the candidates\n Vellore Institute of Technology Engineering Entrance Examination (VITEEE) is conducted for admission to under

In [39]:
retriever.invoke("VIT")

[Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updated 11', 'source': 'VITEEE-2024-information-brochure.pdf', 'total_pages': 20, 'page': 19, 'page_label': '20'}, page_content='About VIT\n17'),
 Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updated 11', 'source': 'VITEEE-2024-information-brochure.pdf', 'total_pages': 20, 'page': 1, 'page_label': '2'}, page_content='VIT - Vellore\nVIT - Chennai\nVIT - AP\nVIT - Bhopal'),
 Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updat

In [40]:
retriever.invoke("B.Tech")

[Document(metadata={'producer': 'Adobe PDF library 17.00', 'creator': 'Adobe Illustrator 27.9 (Windows)', 'creationdate': '2023-12-12T12:45:23+06:30', 'moddate': '2023-12-13T10:15:04+05:30', 'title': 'VITEEE BROCHURE_2024_Updated 11', 'source': 'VITEEE-2024-information-brochure.pdf', 'total_pages': 20, 'page': 14, 'page_label': '15'}, page_content='B.TECH.- Bioengineering\nB.TECH.- Biotechnology\nB.TECH.- Chemical Engineering\nB.TECH.- Civil Engineering\nB.TECH.- Electrical and Electronics Engineering\nB.TECH.- Electronics and Instrumentation Engineering\nB.TECH.- Fashion Technology\nB.TECH.- Aerospace Engineering\nB.TECH.- Computer Science & Engineering (E-Commerce Technology)\nB.TECH.- Computer Science & Engineering (Education Technology)\nB.TECH.- Computer Science and Engineering and Business Systems\nB.TECH.- Computer Science and Engineering\nB.TECH.- Computer Science and Engineering (Artiﬁcial Intelligence and Machine Learning)\nB.TECH.- Computer Science and Engineering (Artiﬁcial

In [41]:
chain = (
    {
        "context":itemgetter("question") | retriever,
        "question":itemgetter("question")
    }
    | prompt
    | model
    | parser
)

In [42]:
chain.invoke({"question":"Give an essay on LLMs"})

"I don't know the answer to this question based on the provided context. The context appears to be related to the VITEEE brochure and admission process, and does not mention LLMs (Large Language Models) or provide any relevant information for an essay on the topic."

In [43]:
chain.invoke({"question":"When is the exam?"})

'The VITEEE 2024 exam is tentatively scheduled to be conducted between April 19 and 30, 2024. The number of days will vary for test cities.'

In [44]:
print(chain.invoke({"question":"What are the courses offered by VIT?"}))

The context provided does not explicitly list all the courses offered by VIT. However, it mentions "B. Tech Programmes Oﬀered" and "admission to undergraduate engineering programmes" which suggests that VIT offers various B.Tech programs. The exact courses are not specified in the given context.
