<a href="https://colab.research.google.com/github/Mangai2024/Mangai2024/blob/main/Complete_RAG_Pipeline_solar_System__Using_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Step 1 Load PDF
!pip install -U langchain-community pypdf
from langchain.document_loaders import PyPDFLoader

Collecting pypdf
  Downloading pypdf-6.2.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.2.0-py3-none-any.whl (326 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.6/326.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.2.0


In [None]:
#Load the PDF
loader= PyPDFLoader("//content/solar_system.pdf")

In [None]:
#Extract text as documents (one per page)
docs = loader.load()

In [None]:
docs[0].page_content

'Solar System\nThe Solar System consists of the Sun and the objects that orbit it, including eight planets, dwarf\nplanets, moons, asteroids, and comets. The Sun is the central star and provides the energy needed for\nlife on Earth. The planets orbit the Sun in elliptical paths. The inner planets are Mercury, Venus, Earth,\nand Mars. The outer planets are Jupiter, Saturn, Uranus, and Neptune.'

In [None]:
#Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
#Create a text splitter
splitter=RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

In [None]:
#Split the PDF pages into chunks
chunks =splitter.split_documents(docs)

In [None]:
print("Total chunks:",  len(chunks))
print("First chunk:\n")
print(chunks[0].page_content)

Total chunks: 1
First chunk:

Solar System
The Solar System consists of the Sun and the objects that orbit it, including eight planets, dwarf
planets, moons, asteroids, and comets. The Sun is the central star and provides the energy needed for
life on Earth. The planets orbit the Sun in elliptical paths. The inner planets are Mercury, Venus, Earth,
and Mars. The outer planets are Jupiter, Saturn, Uranus, and Neptune.


In [None]:
#Embedding Steps
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
#create embedding(no api key needed)
embeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Create FAISS index from documents and embeddings, then save it
db = FAISS.from_documents(chunks, embeddings)
db.save_local("faiss_hf_index")

# Now load the local FAISS index
db = FAISS.load_local("faiss_hf_index", embeddings, allow_dangerous_deserialization=True)

In [None]:
#Retrieval only
query="What are the planets in the Solar System?"
retriever =db.as_retriever(search_kwargs={"k":3})
results = retriever.invoke(query)
for r in results:
  print(r.page_content[:400])
  print("-------------------")

Solar System
The Solar System consists of the Sun and the objects that orbit it, including eight planets, dwarf
planets, moons, asteroids, and comets. The Sun is the central star and provides the energy needed for
life on Earth. The planets orbit the Sun in elliptical paths. The inner planets are Mercury, Venus, Earth,
and Mars. The outer planets are Jupiter, Saturn, Uranus, and Neptune.
-------------------


In [None]:
#Retrievel and Generation (RAG)
#Load the LLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

In [None]:
model_name="google/flan-t5-small"

In [None]:
tokenizer=AutoTokenizer.from_pretrained(model_name)
model=AutoModelForSeq2SeqLM.from_pretrained(model_name)
llm=pipeline("text2text-generation", model=model, tokenizer=tokenizer)

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
#Retrieve Context
query = "What are the planets in the Solar System?"
retriever =db.as_retriever(search_kwargs={"k":3})
docs =retriever.invoke(query)
context = ""
for d in docs:
  context += d.page_content + "\n\n"

In [None]:
#Generate answer
prompt = f"Context:\n{context}nQuestion: {query}\nAnswer:"
answer = llm(prompt, max_length=200)[0]["generated_text"]
print(answer)

Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


dwarf planets, moons, asteroids, and comets
