## 📘 Introduction

In this notebook, we demonstrate a simple Retrieval-Augmented Generation (RAG) pipeline using Hugging Face Transformers, Sentence Transformers, and FAISS.

The goal is to:
- Extract text from a PDF file (university information).
- Split the text into chunks for efficient retrieval.
- Generate embeddings with a transformer model.
- Build a FAISS vector index to enable semantic search.
- Use a Large Language Model (LLM) (Mistral Nemo Instruct) to answer questions based on retrieved chunks.

This approach is widely used in question answering, chatbots, and knowledge retrieval systems.

## ⚙️ Steps

![image](https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ykFSvJzAtPg8W2GN)

#### 1. Setup & Model Loading
- Login to Hugging Face.
- Load the Mistral Nemo Instruct 2407 model for text generation.
- Install and import necessary libraries.

In [1]:
# from huggingface_hub import login

# login()

from huggingface_hub import login
import os

# Retrieve your token from Kaggle environment
token = os.getenv("HF_TOKEN")
login(token=token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-Nemo-Instruct-2407"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")


def generate_text(prompt, max_length=100, num_return_sequences=1):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7,
    )
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

2025-08-25 12:35:29.128772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756125329.154456     220 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756125329.162011     220 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [3]:
!pip install sentence-transformers PyPDF2 faiss-cpu -q

#### 2. PDF Text Extraction
- Use PyPDF2 to extract text from the uploaded PDF.
- Concatenate all pages into a single document string.

In [4]:
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
import faiss
import numpy as np

In [5]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    full_text = ""
    for page in reader.pages:
        full_text += page.extract_text() + "\n"
    return full_text

#### 3. Chunking the Document
- Split the text into overlapping chunks (e.g., 50 tokens with 5 overlap).
- This ensures semantic continuity across boundaries.

In [6]:
def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

#### 4. Embeddings & FAISS Indexing
- Generate embeddings using SentenceTransformers (MiniLM-L6-v2).
- Store embeddings in a FAISS vector index for fast semantic search.

In [7]:
def embed_chunks(chunks, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, convert_to_numpy=True)
    return model, embeddings

In [8]:
def create_faiss_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

In [9]:
def search_index(query, model, index, chunks, k=5):
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)
    return [chunks[i] for i in indices[0]]

#### 5. Question Answering

- Define user queries (e.g., "Where is Tips Hindawi University located?").
- Retrieve the top-k most relevant chunks from FAISS.
- Provide the retrieved context to the LLM (Mistral Nemo) for generating precise answers.

In [10]:
pdf_path = "/kaggle/input/tips-pdf/Tips_Hindawi_University_Info.pdf"

text = extract_text_from_pdf(pdf_path)
chunks = chunk_text(text,chunk_size=50, overlap=5)

model_embeddings, embeddings = embed_chunks(chunks)

index = create_faiss_index(embeddings)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

##### Question 1

In [11]:
question_1 = "Where is Tips Hindawi University located?"
top_chunks_1 = search_index(question_1, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks_1, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
1. General Overview Tips Hindawi University (THU) is a premier institution of higher education located in the heart of the Middle East. Founded in 1963, the university has grown into a globally recognized center for academic excellence and innovation. With over six decades of educational leadership, THU has produced more

--- Chunk 2 ---
leadership, THU has produced more than 150,000 graduates who serve in diverse industries and academic circles worldwide. The university is accredited by the International Commission for Academic Standards and the Ministry of Higher Education. It operates under the guiding motto: "Knowledge, Integrity, Progress" , and fosters an environment of research,

--- Chunk 3 ---
Four residence halls - A medical center - Sports complex and stadium - Innovation and entrepreneurship hub 2.2 Satellite Campuses North Campus : Focused on agricultural sciences and environmental research West Campus : Home to the School of Design and Architecture 2.3 St

In [12]:
chunk_1 = top_chunks_1[0]
prompt_1 = f"Answer the next question: {question_1} by reading the following text:{chunk_1}"

In [13]:
answer_1 = generate_text(prompt_1, max_length=700)
print(answer_1[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the next question: Where is Tips Hindawi University located? by reading the following text:1. General Overview Tips Hindawi University (THU) is a premier institution of higher education located in the heart of the Middle East. Founded in 1963, the university has grown into a globally recognized center for academic excellence and innovation. With over six decades of educational leadership, THU has produced more than 200,000 alumni who have made significant contributions to their respective fields and communities.2. Campus and Location THU's main campus is situated in the vibrant city of Cairo, Egypt. The campus spans over 100 acres and is home to numerous state-of-the-art facilities, including libraries, research centers, and recreational spaces. The university's strategic location in Cairo allows it to draw from the city's rich cultural heritage and bustling intellectual environment, providing students with unique opportunities for learning and growth. Additionally, THU has seve

##### Question 2

In [14]:
question_2 = "Does the university offer online programs?"
top_chunks_2 = search_index(question_2, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks_2, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
leadership, THU has produced more than 150,000 graduates who serve in diverse industries and academic circles worldwide. The university is accredited by the International Commission for Academic Standards and the Ministry of Higher Education. It operates under the guiding motto: "Knowledge, Integrity, Progress" , and fosters an environment of research,

--- Chunk 2 ---
Graduate Programs MBA (Master of Business Administration) MSc in Data Science MA in Psychology PhD in Mechanical Engineering PhD in Political Science 4. Admissions and Tuition 4.1 Undergraduate Admissions High school GPA of 85% or equivalent English proficiency test (IELTS 6.0 or TOEFL 80) Entrance interview for certain programs 4.2

--- Chunk 3 ---
1. General Overview Tips Hindawi University (THU) is a premier institution of higher education located in the heart of the Middle East. Founded in 1963, the university has grown into a globally recognized center for academic excellence and innovation. With ov

In [15]:
chunk_2 = top_chunks_2[0]
prompt_2 = f"Answer the next question: {question_2} by reading the following text:{chunk_2}"

In [16]:
answer_2 = generate_text(prompt_2, max_length=700)
print(answer_2[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the next question: Does the university offer online programs? by reading the following text:leadership, THU has produced more than 150,000 graduates who serve in diverse industries and academic circles worldwide. The university is accredited by the International Commission for Academic Standards and the Ministry of Higher Education. It operates under the guiding motto: "Knowledge, Integrity, Progress" , and fosters an environment of research, innovation, and entrepreneurship. In addition, THU offers a range of online programs, including courses on leadership, business, and education, to accommodate students with flexible schedules and learning preferences.

So, does the university offer online programs? Yes, the university does offer online programs.


##### Question 3

In [17]:
question_3 = "Is there financial aid for international students?"
top_chunks_3 = search_index(question_3, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks_3, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
interview for certain programs 4.2 Graduate Admissions Relevant bachelor’s degree Minimum GPA of 3.0/4.0 Two academic references Statement of purpose 4.3 Tuition Fees (Annual) Undergraduate: $5,000 - $8,000 Graduate: $6,500 - $10,000 4.4 Scholarships Merit Scholarships : 25% to 100% tuition coverage Need-Based Grants Research Fellowships for graduate students 5.

--- Chunk 2 ---
Graduate Programs MBA (Master of Business Administration) MSc in Data Science MA in Psychology PhD in Mechanical Engineering PhD in Political Science 4. Admissions and Tuition 4.1 Undergraduate Admissions High school GPA of 85% or equivalent English proficiency test (IELTS 6.0 or TOEFL 80) Entrance interview for certain programs 4.2

--- Chunk 3 ---
Fellowships for graduate students 5. Administration President : Dr . Nabil Al-Khatib• • • • • • • • • • • • • • • • • • • • • • • 2 Vice President of Academic Affairs : Prof. Layla Mahmoud Dean of Students : Mr .


In [18]:
chunk_3 = top_chunks_3[0]
prompt_3 = f"Answer the next question: {question_3} by reading the following text:{chunk_3}"

In [19]:
answer_3 = generate_text(prompt_3, max_length=700)
print(answer_3[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the next question: Is there financial aid for international students? by reading the following text:interview for certain programs 4.2 Graduate Admissions Relevant bachelor’s degree Minimum GPA of 3.0/4.0 Two academic references Statement of purpose 4.3 Tuition Fees (Annual) Undergraduate: $5,000 - $8,000 Graduate: $6,500 - $10,000 4.4 Scholarships Merit Scholarships : 25% to 100% tuition coverage Need-Based Grants Research Fellowships for graduate students 5.1 International Students 5.1.1 Admission Requirements English proficiency test score (TOEFL or IELTS) Financial statement demonstrating sufficient funds for one academic year 5.1.2 Tuition Fees (Annual) Undergraduate: $10,000 - $15,000 Graduate: $12,000 - $18,000 5.1.3 Financial Aid Limited to graduate students, in the form of Research Assistantships and Teaching Assistantships.

Based on the information provided, there is limited financial aid available for international students, specifically for graduate students, in the

##### Question 4

In [20]:
question_4 = "What languages are used for instruction?"
top_chunks_4 = search_index(question_4, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks_4, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
Events and Traditions Founders' Week Annual Cultural Festival Spring Research Showcase 7.3 Support Services Mental Health Counseling Career Development Center Tutoring and Writing Labs Disability Services 8. Faculty Highlights Dr. Yasir Al-Sabbagh (Computer Science): Expert in natural language processing Prof. Rana Khalil (Economics): World Bank consultant and author Dr. Omar

--- Chunk 2 ---
Dormitory (First-year undergraduates) Rafidain Apartments (Graduate students) Amal Housing Complex (Family and international students) 3. Academic Structure 3.1 Faculties Faculty of Engineering Faculty of Medicine and Health Sciences Faculty of Business and Economics Faculty of Arts and Humanities Faculty of Law and International Studies Faculty of Computer and Information Sciences

--- Chunk 3 ---
of Computer and Information Sciences Faculty of Architecture and Design• • • • • • • • • • • • 1 3.2 Sample Undergraduate Programs BSc in Computer Science BA in International Relations 

In [21]:
chunk_4 = top_chunks_4[0]
prompt_4 = f"Answer the next question: {question_4} by reading the following text:{chunk_4}"

In [22]:
answer_4 = generate_text(prompt_4, max_length=700)
print(answer_4[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the next question: What languages are used for instruction? by reading the following text:Events and Traditions Founders' Week Annual Cultural Festival Spring Research Showcase 7.3 Support Services Mental Health Counseling Career Development Center Tutoring and Writing Labs Disability Services 8. Faculty Highlights Dr. Yasir Al-Sabbagh (Computer Science): Expert in natural language processing Prof. Rana Khalil (Economics): World Bank consultant and author Dr. Omar Al-Kaysi (Mathematics): Recipient of the University's Teaching Excellence Award

In the provided text, the languages used for instruction are not explicitly stated. However, based on the context and the mention of the faculty, we can infer that the languages used for instruction are likely English, as it is a common language of instruction in many educational institutions and is spoken by the faculty members mentioned.


## ✅ Conclusion

In this notebook, we successfully implemented a basic RAG pipeline that:
- Retrieves relevant context from a document.
- Combines semantic search with a generative model.
- Answers user queries based on actual document content.

This approach can be extended to:
- Larger document collections.
- More advanced embedding models.
- Deployment as a chatbot or API.

By leveraging FAISS + LLMs, we can move towards more accurate and context-aware question answering systems.

## 👨‍💻 Made by: Abdelrahman Eldaba

- Check out my website with a portfolio [Here](https://sites.google.com/view/abdelrahman-eldaba110) 🌟
- Connect with me on [LinkedIn](https://www.linkedin.com/in/abdelrahmaneldaba) 🌐
- Look at my [GitHub](https://github.com/Abdelrahman47-code) and [Kaggle](https://www.kaggle.com/abdelrahmanahmed110)🚀