# Lecture 2 Practice

*Utilizing Mistral in the realm of RAG concept*

**Methodology**
- Extracting the text the will act as a database for Mistral
- Chunking text into blocks of sentences
- Embedding chunks for later usage
- Storing embedded chunks (indexed) in a suitable vector storage (faiss database)
- Retrieving most close chunks to embedded question
- Forming an answer with Mistral

---

## 1. Kaggle Setup

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/intro-to-data-engineering/chapter_1.pdf


---

## 2. Installing Required Packages

In [2]:
!pip install transformers==4.52.4

Collecting transformers==4.52.4
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers==4.52.4)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Downloading transformers-4.52.4-py3-none-any.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25hDownloading huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.1/566.1 kB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 1.0.0rc2
    Uninstalling huggingface-hub-1.0.0rc2:
      Successfully uninstalled huggingface-hub-1.0.0rc2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.3
    Uninstalling transfo

In [3]:
!pip install sentence-transformers PyPDF2 faiss-cpu

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>

---

## 3. Importing Packages

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [5]:
import faiss
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer

2025-10-26 18:02:45.700583: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761501765.969698      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761501766.089300      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


--- 

## 4. Helper Functions

In [6]:
def generate_text(prompt, max_length=100, num_return_sequences=1):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7,
    )
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

In [7]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    full_text = ""
    for page in reader.pages:
        full_text += page.extract_text() + "\n"
    return full_text

In [8]:
def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

In [9]:
def embed_chunks(chunks, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, convert_to_numpy=True)
    return model, embeddings

In [10]:
def create_faiss_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

In [11]:
def search_index(query, model, index, chunks, k=5):
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)
    return [chunks[i] for i in indices[0]]

---

## 5. Trying it

In [12]:
# Calling Mistral
model_name = "mistralai/Mistral-Nemo-Instruct-2407"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [13]:
# PDF calling and chunking it
pdf_path = "/kaggle/input/intro-to-data-engineering/chapter_1.pdf"  

text = extract_text_from_pdf(pdf_path)
chunks = chunk_text(text,chunk_size=50, overlap=5)

In [14]:
# Embedding text chunks
model_embeddings, embeddings = embed_chunks(chunks)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
# Creating the vector database
index = create_faiss_index(embeddings)

In [25]:
question = "What are the main programming languages that data engineers commonly use?"
top_chunks = search_index(question, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
lingua franca of data. Has reemerged as a powerful interface for processing massive datasets in data warehouses, data lakes, and streaming frameworks. Python The "glue" language of the data ecosystem and the bridge to data science. Underlies popular tools like pandas, Airﬂow, and PySpark. JVM (Java/Scala) Prevalent in many core

--- Chunk 2 ---
Reuse Creating components and processes that can be leveraged across multiple projects. Interoperability Designing systems where di Ưerent tools and services work together seamlessly. Technical Skills A data engineer must be proﬁcient in production-grade software engineering. Primary Programming Languages Description and Use Cases SQL The lingua franca of data. Has

--- Chunk 3 ---
"modern data stack." Success in the ﬁeld requires a dual proﬁciency in deep technical skills—most notably SQL, Python, and robust software engineering practices—and essential business competencies, including cross-functional communication, cost manage

In [37]:
prompt = f"""Answer the next question: {question}
Based on this list of paragraphs {top_chunks}, 
and if the answer does not lie in the paragraphs you should output something demostrate that

Output format should be like:
Answer: ...
Explanation: ...
"""

In [38]:
answer = generate_text(prompt, max_length=700)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [39]:
print(answer[0])

Answer the next question: What are the main programming languages that data engineers commonly use?
Based on this list of paragraphs ['lingua franca of data. Has reemerged as a powerful interface for processing massive datasets in data warehouses, data lakes, and streaming frameworks. Python The "glue" language of the data ecosystem and the bridge to data science. Underlies popular tools like pandas, Airﬂow, and PySpark. JVM (Java/Scala) Prevalent in many core', 'Reuse Creating components and processes that can be leveraged across multiple projects. Interoperability Designing systems where di Ưerent tools and services work together seamlessly. Technical Skills A data engineer must be proﬁcient in production-grade software engineering. Primary Programming Languages Description and Use Cases SQL The lingua franca of data. Has', '"modern data stack." Success in the ﬁeld requires a dual proﬁciency in deep technical skills—most notably SQL, Python, and robust software engineering practices—