# RAG Student Assistant

## Ingestion

In [1]:
from langchain_community.document_loaders import PyPDFLoader

In [10]:
# Load pdf
loader = PyPDFLoader('./course_1.pdf')
docs = loader.load()

In [19]:
print(len(docs))
print(docs[12].page_content)

64
Binary Numbers 
• Arithmetic operations with numbers in base r follow the same rules as for 
decimal numbers.  
• When a base other than the familiar base 10 is used, one must be careful to 
use only the r-allowable digits.  
• Examples of different arithmetic operations on two binary numbers are as 
follows: 
 
 
13 



In [22]:
# Clear the loaded pdf text (delete the page number)
for i in range(10):
    docs[i].page_content = docs[i].page_content[:-1]
    
for i in range(10, len(docs)):
    docs[i].page_content = docs[i].page_content[:-1]

print(docs[2].page_content)

Bit / Byte 
 
The most basic unit of information in a digital computer is called a bit, which is 
a contraction of binary digit.  
A bit has two numerical values: 0 and 1. 
In the concrete sense, a bit is nothing more than a state of “on” or “off ” (or 
“high” and “low”) within a computer circuit.  
In 1964, the designers of the IBM System/360 computer established a 
convention of using groups of 8 bits as the basic unit of addressable computer 
storage.  
They called this collection of 8 bits a byte. 



### Chunking

In [None]:
# Apply chunking based on course parts
current_title = docs[0].page_content.split('\n')[0]
chunks = []
chunks.append(docs[0].page_content)
for doc in docs[1:]:
    if doc.page_content.split('\n')[0] == current_title:
        for line in doc.page_content.split('\n')[1:]:
            chunks[-1] += '\n' + line       
    else: 
        current_title = doc.page_content.split('\n')[0]
        chunks.append(doc.page_content)

In [46]:
# Apply sub-chunking for long chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=40,
    separators=['\n\n', '\n', ' ', '']
)

for i, chunk in enumerate(chunks):
    if len(chunk.split()) > 1000:
        sub_chunks = splitter.split_text(chunk)
        chunks[i:i+1] = sub_chunks

In [48]:
print(chunks)

['Democratic and Popular Republic of Algeria \nMinistry of Higher Education and Scientific Research \nHigher School of Computer Science, May 8, 1945, Sidi Bel Abbes  \n \nAcademic year: 2023-2024                                                                    Semester  1 \nPreparatory class 1                                                                                 Module: Computer Architecture 1 \n', 'Introduction \n \n\uf0d8The organization of any computer depends considerably on how it represents \ndata.  \n\uf0d8This chapter describes the various ways in which computers can store and \nmanipulate numbers and characters.  \n\uf0d8The ideas presented in the following sections form the basis for understanding \nthe organization and function of all types of digital systems. \n', 'Bit / Byte \n \n\uf0d8The most basic unit of information in a digital computer is called a bit, which is \na contraction of binary digit.  \n\uf0d8A bit has two numerical values: 0 and 1. \n\uf0d8In t

### Embedding

In [8]:
# Apply text-embedding-3-large from OpenAI
import os
from openai import OpenAI

openai_api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key = openai_api_key)

response = client.embeddings.create(
    input="this is a test sentence",
    model = "text-embedding-3-large"
)

print(response.data[0].embedding)

[0.001314252964220941, 0.026127932593226433, -0.012435459531843662, 0.04970592260360718, 0.012992137111723423, -0.020489327609539032, -0.01737372949719429, 0.044246893376111984, -0.028408516198396683, 0.0028978660702705383, 0.017346791923046112, 0.04198426753282547, 0.0059169442392885685, -0.04277439042925835, 0.00901009701192379, 0.023488203063607216, -0.01863972097635269, 0.015999991446733475, -0.00799550674855709, -0.010855213738977909, 0.012974180281162262, -0.02738494798541069, -0.02291356958448887, 0.024475857615470886, 0.03151513636112213, 0.0024085284676402807, -0.008094272576272488, 0.025158237665891647, -0.020974176004529, -0.003317618975415826, 0.030024677515029907, 0.04148146137595177, 0.0036161597818136215, 0.018927039578557014, 0.038644202053546906, -0.02303927019238472, 0.02966552972793579, 0.02402692474424839, -0.004992141388356686, 0.008031421341001987, 0.047084152698516846, -0.01731087826192379, -0.045970797538757324, 0.023703692480921745, 0.00757799856364727, -0.0023