<a href="https://colab.research.google.com/github/Boulder1-kihara/create-your-first-rag-system/blob/main/my_first_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation
**1. Retrieval (The Search):** You ask a question. The system instantly searches through your folder of PDFs and finds the exact page in your handout.

**2. Augmented (The Context):** It takes that specific page and pastes it invisibly into the prompt.

**3. Generation (The Answer):** The AI reads that specific handout and writes an answer based only on that source.


# **Why this is great:**
**No Hallucinations:** It won't make up fake facts; it sticks to your documents.

**Privacy:** Since you want to run this locally on your HP laptop, your data stays with you.

**Custom:** It becomes an expert on your specific degree, not just general knowledge.

We are going to use the following libraries:
1. **pypdf:** To read your PDF files.

2. **langchain:** The framework that connects everything.

3. **chromadb:** A database to store the "memory" (vectors).

4. **sentence-transformers:** To turn your text into numbers the AI understands.

Before everything it is advisable to use google colab and before writing any code run the following command in the terminal

        pip install langchain langchain-community pypdf chromadb sentence-transformers

In [None]:
from langchain_community.document_loaders import PyPDFLoader

#use the actual name of your pdf
pdf_filename = "An Introduction to Machine Learning -- Miroslav KubaÃÅt -- ( WeLib.org ).pdf"

try:
    print(f"üìñ Attempting to load: {pdf_filename}...")

    # This tool reads the PDF
    loader = PyPDFLoader(pdf_filename)
    pages = loader.load()

    print(f"‚úÖ SUCCESS! I read {len(pages)} pages.")
    print("\n--- Here is what the first 500 characters look like: ---")
    print(pages[0].page_content[:500])

except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Tip: Check if the file name is spelled exactly right!")

üìñ Attempting to load: An Introduction to Machine Learning -- Miroslav KubaÃÅt -- ( WeLib.org ).pdf...
‚úÖ SUCCESS! I read 348 pages.

--- Here is what the first 500 characters look like: ---
Miroslav/uni00A0Kubat
An Introduction 
to Machine 
Learning
 Second Edition


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **1. Step 2: The "Chop" (Text Splitting)**
**Here is the problem:** We can't feed all 348 pages into an AI at once. It's like trying to eat a whole pizza in one bite‚Äîyou'll choke (or run out of memory).

We need to slice the book into smaller, bite-sized pieces called Chunks.


#Why we do this:

1. **Accuracy:** When you ask a question later, the AI can find the exact specific paragraph that has the answer, rather than scanning the whole book.

2. **Overlap:** We keep a little bit of repeated text between chunks so we don't accidentally cut a sentence in half.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Configure the splitter
# chunk_size=1000 means each piece will be roughly 1000 characters long
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

print("‚úÇÔ∏è Splitting the book into chunks...")

# 2. Split the documents
# We are using the 'pages' variable from the previous step
splits = text_splitter.split_documents(pages)

print(f"‚úÖ Done! We turned {len(pages)} pages into {len(splits)} small chunks.")
print("\n--- Example of a single chunk ---")
print(splits[10].page_content)

‚úÇÔ∏è Splitting the book into chunks...
‚úÖ Done! We turned 348 pages into 1080 small chunks.

--- Example of a single chunk ---
5.1 Multilayer Perceptrons as ClassiÔ¨Åers ................................ 91
5.2 Neural Network‚Äôs Error .............................................. 95
5.3 Backpropagation of Error ............................................ 97
5.4 Special Aspects of Multilayer Perceptrons .......................... 100
5.5 Architectural Issues .................................................. 104
5.6 Radial-Basis Function Networks .................................... 106
5.7 Summary and Historical Remarks ................................... 109
5.8 Solidify Your Knowledge ............................................ 110
6 Decision Trees ............................................................... 113
6.1 Decision Trees as ClassiÔ¨Åers ......................................... 113
6.2 Induction of Decision Trees ......................................... 117
6.3 Ho

# **Step 3: The "Translator" (Embeddings)**
We are going to turn those text chunks into **Lists of Numbers (called Vectors).**

**Why?** Computers don't understand "Machine Learning." They understand math.

We will use a free model from Hugging Face to do this translation.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

print("üîÆ Loading the Embedding Model (this translates text to numbers)...")
# We use a small, fast model called 'all-MiniLM-L6-v2'
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

print("üíæ Creating the Vector Database (this might take a moment)...")
# This takes your 1080 chunks, turns them into numbers, and saves them in memory
vector_db = Chroma.from_documents(
    documents=splits,
    embedding=embedding_model
)

print("‚úÖ Success! Your specific university brain is ready.")

üîÆ Loading the Embedding Model (this translates text to numbers)...
üíæ Creating the Vector Database (this might take a moment)...
‚úÖ Success! Your specific university brain is ready.


To view the raw nubers(optional) run the following code

In [None]:
# 1. Fetch one record from the database, including the raw math (embeddings)
data = vector_db.get(limit=1, include=["embeddings", "documents"])

# 2. Extract the vector
raw_vector = data['embeddings'][0]
text_content = data['documents'][0]

print(f"üìÑ The Text Chunk starts with: '{text_content[:50]}...'")
print(f"\nüî¢ The Vector has {len(raw_vector)} dimensions (numbers).")
print("Here are the first 20 numbers of that vector:")
print(raw_vector[:20])

üìÑ The Text Chunk starts with: 'Miroslav/uni00A0Kubat
An Introduction 
to Machine ...'

üî¢ The Vector has 384 dimensions (numbers).
Here are the first 20 numbers of that vector:
[-0.03509036 -0.06561454  0.00587055 -0.02697201  0.01989697  0.04304998
  0.02650404  0.02410588 -0.05778159 -0.04572349 -0.01883329 -0.01903909
 -0.00053616 -0.02233853 -0.05741227  0.06519616  0.01334015  0.00904386
 -0.02129865 -0.08295383]


# **Step 4: The Search (Retrieval)**
We will just ask your vector database to find the most relevant chunk for a specific question.

 **Here is how the math works:**

1. You ask a question (e.g., "What is a decision tree?").

2. The system turns your question into a vector (a list of numbers).

3. It compares your question's vector to all 1080 chunk vectors in your database.

4. It finds the ones that are mathematically closest (using something called Cosine Similarity).

In [None]:
# 1. Define your question
query = input("Enter your prompt ")

print(f"üïµÔ∏è‚Äç‚ôÄÔ∏è Searching for: '{query}'")

# 2. Ask the database to find the best matches (k=3 means find the top 3)
results = vector_db.similarity_search(query, k=3)

# 3. Print the results
print("\n--- Top Result Found ---")
print(results[0].page_content)

print("\n--- Source Metadata ---")
print(results[0].metadata)

Enter your prompt what is machine learning
üïµÔ∏è‚Äç‚ôÄÔ∏è Searching for: 'what is machine learning'

--- Top Result Found ---
Introduction
Machine learning has come of age. And just in case you might think this is a mere
platitude, let me clarify.
The dream that machines would one day be able to learn is as old as computers
themselves, perhaps older still. For a long time, however, it remained just that: a
dream. True, Rosenblatt‚Äôs perceptron did trigger a wave of activity, but in retrospect,
the excitement has to be deemed short-lived. As for the attempts that followed, these
fared even worse; barely noticed, often ignored, they never made a breakthrough‚Äî
no software companies, no major follow-up research, and not much support from
funding agencies. Machine learning remained an underdog, condemned to live in
the shadow of more successful disciplines. The grand ambition lay dormant.
And then it all changed.
A group of visionaries pointed out a weak spot in the knowledge-based sys