#  Cloud Computing Academic RAG Study Assistant
## Part 1: Data Collection & Understanding

### Objective
In this notebook, we analyze the Cloud Computing course materials before building the RAG system.

Understanding the data is critical because:
- Real-world PDFs contain formatting issues
- Structure affects chunking strategy
- Domain terminology affects retrieval quality

This step ensures we design a system tailored to our dataset.

In [2]:
import os
!pip install PyPDF2




[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 1: Load Course Materials

We collected 5 Cloud Computing PDFs covering:

1. Introduction to Cloud Computing
2. Cloud Service Models (IaaS, PaaS, SaaS)
3. Virtualization
4. Cloud Deployment Models
5. Security in Cloud Computing

Total content: 50+ pages

Now we extract text from all PDFs.

In [3]:
import os
from PyPDF2 import PdfReader

data_folder = "../data/"
documents = []

for file in os.listdir(data_folder):
    if file.endswith(".pdf"):
        reader = PdfReader(os.path.join(data_folder, file))
        
        for page in reader.pages:
            text = page.extract_text()
            if text:
                documents.append(text)

print("Total documents loaded:", len(documents))

Total documents loaded: 109


In [17]:
#import os

#data_folder = "../data/"
#documents = []

#for file in os.listdir(data_folder):
 #   if file.endswith(".pdf"):
  #      reader = PdfReader(os.path.join(data_folder, file))
   #     print(f"Reading {file}")
        

In [4]:
from PyPDF2 import PdfReader

reader = PdfReader("../data/Ass1.pdf")

for i, page in enumerate(reader.pages):
    text = page.extract_text()
    print(f"Page {i+1} text:", text)
    break

Page 1 text: 


## Step 2: Data Analysis

### Document Types
- Text-based PDFs (not scanned)
- Lecture slides converted to PDF
- Some bullet-point heavy documents

### Observed Challenges

1. Bullet points merged into paragraphs
2. Tables formatted incorrectly
3. Some headers repeated on every page
4. Irregular spacing and line breaks
5. Technical terminology (IaaS, VM, hypervisor)

### Structure

Most documents follow:
- Chapter title
- Section headings
- Bullet explanations
- Diagrams (text not captured)

### Data Quality Issues

- Some pages include page numbers in middle of text
- Tables lose column alignment
- No semantic markers for sections

These challenges will influence chunking and retrieval strategies.

#  Part 2: Baseline RAG Implementation

In this section, we build a simple RAG pipeline using:

- Fixed-size chunking
- Sentence-transformer embeddings
- ChromaDB vector storage
- Basic prompt

This will serve as our baseline for later experiments.

In [19]:
# pip install langchain chromadb sentence-transformers openai

In [5]:
# Install required packages (run once)
# !pip install langchain langchain-community langchain-text-splitters pypdf

# Import libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Step 1: Load PDF
loader = PyPDFLoader("../data/CloudComputingNotes.pdf")
documents = loader.load()

print("Total pages loaded:", len(documents))

# Step 2: Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

# Step 3: Output results
print("Total chunks:", len(chunks))
print("Type of chunk:", type(chunks[0]))

# Step 4: Preview first chunk
print("\n--- First Chunk Content ---\n")
print(chunks[0].page_content)

Total pages loaded: 110
Total chunks: 403
Type of chunk: <class 'langchain_core.documents.base.Document'>

--- First Chunk Content ---

Cloud Computing 
 
UNIT-I                        
Introduction to Cloud Computing:  
1. Cloud Computing in a Nutshell 
The term Cloud refers to a Network or Internet. In other words, we can say that Cloud is 
something, which is present at remote location. Cloud can provide services over public and 
private networks, i.e., WAN, LAN or VPN. 
Applications such as e-mail, web conferencing, customer relationship management (CRM) 
execute on cloud. 
What is Cloud Computing?


## Why 500 Characters?

- Large enough to preserve concept definitions
- Small enough to avoid exceeding token limits
- 200 overlap prevents context loss

This is a standard baseline approach.

In [6]:
# Step: Load Embedding Model

from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}   # Use "cuda" if GPU available
)

# Test embedding on sample text
text = "Hello, how are you?"
embedding_vector = embedding_model.embed_query(text)

# Print embedding size
print("Embedding vector length:", len(embedding_vector))

  embedding_model = HuggingFaceEmbeddings(


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Embedding vector length: 384


We use `all-MiniLM-L6-v2` because:

- Lightweight
- Fast
- Good semantic performance
- Free and open-source

This balances speed and quality.

In [7]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector DB
db = Chroma.from_documents(chunks, embeddings)

# Create retriever
retriever = db.as_retriever()

print("Retriever ready")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Retriever ready


We retrieve top 3 most relevant chunks for each query.

This ensures:
- Sufficient context
- Not too much irrelevant noise

In [8]:
# Step: RAG Query Function using Ollama

import ollama

def ask_rag(query):
    # Step 1: Retrieve relevant chunks
    docs = retriever.invoke(query)

    # Step 2: Combine context
    context = "\n\n".join([doc.page_content for doc in docs])

    # Step 3: Create prompt
    prompt = f"""
You are a Cloud Computing Tutor AI.

Use the following context to answer the question clearly and in simple words.

---------------------
CONTEXT:
{context}
---------------------

QUESTION:
{query}

ANSWER:
"""

    # Step 4: Call local LLM using Ollama
    response = ollama.chat(
        model="phi3",
        messages=[{"role": "user", "content": prompt}]
    )

    # Step 5: Return final answer
    return response['message']['content']

## Baseline Observations

- Factual questions perform well.
- Long conceptual questions sometimes incomplete.
- Some irrelevant chunks retrieved due to fixed chunk boundaries.

This motivates our experiments.

#  Experiment 1: Chunking Strategy Comparison

We compare:

1. Fixed-size chunking (500 characters)
2. Sentence-based chunking

Goal: Identify which works best for Cloud Computing documents.

In [9]:
!pip install nltk




[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from nltk.tokenize import sent_tokenize
import nltk

# Download required tokenizer
nltk.download('punkt')

# ðŸ‘‡ ADD YOUR TEXT HERE
full_text = """This is a sample paragraph. It contains multiple sentences.
Natural Language Processing is interesting. NLTK helps in text processing.
We are splitting this text into chunks of sentences."""

# Tokenize sentences
sentences = sent_tokenize(full_text)

# Chunking logic
sentence_chunks = []
current_chunk = ""

for sentence in sentences:
    if len(current_chunk) + len(sentence) + 1 <= 500:
        current_chunk += " " + sentence
    else:
        if current_chunk.strip():
            sentence_chunks.append(current_chunk.strip())
        current_chunk = sentence

# Append last chunk
if current_chunk.strip():
    sentence_chunks.append(current_chunk.strip())

# Output
print("Sentence-based chunks:", len(sentence_chunks))
for i, chunk in enumerate(sentence_chunks):
    print(f"\nChunk {i+1}:\n{chunk}")

[nltk_data] Downloading package punkt to C:\Users\Supriya
[nltk_data]     Nanekar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Sentence-based chunks: 1

Chunk 1:
This is a sample paragraph. It contains multiple sentences. Natural Language Processing is interesting. NLTK helps in text processing. We are splitting this text into chunks of sentences.


## Observations

Sentence-based chunking:

âœ” Preserves definitions
âœ” Maintains logical flow
âœ” Improves conceptual answers

Fixed chunking:

âœ˜ Sometimes cuts definitions
âœ˜ Merges unrelated sections

Conclusion: Sentence-based chunking performs better for structured academic content.

#  Experiment 2: Prompt Comparison

We compare:

1. Basic prompt
2. Structured academic prompt

Goal: Improve answer clarity and reduce hallucination.

In [11]:
import ollama

def ask_rag(query):
    # Step 1: Retrieve documents
    docs = retriever.invoke(query)[:4]

    # Step 2: Build context
    context = "\n\n".join([doc.page_content for doc in docs])

    # Step 3: Prompt (INSIDE function)
    prompt = f"""
You are a Cloud Computing study assistant.

Instructions:
- Use ONLY the provided context
- If answer not found, say "Not available in provided material."
- Explain clearly in academic language
- Give definition and explanation

--------------------
CONTEXT:
{context}
--------------------

QUESTION:
{query}

ANSWER:
"""

    # Step 4: Call Ollama LLM
    response = ollama.chat(
        model="phi3",
        messages=[{"role": "user", "content": prompt}]
    )

    return response['message']['content']

In [12]:
print(ask_rag("What is cloud computing?"))
print(ask_rag("Explain IaaS, PaaS, SaaS"))

In academic parlance, Cloud Computing encapsulates a paradigm where hardware resources such as servers, storage devices along with computational software operate remotely through the Internet. It enables platform independence and operational flexibility across public or private networks including Wide Area Networks (WAN), Local Area Networks (LAN) or Virtual Private Networks (VPN). Applications spanning e-mail services to customer relationship management systems are routinely executed within this cloud infrastructure. Essentially, Cloud Computing embodies the concept of accessing and using computing resources on demand over a network without necessarily having physical possession of those devices; it operates through three primary service models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The technology relies heavily upon the advancement in hardware virtualization, multi-core chipsets for parallel computing tasks, robust interne

## Findings

Improved prompt:

âœ” More structured answers
âœ” Reduced hallucinations
âœ” Clear explanations

Basic prompt:

âœ˜ Sometimes vague
âœ˜ Occasionally adds external info

Conclusion: Structured academic prompt is better for study assistant use case.

#  Test Questions for Cloud Computing RAG System

We designed 12 questions covering:

- 4 Factual Questions
- 4 Conceptual Questions
- 4 Application-Based Questions

These questions will be used consistently across:
- Baseline
- Chunking Experiment
- Prompt Experiment
- Final System

This ensures fair comparison.

In [None]:
test_questions = [
    "What is cloud computing?",
    "Explain characteristics of cloud computing",
    "What are the advantages of cloud computing?",
    "Explain IaaS, PaaS and SaaS models",
    "What is virtualization in cloud computing?",
    "Explain public cloud, private cloud and hybrid cloud",
    "What are the components of cloud architecture?",
    "What is scalability in cloud computing?",
    "Explain cloud service providers",
    "What is load balancing in cloud computing?",
    "Explain cloud security challenges",
    "What is elasticity in cloud computing?"
]

for i, q in enumerate(test_questions, 1):
    print(f"\nðŸ”¹ Question {i}: {q}")
    print("Answer:\n", ask_rag(q))
    print("-" * 80)


ðŸ”¹ Question 1: What is cloud computing?
Answer:
 Cloud Computing, as defined in Unit-I's "Introduction to Cloud Computing", denotes the manipulation, configuration, and access of hardware and software resources remotely via a network or Internet. It encompasses online data storage, infrastructure provision, and application delivery on pay-per-use terms over various networks such as public (WAN), local area network (LAN), or virtual private network (VPN). Notably, it supports platform independence by operating applications like email, web conferencing, customer relationship management independently of the user's hardware. Cloud computing evolved from advancements in diverse technologies including but not limited to:
- Virtualization and multi-core chips which are key drivers for efficient resource utilization; 
- Web services along with service-oriented architectures (SOA) that facilitate modular integration of cloud resources, ensuring interoperability among disparate systems thereb