## Collect & Prepare Your Own Documents

Task:

1. Create a folder named my_docs/.
2. Place at least 3 text-based files in it — .txt, .md, or extracted .pdf text.
  
Examples:
  - A research paper summary you wrote
  - A blog article about AI or sustainability
  - Course notes or company reports

3. Load them into Python.

In [None]:
import os

folder = "my_docs"
documents = []

# TODO: Load your own files
# Verify their data loaded properly.
# Write a few lines describing what kind of documents they chose and why
for file in os.listdir(folder):
    if file.endswith((".txt", ".md")):
        with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
            documents.append(f.read())

print(f"Loaded {len(documents)} documents.")
print(documents[0][:300])  # preview


## Chunk Your Texts

**Goal**: Break long text into manageable pieces for retrieval.

In [None]:
def chunk_text(text, chunk_size=200):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# TODO: Apply chunking to all your documents
# Try different chunk sizes (100, 200, 400).
# Which size produced more relevant retrievals later? Why?
chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

print(" Total chunks created:", len(chunks))


## Build Your Own Retriever (Semantic)

**Goal**: Retrieve the most relevant chunks using embeddings.

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# TODO: Embed your chunks
chunk_embeddings = embedder.encode(chunks, convert_to_tensor=True)

def retrieve_chunks(query, k=2):
    query_embed = embedder.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(query_embed, chunk_embeddings)[0]
    top_k = torch.topk(scores, k)
    return [chunks[i] for i in top_k.indices]

# TEST
print(retrieve_chunks("What is discussed about AI ethics?", k=2))


## Connect Retrieval with Generation

**Goal**: Use retrieved chunks as context for text generation.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2", max_new_tokens=120, temperature=0.3)

def mini_rag(query, k=2):
    retrieved = retrieve_chunks(query, k)
    context = "\n".join(retrieved)
    prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\n\nAnswer:"
    return generator(prompt)[0]['generated_text']

# TODO: Try 2–3 questions based on your own documents
#Observe how the generator uses their context.
# Run the same question with and without retrieval.
# Compare and describe differences.
print(mini_rag("Summarize the main idea from my document."))


## Add a LangChain RAG Chain

**Goal**: Chain the retrieval and generation steps modularly.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain

llm = HuggingFacePipeline(pipeline=generator)

prompt = ChatPromptTemplate.from_template("""
Context:
{context}

Question:
{question}

Answer clearly and concisely:
""")

rag_chain = (
    {"context": lambda q: "\n".join(retrieve_chunks(q, 2)), "question": RunnablePassthrough()}
    | prompt
    | llm
)

# TODO: Run your chain
# Modify the retriever to return top 3 instead of 2.
# Print retrieved chunks before answering.
# Discuss if adding more chunks improved or worsened quality.
print(rag_chain.invoke("What are the key challenges mentioned in my research?"))


## Evaluation & Reflection

### Goal
Critically assess and document the performance of **your own RAG system**.  
Reflect on how your retrieval and generation pipeline behaved with your chosen documents.

---

### Evaluation Questions

| **Question** | **Students Write** |
|---------------|--------------------|
| **How many documents did you use?** |  |
| **What type of content (topic/domain)?** |  |
| **Which retrieval size (chunk length, `top_k`) worked best?** |  |
| **Did the model produce hallucinations? When?** |  |
| **What improvement would you try next?** |  |

---

### Optional Extension Ideas

- **Try another model**, e.g. `google/flan-t5-base` or `facebook/bart-large`, and compare outputs.  
- **Visualize cosine similarity scores** between your query and retrieved chunks as a bar chart to better understand retrieval ranking.

---


