# Data Ingestion for Space Exploration RAG

This notebook demonstrates how to load data from Wikipedia, split it into chunks, and index it into a local VectorDB (Chroma) for our RAG application.
Submitted by: Amina O. & Ossama Z. & Smia I.
## 1. Setup and Imports

In [None]:
import os
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Define the directory for persistent storage
PERSIST_DIRECTORY = "./chroma_db"

## 2. Load Data from Wikipedia
We will load articles related to Space Exploration to build our knowledge base.

In [None]:
topics = ["Space exploration", "NASA", "SpaceX", "Mars exploration", "International Space Station"]
documents = []

print("Loading documents from Wikipedia...")
for topic in topics:
    try:
        loader = WikipediaLoader(query=topic, load_max_docs=1)
        docs = loader.load()
        documents.extend(docs)
        print(f"Loaded: {topic}")
    except Exception as e:
        print(f"Error loading {topic}: {e}")

print(f"Total documents loaded: {len(documents)}")

## 3. Split Text into Chunks
We need to split the long articles into smaller chunks for flexible retrieval.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)

splits = text_splitter.split_documents(documents)
print(f"Created {len(splits)} chunks.")

## 4. Embed and Store in ChromaDB
We will use a local HuggingFace embedding model (`all-MiniLM-L6-v2`) which is lightweight and efficient.
The data will be saved locally to `./chroma_db`.

In [None]:
print("Initializing Embedding Model...")
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

print("Indexing data into ChromaDB... (This might take a minute)")
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embedding_model,
    persist_directory=PERSIST_DIRECTORY
)

print("Done! Data indexed and saved to ./chroma_db")

## 5. Test Retrieval
Let's verify that we can retrieve relevant information.

In [None]:
query = "Who founded SpaceX?"
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
results = retriever.invoke(query)

print(f"Query: {query}")
print("Top Result:")
print(results[0].page_content)