# Adanced Multisource RAG Project


Key details about this initial project include:
* Goal: Building a robust Q&A system.
* Data Ingestion: The system must ingest data from at least three distinct sources, such as databases, websites, or PDFs.
* Retrieval Mechanisms: It requires the use of multiple retrieval methods, including graph-based retrieval, vector search, or sentence window retrieval.
* Processing Workflow: The system is designed to merge all the retrieved context and then rerank them before sending the information to the Large Language Models (LLMs).
* End Product: The final product should feature a web interface where a user can upload files (or point to data sources like a website, CSV, or database) and ask a natural language question, receiving a precise answer along with the source citations. 
* (optional feature) adding a metric that shows the latency and accuracy tradeoff, and including citations to boost credibility.

# Minimal Requirements

This is saved in the Project File Structure as a txt file
* llama-index
* sentence-transformers
* qdrant-client
* gqlalchemy
* spacy
* rank_bm25
* streamlit
* pandas
* pdfplumber
* torch
* faiss-cpu


# Tech Stack 

| Layer                        | Tool                                                                 | Purpose                                                                           |
| ---------------------------- | -------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **Data ingestion**           | Python scripts / `llama-index` connectors                            | Read from PDFs, CSVs, databases, web pages                                        |
| **Chunking & preprocessing** | Custom + `llama-index` utilities                                     | Clean and split text into overlapping chunks                                      |
| **Embeddings (local)**       | `sentence-transformers` model (HF)                                   | Generate open-source text embeddings                                              |
| **Vector DB**                | **Qdrant** (Docker/self-hosted)                                      | Vector + metadata storage and semantic retrieval                                  |
| **Graph DB**                 | **Memgraph**                                                         | Build small knowledge graphs for entity/relationship retrieval (GraphRAG pattern) |
| **Fusion / Rerank**          | Cross-encoder re-ranker (HF) + Reciprocal Rank Fusion                | Improve top-k quality                                                             |
| **LLM inference**            | Local or API-based (open-weights LLM like Mistral-7B, Llama 3 local) | Generate grounded answer                                                          |
| **Frontend**                 | Streamlit                                                            | File upload + query + answers + citations                                         |
| **Deployment**               | Local Docker compose (Qdrant + Memgraph + Streamlit app)             | Self-contained and free                                                           |



# Docker Desktop Install on Windows 11 system

Follow next steps to run Docker Desktop on Win 11 sys +WSL2

* Optional: Check for virtualization option being enabled on your system (usually is) 
	* Open Task Manager → Performance → CPU, check that Virtualization: Enabled.
	* If it’s disabled, enable it in your BIOS (usually under “Intel VT-x” or “AMD-V”).
* Download Docker Desktop for Windows from the official Docker website: https://www.docker.com/products/docker-desktop.
 Ensure you are using a version that supports WSL2, which is standard for recent releases.
* Before installing Docker Desktop, verify that WSL2 is properly set up:
	* Open PowerShell as Administrator and run wsl --install to install WSL and a default Linux distribution like Ubuntu.
	* After reboot, confirm WSL version 2 is active by running wsl -l -v.
		If needed, set WSL2 as the default version using wsl --set-default-version 2.
		Install the latest WSL2 Linux kernel update package from Microsoft’s website to ensure compatibility.
* During installation, make sure the option "Use WSL 2 based engine" is enabled.
 This setting ensures Docker uses the WSL2 backend for better performance and compatibility.
* After installing Docker Desktop, open it and go to Settings > Resources > WSL Integration to enable integration with your desired WSL2 distributions.
* Once configured, verify the installation by opening a WSL2 terminal and running docker version to confirm both client and server are active.

# Install WSL Extension for VS Code

* Open VS Code.

* Go to the Extensions view (Ctrl+Shift+X).

* Search for and install:

    - “WSL” (official Microsoft extension)

    - “Remote Development” (optional but recommended)

    - “Docker” (by Microsoft)


These allow you to:

Open VS Code inside your WSL distro

Manage containers and images from the VS Code UI

# Install Docker

* Install WSL
```
wsl --install
wsl --status
```
It should come up with 
    Distribución predeterminada: Ubuntu
    Versión predeterminada: 2

* Download and Install Docker Desktop
    On the installer keep default options (Use WSL 2)
    Restar PC when done
* Run the following:
```
docker --version
docker run hello-world
```
It should result in "Hello from Docker!"






# WSL Session in VS Code

Once the extension is installed:

* Open the Command Palette (Ctrl+Shift+P)

* Type and select:

Remote-WSL: New WSL Window


* VS Code will reopen — now it’s connected to your Linux environment.
You’ll see something like:
```
[WSL: Ubuntu]
```
in the bottom-left corner.

## Verify Docker from inside WSL

Open terminal inside VS Code 
    - Create new terminal [Ctrl + ñ]
    ```
    docker version
    ```
    Both Client and Server (Engine) sould come up.

# Access / Sare Win11 with WSL Ubuntu Files

Inside your WSL terminal (Ubuntu), everything under your Windows drives is already mounted automatically at /mnt.

Example:
```
cd /mnt/c/Users/Diego
ls
```

That gives you access to all your Windows files — Desktop, Documents, Downloads, etc.

You can also open any Windows file from WSL:
```
explorer.exe .
```

This opens File Explorer in your current WSL folder!

So, if you’re in ```/mnt/c/Users/Diego/Projects```, you’ll see your Windows folder directly.

# Visual Studio Code Project Setup

VS Code is linked with Ubuntu by now but if VS Code was a fresh install then it needs to be set up with all packages (python, extensions, etc.)

* installed python 
```
sudo apt update && sudo apt upgrade -y
sudo apt install build-essential curl wget git -y
```
If needed install or upgrade: 
```
sudo apt install python3 python3-pip python3-venv -y
```
then 
```
python3 --version
pip3 --version
```
* created a dedicated python virtual environment
**Always** use a venv inside your project folder to avoid permission and dependency issues. ** This can be deactivated (later) with ```deactivate```.
```
cd ~/Advanced Multisources RAG Project/
python3 -m venv .venv
source .venv/bin/activate
```
* Install VS Code Server inside WSL (if not done before)
code .

* Inside VS COde install 

Extension                  Purpose 

ms-python.python	Core Python language features

ms-toolsai.jupyter	Run notebooks directly in VS Code

ms-python.vscode-pylance	Language server for autocompletion & linting

ms-vscode.remote-wsl	Connect VS Code to WSL

ms-azuretools.vscode-docker (optional)	For container work

GitLens (optional)	Git management

ms-vscode-remote.remote-containers (optional)	For dev containers later

* Install dependencies for project using requirements.txt
```
pip install -r requirements.txt
```

* Verify setup
Inside .venv
```
which python
which pip
```
This shoulc come back with 
```/home/<user>/your_project/.venv/bin/python```



# Project Dev Journal

07/11/2025

* Started with Advanced Multisources RAG Project dev
* Installed WSL for Win11 sys
* Installed Docker Desktop
* Linked Visual Code Studio with Docker
* Access / Share Win11 with WSL Ubuntu
* Set up Visual Code Studio with Project requirements





In [None]:
08/11/2025

* Completed rework and putting Notebook 01
* Started/finishe with Notebook 02
* Had to install tdqm
* Got an automated script to check when I have outdated packages 
* Got to embedding storage using qdrant - had issues running qdrant on docker. -- Already included checks in script.



# Install Ipwidgets -- tdqm lib
* Make sure you're in your venv

```source ~/Advanced\ Multisources\ RAG\ Project/.venv/bin/activate```
* Cleanly install the modern stack

```pip install --upgrade jupyterlab notebook ipywidgets tqdm```

moving forward will be using settings.json script.

# Check for outdated packages script

Just run check_env.py script in root folder while .venv is active.


# Qdrant on Docker
Fired up Qdrant runing: 

```docker run -p 6333:6333 qdrant/qdrant```

# 13/11/2025

There's been several issues with VS Code and Jupyter Notebook Cells. 

Each Notebook gets its own Kernel instance which caused multiple kernels to compete for resources and at some point caused the system to hang. This happened very often and after many attempts and adding a kernel_spawn_tracer, kernel_watcher and modifying VS Code settings it didn't change the behaviour. 

If kernel gets stuck do: 

pkill -9 -f "ipykernel_launcher" && sleep 2 && echo "Killed all kernels"

# 14/11/2025

Got VS Code to change settings on .vscode>settings.json
I've added settings that:

Force all notebooks to use the same kernel instance (no spawning per-notebook)
Increase kernel provider timeout to 30 seconds (to prevent premature "stuck" detection)
Specify the kernel name explicitly

Check for ipykernels:

```ps aux | grep ipykernel | grep -v grep```

Same as before, if kernel gets stuck do: 

```pkill -9 -f "ipykernel_launcher" && sleep 2 && echo "Killed all kernels"```





# 14/11/2025 (continued)

## Solution: JupyterLab instead of VS Code Notebooks

Given repeated issues with VS Code spawning multiple kernels per notebook despite configuration changes, decided to use **JupyterLab** instead.

### Why JupyterLab?
- Native kernel management (one shared kernel across all notebooks)
- No per-notebook spawning (unlike VS Code)
- More stable for multi-notebook workflows
- Better suited for RAG pipeline development (04 notebooks in sequence)

### Setup & Usage

**Start JupyterLab:**
```bash
cd ~/advanced-multisource-rag-project
./start_jupyter_lab.sh
```

The script will:
1. Activate the `.venv`
2. Kill any existing kernel processes (clean slate)
3. Launch JupyterLab at `http://localhost:8888`
4. All notebooks will share a **single kernel instance**

**Workflow:**
1. Open notebook 01 in JupyterLab
2. Run all cells (ingests PDF/CSV, chunks text)
3. Switch to notebook 02 (same kernel — no hang)
4. Run cells (generates embeddings, uploads to Qdrant)
5. Switch to notebook 03 (continues with same kernel)
6. Switch to notebook 04 (uses `query` and `context_for_llm` from 03)

**Benefits over VS Code:**
- ✅ Single kernel for entire session
- ✅ No "Connecting to kernel..." hangs
- ✅ Faster switching between notebooks
- ✅ Variables persist across notebooks naturally
- ✅ No need for systemd watchers or watcher scripts

**If kernel still hangs in JupyterLab (rare):**
```bash
pkill -9 -f "ipykernel_launcher"
```
Then restart the `start_jupyter_lab.sh` script.

### Next Steps if JupyterLab Still Has Issues
If JupyterLab doesn't solve the problem, fall back to native Ubuntu (dual boot or VM).
Ubuntu removes WSL2 overhead and resolves path/IPC issues.


# 15/11/2025

* **Finished** first stage in proyect: 
Architecture already has clean separations:
    - Notebook 01 — Data ingestion
    - Notebook 02 — Embedding + storage
    - Notebook 03 — Retrieval pipeline + LLM answer

* Found out you can change the whole WSL config on windows by creating a .wslconfig file.
    Done, created a new config file meant to resolve instabilities while executing this proyect: 
    ```
    memory=10GB
    processors=4
    swap=8GB
    ```


# 15/11/2025

Made changes to improve the chunking function which now implements: 

* sentence-aware splitting
* recursive fallback splitting
* token-based chunk size (via tiktoken)
* multi-level separators
* guaranteed uniform chunk sizes

# 18/11/2025

1. Synced Project with Google Drive

Today we configured a workflow to store the project both locally in WSL and in Google Drive for redundancy and future migration (e.g., dual-booting to Ubuntu).

Steps Completed: 

* Identified project directory in WSL:

~/Projects/advanced-multisource-rag-project


* Created a Google Drive–synced folder on Windows:

C:\Users\Diego\GDrive\Projects\


* Copied the project from WSL → Google Drive using:

cp -r ~/Projects/advanced-multisource-rag-project /mnt/c/Users/Diego/GDrive/Projects/


* Verified the new copy using:

ls /mnt/c/Users/Diego/GDrive/Projects


* Confirmed that Google Drive automatically synced the folder to the cloud.

### Outcome

✔ Project is now mirrored inside Google Drive.

✔ Safe for backup, sharing, and migration to another OS installation.

# 18/11/2025

2. Prepared the Project for GitHub Version Control

The project was structured properly for Git versioning and then pushed to GitHub.

### Steps Completed: 
    2.1 Initialized Git repo
        git init

    2.2 Configured Git identity
        git config --global user.name "Diego Miranda"
        git config --global user.email "<github-email>"

    2.3 Created a proper .gitignore

        Added ignore rules for:

        Python cache, logs

        .venv virtual environment

        VS Code settings

        Jupyter checkpoints

        Data/model directories

        OS-specific junk files

    2.4 Added and committed the project
        git add .
        git commit -m "Initial commit - Advanced Multisource RAG Project"

    2.5 Created GitHub repo online

        Repository name: advanced-multisource-rag-project

        No README / .gitignore added on GitHub (to avoid conflicts)

    2.6 Connected local → GitHub
        git remote add origin https://github.com/<username>/advanced-multisource-rag-project.git
        git branch -M main
        git push -u origin main

### Outcome

✔ Project is now fully version-controlled

✔ Hosted on GitHub

✔ Ready for collaboration, backups, and CI/CD later