# Program - RAG

Blake Green

In [2]:
import os
from dotenv import load_dotenv
from pypdf import PdfReader
from langchain_text_splitters import CharacterTextSplitter
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import numpy as np
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import Qdrant as QdrantVectorStore
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
import time


import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## SECTION 1: Environment Setup & PDF Ingestion

- **`load_dotenv()`**  
  Loads environment variables from a `.env` file into `os.environ`, so we can keep API keys and paths out of source control.
  
- **Environment variables**  
  ```python
  PDF_DIR         = os.getenv("PDF_DIR")
  QDRANT_URL      = os.getenv("QDRANT_URL")
  QDRANT_API_KEY  = os.getenv("QDRANT_API_KEY")
  MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
  ```
  Fetches the directory containing PDFs, Qdrant endpoint & key, and the Mistral API key at runtime.

### Reading and Limiting PDFs
```python
all_text = []
for fname in os.listdir(PDF_DIR):
    if fname.lower().endswith(".pdf"):
        reader = PdfReader(os.path.join(PDF_DIR, fname))
        text = "\n".join(p.extract_text() or "" for p in reader.pages)
        all_text.append(text)
# keep exactly 10 reports (or fewer if there aren't 10)
all_text = all_text[:10]
```
Iterates over every PDF in PDF_DIR, extracts text page-by-page, and collects each document's full text in the all_text list. The code then limits processing to a maximum of 10 documents to manage resource usage.

### Optimizing Chunk Size and Overlap
```python
lengths     = [len(doc) for doc in all_text]
avg_length  = sum(lengths) / len(lengths)
chunk_size  = int(avg_length * 0.25)
chunk_overlap = int(chunk_size * 0.10)
print(f"Using chunk_size={chunk_size}, chunk_overlap={chunk_overlap}")
```
Calculates optimal chunking parameters based on document statistics:
- Measures the length of each document
- Computes the average document length
- Sets chunk size to 25% of average document length
- Sets chunk overlap to 10% of the chunk size
- Outputs the calculated parameters for transparency

### Document Splitting
```python
splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len
)
documents = splitter.create_documents(all_text)
```
Creates a text splitter that divides documents at paragraph breaks while respecting the calculated chunk parameters, then processes all documents into properly sized chunks.


In [4]:
load_dotenv()  # reads .env into os.environ

PDF_DIR         = os.getenv("PDF_DIR")
QDRANT_URL      = os.getenv("QDRANT_URL")
QDRANT_API_KEY  = os.getenv("QDRANT_API_KEY")
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

# ─── 1. Read all PDFs (then grab only 10) ──────────────────────────────────
all_text = []
for fname in os.listdir(PDF_DIR):
    if fname.lower().endswith(".pdf"):
        reader = PdfReader(os.path.join(PDF_DIR, fname))
        text = "\n".join(p.extract_text() or "" for p in reader.pages)
        all_text.append(text)

# keep exactly 10 reports (or fewer if there aren’t 10)
all_text = all_text[:10]

# ─── 2. Compute “optimal” chunk-size & overlap ─────────────────────────────
# e.g. use 25% of the average document length as chunk_size,
# and 10% of that as the overlap.
lengths     = [len(doc) for doc in all_text]
avg_length  = sum(lengths) / len(lengths)
chunk_size  = int(avg_length * 0.25)
chunk_overlap = int(chunk_size * 0.10)

print(f"Using chunk_size={chunk_size}, chunk_overlap={chunk_overlap}")

splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len
)
documents = splitter.create_documents(all_text)


Using chunk_size=2914, chunk_overlap=291


## SECTION 2: Embedding Model Definitions

```python
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
def embed_sbert(texts):
    return sbert_model.encode(texts, convert_to_numpy=True)

instructor_model = SentenceTransformer(
    "hkunlp/instructor-xl",
    trust_remote_code=True,
    device="cpu"
)
def embed_instructor(texts):
    return instructor_model.encode(texts, convert_to_numpy=True)

mpnet_model = SentenceTransformer(
    "sentence-transformers/all-mpnet-base-v2",
    device="cpu"
)
def embed_mpnet(texts):
    return mpnet_model.encode(texts, convert_to_numpy=True)

os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
```

**[all-MiniLM-L6-v2 (SBERT, 384-dim)](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)**  
A distilled Sentence-BERT model offering fast, lightweight 384-dim embeddings. Well-suited for large-scale semantic search where inference speed and memory are critical.

**[hkunlp/instructor-xl (Instructor-XL, 768-dim)](https://huggingface.co/hkunlp/instructor-xl)**  
An instruction-tuned model that conditions its embeddings on supplied prompts, producing 768-dim vectors. Good for tasks where you want to feed explicit instructions or task descriptions into the embedding process.

**[all-mpnet-base-v2 (MPNet, 768-dim)](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)**  
Based on the MPNet pretraining objective (masked & permuted language modeling), this 768-dim model often outperforms BERT-based alternatives on semantic tasks, striking a balance between quality and compute cost.

**Embedding functions**  
Each embed_* wrapper converts a list of strings into a NumPy array of dense vectors, ready for upload to a vector database.

In [6]:
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
def embed_sbert(texts):
    return sbert_model.encode(texts, convert_to_numpy=True)

instructor_model = SentenceTransformer(
    "hkunlp/instructor-xl",
    trust_remote_code=True,
    device="cpu"
)
def embed_instructor(texts):
    return instructor_model.encode(texts, convert_to_numpy=True)

mpnet_model = SentenceTransformer(
    "sentence-transformers/all-mpnet-base-v2",
    device="cpu"
)
def embed_mpnet(texts):
    return mpnet_model.encode(texts, convert_to_numpy=True)

# Silence HF symlink warnings (optional)
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


## SECTION 3: Qdrant Client Initialization & Collection Setup

```python
client = QdrantClient(
    url=QDRANT_URL,
    prefer_grpc=False,
    api_key=QDRANT_API_KEY
)

collections = {"sbert": 384, "instructor": 768, "mpnet": 768}
for name, dim in collections.items():
    client.recreate_collection(
        collection_name=f"{name}_collection",
        vectors_config=VectorParams(size=dim, distance=Distance.COSINE)
    )
```

**QdrantClient**  
Connects to your Qdrant vector database using HTTP (or gRPC if prefer_grpc=True) and your API key.

**recreate_collection**  
For each embedding type (sbert, instructor, mpnet), drops any existing collection of that name and creates a new one with the appropriate vector dimension and cosine similarity as the distance metric.

## SECTION 4: Bulk Uploading Embeddings to Qdrant

```python
for name, embed_fn in [("sbert", embed_sbert),
                       ("instructor", embed_instructor),
                       ("mpnet", embed_mpnet)]:
    texts      = [doc.page_content for doc in documents]
    embeddings = embed_fn(texts)
    points     = [
        PointStruct(id=i, vector=embeddings[i], payload={"text": texts[i]})
        for i in range(len(texts))
    ]
    client.upload_points(
        collection_name=f"{name}_collection",
        points=points
    )
```

**Text extraction**  
Gathers the .page_content from each chunked Document.

**Embedding generation**  
Calls the appropriate embed_fn to get a NumPy array of shape (num_chunks, dim).

**Point creation & upload**  
Wraps each vector in a PointStruct, attaching the original text as payload, and uploads them in bulk to Qdrant.

https://cloud.qdrant.io/accounts/540c85ca-df0c-42c4-adf7-a1747bc5b781/clusters/9a2caa1e-4336-4898-b431-a9d4ed717c4b/api-keys

In [9]:
client = QdrantClient(
    url=QDRANT_URL,
    prefer_grpc=False,
    api_key=QDRANT_API_KEY
)

# Create one collection per embedding
collections = {"sbert": 384, "instructor": 768, "mpnet": 768}
for name, dim in collections.items():
    client.recreate_collection(
        collection_name=f"{name}_collection",
        vectors_config=VectorParams(size=dim, distance=Distance.COSINE)
    )

# Upload embeddings
for name, embed_fn in [("sbert", embed_sbert),
                       ("instructor", embed_instructor),
                       ("mpnet", embed_mpnet)]:
    texts     = [doc.page_content for doc in documents]
    embeddings= embed_fn(texts)
    points    = [
        PointStruct(id=i, vector=embeddings[i], payload={"text": texts[i]})
        for i in range(len(texts))
    ]
    client.upload_points(collection_name=f"{name}_collection",
                         points=points)


## SECTION 5: LLM Setup & Question Bank Definition

```python
llm = ChatMistralAI(
    model="mistral-small",
    api_key=MISTRAL_API_KEY,
    temperature=0.0,
    max_retries=2,
)

rag_configs = {
    "sbert_collection":      "all-MiniLM-L6-v2",
    "instructor_collection": "hkunlp/instructor-xl",
    "mpnet_collection":      "sentence-transformers/all-mpnet-base-v2",
}

multi_questions = [
    "What accident was the most dangerous?",
    "What accident featured the least amount of people onboard?",
    "How many accidents featured serious, but non-fatal injuries?",
    "What accident featured the oldest pilot?",
]

single_questions = [
    "What was the cause of incident DCA22LA182?",
    "When was the captain of incident DCA23LA384 hired, and by which airline?",
    "In the fatal aircraft incident, was the name of the airshow it occurred at?",
    "What object was struck in accident DCA21LA137, and which wing sustained damage?",
]
```

**ChatMistralAI**  
Instantiates a zero-temperature Mistral-small chat model for deterministic outputs, with retry logic on transient failures.

**rag_configs**  
Maps each Qdrant collection name to its original Hugging Face model identifier, used by SentenceTransformerEmbeddings.

**Question lists**
- **multi_questions**: Broad, comparative queries over the entire corpus.
- **single_questions**: Specific, incident-level inquiries keyed to particular report IDs.

In [11]:
llm = ChatMistralAI(
    model="mistral-small",
    api_key=MISTRAL_API_KEY,
    temperature=0.0,
    max_retries=2,
)

rag_configs = {
    "sbert_collection":      "all-MiniLM-L6-v2",
    "instructor_collection": "hkunlp/instructor-xl",
    "mpnet_collection":      "sentence-transformers/all-mpnet-base-v2",
}

multi_questions = [
    "What accident was the most dangerous?",
    "What accident featured the least amount of people onboard?",
    "How many accidents featured serious, but non-fatal injuries?",
    "What accident featured the oldest pilot?",
]

single_questions = [
    "What was the cause of incident DCA22LA182?",
    "When was the captain of incident DCA23LA384 hired, and by which airline?",
    "In the fatal aircraft incident, was the name of the airshow it occurred at?",
    "What object was struck in accident DCA21LA137, and which wing sustained damage?",
]

## SECTION 6: Executing MultiRAG Queries

```python
for coll_name, emb_model in rag_configs.items():
    print(f"\n=== MultiRAG using {coll_name} ({emb_model}) ===")
    embeddings  = SentenceTransformerEmbeddings(model_name=emb_model)
    vectorstore = QdrantVectorStore(
        client=client,
        collection_name=coll_name,
        embeddings=embeddings,
        content_payload_key="text",
    )
    qa_multi = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
    )
    for q in multi_questions:
        print(f"\nQ: {q}")
        ans = qa_multi.run(q)
        print(f"A: {ans}")
        time.sleep(30)
```

**Building each retriever**  
Wraps the Qdrant collection in a LangChain QdrantVectorStore, telling it how to embed new queries and which payload field holds the text.

**RetrievalQA.from_chain_type("stuff")**  
Creates a simple "stuff" chain that concatenates the top-k retrieved chunks (here, k=10) and feeds them to the LLM.

**Query loop**  
For each of the four multi-document questions, prints the question, retrieves & answers it, then waits 30 seconds to avoid rate limits.

https://admiralcloudberg.medium.com/passing-the-buck-the-story-of-the-2022-wings-over-dallas-air-show-collision-9bbe5947297b

In [14]:
for coll_name, emb_model in rag_configs.items():
    print(f"\n=== MultiRAG using {coll_name} ({emb_model}) ===")
    embeddings  = SentenceTransformerEmbeddings(model_name=emb_model)
    vectorstore = QdrantVectorStore(
        client=client,
        collection_name=coll_name,
        embeddings=embeddings,
        content_payload_key="text",
    )
    qa_multi = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
    )
    for q in multi_questions:
        print(f"\nQ: {q}")
        ans = qa_multi.run(q)
        print(f"A: {ans}")
        time.sleep(30)


=== MultiRAG using sbert_collection (all-MiniLM-L6-v2) ===

Q: What accident was the most dangerous?
A: Based on the information provided, the most dangerous accident in terms of the number of fatalities was the midair collision between a Boeing B17 and a Bell P63 during an air show at Dallas Executive Airport on November 12, 2022, which resulted in 6 fatalities (5 on the Boeing B17 and 1 on the Bell P63). However, it is important to note that the severity of an accident can also be measured in other ways, such as the extent of damage to the aircraft or the impact on the environment.

Q: What accident featured the least amount of people onboard?
A: The accident with the least amount of people onboard was CEN24LA035, which involved a single-engine airplane with two people onboard.

Q: How many accidents featured serious, but non-fatal injuries?
A: In the provided reports, there are two accidents (DCA21LA058 and DCA23LA304) where a flight attendant sustained serious injuries but there w

# MultiRAG Response Analysis

## Multiple-Document Questions

### 1. What accident was the most dangerous?
* **Expected:** CEN23MA034, a midair collision at an airshow that resulted in 6 fatalities.
* **SBERT** (`all-MiniLM-L6-v2`):
   * ✅ Correctly identifies the midair collision between a B-17 and P-63 with 6 fatalities.
   * ❌ Omits the report code (CEN23MA034).
* **Instructor** (`hkunlp/instructor-xl`):
   * ✅ Identifies the same collision with 6 fatalities.
   * ❌ Also omits the specific report ID.
* **MPNet** (`all-mpnet-base-v2`):
   * ❌ Misattributes "most dangerous" to a birdstrike (CEN24LA035), citing aircraft damage rather than fatalities.

### 2. What accident featured the least amount of people onboard?
* **Expected:** CEN24LA035, with 1 person onboard.
* **SBERT:**
   * ✅ Correct report (CEN24LA035).
   * ❌ Incorrect count ("two people onboard" vs. 1).
* **Instructor:**
   * ❌ Picks CEN23MA034 (10 onboard), entirely off.
* **MPNet:**
   * ✅ Correctly reports CEN24LA035 with 1 person onboard.

### 3. How many accidents featured serious, but non-fatal injuries?
* **Expected:** Three (DCA21LA058, DCA23LA196, DCA23LA304).
* **SBERT:**
   * ❌ Under-counts (only 2 accidents).
* **Instructor:**
   * ✅ Counts three, correctly lists three distinct serious-injury cases.
* **MPNet:**
   * ✅ Also counts three matching the expected set.

### 4. What accident featured the oldest pilot?
* **Expected:** CEN23MA034 (pilot aged 67).
* **SBERT:**
   * ❌ Attributes it to DCA21LA137 (62 years old).
* **Instructor:**
   * ❌ Attributes it to SkyWest 5069 (58 years old).
* **MPNet:**
   * ❌ Attributes it to DCA21LA058 (60 years old).

## Overall MultiRAG Comparison
* **Best model:** Instructor-XL and MPNet tie on question 3 but only **SBERT** and **Instructor-XL** correctly handle question 1.
* **Major strength of Instructor-XL:** Precise counts on injuries (Q 3).
* **Major weakness of MPNet:** Misinterprets "most dangerous" (Q 1).
* **Common gap:** None of the three models correctly identified the oldest pilot (Q 4) or cited report codes consistently.


## SECTION 7: Executing SingleRAG Queries

```python
for coll_name, emb_model in rag_configs.items():
    print(f"\n=== SingleRAG using {coll_name} ({emb_model}) ===")
    qa_single = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
    )
    for q in single_questions:
        print(f"\nQ: {q}")
        ans = qa_single.run(q)
        print(f"A: {ans}")
        time.sleep(30)
```

Reuses the same retriever setup for each embedding collection.

Single-incident queries ask about one report at a time; the loop structure is identical to Section 6 but iterates over the single_questions list.


In [16]:
for coll_name, emb_model in rag_configs.items():
    print(f"\n=== SingleRAG using {coll_name} ({emb_model}) ===")
    qa_single = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
    )
    for q in single_questions:
        print(f"\nQ: {q}")
        ans = qa_single.run(q)
        print(f"A: {ans}")
        time.sleep(30)


=== SingleRAG using sbert_collection (all-MiniLM-L6-v2) ===

Q: What was the cause of incident DCA22LA182?
A: The National Transportation Safety Board (NTSB) determined the probable cause of incident DCA22LA182 to be a lateral runway excursion during landing for reasons that could not be determined based on the available evidence.

Q: When was the captain of incident DCA23LA384 hired, and by which airline?
A: The captain of incident DCA23LA384 was hired on June 10, 2010, by United Airlines.

Q: In the fatal aircraft incident, was the name of the airshow it occurred at?
A: The name of the airshow where the fatal aircraft incident occurred is the Commemorative Air Force’s (CAF) Wings Over Dallas air show.

Q: What object was struck in accident DCA21LA137, and which wing sustained damage?
A: In accident DCA21LA137, the left wing of the airplane struck a light pole.

=== SingleRAG using instructor_collection (hkunlp/instructor-xl) ===

Q: What was the cause of incident DCA22LA182?
A: The 

## SingleRAG Response Analysis

### 1. What was the cause of incident DCA22LA182?
* **Expected:** A lateral runway excursion during landing (undetermined reason).
* **All models (SBERT, Instructor, MPNet):**
   * ✅ Perfect match on wording and substance.

### 2. When was the captain of incident DCA23LA384 hired, and by which airline?
* **Expected:** Hired by United Airlines in April 2001 as a first officer.
* **SBERT:**
   * ❌ "June 10, 2010" by United.
* **Instructor:**
   * ❌ "June 10, 2020" by United.
* **MPNet:**
   * ❌ Same "June 10, 2020."
* **Analysis:** All three hallucinate both date and rank nuance; none recall April 2001.

### 3. In the fatal aircraft incident, was the name of the airshow it occurred at?
* **Expected:** Commemorative Air Force's Wings Over Dallas.
* **SBERT:**
   * ✅ Correct name.
* **Instructor & MPNet:**
   * ✅ Also mention "CAF Wings Over Dallas" (plus location detail in Instructor).

### 4. What object was struck in accident DCA21LA137, and which wing sustained damage?
* **Expected:** Light pole struck left wing.
* **All models:**
   * ✅ Correctly cite "left wing" and "light pole."

## Overall SingleRAG Comparison
* **Strengths:** All embeddings reliably retrieve cause (Q 1), airshow name (Q 3), and struck object with wing (Q 4).
* **Weaknesses:** All three fail the hiring date question (Q 2), suggesting that employment-history details may be sparsely represented or harder to retrieve.
* **Best model:** All perform similarly; SBERT has a slight edge on date precision for Q 2 (2010 vs. 2020), but still incorrect.
