# Local RAG with Mistral 7B & CodeCarbon Evaluation

This notebook runs a complete Retrieval-Augmented Generation (RAG) pipeline locally on your machine. It uses:

- **`llama-index`**: To build the RAG pipeline.
- **`llama-cpp-python`**: To run the quantized Mistral 7B GGUF model.
- **GPU Acceleration**: The model is configured to run on your NVIDIA GPU (`n_gpu_layers=-1`).
- **`codecarbon`**: To measure the energy consumption and CO2 emissions for each query you make in real-time.

## 1. Setup & Installations

This cell installs all the required Python libraries. 

**Note:** This assumes you have already installed `llama-cpp-python` with the correct CUDA (GPU) support. If not, you may need to run this command in your terminal first:

`$env:CMAKE_ARGS = "-DGGML_CUDA=on"`
`pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir`

In [1]:
!pip install llama-index
!pip install llama-index-llms-llama-cpp
!pip install llama-index-embeddings-huggingface
!pip install sentence-transformers
!pip install pypdf
!pip install torch torchvision torchaudio
!pip install codecarbon
!pip install langchain-community # Dependency for Ragas



ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


## 2. Imports and Configuration

Here we import all necessary modules and set up the file paths for your model and data. 

**Please double-check that `MODEL_PATH` and `DATA_PATH` are correct for your system.**

In [2]:
# import os
import time
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from codecarbon import OfflineEmissionsTracker
from llama_index.core import PromptTemplate
import textwrap

# --- 1. Configuration ---

# Set the path to your downloaded GGUF model
# IMPORTANT: Use a raw string (r"...") for Windows paths
# MODEL_PATH =r"D:\\Mistral7B\\mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# Set the path to your data (PDFs, .txt, etc.)
DATA_PATH = r"D:\Mistral7B\data"

# Set your country's 3-letter ISO code for CodeCarbon
# Find your code: https://en.wikipedia.org/wiki/List_of_ISO_3166-1_alpha-3_codes
YOUR_COUNTRY_ISO_CODE = "EGY"

print("Configuration loaded.")




Configuration loaded.


## 3. Initialize Models and Index

This cell loads the Mistral 7B model into your GPU VRAM, loads the embedding model, and then scans your `DATA_PATH` to build the searchable RAG index. This step may take a moment.

In [None]:
print("Initializing models...")
MODEL_PATH = "D:/Mistral7B/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# Load the local LLM (Mistral 7B) with GPU offloading
llm = LlamaCPP(
    model_path=MODEL_PATH,
    temperature=0.1,
    max_new_tokens=1024,
    context_window=3900,
    generate_kwargs={},
    # Set n_gpu_layers to -1 to offload all layers to GPU
    model_kwargs={"n_gpu_layers": -1},
    verbose=True,
)

# Load the local Embedding Model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Set up LlamaIndex global settings to use our local models
Settings.llm = llm
Settings.embed_model = embed_model

print("\nLoading and indexing documents...")
documents = SimpleDirectoryReader(DATA_PATH).load_data()
print(f"Loaded {len(documents)} document(s).")

index = VectorStoreIndex.from_documents(documents)
print("Indexing complete.")


# --- ADD THIS SECTION ---
# Define the new, strict prompt template
qa_template_str = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Do not mention the context, the source document, or 'the text'.\n"
    "2. Answer the question directly, as if you knew the information yourself.\n"
    "3. If the answer is not in the context, state that you do not have enough "
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template_str2 = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Use the context as inspiration, but do not copy it.'.\n"
    "2. Expand or interpret the ideas creatively, producing a short paragraph.\n"
    "3. Keep the tone natural and imaginative, as if writing your own reflection"
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template_str3 = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Rewrite the given information in your own words.'.\n"
    "2. Preserve meaning and tone without copying phrases directly..\n"
    "3. The output should read naturally like an original paragraph."
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template_str4 = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Provide short, well-structured answers (2â€“5 sentences)..'.\n"
    "2. Use only logical reasoning\n"
    "3. Do not add assumptions or outside facts"
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template = PromptTemplate(qa_template_str)
# --- END SECTION ---


# --- MODIFY THIS LINE ---
# Create the query engine, passing in the new template
query_engine = index.as_query_engine(
    streaming=True,
    text_qa_template=qa_template,  # <-- Pass the template here
)
# --- END MODIFICATION ---

print("Query engine is ready (with custom anti-leak prompt).")

Initializing models...


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU) - 3302 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from D:/Mistral7B/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loade


Loading and indexing documents...
Loaded 1 document(s).
Indexing complete.
Query engine is ready (with custom anti-leak prompt).


## 4. Start Interactive RAG + Carbon Tracking

Run this cell to start the interactive chat. You can ask questions about your documents.

- Type your question and press Enter.
- The model will stream its answer.
- After the answer, `codecarbon` will print the latency and environmental cost for that specific query.
- Type `exit` to stop the loop and see the total emissions for the session.

In [4]:
print(f"\nInitializing CodeCarbon tracker for country: {YOUR_COUNTRY_ISO_CODE}")
tracker = OfflineEmissionsTracker(country_iso_code=YOUR_COUNTRY_ISO_CODE)
tracker.start()

print("\n--- Query Engine Ready (Tracking Emissions) ---")
print("Type 'exit' to quit.")

try:
    while True:
        query = input("Ask a question about your documents: ")
        if query.lower() == "exit":
            break

        # --- Start tracking just for the query ---
        tracker.start_task("RAG Query")
        start_time = time.time()

        response_stream = query_engine.query(query)

        # Print the user's question
        print(f"\n\nYour Question: {query}")
        print("\nAssistant: ")

        # --- Iterate stream and wrap text ---
        full_answer_text = ""
        for chunk_text in response_stream.response_gen:
            full_answer_text += chunk_text

        # Wrap the complete answer to a width of 80 characters
        wrapped_answer = textwrap.fill(full_answer_text, width=80)
        print(wrapped_answer)
        # --- END MODIFICATION --

        # --- Stop tracking and get emissions for this single query ---
        end_time = time.time()
        emissions_data = tracker.stop_task()

        print("\n\n--- Query Metrics ---")
        print(f"Latency: {end_time - start_time:.2f} seconds")
        print(f"Emissions: {emissions_data.emissions * 1000:.6f} gCO2eq")
        print(f"Energy: {emissions_data.energy_consumed * 1000:.6f} Wh")
        print("-" * 50)

finally:
    # This stops the main tracker and saves the total emissions.csv file
    total_emissions_kg = tracker.stop()
    print("\n\n--- Total Emissions Summary (Session) ---")
    # Access total energy from the tracker object itself
    if tracker.final_emissions_data:
        print(
            f"Total Energy Consumed: {tracker.final_emissions_data.energy_consumed * 1000:.4f} Wh"
        )
    print(f"Total CO2 Emitted: {total_emissions_kg * 1000:.4f} gCO2eq")
    print("Full report saved to 'emissions.csv'")


Initializing CodeCarbon tracker for country: EGY

--- Query Engine Ready (Tracking Emissions) ---
Type 'exit' to quit.


Your Question: Summarize the main events during the Apollo 11 lunar landing in 3 sentences.

Assistant: 


llama_perf_context_print:        load time =   70428.16 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  1885 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   102 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  119472.65 ms /  1987 tokens


 The Apollo 11 lunar landing was marked by Armstrong and Aldrin reporting they
were passing landmarks too early due to Eagle traveling too fast, and
encountering unexpected 1201 and 1202 program alarms. The guidance computer,
rather than forcing an abort, took recovery action and prevented an abort,
allowing Armstrong to take semi-automatic control and land the spacecraft in a
clear patch of ground, despite having limited propellant remaining.


--- Query Metrics ---
Latency: 119.76 seconds
Emissions: 1.457414 gCO2eq
Energy: 2.555495 Wh
--------------------------------------------------


Your Question: What were the main challenges Armstrong faced while landing the Eagle?

Assistant: 


Llama.generate: 1858 prefix-match hit, remaining 19 prompt tokens to eval
llama_perf_context_print:        load time =   70428.16 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    19 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   140 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   68114.01 ms /   159 tokens


 The main challenges Armstrong faced while landing the Eagle were passing
landmarks earlier than expected due to Eagle traveling too fast, encountering
unexpected 1201 and 1202 program alarms, and dealing with a gravitational
anomaly caused by mascons in the Moon's crust. Additionally, Armstrong had to
take semi-automatic control when the computer's landing target was in a boulder-
strewn area and had to land at the first possible site due to dwindling
propellant supply. Lunar dust kicked up by the LM's engine also impaired his
ability to determine the spacecraft's motion.


--- Query Metrics ---
Latency: 68.81 seconds
Emissions: 0.855907 gCO2eq
Energy: 1.500786 Wh
--------------------------------------------------


Your Question: Describe the activities the astronauts performed on the lunar surface.

Assistant: 


Llama.generate: 1858 prefix-match hit, remaining 20 prompt tokens to eval
llama_perf_context_print:        load time =   70428.16 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    20 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   274 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  137524.22 ms /   294 tokens


 The astronauts, Armstrong and Aldrin, found themselves passing landmarks on the
lunar surface earlier than expected due to Eagle traveling too fast. They
reported this to Mission Control and experienced several unexpected 1201 and
1202 program alarms. Mission Control assured them it was safe to continue the
descent. The alarms indicated 'executive overflows', meaning the guidance
computer could not complete all its tasks in real-time and had to postpone some.
Margaret Hamilton, the Director of Apollo Flight Computer Programming, later
recalled that the computer was programmed to do more than just recognize error
conditions and had a complete set of recovery programs incorporated into the
software. The computer's action was to eliminate lower priority tasks and re-
establish the more important ones, preventing an abort. Armstrong took semi-
automatic control when he saw the computer's landing target was in a boulder-
strewn area. Throughout the descent, Aldrin called out navigation dat

Llama.generate: 1858 prefix-match hit, remaining 21 prompt tokens to eval
llama_perf_context_print:        load time =   70428.16 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    21 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   211 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  103056.42 ms /   232 tokens


 Based on the context, Armstrong and Aldrin found themselves passing landmarks
earlier than expected during their descent, indicating they were traveling too
fast. They reported being "long" and miles west of their target point. The LM
guidance computer (LGC) experienced unexpected 1201 and 1202 program alarms,
which were later determined to be "executive overflows," meaning the computer
could not complete all its tasks in real-time and had to postpone some of them.
The computer's recovery programs prevented an abort and allowed the successful
Moon landing. Armstrong took semi-automatic control when the computer's landing
target was in a boulder-strewn area. He was determined to land at the first
possible site due to dwindling propellant. The actual landing site was not
explicitly stated in the context, so it's unclear how it compares to the planned
site. However, it appears that the descent and landing deviated from the planned
timeline.


--- Query Metrics ---
Latency: 103.24 seconds

AttributeError: 'OfflineEmissionsTracker' object has no attribute 'final_emissions_data'