# Local RAG with Mistral 7B & CodeCarbon Evaluation

This notebook runs a complete Retrieval-Augmented Generation (RAG) pipeline locally on your machine. It uses:

- **`llama-index`**: To build the RAG pipeline.
- **`llama-cpp-python`**: To run the quantized Mistral 7B GGUF model.
- **GPU Acceleration**: The model is configured to run on your NVIDIA GPU (`n_gpu_layers=-1`).
- **`codecarbon`**: To measure the energy consumption and CO2 emissions for each query you make in real-time.

## 1. Setup & Installations

This cell installs all the required Python libraries. 

**Note:** This assumes you have already installed `llama-cpp-python` with the correct CUDA (GPU) support. If not, you may need to run this command in your terminal first:

`$env:CMAKE_ARGS = "-DGGML_CUDA=on"`
`pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir`

In [1]:
!pip install llama-index
!pip install llama-index-llms-llama-cpp
!pip install llama-index-embeddings-huggingface
!pip install sentence-transformers
!pip install pypdf
!pip install torch torchvision torchaudio
!pip install codecarbon
!pip install langchain-community # Dependency for Ragas



ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


## 2. Imports and Configuration

Here we import all necessary modules and set up the file paths for your model and data. 

**Please double-check that `MODEL_PATH` and `DATA_PATH` are correct for your system.**

In [2]:
# import os
import time
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from codecarbon import OfflineEmissionsTracker
from llama_index.core import PromptTemplate
import textwrap

# --- 1. Configuration ---

# Set the path to your downloaded GGUF model
# IMPORTANT: Use a raw string (r"...") for Windows paths
# MODEL_PATH =r"D:\\Mistral7B\\mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# Set the path to your data (PDFs, .txt, etc.)
DATA_PATH = r"D:\Mistral7B\data"

# Set your country's 3-letter ISO code for CodeCarbon
# Find your code: https://en.wikipedia.org/wiki/List_of_ISO_3166-1_alpha-3_codes
YOUR_COUNTRY_ISO_CODE = "EGY"

print("Configuration loaded.")




Configuration loaded.


## 3. Initialize Models and Index

This cell loads the Mistral 7B model into your GPU VRAM, loads the embedding model, and then scans your `DATA_PATH` to build the searchable RAG index. This step may take a moment.

In [3]:
print("Initializing models...")
MODEL_PATH = r"D:\Mistral7B\tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

# Load the local LLM (Mistral 7B) with GPU offloading
llm = LlamaCPP(
    model_path=MODEL_PATH,
    temperature=0.1,
    max_new_tokens=1024,
    context_window=3900,
    generate_kwargs={},
    # Set n_gpu_layers to -1 to offload all layers to GPU
    model_kwargs={"n_gpu_layers": -1},
    verbose=True,
)

# Load the local Embedding Model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Set up LlamaIndex global settings to use our local models
Settings.llm = llm
Settings.embed_model = embed_model

print("\nLoading and indexing documents...")
documents = SimpleDirectoryReader(DATA_PATH).load_data()
print(f"Loaded {len(documents)} document(s).")

index = VectorStoreIndex.from_documents(documents)
print("Indexing complete.")


# --- ADD THIS SECTION ---
# Define the new, strict prompt template
qa_template_str = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Do not mention the context, the source document, or 'the text'.\n"
    "2. Answer the question directly, as if you knew the information yourself.\n"
    "3. If the answer is not in the context, state that you do not have enough "
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template_str2 = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Use the context as inspiration, but do not copy it.'.\n"
    "2. Expand or interpret the ideas creatively, producing a short paragraph.\n"
    "3. Keep the tone natural and imaginative, as if writing your own reflection"
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template_str3 = (
    "You are an expert assistant. Answer the user's question based *only* on the "
    "provided context.\n\n"
    "Strict Rules:\n"
    "1. Rewrite the given information in your own words.'.\n"
    "2. Preserve meaning and tone without copying phrases directly..\n"
    "3. The output should read naturally like an original paragraph."
    "information to answer.\n\n"
    "--- Context ---\n"
    "{context_str}\n"
    "--- Question ---\n"
    "{query_str}\n\n"
    "Answer:"
)
qa_template = PromptTemplate(qa_template_str2)
# --- END SECTION ---


# --- MODIFY THIS LINE ---
# Create the query engine, passing in the new template
query_engine = index.as_query_engine(
    streaming=True,
    text_qa_template=qa_template,  # <-- Pass the template here
    similarity_top_k=3,
    include_source_nodes=True,
)
# --- END MODIFICATION ---

print("Query engine is ready (with custom anti-leak prompt).")

Initializing models...


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3050 Laptop GPU) - 3302 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from D:\Mistral7B\tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader


Loading and indexing documents...
Loaded 1 document(s).
Indexing complete.
Query engine is ready (with custom anti-leak prompt).


## 4. Start Interactive RAG + Carbon Tracking

Run this cell to start the interactive chat. You can ask questions about your documents.

- Type your question and press Enter.
- The model will stream its answer.
- After the answer, `codecarbon` will print the latency and environmental cost for that specific query.
- Type `exit` to stop the loop and see the total emissions for the session.

In [None]:
# --- Define the new prompts for the Critic and Refiner ---

CRITIC_PROMPT = """
You are a 'Critic' AI. Your job is to provide specific, actionable feedback on a
draft answer. You will be given the original question, the source context, and the
draft answer.

Evaluate the 'Draft Answer' based *only* on the 'Source Context' for two criteria:
1.  **Faithfulness:** Is all information in the answer supported by the context?
2.  **Relevance:** Does the answer directly address the question?

Provide your feedback as a list of bullet points. If the draft is perfect,
simply respond with "The draft is perfect."

---
**Source Context:**
{context}
---
**Original Question:**
{question}
---
**Draft Answer:**
{draft}
---
**Your Feedback:**
"""

REFINER_PROMPT = """
You are a 'Refiner' AI. Your job is to rewrite a draft answer based on
feedback from a critic.

Your goal is to produce a final, high-quality answer that directly
answers the question, is fully supported by the context (which you haven't seen),
and incorporates the critic's feedback.

**Strict Rules:**
- Do not mention the context, the feedback, or the draft.
- Do not add any new information.
- Just provide the final, improved answer.

---
**Original Draft:**
{draft}
---
**Critic's Feedback:**
{feedback}
---
**Your Refined Answer:**
"""

# --- Define the number of refinement cycles ---
REFINEMENT_CYCLES = 3

print(f"\nInitializing CodeCarbon tracker for country: {YOUR_COUNTRY_ISO_CODE}")
tracker = OfflineEmissionsTracker(country_iso_code=YOUR_COUNTRY_ISO_CODE)
tracker.start()

print(f"\n--- Query Engine Ready (Recursive Editing: {REFINEMENT_CYCLES} cycles) ---")
print("Type 'exit' to quit.")

try:
    while True:
        query = input("Ask a question about your documents: ")
        if query.lower() == "exit":
            break

        # --- Start tracking the entire multi-step process ---
        tracker.start_task("Recursive RAG Query")
        start_time = time.time()

        # --- Step 1: Get the Draft (and Context) ---
        response_stream = query_engine.query(query)

        # Collect the streamed draft text
        draft_text = ""
        for chunk_text in response_stream.response_gen:
            draft_text += chunk_text

        # Extract the source context
        context_str = "\n---\n".join(
            [node.get_content() for node in response_stream.source_nodes]
        )

        print(f"\n\nYour Question: {query}")
        print("\n--- Initial Draft (from RAG) ---")
        print(textwrap.fill(draft_text, width=80))

        # --- MODIFICATION: Start Recursive Loop ---

        current_draft = draft_text  # Initialize the loop with the first draft

        for i in range(REFINEMENT_CYCLES):
            print(f"\n--- Refinement Cycle {i + 1}/{REFINEMENT_CYCLES} ---")

            # --- Step 2: Run the Critic ---
            print("--- Critic is thinking... ---")
            critic_prompt = CRITIC_PROMPT.format(
                context=context_str,
                question=query,
                draft=current_draft,  # Use the *current* draft
            )

            feedback_response = llm.complete(critic_prompt)
            feedback_text = feedback_response.text
            print(feedback_text)

            # --- Step 3: Check for Convergence ---
            if "The draft is perfect" in feedback_text:
                print("--- Critic approved. Stopping refinement loop. ---")
                break  # Exit the for loop early

            # --- Step 4: Run the Refiner ---
            print("--- Refiner is working... ---")
            refiner_prompt = REFINER_PROMPT.format(
                draft=current_draft, feedback=feedback_text
            )

            refiner_response = llm.complete(refiner_prompt)

            # --- Step 5: Update Draft for Next Loop ---
            current_draft = (
                refiner_response.text
            )  # The refined answer becomes the new draft

            print(f"--- Intermediate Refined Draft (Cycle {i + 1}) ---")
            print(textwrap.fill(current_draft, width=80))

        # --- END OF MODIFIED LOOP ---

        # The loop is finished, 'current_draft' holds the final answer
        final_answer = current_draft

        # Print the final, refined answer
        print("\n--- Final Refined Answer (After All Cycles) ---")
        print(textwrap.fill(final_answer, width=80))

        # --- Stop tracking and get emissions ---
        end_time = time.time()
        emissions_data = tracker.stop_task()

        print(f"\n\n--- Query Metrics (Full Loop) ---")
        print(f"Latency: {end_time - start_time:.2f} seconds")
        print(f"Emissions: {emissions_data.emissions * 1000:.6f} gCO2eq")
        print(f"Energy: {emissions_data.energy_consumed * 1000:.6f} Wh")
        print("-" * 50)

finally:
    # This stops the main tracker
    total_emissions_kg = tracker.stop()
    print("\n\n--- Total Emissions Summary (Session) ---")
    if tracker.emissions_data:
        print(
            f"Total Energy Consumed: {tracker.emissions_data.energy_consumed * 1000:.4f} Wh"
        )
    print(f"Total CO2 Emitted: {total_emissions_kg * 1000:.4f} gCO2eq")
    print("Full report saved to 'emissions.csv'")


Initializing CodeCarbon tracker for country: EGY

--- Query Engine Ready (Recursive Editing: 3 cycles) ---
Type 'exit' to quit.


llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  1925 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /  1023 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   10866.38 ms /  2948 tokens
Llama.generate: 1 prefix-match hit, remaining 3044 prompt tokens to eval




Your Question:  Imagine being one of the people in Mission Control. How would you feel while watching the landing?

--- Initial Draft (from RAG) ---
 I would feel a mix of excitement, relief, and sadness. I would be excited to
see the spacecraft touch down safely, but I would also be sad to see the mission
come to an end. The thought of the people who worked so hard to make this
mission a success would be a constant source of pride and motivation.  ---
Context --- page_label: 1 file_path: D:\Mistraal7B\data\Apolllo.pdf  As  the
descent  began,  Armstrong  and  Aldrin  found  themselve  passing  landmarks
on  the   surface   two   or   three   seconds   early,   and   reported   that
they   weren't     “long”;   they   would   land   mile   south   of   their
target   point.   Eagle   was   traveling   too   many   of   the   Moons’
crater   Moons   ,     and   only   ,         and   only the       Moons   Moons
LagM   Landing    900      90        900  90  90  90Mo’s  9   900.   90 i

llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  3044 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   858 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    9046.62 ms /  3902 tokens
Llama.generate: 7 prefix-match hit, remaining 2061 prompt tokens to eval


cr.
cr
cr
cr
Cr
cr
cr
cr
cr
cr
cr
cr:



cr


cr

cr
cr
cr
cr.
cr
cr
crcrcrcr
cr
cr
cr
cr
cr
cr
cr
cr
cr.
dra.
cr.
cr.
cr.
cr,the,dra,dradra,the,
the, crs
the,
the
the
the,thedra
the
theecr
the crs
thedspes
the:
thee, anddra.
cr crs.crdradradradradracr?the?cr:cr:cr:the:thee-the crscrcrgsgs
the
the-the
cr.

the.
the the the
the, thescr.the,the:the.cr.cr.crscrsedcr.cr.
cr.
cr.cr.cr.
the
the
the
the
cr,the,
the
the crscr.cr.the
the
cr.cr.the
cr.
cr.
cr
cr
cr
the
cr
cr
cr
cr
cr
cr.cr.cr.cr.cr.cr.cr.cr.cr.cr.cr.cr.cr.cral.cralalideal
cr.
cride
the
the
the
the
crdraws


the
the
cr
cr.
cr
cr
cr.
cr
cr
cr
cr
cr

cr

cr



cr
cr
cr
cr
cr

cr





























































































































































the






theal’theu















theo

















the


























to






















the


the

the












































































llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  2061 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /  1023 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    9688.05 ms /  3084 tokens
Llama.generate: 7 prefix-match hit, remaining 2989 prompt tokens to eval


--- Intermediate Refined Draft (Cycle 1) ---
 The original draft was a high-quality piece of writing that provided a detailed
explanation of the mission's objectives, the team's roles, and the critical
components of the mission. The draft was well-structured, with a clear
introduction, a well-written body, and a strong conclusion. The use of vivid and
descriptive language helped to create a vivid picture of the mission's
environment, and the team's interactions with the spacecraft. The draft also
included a clear and concise explanation of the mission's objectives, the team's
roles, and the critical components. The draft also included a well-structured
introduction, a well-written body, and a strong conclusion, which helped to
provide a well-rational and well-structured. The drafted.  The draft. The draft.
The space. The draft.  The draft.  The draft, the draft, the draft. The drafted.
The draft. The team. The team. The team. The team. The The team. The team. The
team. The team. The te

llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  2989 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   907 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    9679.70 ms /  3896 tokens
Llama.generate: 7 prefix-match hit, remaining 2058 prompt tokens to eval


Cr.Cr.Cr.Cr.
cr.
cr.
cr.Cr.
crsedcr
cr
crcrsedcrcrcr.crcrcrcr.cr.cr:cr.cr.
cr.
cr.crcrcr
cr
cr
cr
cr
cr
cr
cr
cr
cr:
Cr

cr



cr
cr
cr
cr
cr.
cr
cr
crcrcr
cr
cr
cr
cr
cr
cr
crs
cr
dra.
dra.
cr.
cr.
cr.cr,crdra,cr,dradra,the,
cr,
cr,
the,the the
the,thedraging
the
thecrcr.
the crsthedthe.
the the the anddrageddra cr crs.crdradradradradradra?the?cr:dra:cr:the:thee-the the crscrgsgs
theferging
crcible
cr.
the.
the the the
the,thespes.the:the:cr.cr.cr.crscrsedcr.crs
cr:
cr.cr.

cr
the

cr.
cr,cr,the
the
the crsthe crscr.cr
cr.cr.cr.
cr.
cr.
cr
cr
cr
cr
cr
cr
cr
cr
cr
cr.cr.cr.cr.
cr.cr.cr.cr.cr.cr.cr.cr.cral.cralalaldraws
cr.
crdradra
the
cr
the
craldraws

the
the
cr
cr.
cr
cr.
cr.
cr
cr
cr
cr
cr
cr






cr
cr
cr
cr
cr
cr

cr


























































































































































cr
the






theal





































the


























to






















t

llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  2058 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /  1023 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    9800.57 ms /  3081 tokens
Llama.generate: 7 prefix-match hit, remaining 2987 prompt tokens to eval


--- Intermediate Refined Draft (Cycle 2) ---
 The original draft was a high-quality piece of writing that provided a detailed
explanation of the mission's objective, the team's roles, and the critical
components of the mission. The draft was well-structured, with a clear
introduction, a well-written body, and a strong conclusion. The use of vivid and
descriptive languaire is a vital component of the mission's environment,
mission's objective, the team's interaction with the spacecraft. The drafted.
The draft.  The space. The draft.  The draft. The team's roles, the clear and
concise, with a well-structured, with a clear introduction, a well-structed, a
well-structure, a well-struct, the drafted. The team's roles, with a well-
structs, the drafted, the team's roles, the team's introduction, the team's
conc's conc'drafted, the team's conc'd's, the team's conc'draft's conc's conc's
conc'd'd's the conc'd'd'd'd, the conc'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'draft'
d'd'd'd'd'd'draft'd'd'd'd'd

llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  2987 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   909 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   10181.23 ms /  3896 tokens
Llama.generate: 7 prefix-match hit, remaining 2062 prompt tokens to eval


Cr.Cr.Cr.Cr.Cr.cr.
cr.
cr.Cr.
crsedcrsedcrsedcr.Crsedcrcrcr.cr.crcr.cr.cr.cr.cr.
cr.
cr.cr
cr
cr
cr
crcrcr
crcrcr
cr
cr
cr:


cr



cr
cr
cr
cr
cr.
cr
cr
crcrcrcr
cr
cr
cr
cr
cr
cr
crs
cr
cr,
dra.
cr.
cr.
cr.cr,crdra,cr,dradradra,
the,
dra,
the,
the
the,the,draging
the
thecrcr
the crs
thedthe.
the 
the anddrageddcr
the crsdradradradradradradra?the?cr:dra:the:the:thee-the the thedrags
the
theferging
cr.
cr.
the
the the the
the,the,the,the:the:cr.cr.cr.crscrsedcr.cr
cr.
cr.cr.cr.
cr
the

the
cr.cr,the:
the
the crscrcr.cr.the
the
cr.cr.
cr.
cr.
cr
cr
cr
cr
cr
cr
cr
cr
cr.cr.cr.cr.cr.
cr.cr.cr.cr.cr.cr.cr.cr.cralal,thealidealdraw
cr.
cr
the
Cr.
the
the
cr
the

the
the
Cr
cr.
cr
cr
cr.
cr
cr
cr

cr
cr

cr





cr
cr
cr
cr
cr
cr
cr




























































































































































the al’



the,



















theo
















the



























to
to

















llama_perf_context_print:        load time =    1173.22 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  2062 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /  1023 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   10586.15 ms /  3085 tokens


--- Intermediate Refined Draft (Cycle 3) ---
 The original draft was a high-quality piece of writing that provided a detailed
explanation of the mission's objective, the team's roles, and the critical
components of the mission. The draft was well-structured, with a clear
introduction, a well-written body, and a strong conclusion. The use of vivid and
descriptive languaire is a vital component of the mission's environment,
mission's objective, the team's interaction with the spacecraft. The drafted.
The space. The draft.  The team's roles, the clear and conc's conc's conc's
conc's conc's conc's conc's conc's's's's conc's conc's's
conc's'd's's's's'd's's's's's's'd's'd's'd's'd's's'd's's'd's'd'd's'd's'd's's of's
ofs's ofd'd'd'd's'd'd's ofdraft'd'd'd'd'd's ofd'd'd'd's'd's's'draft's's'd'd'd'dr
aft's'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd'd's'd'd'd'd'd'd's's'd's'd'd'd'd'd'd'
s'd'd'd'd'd'd'd'd'd'd'd'd'd's'd's'd'd'd'd'd'dding'ddingding's'‘d'd’‘‘‘‘toddingd,
d’‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘

AttributeError: 'OfflineEmissionsTracker' object has no attribute 'emissions_data'