<a href="https://colab.research.google.com/github/Rashi-Dwivedi1812/Rag-Query-Engine/blob/main/integrating_open_ai_api_key.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [34]:
!pip install PyMuPDF -q
!pip install langchain_community
!pip install sentence-transformers
!pip install langchain langchain-community sentence-transformers faiss-cpu PyMuPDF
!pip install langchain langchain-community sentence-transformers faiss-cpu PyMuPDF groq langchain-groq python-dotenv




## Making the required files

In [35]:
import os

# Define the source directory for your raw files
source_directory = "raw_data"

# Define the destination directory for the converted text files
destination_directory = "processed_data"

# Create the directories
os.makedirs(source_directory, exist_ok=True)
os.makedirs(destination_directory, exist_ok=True)

print(f"Directory '{source_directory}' created.")
print(f"Please upload your PDF and JSON files to the '{source_directory}' folder in the file browser on the left.")

Directory 'raw_data' created.
Please upload your PDF and JSON files to the 'raw_data' folder in the file browser on the left.


## Converting Raw Data into Text format

In [45]:
import os
import json
import fitz  # PyMuPDF

# --- Function to Convert PDFs to Text ---
def convert_pdfs_to_txt(source_dir: str, dest_dir: str):
    """Converts all PDF files in a source directory to text files."""
    print("--- Converting PDF files to text... ---")
    if not os.path.exists(source_dir):
        print(f"Source directory '{source_dir}' not found.")
        return

    for filename in os.listdir(source_dir):
        if filename.lower().endswith(".pdf"):
            pdf_path = os.path.join(source_dir, filename)
            txt_filename = os.path.splitext(filename)[0] + ".txt"
            txt_path = os.path.join(dest_dir, txt_filename)

            try:
                doc = fitz.open(pdf_path)
                text_content = ""
                for page in doc:
                    text_content += page.get_text()

                with open(txt_path, "w", encoding="utf-8") as f:
                    f.write(text_content)

                print(f"Successfully converted '{filename}' to '{txt_filename}'.")
            except Exception as e:
                print(f"Error converting '{filename}': {e}")


# --- Function to Convert JSON to Text ---
def convert_json_to_txt(source_dir: str, dest_dir: str):
    """Converts all JSON files with Q&A pairs to a single text file."""
    print("\n--- Converting JSON files to text... ---")
    if not os.path.exists(source_dir):
        print(f"Source directory '{source_dir}' not found.")
        return

    for filename in os.listdir(source_dir):
        if filename.lower().endswith(".json"):
            json_path = os.path.join(source_dir, filename)
            txt_filename = os.path.splitext(filename)[0] + ".txt"
            txt_path = os.path.join(dest_dir, txt_filename)

            try:
                with open(json_path, "r", encoding="utf-8") as f:
                    data = json.load(f)

                text_content = ""
                # Check if data is a list of dictionaries (Q&A format)
                if isinstance(data, list):
                    for entry in data:
                        if "question" in entry and "answer" in entry:
                            text_content += f"Question: {entry['question']}\nAnswer: {entry['answer']}\n\n"
                else:
                    print(f"Warning: JSON file '{filename}' is not in the expected Q&A list format. Skipping.")
                    continue


                with open(txt_path, "w", encoding="utf-8") as f:
                    f.write(text_content)

                print(f"Successfully converted '{filename}' to '{txt_filename}'.")
            except Exception as e:
                print(f"Error converting '{filename}': {e}")


# --- Main Execution Block ---

# Define directory paths (using the variables from the previous cell)
source_data_dir = "raw_data"
dest_data_dir = "processed_data"

# Run the conversion functions
convert_pdfs_to_txt(source_data_dir, dest_data_dir)
convert_json_to_txt(source_data_dir, dest_data_dir)

print(f"\nConversion complete! All processed documents are now in the '{dest_data_dir}' directory.")
print("You can now use these text files for your next processing steps.")

--- Converting PDF files to text... ---
Successfully converted 'mql5book.pdf' to 'mql5book.txt'.

--- Converting JSON files to text... ---

Conversion complete! All processed documents are now in the 'processed_data' directory.
You can now use these text files for your next processing steps.


## Storing data in Vector DB

In [47]:
# FILE: create_faiss_index.py (Final Corrected Version)
# DESCRIPTION: Uses RecursiveCharacterTextSplitter to handle any document format.

import os
import re
# --- CHANGED: Import the more robust RecursiveCharacterTextSplitter ---
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document
from dotenv import load_dotenv

# Define paths
DOCS_PATH = "processed_data"
FAISS_INDEX_PATH = "data/faiss_index"

def setup_api_key():
    """Sets up the OpenAI API key from a local .env file."""
    load_dotenv()
    if not os.getenv("OPENAI_API_KEY"):
        raise ValueError("OPENAI_API_KEY not found in .env file.")

def separate_code_and_text(content: str):
    """
    Separates text content into natural language and code blocks
    using markdown-style ``` fences.
    """
    code_pattern = re.compile(r"```(.*?)```", re.DOTALL)
    code_blocks = [match.group(1).strip() for match in code_pattern.finditer(content)]
    text_content = code_pattern.sub("", content).strip()
    return text_content, code_blocks

def create_faiss_index():
    """
    Processes documents, creates embeddings with OpenAI, and saves a FAISS index.
    """
    print("Starting the FAISS index creation process...")
    setup_api_key()

    all_docs = []
    if not os.path.exists(DOCS_PATH):
        print(f"Error: Directory '{DOCS_PATH}' not found.")
        return

    # 1. Load each file and process its content
    for filename in os.listdir(DOCS_PATH):
        if filename.endswith(".txt"):
            file_path = os.path.join(DOCS_PATH, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            text, code = separate_code_and_text(content)

            if text:
                all_docs.append(Document(page_content=text, metadata={"source": filename, "type": "text"}))
            for i, code_block in enumerate(code):
                if code_block:
                    code_source = f"{filename}_code_{i+1}"
                    all_docs.append(Document(page_content=code_block, metadata={"source": code_source, "type": "code"}))

    if not all_docs:
        print(f"No processable content found in '{DOCS_PATH}'.")
        return

    print(f"Loaded and processed {len(all_docs)} content blocks. Splitting...")

    # 2. Split documents into smaller chunks
    # --- CHANGED: Switched to RecursiveCharacterTextSplitter ---
    # This will correctly split the document into ~1000 character chunks.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    split_docs = text_splitter.split_documents(all_docs)

    print(f"Split into {len(split_docs)} chunks. Generating embeddings with OpenAI...")

    # 3. Create embeddings (keeping the batching fix from before)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=100)

    # 4. Create the FAISS index from documents
    db = FAISS.from_documents(split_docs, embeddings)
    db.save_local(FAISS_INDEX_PATH)

    print("FAISS index created successfully!")
    print(f"Index saved at: {FAISS_INDEX_PATH}")

if __name__ == "__main__":
    os.makedirs(os.path.dirname(FAISS_INDEX_PATH), exist_ok=True)
    create_faiss_index()

Starting the FAISS index creation process...
Loaded and processed 1 content blocks. Splitting...
Split into 4772 chunks. Generating embeddings with OpenAI...
FAISS index created successfully!
Index saved at: data/faiss_index


In [38]:
!pip install langchain-openai



## RAG model for answering users' queries

In [68]:
# FILE: final_rag_app.py
# This script achieves your goal of "Document-First, then AI Knowledge" using only one API key.

import os
import traceback
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from dotenv import load_dotenv

# --- CONFIGURATION ---
CONFIG = {
    "FAISS_INDEX_PATH": "data/faiss_index",
    "EMBEDDING_MODEL_NAME": "text-embedding-3-small",
    "LLM_MODEL_NAME": "gpt-4o",
    "SEARCH_K": 3
}

# --- API KEY SETUP ---
def setup_api_key():
    """Sets up the OpenAI API key from a local .env file."""
    load_dotenv()
    if not os.getenv("OPENAI_API_KEY"):
        raise ValueError("OPENAI_API_KEY not found.")
    print("--- API Key is set. ---")

# --- RAG SETUP ---
def setup_rag_components():
    """Initializes the main components for the RAG system."""
    print("--- Setting up RAG Components ---")
    llm = ChatOpenAI(model_name=CONFIG['LLM_MODEL_NAME'], temperature=0.1)
    embeddings = OpenAIEmbeddings(model=CONFIG['EMBEDDING_MODEL_NAME'])

    if not os.path.exists(CONFIG['FAISS_INDEX_PATH']):
        raise FileNotFoundError(f"FAISS index not found at '{CONFIG['FAISS_INDEX_PATH']}'. Please run the index creation script first.")

    print(f"Loading FAISS index from: {CONFIG['FAISS_INDEX_PATH']}...")
    db = FAISS.load_local(
        CONFIG['FAISS_INDEX_PATH'], embeddings, allow_dangerous_deserialization=True
    )
    retriever = db.as_retriever(search_kwargs={"k": CONFIG['SEARCH_K']})
    print("--- RAG Components setup complete. ---")
    return llm, retriever

# --- MAIN LOOP ---
def main():
    try:
        setup_api_key()
        llm, retriever = setup_rag_components()

        # --- THE ONE AND ONLY PROMPT TEMPLATE ---
        # This single, powerful prompt handles the fallback logic internally.
        final_template = """
You are an expert technical assistant. Answer the user's QUESTION using the provided CONTEXT.
The context is retrieved from a user's local documents and may contain code snippets, parts of a book, or other text.

Your instructions are:
1. First, carefully analyze the CONTEXT. If the context contains information that directly answers the QUESTION, synthesize a comprehensive answer based **primarily on the CONTEXT**.
2. If the CONTEXT is relevant but not fully sufficient, use the information from the CONTEXT and supplement it with your own general knowledge to provide a complete answer.
3. If the CONTEXT seems completely irrelevant to the QUESTION, then ignore the context and answer the question using only your own general knowledge.
4. **Do not** mention the context in your answer. For example, do not say "Based on the context provided...". Just answer the question directly and naturally.

CONTEXT:
{context}

QUESTION:
{question}
"""
        final_prompt = PromptTemplate.from_template(final_template)

        while True:
            print("\n" + "=" * 50)
            user_query = input("Ask a question (or type 'exit' to quit): ")
            if user_query.lower() in ["exit", "quit"]:
                print("Exiting...")
                break
            if not user_query.strip():
                continue

            print("Retrieving documents...")
            retrieved_docs = retriever.invoke(user_query)
            context_text = "\n\n---\n\n".join([doc.page_content for doc in retrieved_docs])

            print("Generating answer...")

            final_chain = final_prompt | llm | StrOutputParser()
            response = final_chain.invoke({
                "context": context_text,
                "question": user_query
            })

            print("\n--- Answer ---")
            print(response)

    except Exception as e:
        print("\nAn unexpected error occurred:")
        print(f"{type(e).__name__}: {e}")
        traceback.print_exc()

if __name__ == "__main__":
    main()

--- API Key is set. ---
--- Setting up RAG Components ---
Loading FAISS index from: data/faiss_index...
--- RAG Components setup complete. ---

Ask a question (or type 'exit' to quit): What does mql5book.txt say about the BandOsMA.mq5 Expert Advisor?
Retrieving documents...
Generating answer...

--- Answer ---
The document describes the BandOsMA.mq5 Expert Advisor as a trading system that uses the OsMA histogram and Bollinger bands for decision-making. It mentions that the Expert Advisor opens and closes trades based on the crossing of the OsMA histogram with the Bollinger bands. The general settings for the Expert Advisor include a magic number, a fixed lot size, and a stop loss distance, with no take profit used. The stop loss is managed using a trailing stop. The document also discusses the creation of an OnTester handler to evaluate the performance of the Expert Advisor using metrics like profit, profitability, the number of trades, and the Sharpe ratio. Additionally, it mentions t