<a href="https://colab.research.google.com/github/Colsai/AI_RAG_Modeling/blob/main/RAG_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI-Based Basic RAG
A simple infrastructure with langchain_community and a web-based data pull to answer basic questions about documents/reports at a federal agency.

Sources:
https://medium.com/@drjulija/how-i-built-a-basic-rag-for-pdf-qa-in-a-few-lines-of-python-code-9849c32e59f0



In [1]:
#Import Dependencies
!pip install langchain_community langchain openai chromadb tiktoken



In [18]:
import os
import pandas as pd
from google.colab import auth
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from openai import OpenAI
from google.colab import userdata
from langchain.schema import Document

# Authenticate user in Google Colab
auth.authenticate_user()

# Retrieve OpenAI API Key from environment variables
openai_api_key = userdata.get('openai_key')

# Model Selection
model_select = "gpt-4o-mini-2024-07-18"  # Change to "gpt-3.5-turbo" if needed

# OpenAI API Client Initialization
client = OpenAI(api_key=openai_api_key)

# Data Load Function
def return_oig_work_plans(site: str = "https://oig.hhs.gov/reports-and-publications/workplan/active-item-table.asp") -> pd.DataFrame:
    """
    Fetch OIG Work Plans from the given website.
    """
    try:
        temp_df = pd.read_html(site)[0].iloc[:-1]  # Exclude last row if necessary
    except Exception as e:
        raise Exception(f"Error fetching OIG work plans: {e}")
    return temp_df

# Response Functions
def user_query_similarity_search(query: str = '') -> str:
    """
    Use ChromaDB to perform a similarity search.
    """
    docs_chroma = db_chroma.similarity_search_with_score(query, k=5)
    return "\n\n".join([doc.page_content for doc, _ in docs_chroma])

def generate_response(user_query: str, temperature: float = 0.1, max_tokens: int = 2500) -> str:
    """
    Generate a response using the OpenAI API.
    """
    response = client.chat.completions.create(
        model=model_select,  # ✅ Uses the selected model
        messages=[
            {"role": "system", "content": "Take the role of a federal expert at HHS OIG. Provide insight into a specific question to the public."},
            {"role": "user", "content": user_query}
        ],
        temperature=0,
        top_p=1,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content

def generate_rag_response(question: str = 'Tell me about HHS OIG work occurring on COVID-19.') -> str:
    """
    Generate a Retrieval-Augmented Generation (RAG) response using the context.
    """
    context = user_query_similarity_search(query=question)
    response_template = f"""
    Please use your expertise to answer this question: {question}.
    To answer in more detail, use the following context: {context}.
    Provide a detailed answer, use bullet points when applicable, and use quotations and sources from the context where appropriate.
    Do not make assumptions or guesses about current work.
    """

    prompt_template = ChatPromptTemplate.from_template(response_template)
    return generate_response(response_template)

def ask_question():
    # Get User Query
    user_question = input("Please ask a question: ")

    # Generate RAG-based Response
    resp = generate_rag_response(user_question)

    return resp

# Run Main
if __name__ == '__main__':
    # Load OpenAI Embeddings
    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

    # Fetch OIG summaries
    oig_summaries = return_oig_work_plans()['Summary'].tolist()

    # Split Text into Chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=50)

    # Ensure only valid summaries are processed (exclude NaN/float values)
    filtered_summaries = [summary for summary in oig_summaries if isinstance(summary, str)]

    # Process only valid text summaries
    chunks = [chunk for summary in filtered_summaries for chunk in text_splitter.split_text(summary)]

    # Convert Chunks into Document Objects
    documents = [Document(page_content=chunk) for chunk in chunks]

    # Initialize ChromaDB with Embedded Documents
    persist_dir = os.path.join(os.getcwd(), "chroma_db")
    db_chroma = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir)

    #Ask a question
    print(ask_question())

Please ask a question: Tell me about COVID-19 audits by HHS OIG. Include Details.
As a federal expert at the HHS Office of Inspector General (OIG), I can provide insight into the COVID-19 audits being conducted by our office, particularly in relation to the redirection of PEPFAR (President's Emergency Plan for AIDS Relief) funds for COVID-19 response efforts. Here are the key details:

### Overview of COVID-19 Audits by HHS OIG

- **Collaborative Audit**: The HHS OIG is conducting a collaborative audit with the U.S. Agency for International Development's (USAID) OIG. This partnership aims to ensure a comprehensive review of how PEPFAR funds have been utilized in response to the COVID-19 pandemic.
  
- **Focus on PEPFAR Funds**: The audit specifically examines the "redirection of PEPFAR funds for COVID-19." This involves assessing whether funds originally allocated for HIV/AIDS programs were appropriately redirected to support COVID-19 response efforts.

- **Separate Reports**: Each OIG