## **Leveraging LangChain, OpenAI, and RAG for Enhanced, Context-Specific AI Understanding**

WIP Version: 1 <br>
Last Updated:  2/21/2025 <br>
This workbook is the companion code to this <a href="https://decisionsciences.blog/2025/02/25/harnessing-llms-for-real-time-domain-specific-decision-support/" target="blank">blog post</a> on decisionsciences.blog</br>

***

## Import Libraries: ##

In [3]:
import json
import pandas as pd
import numpy as np
import seaborn as sns
import os
import sqlite3
import openai
from sqlalchemy import create_engine
from langchain.chat_models import ChatOpenAI  # Correct import for chat models
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.agents import create_openai_functions_agent
from langchain.tools import Tool
from langchain.prompts import PromptTemplate
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize
import requests
import warnings

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Robb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Set View Options: ##

In [6]:
# turn off warnings
import warnings; warnings.simplefilter('ignore')
#set some options for the output
get_ipython().magic(u'matplotlib inline')
pd.set_option('display.notebook_repr_html', True) 
pd.set_option('display.max_columns', 45) 
pd.set_option('display.max_rows', 40) 
pd.set_option('display.width', 180)
pd.set_option('display.max_colwidth', 240)  # or 199
sns.set_style("whitegrid", {'axes.grid' : False})

## Introduction: ##


In this project, I have built a custom system for answering complex queries related to **decision sciences** by integrating multiple components, including `web scraping, document indexing, retrieval-augmented generation (RAG), and a large language model (LLM)` like OpenAI’s GPT-4. This system allows me to efficiently access and generate detailed, `domain-specific answers` by combining my own blog content with the powerful capabilities of LLMs.

#### Putting It All Together: How These Options Interact
When you combine these elements, here's how the information flow looks:

*`User Query → Retriever (RAG) → Vector DB (blog content) → Relevant Chunks → LLM (GPT-4) → Final Response`*

- The retriever ensures the model gets the most relevant information from your custom data source.
- The LLM uses this retrieved context plus its own general knowledge to generate the final response.
- LangChain simplifies orchestrating these steps.


## Steps Taken to Build the System:

#### **Step 1: Document Splitting**  
   I used **LangChain's WebBaseLoader** to collect and load blog posts from specific URLs on my website, focusing on decision sciences topics.  
   ```python
   urls = ["https://www.decisionsciences.blog", 
           "https://decisionsciences.blog/2025/02/10/solving-the-last-mile-problem-in-analytics-introducing-the-decision-science-pre-analysis-framework/",
           ...]  
   loader = WebBaseLoader(urls)
   documents = loader.load()
```

---

#### **Step 2: Document Splitting**

After loading the documents, I split them into smaller, manageable chunks using **LangChain's `RecursiveCharacterTextSplitter`**. This step is crucial for efficiently processing large documents and making them suitable for embedding and retrieval.

```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
```

- `chunk_size=1000`: This defines the maximum size of each document chunk.
- `chunk_overlap=100`: This defines how many tokens should overlap between chunks to preserve context across splits.

---

#### **Step 3: Embedding Creation**

To convert the text into machine-readable format, I used **OpenAI's text-embedding-ada-002** model to generate **embeddings**. These embeddings represent the documents as vectors in a high-dimensional space, which are then used for semantic search.

```python
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_db = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")
vector_db.persist()
```

- **OpenAIEmbeddings**: Converts documents into vector embeddings using OpenAI's API.
- **Chroma**: This vector store handles the storage of embeddings and ensures they are indexed for efficient retrieval.

---

#### **Step 4: Setting Up Retrieval Chain**

I then used **LangChain's `RetrievalQA`** to set up a chain that allows the model to query the vector database. This helps with retrieving relevant document sections based on a query, enabling the system to provide intelligent answers based on the data.

```python
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=vector_db.as_retriever()
)
```

- **`RetrievalQA`**: This allows the query to interact with both the language model and the vector store, retrieving the most relevant document chunks.
- **`ChatOpenAI`**: A wrapper for OpenAI’s chat models, providing a conversational approach to querying and answering.

---

#### **Step 5: Querying the System**

Once the retrieval chain is set up, I was able to send queries to the system, which would then retrieve relevant documents and use the language model to generate a response.

```python
query = "How does Monte Carlo simulation apply to decision sciences?"
result = qa_chain.run(query)
```

This allows me to query the vector database and retrieve context-specific responses, which help answer questions related to decision sciences.

---

#### **Step 6: Handling Missing Information**

In some cases, when a query relates to information not found in the loaded documents, the system may respond indicating that no information is available. This ensures that the model doesn’t provide false or unsupported claims.

```python
query = "What are impact paths in decision sciences?"
result = qa_chain.run(query)
```

In this case, if the term "impact paths" is not found in the documents, the model might indicate that it cannot provide an answer based on the available data. This is important for maintaining accurate and trustworthy responses.

---

In [11]:
# Set openai.api_key to the OPEN AI environment (stored in system environment to project key)
openai.api_key = os.environ["OPENAI_API_KEY"]

#### Load and Process Your Blog Content (Specific URLs)
- Scrape my blog posts or load them as text files.
- This is a quick n' dirty web scraping and text cleansing effort, but illustrates how the step in the process works
- Using Semantic-based chunking to groups text based on topic or meaning, preserving logical structure (e.g., entire sections or paragraphs stay together) vs. Character-based chunking (e.g. chucnk size = 1000) that cuts off text at arbitrary points (e.g., mid-sentence or mid-paragraph).

Why this is better?
- Splits text by sentences, preserving logical flow.
- Keeps related ideas together, improving retrieval.
- Still applies overlap, so context is preserved across chunks.

In [13]:
# Ignore deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Download the sentence tokenizer if not already installed
# nltk.download('punkt')

# List of URLs to scrape
urls = [
    "https://www.decisionsciences.blog", 
    "https://decisionsciences.blog/2025/02/10/solving-the-last-mile-problem-in-analytics-introducing-the-decision-science-pre-analysis-framework/",
    "https://decisionsciences.blog/2025/01/18/the-executive-decision-series-post-4-avoiding-risks-with-monte-carlo-simulations-in-marketing/",
    "https://decisionsciences.blog/2025/02/07/the-sleep-training-of-data-driven-decision-making-how-intermittent-reinforcement-creates-bad-habits/",
    "https://decisionsciences.blog/2025/02/06/roas-and-the-flaws-of-linear-thinking-in-marketing-spend-decisions/",
    "https://decisionsciences.blog/2025/01/27/data-helps-people-deliver-reflecting-on-my-last-10-years-as-a-data-scientist-and-looking-ahead-to-the-next-10/",
    "https://decisionsciences.blog/2025/01/26/the-economy-of-insights-in-a-genai-driven-world-a-decision-scientists-perspective/",
    "https://decisionsciences.blog/2025/01/22/the-power-of-decision-driven-analysis/",
    "https://decisionsciences.blog/2025/01/16/from-insights-to-out-of-sights-how-decision-sciences-and-genai-help-drive-consistency/",
    "https://decisionsciences.blog/2025/01/15/the-executive-decision-series-post-3-transforming-decile-tables-into-roi-powerhouses/",
    "https://decisionsciences.blog/2025/01/12/the-executive-decision-series-post-2-how-to-read-a-decile-table-for-response-optimization-part-1/",
    "https://decisionsciences.blog/2025/01/03/navigating-the-pitfalls-of-data-driven-decision-making-insights-and-strategies-from-hbr/",
    "https://decisionsciences.blog/2025/01/02/500-cats-or-one-tiger-the-promise-and-reality-of-data-driven-decision-making/",
    "https://decisionsciences.blog/2025/01/01/the-executive-decision-series-post-1-understanding-saturation-curves-in-spend-optimization-mmm/",
    "https://decisionsciences.blog/2024/12/30/beyond-the-data-veneer-a-decision-sciences-call-to-action-for-2025/",
    "https://decisionsciences.blog/2024/12/28/decision-sciences-why-data-science-needs-clearer-impact-paths/",
    "https://decisionsciences.blog/2024/05/18/bridging-the-intelligence-value-gap-advancing-decision-sciences-with-predictive-prescriptive-analytics/",
    "https://decisionsciences.blog/2024/05/06/genai-as-catalyst-for-emergence-of-the-decision-scientist/",
    "https://decisionsciences.blog/2024/05/16/workbook-quantifying-incrementality-a-b-testing-mean-as-a-model-statistical-analysis-and-confidence-intervals/",
    "https://decisionsciences.blog/2024/05/06/the-importance-of-incrementality-and-lift-for-decision-sciences-in-marketing/",
    "https://decisionsciences.blog/2024/04/30/key-skillset-of-the-decision-scientist/",
    "https://decisionsciences.blog/2024/05/07/analysts-vs-data-scientists-vs-decision-scientists/",
    "https://decisionsciences.blog/2024/04/30/what-is-decision-sciences/",
    "https://decisionsciences.blog/2025/02/24/data-informed-vs-data-driven-vs-decision-driven-why-executives-must-understand-the-difference/"
]


### **Function to Scrape & Clean Web Pages Using BeautifulSoup**
def scrape_and_clean(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an error for bad responses (e.g., 404)

        # Parse the HTML with BeautifulSoup
        soup = BeautifulSoup(response.text, "html.parser")

        # Remove unwanted elements (navigation, scripts, footers, ads, etc.)
        for tag in soup(["script", "style", "footer", "nav", "aside", "header"]):
            tag.decompose()

        # Extract main content
        text = soup.get_text(separator="\n", strip=True)
        print(text[:1000])  # Preview cleaned content

        return text

    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return None


### **Function to Perform Semantic Chunking Using Sentence Tokenization**
def semantic_chunking(text, max_chunk_size=1000):
    """Split text into semantic chunks using sentence-based chunking."""
    sentences = sent_tokenize(text)  # Break text into sentences
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())  # Store complete chunk
            current_chunk = sentence + " "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks


### **Scrape, Clean, and Store Documents**
documents = []
for url in urls:
    content = scrape_and_clean(url)
    if content:
        documents.extend(semantic_chunking(content))  # Apply semantic chunking

# Ensure some documents were collected
if not documents:
    raise ValueError("No documents were loaded. Please check the URLs or network connection.")

print(f"Successfully loaded {len(documents)} cleaned and chunked documents.")


### **Step 2: Apply Final Chunking with RecursiveCharacterTextSplitter**
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.create_documents(documents)

### **Step 3: Convert Text into Embeddings & Store in Chroma Vector Database**
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_db = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")

# Persist the database to keep it for future queries
vector_db.persist()

print("Documents successfully processed and stored in Chroma DB.")


Decision Sciences: Marketing – Discussing topics related to Decision Sciences in Marketing
Beyond Insights: Shaping Strategic Outcomes
Welcome to my blog, where I delve into the dynamic world of Decision Sciences, focusing especially on its role in enhancing strategic decision-making within Marketing organizations. Here, I’ll share insights, thoughts, and occasionally, practical resources such as links to code and models on GitHub to deepen understanding and illustrate key concepts. My posts will venture into areas I’m passionate about, connecting the dots between Advanced Analytics, Data Science, and the burgeoning field of Decision Sciences. My writing might at times echo a manifesto for the Decision Scientist. This is intentional, as I am exploring and advocating for the pivotal role that Decision Scientists can and should play in a business world increasingly driven by Generative AI and a data-savvy workforce.
Expect a non-traditional approach as I share these thoughts and ideas. T

### Implement RAG Querying
- Now, when you ask a question, it retrieves relevant content from www.decisionsciecnes.blog and integrates it into the response.

#### Example 1: 
- Prompt to pull from single post

In [16]:
# Reload vector database
vector_db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Define retrieval-based QA chain using ChatOpenAI for GPT-4 chat model
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),  # Corrected to use ChatOpenAI
    retriever=vector_db.as_retriever(search_kwargs={"k": 5}) #Control how many chunks or documents the retriever fetches (e.g., top 3 documents):
)

# Example query
query = "Explain the concept of impact paths in decision sciences as described in the blog post."
response = qa_chain.run(query)
print(response)

The concept of impact paths in Decision Sciences refers to structured frameworks that connect insights to measurable outcomes. They are established by Decision Scientists to ensure that insights lead to action. An impact path answers critical questions such as what decision an insight informs and why it matters, how the insight aligns with business goals and priorities, what actions should follow from the insight, who is responsible for these actions, and what the anticipated impact of the actions is and how it will be measured. Without these clear pathways, even compelling insights risk being ignored, misunderstood, or deprioritized.


#### Example 2: 
- Prompt to pull from single post - limitation is that cannot make connection to broader LLM knowledge as is

In [18]:
# Reload vector database
vector_db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Define retrieval-based QA chain using ChatOpenAI for GPT-4 chat model
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),  # Corrected to use ChatOpenAI
    retriever=vector_db.as_retriever(search_kwargs={"k": 10}) #Control how many chunks or documents the retriever fetches (e.g., top 10 documents):
)
# Example query
query = "Using both your knowledge and the blog post about impact paths, explain how Monte Carlo simulations could enhance impact paths in decision sciences."
response = qa_chain.run(query)
print(response)


I'm sorry, but I can't provide the information you're looking for because the provided context doesn't contain any information about impact paths or how Monte Carlo simulations could enhance them in decision sciences.


#### Example 3: 
- Prompt to explicitly leverage pre-trained knowledge of LLM

Change `RetrievalQA settings:` adjust the prompt template to encourage more inference or creativity from the LLM.
- You're not limited strictly to the blog URLs, but the `retrieval step primarily focuses on the blog content` provided.
- The LLM does have general pretrained knowledge. It just might need a clear prompt or configuration to explicitly leverage that knowledge.

In [20]:
template = """
You are a helpful assistant in the field of Decision Sciences. Using both your own knowledge and the following context, answer the question thoroughly, even 
if some connections aren't explicit in the provided context.

Context: {context}

Question: {question}

Answer:"""

custom_prompt = PromptTemplate(template=template, input_variables=["context", "question"])

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vector_db.as_retriever(),
    chain_type_kwargs={"prompt": custom_prompt}
)

# Example query
query = "Using both your knowledge and the blog post about impact paths, explain how Monte Carlo simulations could enhance impact paths in decision sciences."
response = qa_chain.run(query)
print(response)


Monte Carlo simulations could enhance impact paths in decision sciences by providing more detailed and probabilistic understanding of potential outcomes. Impact paths refer to the chain of events that lead from a decision to its final outcome. Traditionally, these paths are often evaluated based on static point forecasts or best guesses. 

However, Monte Carlo simulations allow for a more dynamic and comprehensive assessment. By running thousands, or even millions, of simulations, we can gain insight into a wide range of possible scenarios and their associated probabilities. This allows decision-makers to better understand the full spectrum of potential impacts, rather than just a single expected outcome.

Moreover, given the uncertainty inherent in many decision-making contexts, the probabilistic nature of Monte Carlo simulations can provide a more robust foundation for decision-making. It allows leaders to prepare for a variety of scenarios and to optimize strategies accordingly, thu

#### What Do {context} and {question} Do in the Prompt Template?  ###
In the prompt template below, `{context}` and `{question}` are placeholders that will be dynamically replaced by actual values before sending the prompt to the language model (GPT-4). These values come from the retrieval-augmented generation (RAG) pipeline.

1️⃣ **{context} → Injects Retrieved Information**
- Purpose: This placeholder is replaced with the most relevant documents retrieved from the ChromaDB vector store.
- Source of Value: The retriever=vector_db.as_retriever() fetches top-k relevant text chunks from the embedded documents.
- Why It Matters: This ensures that GPT-4 generates responses based on the blog content rather than just its pre-trained knowledge.

2️⃣ **{question} → Inserts User Query**
- Purpose: This placeholder is replaced with the actual question the user asks (query).
- Source of Value: The query variable (e.g., "Explain impact paths in decision sciences.").
- Why It Matters: The model needs to know the question explicitly to provide a focused answer.

#### BONUS: Analyze Summarized Data with LLM

In [53]:
# Load Excel file
df = pd.read_csv("synthetic_leads_dataset.csv")

# Summarize key stats
summary = df.describe()
summary


Unnamed: 0,lead_id,company_size,engagement_score,cost,treatment,converted
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,1448.9,50.46,274.503642,0.514,0.464
std,144.481833,1945.086248,29.448512,126.650607,0.500305,0.499202
min,1.0,50.0,0.0,50.463266,0.0,0.0
25%,125.75,162.5,25.0,163.069948,0.0,0.0
50%,250.5,200.0,50.0,283.192093,1.0,0.0
75%,375.25,1000.0,76.25,381.113501,1.0,1.0
max,500.0,5000.0,100.0,497.966221,1.0,1.0


In [55]:
# Prepare a summary string to pass to the LLM
summary_text = summary.to_string()

# Initialize OpenAI client with your API key
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"]) # Replace with your actual API key

# Ask the model to analyze trends
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a decision scientist."},
        {"role": "user", "content": f"Analyze the following summary of sales data:\n{summary_text}\nWhat trends do you notice?"}
    ]
)

# Print response
print(response.choices[0].message.content)

Based on the provided summary, the following conclusions can be drawn:

1. Lead ID: As it's likely a unique identifier for each entry, this variable probably lacks inherent meaning concerning the company's sales performance.

2. Company Size: On average, the companies in this data set are large (avg. of 1448.9 employees). This dataset contains a wide variety of company sizes, however, ranging from very small companies (minimal 50 employees) to very large ones (maximum 5000). It has a high standard deviation (1945.09), indicating a significant variation in company size.

3. Engagement Score: This lies between 0 and 100, with an average (mean) of 50.46. The standard deviation of around 29.45 suggests some level of variability in the engagement score.

4. Cost: It averages at 274.50 with a standard deviation of 126.65, ranging from approximately 50.46 to 497.97. The variation suggests differences in the cost of converting leads.

5. Treatment: Roughly half of the entries (51.4%) received 