# Overview

This project uses an LLM (via the OpenAI API) to extract commonly asked questions from unstructured text data and automatically creates a Notion page (via the Notion API) to document them.

The Slack message data used in this example was originally extracted using **Fivetran**.  However, the code is **designed to work with any text-based message data**. You can adapt it easily to any dataset containing text conversations (e.g., customer chats, support tickets, forum posts).

The results created by the automated Notion integration can be viewed here:  
🔗 [View Results on Notion](https://www.notion.so/Auto-generated-FAQ-1afd8cbfdbfd80279fa5c9200855c19b)

---

# Requirements

To run this notebook, you will need:

- A database or file containing the text message data you want to analyze.
- In this example, Slack message data (`team_slack`) is loaded from **Google Drive**.  
  ➔ You will need to replace this part with your own data loading method, depending on where your messages are stored (e.g., local file, database, or cloud storage).
- An **OpenAI API key** for processing and summarizing your text data.
- A **Notion API integration**, which provides you with a Notion "secret" (your Notion API key), for automatically exporting the results to Notion.
- An empty **Notion page** to export your results to.

👉 **Important:**  
- You must have a **billed OpenAI account** (free-tier accounts without billing enabled cannot use the API).
- Running this notebook will **incur API costs** based on the number of tokens processed.  
  Costs depend on the size of your message dataset and the OpenAI model you use.
- You can create and manage your OpenAI API key here: [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)

- You can create a **Notion integration** here: [https://www.notion.so/my-integrations](https://www.notion.so/my-integrations).  
  After creating the integration, copy your **Internal Integration Token** (the Notion secret/API key), and  
  ➔ **share the target Notion page** with your integration to give it the necessary permissions.

Please make sure you have both API keys ready before running the code.  
You can set them securely as environment variables (recommended) or insert them directly into the code (not recommended for shared environments).

Example environment variables:
- `OPENAI_API_KEY`
- `NOTION_API_KEY`
---

# **Cost Estimate**

Running this code on approximately 2,000 text messages using the OpenAI `gpt-4o` model cost me less than **$1** in API usage.

👉 **Important:** Actual costs will vary depending on:
- The total number of tokens processed — which depends heavily on **how many messages** you process and **how long each message is** (more text = more tokens).
- The model used (e.g., GPT-4o is cheaper than GPT-4-turbo).

🛡️ **Tip:**  
To control spending, you can disable "Auto-recharge" in your OpenAI billing settings and/or set a usage limit.  
Always monitor your usage in the [OpenAI usage dashboard](https://platform.openai.com/usage) to avoid unexpected charges.


# A Note on Batch Size and Temperature

Batch size and temperature are parameters set inside the LLM function to control how the messages are processed and how consistent or creative the model's responses are.


### 1️⃣ Batch Size: Affects Context and Accuracy

**What It Does:**

* Determines how many messages are processed in a single LLM call.
* Larger batches provide more context, allowing the model to find recurring themes more effectively.
* Smaller batches may miss broader patterns because the LLM sees less data at once.

**Impact of Batch Size:**

| Batch Size | Pros | Cons |
|------------|---------------------------------|------------------------------------------------|
| **Small (e.g., 10-20 messages per batch)** | Faster responses, easier error handling | Less context, less effective at spotting trends |
| **Medium (e.g., 50-100 messages per batch, default)** | Balance of context and performance | Still may not catch all patterns |
| **Large (e.g., 200+ messages per batch)** | More context, better at summarization | Slower response, might hit token limits |

**Best Practice Summary for Batch Size:**

* ✅ Use medium or large batch sizes (50-200 messages per batch)
* ✅ If patterns aren't emerging, try increasing the batch size so the model sees more context.
* ✅ If API calls are slow or expensive, reduce batch size to process in chunks.

<br>

> **Approach Chosen In This Case: Combine batch 100 & batch 300 results**
>
> Combining the results of batch size 100 and batch size 300 into a single, balanced summary, using LLM. This allows the LLM to merge the high-level themes from batch 300 with the more detailed breakdown from batch 100, resulting in a structured but not overly detailed summary.

### 2️⃣ Temperature: Affects Consistency vs. Creativity

**What It Does:**

* Controls how random or deterministic the responses are.
* Lower values (0.1 - 0.3) → More consistent, structured, and predictable outputs.
* Higher values (0.7 - 1.0) → More creative, diverse, and sometimes inconsistent results.

**Impact of Temperature:**

| Temperature | Behavior | Use Case |
|-------------|-----------------------------------|------------------------------------------------------------|
| **0.1 - 0.3** | Consistent, structured, focused | Best for summaries, factual outputs, and categorization |
| **0.4 - 0.6** | Some variation, slight randomness | Good for brainstorming alternative groupings |
| **0.7 - 1.0** | Highly creative, less structured | Best for generating unique ideas, but not for structured summarization |


**Best Practice Summary for Temperature settings:**

* ✅ Use a low temperature (0.2 - 0.3) to get a clear, structured summary.
* ✅ If the model's output is too rigid, try increasing the temperature slightly (0.4 - 0.5).
* ✅ If the output is too chaotic or inconsistent, lower the temperature to 0.1 - 0.2.


# Here Comes the Code

## Import necessary libraries

In [None]:
# Install the OpenAI Python package if you haven't already
# (You can run this in your terminal or notebook once)
# pip install openai

# Import necessary libraries
import openai  # For interacting with the OpenAI API
import pandas as pd  # For working with tabular data (e.g., storing and processing messages)
import os  # For accessing environment variables (e.g., loading the OpenAI API key) and handling file paths

## Set Up LLM Access (OpenAI)

In [None]:
os.environ["OPENAI_API_KEY"] = "insert_your_openai_api_key" # Insert your OpenAI api key here

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Load the message data

⚡ Note: In this example, the Slack messages ("team_slack") are loaded from Google Drive.

Replace this with your own data loading method if your messages are stored elsewhere (e.g., local CSV, database query, other cloud storage).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = "/content/drive/My Drive/data/team_slack_20250224.csv"
team_slack = pd.read_csv(file_path)

In [None]:
team_slack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1864 entries, 0 to 1863
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   message_channel_id  1864 non-null   object 
 1   parent_message_ts   1864 non-null   float64
 2   ts                  1864 non-null   float64
 3   user_id             1864 non-null   object 
 4   subtype             197 non-null    object 
 5   last_read           0 non-null      float64
 6   subscribed          291 non-null    object 
 7   source_team         0 non-null      float64
 8   type                1864 non-null   object 
 9   team                1543 non-null   object 
 10  is_locked           291 non-null    object 
 11  edited_user         176 non-null    object 
 12  _fivetran_deleted   1864 non-null   bool   
 13  _fivetran_synced    1864 non-null   object 
 14  edited_ts           176 non-null    float64
 15  reply_count         291 non-null    float64
 16  thread

## Preprocess data

This step applies basic cleaning to the Slack messages:
- Removes any leading or trailing spaces
- Ensures that only valid text strings are processed (non-strings are replaced with an empty string)


In [None]:
# Apply the preprocessing function to the 'masked_text' column and store the cleaned text in a new column 'clean_text'


def preprocess_text(text):
    """Basic cleaning of Slack messages"""
    if isinstance(text, str):  # Ensure it's a string before processing
        return text.strip()  # Remove leading/trailing spaces
    return ""

team_slack['clean_text'] = team_slack['masked_text'].apply(preprocess_text)

## Use OpenAI to extract common questions


### LLM Function and Prompt

The function below sends a structured prompt to the OpenAI model (default: GPT-4o) to identify and group actual questions from a batch of Slack messages into an FAQ-style format.

In the development process, I ran several versions of the prompt and gradually refined it to improve the results.

The prompt was improved step-by-step to ensure:
* ✅ Focus on extracting **actual questions** from the text (not just themes or summaries)
* ✅ **Grouping similar questions** together into a coherent FAQ format
* ✅ **Removing unnecessary thematic descriptions** and ensuring the questions are **clear and concise**

In this cleaned-up version, only the final, optimized prompt is included.

In [None]:
def extract_faq_questions(messages, model="gpt-4o"):
    """Extract actual FAQ-style common questions from Slack messages"""

    prompt = f"""
You are analyzing Slack messages to extract **common recurring questions** that team members frequently ask.

### **Instructions:**
- Extract **only actual questions** (e.g., "How do I get API access?" instead of "Discussing API access").
- Group them into **clear FAQ categories**.
- Do **not include summaries or descriptions**—only list questions.
- Do **not switch to general topic descriptions**—always extract full-sentence questions.

### **Expected Output Format:**

**1. Meeting Schedules & Availability**
- What time works for everyone for the next team meeting?
- Can we schedule a 1-1 session?
- Is there a slot available to meet before our scheduled call?

**2. API Access & Authentication**
- How do I obtain OAuth credentials for API access?
- What permissions are required to authenticate via the API?
- How do I troubleshoot OAuth authentication failures?

**3. Debugging & Technical Issues**
- How do I resolve an SSL protocol error?
- What's the best way to debug OAuth authentication issues?
- How do I fix issues with database connections?

**4. Project Management & Task Coordination**
- How should we organize project tasks in Notion?
- What is the best practice for managing Git branches?
- How do we handle merging conflicts effectively?

---

**Here are the extracted Slack messages:**

{messages}

**Now, extract and format only direct questions under their respective categories. Do NOT include descriptions or summaries. Avoid switching to topic descriptions.**


    """

    # client = openai.OpenAI(api_key="your-api-key-here")

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are an expert in identifying and structuring common questions in team discussions."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3  # Ensures structured, non-random responses
    )

    return response.choices[0].message.content  # Extract the FAQ-style questions


### Run with batch size 100

In [None]:
def batch_messages_100(df, column='masked_text', batch_size=100):
    """Processes messages in batches and extracts common questions"""
    questions = []
    for i in range(0, len(df), batch_size):
        batch = "\n".join(df[column][i:i+batch_size].dropna().tolist())  # Drop NaNs before joining
        result = extract_faq_questions(batch)
        questions.append(result)
    return questions

In [None]:
faq_b100 = batch_messages_100(team_slack, column='masked_text')


In [None]:
faq_b100

### Run with batch size 300

Larger batches provide more context, allowing the model to find recurring themes more effectively.

In [None]:
def batch_messages_300(df, column='masked_text', batch_size=300):
    """Processes messages in batches and extracts common questions"""
    questions = []
    for i in range(0, len(df), batch_size):
        batch = "\n".join(df[column][i:i+batch_size].dropna().tolist())  # Drop NaNs before joining
        result = extract_faq_questions(batch)
        questions.append(result)
    return questions

In [None]:
faq_b300 = batch_messages_300(team_slack, column='masked_text')
faq_b300

### Merge b100 and b300 results

Combining the results of batch size 100 and batch size 300 into a single, balanced summary. This allows the LLM to merge the high-level themes from batch 300 with the more detailed breakdown from batch 100, resulting in a structured but not overly detailed summary.

In [None]:
def merge_summaries(faq_b100, faq_b300, model="gpt-4o"):
    """Merge and refine the two batch summaries into a single, balanced list."""

    prompt = f"""
    You are analyzing two summaries of common Slack discussions focussing on common questions raised:

    1. **The first summary (Batch 300)** is a high-level overview of the most recurring questions.
    2. **The second summary (Batch 100)** is a more detailed breakdown of different questions.

    Your task is to merge these into a **single, refined summary** that:
    - Keeps the most important topics and questions from Batch 300.
    - Retains useful details from Batch 100 without excessive repetition.
    - Ensures topics are **not too broad, but also not overly detailed**.
    - Groups similar questions together.
    - Avoids listing redundant information.

    Here are the summaries:

    **Batch 300 Summary:**
    {faq_b300}

    **Batch 100 Summary:**
    {faq_b100}

    Please return the merged summary in a structured format with grouped categories.
    """

    # client = openai.OpenAI(api_key="your-api-key-here")  # Ensure API key is set

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": "You are an AI expert at refining and merging topic summaries."},
                  {"role": "user", "content": prompt}],
        temperature=0.3  # Keep it structured and deterministic
    )

    return response.choices[0].message.content  # Extract refined summary


In [None]:
merged_faqs = merge_summaries(faq_b100, faq_b300)

# Summary of results

Please note:
Here I **manually removed the last category ("General Inquiries")** because it was a little messy and included a mix of internal feedback and random one-off asks.

In [None]:
def remove_last_category(merged_faqs_text, category_title="**6. General Inquiries**"):
    """Removes the last unwanted category and anything after it from the merged FAQ text."""
    if category_title in merged_faqs_text:
        # Only keep the part before the unwanted category
        cleaned_text = merged_faqs_text.split(category_title)[0].strip()
        return cleaned_text
    else:
        # If the category is not found, return the original text
        return merged_faqs_text


In [None]:
# Remove the 6th category before printing or uploading
cleaned_merged_faqs = remove_last_category(merged_faqs)

# Now print the cleaned version
print(cleaned_merged_faqs)


**Merged Summary of Slack Discussions**

**1. Meeting Schedules & Availability**
- What time works for everyone for the next team meeting?
- Can we schedule a 1-1 session?
- Is there a slot available to meet before our scheduled call?
- Should we have a meeting tomorrow morning?
- Please let me know what hours work for you, so that we can plan team calls accordingly.
- Can you please record the meeting?
- Does anyone have schedule conflicts with the time?

**2. API Access & Authentication**
- How do I obtain OAuth credentials for API access?
- What permissions are required to authenticate via the API?
- How do I troubleshoot OAuth authentication failures?
- Could you check if there’s an issue with the access or if I need to take any additional steps?
- Do I need to create a new Slack app to obtain the required API access token?

**3. Debugging & Technical Issues**
- How do I resolve an SSL protocol error?
- What's the best way to debug OAuth authentication issues?
- How do I fix issues

# Notion Integration

In [None]:
# pip install notion-client

In [None]:
from notion_client import Client
import os

# --- SETUP: Replace with your credentials ---
NOTION_API_KEY = "insert_your_own_Notion_API_key_here"  # Replace with your API key
NOTION_PAGE_ID = "1afd8cbfdbfd80279fa5c9200855c19b"  # Replace with the Notion Page ID

# --- Initialize Notion Client ---
notion = Client(auth=NOTION_API_KEY)

In [None]:
# The output generated by the LLM is in Markdown-style formatting.
# This function parses the Markdown into a list of Notion blocks.

def parse_markdown_to_notion_blocks(markdown_text):
    """
    Parses simple Markdown text into a list of Notion blocks (headings and bullet list items).
    Only handles bold headings (**) and bullets (-).
    """

    notion_blocks = []
    lines = markdown_text.splitlines()

    for line in lines:
        line = line.strip()
        if not line:
            continue  # skip empty lines

        # Heading detection: lines starting and ending with **
        if line.startswith("**") and line.endswith("**"):
            heading_text = line.strip("*").strip()
            block = {
                "object": "block",
                "type": "heading_2",
                "heading_2": {
                    "rich_text": [
                        {"type": "text", "text": {"content": heading_text}}
                    ]
                }
            }
            notion_blocks.append(block)

        # Bullet detection: lines starting with -
        elif line.startswith("-"):
            bullet_text = line.lstrip("-").strip()
            block = {
                "object": "block",
                "type": "bulleted_list_item",
                "bulleted_list_item": {
                    "rich_text": [
                        {"type": "text", "text": {"content": bullet_text}}
                    ]
                }
            }
            notion_blocks.append(block)

        # Otherwise, ignore the line (safe for now)

    return notion_blocks


In [None]:
# Function to Add Parsed Content to Notion Page
def add_blocks_to_notion(blocks, page_id):
    """Uploads a list of Notion blocks to the specified Notion page."""

    # Append blocks to the Notion page
    notion.blocks.children.append(page_id, children=blocks)


In [None]:
# Parse the merged_faqs markdown into structured Notion blocks
notion_blocks = parse_markdown_to_notion_blocks(cleaned_merged_faqs)

# Upload the structured blocks to Notion
add_blocks_to_notion(notion_blocks, NOTION_PAGE_ID)

print("✅ Data successfully added to Notion!")


✅ Data successfully added to Notion!
