# Notion LLM Tutor

This notebook demonstrates a Notion-powered LLM tutor that answers technical questions using both general knowledge and reference material from a Notion page. It fetches, cleans, and incorporates Notion content to enhance explanations about Python, software engineering, data science, and LLMs. The workflow includes API setup, content extraction, prompt engineering, and model interaction for detailed, context-aware responses.

In [50]:
# imports
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from openai import OpenAI
import ollama
import os
from notion_client import Client as NotionClient
from notion_exporter import NotionExporter
import asyncio
import nest_asyncio
import re

In [51]:
# constants and setup
parent_dir = os.path.dirname(os.getcwd())  # Get the parent directory (GenAI-Discovery)
env_path = os.path.join(parent_dir, '.env')

# Load environment variables
load_dotenv(dotenv_path=env_path, override=True)

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
NOTION_API_KEY = os.getenv('NOTION_API_KEY')
NOTION_PAGE_ID = os.getenv('NOTION_PAGE_ID')
MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

# Initialize OpenAI client with API key
CLIENT = OpenAI(api_key=OPENAI_API_KEY)

In [7]:
# check OpenAI API key format
if OPENAI_API_KEY and OPENAI_API_KEY.startswith('sk-proj-') and len(OPENAI_API_KEY)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key?")
    
# check Notion API key format
if NOTION_API_KEY and NOTION_API_KEY.startswith('ntn_') and len(NOTION_API_KEY)>10:
    print("Notion API key looks good so far")
else:
    print("Notion API key not found or invalid format")

# check Notion Page ID format (8-4-4-4-12 characters)
if NOTION_PAGE_ID and \
    len(NOTION_PAGE_ID) == (32+4) and \
    all(len(part) == expected for part, expected in zip(NOTION_PAGE_ID.split('-'), [8, 4, 4, 4, 12])):
    print("Notion Page ID looks good so far")
else:
    print("Invalid Notion Page ID format. It should follow the pattern: 8-4-4-4-12")

API key looks good so far
Notion API key looks good so far
Notion Page ID looks good so far


In [66]:
# Using NotionExporter instead as it handles block-to-markdown conversion
# Initialize the client
# notion = NotionClient(auth=NOTION_API_KEY)
# Read a page
# page = notion.pages.retrieve(page_id=NOTION_PAGE_ID)

In [53]:
# Need with NotionExporter as Jupyter is already a running event loop
# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

# Helper function to run async code in Jupyter
def run_async(coro):
    try:
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(coro)
    except RuntimeError as e:
        if "There is no current event loop in thread" in str(e):
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
            return loop.run_until_complete(coro)
        raise
    
# Example of how to use run_async with an async function
async def async_operation():
    # Your async code here
    pass

# Run it using the helper
result = run_async(async_operation())

In [None]:
# Fetch Notion page content
exporter = NotionExporter(notion_token=NOTION_API_KEY)
notion_content = exporter.export_pages(page_ids=[NOTION_PAGE_ID])

In [69]:
# string representation of the return content
notion_content[NOTION_PAGE_ID]



In [70]:
def remove_images_links_files(text: str) -> str:
    """
    Remove images, links and uploaded files (names and links) from an HTML/Markdown-ish string.
    Handles:
      - HTML <figure>...</figure> blocks (commonly wrap images)
      - HTML <img ...> tags
      - Markdown images: ![alt](url)
      - HTML anchors: <a ...>...</a> (removes tag and inner text)
      - Markdown links: [text](url)
      - Standalone http(s):// URLs
      - Bare filenames with common extensions (e.g. requirements.txt, report.pdf, diagnostics.ipynb)
    Returns cleaned string with extra blank lines collapsed.
    """
    if not text:
        return text

    # Patterns to remove. Use DOTALL and IGNORECASE where needed.
    patterns = [
        # Remove entire figure blocks which often contain images/files
        re.compile(r'(?is)<figure\b.*?>.*?</figure>'),
        # Remove img tags
        re.compile(r'(?is)<img\b[^>]*>'),
        # Remove markdown images ![alt](url)
        re.compile(r'!\[.*?\]\(.*?\)'),
        # Remove HTML anchors and their content (link text or file name)
        re.compile(r'(?is)<a\b[^>]*>.*?</a>'),
        # Remove markdown links [text](url)
        re.compile(r'\[.*?\]\(.*?\)'),
        # Remove standalone URLs (http/https)
        re.compile(r'https?://\S+'),
        # Remove bare filenames with common extensions (case-insensitive)
        re.compile(r'(?i)\b[\w\-\.() ]+\.(?:pdf|txt|ipynb|py|yml|yaml|md|docx|pptx|csv|xlsx|png|jpe?g|gif|zip|tar|gz)\b'),
    ]

    cleaned = text
    for pat in patterns:
        cleaned = pat.sub('', cleaned)

    # Remove common leftover HTML attributes or sequences like src="..." or href="..." if any remain (optional)
    cleaned = re.sub(r'(?i)\b(src|href)=["\'][^"\']*["\']', '', cleaned)

    # Remove leftover sequences of punctuation that might remain after filename removal (e.g., ' ,', ' .')
    cleaned = re.sub(r'[\s]*[,;:\-][\s]*', ' ', cleaned)

    # Collapse multiple blank lines into two and trim
    cleaned = re.sub(r'\n\s*\n+', '\n\n', cleaned).strip()

    return cleaned


In [71]:
# remove attached files, images, links from Notion page content
notion_content_clean = remove_images_links_files(notion_content[NOTION_PAGE_ID])

In [72]:
# original content
# display(Markdown(notion_content[NOTION_PAGE_ID]))

In [73]:
# cleaned content
# display(Markdown(notion_content_clean))

In [74]:
# example question:
# What was the first built project in this course?
question = input("Please enter your question:")

In [75]:
# Create enhanced system prompt with Notion content
base_system_prompt = """You are a helpful technical tutor who answers questions about python code, software engineering, data science and LLMs.

You have access to additional reference material that may be relevant to the user's questions. Use this information to provide more comprehensive and accurate answers when applicable.

REFERENCE MATERIAL:
"""

# Combine base prompt with Notion content
if notion_content_clean:
    system_prompt = base_system_prompt + f"\n{notion_content_clean}\n\n" + """
INSTRUCTIONS:
- Use the reference material above when it's relevant to the user's question
- Always prioritize accuracy and clarity in your explanations
- If the reference material contains relevant information, mention that you're drawing from additional resources
- If the question is outside the scope of the reference material, answer based on your general knowledge
"""
else:
    system_prompt = "You are a helpful technical tutor who answers questions about python code, software engineering, data science and LLMs"

user_prompt = "Please give a detailed explanation to the following question: " + question

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

# Show system prompt length for debugging
print(f"System prompt length: {len(system_prompt)} characters")
print(f"Contains Notion content: {'Yes' if notion_content else 'No'}")

System prompt length: 52560 characters
Contains Notion content: Yes


In [76]:
# Get gpt-4o-mini to answer, with streaming
stream = CLIENT.chat.completions.create(
    model=MODEL_GPT,
    messages=messages,
    stream=True
)
response = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    response += chunk.choices[0].delta.content or ''
    response = response.replace("```","").replace("markdown", "")
    update_display(Markdown(response), display_id=display_handle.display_id)

10/30/2025 03:48:24 PM HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The first project in the course is the **"Instant Gratification Project: Creating an AI Powered Web Page Summarizer."** Here’s a breakdown of what this project involves:

### Overview of the Project
The goal of the project is to build a web page summarization tool using a large language model (LLM). It allows users to input a URL and receive a concise summary of the contents on that webpage. This showcases how LLMs can be applied to real-world tasks like summarization.

### Key Components of the Project

1. **Environment Setup:**
   - Prior to starting the project, participants are instructed to set up their development environment, which includes tools like Jupyter Lab, Python, and the necessary libraries such as Beautiful Soup for web scraping.

2. **Retrieving Web Content:**
   - The project involves creating a class `Website`, which is responsible for fetching the content of a specified URL. It uses the `requests` library to get the HTML content and `BeautifulSoup` to parse that content. The class extracts the webpage's title and main text while filtering out irrelevant elements such as scripts and styles.

3. **Generating Summaries with LLM:**
   - The project utilizes an LLM (like OpenAI’s GPT models or Ollama's Llama model) to process the extracted content. 
   - Participants define **system and user prompts** to guide the LLM in generating a summary. The system prompt sets the context for the task, while the user prompt provides the content that needs summarization.

4. **API Integration:**
   - The project includes making API calls to the OpenAI service or to local instances of models like Llama using Ollama. This entails setting up the necessary API keys and routes for making requests to the LLM.

5. **Combining Components:**
   - Finally, the various components (web scraping, LLM integration, and output formatting) are combined into a cohesive function to create the summarizer. The results are formatted and displayed in a user-friendly Markdown format.

### Example Code Snippets
Here is a simplified version of what parts of the implementation may look like:

python
import requests
from bs4 import BeautifulSoup

class Website:
    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        self.text = soup.body.get_text(separator="\n", strip=True)

# Function to summarize the content using LLM
def summarize(url):
    website = Website(url)
    # Here you would call the LLM API with the website.text to get a summary
    # For example:
    response = openai.chat.completions.create(
        model='your-llm-model',
        messages=[
            {"role": "system", "content": "Summarize the content of the following text."},
            {"role": "user", "content": website.text}
        ]
    )
    return response.choices[0].message.content

# Example usage
print(summarize("http://example.com"))


### Learning Outcomes
- Participants gain practical experience in web scraping, API integration, and working with LLMs.
- The project emphasizes understanding how to construct effective prompts and manage the flow of data from input (web content) to output (summarized text).

This formative project sets the stage for the entire course by illustrating critical concepts and skills relevant to working with LLMs and applying them to solve real-world problems.

In [None]:
# Get Llama 3.2 to answer
response = ollama.chat(model=MODEL_LLAMA, messages=messages)
reply = response['message']['content']
display(Markdown(reply))