# Turn blog feed into QA dataset

We will use Llama 3.2 (through Ollama) to do the text -> QA.

This allows users to run the model locally without sharing potential confidential information with others.

### Install all libraries

Run:
- poetry install
- poetry run python -m spacy download en_core_web_sm

In [27]:
import spacy
import html2text
import requests
from difflib import SequenceMatcher
import re
from datetime import datetime
from datasets import Dataset
import os
from dotenv import load_dotenv

load_dotenv()

True

### Get started with Ollama

Follow instructions here: https://ollama.com/

Select a model to run locally using https://ollama.com/search.

In this case, I want to run `llama3.2:latest` (https://ollama.com/library/llama3.2).

So I run: `ollama pull llama3.2:latest`.

<img src="./assets/ollama_pull.png" width="600px" alt="ollama pull">

Then, I check that the model has been downloaded with: `ollama list`

<img src="./assets/ollama_list.png" width="600px" alt="ollama list">

Finally, I test that it works with `ollama run llama3.2:latest`

<img src="./assets/ollama_run.png" width="600px" alt="ollama run">

### Test whether we can reach Llama 3.2:latest with Ollama API

In [11]:
def generate_ollama_response(prompt, model="llama3.2:latest", temperature=0.7):
    """Generate response using Ollama API"""
    response = requests.post('http://localhost:11434/api/generate',
                           json={
                               "model": model,
                               "prompt": prompt,
                               "temperature": temperature,
                               "stream": False
                           })
    return response.json()['response']

output = generate_ollama_response(prompt="What is the capital of France?")
output

'The capital of France is Paris.'

### Check that we can retrieve data from the blog

In [17]:
url = "https://didierlopes.com/blog/feed.json"

response = requests.get(url)
blog_data = response.json()
blog_data["items"][0]

{'id': 'https://didierlopes.com/blog/ai-chatbots-wont-revolutionize-finance-but-intelligent-workspaces-will',
 'content_html': '<p align="center"><img width="600" src="https://didierlopes.com/blog/2024-12-27-ai-chatbots-wont-revolutionize-finance-but-intelligent-workspaces-will.png"></p>\n<p>Why the future of financial analysis isn\'t about chatbots, but about intelligent workspaces that combine your data, tools, and AI exactly when you need them.</p>\n<div style="border-top:1px solid #0088CC;margin:1.5em 0"></div>\n<p>When ChatGPT launched, everyone rushed to build financial chatbots. But they missed two fundamental truths:</p>\n<ul>\n<li>The best AI model is useless without access to your data.</li>\n<li>Access to data isn\'t enough - AI needs to handle complete workflows, not just conversations.</li>\n</ul>\n<p>The problem with financial chatbots:</p>\n<ol>\n<li>They can\'t access your proprietary data</li>\n<li>They can\'t handle complex financial workflows</li>\n<li>They force ana

### Clean html data to markdown format for AI

In [23]:
# Initialize HTML to text converter with stricter settings
h = html2text.HTML2Text()
h.ignore_links = True  # Now ignore links
h.ignore_images = True  # Now ignore images
h.ignore_emphasis = True  # Now ignore emphasis
h.body_width = 0  # Don't wrap text
h.skip_internal_links = True
h.inline_links = False
h.protect_links = False
h.images_to_alt = False  # Don't convert images to alt text
h.unicode_snob = True  # Use Unicode instead of ASCII
h.wrap_links = False

def cleanup_markdown(text):
    # Remove any remaining image markdown
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)
    # Remove empty lines
    text = re.sub(r'\n\s*\n', '\n\n', text)
    # Remove any remaining URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    # Remove any remaining markdown links
    text = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', text)
    return text.strip()

html_content = blog_data["items"][0]["content_html"]

content_markdown = h.handle(html_content)
content_markdown_further_cleaned = cleanup_markdown(content_markdown)

print("BEFORE\n")
print(html_content[:300])
print("\nAFTER\n")
print(content_markdown_further_cleaned[:300])


BEFORE

<p align="center"><img width="600" src="https://didierlopes.com/blog/2024-12-27-ai-chatbots-wont-revolutionize-finance-but-intelligent-workspaces-will.png"></p>
<p>Why the future of financial analysis isn't about chatbots, but about intelligent workspaces that combine your data, tools, and AI exactl

AFTER

Why the future of financial analysis isn't about chatbots, but about intelligent workspaces that combine your data, tools, and AI exactly when you need them.

When ChatGPT launched, everyone rushed to build financial chatbots. But they missed two fundamental truths:

  * The best AI model is useless


### Count sentences in blog

This is important as it will allow users to select a number of QA pairs to be extracted based on the number of sentences in each blogpost.

In [26]:
def count_sentences_spacy(text):
    # Load the English language model
    # Use 'python -m spacy download en_core_web_sm' if not already downloaded
    nlp = spacy.load("en_core_web_sm")
    
    # Process the text
    doc = nlp(text)
    
    # Get sentences
    sentences = list(doc.sents)
    
    return len(sentences), sentences

# Example usage
text = """The Great Wall of China was built over many centuries by different dynasties. 
Construction began more than 2,300 years ago and continued through the Ming Dynasty in the 1600s. 
The wall was primarily built for defense purposes and spans approximately 13,171 miles."""

count, sentences = count_sentences_spacy(text)
print(f"Number of sentences: {count}")
print("\nSentences:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.text.strip()}")

Number of sentences: 3

Sentences:
1. The Great Wall of China was built over many centuries by different dynasties.
2. Construction began more than 2,300 years ago and continued through the Ming Dynasty in the 1600s.
3. The wall was primarily built for defense purposes and spans approximately 13,171 miles.


### Generate QA pairs from text

We are going to utilize Conversational Question Answering Dataset (CoQA)

In [31]:
def generate_qa_pairs(text, min_question_length=20):
    # Count sentences to determine the number of QA pairs to generate
    num_sentences, _ = count_sentences_spacy(text)  # Unpack the tuple
    # Generate 1 QA pair per 5 sentences, minimum 1 pair
    num_qa_pairs = max(1, num_sentences // 3) 
    
    # Generate questions
    question_prompt = f"""Generate {num_qa_pairs} specific questions that capture the main points from this post written by Didier Lopes (founder and CEO of OpenBB). 
    Create a series of questions that follow a conversational flow, as if in a dialogue about the content.
    Each question should build upon the previous ones, exploring the topic in more depth.
    Focus on what you found most interesting, original, or important.
    Ensure questions are diverse, covering different aspects of the content.
    Avoid generic questions and those answerable with a simple yes/no.
    Questions should encourage detailed responses that reveal key information from the text.
    
    Text: {text}
    
    Questions (in conversational order):"""

    # Higher temperature for more diverse questions
    questions_response = generate_ollama_response(
        question_prompt,
        model="llama3.2:latest",
        temperature=0.8
    )
    
    # Parse questions (assuming numbered list format)
    questions = [q.strip() for q in questions_response.split('\n') 
                if q.strip() and any(c.isdigit() for c in q)]
    
    qa_pairs = []
    for question in questions:
        # Remove leading numbers and dots
        question = ' '.join(question.split('.')[1:]).strip()
        
        # Filter out low-quality questions
        if (len(question) < min_question_length or 
            question.lower().startswith(('what is', 'tell me about', 'can you')) or
            'text' in question.lower()):
            continue
        
        # Generate answer
        answer_prompt = f"""You are Didier Lopes (founder and CEO of OpenBB).
        You wrote the post where this question was taken from.
        Provide an extremely concise answer, no longer than 2-3 sentences.
        Include key details or numbers if absolutely necessary.
        Avoid all unnecessary information.
        Be direct and to the point.
        
        Context: {text}
        Question: {question}
        Answer (2-3 sentences max): """
        
        # Lower temperature for more concise answers
        answer = generate_ollama_response(
            answer_prompt,
            model="llama3.2:latest",
            temperature=0.1
        )
        
        # Filter answers that have too little information or model wasn't able to generate a proper answer
        if (len(answer) > 40 and
            not any(x in answer.lower() for x in ['cannot', 'text does not', 'no information'])):
            qa_pairs.append({
                "question": question,
                "answer": answer,
                "context": text
            })
    
    # Remove duplicates
    filtered_pairs = []
    seen_questions = set()
    for pair in qa_pairs:
        q_normalized = ' '.join(pair['question'].lower().split())
        # Check if question is similar to any previously seen questions
        if not any(SequenceMatcher(None, q_normalized, sq).ratio() > 0.8 for sq in seen_questions):
            seen_questions.add(q_normalized)
            filtered_pairs.append(pair)
    
    return filtered_pairs

qa_pairs = generate_qa_pairs(content_markdown_further_cleaned)

print("Generated Q&A Pairs:")
for pair in qa_pairs[:5]:
    print(f"\nQ: {pair['question']}")
    print(f"A: {pair['answer']}")

Generated Q&A Pairs:

Q: Didier, you mentioned that when ChatGPT launched, everyone rushed to build financial chatbots  What do you think was the initial excitement about building these chatbots, and what did they miss in their approach?
A: The initial excitement around building financial chatbots was likely driven by the promise of automation and efficiency gains. However, they missed that the best AI model is only as good as the data it has access to, and that true value comes from integrating AI into a complete workflow, not just a conversational interface.

Q: You stated that the best AI model is useless without access to your data  Can you elaborate on why having access to data is crucial for effective AI implementation in financial analysis? What types of data do you think are most important to have?
A: Access to high-quality, relevant data is critical for training and fine-tuning AI models in finance. Proprietary company data, such as financial statements, transactions, and mark

### Creating the QA datset

In [37]:
# Set the format to be saved in the dataset
qa_dataset = {
    "title": [],
    "conversation": [],
    "context": [],
    "url": [],
    "date": []
}
# Process each post
for post in blog_data['items'][:2]:
    # Convert HTML to markdown
    content_markdown = h.handle(post['content_html'])
    content_markdown_further_cleaned = cleanup_markdown(content_markdown)
    
    # Parse the date string into a datetime object
    try:
        date_obj = datetime.fromisoformat( post['date_modified'].replace('Z', '+00:00'))
        # Format the date as YYYY-MM-DD
        formatted_date = date_obj.strftime('%Y-%m-%d')
    except ValueError:
        # If date parsing fails, use a placeholder
        formatted_date = 'Unknown'

    # Generate QA pairs
    qa_pairs = generate_qa_pairs(content_markdown_further_cleaned)

    # Create the conversation structure
    conversation = []
    for pair in qa_pairs:
        conversation.extend([
            {"role": "user", "content": pair['question']},
            {"role": "assistant", "content": pair['answer']}
        ])

    # Append to the formatted data structure directly
    qa_dataset["title"].append(post['title'])
    qa_dataset["conversation"].append(conversation)
    qa_dataset["context"].append(content_markdown_further_cleaned)
    qa_dataset["url"].append(post['url'])
    qa_dataset["date"].append(formatted_date)

qa_dataset["conversation"][0][:5]

[{'role': 'user',
  'content': 'You mentioned that when ChatGPT launched, everyone rushed to build financial chatbots  What were some of the fundamental truths that those who built these chatbots missed?'},
 {'role': 'assistant',
  'content': "Those building financial chatbots missed two fundamental truths:\n\n1. AI models are useless without access to your data.\n2. Access to data isn't enough - AI needs to handle complete workflows, not just conversations.\n\nThese limitations led to chatbots that can't access proprietary data, can't handle complex workflows and restrict analysts to an unnatural chat interface."},
 {'role': 'user',
  'content': 'I noticed that OpenBB emphasizes complete data access as a key component of its platform  Can you tell us more about how this is achieved and what kind of data sources are supported?'},
 {'role': 'assistant',
  'content': 'At OpenBB, we ensure complete data access by allowing users to run everything on-premise or in their VPC. We also support

### Push the dataset to HuggingFace

Set your HF_TOKEN through Hugging Face, following https://huggingface.co/docs/hub/en/security-tokens.

Notice that you will need "write" permissions to push the dataset into HuggingFace.

Copy the HF_TOKEN created and paste it in an .env file as following:

```
HF_TOKEN=hf_123abc456def
```

Then set your dataset_repo by setting your HuggingFace username followed by a / with the name that you want this dataset repository to have.

In [38]:
dataset_repo = "didierlopes/my-blog-qa-dataset"

# Create the Dataset object
dataset = Dataset.from_dict(qa_dataset)

# Save the dataset to the HuggingFace Hub
dataset.push_to_hub(
    dataset_repo,
    token=os.getenv('HF_TOKEN')
)

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 203.51ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  2.50it/s]


CommitInfo(commit_url='https://huggingface.co/datasets/didierlopes/test-my-blog-qa-dataset/commit/8918bbeb475e08ed60deb2302055ecb5628f6565', commit_message='Upload dataset', commit_description='', oid='8918bbeb475e08ed60deb2302055ecb5628f6565', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/didierlopes/test-my-blog-qa-dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='didierlopes/test-my-blog-qa-dataset'), pr_revision=None, pr_num=None)