# How to Make an up to date AI chat bot

ChatGPT and other AI chatbots are amazing, they can write blog posts, emails, fix bugs, document code and have been shown to complete tasks like planning trips all based off of natural language prompts. Despite all of the capabilities they fall short with respect to the most recent information of the world, out-of-domain knowledge or might provide incorrect information without the right prompting.

The solution to helping the language model with those areas that it falls short is sophisticated prompt templating and injecting information into the context. For example, if I were to ask chatGPT, 

>"What happened with Meta in the news today?" 

I would get the response:

> "I'm sorry, but my training data only goes up until September 2021, and I'm unable to provide real-time news updates. To find out the latest news about Meta (formerly known as Facebook), please visit a reliable news website or search for "Meta news" in your preferred search engine."

But I could solve this problem by providing more information as context and the right prompt. For example I could copy and paste an article and change the prompt to something like:

> "Given this article from today: `Article about META in the news today`. Answer the following: What happened with Meta in the news today?"

I now get the response:

> "In the news today, Elon Musk, the CEO of Tesla and owner of Twitter, has accused Meta's messaging app, WhatsApp, of being untrustworthy. The accusation was based on a tweet by a Twitter engineer, Foad Dabiri, who claimed that WhatsApp had been using the microphone in the background without their knowledge. Musk pointed out that WhatsApp cannot be trusted and reminded people that WhatsApp is owned by Meta (previously Facebook) and that its founders left Meta in disagreement over its policies and decisions, particularly regarding personal data processing. This news may have an impact on Meta's public image and its platforms, as trust and privacy are crucial factors for users of social media and messaging apps."

This is basic and over simplified, but the idea is that we can help the language model provide what we ultimately want with a little bit of work asking the right things and giving it the right information as context. 

Since we understand that this works for our one-off use case, how can we make something that will scale to many questions about different subjects? This is where vector databases and embeddings come into play.

## What is an embedding and a vector database?

An embedding is simply a representation of something as a vector, or in other words a list of numbers. It is a way to intelligently turn categorical things into numerical representations so that we can reason about them in different ways, like determine how similar they are group them in multi-dimensional space. A vector database is a database that allows us to store our embeddings as vectors so that we can efficiently retrieve them. Since the vectors can be compared for similarity, our query against the database will be able to return the most similar items.

So if we can consume news as it is created, create embeddings to represent the news and then store it in a vector database. When we ask a question we will be able pull the articles most similar to our query, give them to a large language model as context and then ask the language model to provide us with information. 

To support the ingestion, we will need to make a pipeline that is capable of ingesting data in real-time, processing it, running it against an ML model to create an embedding and then write it to a veector database. We can use Bytewax as the real-time processing framework.

## Building the pipeline

We are going to leverage Bytewax to create a real-time embedding pipeline from a live news source, in this case it is a financial news source via Alpaca. This pipeline could be scaled to consume many different news sources and aggregate them all into a vector database like Qdrant shown below.

In [1]:
# Install requirements
!pip install -r requirements.txt
!pip install unstructured transformers torch sentencepiece pydantic
!pip install qdrant-client==1.1.1







To start we will define an input for the pipeline. The input is based of of the `StatelessSource` class and is taking input from Alpaca API which provides a websocket based input. Note that this will not be able to resume and data will be lost if the service goes down. 

In [2]:
import os
import json

from bytewax.dataflow import Dataflow
from bytewax.inputs import DynamicInput, StatelessSource
from bytewax.connectors.stdio import StdOutput
from websocket import create_connection  # pip install websocket-client

API_KEY = os.getenv("API_KEY")
API_SECRET = os.getenv("API_SECRET")

class AlpacaSource(StatelessSource):
    def __init__(self, worker_tickers):
        self.worker_tickers = worker_tickers

        self.ws = create_connection("wss://stream.data.alpaca.markets/v1beta1/news")
        print(self.ws.recv())
        self.ws.send(
            json.dumps(
                {"action":"auth",
                 "key":f"{API_KEY}",
                 "secret":f"{API_SECRET}"}
            )
        )
        print(self.ws.recv())
        self.ws.send(
            json.dumps(
                {"action":"subscribe","news":self.worker_tickers}
            )
        )
        print(self.ws.recv())

    def next(self):
        return self.ws.recv()


class AlpacaNewsInput(DynamicInput):
    TICKERS = ["*"]

    def build(self, worker_index, worker_count):
        prods_per_worker = int(len(self.TICKERS) / worker_count)
        worker_tickers = self.TICKERS[
            int(worker_index * prods_per_worker) : int(
                worker_index * prods_per_worker + prods_per_worker
            )
        ]
        return AlpacaSource(worker_tickers)


flow = Dataflow()
flow.input("input", AlpacaNewsInput())
flow.inspect(print)



## Preprocess data

With our input configured, we will move on to processing our data. We will use pydantic to define a model for our document and then use unstructured to process the text. Finally, to finish preparing the data we will chunk the text into the appropriate lengths.

In [3]:
from unstructured.partition.html import partition_html
from unstructured.cleaners.core import clean, replace_unicode_quotes, clean_non_ascii_chars
from unstructured.staging.huggingface import chunk_by_attention_window

from unstructured.staging.huggingface import stage_for_transformers

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

In [4]:
news = [json.dumps([{"T":"n","id":32052192,"headline":"What\u0026#39;s Going On With ContraFecta Stock Today","summary":"\n\tContraFect Corporation (NASDAQ: CFRX) shares are trading higher Thursday morning. However, there is no specific news to justify the move.\n","author":"Vandana Singh","created_at":"2023-04-27T18:24:49Z","updated_at":"2023-04-27T18:24:49Z","url":"https://www.benzinga.com/general/biotech/23/04/32052192/whats-going-on-with-contrafecta-stock-today","content":"\u003cul\u003e\r\n\t\u003cli\u003e\u003cstrong\u003eContraFect Corporation\u0026nbsp;\u003c/strong\u003e(NASDAQ:\u003ca class=\"ticker\" href=\"https://www.benzinga.com/stock/CFRX#NASDAQ\"\u003eCFRX\u003c/a\u003e) shares are trading higher Thursday morning. However, there is no specific news to justify the move.\u003c/li\u003e\r\n\t\u003cli\u003eWednesday morning, ContraFect\u0026nbsp;\u003ca class=\"editor-rtfLink\" href=\"https://www.benzinga.com/pressreleases/23/04/g32009856/contrafect-announces-first-patient-dosed-in-the-phase-1b2-study-of-exebacase-in-patients-with-chro\" style=\"color:#4a6ee0; background:transparent; margin-top:0pt; margin-bottom:0pt\" target=\"_blank\"\u003eannounced the dosing\u003c/a\u003e\u0026nbsp;of the first patient in Phase 1b/2 of exebacase in the setting of an arthroscopic debridement, antibiotics, irrigation, and retention procedure in patients with chronic prosthetic joint infections of the knee due to\u0026nbsp;\u003cem\u003eStaphylococcus aureus\u003c/em\u003e\u0026nbsp;or Coagulase-Negative Staphylococci.\u003c/li\u003e\r\n\t\u003cli\u003eThe study was initiated\u0026nbsp;\u003ca class=\"editor-rtfLink\" href=\"https://www.benzinga.com/pressreleases/23/04/g31631700/contrafect-announces-initiation-of-a-phase-1b2-study-of-exebacase-in-patients-with-chronic-prosthe\" style=\"color:#4a6ee0; background:transparent; margin-top:0pt; margin-bottom:0pt\" target=\"_blank\"\u003eearlier this month\u003c/a\u003e.\u003c/li\u003e\r\n\t\u003cli\u003eContraFect stock is gaining on heavy volume, with a session volume of 55 million shares traded, compared to the trailing 100-day volume of 3.08 million shares.\u003c/li\u003e\r\n\t\u003cli\u003eAccording to data from\u0026nbsp;\u003ca class=\"editor-rtfLink\" href=\"https://benzinga.grsm.io/register174\" style=\"color:#4a6ee0; background:transparent; margin-top:0pt; margin-bottom:0pt\" target=\"_blank\"\u003eBenzinga Pro\u003c/a\u003e, CFRX has a 52-week high of $362.4 and a 52-week low of $0.90.\u003c/li\u003e\r\n\t\u003cli\u003e\u003cstrong\u003ePrice Action:\u003c/strong\u003e\u0026nbsp;CFRX shares are up 68.20% at $2.22 on the last check Thursday.\u003c/li\u003e\r\n\u003c/ul\u003e\r\n ","symbols":["CFRX"],"source":"benzinga"}])]

In [5]:
import hashlib
from pydantic import BaseModel
from typing import Any, Optional

class Document(BaseModel):
    id: str
    group_key: Optional[str] = None
    metadata: Optional[dict] = {}
    text: Optional[list]
    embeddings: Optional[list] = []
        

flow.flat_map(lambda x: json.loads(x))


# Clean the code and setup the dataclass
def parse_article(_data):
    document_id = hashlib.md5(_data['content'].encode()).hexdigest()
    document = Document(id = document_id)
    print(_data)
    article_elements = partition_html(text=_data['content'])
    print(article_elements)
    _data['content'] = clean_non_ascii_chars(replace_unicode_quotes(clean(" ".join([str(x) for x in article_elements]))))
    _data['headline'] = clean_non_ascii_chars(replace_unicode_quotes(clean(_data['headline'])))
    _data['summary'] = clean_non_ascii_chars(replace_unicode_quotes(clean(_data['summary'])))

    document.text = [_data['headline'], _data['summary'], _data['content']]
    document.metadata['headline'] = _data['headline']
    document.metadata['summary'] = _data['summary']
    document.metadata['url'] = _data['url']
    document.metadata['symbols'] = _data['symbols']
    document.metadata['author'] = _data['author']
    document.metadata['created_at'] = _data['created_at']
    return (document.id, document)

flow.map(parse_article)

# chunk the news article and summary
def chunk(document_id__document):
    document_id, document = document_id__document
    chunks = []
    for text in document.text:
        chunks += chunk_by_attention_window(text, tokenizer)
    
    document.text = chunks
    return(document_id, document)

flow.map(chunk)
    

## Create Embeddings

Embeddings are a vector representation of a sequence of things. It could be words, sound, letters, events. We are using a huggingface  model to transform our chunks of text into embeddings. These will be used later when we want to query for new information.

In [6]:
# create embedding and store in vector db
def embedding(document_id__document):
    document_id, document = document_id__document
    for chunk in document.text:
        inputs = tokenizer(chunk, padding=True, truncation=True, return_tensors="pt", max_length=512)
        result = model(**inputs)
        embeddings = result.last_hidden_state[:, 0, :].cpu().detach().numpy()
        lst = embeddings.flatten().tolist()
        document.embeddings.append(lst)
    return (document_id, document)
    
flow.map(embedding)
    

In [7]:
import datetime

from bytewax.outputs import DynamicOutput, StatelessSink

from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from qdrant_client.models import PointStruct
from qdrant_client.http.api_client import UnexpectedResponse


class _QdrantVectorSink(StatelessSink):
    
    def __init__(self, client, collection_name):
        self._client=client
        self._collection_name=collection_name

    def write(self, id_hash__doc):
        id_hash, doc = id_hash__doc
        _payload = doc.metadata
        _payload.update({"text":doc.text})
        self._client.upsert(
            collection_name=self._collection_name,
            points=[
                PointStruct(
                    id=idx,
                    vector=vector,
                    payload=_payload
                )
                for idx, vector in enumerate(doc.embeddings)
            ]
        )


class QdrantVectorOutput(DynamicOutput):
    """Qdrant.

    Workers are the unit of parallelism.

    Can support at-least-once processing. Messages from the resume
    epoch will be duplicated right after resume.

    """
    def __init__(self, collection_name, vector_size, schema='', host='localhost', port=6333):
        self.collection_name=collection_name
        self.vector_size=vector_size
        self.schema=schema
        self.client=QdrantClient(host, port=6333)
    
        try: 
            self.client.get_collection(collection_name="test_collection")
        except UnexpectedResponse:
            self.client.recreate_collection(
                collection_name="test_collection",
                vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE),
                schema=self.schema
            )

    def build(self, worker_index, worker_count):
        
        return _QdrantVectorSink(self.client, self.collection_name)
    

flow.output("output", QdrantVectorOutput("test_collection", 384))



In [11]:
for articles in news:
    for article in json.loads(articles):
        out1 = parse_article(article)
        out2 = chunk(out1)
        out3 = embedding(out2)
        output = QdrantVectorOutput("test_collection", 384)
        output_sink = output.build(1,1)
        output_sink.write(out3)

{'T': 'n', 'id': 32052192, 'headline': 'What&#39;s Going On With ContraFecta Stock Today', 'summary': '\n\tContraFect Corporation (NASDAQ: CFRX) shares are trading higher Thursday morning. However, there is no specific news to justify the move.\n', 'author': 'Vandana Singh', 'created_at': '2023-04-27T18:24:49Z', 'updated_at': '2023-04-27T18:24:49Z', 'url': 'https://www.benzinga.com/general/biotech/23/04/32052192/whats-going-on-with-contrafecta-stock-today', 'content': '<ul>\r\n\t<li><strong>ContraFect Corporation&nbsp;</strong>(NASDAQ:<a class="ticker" href="https://www.benzinga.com/stock/CFRX#NASDAQ">CFRX</a>) shares are trading higher Thursday morning. However, there is no specific news to justify the move.</li>\r\n\t<li>Wednesday morning, ContraFect&nbsp;<a class="editor-rtfLink" href="https://www.benzinga.com/pressreleases/23/04/g32009856/contrafect-announces-first-patient-dosed-in-the-phase-1b2-study-of-exebacase-in-patients-with-chro" style="color:#4a6ee0; background:transparent;

In [12]:
from bytewax.testing import run_main

run_main(flow)

[{"T":"success","msg":"connected"}]
[{"T":"success","msg":"authenticated"}]
[{"T":"subscription","news":["*"]}]
[{"T":"n","id":32167101,"headline":"Bitcoin Defies Gravity: Grayscale\u0026#39;s Bold Prediction Amid Rate Increase Fears","summary":"As the price of Bitcoin (CRYPTO: BTC) experiences fluctuations ahead of the U.S.","author":"Murtuza Merchant","created_at":"2023-05-02T18:38:37Z","updated_at":"2023-05-02T18:38:38Z","url":"https://www.benzinga.com/markets/cryptocurrency/23/05/32167101/bitcoin-defies-gravity-grayscales-bold-prediction-amid-rate-increase-fears","content":"\u003cp\u003eAs the price of \u003cstrong\u003eBitcoin \u003c/strong\u003e(CRYPTO: \u003ca class=\"ticker\" href=\"https://www.benzinga.com/quote/btc/usd\"\u003eBTC\u003c/a\u003e) experiences fluctuations ahead of the U.S. Federal Reserve\u0026#39;s upcoming interest rate decision on Wednesday, \u003cstrong\u003eGrayscale \u003c/strong\u003ehas suggested\u0026nbsp;cryptocurrency prices could still benefit from a

[{"T":"n","id":32167101,"headline":"Bitcoin Defies Gravity: Grayscale\u0026#39;s Bold Prediction Amid Rate Increase Fears","summary":"As the price of Bitcoin (CRYPTO: BTC) experiences fluctuations ahead of the U.S.","author":"Murtuza Merchant","created_at":"2023-05-02T18:38:37Z","updated_at":"2023-05-02T18:41:27Z","url":"https://www.benzinga.com/markets/cryptocurrency/23/05/32167101/bitcoin-defies-gravity-grayscales-bold-prediction-amid-rate-increase-fears","content":"\u003cp\u003eAs the price of \u003cstrong\u003eBitcoin \u003c/strong\u003e(CRYPTO: \u003ca class=\"ticker\" href=\"https://www.benzinga.com/quote/btc/usd\"\u003eBTC\u003c/a\u003e) experiences fluctuations ahead of the U.S. Federal Reserve\u0026#39;s upcoming interest rate decision on Wednesday, \u003cstrong\u003eGrayscale \u003c/strong\u003ehas suggested\u0026nbsp;cryptocurrency prices could still benefit from a rate increase, provided that the outcome \u003ca href=\"https://www.benzinga.com/markets/cryptocurrency/23/04/3

[{"T":"n","id":32169840,"headline":"How To Attend TransDigm Gr Q2 2023 Earnings Conference Call","summary":" ","author":"Benzinga Insights","created_at":"2023-05-02T18:45:10Z","updated_at":"2023-05-02T18:45:10Z","url":"https://www.benzinga.com/news/earnings/23/05/32169840/how-to-attend-transdigm-gr-q2-2023-earnings-conference-call","content":"\u003cp\u003eTransDigm Gr (NYSE:\u003ca class=\"ticker\" href=\"https://www.benzinga.com/stock/TDG#NYSE\"\u003eTDG\u003c/a\u003e) will host a conference call at 11:00 AM ET on May 9, 2023, to discuss Q2 2023 earnings results.\u003c/p\u003e\n\u003ch3\u003eHow to Attend TransDigm Gr (TDG) Conference Call\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://edge.media-server.com/mmc/p/3qt2zk2b\"\u003eFollow this link to access the live webcast.\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eTo listen to the call, dial  (US) or  (International)\u003c/p\u003e\n\u003ch3\u003eWhat Is an Earnings Conference Call?\u003c/h3\u003e\n\u003cp\u003eEarnings conference

[{"T":"n","id":32169895,"headline":"Building A Stock Portfolio? Here\u0026#39;s Your Competition","summary":"While there\u0026#39;s no silver bullet for building a portfolio, you don\u0026#39;t have to be a Wall Street strategist to do it. Benchmarks are a useful guidepost, and you can learn a lot from them about your performance and strategy as you move along.","author":"Tornado","created_at":"2023-05-02T18:47:48Z","updated_at":"2023-05-02T18:47:48Z","url":"https://www.benzinga.com/23/05/32169895/building-a-stock-portfolio-heres-your-competition","content":"\u003cp\u003eWhile there\u0026#39;s no silver bullet for building a portfolio, you don\u0026#39;t have to be a Wall Street strategist to do it. Benchmarks are a useful guidepost, and you can learn a lot from them about your performance and strategy as you move along.\u003c/p\u003e\r\n\r\n\u003cp\u003eConsider the S\u0026amp;P 500. It features around 500 companies and covers 80% of available US public equity market capitalization, a

[{"T":"n","id":32169895,"headline":"Building A Stock Portfolio? Here\u0026#39;s Your Competition","summary":"While there\u0026#39;s no silver bullet for building a portfolio, you don\u0026#39;t have to be a Wall Street strategist to do it. Benchmarks are a useful guidepost, and you can learn a lot from them about your performance and strategy as you move along.","author":"Tornado","created_at":"2023-05-02T18:47:48Z","updated_at":"2023-05-02T18:48:13Z","url":"https://www.benzinga.com/23/05/32169895/building-a-stock-portfolio-heres-your-competition","content":"\u003cp\u003eWhile there\u0026#39;s no silver bullet for building a portfolio, you don\u0026#39;t have to be a Wall Street strategist to do it. Benchmarks are a useful guidepost, and you can learn a lot from them about your performance and strategy as you move along.\u003c/p\u003e\r\n\r\n\u003cp\u003eConsider the S\u0026amp;P 500. It features around 500 companies and covers 80% of available US public equity market capitalization, a

thread '<unnamed>' panicked at 'Box<dyn Any>', src/inputs.rs:465:31
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace



KeyboardInterrupt: (src/run.rs:120:17) interrupt signal received, all processes have been shut down

In [None]:
query_string = "what is new on META"
inputs = tokenizer(query_string, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embeddings = result.last_hidden_state[:, 0, :].cpu().detach().numpy()
lst = embeddings.flatten().tolist()

client
query_vector = lst
hits = client.search(
    collection_name="test_collection",
    query_vector=query_vector,
    limit=5  # Return 5 closest points
)