# Power your products with ChatGPT and your own data

This is a walkthrough taking readers through how to build starter Q&A and Chatbot applications using the ChatGPT API and their own data. 

It is laid out in these sections:
- **Setup:** 
    - Initiate variables and source the data
- **Lay the foundations:**
    - Set up the vector database to accept vectors and data
    - Load the dataset, chunk the data up for embedding and store in the vector database
- **Make it a product:**
    - Add a retrieval step where users provide queries and we return the most relevant entries
    - Summarise search results with GPT-3
    - Test out this basic Q&A app in Streamlit
- **Build your moat:**
    - Create an Assistant class to manage context and interact with our bot
    - Use the Chatbot to answer questions using semantic search context
    - Test out this basic Chatbot app in Streamlit
    
Upon completion, you have the building blocks to create your own production chatbot or Q&A application using OpenAI APIs and a vector database.

This notebook was originally presented with [these slides](https://drive.google.com/file/d/1dB-RQhZC_Q1iAsHkNNdkqtxxXqYODFYy/view?usp=share_link), which provide visual context for this journey.

In [23]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Setup

First we'll setup our libraries and environment variables

In [24]:
import openai
import os
import requests
import numpy as np
import pandas as pd
from typing import Iterator
import tiktoken
import textract
from numpy import array, average

from database import get_redis_connection

# Set our default models and chunking size
from config import COMPLETIONS_MODEL, EMBEDDINGS_MODEL, CHAT_MODEL, TEXT_EMBEDDING_CHUNK_SIZE, VECTOR_FIELD_NAME

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ImportWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [25]:
pd.set_option('display.max_colwidth', 0)

In [26]:
data_dir = os.path.join(os.curdir,'data')
pdf_files = sorted([x for x in os.listdir(data_dir) if 'DS_Store' not in x])
pdf_files

["FIA Practice Directions - Competitor's Staff Registration System.pdf",
 'fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf',
 'fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf',
 'fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf',
 'fia_formula_1_financial_regulations_iss.13.pdf']

## Laying the foundations

### Storage

We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).

To set this up locally, you will need to install Docker and then run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [27]:
# Setup Redis
from redis import Redis
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

redis_client = get_redis_connection()

In [28]:
# Constants
VECTOR_DIM = 1536 #len(data['title_vector'][0]) # length of the vectors
#VECTOR_NUMBER = len(data)                 # initial number of vectors
PREFIX = "sportsdoc"                            # prefix for the document keys
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)

In [29]:
# Create search index

# Index
INDEX_NAME = "f1-index"           # name of the search index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
filename = TextField("filename")
text_chunk = TextField("text_chunk")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [filename,text_chunk,file_chunk_index,text_embedding]

In [30]:
redis_client.ping()

True

In [31]:
# Optional step to drop the index if it already exists
#redis_client.ft(INDEX_NAME).dropindex()

# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except Exception as e:
    print(e)
    # Create RediSearch Index
    print('Not there yet. Creating')
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Index already exists


### Ingestion

We'll load up our PDFs and do the following
- Initiate our tokenizer
- Run a processing pipeline to:
    - Mine the text from each PDF
    - Split them into chunks and embed them
    - Store them in Redis

In [32]:
# The transformers.py file contains all of the transforming functions, including ones to chunk, embed and load data
# For more details the file and work through each function individually
from transformers import handle_file_string

In [41]:
%%time
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")
print(pdf_files)

print(data_dir)
# data_dir = "data/"

for pdf_file in pdf_files:
    pdf_path = os.path.join(data_dir,pdf_file)
    print(pdf_path)
    text = textract.process(pdf_path)
    # Extract the raw text from each PDF using textract
    # text = textract.process(pdf_path, method='pdfminer')
    handle_file_string((pdf_file,text.decode("utf-8")),tokenizer,redis_client,VECTOR_FIELD_NAME,INDEX_NAME)
    
    
    

["FIA Practice Directions - Competitor's Staff Registration System.pdf", 'fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf', 'fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf', 'fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', 'fia_formula_1_financial_regulations_iss.13.pdf']
data/
data/FIA Practice Directions - Competitor's Staff Registration System.pdf
data/fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf
data/fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf
data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf
data/fia_formula_1_financial_regulations_iss.13.pdf
CPU times: total: 6.98 s
Wall time: 20.4 s


In [42]:
%%time
# This step takes about 5 minutes

# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")
print(pdf_files)

# data_dir = "data/"

# Process each PDF file and prepare for embedding
for pdf_file in pdf_files:
    
    pdf_path = os.path.join(data_dir,pdf_file)
    print(pdf_path)
    
    # Extract the raw text from each PDF using textract
    text = textract.process(pdf_path) #, method='pdfminer'
    print(text)
    
    # Chunk each document, embed the contents and load to Redis
    handle_file_string((pdf_file,text.decode("utf-8")),tokenizer,redis_client,VECTOR_FIELD_NAME,INDEX_NAME)

["FIA Practice Directions - Competitor's Staff Registration System.pdf", 'fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf', 'fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf', 'fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', 'fia_formula_1_financial_regulations_iss.13.pdf']
data/FIA Practice Directions - Competitor's Staff Registration System.pdf
b"FIA Legal Department Practice Directions - Competitor's Staff Registration System\r\n\r\n17 March 2011\r\n\r\nPRACTICE DIRECTIONS\r\nCOMPETITOR'S STAFF REGISTRATION SYSTEM FIA FORMULA ONE WORLD CHAMPIONSHIP\r\nSince the FIA General Assembly of 5 November 2010, the main members of a Competitor's staff have to register with the FIA if they take part in an FIA World Championship. This system, which is based on disciplinary and safety considerations, is part of the FIA International Sporting Code (herein referred to as the `ISC').\r\nThe objective of this system is to preserve and protec

In [43]:
# Check that our docs have been inserted
redis_client.ft(INDEX_NAME).info()['num_docs']

'691'

## Make it a product

Now we can test that our search works as intended by:
- Querying our data in Redis using semantic search and verifying results
- Adding a step to pass the results to GPT-3 for summarisation

In [44]:
from database import get_redis_results

In [45]:
%%time

f1_query='what are the criteria for disqualification'

result_df = get_redis_results(redis_client,f1_query,index_name=INDEX_NAME)
result_df.head(2)

CPU times: total: 0 ns
Wall time: 602 ms


Unnamed: 0,id,result,certainty
0,0,"\r b) Require a car to be dismantled by the Competitor to make sure that the conditions of eligibility or conformity are fully satisfied.\r c) Require a Competitor to pay the reasonable expenses which exercise of the powers mentioned in this Article may entail.\r d) Require a Competitor to supply them with such parts or samples as they may deem necessary.\r \r 2022 Formula 1 Sporting Regulations �2022 F�d�ration Internationale de l'Automobile\r \r 35/107\r \r 19 October 2022 Issue 9\r \r 31.6 The Race Director or the clerk of the course may require that any car involved in an accident be stopped and checked.\r 31.7 Checks and scrutineering shall be carried out by duly appointed officials who shall also be responsible for the operation of the parc ferm� and who alone are authorised to give instructions to the Competitors.\r 31.8 The stewards will publish the findings of the scrutineers each time cars are checked during the Competition. These results will not include any specific figure except when a car is found to be in breach of the Technical Regulations.\r \r 32) CHANGES OF DRIVER 32.1 During a Championship each Competitor will be permitted to use a maximum of four (4)\r drivers in races.\r 32.2 A change of driver may be made at any time before the start of the qualifying practice session provided any change proposed after the end of initial scrutineering receives the consent of the stewards. Additional changes for reasons of force majeure will be considered separately.\r 32.3 Any new driver may score points in the Championship.\r 32.4 In addition to the provisions of Article 32.1, each Competitor will be permitted to use additional drivers during P1 and P2 provided that:\r a) The FIA are informed which cars and drivers each Competitor intends to use in each session no less than twenty-four (24) hours before the scheduled start of P1.",0.198021292686
1,1,"\r The following sets out examples of the type of behaviours which might constitute an infringement of the FIA Code of Good Standing (non-exhaustive list of examples) in relation to a person who is subject to the Code of Good Standing:\r - giving instructions to a driver or other member of a Competitor's staff with the intention or with the likely result of causing an accident, collision or crash or a race to be stopped or suspended \r - any action which is likely to endanger or materially compromise the safety of any driver, other members of the Competitor's staff, other participants in a race, Officials or any spectators or other members of the public who attend an event \r - giving instructions to make any changes to a car in breach of any safety requirements or regulations \r - giving instructions to tamper with or adversely affect the set-up or performance of the car of any other Competitor \r 4 / 5\r \r FIA Legal Department Practice Directions - Competitor's Staff Registration System\r \r 17 March 2011\r \r - giving instructions to a driver or otherwise taking any action by which the result or course of a race may be influenced or affected for the purpose of profiting or assisting someone to profit through betting on the outcome of a race or any part of a race or\r - being convicted of a criminal offence (other than a driving offence) which carries a maximum prison sentence of five years.\r VII. AMENDMENTS TO THE COMPETITOR'S STAFF REGISTRATION SYSTEM\r The FIA will not make any amendments with regard to the Competitor's Staff Registration System, either to the International Sporting Code or to the Practice Directions, prior consultation with the Competitors entered in the FIA Formula One World Championship and adequate opportunity to provide input on the proposed amendments.",0.201568305492


In [46]:
# Build a prompt to provide the original query, the result and ask to summarise for the user
summary_prompt = '''Summarise this result in a bulleted list to answer the search query a customer has sent.
Search query: SEARCH_QUERY_HERE
Search result: SEARCH_RESULT_HERE
Summary:
'''
summary_prepped = summary_prompt.replace('SEARCH_QUERY_HERE',f1_query).replace('SEARCH_RESULT_HERE',result_df['result'][0])
summary = openai.Completion.create(engine=COMPLETIONS_MODEL,prompt=summary_prepped,max_tokens=500)
# Response provided by GPT-3
print(summary['choices'][0]['text'])

- Checks and scrutineering by duly appointed officials
- Publish the findings of the scrutineers
- Maximum of 4 drivers in a Championship
- Changes of driver before start of qualifying practice
- Additional drivers can be used during P1 and P2 with prior notice to the FIA


### Search

Now that we've got our knowledge embedded and stored in Redis, we can now create an internal search application. Its not sophisticated but it'll get the job done for us.

In the directory containing this app, execute ```streamlit run search.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data.

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form

## Build your moat

The Q&A was useful, but fairly limited in the complexity of interaction we can have - if the user asks a sub-optimal question, there is no assistance from the system to prompt them for more info or conversation to lead them down the right path.

For the next step we'll make a Chatbot using the Chat Completions endpoint, which will:
- Be given instructions on how it should act and what the goals of its users are
- Be supplied some required information that it needs to collect
- Go back and forth with the customer until it has populated that information
- Say a trigger word that will kick off semantic search and summarisation of the response

For more details on our Chat Completions endpoint and how to interact with it, please check out the docs [here](https://platform.openai.com/docs/guides/chat).

### Framework

This section outlines a basic framework for working with the API and storing context of previous conversation "turns". Once this is established, we'll extend it to use our retrieval endpoint.

In [47]:
# A basic example of how to interact with our ChatCompletion endpoint
# It requires a list of "messages", consisting of a "role" (one of system, user or assistant) and "content"
question = 'How can you help me'


completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": question}
  ]
)
print(f"{completion['choices'][0]['message']['role']}: {completion['choices'][0]['message']['content']}")

assistant: I can help you in various ways depending on your needs. As an AI, I can assist you with answering questions, providing information, helping with research, giving suggestions or recommendations, providing language translations, and offering general assistance in various topics. Feel free to let me know how I can assist you specifically.


In [48]:
from termcolor import colored

# A basic class to create a message as a dict for chat
class Message:
    
    
    def __init__(self,role,content):
        
        self.role = role
        self.content = content
        
    def message(self):
        
        return {"role": self.role,"content": self.content}
        
# Our assistant class we'll use to converse with the bot
class Assistant:
    
    def __init__(self):
        self.conversation_history = []

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=prompt
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'

    def ask_assistant(self, next_user_prompt, colorize_assistant_replies=True):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        self.conversation_history.append(assistant_response)
        return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                print(output)

In [49]:
# Initiate our Assistant class
conversation = Assistant()

# Create a list to hold our messages and insert both a system message to guide behaviour and our first user question
messages = []
system_message = Message('system','You are a helpful business assistant who has innovative ideas')
user_message = Message('user','What can you do to help me')
messages.append(system_message.message())
messages.append(user_message.message())
messages

[{'role': 'system',
  'content': 'You are a helpful business assistant who has innovative ideas'},
 {'role': 'user', 'content': 'What can you do to help me'}]

In [50]:
# Get back a response from the Chatbot to our question
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

As a business assistant, I can help you in various ways by providing innovative ideas and support. Here are some ways I can assist you:

1. Brainstorming: I can help you generate new and creative ideas for your business, whether it's for marketing campaigns, product development, or process improvements. We can collaborate on brainstorming sessions to explore different possibilities.

2. Market Research: I can conduct market research and analysis to help you gather valuable insights about your target audience, competitors, industry trends, and potential growth opportunities. This information can guide your decision-making process and help you stay ahead in the market.

3. Strategy Development: I can assist you in developing business strategies that align with your goals and objectives. We can work together to create a comprehensive plan that addresses areas such as market positioning, competitive advantage, growth strategies, and risk management.

4. Process Optimization: I can analyze 

In [51]:
next_question = 'Tell me more about option 2'

# Initiate a fresh messages list and insert our next question
messages = []
user_message = Message('user',next_question)
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

Market research is a crucial aspect of running a successful business. By conducting market research, you can gain valuable insights about your target audience, competitors, industry trends, and potential growth opportunities. Here's a closer look at how I can assist you with market research:

1. Target Audience Insights: I can help you identify and understand your target audience better. This involves collecting demographic data, preferences, behavior patterns, and buying habits. By analyzing this information, we can create more effective marketing strategies and tailor your products or services to meet their needs.

2. Competitor Analysis: I can conduct thorough research on your competitors, analyzing their strengths, weaknesses, marketing strategies, pricing, and customer perceptions. This information can help you identify your competitive advantages and develop strategies to differentiate yourself in the market.

3. Industry Trends: Staying up to date with the latest industry trends

In [52]:
# Print out a log of our conversation so far

conversation.pretty_print_conversation_history()

user:
What can you do to help me
[32massistant:
As a business assistant, I can help you in various ways by providing innovative ideas and support. Here are some ways I can assist you:

1. Brainstorming: I can help you generate new and creative ideas for your business, whether it's for marketing campaigns, product development, or process improvements. We can collaborate on brainstorming sessions to explore different possibilities.

2. Market Research: I can conduct market research and analysis to help you gather valuable insights about your target audience, competitors, industry trends, and potential growth opportunities. This information can guide your decision-making process and help you stay ahead in the market.

3. Strategy Development: I can assist you in developing business strategies that align with your goals and objectives. We can work together to create a comprehensive plan that addresses areas such as market positioning, competitive advantage, growth strategies, and risk man

### Knowledge retrieval

Now we'll extend the class to call a downstream service when a stop sequence is spoken by the Chatbot.

The main changes are:
- The system message is more comprehensive, giving criteria for the Chatbot to advance the conversation
- Adding an explicit stop sequence for it to use when it has the info it needs
- Extending the class with a function ```_get_search_results``` which sources Redis results

In [53]:
# Updated system prompt requiring Question and Year to be extracted from the user
system_prompt = '''
You are a helpful Formula 1 knowledge base assistant. You need to capture a Question and Year from each customer.
The Question is their query on Formula 1, and the Year is the year of the applicable Formula 1 season.
If they haven't provided the Year, ask them for it again.
Once you have the Year, say "searching for answers".

Example 1:

User: I'd like to know the cost cap for a power unit

Assistant: Certainly, what year would you like this for?

User: 2023 please.

Assistant: Searching for answers.
'''

# New Assistant class to add a vector database call to its responses
class RetrievalAssistant:
    
    def __init__(self):
        self.conversation_history = []  

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model=CHAT_MODEL,
              messages=prompt,
              temperature=0.1
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'
    
    # The function to retrieve Redis search results
    def _get_search_results(self,prompt):
        latest_question = prompt
        search_content = get_redis_results(redis_client,latest_question,INDEX_NAME)['result'][0]
        return search_content
        

    def ask_assistant(self, next_user_prompt):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        
        # Answer normally unless the trigger sequence is used "searching_for_answers"
        if 'searching for answers' in assistant_response['content'].lower():
            question_extract = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=f"Extract the user's latest question and the year for that question from this conversation: {self.conversation_history}. Extract it as a sentence stating the Question and Year")
            search_result = self._get_search_results(question_extract['choices'][0]['text'])
            
            # We insert an extra system prompt here to give fresh context to the Chatbot on how to use the Redis results
            # In this instance we add it to the conversation history, but in production it may be better to hide
            self.conversation_history.insert(-1,{"role": 'system',"content": f"Answer the user's question using this content: {search_result}. If you cannot answer the question, say 'Sorry, I don't know the answer to this one'"})
            #[self.conversation_history.append(x) for x in next_user_prompt]
            
            assistant_response = self._get_assistant_response(self.conversation_history)
            print(next_user_prompt)
            print(assistant_response)
            self.conversation_history.append(assistant_response)
            return assistant_response
        else:
            self.conversation_history.append(assistant_response)
            return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                #prefix = entry['role']
                print(output)

In [54]:
conversation = RetrievalAssistant()
messages = []
system_message = Message('system',system_prompt)
user_message = Message('user','How can a competitor be disqualified from competition')
messages.append(system_message.message())
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
response_message

{'role': 'assistant',
 'content': 'Certainly, what year would you like this for?'}

In [55]:
messages = []
user_message = Message('user','For 2023 please.')
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
#response_message

[{'role': 'user', 'content': 'For 2023 please.'}]
{'role': 'assistant', 'content': 'Searching for answers.'}


In [56]:
conversation.pretty_print_conversation_history()

user:
How can a competitor be disqualified from competition
[32massistant:
Certainly, what year would you like this for?[0m
user:
For 2023 please.
[32massistant:
Searching for answers.[0m


### Chatbot

Now we'll put all this into action with a real (basic) Chatbot.

In the directory containing this app, execute ```streamlit run chat.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data. 

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form
- how can a competitor be disqualified

### Consolidation

Over the course of this notebook you have:
- Laid the foundations of your product by embedding our knowledge base
- Created a Q&A application to serve basic use cases
- Extended this to be an interactive Chatbot

These are the foundational building blocks of any Q&A or Chat application using our APIs - these are your starting point, and we look forward to seeing what you build with them!