In [1]:
# Either you can store the  OpenAI key in the “OPENAI_API_KEY” environment variable.
# or pass it here as below from a config.ini
import configparser
workingFolder=r'C:\Users\jfrancis\AI Journey\Gen AI'
# Read the configuration file
config = configparser.ConfigParser()
config.read(workingFolder+'\\config.ini')
OPENAI_API_KEY=config.get('General','OPENAI_API_KEY')
ACTIVELOOP_TOKEN=config.get('General','ACTIVELOOP_TOKEN')
ACTIVELOOP_ORG_ID=config.get('General','ACTIVELOOP_ORG_ID')
HUGGINGFACEHUB_API_TOKEN=config.get('General','HUGGINGFACEHUB_API_TOKEN')
GOOGLE_API_KEY=config.get('General','GOOGLE_API_KEY')
GOOGLE_CSE_ID=config.get('General','GOOGLE_CSE_ID')
COHERE_API_KEY=config.get('General','COHERE_API_KEY')

In [2]:
# Get the token from OPENAI/Active loop website before this. Now we are taking from the config.ini
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN
# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = ACTIVELOOP_ORG_ID

## Build a Customer Support Question Answering Chatbot

### Having a Knowledge Base

LLMs can significantly enhance chatbot functionality by associating broader intents with documents from a Knowledge Base (KB) instead of specific questions and answers. This approach streamlines intent management and generates more tailored responses to user inquiries.

GPT3 has a maximum prompt size of around 4,000 tokens, which is substantial but insufficient for incorporating an entire knowledge base in a single prompt. 

Future LLMs may not have this limitation while retaining the text generation capabilities. However, for now, we need to design a solution around it.
Workflow

This project aims to build a chatbot that leverages GPT3 to search for answers within documents. The workflow for the experiment is explained in the following diagram.

<img src="https://images.spr.so/cdn-cgi/imagedelivery/j42No7y-dcokJuNgXeA0ig/c2508d93-940f-4e93-b84a-ffdab7b535a2/Screenshot_2023-06-09_at_13.24.32/w=1920,quality=80"/>

First we scrape some content from online articles, we split them into small chunks, compute their embeddings and store them in Deep Lake. Then, we use a user query to retrieve the most relevant chunks from Deep Lake, we put them into a prompt, which will be used to generate the final answer by the LLM.

It is important to note that there is always a risk of generating hallucinations or false information when using LLMs. Although this might not be acceptable for many customers support use cases, the chatbot can still be helpful for assisting operators in drafting answers that they can double-check before sending them to the user.

In the next steps, we'll explore how to manage conversations with GPT-3 and provide examples to demonstrate the effectiveness of this workflow:

First, set up the OPENAI_API_KEY and ACTIVELOOP_TOKEN environment variables with your API keys and tokens.

As we’re going to use the UnstructuredURLLoader LangChain class, and it uses the unstructured Python library, let’s install it using pip. It is recommended to install the latest version of the library. Nonetheless, please be aware that the code has been tested specifically on version 0.7.7.

In [3]:
#langchain==0.0.208
#deeplake==3.6.5
#openai==0.27.8
#tiktoken==0.4.0
#unstructured=0.7.7
#python-magic-bin==0.4.14

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI
from langchain.document_loaders import UnstructuredURLLoader
from langchain import PromptTemplate



These libraries provide functionality for handling OpenAI embeddings, managing vector storage, splitting text, and interacting with the OpenAI API. They also enable the creation of a context-aware question-answering system, incorporating retrieval and text generation.

The database for our chatbot will consist of articles regarding technical issues.

In [5]:
# we'll use information from the following articles
urls = ['https://beebom.com/what-is-nft-explained/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-download-gif-twitter/',
        'https://beebom.com/how-use-chatgpt-linux-terminal/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-save-instagram-story-with-music/',
        'https://beebom.com/how-install-pip-windows/',
        'https://beebom.com/how-check-disk-usage-linux/']

#### 1: Split the documents into chunks and compute their embeddings

We load the documents from the provided URLs and split them into chunks using the CharacterTextSplitter with a chunk size of 1000 and no overlap:

In [6]:
# use the selenium scraper to load the documents
loader = UnstructuredURLLoader(urls=urls)
docs_not_splitted = loader.load()

# we split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs_not_splitted)

Created a chunk of size 1226, which is longer than the specified 1000


In [7]:
docs

[Document(page_content='Home  Internet  NFTs Explained: What is an NFT and What is Its Use\n\nNFTs Explained: What is an NFT and What is Its Use\n\nArjun Sha\n\nLast Updated: December 6, 2021 4:42 pm\n\nAfter Bitcoin and Blockchain, NFT is another word to have entered our lexicon. The buzzword is everywhere and people are wondering what is NFT and what is its use? Well, there is not really a one-line explainer. And that’s why we have brought a comprehensive explainer on NFT, what is its use in digital art, and more. So without wasting any time, let’s go ahead and learn about NFTs (Non-fungible Token) in complete detail.\n\nWhat is NFT: A Definitive Explainer (2021)\n\nHere, we have mentioned all the questions that people have in their minds regarding NFT. You can click on the table to find all the sections that we have covered in this article and click on the link to move to the corresponding section.\n\nTable of Contents\n\nNFTs Explained: What is NFT in Crypto?\n\nWhat is the Use of 

Next, we compute the embeddings using OpenAIEmbeddings and store them in a Deep Lake vector store on the cloud.

In [8]:
# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_dataset_name = "langchain_course_customer_support"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


\

Dataset(path='hub://jfrancis/langchain_course_customer_support', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (131, 1536)  float32   None   
    id        text      (131, 1)      str     None   
 metadata     json      (131, 1)      str     None   
   text       text      (131, 1)      str     None   


 

['bd5535cd-82f5-11ee-b60b-401c83da435e',
 'bd5535ce-82f5-11ee-98a7-401c83da435e',
 'bd5535cf-82f5-11ee-b082-401c83da435e',
 'bd5535d0-82f5-11ee-b8f9-401c83da435e',
 'bd5535d1-82f5-11ee-9aea-401c83da435e',
 'bd5535d2-82f5-11ee-8297-401c83da435e',
 'bd5535d3-82f5-11ee-9092-401c83da435e',
 'bd5535d4-82f5-11ee-bdca-401c83da435e',
 'bd5535d5-82f5-11ee-a0b8-401c83da435e',
 'bd5535d6-82f5-11ee-9f19-401c83da435e',
 'bd5535d7-82f5-11ee-99ea-401c83da435e',
 'bd5535d8-82f5-11ee-bac4-401c83da435e',
 'bd5535d9-82f5-11ee-b229-401c83da435e',
 'bd5535da-82f5-11ee-b51c-401c83da435e',
 'bd5535db-82f5-11ee-9369-401c83da435e',
 'bd5535dc-82f5-11ee-bdf0-401c83da435e',
 'bd5535dd-82f5-11ee-9e80-401c83da435e',
 'bd5535de-82f5-11ee-96ea-401c83da435e',
 'bd5535df-82f5-11ee-a556-401c83da435e',
 'bd5535e0-82f5-11ee-b910-401c83da435e',
 'bd5535e1-82f5-11ee-ba1e-401c83da435e',
 'bd5535e2-82f5-11ee-8dfa-401c83da435e',
 'bd5535e3-82f5-11ee-a04c-401c83da435e',
 'bd5535e4-82f5-11ee-88e9-401c83da435e',
 'bd5535e5-82f5-

To retrieve the most similar chunks to a given query, we can use the similarity_search method of the Deep Lake vector store:

In [9]:
# let's see the top relevant documents to a specific query
query = "how to check disk usage in linux?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Home  Tech  How to Check Disk Usage in Linux (4 Methods)

How to Check Disk Usage in Linux (4 Methods)

Beebom Staff

Last Updated: June 19, 2023 5:14 pm

There may be times when you need to download some important files or transfer some photos to your Linux system, but face a problem of insufficient disk space. You head over to your file manager to delete the large files which you no longer require, but you have no clue which of them are occupying most of your disk space. In this article, we will show some easy methods to check disk usage in Linux from both the terminal and the GUI application.

Monitor Disk Usage in Linux (2023)

Table of Contents

Check Disk Space Using the df Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Occupancy of a Particular Type

Check Disk Usage using the du Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Usage for a Particular DirectoryCompare Disk Usage of Two Directories


#### 2: Craft a prompt for GPT-3 using the suggested strategies

We will create a prompt template that incorporates role-prompting, relevant Knowledge Base information, and the user's question:

In [10]:
# let's write a prompt for a customer support chatbot that
# answer questions using information extracted from our db
template = """You are an exceptional customer support chatbot that gently answer questions.

You know the following context information.

{chunks_formatted}

Answer to the following question from a customer. Use only information from the previous context information. Do not invent stuff.

Question: {query}

Answer:"""

prompt = PromptTemplate(
    input_variables=["chunks_formatted", "query"],
    template=template,
)

The template sets the chatbot's persona as an exceptional customer support chatbot. The template takes two input variables: chunks_formatted, which consists of the pre-formatted chunks from articles, and query, representing the customer's question. The objective is to generate an accurate answer using only the provided chunks without creating any false or invented information.

#### 3: Utilize the GPT3 model with a temperature of 0 for text generation

To generate a response, we first retrieve the top-k (e.g., top-3) chunks most similar to the user query, format the prompt, and send the formatted prompt to the GPT3 model with a temperature of 0.

In [11]:
# the full pipeline

# user question
query = "How to check disk usage in linux?"

# retrieve relevant chunks
docs = db.similarity_search(query)
retrieved_chunks = [doc.page_content for doc in docs]

# format the prompt
chunks_formatted = "\n\n".join(retrieved_chunks)
prompt_formatted = prompt.format(chunks_formatted=chunks_formatted, query=query)

# generate answer
llm = OpenAI(model="text-davinci-003", temperature=0)
answer = llm(prompt_formatted)
print(answer)

 You can check disk usage in Linux using the df command. This command will show the current disk usage and the available disk space in Linux. You can also use the Disk Usage Analyzer Tool to check disk usage in Linux. This tool will scan the entire device and display a ring chart of the disk occupancy for all the folders. You can also use the Gnome Disk Tool to check disk usage in Linux.
