# Embedding Zendesk articles for search

This notebook shows how we prepared a dataset of Wikipedia articles for search, used in [Question_answering_using_embeddings.ipynb](Question_answering_using_embeddings.ipynb).

Procedure:

0. Prerequisites: Import libraries, set API key (if needed)
1. Collect: We download a few hundred Wikipedia articles about the 2022 Olympics
2. Chunk: Documents are split into short, semi-self-contained sections to be embedded
3. Embed: Each section is embedded with the OpenAI API
4. Store: Embeddings are saved in a CSV file (for large datasets, use a vector database)

## 0. Prerequisites

### Import libraries

In [None]:
# imports
import os
from bs4 import BeautifulSoup
import mwclient  # for downloading example Wikipedia articles
import mwparserfromhell  # for splitting Wikipedia articles into sections
import openai  # for generating embeddings
import numpy as np  # for arrays to store embeddings
import pandas as pd  # for DataFrames to store article sections and embeddings
import re  # for cutting <ref> links out of Wikipedia articles
import tiktoken  # for counting tokens
from datetime import datetime
# Import the Zenpy Class
from zenpy import Zenpy
from zenpy.lib.api_objects import Ticket
from pprint import pprint
from scipy import spatial  # for calculating vector similarities for search
import typing  # for type hints


Install any missing libraries with `pip install` in your terminal. E.g.,

```zsh
pip install openai
```

(You can also do this in a notebook cell with `!pip install openai`.)

If you install any libraries, be sure to restart the notebook kernel.

### Set API key (if needed)

Note that the OpenAI library will try to read your API key from the `OPENAI_API_KEY` environment variable. If you haven't already, set this environment variable by following [these instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

In [None]:
openai.organization = "org-C15lzQ0mQYcGkjGrpiBPk2Hb"
openai.api_key = "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"

In [None]:
MAX_INPUT_TOKENS = 8191
COMPLETIONS_MODEL = "text-davinci-003"
CHAT_COMPLETIONS_MODEL="gpt-3.5-turbo"

## Configure Zendesk API config

The Zendesk API is configured in the [Zendesk dashboard](https://app.zendesk.com/hc/en-us/articles/360001111134-Zendesk-API-Configuration).

In [None]:
# Zenpy accepts an API token
creds = {
    "email": "chisom@exam-genius.com",
    "token": "1ASu216KqW6p0BHrBIOSAYaBlax2NmHvRu5rCAAk",
    "subdomain": "omnicentra",
}

# Default
zenpy_client = Zenpy(**creds)

## 1. Collect articles

In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.

In [None]:
def get_date_string():
    return datetime.now().strftime("%Y-%m-%d")


def fetch_zendesk_sections():
    sections = []
    for section in zenpy_client.help_center.sections():
        if section.name == "IT Queries":
            section.name = "IT"
        else:
            section.name = "HR"
        sections.append(section)
        pass
    return sections


def fetch_all_zendesk_articles():
    articles = zenpy_client.help_center.articles()
    for article in articles:
        pprint(article)
        pass
    return articles


def fetch_zendesk_articles_by_section(sections):
    my_articles = []
    for _section in sections:
        category = "IT" if _section.name == "IT" else "HR"
        print(f"Searching for articles in section {_section.name}")
        articles = zenpy_client.help_center.sections.articles(section=_section)
        print(f"Found {len(articles)} articles in section {_section}")
        for article in articles:
            # pprint("--------------------------------------------------------------------------------------------------")
            my_articles.append((article.title, article.body, category))
            pass
    return my_articles

### Fetch All Article sections

In [None]:
article_sections = fetch_zendesk_sections()
print(article_sections)

In [None]:
for section in article_sections:
    print(section.name)
    pass

### Fetch all articles for each section

In [None]:
articles = fetch_zendesk_articles_by_section(article_sections)
len(articles)

In [None]:
def create_txt_knowledge_base(articles, path: str):
    if not os.path.exists(path):
        os.mkdir(path)

    with open(f"{path}/base.txt", "w") as file:
        for article in articles:
            file.write(article[0] + "\n" + article[1] + "\n" + article[2] + "\n\n")
            pass
    return True

In [None]:
create_txt_knowledge_base(articles, f"knowledge_base/{get_date_string()}")

## 2. Chunk documents

Now that we have our reference documents, we need to prepare them for search.

Because GPT can only read a limited amount of text at once, we'll split each document into chunks short enough to be read.

For this specific example on Wikipedia articles, we'll:
- Remove all html syntax tags (e.g., \<ref>\, \<div>\), whitespace, and super short sections
- Clean up the text by removing reference tags (e.g., <ref>), whitespace, and super short sections
- Split each article into sections
- Prepend titles and subtitles to each section's text, to help GPT understand the context
- If a section is long (say, > 1,600 tokens), we'll recursively split it into smaller sections, trying to split along semantic boundaries like paragraphs

In [None]:
def num_tokens_from_text(string: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
def clean_up_text(articles):
    cleaned_articles = []
    for title, body, category in articles:
        cleaned_body = BeautifulSoup(body, "html.parser").get_text()
        if num_tokens_from_text(title.strip() + cleaned_body.strip()) > MAX_INPUT_TOKENS:
            left = body[:MAX_INPUT_TOKENS]
            right = body[MAX_INPUT_TOKENS:]
            cleaned_articles.append((title, left, category))
            cleaned_articles.append((title, right, category))
        else:
            cleaned_articles.append((title, cleaned_body, category))
    pass
    return cleaned_articles


In [None]:
CLEANED_ARTICLES = clean_up_text(articles)
np.array(CLEANED_ARTICLES).shape

In [None]:
# print example data
for article in CLEANED_ARTICLES[:5]:
    print(article[0])
    display(article[1][:77])
    print(article[2])
    print("-"*50)


Next, we'll recursively split long sections into smaller sections.

There's no perfect recipe for splitting text into sections.

Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Shorter sections allow more sections to be retrieved, which may help with recall
- Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long. To avoid cutting in the middle of useful sentences, we'll split along paragraph boundaries when possible.

## 3. Embed document chunks

Now that we've split our library into shorter self-contained strings, we can compute embeddings for each.

(For large embedding jobs, use a script like [api_request_parallel_processor.py](api_request_parallel_processor.py) to parallelize requests while throttling to stay under rate limits.)

In [None]:
# calculate embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request


def calculate_embeddings(articles):
    titles = []
    content = []
    categories = []
    embeddings = []
    for batch_start in range(0, len(articles), BATCH_SIZE):
        batch_end = batch_start + BATCH_SIZE
        batch = articles[batch_start:batch_end]
        titles.extend([article[0] for article in batch])
        content.extend([article[1] for article in batch])
        categories.extend([article[2] for article in batch])
        batch_text = [title + " " + body for title, body, category in batch]
        print(f"Batch {batch_start} to {batch_end - 1}")
        response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch_text)
        for i, be in enumerate(response["data"]):
            assert i == be["index"]  # double check embeddings are in same order as input
        batch_embeddings = [e["embedding"] for e in response["data"]]
        embeddings.extend(batch_embeddings)

    return pd.DataFrame({"titles": titles, "content": content, "categories": categories, "embedding": embeddings}), embeddings


In [None]:
DF, EMBEDDINGS = calculate_embeddings(CLEANED_ARTICLES)

## 4. Store document chunks and embeddings

Because this example only uses a few thousand strings, we'll store them in a CSV file.

(For larger datasets, use a vector database, which will be more performant.)

In [None]:
# save document chunks and embeddings
def save_dataframe_to_csv(df: pd.DataFrame, path: str, filename: str):
    if not os.path.exists(path):
        os.mkdir(path)
        print(f"Created {path}")
    df.to_csv(f"{path}/{filename}", index=False)


In [None]:
save_dataframe_to_csv(DF, f"data/{get_date_string()}", "zendesk_vector_embeddings.csv")

## Store embeddings in Pinecone database

In [None]:
# Initialise pinecone client with valid API key and environment
import pinecone

pinecone.init(api_key="50f995ae-f134-4a60-8aba-edf67c153790", environment="us-west1-gcp-free")
# Connect to the "Alfred" index
index = pinecone.Index("alfred")

In [None]:
# Insert the vector embeddings into the index
from tqdm.auto import tqdm  # this is our progress bar


def store_embeddings_into_pinecone(embeddings: np.ndarray, index: pinecone.Index):
    batch_size = 32  # process everything in batches of 32
    for i in tqdm(range(0, len(DF), batch_size)):
        i_end = min(i + batch_size, len(DF))
        batch = DF[i: i + batch_size]
        embeddings_batch = batch["embedding"]
        ids_batch = [str(n) for n in range(i, i_end)]
        # prep metadata and upsert batch
        meta = [{'title': titles, "content": content, "category": categories} for titles, content, categories, embeddings in batch.to_numpy()]
        to_upsert = zip(ids_batch, embeddings_batch, meta)
        print(to_upsert)
        index.upsert(vectors=list(to_upsert))
        # upsert to Pinecone

In [None]:
store_embeddings_into_pinecone(EMBEDDINGS, index)

In [None]:
DF.head()

In [None]:
def get_queries():
    queries = []
    # removing the new line characters
    with open('test/queries.txt') as f:
        lines = [line.rstrip() for line in f]
        for line in lines:
            queries.append(line)
    return queries

# 6. Create VE for test query and retrieve embeddings

In [None]:
from openai.embeddings_utils import cosine_similarity, get_embedding

# search function
def strings_ranked_by_relatedness(
        query: str,
        df: pd.DataFrame,
        relatedness_fn=lambda x, y: cosine_similarity(x, y),
        top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    question_vector = get_embedding(query, EMBEDDING_MODEL)
    strings_and_relatednesses = [
        (row["content"], relatedness_fn(row["embedding"], question_vector), row["embedding"])
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses, embedding = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n], embedding

In [None]:
def get_similarities(query: typing.List[str], df: pd.DataFrame) -> pd.DataFrame:
    SCORES = []
    ANSWERS = []
    EMBEDDINGS = []
    strings, relatednesses, embeddings = strings_ranked_by_relatedness(query, df, top_n=3)
    for string, relatedness, embedding in zip(strings, relatednesses, embeddings):
        ANSWERS.append(string)
        SCORES.append("%.3f" % relatedness)
        EMBEDDINGS.append(embedding)

    results = pd.DataFrame({"answers": ANSWERS, "match_scores": SCORES, "embeddings": EMBEDDINGS})
    return results

# 7. Combine all top n answers into one chunk of text to use as knowledge base context for GPT

In [None]:
def generate_context_array(results: pd.DataFrame) -> str:
    context_array = []
    for i, row in results.iterrows():
        context_array.append(row.answers)

    context = "\n".join(context_array)
    return context

# 8. Use GPT3 model to generate user-friendly answers to the query

In [None]:
from typing import Literal

def generate_gpt_opt_response(record: pd.Series, category: Literal["IT", "HR"], company: str="Omnicentra", description: str="an AI software company"):
    question = record.question
    context = record.top_answer
    prompt = f"""Name: Alfred

"Answer the following question by rephrasing the context below"
Context:
{context}

Question:
{question}

You are an AI-powered assistant designed to help employees with {category} questions at {company}. You have been programmed to provide fast and accurate solutions to their inquiries. As an AI, you do not have a gender, age, sexual orientation or human race.

As an experienced assistant, you can create Zendesk tickets and forward complex inquiries to the appropriate person. If you are unable to provide an answer, you will respond by saying "I don't know, would you like me to create a ticket on Zendesk or ask {category}?" and follow the steps accordingly based on their response.

If a question is outside your scope, you will make a note of it and store it as a "knowledge gap" to learn and improve. It is important to address employees in a friendly and compassionate tone, speaking to them in first person terms.

Please feel free to answer any {category} related questions, and do your best to assist employees with questions promptly and professionally."""

    # pprint(prompt)
    response = openai.Completion.create(
        prompt=prompt,
        temperature=0.9,
        max_tokens=500,
        frequency_penalty=0,
        presence_penalty=0,
        top_p=1,
        model=COMPLETIONS_MODEL
    )['choices'][0]['text'].strip(" \n").strip(" Answer:").strip(" \n")
    return response

In [None]:
def query_message(query: str, context: str, company: str, token_budget: int) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    introduction = f"""You are an AI-powered assistant designed to help employees with HR and IT questions at {company}. You have been programmed to provide fast and accurate solutions to their inquiries. As an AI, you do not have a gender, age, sexual orientation or human race.

As an experienced assistant, you can create Zendesk tickets and forward complex inquiries to the appropriate person.

When a HR / IT related question is asked by the user, only use information provided in the context and never use general knowledge. If the question asked is not in the context given to you or the context does not answer the question properly, you will respond apologetically saying something along the lines of "this information is not provided within the company’s knowledge base, would you like me to create a ticket on Zendesk or ask HR/IT?" and follow the steps accordingly based on their response.

If a question is outside your scope, you will make a note of it and store it as a "knowledge gap" to learn and improve. It is important to address employees in a friendly and compassionate tone, speaking to them in first person terms.

Please feel free to answer any HR or IT related questions."""
    question = f"\n\nQuestion: {query}"
    message = introduction
    context = f'\n\nContext:\n"""\n{context}\n"""'
    num_tokens = num_tokens_from_text(message + context + question)
    if num_tokens > token_budget:
        print(f"Question too long: {num_tokens} tokens")
    else:
        message += context

    return message + question

In [None]:
def generate_gpt_chat_response(query: str, context: str, company: str="Omnicentra"):
    message = query_message(query, context, company, MAX_INPUT_TOKENS)
    messages = [
        {"role": "system", "content": f"Your name is Alfred. You are a helpful assistant that answers HR and IT questions at Omnicentra"},
        {"role": "user", "content": message},
    ]
    # pprint(prompt)
    response = openai.ChatCompletion.create(
        model=CHAT_COMPLETIONS_MODEL,
        messages=messages,
        temperature=0
    )
    sanitized_response = response['choices'][0]['message']['content'].strip(" \n").strip(" \n")
    return sanitized_response, messages

In [None]:
# Choose a random query from the query list
from numpy import random

QUERIES = get_queries()
rand_index = random.randint(0, len(QUERIES) - 1)
rand_query = QUERIES[rand_index]
rand_query

In [None]:
similarities = get_similarities(rand_query, DF)
similarities

In [None]:
context = generate_context_array(similarities)
context

### GPT-generated Prompt response

In [None]:
response, messages = generate_gpt_chat_response(rand_query, context, "HR")

In [None]:
response