# Evaluating OpenAI Embeddings on MC Synthetic Data
## 1. Data preparation

Preprocess synthetic MomConnect data for testing OpenAI embeddings.

We use synthetic questions in order to abide by our data sharing agreement.

Ideally, we would split synthetic questions into the following:

* reference questions: for question-question matching
* training questions: for training BERT
* test questions: for evaluating BERT or OpenAI

But many FAQs only have 4 synthetic questions.

### 1.1 Load FAQs and synthetic questions

In [None]:
import pandas as pd
import s3fs


fs = s3fs.S3FileSystem()
s3_path = "s3://praekelt-static-resources/experiment/data/[Sam] Helpdesk Q&A FAQ Content _ MOMZA AAQ _ Oct 2022.xlsx - Final FAQs for AAQ admin.csv"

with fs.open(s3_path) as f:
    faqs = pd.read_csv(f)

In [None]:
faqs.head()

In [None]:
faqs.info()

### 1.2 Clean FAQs and synthetic questions

In [None]:
# Clean column names in FAQs file
column_map = {
    'VALIDATION questions (user generated)': 'questions_usr',
    'SYNTHETIC questions': 'questions_syn',
    'FAQ Content': 'faq_content',
    'FAQ Name': 'faq_name',
    'FAQ title': 'faq_title',
    'TAGS (FINAL VERSION)': 'faq_tags',
    'Trimester tag': 'trimester_tags',
    'Week tags': 'week_tags',
    'Added to Database': 'added',
}

faqs = faqs.rename(columns=column_map)

In [None]:
# Keep only the data we need
faqs = faqs.loc[faqs.added == "Yes", column_map.values()]

In [None]:
faqs[faqs["questions_syn"].isnull() | faqs["questions_usr"].isnull()]

Clean up tags

In [None]:
# Remove tags with strikethrough in the google sheet
inserts = { # faq_name: new tags
    "Preg - Shortness of breath": "['shortness', 'breath']",
    "Preg - Safe foods to eat": "['safe', 'food']",
    "Baby - Family planning & birth spacing": "['family', 'planning', 'birth', 'spacing']",
    "Baby - Umbilical cord care": "['umbilical', 'cord', 'care']",
    "HIV - Viral load": "['hiv', 'viral', 'load', 'arv']",
}

for name, new_tags in inserts.items():
    faqs.loc[faqs.faq_name == name, "faq_tags"] = new_tags

In [None]:
faqs["faq_tags"] = faqs.faq_tags.apply(lambda x: x.replace("[", "{").replace("]", "}").replace("'", "\""))

Preprocess questions

In [None]:
import re
import numpy as np


# Parse example questions column so each elemnt is an array of questions (we use numpy array so we can index them)
for questions_col in ["questions_syn", "questions_usr"]:
    faqs.loc[:, questions_col] = faqs[questions_col].apply(
        lambda x: np.asarray(
            re.sub('\n+', "\n", x.strip()).split('\n')
        ) if isinstance(x, str) else np.asarray([])
    )

Check for empty questions.

1. Empty question strings?

In [None]:
faqs[faqs.questions_usr.apply(lambda x: any(len(y) == 0 for y in x))]

2. Empty set of questions?


In [None]:
faqs[faqs.questions_usr.apply(len) == 0]

We need at least 4 questions per FAQ to have enough training data for other models (e.g. BERT)

In [None]:
# Keep FAQs with at least 4 or more example questions
print(f"FAQs that had been added to DB: {len(faqs)}")
faqs = faqs[faqs.questions_usr.apply(lambda x: len(x)) >= 4]
print(f"FAQs with at least 4 synthetic questions: {len(faqs)}")

Check for content duplicates

In [None]:
faqs[faqs.faq_content.duplicated(keep=False)]

"Baby - Newborn care" has the wrong content (duplicated of "HIV - Breastfeeding & weaning") so I will drop it.

In [None]:
faqs = faqs[faqs.faq_name != "Baby - Newborn care"]

Histogram of number of questions per FAQ

In [None]:
faqs.questions_syn.apply(lambda x: len(x)).hist()

In [None]:
faqs.questions_usr.apply(lambda x: len(x)).hist()

Split data

In [None]:
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence

# (Only relevant for question-question matching)
# Split into reference questions (tied to the FAQ) and example questions for training
rs = RandomState(MT19937(SeedSequence(123456789)))

def get_ref_split(l):
    r = np.arange(len(l))
    rs.shuffle(r)
    return r[:2], r[2:]

faqs.loc[:, "_splits"] = faqs.questions_usr.apply(get_ref_split)
faqs.loc[:, "question_ref"] = faqs.apply(lambda x: x.questions_usr[x._splits[0]], axis=1)
faqs.loc[:, "question"] = faqs.apply(lambda x: x.questions_usr[x._splits[1]], axis=1)

# Cast numpy arrays into lists
for col in ['question', 'question_ref', 'questions_usr',]:
    faqs[col] = faqs[col].apply(lambda x: list(x))

In [None]:
faqs.faq_content.nunique()

In [None]:
faqs.faq_name.nunique()

In [None]:
faqs.isnull().any()

Distribution of number of synthetic questions per FAQ (excluding 2 reference questions, which we'll use for validation)

In [None]:
faqs._splits.apply(lambda x: len(x[1])).hist(bins=10)

In [None]:
!pip install plotly

In [None]:
!pip install scipy

In [None]:
!pip install scikit-learn

## 2. Evaluate OpenAI embeddings

In [None]:
import tiktoken

from openai.embeddings_utils import get_embedding, get_embeddings


# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002

### 2.1 Cost estimation

In [None]:
encoding = tiktoken.get_encoding(embedding_encoding)

faqs_n_tokens = faqs.faq_content.apply(lambda x: len(encoding.encode(x)))
faqs_n_tokens.describe()

In [None]:
faqs_n_tokens.sum()

OpenAI rate limits for pay-as-you-go (cf. [here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb)):

* 60 requests / minute
* 250,000 davinci tokens / minute (and proportionally more for cheaper models)

Pricing

* \$0.0004  / 1K tokens for Ada
* \$0.0005  / 1K tokens for Babbage
* \$0.0020  / 1K tokens for Curie
* \$0.0200  / 1K tokens for Davinci

In [None]:
pricing = {
    "Ada": 0.0004 / 1000,
    "Babbage": 0.0005 / 1000,
    "Curie": 0.002 / 1000,
    "Davinci": 0.02 / 1000
}

print("Estimated cost for all FAQs")
for model, rate in pricing.items():
    print(f"{model}: ${rate * faqs_n_tokens.sum():.2f}")

### 2.2 Get embeddings for FAQ

In [None]:
faq_embeddings = get_embeddings(faqs.faq_content.tolist(), engine=embedding_model)

In [None]:
print(len(faq_embeddings))
print(len(faq_embeddings[0]))

In [None]:
!pip install pyarrow

In [None]:
faqs["faq_content_embedding"] = pd.Series(faq_embeddings, index=faqs.index).apply(np.asarray)
faqs[["faq_name", "faq_title", "faq_content_embedding"]].to_parquet("../data/faq_embeddings_updated.parquet")

In [None]:
faqs[faqs.faq_content_embedding.isnull()]

### 2.3 Get embeddings for queries

#### Check query data

In [None]:
validation_data = faqs.explode("question_ref").reset_index()

In [None]:
validation_data.head()

In [None]:
validation_data.loc[validation_data.question_ref.apply(len) > 250, "question_ref"].tolist()

In [None]:
validation_data.question_ref.apply(len).hist()

#### Get embeddings for queries

In [None]:
validation_data_questions = validation_data.question_ref.tolist()

In [None]:
query_embeddings_list = []
failed_indices = []
for i, query in enumerate(validation_data_questions):
    try:
        query_embeddings_list.append(get_embedding(query, engine=embedding_model))
    except Exception as e:
        print(f"{i}: {e}")
        failed_indices.append(i)

In [None]:
validation_data.shape

In [None]:
len(validation_data_questions)

In [None]:
len(query_embeddings_list)

In [None]:
faqs.faq_content_embedding.isnull().any()

### 2.4 Compute top K accuracies

In [None]:
from openai.embeddings_utils import cosine_similarity


def get_top_k_faqs_for_embedding(query_embedding, k=10):
    faqs["current_query_cossim"] = faqs.faq_content_embedding.apply(lambda x: cosine_similarity(x, query_embedding))
    
    results = (
        faqs.sort_values("current_query_cossim", ascending=False)
        .head(k)
        .faq_name
        .tolist()
    )
    del faqs["current_query_cossim"]
    return results

In [None]:
validation_data["top10_pred"] = list(map(get_top_k_faqs_for_embedding, query_embeddings_list))

In [None]:
for k in [1, 3, 5, 7, 10]:
    validation_data[f"isin_top{k}"] = validation_data.apply(lambda row: row.faq_name in row.top10_pred[:k], axis=1)

In [None]:
for k in [1, 3, 5, 7, 10]:
    acc=validation_data[f'isin_top{k}'].mean()
    print(f"Top {k} accuracy: {acc:.1%}")

## 3. Save validation data to upload to S3

In [None]:
validation_data[["question_ref", "faq_name", "faq_title", "faq_content"]].to_csv("../data/synthetic_validation_data_updated.csv", index=False)

In [None]:
(
    faqs.loc[
        faqs.faq_content_embedding.notnull(),
        ["faq_name", "faq_title", "faq_content", "faq_tags", "questions_usr", "question_ref"]
    ]
    .rename(columns={"faq_content": "faq_content_to_send"})
    .to_csv("../data/faqs_with_synthetic_questions_updated.csv", index=False)
)

Validation with custom embeddings + WMD + scoring on the entire content:

* Top 1 accuracy: 0.28
* Top 5 accuracy: 0.63
* Top 10 accuracy: 0.75

Google news pretrained embedding + StepwiseKeyedVectorScorer + tags

* Top 1 accuracy: 0.29
* Top 5 accuracy: 0.50
* Top 10 accuracy: 0.68

# _Fix tags_

In [None]:
import pandas as pd
_df = pd.read_csv("../data/faqs_with_synthetic_questions_updated.csv")

In [None]:
_df["faq_tags"] = _df.faq_tags.str.replace(",}", "}")

In [None]:
_df.loc[61, "faq_tags"]

In [None]:
_df.loc[61, "faq_tags"]

In [None]:
_df.to_csv("../data/faqs_with_synthetic_questions_updated.csv", index=False)