<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Text Embedding API

Text embedding allows us to directly convert text documents to vectors with a simple API call with Open AI.

Keep in mind, just like other Open AI services, it is not free and it is also important to note it has its own pricing structure (its typically much cheaper than GPT on a token basis, since the processing is simpler). You can view the pricing here: https://openai.com/api/pricing/

## Imports

In [6]:
import openai 
import pandas as pd
import tiktoken # https://github.com/openai/tiktoken

In [94]:
openai.api_key = os.getenv("OPENAI_API_KEY")

### What happens when GPT doesn't know anything about a topic?

For example, we know GPT is limited by its training data not being up to date to the present day (depending on the model, the cut-off can be very recent though). There are also limitations based on how esoteric the topic is. 

Let's ask GPT about a a "unicorn" company

---

In [188]:
prompt = "What does the start-up company Pentera do and who invested in it?"

response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Pentera is a start-up company that provides software solutions to help organizations manage their employee benefits programs. The company has raised $3.5 million in seed funding from investors including Y Combinator, SV Angel, and Social Leverage.


While this may sound some what correct, the model is hallucinating! A common issue with LLMs, they are eager to please and with enough context they can make stuff up that sounds right, but actually isn't. In my personal research, it looks like Y Combinator did NOT actually invest in Pentera. Also Pentera isn't an HR company, Pentera is a penetration testing company that develops and provides an automated security validation platform to reduce cybersecurity risks.

We could try to alleviate this issue with some prompt engineering:

In [191]:
prompt = """Only answer the question below if you have 100% certainty of the facts.

Q: What does the start-up company Pentera do and who invested in it?
A:"""


response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

I cannot answer this question with 100% certainty.


Alright, very interesting! How can we help the model? We could input some context from our own data. In fact, we have a data set about recent Unicorn companies.  

## Text Data

Let's grab some text data and send it to Open AI to receive the embeddings back.
 


In [199]:
df = pd.read_csv("unicorns.csv") 

In [200]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",


Let's create a new column that summarizes each company with the information from the other columns

In [215]:
import ast 
def summary(company,crunchbase_url,city,country,industry,investor_list):
    investors = 'The investors in the company are'
     
    for investor in ast.literal_eval(investor_list):
        investors += f" {investor}, "

    text = f"{company} has headquarters in {city} in {country} and is in the field of {industry}. {investors}. You can find more information at {crunchbase_url}"

    return text 

In [216]:
df['summary'] = df.apply(lambda df: summary(df['Company'],df['Crunchbase Url'],df['City'],df['Country'],df['Industry'],df['Investors']),axis=1)

In [217]:
df['summary'][0]

'Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures,  Zeal Capital Partners,  SoftBank Group, . You can find more information at https://www.cbinsights.com/company/esusu'

In [218]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...


### Token Count

In case you are ever worried about how many tokens your text actually has (to get an estimate of your costs) OpenAI has a library called "tiktoken", which allows you to estimate a cost based on token counts.

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

**tiktoken** supports 3 different encodings for OpenAI models:

* "gpt2" for most gpt-3 models
* "p50k_base" for code models, and Davinci models, like "text-davinci-003"
* "cl100k_base" for text-embedding-ada-002

In [166]:
import tiktoken

def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

Let's run a quick example on some text:

In [219]:
num_tokens_from_string(df['summary'][0],encoding_name='cl100k_base')

58

Note how this is higher than the actual word count, this is because OpenAI tokens are not the same as words, remember things like punctuation and word length come into play, as a rough estimate, 1000 tokens is about 750 words. But with the tool above you can check your real token count before sending text over to OpenAI. Let's get a cost estimate of vectorizing our entire data set:

In [220]:
df['token_count'] = df['summary'].apply(lambda text: num_tokens_from_string(text,'cl100k_base'))

In [221]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,58
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,60
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,57
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,62
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58


### Estimating Embedding Costs

Let's now do a quick monetary estimate of how much this will all cost, currently ADA-002 embedding model costs $0.0004 / 1K tokens

Pay careful attention, that isn't 4 cents per 1000 tokens, that would be $0.04, this is 1/100 of that cost, so quite "inexpensive" depending on your document workload.

So, let's estimate the cost:

In [222]:
df['token_count'].sum() * 0.0004 / 1000

0.028168400000000003

Another thing to keep in mind is the size limit for embeddings, currently the ADA 002 model max token limit is 8191 tokens, let's quickly check against this limit:

In [223]:
df[df['token_count'] > 8191]

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count


Looks like we're okay! It also looks like this will only cost us abou 3 cents to embed, not too bad!

## Text Embedding

To begin, we'll create a simple function to grab the embedding, in our case, we'll specify the ADA 002 model

In [227]:
def get_embedding(text):
  # Note how this function assumes you already set your Open AI key!
    result = openai.Embedding.create(
      model='text-embedding-ada-002',
      input=text
    )
    return result["data"][0]["embedding"]


## Create Embeddings

Now to create the embeddings, we can simply call these functions, for example:

In [228]:
get_embedding(df['summary'][0])

[0.012057947926223278,
 -0.017802061513066292,
 -0.022373223677277565,
 -0.034451279789209366,
 -0.013807321898639202,
 0.01033538393676281,
 -0.016340898349881172,
 0.025375980883836746,
 0.0037467442452907562,
 -0.02198447287082672,
 0.01943749189376831,
 -0.013076740317046642,
 0.005415687337517738,
 -0.016877105459570885,
 0.01076434925198555,
 -0.023780766874551773,
 -0.0008374039898626506,
 -0.023686930537223816,
 0.0170647781342268,
 0.0026877359487116337,
 -0.023123912513256073,
 -0.015724260360002518,
 -0.009805879555642605,
 0.021314214915037155,
 -0.02210512012243271,
 -0.0009475777624174953,
 -0.011776438914239407,
 -0.0014787574764341116,
 -0.011628982611000538,
 -0.011246935464441776,
 0.011977517046034336,
 -0.0036830694880336523,
 -0.018968310207128525,
 -0.004118737298995256,
 -0.012687990441918373,
 0.006142917554825544,
 -0.02238662913441658,
 0.014263097196817398,
 0.019370466470718384,
 -0.011816654354333878,
 0.03174343332648277,
 0.008733466267585754,
 -0.0140754

Let's do the rest via our 2nd function:

In [229]:
# this will take awhile due to the amount of calls to the API.
# it will take about 0.5 seconds per row
df['embedding'] = df['summary'].apply(get_embedding)

In [230]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embedding
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,58,"[0.01195491198450327, -0.017717931419610977, -..."
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,60,"[0.009171437472105026, 0.01314949057996273, -0..."
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,57,"[0.002730059437453747, -0.03737899661064148, 0..."
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,62,"[-0.0024771858006715775, -0.024587858468294144..."
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58,"[0.011331121437251568, -0.011193273589015007, ..."


In [231]:
df.to_csv('unicorns_with_embeddings.csv',index=False)

## Document Similarity 

We can now take a new string, embed it into a vector, and perform a cosine similarity search against all the vector embeddings in our DataFrame:

In [232]:
prompt = "What does the company Pentera do and who invested in it?"

In [233]:
prompt_embedding = get_embedding(prompt)

In [235]:
import numpy as np
# There are other services/programs for larger amount of vectors
# Take a look at vector search engines like Pinecone or Weaviate
def vector_similarity(vec1,vec2):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(vec1), np.array(vec2))


In [237]:
df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

In [239]:
df.sort_values("prompt_similarity", ascending=False).head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embedding,prompt_similarity
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58,"[0.011331121437251568, -0.011193273589015007, ...",0.883279
933,"10/31/2022, 2:34:02 AM",Pendo,https://www.cbinsights.com/company/pendoio,2.6,10/17/2019,2019,Raleigh,United States,Internet software & services,"[""Contour Venture Partners"",""Battery Ventures""...",,Pendo has headquarters in Raleigh in United St...,59,"[0.01703517511487007, -0.0028536927420645952, ...",0.826227
61,"10/31/2022, 2:36:13 AM",Perimeter 81,https://www.cbinsights.com/company/perimeter-81,1.0,6/6/2022,2022,Tel Aviv,Israel,Cybersecurity,"[""Insight Partners"",""Toba Capital"",""Spring Ven...",,Perimeter 81 has headquarters in Tel Aviv in I...,57,"[0.006611596792936325, -0.0017099825199693441,...",0.819634
1183,"10/31/2022, 2:33:34 AM",Intarcia Therapeutics,https://www.cbinsights.com/company/intarcia-th...,3.8,4/1/2014,2014,Boston,United States,Health,"[""New Enterprise Associates"",""New Leaf Venture...",,Intarcia Therapeutics has headquarters in Bost...,62,"[0.016609707847237587, -0.002032819902524352, ...",0.804336
988,"10/31/2022, 2:36:23 AM",Momenta,https://www.cbinsights.com/company/momenta,1.0,10/17/2018,2018,Beijing,China,Artificial intelligence,"[""Sinovation Ventures"",""Tencent Holdings"",""Seq...",,Momenta has headquarters in Beijing in China a...,55,"[0.0063818651251494884, -0.03462127968668938, ...",0.803542


Now we can easily grab the summary for the most similar embedding, then insert that as context to our GPT request!

In [261]:
# Could also use sort_values() with ascending=False, but nlargest should be more performant
df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

'Pentera has headquarters in Petah Tikva in Israel and is in the field of Cybersecurity . The investors in the company are AWZ Ventures,  Blackstone,  Insight Partners, . You can find more information at https://www.cbinsights.com/company/pcysys'

## Question Answering with Embeddings

Let's try inserting the summary to help the model to see if it actually helps:

In [262]:
summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

In [263]:
prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
Here is some context:
{summary}
Q: What does the start-up company Pentera do and who invested in it?
A:"""


response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Pentera is a start-up company in the field of Cybersecurity with headquarters in Petah Tikva, Israel. The investors in the company are AWZ Ventures, Blackstone, and Insight Partners.


Nice! Let's clean this all up by creating a function that wraps all the functions together. Note how this is pretty limited due to our data, but hopefully you can see how this is generalizable to your own data sets and prompts, each situation will be different!

In [265]:
def embed_prompt_lookup():
    # initial question
    question = input("What question do you have about a Unicorn company? ")
    # Get embedding
    prompt_embedding = get_embedding(question)
    # Get prompt similarity with embeddings
    # Note how this will overwrite the prompt similarity column each time!
    df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

    # get most similar summary
    summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

    prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
            Here is some context:
            {summary}
            Q: {question}
            A:"""


    response = openai.Completion.create(
        prompt=prompt,
        temperature=0,
        max_tokens=500,
        model="text-davinci-003"
    )
    print(response["choices"][0]["text"].strip(" \n"))

In [266]:
embed_prompt_lookup()

Momenta is a company in the field of Artificial Intelligence with headquarters in Beijing, China.
