Source: https://www.pinecone.io/learn/openai-gen-qa/

In [1]:
# get the openai secret key
import getpass

OPENAI_API_KEY = getpass.getpass('Please enter your openai key: ')

In [2]:
!pip install -qU openai pinecone-client datasets cohere tiktoken


In [3]:
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = OPENAI_API_KEY

In [5]:
query = "who was the 12th person on the moon and when did they land?"

# now query gpt-3.5-turbo WITHOUT context
res = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": query,
        }
    ]
)

res.choices[0].message.content


"The 12th person to walk on the moon was Eugene Cernan, who was a part of NASA's Apollo 17 mission. He landed on the moon on December 11, 1972."

In [6]:
# first let's make it simpler to get answers
def complete(prompt):
    # query text-davinci-003
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "user", 
                "content": prompt
            }
        ]
    )
    return response.choices[0].message.content

query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)

"If you only have pairs of related sentences and want to train sentence transformers, you can use a Siamese-like training setup. Here's a recommended approach:\n\n1. Data Collection: Gather a large number of sentence pairs where the sentences share a semantic relationship (e.g., similar meaning, paraphrases, etc.). You can create your own labeled dataset or use an existing one such as the STSbenchmark dataset.\n\n2. Preprocessing: Preprocess your sentences by tokenizing, removing stop words, lowercasing, etc.\n\n3. Model Architecture: Use a model architecture suitable for sentence embeddings, such as BERT, RoBERTa, or other transformer-based models. These models can be adapted for sentence embeddings by adding task-specific layers.\n\n4. Training Objective: Define a training objective to optimize sentence embeddings for semantic similarity. One common approach is to use contrastive loss, such as the triplet loss or the multiple negative pairs loss.\n\n   a. Triplet Loss: Randomly selec

In [9]:
embed_model = "text-embedding-ada-002"

res = openai.embeddings.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], model=embed_model
)

In [26]:
# vector embeddings for each document
res.data

[Embedding(embedding=[-0.0030535899568349123, 0.011695603840053082, -0.005062929820269346, -0.02725830115377903, -0.01638462394475937, 0.03228417783975601, -0.01620945893228054, -0.0010257232934236526, -0.025843510404229164, -0.006609093863517046, 0.020197821781039238, 0.01664063334465027, -0.009135506115853786, 0.023472052067518234, -0.010186493396759033, 0.013467460870742798, 0.02522369846701622, -0.01688317023217678, 0.012113303877413273, -0.01634420081973076, -0.00420058099552989, -0.00645414087921381, -0.0044060624204576015, 0.02081763558089733, -0.010503137484192848, -0.0037390899378806353, 0.01364936213940382, -0.026342056691646576, -0.00040612072916701436, -0.0021794515196233988, 0.005820853170007467, -0.010125859640538692, -0.028241917490959167, -0.016222933307290077, -0.0042915320955216885, 0.007457968313246965, -0.002913795178756118, -0.03144877776503563, 0.023835856467485428, -0.033335164189338684, -0.0003707509604282677, 0.013083445839583874, 0.007073953747749329, -0.00569

In [19]:
# we have created two vectors (one for each sentence input)
len(res.data)

2

In [21]:
# we have created two 1536-dimensional vectors
len(res.data[0].embedding), len(res.data[1].embedding)

(1536, 1536)

In [27]:
# we can also get the vector for a single sentence
res.data[0].embedding

[-0.0030535899568349123,
 0.011695603840053082,
 -0.005062929820269346,
 -0.02725830115377903,
 -0.01638462394475937,
 0.03228417783975601,
 -0.01620945893228054,
 -0.0010257232934236526,
 -0.025843510404229164,
 -0.006609093863517046,
 0.020197821781039238,
 0.01664063334465027,
 -0.009135506115853786,
 0.023472052067518234,
 -0.010186493396759033,
 0.013467460870742798,
 0.02522369846701622,
 -0.01688317023217678,
 0.012113303877413273,
 -0.01634420081973076,
 -0.00420058099552989,
 -0.00645414087921381,
 -0.0044060624204576015,
 0.02081763558089733,
 -0.010503137484192848,
 -0.0037390899378806353,
 0.01364936213940382,
 -0.026342056691646576,
 -0.00040612072916701436,
 -0.0021794515196233988,
 0.005820853170007467,
 -0.010125859640538692,
 -0.028241917490959167,
 -0.016222933307290077,
 -0.0042915320955216885,
 0.007457968313246965,
 -0.002913795178756118,
 -0.03144877776503563,
 0.023835856467485428,
 -0.033335164189338684,
 -0.0003707509604282677,
 0.013083445839583874,
 0.0070739

In [28]:
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

In [29]:
data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

In [30]:
from tqdm.auto import tqdm

new_data = []

window = 20  # number of sentences to combine
stride = 4  # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

  0%|          | 0/52155 [00:00<?, ?it/s]

In [31]:
new_data[0]

{'start': 0.0,
 'end': 74.12,
 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.",
 'id': '35Pdoyi6ZoQ-t0.0',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'published': '2021-07-06 13:0

In [32]:
PINECONE_API_KEY = getpass.getpass('Please enter your pinecone key: ')

In [37]:
import pinecone

index_name = 'test' # change this to your index name

# initialize connection (get API key at app.pinecone.io)
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment="gcp-starter"  # find next to API key
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res.data[0].embedding),
        metric='cosine',
        metadata_config={
            'indexed': ['channel_id', 'published']
        }
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [39]:
from tqdm.auto import tqdm
import datetime
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(new_data), batch_size)):
    # find end of batch
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.embeddings.create(input=texts, model=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.embeddings.create(input=texts, model=embed_model)
                done = True
            except:
                pass
    embeds = [record.embedding for record in res.data]
    # cleanup metadata
    meta_batch = [{
        'start': x['start'],
        'end': x['end'],
        'title': x['title'],
        'text': x['text'],
        'url': x['url'],
        'published': x['published'],
        'channel_id': x['channel_id']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/487 [00:00<?, ?it/s]

In [40]:
res = openai.embeddings.create(
    input=[query],
    model=embed_model
)

# retrieve from Pinecone
xq = res.data[0].embedding

# get relevant contexts (including the questions)
res = index.query(xq, top_k=2, include_metadata=True)

In [41]:
res

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.4,
                           'published': datetime.datetime(2021, 11, 24, 16, 24, 24, tzinfo=tzutc()),
                           'start': 418.88,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
    

In [42]:
limit = 3750

def retrieve(query):
    res = openai.embeddings.create(
        input=[query],
        model=embed_model
    )

    # retrieve from Pinecone
    xq = res.data[0].embedding

    # get relevant contexts
    res = index.query(xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

In [43]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

"Answer the question based on the context below.\n\nContext:\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can 

In [44]:
print(query_with_contexts)

Answer the question based on the context below.

Context:
pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine

In [45]:
# then we complete the context-infused query
complete(query_with_contexts)

'When you only have pairs of related sentences, you should use a training method called semantic textual similarity (STS). This method involves training the model using a similarity score between the two sentences, usually calculated using cosine similarity loss.'