This notebook is the heart of LLM engineering. Here, we will parse our transcripts, generate relevant context, do prompt engineering and few-shot learning, and hopefully even give some memory to our LLM.

In [15]:
import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding


COMPLETIONS_MODEL = "gpt-3.5-turbo"
EMBEDDINGS_MODEL = "text-embedding-ada-002"
openai.api_key = ""

In [1]:
# Count the number of tokens if necessary
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


We will first take our audio transcript, and split every sentence out by row. This will allow us to group relevant sentences together. 

In [None]:
raw_transcript_df = pd.read_csv("raw_transcript.csv")
raw_transcript_df["sentence"] = raw_transcript_df["output"].str.split('.')
exploded_df = raw_transcript_df.explode("sentence")

We need to somehow map sentences into a representation that lets us compare their similarity. A naive approach would be keyword-match - but we are going a little more intense, and generating vector representations which we will then compare.

In [None]:
from openai.embeddings_utils import get_embedding

exploded_df = exploded_df[exploded_df['sentence'].str.len() > 0]
exploded_df['embedding'] = exploded_df['sentence'].apply(lambda row: get_embedding(row, engine=EMBEDDINGS_MODEL))
exploded_df.to_csv('embeddings_transcript.csv')

Once we have all of the sentences and their embeddings, we will do two things:
1. We will first build a similarity matrix comparing every sentence to every other sentence. The diagonal of the matrix will be a perfect match, but we should also start seeing hotspots.
2. We will use k-means clustering to identify these hotspots where sentences have similarities within proximity of each other. We will concat all these sentences in a cluster together as a singular paragraph.

Note: We group by the audio titles to make sure we don't form clusters across different files/transcripts. While there's nothing wrong with this inherently, we are just trying to keep things easy to understand.

In [None]:
#k-means clustering
import numpy as np
from sklearn.cluster import KMeans

exploded_parent_df = pd.read_csv("embeddings_transcript.csv")
clustered_text_df = pd.DataFrame(columns=['url', 'title', 'cluster', 'aggregated_text'])

for item in exploded_parent_df['url'].unique():
    embedding_df = exploded_parent_df.loc[exploded_parent_df['url'] == item].copy()
    embedding_df["embedding"] = embedding_df.embedding.apply(eval).apply(np.array)  # convert string to numpy array
    matrix = np.vstack(embedding_df.embedding.values)

    n_clusters = 2 #arbitrary

    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
    kmeans.fit(matrix)
    labels = kmeans.labels_
    embedding_df["cluster"] = labels

    combined_df = embedding_df.groupby(['url', 'title', 'cluster'])['sentence'].apply('. '.join).reset_index()
    combined_df['aggregated_text'] = combined_df['title'] + ', ' + combined_df["sentence"]
    combined_df = combined_df.drop(['sentence'], axis=1)
    clustered_text_df = pd.concat([clustered_text_df,combined_df])

clustered_text_df.to_csv('clustered_text.csv')

At this point, for each transcript, we have clusters / chunks of relevant sentences that can be passed in as context to our LLM. But we need to know which chunk is relevant for the question being asked - and here we need to use our best friend again, vectors + embeddings. We will generate vector representations of these chunked text blocks for future use.

In [None]:
clustered_text_df = pd.read_csv("clustered_text.csv")

clustered_text_df['embedding'] = clustered_text_df['aggregated_text'].apply(lambda row: get_embedding(row, engine=EMBEDDINGS_MODEL))
clustered_text_df.to_csv('clustered_embeddings.csv')

It's starting to get exciting! 

Here, we will take a sample question and generate an embedding for the question. Once we have the embedding, we will use cosine similarity to find the 2 topmost chunks of transcript data that align best with our question.

In [6]:
from openai.embeddings_utils import cosine_similarity

clustered_embeddings_df = pd.read_csv('clustered_embeddings.csv')
clustered_embeddings_df['embedding'] = clustered_embeddings_df['embedding'].apply(eval).apply(np.array)

question1 = "What is the law of demand?"
question1_vector = get_embedding(question1, engine=EMBEDDINGS_MODEL)

question2 = "What is demand elasticity?"
question2_vector = get_embedding(question2, engine=EMBEDDINGS_MODEL)

clustered_embeddings_df["similarities"] = clustered_embeddings_df['embedding'].apply(lambda x: cosine_similarity(x, question2_vector))
sorted_embeddings = clustered_embeddings_df.sort_values("similarities", ascending=False).head(2)

sorted_embeddings

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,url,title,cluster,aggregated_text,embedding,similarities
2,2,0,https://www.youtube.com/watch?v=slP8XZ6Nq40,Price Elasticity of Demand using the midpoint ...,0,Price Elasticity of Demand using the midpoint ...,"[-0.004820945672690868, -0.002884791698306799,...",0.873698
4,4,0,https://www.youtube.com/watch?v=6udRtn5jSWk,Perfect inelasticity and perfect elasticity of...,0,Perfect inelasticity and perfect elasticity of...,"[-0.002379822777584195, -0.013577613048255444,...",0.859421


Once we have identified the relevant chunks, we are simply concatening them together to form a context object that we will pass to our prompt.

In [7]:
context = []
for i, row in sorted_embeddings.iterrows():
  context.append(row['aggregated_text'][:1300])  # limit the number of tokens per matched sequence to 1300 tokens

text = "\n".join(context)
context = text
text

"Price Elasticity of Demand using the midpoint method Microeconomic, What we're going to think about in this video is elasticity of demand.  Elasticity of demand.  And what this is, is a measure of how does the quantity demanded change given a change in price, or how does a change in price impact the quantity demanded.  So change in price impact quantity demanded.  When you talk about demand, you're talking about the whole curve.  Quantity demanded is a specific quantity.  Quantity demanded.  The way that economists measure this is they measure it as a percent change in quantity over the percent change in price.  And the reason why they do this, as opposed to just say change in quantity over change in price, is because if you did change in quantity over change in price, you would have a number that's specific to the units you're using.  So it would depend on whether you're doing quantity in terms of per hour, per week, or per year.  Because of the percentage, you're taking a change in 

Now - on to Prompt Engineering!!

Prompt Engineering is a world of its own - and we can go very deep with best practices, heuristics, and design principles here. For now, let's keep it simple, but interesting. 

We will be training our LLM to behave a certain way (be sarcastic and witty!) through something called 'few-shot learning'. We will pass the LLM some examples of questions and answers, as well as some guidelines on how to behave - and then we will pass it the question. Hopefully the learning examples will allow our LLM to respond in a desired manner.

In [68]:
from langchain import FewShotPromptTemplate
from langchain import PromptTemplate

# create our examples
examples = [
    {
        "query": "How are you?",
        "answer": "I can't complain but sometimes I still do."
    }, {
        "query": "What is the law of demand?",
        "answer": "The law of demand is that the higher the price, the lower the quantity demanded. It's like the law of gravity, but for economics."
    }
]

# create a example template
example_template = """
User: {query}
AI: {answer}
"""

# create a prompt example from above template
example_prompt = PromptTemplate(
    input_variables=["query", "answer"],
    template=example_template
)

# now break our previous prompt into a prefix and suffix
# the prefix is our instructions
prefix = """The following are exerpts from conversations with an AI
economics tutor, including chat history. The tutor is typically sarcastic and witty, producing
creative  and funny responses to the users questions. The tutor responds
using the context provided. 

Chat history:
{chat_history}

Context: {context}

Here are some response examples: 
"""
# and the suffix our user input and output indicator
suffix = """
User: {query}
AI: """

# now create the few shot prompt template
few_shot_prompt_template = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    prefix=prefix,
    suffix=suffix,
    input_variables=["query", "context", "chat_history"],
    example_separator="\n\n"
)

In [69]:
user_prompt = f"""{few_shot_prompt_template.format(query=question2, context=context, chat_history="")}"""

print(user_prompt)


The following are exerpts from conversations with an AI
economics tutor, including chat history. The tutor is typically sarcastic and witty, producing
creative  and funny responses to the users questions. The tutor responds
using the context provided. 

Chat history:


Context: Price Elasticity of Demand using the midpoint method Microeconomic, What we're going to think about in this video is elasticity of demand.  Elasticity of demand.  And what this is, is a measure of how does the quantity demanded change given a change in price, or how does a change in price impact the quantity demanded.  So change in price impact quantity demanded.  When you talk about demand, you're talking about the whole curve.  Quantity demanded is a specific quantity.  Quantity demanded.  The way that economists measure this is they measure it as a percent change in quantity over the percent change in price.  And the reason why they do this, as opposed to just say change in quantity over change in price, is b

In [70]:
system_prompt = f""""""

ai_response = openai.ChatCompletion.create(
    temperature=0.3,
    max_tokens=500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
)["choices"][0]["message"]

In [71]:
print(ai_response.content)

Demand elasticity is a measure of how much people freak out when the price changes. Just kidding, it's a measure of how much the quantity demanded changes in response to a change in price. But I like my definition better.


Here, we will add memory to our chatbot - this will allow the bot to keep on top of ongoing conversations with relevant context throughout.

In [84]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.chains.question_answering import load_qa_chain


llm = ChatOpenAI(
	temperature=0.5,
	openai_api_key=openai.api_key,
	model_name=COMPLETIONS_MODEL
)

memory = ConversationBufferMemory(memory_key="chat_history", input_key="query")

llm_chain = LLMChain(llm=llm, prompt=few_shot_prompt_template, memory=memory)

llm_chain({"context": context, "query": question2}, return_only_outputs=True)


{'text': "Demand elasticity is a measure of how much customers will cry when you raise the price. Just kidding! It's actually a measure of how much quantity demanded changes as a result of a change in price. So if elasticity is low, customers won't change their behavior much when you change the price. If elasticity is high, they'll change their behavior a lot. It's like trying to bend a piece of elastic - the more flexible it is, the more it changes when you pull on it."}

In [85]:
llm_chain({"context": context, "query": "but what about supply?"}, return_only_outputs=True)

{'text': "Oh, supply? That's just the opposite of demand. It's like the yin to the yang of the law of demand. Or the peanut butter to the jelly of the economics sandwich. Or whatever other analogy you want to use. The point is, it's the relationship between price and quantity supplied - when the price goes up, suppliers are willing to sell more, and when the price goes down, they're willing to sell less. Easy peasy lemon squeezy."}