# Definer Experiments
Previous notebook was getting quite full, here is a new notebooks for the Definer project.

In [1]:
# Put all imports here to be efficient
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from pydantic import BaseModel, Field, validator
from langchain.output_parsers import PydanticOutputParser
from langchain.schema import OutputParserException
from typing import List
import os

In [2]:
# Way to generate a random test input using transcripts from Lex Fridman's podcast
# Make sure you have the transcripts downloaded in the folder lex_whisper_transcripts

import test_on_lex

transcripts = test_on_lex.load_lex_transcripts(random_n=10, transcript_folder="./lex_whisper_transcripts/", chunk_time_seconds=20)

import random
def generate_test_input():
    idx = random.randint(0, 10)
    key = list(transcripts.keys())[idx]
    transcript = transcripts[key]
    trans_idx = random.randint(10, len(transcript)-10)
    latest = transcript[trans_idx:trans_idx+7]
    prev_transcripts, curr_transcripts = str.join(",", list(latest[0:5])), latest[5]
    return prev_transcripts + "\n" + curr_transcripts

generate_test_input()

Processing episode_250_large...
Processing episode_217_large...
Processing episode_085_large...
Processing episode_201_large...
Processing episode_142_large...
Processing episode_088_large...
Processing episode_078_large...
Processing episode_241_large...
Processing episode_140_large...
Processing episode_057_large...


" That, to me, is an interesting application of a humanoid form because humans are drawn, like I mentioned to you, like robots, we're drawn to legs and limbs and body language and all that kind of stuff. And even a face, even if you don't have the facial features, which you might not want to have to reduce the creepiness factor, all that kind of stuff. But yeah, that, to me, the humanoid form is compelling., But in terms of that being the right form for the factory environment, I'm not so sure. Yeah, for the factory environment, like right off the bat, what are you optimizing for? Is it strength? Is it mobility? Is it versatility, right? Like that changes completely the look and feel of the robot that you create, you know, and almost certainly the human form is over designed for some dimensions and constrained for some dimensions., And so, like, what are you grasping? Is it big? Is it little, right? So you would customize it and make it customizable for the different needs if that was 

In [3]:
def format_list_data(list_data: list):
    return "\n".join([f"{i+1}. {example}" for i, example in enumerate(list_data)])

In [40]:
proactive_rare_word_agent_prompt_blueprint = """
# Objective: 
Identify "Rare Entities" in a conversation transcript. These include rare words, phrases, jargons, adages, people, places, organizations, events etc that are not well known to the average high schooler, in accordance to current trends. You can also intelligently detect entities that are described in the conversation but not explicitly mentioned.

# Criteria for Rare Entities in order of importance:
1. Rarity: Select entities that are unlikely for an average high schooler to know. Well known entities are like Fortune 500 organizations, worldwide-known events, popular locations, and entities popularized by recent news or events such as "COVID-19", "Bitcoin", or "Generative AI".
2. Utility: Definition should help a user understand the conversation better and achieve their goals.
3. No Redundancy: Exclude definitions if already defined in the conversation.
4. Complexity: Choose terms with non-obvious meanings, such as "Butterfly Effect" and not "Electric Car".
5. Definability: Must be clearly and succinctly definable in under 10 words.
6. Searchability: Likely to have a specific and valid reference source: Wikipedia page, dictionary entry etc.

# Conversation Transcript:
<Transcript start>{conversation_context}<Transcript end>

# Output Guidelines:
Output an array:
entities: [{{ name: string, definition: string, ekg_search_keyword: string }}], where definition is concise (< 12 words), and ekg_search_keyword as the best search keyword for the Google Knowledge Graph.  

## Additional Guidelines:
- Entity names should be quoted from the conversation, so the output definitions can be referenced back to the conversation.
- For the search keyword, use complete, official and context relevant keyword(s) to search for that entity. You might need to autocomplete entity names or use their official names or add additional context keywords to help with searchability, especially if the entity is ambiguous or has multiple meanings. For rare words, include "definition" in the search keyword.
- Definitions should use simple language to be easily understood.
- Select entities whose definitions you are very confident about, otherwise skip them.
- Multiple entities can be detected from one phrase, for example, "The Lugubrious Game" can be defined as a painting, and the rare word "lugubrious" is also worth defining.
- Limit results to 3 entities, prioritize rarity.
- Examples:
    - Completing incomplete name example: If the conversation talks about "Balmer" and "Microsoft", the keyword is "Steve Balmer", but the entity name would be "Balmer" because that is the name quoted from the conversation.
    - Replacing unofficial name example: If the conversation talks about "Clay Institute", the keyword is "Clay Mathematics Institute" since that is the official name, but the entity name would be "Clay Institute" because that is the name quoted from the conversation.
    - Adding context example: If the conversation talks about "Theory of everything", the keyword needs context keywords such as "Theory of everything (concept)", because there is a popular movie with the same name. 

## Recent Definitions:
These have already been defined so don't define them again:
{definitions_history}

## Example Output:
entities: [{{ name: "Moore's Law", definition: "Computing power doubles every ~2 yrs", ekg_search_key: "Moore's Law" }}, {{ name: "80/20 Rule", definition: "Productivity concept; Majority of results come from few causes", ekg_search_key: "Pareto Principle" }}]

{format_instructions} 
If no relevant entities are identified, output empty arrays.
"""

In [42]:
def run_proactive_rare_word_agent_and_definer(
    conversation_context: str, definitions_history: list = []
):
    # run proactive agent to find out which expert agents we should run
    proactive_rare_word_agent_response = run_proactive_rare_word_agent(
        conversation_context, definitions_history
    )

    # do nothing else if proactive meta agent didn't specify an agent to run
    if proactive_rare_word_agent_response == []:
        return []

    # pass words to define to definer agent
    print("proactive_rare_word_agent_response", proactive_rare_word_agent_response)
    pass

class ProactiveRareWordAgentQuery(BaseModel):
    """
    Proactive rare word agent that identifies rare entities in a conversation context
    """

    to_define_list: list = Field(
        description="the rare entities to define",
    )

class Entity(BaseModel):
    name: str = Field(
        description="entity name",
    )
    definition: str = Field(
        description="entity definition",
    )
    ekg_search_keyword: str = Field(
        description="keyword to search for entity on Google Enterprise Knowledge Graph",
    )

class ConversationEntities(BaseModel):
    entities: List[Entity] = Field(
        description="list of entities and their definitions",
        default=[]
    )

proactive_rare_word_agent_query_parser = PydanticOutputParser(
    pydantic_object=ConversationEntities
)

def run_proactive_rare_word_agent(conversation_context: str, definitions_history: list):
    # start up GPT4 connection
    llm = ChatOpenAI(
        temperature=0,
        openai_api_key=os.environ.get("OPEN_AI_API_KEY"),
        model="gpt-4-1106-preview",
    )

    extract_proactive_rare_word_agent_query_prompt = PromptTemplate(
        template=proactive_rare_word_agent_prompt_blueprint,
        input_variables=[
            "conversation_context",
            "definitions_history",
        ],
        partial_variables={
            "format_instructions": proactive_rare_word_agent_query_parser.get_format_instructions()
        },
    )

    if len(definitions_history) > 0:
        definitions_history = format_list_data(definitions_history)
    else:
        definitions_history = "None"

    proactive_rare_word_agent_query_prompt_string = (
        extract_proactive_rare_word_agent_query_prompt.format_prompt(
            conversation_context=conversation_context,
            definitions_history=definitions_history,
        ).to_string()
    )

    # print("Proactive meta agent query prompt string", proactive_rare_word_agent_query_prompt_string)

    response = llm(
        [HumanMessage(content=proactive_rare_word_agent_query_prompt_string)]
    )

    print(response.content)
    try:
        res = proactive_rare_word_agent_query_parser.parse(
            response.content
        )
        return res
    except OutputParserException:
        return None

In [43]:
# test_transcript = generate_test_input()
test_transcript = """yes, you can find it, okay? If you had this algorithm in your hands, okay? You could ask your computer, you know, I mean, P versus NP is one of these seven problems that carries this million dollar prize from the Clay Foundation. But what I like to say, the way that we can see that P versus NP is the biggest of all of these questions is that if you had this fast algorithm, then you could solve all seven of them, okay? You just ask your computer, you know, is there a short proof of the Riemann hypothesis, right? You know, that a machine could, in a language where a machine could verify it,
 and provided that such a proof exists, then your computer finds it in a short amount of time without having to do a brute force search, okay? So, I mean, those are the stakes of what we're talking about. But I hope that also helps to give your listeners some intuition of why I and most of my colleagues would put our money on P not equaling NP. Is it possible, I apologize this is a really dumb question, but is it possible to, we should go to the gallery to look at The Lugubrious Game, maybe that will help us relax"""
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
res

yes, you can find it, okay? If you had this algorithm in your hands, okay? You could ask your computer, you know, I mean, P versus NP is one of these seven problems that carries this million dollar prize from the Clay Foundation. But what I like to say, the way that we can see that P versus NP is the biggest of all of these questions is that if you had this fast algorithm, then you could solve all seven of them, okay? You just ask your computer, you know, is there a short proof of the Riemann hypothesis, right? You know, that a machine could, in a language where a machine could verify it,
 and provided that such a proof exists, then your computer finds it in a short amount of time without having to do a brute force search, okay? So, I mean, those are the stakes of what we're talking about. But I hope that also helps to give your listeners some intuition of why I and most of my colleagues would put our money on P not equaling NP. Is it possible, I apologize this is a really dumb questio