# Definer Experiments
Previous notebook was getting quite full, here is a new notebooks for the Definer project.

In [1]:
# Put all imports here to be efficient
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from pydantic import BaseModel, Field, validator
from langchain.output_parsers import PydanticOutputParser
from langchain.schema import OutputParserException
from typing import List
import os

In [2]:
# Way to generate a random test input using transcripts from Lex Fridman's podcast
# Make sure you have the transcripts downloaded in the folder lex_whisper_transcripts

import test_on_lex

transcripts = test_on_lex.load_lex_transcripts(random_n=10, transcript_folder="./lex_whisper_transcripts/", chunk_time_seconds=20)

import random
def generate_test_input():
    idx = random.randint(0, 10)
    key = list(transcripts.keys())[idx]
    transcript = transcripts[key]
    trans_idx = random.randint(10, len(transcript)-10)
    latest = transcript[trans_idx:trans_idx+7]
    prev_transcripts, curr_transcripts = str.join(",", list(latest[0:5])), latest[5]
    return prev_transcripts + "\n" + curr_transcripts

generate_test_input()

Processing episode_317_large...
Processing episode_211_large...
Processing episode_186_large...
Processing episode_242_large...
Processing episode_061_large...
Processing episode_149_large...
Processing episode_153_large...
Processing episode_015_large...
Processing episode_290_large...
Processing episode_191_large...


" And if there was a really shitty job to do, he'd say, see the job, do the job, stay out of the misery. Just don't indulge any negativity, do the things that need done. And so there's like, there's an empowerment and a nobility together. And yeah, extraordinarily fortunate. Is there ways you think you could have been a better son?, Is there things you regret? Interesting question. Let me first say, just as a bit of a criticism, that what kind of man do you think you are not wearing a suit and tie, if a real man should? Exactly I agree with your dad on that point., You mentioned offline that he suggested a real man should wear a suit and tie. But outside of that, is there ways you could have been a better son? Maybe next time on your show, I'll wear a suit and tie. My dad would be happy about that., I can answer the question later in life, not early. I had just a huge amount of respect and reverence for my dad when I was young. So I was asking myself that question a lot. So there weren

In [3]:
def format_list_data(list_data: list):
    return "\n".join([f"{i+1}. {example}" for i, example in enumerate(list_data)])

In [25]:
proactive_rare_word_agent_prompt_blueprint = """
# Objective: 
Identify "Rare Entities" in a conversation transcript. These include rare words, phrases, jargons, adages, people, places, organizations, events etc that are not well known to the average high schooler, in accordance to current trends. We are using a really lousy transcribing service, so words are often mispelt, but you can autocorrect and piece together implied entities that are described in the conversation context but not explicitly mentioned, your vast knowledge base to derive the "Rare Entity" originally mentioned by the user. If you feel the need to search for the entity, then it is most likely mistranscribed.

# Criteria for Rare Entities in order of importance:
1. Rarity: Select entities that are unlikely for an average high schooler to know. Well known entities are like Fortune 500 organizations, worldwide-known events, popular locations, and entities popularized by recent news or events such as "COVID-19", "Bitcoin", or "Generative AI".
2. Utility: Definition should help a user understand the conversation better and achieve their goals.
3. No Redundancy: Exclude definitions if already defined in the conversation.
4. Complexity: Choose terms with non-obvious meanings, such as "Butterfly Effect" and not "Electric Car".
5. Definability: Must be clearly and succinctly definable in under 10 words.
6. Searchability: Likely to have a specific and valid reference source: Wikipedia page, dictionary entry etc.

# Conversation Transcript:
<Transcript start>{conversation_context}<Transcript end>

# Output Guidelines:
Output an array:
entities: [{{ name: string, definition: string, ekg_search_keyword: string }}], where definition is concise (< 12 words), and ekg_search_keyword as the best search keyword for the Google Knowledge Graph.  

## Additional Guidelines:
- Entity names should be quoted from the conversation, so the output definitions can be referenced back to the conversation, unless they are transcribed wrongly, then use the official name.
- For the search keyword, use complete, official and context relevant keyword(s) to search for that entity. You might need to autocomplete entity names or use their official names or add additional context keywords to help with searchability, especially if the entity is ambiguous or has multiple meanings. For rare words, include "definition" in the search keyword.
- Definitions should use simple language to be easily understood.
- Select entities whose definitions you are very confident about, otherwise skip them.
- Multiple entities can be detected from one phrase, for example, "The Lugubrious Game" can be defined as a painting, and the rare word "lugubrious" is also worth defining.
- Limit results to 4 or less entities, prioritize rarity.
- Examples:
    - Completing incomplete name example: If the conversation talks about "Balmer" and "Microsoft", the keyword is "Steve Balmer", but the entity name would be "Balmer" because that is the name quoted from the conversation.
    - Replacing unofficial name example: If the conversation talks about "Clay Institute", the keyword is "Clay Mathematics Institute" since that is the official name, but the entity name would be "Clay Institute" because that is the name quoted from the conversation.
    - Adding context example: If the conversation talks about "Theory of everything", the keyword needs context keywords such as "Theory of everything (concept)", because there is a popular movie with the same name. 

## Recent Definitions:
These have already been defined so don't define them again:
{definitions_history}

## Example Output:
entities: [{{ name: "Moore's Law", definition: "Computing power doubles every ~2 yrs", ekg_search_key: "Moore's Law" }}, {{ name: "80/20 Rule", definition: "Productivity concept; Majority of results come from few causes", ekg_search_key: "Pareto Principle" }}]

{format_instructions} 
If no relevant entities are identified, output empty arrays.
"""

In [26]:
def run_proactive_rare_word_agent_and_definer(
    conversation_context: str, definitions_history: list = []
):
    # run proactive agent to find out which expert agents we should run
    proactive_rare_word_agent_response = run_proactive_rare_word_agent(
        conversation_context, definitions_history
    )

    # do nothing else if proactive meta agent didn't specify an agent to run
    if proactive_rare_word_agent_response == []:
        return []

    # pass words to define to definer agent
    print("proactive_rare_word_agent_response", proactive_rare_word_agent_response)
    pass

class ProactiveRareWordAgentQuery(BaseModel):
    """
    Proactive rare word agent that identifies rare entities in a conversation context
    """

    to_define_list: list = Field(
        description="the rare entities to define",
    )

class Entity(BaseModel):
    name: str = Field(
        description="entity name",
    )
    definition: str = Field(
        description="entity definition",
    )
    ekg_search_keyword: str = Field(
        description="keyword to search for entity on Google Enterprise Knowledge Graph",
    )

class ConversationEntities(BaseModel):
    entities: List[Entity] = Field(
        description="list of entities and their definitions",
        default=[]
    )

proactive_rare_word_agent_query_parser = PydanticOutputParser(
    pydantic_object=ConversationEntities
)

def run_proactive_rare_word_agent(conversation_context: str, definitions_history: list):
    # start up GPT4 connection
    llm = ChatOpenAI(
        temperature=0,
        openai_api_key=os.environ.get("OPEN_AI_API_KEY"),
        model="gpt-4-1106-preview",
    )

    extract_proactive_rare_word_agent_query_prompt = PromptTemplate(
        template=proactive_rare_word_agent_prompt_blueprint,
        input_variables=[
            "conversation_context",
            "definitions_history",
        ],
        partial_variables={
            "format_instructions": proactive_rare_word_agent_query_parser.get_format_instructions()
        },
    )

    if len(definitions_history) > 0:
        definitions_history = format_list_data(definitions_history)
    else:
        definitions_history = "None"

    proactive_rare_word_agent_query_prompt_string = (
        extract_proactive_rare_word_agent_query_prompt.format_prompt(
            conversation_context=conversation_context,
            definitions_history=definitions_history,
        ).to_string()
    )

    # print("Proactive meta agent query prompt string", proactive_rare_word_agent_query_prompt_string)

    response = llm(
        [HumanMessage(content=proactive_rare_word_agent_query_prompt_string)]
    )

    print(response.content)
    try:
        res = proactive_rare_word_agent_query_parser.parse(
            response.content
        )
        return res
    except OutputParserException:
        return None

In [28]:
# test_transcript = generate_test_input()
test_transcript = """
In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model"""
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
res


In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model
```json
{
  "entities": [
    {
      "name": "Hugging Face",
      "definition": "AI company specializing in NLP",
      "ekg_search_keyword": "Hugging Face"
    },
    {
      "name": "OpenAI",
      "definition": "AI research laboratory",
      "ekg_search_keyword": "OpenAI"
    },
    {
      "name": "Falcon LLM model",
      "definition": "Large language machine learning model",
      "ekg_search_keyword": "Falcon LLM model"
    }
  ]
}
```
proactive_rare_word_agent_response entities=[Entity(name='Hugging Face', definition='AI company specializing in NLP', ekg_search_keyword='Hugging Face'), Entity(name='OpenAI', definition='AI research laboratory', ekg_search_keyword='OpenAI'), Entit