# Definer Experiments
Previous notebook was getting quite full, here is a new notebooks for the Definer project.

In [1]:
# Put all imports here to be efficient
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from pydantic import BaseModel, Field, validator
from langchain.output_parsers import PydanticOutputParser
from langchain.schema import OutputParserException
from typing import List
import os

In [2]:
# Way to generate a random test input using transcripts from Lex Fridman's podcast
# Make sure you have the transcripts downloaded in the folder lex_whisper_transcripts

import test_on_lex

transcripts = test_on_lex.load_lex_transcripts(random_n=10, transcript_folder="./lex_whisper_transcripts/", chunk_time_seconds=20)

import random
def generate_test_input():
    idx = random.randint(0, 10)
    key = list(transcripts.keys())[idx]
    transcript = transcripts[key]
    trans_idx = random.randint(10, len(transcript)-10)
    latest = transcript[trans_idx:trans_idx+7]
    prev_transcripts, curr_transcripts = str.join(",", list(latest[0:5])), latest[5]
    return prev_transcripts + "\n" + curr_transcripts

generate_test_input()

Processing episode_158_large...
Processing episode_255_large...
Processing episode_108_large...
Processing episode_145_large...
Processing episode_269_large...
Processing episode_276_large...
Processing episode_251_large...
Processing episode_167_large...
Processing episode_270_large...
Processing episode_095_large...


" Won't selection result in a few pockets of interesting complexities? I mean, yeah, if we ran Earth over again, over and over and over, you're saying it's going to come up with, there's not going to be elephants every time? Yeah, I don't think so. I think that there will be similarities., And I think we don't know enough about how selection globally works. But it might be that the emergence of elephants was wired into the history of Earth in some way, like the gravitational force, how evolution was going, Cambrian explosions, blah, blah, blah, the emergence of mammals. But I just don't know enough about the contingency,, the variability. All I do know is you count the number of bits of information required to make an elephant and think about the causal chain that provide the lineage of elephants going all the way back to Luca, there's a huge scope for divergence. Yeah, but just like you said, with chemistry and selection,, the things that result in self replicating chemistry and self 

In [12]:
def format_list_data(list_data: list):
    return "\n".join([f"{i+1}. {example}" for i, example in enumerate(list_data)])

In [13]:
proactive_rare_word_agent_prompt_blueprint = """
# Objective: 
Identify "Rare Entities" in a conversation transcript. These include rare words, phrases, jargons, adages, people, places, organizations, events etc that are not well known to the average high schooler, in accordance to current trends. We are using a really lousy transcribing service, so words are often mispelt, but you can autocorrect and piece together implied entities that are described in the conversation context but not explicitly mentioned, your vast knowledge base to derive the "Rare Entity" originally mentioned by the user. If you feel the need to search for the entity, then it is most likely mistranscribed.

# Criteria for Rare Entities in order of importance:
1. Rarity: Select entities that are unlikely for an average high schooler to know. Well known entities are like Fortune 500 organizations, worldwide-known events, popular locations, and entities popularized by recent news or events such as "COVID-19", "Bitcoin", or "Generative AI".
2. Utility: Definition should help a user understand the conversation better and achieve their goals.
3. No Redundancy: Exclude definitions if already defined in the conversation.
4. Complexity: Choose terms with non-obvious meanings, such as "Butterfly Effect" and not "Electric Car".
5. Definability: Must be clearly and succinctly definable in under 10 words.
6. Searchability: Likely to have a specific and valid reference source: Wikipedia page, dictionary entry etc.

# Conversation Transcript:
<Transcript start>{conversation_context}<Transcript end>

# Output Guidelines:
Output an array:
entities: [{{ name: string, definition: string, ekg_search_keyword: string }}], where definition is concise (< 12 words), and ekg_search_keyword as the best search keyword for the Google Knowledge Graph.  

## Additional Guidelines:
- Entity names should be quoted from the conversation, so the output definitions can be referenced back to the conversation, unless they are transcribed wrongly, then use the official name.
- For the search keyword, use complete, official and context relevant keyword(s) to search for that entity. You might need to autocomplete entity names or use their official names or add additional context keywords to help with searchability, especially if the entity is ambiguous or has multiple meanings. For rare words, include "definition" in the search keyword.
- Definitions should use simple language to be easily understood.
- Select entities whose definitions you are very confident about, otherwise skip them.
- Multiple entities can be detected from one phrase, for example, "The Lugubrious Game" can be defined as a painting, and the rare word "lugubrious" is also worth defining.
- Limit results to 4 or less entities, prioritize rarity.
- Examples:
    - Completing incomplete name example: If the conversation talks about "Balmer" and "Microsoft", the keyword is "Steve Balmer", but the entity name would be "Balmer" because that is the name quoted from the conversation.
    - Replacing unofficial name example: If the conversation talks about "Clay Institute", the keyword is "Clay Mathematics Institute" since that is the official name, but the entity name would be "Clay Institute" because that is the name quoted from the conversation.
    - Adding context example: If the conversation talks about "Theory of everything", the keyword needs context keywords such as "Theory of everything (concept)", because there is a popular movie with the same name. 

## Recent Definitions:
These have already been defined so don't define them again:
{definitions_history}

## Example Output:
entities: [{{ name: "Moore's Law", definition: "Computing power doubles every ~2 yrs", ekg_search_key: "Moore's Law" }}, {{ name: "80/20 Rule", definition: "Productivity concept; Majority of results come from few causes", ekg_search_key: "Pareto Principle" }}]

{format_instructions} 
If no relevant entities are identified, output empty arrays.
"""

In [18]:
def run_proactive_rare_word_agent_and_definer(
    conversation_context: str, definitions_history: list = []
):
    # run proactive agent to find out which expert agents we should run
    proactive_rare_word_agent_response = run_proactive_rare_word_agent(
        conversation_context, definitions_history
    )

    # do nothing else if proactive meta agent didn't specify an agent to run
    if proactive_rare_word_agent_response == []:
        return []

    # pass words to define to definer agent
    print("proactive_rare_word_agent_response", proactive_rare_word_agent_response)
    
    return proactive_rare_word_agent_response

class ProactiveRareWordAgentQuery(BaseModel):
    """
    Proactive rare word agent that identifies rare entities in a conversation context
    """

    to_define_list: list = Field(
        description="the rare entities to define",
    )

class Entity(BaseModel):
    name: str = Field(
        description="entity name",
    )
    definition: str = Field(
        description="entity definition",
    )
    ekg_search_keyword: str = Field(
        description="keyword to search for entity on Google Enterprise Knowledge Graph",
    )

class ConversationEntities(BaseModel):
    entities: List[Entity] = Field(
        description="list of entities and their definitions",
        default=[]
    )

proactive_rare_word_agent_query_parser = PydanticOutputParser(
    pydantic_object=ConversationEntities
)

def run_proactive_rare_word_agent(conversation_context: str, definitions_history: list):
    # start up GPT4 connection
    llm = ChatOpenAI(
        temperature=0,
        openai_api_key=os.environ.get("OPEN_AI_API_KEY"),
        model="gpt-4-1106-preview",
    )

    extract_proactive_rare_word_agent_query_prompt = PromptTemplate(
        template=proactive_rare_word_agent_prompt_blueprint,
        input_variables=[
            "conversation_context",
            "definitions_history",
        ],
        partial_variables={
            "format_instructions": proactive_rare_word_agent_query_parser.get_format_instructions()
        },
    )

    if len(definitions_history) > 0:
        definitions_history = format_list_data(definitions_history)
    else:
        definitions_history = "None"

    proactive_rare_word_agent_query_prompt_string = (
        extract_proactive_rare_word_agent_query_prompt.format_prompt(
            conversation_context=conversation_context,
            definitions_history=definitions_history,
        ).to_string()
    )

    # print("Proactive meta agent query prompt string", proactive_rare_word_agent_query_prompt_string)

    response = llm(
        [HumanMessage(content=proactive_rare_word_agent_query_prompt_string)]
    )

    print(response.content)
    try:
        res = proactive_rare_word_agent_query_parser.parse(
            response.content
        )
        return res
    except OutputParserException:
        return None

In [28]:
# test_transcript = generate_test_input()
test_transcript = """
In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model"""
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
res


In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model
```json
{
  "entities": [
    {
      "name": "Hugging Face",
      "definition": "AI company specializing in NLP",
      "ekg_search_keyword": "Hugging Face"
    },
    {
      "name": "OpenAI",
      "definition": "AI research laboratory",
      "ekg_search_keyword": "OpenAI"
    },
    {
      "name": "Falcon LLM model",
      "definition": "Large language machine learning model",
      "ekg_search_keyword": "Falcon LLM model"
    }
  ]
}
```
proactive_rare_word_agent_response entities=[Entity(name='Hugging Face', definition='AI company specializing in NLP', ekg_search_keyword='Hugging Face'), Entity(name='OpenAI', definition='AI research laboratory', ekg_search_keyword='OpenAI'), Entit

### Search tool
EKG is unreliable

In [30]:
from typing import Any, List, Literal
import aiohttp
import asyncio
import os

k: int = 3
gl: str = "us"
hl: str = "en"
tbs = None
num_sentences = 7
serper_api_key=os.environ.get("SERPER_API_KEY")
search_type: Literal["news", "search", "places", "images"] = "search"
result_key_for_type = {
        "news": "news",
        "places": "places",
        "images": "images",
        "search": "organic",
    }

async def serper_search_async(
    search_term: str, search_type: str = "search", **kwargs: Any
) -> dict:
    headers = {
        "X-API-KEY": serper_api_key or "",
        "Content-Type": "application/json",
    }
    params = {
        "q": search_term,
        **{key: value for key, value in kwargs.items() if value is not None},
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(f"https://google.serper.dev/{search_type}", headers=headers, json=params) as response:
            response.raise_for_status()
            search_results = await response.json()
            return search_results


async def parse_snippets_async(results: dict, scrape_pages: bool = False, summarize_pages: bool = True, num_sentences: int = 3) -> List[str]:
    snippets = []
    if results.get("answerBox"):
        answer_box = results.get("answerBox", {})
        if answer_box.get("answer"):
            snippets.append(f"The answer is {answer_box.get('answer')}")
        elif answer_box.get("snippet"):
            snippets.append(f"The answer might be in the snippet: {answer_box.get('snippet')}")
        elif answer_box.get("snippetHighlighted"):
            snippets.append(f"The answer might be in the snippet: {answer_box.get('snippetHighlighted')}")

    if results.get("knowledgeGraph"):
        kg = results.get("knowledgeGraph", {})
        title = kg.get("title")
        entity_type = kg.get("type")
        if entity_type:
            snippets.append(f"Knowledge Graph Results: {title}: {entity_type}.")
        description = kg.get("description")
        if description:
            snippets.append(f"Knowledge Graph Results: {title}: {description}.")
        for attribute, value in kg.get("attributes", {}).items():
            snippets.append(f"Knowledge Graph Results: {title} {attribute}: {value}.")

    if scrape_pages:
        tasks = []
        for result in results[result_key_for_type[search_type]][:k]:
            task = asyncio.create_task(scrape_page_async(result["link"], summarize_page=summarize_pages, num_sentences=num_sentences))
            tasks.append(task)
        summarized_pages = await asyncio.gather(*tasks)
        for i, page in enumerate(summarized_pages):
            result = results[result_key_for_type[search_type]][i]
            if page:
                snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\nSummarized Page: {page}")
            else:
                snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\n")
    else:
        for result in results[result_key_for_type[search_type]][:k]:
            snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\n")

    if len(snippets) == 0:
        return ["No good Google Search Result was found"]
    return snippets

def extract_entity_url_and_image(results: dict):
    print(results)
    res = {}
    if results.get("knowledgeGraph"):
        res["url"] = results.get("knowledgeGraph", {}).get("descriptionLink")
        res["imageUrl"] = results.get("knowledgeGraph", {}).get("imageUrl")

    for result in results[result_key_for_type[search_type]][:k]:
        if result.get("link"):
            res["url"] = result.get("link")
    
    return res

async def search_url_for_entity_async(query: str):
    results = await serper_search_async(
        search_term=query,
        gl=gl,
        hl=hl,
        num=k,
        tbs=tbs,
        search_type=search_type,
    )
    return extract_entity_url_and_image(results)

In [33]:
await search_url_for_entity_async("ChatGpt")

{'searchParameters': {'q': 'ChatGpt', 'gl': 'us', 'hl': 'en', 'num': 3, 'type': 'search', 'engine': 'google'}, 'knowledgeGraph': {'title': 'ChatGPT', 'type': 'Computer program', 'imageUrl': 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSE0_kvfFhLnJU0KZHDXOdED-y5VUeuVi9TmqBcKSg&s=0', 'description': 'ChatGPT is a large language model-based chatbot developed by OpenAI and launched on November 30, 2022, that enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language.', 'descriptionSource': 'Wikipedia', 'descriptionLink': 'https://en.wikipedia.org/wiki/ChatGPT', 'attributes': {'Initial release date': 'November 30, 2022', 'Developer(s)': 'OpenAI', 'Engine': 'GPT-3.5 (free and paid); GPT-4 (paid only)', 'License': 'Proprietary service', 'Platform': 'Cloud computing platforms', 'Stable release': 'November 21, 2023; 13 days ago', 'Written in': 'Python'}}, 'organic': [{'title': 'ChatGPT', 'link': 'https://chat.openai.com/', 's

{'url': 'https://openai.com/',
 'imageUrl': 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSE0_kvfFhLnJU0KZHDXOdED-y5VUeuVi9TmqBcKSg&s=0'}

In [28]:
test_transcript = """
In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model"""
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
print(res)
for entities in res.entities:
    print(entities.name)
    print(await search_url_for_entity_async(entities.name))


In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falkon LLM model
```json
{
  "entities": [
    {
      "name": "Hugging Face",
      "definition": "AI company specializing in NLP models",
      "ekg_search_keyword": "Hugging Face"
    },
    {
      "name": "OpenAI",
      "definition": "AI research lab, created GPT models",
      "ekg_search_keyword": "OpenAI"
    },
    {
      "name": "Falcon LLM model",
      "definition": "Large language model for AI applications",
      "ekg_search_keyword": "Falcon LLM model"
    }
  ]
}
```
proactive_rare_word_agent_response entities=[Entity(name='Hugging Face', definition='AI company specializing in NLP models', ekg_search_keyword='Hugging Face'), Entity(name='OpenAI', definition='AI research lab, created GPT m