# Definer Experiments
Previous notebook was getting quite full, here is a new notebooks for the Definer project.

In [1]:
# Put all imports here to be efficient
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from pydantic import BaseModel, Field, validator
from langchain.output_parsers import PydanticOutputParser
from langchain.schema import OutputParserException
from typing import List
import os

In [2]:
# Way to generate a random test input using transcripts from Lex Fridman's podcast
# Make sure you have the transcripts downloaded in the folder lex_whisper_transcripts

import test_on_lex

transcripts = test_on_lex.load_lex_transcripts(random_n=10, transcript_folder="./lex_whisper_transcripts/", chunk_time_seconds=20)

import random
def generate_test_input():
    idx = random.randint(0, 10)
    key = list(transcripts.keys())[idx]
    transcript = transcripts[key]
    trans_idx = random.randint(10, len(transcript)-10)
    latest = transcript[trans_idx:trans_idx+7]
    prev_transcripts, curr_transcripts = str.join(",", list(latest[0:5])), latest[5]
    return prev_transcripts + "\n" + curr_transcripts

generate_test_input()

Processing episode_069_large...
Processing episode_305_large...
Processing episode_166_large...
Processing episode_023_large...
Processing episode_095_large...
Processing episode_238_large...
Processing episode_058_large...
Processing episode_239_large...
Processing episode_253_large...
Processing episode_006_large...


" The art of running such a lab is that there are strategic priorities for the company. And there are areas where we do want to invest and pressing problems. And so it's a little bit of a trickle down and filter up meets in the middle. And so you don't tell people you have to do X, but you say X would be particularly appreciated this year. And then people reinterpret, X through the filter of things they want to do and they're interested in. And miraculously, it usually comes together very well. One thing that really helps is Adobe has a really broad portfolio of products. So if we have a good idea, there's usually a product team that is intrigued or interested. So it means we don't have to qualify things too much ahead of time., Once in a while, the product teams sponsor extra intern, because they have a particular problem that they really care about, in which case it's a little bit more, we really need one of these. And then we sort of say, great, I get an extra intern, we find an int

In [3]:
def format_list_data(list_data: list):
    return "\n".join([f"{i+1}. {example}" for i, example in enumerate(list_data)])

In [69]:
proactive_rare_word_agent_prompt_blueprint = """
# Objective
Your role is to identify and define "Rare Entities" in a conversation transcript. Types of "Rare Entities" include rare words, phrases, jargons, adages, people, places, organizations, events etc that are not well known to the average high schooler, in accordance to current trends. You can also intelligently detect entities that are described in the conversation but not explicitly mentioned.

# Criteria for Rare Entities in order of importance
1. Rarity: Select entities that are unlikely for an average high schooler to know. Well known entities are like Fortune 500 organizations, worldwide-known events, popular locations, and entities popularized by recent news or events such as "COVID-19", "Bitcoin", or "Generative AI".
2. Utility: Definition should help a user understand the conversation better and achieve their goals.
3. No Redundancy: Exclude definitions if already defined in the conversation.
4. Complexity: Choose terms with non-obvious meanings, such as "Butterfly Effect" but not "Electric Car".
5. Definability: Must be clearly and succinctly definable in under 10 words.
6. Existance: Don't select entities if you have no knowledge of them

# Conversation Transcript:
<Transcript start>{conversation_context}<Transcript end>

# Output Guidelines:
Output an array (ONLY OUTPUT THIS) of the entities you identified using the following template: `[{{ name: string, definition: string, search_keyword: string }}]`

- name is the entity name shown to the user, if it is mistranscribed, autocorrect it, otherwise use the name quoted from the conversation
- definition is concise (< 12 words)
- search_keyword as the best Internet search keywords to find the entity, add entity type defined above for better searchability
- it's OK to output an empty array - most of the time, the array will be empty, only include items if the fit all the requirements

## Additional Guidelines:
- Do not define entities you yourself are not familiar with, you can try to piece together the implied entity, but if you are not 90% confident, skip it.
- For the search keyword, use complete, official and context relevant keyword(s) to search for that entity. You might need to autocomplete entity names or use their official names or add additional context keywords (like the type of entity) to help with searchability, especially if the entity is ambiguous or has multiple meanings. Additionally, for rare words, add "definition" to the search keyword.
- Definitions should use simple language to be easily understood.
- Select entities whose definitions you are very confident about, otherwise skip them.
- Multiple entities can be detected from one phrase, for example, "The Lugubrious Game" can be defined as a painting (iff the entire term "the lugubrious game" is mentioned), and the rare word "lugubrious" is also worth defining.
- Limit results to {number_of_definitions} entities, prioritize rarity.
- Examples:
    - Completing incomplete name example: If the conversation talks about "Balmer" and "Microsoft", the keyword is "Steve Balmer + person", and the name would be "Steve Balmer" because it is complete.
    - Replacing unofficial name example: If the conversation talks about "Clay Institute", the keyword is "Clay Mathematics Institute + organization" since that is the official name, but the entity name would be "Clay Institute" because that is the name quoted from the conversation.
    - Adding context example: If the conversation talks about "Theory of everything", the keyword needs context keywords such as "Theory of everything + concept", because there is a popular movie with the same name. 
    - Inferring transcript errors example: If the conversation mentions "Coleman Sachs" in the context of finance, you can infer it was supposed to be "Goldman Sachs", so you autocorrect and define it as "Goldman Sachs" and give its definition.

## Recent Definitions:
These have already been defined so don't define them again:
{definitions_history}

## Example Output:
entities: [{{ name: "80/20 Rule", definition: "Productivity concept; Majority of results come from few causes", search_keyword: "80/20 Rule + concept" }}]

{format_instructions} 
If no relevant entities are identified, output empty arrays.
"""

In [70]:
def run_proactive_rare_word_agent_and_definer(
    conversation_context: str, definitions_history: list = []
):
    # run proactive agent to find out which expert agents we should run
    proactive_rare_word_agent_response = run_proactive_rare_word_agent(
        conversation_context, definitions_history
    )

    # do nothing else if proactive meta agent didn't specify an agent to run
    if proactive_rare_word_agent_response == []:
        return []

    # pass words to define to definer agent
    print("proactive_rare_word_agent_response", proactive_rare_word_agent_response)
    
    return proactive_rare_word_agent_response

class ProactiveRareWordAgentQuery(BaseModel):
    """
    Proactive rare word agent that identifies rare entities in a conversation context
    """

    to_define_list: list = Field(
        description="the rare entities to define",
    )

class Entity(BaseModel):
    name: str = Field(
        description="entity name",
    )
    definition: str = Field(
        description="entity definition",
    )
    search_keyword: str = Field(
        description="keyword to search for entity on the Internet",
    )

class ConversationEntities(BaseModel):
    entities: List[Entity] = Field(
        description="list of entities and their definitions",
        default=[]
    )

proactive_rare_word_agent_query_parser = PydanticOutputParser(
    pydantic_object=ConversationEntities
)

def run_proactive_rare_word_agent(conversation_context: str, definitions_history: list):
    # start up GPT4 connection
    llm = ChatOpenAI(
        temperature=0,
        openai_api_key=os.environ.get("OPEN_AI_API_KEY"),
        model="gpt-4-1106-preview",
    )

    extract_proactive_rare_word_agent_query_prompt = PromptTemplate(
        template=proactive_rare_word_agent_prompt_blueprint,
        input_variables=[
            "conversation_context",
            "definitions_history",
        ],
        partial_variables={
            "format_instructions": proactive_rare_word_agent_query_parser.get_format_instructions(),
            "number_of_definitions": 3,
        },
    )

    if len(definitions_history) > 0:
        definitions_history = format_list_data(definitions_history)
    else:
        definitions_history = "None"

    proactive_rare_word_agent_query_prompt_string = (
        extract_proactive_rare_word_agent_query_prompt.format_prompt(
            conversation_context=conversation_context,
            definitions_history=definitions_history,
        ).to_string()
    )

    # print("Proactive meta agent query prompt string", proactive_rare_word_agent_query_prompt_string)

    response = llm(
        [HumanMessage(content=proactive_rare_word_agent_query_prompt_string)]
    )

    print(response.content)
    try:
        res = proactive_rare_word_agent_query_parser.parse(
            response.content
        )
        return res
    except OutputParserException as e:
        print("Error parsing output" , e)
        return None

In [71]:
# test_transcript = generate_test_input()
test_transcript = """
In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falcon LLM model"""
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
res


In the realm of artificial intelligence and big data, several key players stand out with their innovative contributions. Hugging Fase, a leader in machine learning models. Another major entity, OpenYI, has revolutionized language models. We now have the largest LLMs ever such as the Falcon LLM model
```json
{
  "entities": [
    {
      "name": "Hugging Face",
      "definition": "AI company specializing in natural language processing",
      "search_keyword": "Hugging Face + AI company"
    },
    {
      "name": "OpenAI",
      "definition": "AI research lab, creators of GPT models",
      "search_keyword": "OpenAI + AI research lab"
    },
    {
      "name": "Falcon LLM model",
      "definition": "A large language model for AI applications",
      "search_keyword": "Falcon LLM model + AI"
    }
  ]
}
```
proactive_rare_word_agent_response entities=[Entity(name='Hugging Face', definition='AI company specializing in natural language processing', search_keyword='Hugging Face + AI co

ConversationEntities(entities=[Entity(name='Hugging Face', definition='AI company specializing in natural language processing', search_keyword='Hugging Face + AI company'), Entity(name='OpenAI', definition='AI research lab, creators of GPT models', search_keyword='OpenAI + AI research lab'), Entity(name='Falcon LLM model', definition='A large language model for AI applications', search_keyword='Falcon LLM model + AI')])

### Search tool
EKG is unreliable

In [83]:
from typing import Any, List, Literal
import aiohttp
import asyncio
import os

k: int = 3
gl: str = "us"
hl: str = "en"
tbs = None
num_sentences = 7
serper_api_key=os.environ.get("SERPER_API_KEY")
search_type: Literal["news", "search", "places", "images"] = "images"
result_key_for_type = {
        "news": "news",
        "places": "places",
        "images": "images",
        "search": "organic",
    }

async def serper_search_async(
    search_term: str, search_type: str = "search", **kwargs: Any
) -> dict:
    headers = {
        "X-API-KEY": serper_api_key or "",
        "Content-Type": "application/json",
    }
    params = {
        "q": search_term,
        **{key: value for key, value in kwargs.items() if value is not None},
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(f"https://google.serper.dev/{search_type}", headers=headers, json=params) as response:
            response.raise_for_status()
            search_results = await response.json()
            return search_results


async def parse_snippets_async(results: dict, scrape_pages: bool = False, summarize_pages: bool = True, num_sentences: int = 3) -> List[str]:
    snippets = []
    if results.get("answerBox"):
        answer_box = results.get("answerBox", {})
        if answer_box.get("answer"):
            snippets.append(f"The answer is {answer_box.get('answer')}")
        elif answer_box.get("snippet"):
            snippets.append(f"The answer might be in the snippet: {answer_box.get('snippet')}")
        elif answer_box.get("snippetHighlighted"):
            snippets.append(f"The answer might be in the snippet: {answer_box.get('snippetHighlighted')}")

    if results.get("knowledgeGraph"):
        kg = results.get("knowledgeGraph", {})
        title = kg.get("title")
        entity_type = kg.get("type")
        if entity_type:
            snippets.append(f"Knowledge Graph Results: {title}: {entity_type}.")
        description = kg.get("description")
        if description:
            snippets.append(f"Knowledge Graph Results: {title}: {description}.")
        for attribute, value in kg.get("attributes", {}).items():
            snippets.append(f"Knowledge Graph Results: {title} {attribute}: {value}.")

    if scrape_pages:
        tasks = []
        for result in results[result_key_for_type[search_type]][:k]:
            task = asyncio.create_task(scrape_page_async(result["link"], summarize_page=summarize_pages, num_sentences=num_sentences))
            tasks.append(task)
        summarized_pages = await asyncio.gather(*tasks)
        for i, page in enumerate(summarized_pages):
            result = results[result_key_for_type[search_type]][i]
            if page:
                snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\nSummarized Page: {page}")
            else:
                snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\n")
    else:
        for result in results[result_key_for_type[search_type]][:k]:
            snippets.append(f"Title: {result.get('title', '')}\nSource:{result['link']}\nSnippet: {result.get('snippet', '')}\n")

    if len(snippets) == 0:
        return ["No good Google Search Result was found"]
    return snippets

import requests

def can_embed_url(url: str):
    response = requests.head(url)

    # Check the headers for 'X-Frame-Options' or 'Content-Security-Policy'
    x_frame_options = response.headers.get('X-Frame-Options')
    csp = response.headers.get('Content-Security-Policy')

    return x_frame_options or ('frame-ancestors' in csp if csp else False)

def extract_entity_url_and_image(search_results: dict, image_results: dict):
    # Only get the first top url and image_url
    res = {}
    if search_results.get("knowledgeGraph"):
        result = search_results.get("knowledgeGraph", {})
        if result.get("descriptionSource") == "Wikipedia":
            ref_url = result.get("descriptionLink")
            res["url"] = ref_url

    for result in search_results[result_key_for_type["search"]][:k]:
        if "url" not in res and result.get("link") and can_embed_url(result.get("link")):
            res["url"] = result.get("link")
            break

    if image_results is None:
        return res
    
    for result in image_results[result_key_for_type["images"]][:k]:
        if "image_url" not in res and result.get("imageUrl"):
            res["image_url"] = result.get("imageUrl")
            break

    return res

async def search_url_for_entity_async(query: str):
    search_results = await serper_search_async(
        search_term=query,
        gl=gl,
        hl=hl,
        num=k,
        tbs=tbs,
        search_type="search",
    )

    image_results = None if "definition" in query else await serper_search_async(
            search_term=query,
            gl=gl,
            hl=hl,
            num=k,
            tbs=tbs,
            search_type="images",
        )
    
    return extract_entity_url_and_image(search_results, image_results)

In [84]:
await search_url_for_entity_async("ChatGpt")

{'url': 'https://en.wikipedia.org/wiki/ChatGPT',
 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/ChatGPT_logo.svg/1200px-ChatGPT_logo.svg.png'}

In [85]:
test_transcript = generate_test_input()
print(test_transcript)
res = run_proactive_rare_word_agent_and_definer(test_transcript, [])
print(res)
for entities in res.entities:
    print(entities.search_keyword)
    print(await search_url_for_entity_async(entities.search_keyword))

 And sometimes they're in a network. So you imagine them connected with network links and a dynamic network, those can change, right? So I was talking to you, but now I can't talk to you anymore. Now I'm connected to a person over here. It's a really hard environment mathematically speaking. And there's a lot of really strong lower bounds, which you could imagine if the network can change all the time and a bad guy is doing it, it's like hard to do things well., So there's an algorithm running on every single node in the network. Yeah. And then you're trying to say something of any kind that makes any kind of definitive sense about the performance of that algorithm. Yeah, so I just submitted a new paper on this a couple of weeks ago. And we were looking at a very simple problem. There's some messages in the network. We want everyone to get them. If the network doesn't change,, you can do this pretty well. You can pipeline them. There's some basic algorithms that work really well. If th

## Just go for the best intuitive way that works the best

Pipeline
1. Check if page can be embed
2. Check if url is accurate for definition

### Check if page can be embedded

In [35]:
import requests

# URL of the page you want to check
url = 'https://www.labellerr.com/blog/what-are-adversarial-attacks-in-machine-learning-and-how-can-you-prevent-them/'

# Send a request to the URL
response = requests.head(url)

# Check the headers for 'X-Frame-Options' or 'Content-Security-Policy'
x_frame_options = response.headers.get('X-Frame-Options')
csp = response.headers.get('Content-Security-Policy')

if x_frame_options or ('frame-ancestors' in csp if csp else False):
    print("The page cannot be embedded.")
else:
    print("The page can be embedded.")


The page can be embedded.
