# Google Searcher

The goal of this notebook is to create an automated google searcher leveraging the semantic capabilities of GPT-3.
The use case is for a user who has an arbitrary question they want answered (one would ordinarily google specific keywords for this and go find the answer in some website found in the google results).
This is the following cognitive process we want to replicate:
1. The input string is the question the user wants answered.  We want to extract multiple strings from it via GPT-3, each of them containing a proposal for a google search.  Ask GPT-3 to output as python list.
2. Execute each google search
    
    2.a. retrieve the top n urls
    
    2.b. fetch the website's text and do embeddings for semantic search
    
3. Store page urls of the google search in a table
4. Use GPT-3 to evaluate the credibility score of each website
5. Execute a semantic search for the input question against the text corpus (which should be labeled with the urls as metadata or something) to get top m candidates.
6. Shoot the question against each of the candidates using GPT-3
7. Display answers and url, ordered by credibility score.

### Index:

0. Imports & functions

1. Create Google Searches

2. a. Top n URLS

2. b. Embeddings

3. URL Table

4. Credibility Score

5. Semantic Search with credibility

6. GPT-3 Answers

7. Examples

You can skip all the steps and go down to the bottom to see a few examples.

## 0. Imports & Functions

In [None]:
import re
import os
import openai
from time import time, sleep
import textwrap
import pandas as pd
import ast
import requests
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
import tiktoken
import json
import bs4 as bs
import time

In [87]:
def open_file(filepath):
    # This function opens a file located at the specified filepath and returns a string containing the file's content.
    with open(filepath, 'r', encoding='utf-8') as infile:
        return infile.read()


def save_file(filepath, content):
    # This function saves the content in a file located at the specified filepath.
    with open(filepath, 'w', encoding='utf-8') as outfile:
        outfile.write(content)

# change path to where you store your credentials
openai.api_key = open_file('creds/creds.txt')

rapid_api_key = open_file('creds/rapid_api_key.txt')
rapid_api_host = open_file('creds/rapid_api_host.txt')

def gpt3_completion(prompt, label='gpt3', engine='text-davinci-003', temp=0, top_p=1.0, tokens=400, freq_pen=2.0, pres_pen=2.0, stop=['asdfasdf', 'asdasdf']):
    # This function uses OpenAI's GPT-3 Engine to generate completions for the given prompt.
    max_retry = 5
    retry = 0
    prompt = prompt.encode(encoding='ASCII', errors='ignore').decode()  # force it to fix any unicode errors
    while True:
        try:
            response = openai.Completion.create(
                engine=engine,
                prompt=prompt,
                temperature=temp,
                max_tokens=tokens,
                top_p=top_p,
                frequency_penalty=freq_pen,
                presence_penalty=pres_pen,
                stop=stop)
            text = response['choices'][0]['text'].strip()
            text = re.sub('\s+', ' ', text)
            return text
        except Exception as oops:
            retry += 1
            if retry >= max_retry:
                return "GPT3 error: %s" % oops
            print('Error communicating with OpenAI:', oops)
            sleep(1)
            
def deduplicate_list(x):
    # This function removes duplicate elements from a list and returns the deduplicated list.
    return list(dict.fromkeys(x))


## 1. Search strings
Here we obtain search strings from a question I want to answer.  Our initial question for the testing of the functions below will be "How long until we cure cancer?"

In [2]:
# Example question
my_question = "How long until we cure cancer?"

In [88]:
def get_search_strings(input):
    ### input is a string, containing the initial question of the user
    ### Returns string containing a JSON formatted response
    question = input
    prompt = """The following is a question a user has.  Please propose a list of between 5 and 10 potentially useful Google Searches the user can execute to retrieve useful information to answer the question.  The list should be output as a python list, using double quotes for strings.
    QUESTION:
    {0}

    LIST RESPONSE: """.format(question)
    gpt_response = gpt3_completion(prompt)
    try:
        search_list = ast.literal_eval(gpt_response)
    except:
        gpt_response = gpt3_completion(prompt, temp=0.6)
        search_list = ast.literal_eval(gpt_response)
    return search_list
    

Here are some examples of proposed searches by GPT-3:

In [19]:
get_search_strings(my_question)

['timeline for curing cancer',
 'when will we cure cancer?',
 'progress in curing cancer research',
 'how close are scientists to a cure for cancer?',
 'what is the prognosis of finding a cure for all cancers?',
 'cancer breakthroughs timeline',
 'current progress towards a universal treatment or prevention of any type of malignant tumor ',
 'scientific advances that could lead to cures and treatments against different types of tumors ',
 'research on new methods to treat various forms of neoplasms']

## 2. a. Top n URLS

We will want do a google search with each of the queries in the list above and fetch a fixed number of urls from each of the search results (we will deduplicate and use them later on).  For that we will need the following function which executes a google search and returns the top N links.

In [98]:
query = "When will we cure cancer?"
n = 10

def get_top_urls(query, n, pprint = True):
    # input is a google search (a string) and an integer n
    # output top n urls
    url = "https://google-search72.p.rapidapi.com/search"
    #num is number n of top results
    querystring = {"query":query,"gl":"us","lr":"en","num":str(n),"start":"0","sort":"relevance"}

    headers = {
        "X-RapidAPI-Key": rapid_api_key,
        "X-RapidAPI-Host": rapid_api_host
    }

    response = requests.request("GET", url, headers=headers, params=querystring)
    if pprint:
        global resp_debug 
        resp_debug = response.text
    data = json.loads(response.text)
    # Get list of top urls
    top_urls = []
    for item in data['items']:
        if not item['link'].endswith(".pdf"):
            top_urls.append(item['link'])
        
    return top_urls
#print(response.text)

In [65]:
top_urls = get_top_urls(query,3)

Here are the top 3 urls from the google search (in order returned by google search results):

In [66]:
top_urls

['https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer',
 'https://www.foxchase.org/blog/are-we-any-closer-curing-cancer',
 'https://www.hsph.harvard.edu/magazine/magazine_article/the-cancer-miracle-isnt-a-cure-its-prevention/']

## 2. b. Embeddings

Now we want to store the text from the websites in encoded format to do the semantic search.

In [90]:
# Functions to fetch the text
def get_website(url):
    # The input is a string containing the desired url
    # The output is a string containing the text from the website at the url
    # The function uses the requests library to get the text from the website
    # at the url and returns the text
    try:
        response = requests.get(url)
        return response.text
    except:
        return 'Error'

def extract_clean_text(site):
    # The input is a string containing all the text from a website, including the HTML
    # The output is a string containing only the human readable text i.e. without all the HTML tags
    # The function uses the BeautifulSoup library to parse the HTML and return only the human readable text
    # Parse the HTML as a string
    soup = bs.BeautifulSoup(site,'html.parser')
    # Get the text out of the soup and return it
    text = ''.join(map(lambda p: p.text, soup.find_all('p')))
    return text

def split_string(string, x = 2500): 
    # Split the string into n-sized chunks. Last chunk might be smol.
    res=[string[y-x:y] for y in range(x, len(string)+x,x)]
    #split_strings.append(res) 
    return res


In [91]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191
encoding = tiktoken.get_encoding(embedding_encoding)


In [100]:
#Create empty DataFrame with specific column names & types
df = pd.DataFrame({'url': pd.Series(dtype='str'),
                   'snippet_id': pd.Series(dtype='str'),
                   'text': pd.Series(dtype='str'),
                  'embedding': pd.Series(dtype='str')})
# Using NumPy
dtypes = np.dtype(
    [
        ("url", str),
        ("snippet_id", str),
        ("text", str),
        ("embedding",str)
    ]
)
df = pd.DataFrame(np.empty(0, dtype=dtypes))

for url in top_urls:
    text = get_website(url)
    clean = extract_clean_text(text)
    snippet_list = split_string(clean, 2500)
    #url_list = [url] * len(snippet_list)     
    encoding_list = [get_embedding(snippet, engine=embedding_model) for snippet in snippet_list]
    for i in range(len(snippet_list)):
        df = pd.concat([df,pd.DataFrame({'url': [url], 'snippet_id': [i], 'text':[snippet_list[i]], 'embedding': [encoding_list[i]]})], ignore_index=True) 

   
        
        

In [99]:
def create_df(top_urls, pprint = True):
    # Input is a deduplicated list of urls
    # Output is a dataframe containing url, snippet_id, text, embedding obtained by fetching the website and calling openai
    df = pd.DataFrame({'url': pd.Series(dtype='str'),
                   'snippet_id': pd.Series(dtype='str'),
                   'text': pd.Series(dtype='str'),
                  'embedding': pd.Series(dtype='str')})
    # Using NumPy
    dtypes = np.dtype(
        [
            ("url", str),
            ("snippet_id", str),
            ("text", str),
            ("embedding",str)
        ]
    )
    df = pd.DataFrame(np.empty(0, dtype=dtypes))
    all_df_list = []
    for url in top_urls:
        text = get_website(url)
        if pprint:
            print(url)
        if text =='Error':
            continue
        try:
            clean = extract_clean_text(text)
        except Exception as e:
            print(e)
            continue
        snippet_list = split_string(clean, 2500)
        #url_list = [url] * len(snippet_list)   
        if pprint:
            print('running encoder')
        encoding_list = [get_embedding(snippet, engine=embedding_model) for snippet in snippet_list]
        if pprint:
            print('finished encoder')
            print('adppending to dataframe')
        for i in range(len(snippet_list)):
            all_df_list.append([url, i, snippet_list[i], encoding_list[i]])
            #print(all_df_list[i][0])
            #print(all_df_list[i][1])
            #df = pd.concat([df,pd.DataFrame({'url': [url], 'snippet_id': [i], 'text':[snippet_list[i]], 'embedding': [encoding_list[i]]})], ignore_index=True) 
        all_df = pd.concat([df,pd.DataFrame(all_df_list, columns=df.columns)], ignore_index=True)       
    return all_df 


In the table below, you can now see we have url, text snippet, and embedding of the text snippet into a vector space.

In [54]:
df = create_df(top_urls[0:2])
df.head()

https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
running encoder
finished encoder
adppending to dataframe
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
0
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
1
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
2
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
3
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
4
https://www.foxchase.org/blog/are-we-any-closer-curing-cancer
running encoder
finished encoder
adppending to dataframe
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
0
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer
1
https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer


Unnamed: 0,url,snippet_id,text,embedding
0,https://ec.europa.eu/research-and-innovation/e...,0,We asked three cancer experts - Nobel laureate...,"[0.012906081974506378, -0.005089065060019493, ..."
1,https://ec.europa.eu/research-and-innovation/e...,1,ces. They are slightly modified bacterial plas...,"[-0.013120053336024284, 0.00935688428580761, -..."
2,https://ec.europa.eu/research-and-innovation/e...,2,em is stimulated to attack the cancer) has als...,"[-0.007977260276675224, -0.0039186542853713036..."
3,https://ec.europa.eu/research-and-innovation/e...,3,sk and to help detect certain cancers early.Fo...,"[0.027712570503354073, -0.0203661248087883, 0...."
4,https://ec.europa.eu/research-and-innovation/e...,4,"shape the Horizon Europe work programme, an i...","[-0.001030220533721149, 0.00011779329361161217..."


In [93]:
# search through the reviews for a specific product
def search_snippets(df, question, n=3, pprint=False):
    # input is a df with embeddings of the texts (it contains url, snippet_id, snippet text as well)
    # output is the top n results, in the same format as df
    question_embedding = get_embedding(
        question,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, question_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
    )
    if pprint:
        for r in results:
            print(r[:200])
            print()
    return results

Below you can see the similarity column.  First we embedded the initial question as well, then found nearby vectors i.e. computed the similarity of that particular snippet with respect to our initial query.

In [99]:
search_snippets(df, my_question, n=3, pprint=True)

url

snippet_id

text

encoding

embedding

similarity



Unnamed: 0,url,snippet_id,text,encoding,embedding,similarity
29,https://www.icr.ac.uk/blogs/science-talk/page-...,0,We’ve learned a lot about cancer in the last d...,,"[0.013241356238722801, -0.00850482378154993, 0...",0.87177
30,https://www.icr.ac.uk/blogs/science-talk/page-...,1,of patients and the public in wanting cancer ...,,"[-0.01164434663951397, 0.004861883819103241, 0...",0.852989
5,https://www.foxchase.org/blog/are-we-any-close...,0,"Due to a system-wide technology update, we are...",,"[-0.0031348944175988436, 0.004773442167788744,...",0.852107


In [172]:
df['text'][0]

'We asked three cancer experts - Nobel laureate Professor Harald zur Hausen, Professor Walter Ricciardi and Dr Elisabete Weiderpass – for their thoughts on curing cancer. They all sit on the EU’s Horizon Europe mission board for cancer and will help to define a concrete target for Europe in this area over the next decade.Prof. Harald zur Hausen, German Cancer Research Center, Heidelberg‘Evidence of infections linked to cancer provide hope of preventing up to half of all cancers’If we can ever cure cancer completely – that is an open question which I cannot answer. We have a good chance of drastically reducing the incidence of cancers, but what we see at present is that the incidence, or occurrence, of cancer is increasing globally.The mortality of cancer patients is slightly decreasing, but the increase in incidence is not compensated by the decrease in mortality. There are still a large number of cases coming up every year, and if we really want to do something against cancer in the f

## 3.URL Table

Make URL table so we can add a column with the credibility score of that website

In [123]:
top_urls

['https://ec.europa.eu/research-and-innovation/en/horizon-magazine/will-we-ever-cure-cancer',
 'https://www.foxchase.org/blog/are-we-any-closer-curing-cancer',
 'https://www.hsph.harvard.edu/magazine/magazine_article/the-cancer-miracle-isnt-a-cure-its-prevention/',
 'https://futurism.com/neoscope/vaccine-predict-cancer-vaccine-2030',
 'https://www.worldwidecancerresearch.org/news-opinion/2021/march/why-havent-we-cured-cancer-yet/',
 'https://www.icr.ac.uk/blogs/science-talk/page-details/what-s-coming-for-cancer-in-the-2020s',
 'https://www.roswellpark.org/cancertalk/201909/cure-cancer-whats-taking-so-long',
 'https://www.verywellhealth.com/will-cancer-ever-be-cured-4686392',
 'https://www.theatlantic.com/magazine/archive/2014/01/when-will-genomics-cure-cancer/355739/',
 'https://www.cancer.gov/news-events/cancer-currents-blog/2022/mrna-vaccines-to-treat-cancer']

In [125]:
#To be populated below
df_urls = pd.DataFrame({'url': pd.Series(dtype='str'),
                       'score': pd.Series(dtype='int')})

## 4. Credibility Score

We deviate from the semantic search for a bit to create a credibility score column, based on the url (if it looks trustworthy or not).  We ask GPT-3 to do this.  Experimenting a bit (notebook with analysis should be in place, but have not done yet) it does a decent job at least in the relative aspect i.e. It gives Nature.com a 10 and homeopathy.com a 6, although the latter probably deserves a 2, but the order is still useful.

In [142]:
prompt_template = """Assume you are a University Profressor, experienced in researching and evaluating sources of information. 
Here is my question. I google "{0}" and got this link: {1} Can I trust the information on this url? 
From 1 to 10, how would you score the reputation of this source? Please only output the number.
"""
score_list = []
for url in top_urls:
    prompt = prompt_template.format(my_question, url)
    #print(prompt)
    gpt_response = gpt3_completion(prompt)
    counter = 1
    temp =0
    # If we don't get a numeric response or number out of 1-10 range, keep trying with more temperature
    for i in range(4):
        if gpt_response.isnumeric()==True:
            if int(gpt_response) in range(11):
                score_list.append(int(gpt_response))
                df_urls = pd.concat([df_urls,pd.DataFrame({'url': [url],'score': [int(gpt_response)]})], ignore_index=True) 
                break
            else:
                temp = temp + 2*counter/10
                gpt_response = gpt3_completion(prompt, temp=temp)
                counter = counter + 1
                continue
        else:
            temp = temp + 2*counter/10
            gpt_response = gpt3_completion(prompt, temp=temp)
        counter = counter + 1
        if counter == 4:
            break
    
    

In [100]:
def create_df_urls(top_urls, df, my_question, pprint = True):
    """This function creates a dataframe (df_urls) of URLs and their respective trust scores.
    Inputs: 
    top_urls (list): A list of URLs
    df (DataFrame): A DataFrame
    my_question (str): A string representing a question
    pprint (bool): A boolean determining if print statements should be executed
    Outputs:
    df_urls (DataFrame): A DataFrame of URLs and their respective trust scores
    """
    
    df_urls = pd.DataFrame({'url': pd.Series(dtype='str'),
                       'score': pd.Series(dtype='int')})
    score_list = []
    for url in top_urls:
        prompt_template = """Assume you are a University Profressor, experienced in researching and evaluating sources of information. 
        Here is my question. I google "{0}" and got this link: {1} Can I trust the information on this url? 
        From 1 to 10, how would you score the reputation of this source? Please only output the number.
        """
        prompt = prompt_template.format(my_question, url)
        #print(prompt)
        if pprint:
            print(url)
            print('fetch gpt')
        gpt_response = gpt3_completion(prompt)
        if pprint:
            print(gpt_response)
        temp =0
        # If we don't get a numeric response or number out of 1-10 range, keep trying with more temperature
        for i in range(4):
            if gpt_response.isnumeric()==True:
                if int(gpt_response) in range(11):
                    score_list.append(int(gpt_response))
                    df_urls = pd.concat([df_urls,pd.DataFrame({'url': [url],'score': [int(gpt_response)]})], ignore_index=True) 
                    break
                else:
                    temp = temp + 2*i/10
                    if pprint:
                        print('fetch gpt on rebound')
                    gpt_response = gpt3_completion(prompt, temp=temp)
                    if pprint:
                        print(gpt_response)
                    counter = counter + 1
                    continue
            else:
                temp = temp + 2*i/10
                gpt_response = gpt3_completion(prompt, temp=temp)
    return df_urls
    

In [140]:
score_list

[8, 7, 9, 7, 8, 8, 8, 7, 8, 9]

In [143]:
df_urls

Unnamed: 0,url,score
0,https://ec.europa.eu/research-and-innovation/e...,8
1,https://www.foxchase.org/blog/are-we-any-close...,7
2,https://www.hsph.harvard.edu/magazine/magazine...,9
3,https://futurism.com/neoscope/vaccine-predict-...,7
4,https://www.worldwidecancerresearch.org/news-o...,8
5,https://www.icr.ac.uk/blogs/science-talk/page-...,8
6,https://www.roswellpark.org/cancertalk/201909/...,8
7,https://www.verywellhealth.com/will-cancer-eve...,7
8,https://www.theatlantic.com/magazine/archive/2...,8
9,https://www.cancer.gov/news-events/cancer-curr...,9


## 5. Semantic Search with credibility

We will now filter by credibility initially (top m urls), then find the top n similar snippets according to semantic search.  Remember, each url wil give rise to many snippets i.e. we split the text in the url's website into snippets that we can pass to GPT-3 later on.

In [95]:
def find_snippets(my_question, df, df_urls, top_m_urls, top_n_snippets):
    """This function takes a question, a dataframe, a dataframe of urls, the number of top URLs, and the number of top snippets 
        as input, and returns a dataframe of the most relevant snippets as output. It first sorts the urls dataframe by score
        and takes the top m urls. It then merges this dataframe with the original dataframe and removes any duplicated columns. 
        It then searches the dataframe for the most relevant snippets and returns a dataframe of the top n snippets.
    """
    url_sorted = df_urls.sort_values(by='score', ascending=False).head(top_m_urls)
    df = pd.merge(df,url_sorted, on='url', how='inner')
    df = df.loc[:,~df.columns.duplicated()].copy()
    df = search_snippets(df, my_question, n = top_n_snippets, pprint=True)
    return df



In [276]:
search_df = find_snippets(my_question, df, df_urls, 4,5)

url

snippet_id

text

embedding

score

similarity



In [277]:
search_df = search_df.sort_values(by='score', ascending=False)

In [278]:
search_df

Unnamed: 0,url,snippet_id,text,embedding,score,similarity
2,https://ec.europa.eu/research-and-innovation/e...,2,em is stimulated to attack the cancer) has als...,"[-0.007977260276675224, -0.0039186542853713036...",9,0.684178
9,https://www.hsph.harvard.edu/magazine/magazine...,1,"ic predispositions. The majority of risk, the ...","[0.007141884882003069, -0.0015429595950990915,...",9,0.679147
0,https://ec.europa.eu/research-and-innovation/e...,0,We asked three cancer experts - Nobel laureate...,"[0.012988211587071419, -0.005143945105373859, ...",8,0.683805
4,https://ec.europa.eu/research-and-innovation/e...,4,"shape the Horizon Europe work programme, an i...","[-0.001004464807920158, 6.958011363167316e-05,...",8,0.652864


## 6. GPT-3 Answers

For each candidate snippet of text we ask GPT-3 to extract an answer to the question (or output "No relevant info" so we can filter later)

In [11]:
gpt_answers = []
for index, row in search_df.iterrows():
    prompt = """I have a question and a paragraph.  Please extract an answer from the paragraph if present, otherwise say "No relevant information here"
    QUESTION:
    {0}
    PARAGRAPH:
    {1}
    ANSWER:""".format(my_question, row['text'])
    completion = gpt3_completion(prompt)
    gpt_answers.append(completion)
    

NameError: name 'search_df' is not defined

In [112]:
def get_answer(search_df, my_question, pprint = True):
    """
    This function takes three parameters: a search dataframe consisting of text, a question string, and a 
    boolean value for pprint. It uses a GPT-3 completion to extract an answer from each row of the search dataframe based 
    on the question string, appending the answer to the search dataframe as a new column. If the pprint parameter is set to 
    True, the prompt and answer will be printed to the console. The function returns the search 
    dataframe with the answers added.
    """
    gpt_answers = []
    for index, row in search_df.iterrows():
        prompt = """I have a question and a paragraph.  Please extract an answer from the paragraph if present, otherwise say "No relevant information here
        QUESTION:
        {0}
        PARAGRAPH:
        {1}
        ANSWER:""".format(my_question, row['text'])
        completion = gpt3_completion(prompt)
        gpt_answers.append(completion)
        if pprint:
            print(prompt)
    search_df['answer'] = gpt_answers
    return search_df

In [178]:
search_df['answer'] = gpt_answers

In [236]:
search_df

Unnamed: 0,url,snippet_id,text,embedding,score,answer
2,https://ec.europa.eu/research-and-innovation/e...,2,em is stimulated to attack the cancer) has als...,"[-0.007977260276675224, -0.0039186542853713036...",9,It is not possible to provide an exact timelin...
9,https://www.hsph.harvard.edu/magazine/magazine...,1,"ic predispositions. The majority of risk, the ...","[0.007141884882003069, -0.0015429595950990915,...",9,It is unlikely that cancer could ever be eradi...
0,https://ec.europa.eu/research-and-innovation/e...,0,We asked three cancer experts - Nobel laureate...,"[0.012988211587071419, -0.005143945105373859, ...",8,It is an open question whether we can ever com...


## 7. Search Function

We put all of the above together. The following function takes your main question and some params and does the research as specified in the intro of the notebook.

In [113]:
# a function to run all of the above process
def run_search(my_question, urls_per_search, top_m_urls, top_n_snippets, pprint = True):
    """This function runs a search for a given question and returns the top search results and snippets, as well as 
    proposed answer by GPT-3. It takes in 4 parameters: my_question (the question to be searched for), urls_per_search
    (the number of urls per search), top_m_urls (the top number of urls to be returned), and top_n_snippets 
    (the top number of snippets to be returned). The output is a search_df dataframe that contains the top search 
    results and snippets for the given question.
    """
    search_strings = get_search_strings(my_question)
    print(search_strings)
    top_urls = []
    for search in search_strings:
        top_urls_search = get_top_urls(search, urls_per_search, pprint)
        top_urls = top_urls_search + top_urls
        top_urls = deduplicate_list(top_urls)
        sleep(1)
    print('total urls: ')
    print(len(top_urls))
    df = create_df(top_urls, pprint)
    #global z # Yes, this is horrible, but helped me debug..
    #z = df
    print('df created with length:')
    print(len(df))
    df_urls = create_df_urls(top_urls, df, my_question, pprint)
    #global a 
    #a = df_urls
    print('df_urls done')
    search_df = find_snippets(my_question, df, df_urls, top_m_urls, top_n_snippets)
    #global b
    #b = search_df
    print('search_df-done')
    search_df = get_answer(search_df, my_question, pprint = pprint)
    return search_df
        

# 8. Example 1

Let's get info for the question "What are the most recent breakthroughs in Menieres disease?". Instead of the user searching and browsing each url, we should be getting back the answer to this question, synthesized by GPT-3, according to the relevant info in each website that contains a potential answer. GPT-3 initially comes up with multiple search queries to google them and fetch the corresponding urls.

In [None]:
start = time.time()
my_question = "What are the most recent breakthroughs in Meniere's disease"
urls_per_search = 15
top_m_urls = 40
top_n_snippets = 40
sdf = run_search(my_question, urls_per_search, top_m_urls, top_n_snippets)
end = time.time()
print(end - start)

In [None]:
sdf[sdf['answer']!='No relevant information here.']#['text'][0]'

In [293]:
for index, row in sdf[sdf['answer']!='No relevant information here.'].iterrows():
    print(row['url'])
    print('SCORE:')
    print(row['score'])
    print(row['answer'])

https://emedicine.medscape.com/article/1159069-overview
SCORE:
9
The most recent breakthroughs in Meniere's disease include the identification of endolymphatic hydrops utilizing delayed postcontrast 3D FLAIR and fused 3D FLAIR and CISS color maps, transtympanic steroids for Mnire's Disease, Betahistine dihydrochloride in the treatment of peripheral vestibular vertigo, clinical long-term effects of Meniett pulse generator for Meniere’s disease.
https://www.mountsinai.org/locations/center-hearing-balance/conditions/vertigo-balance-disorders/menieres-disease
SCORE:
9
Low-salt diet, medications such as gentamicin injections, and cutting back on salt to help keep the inner ear fluid low are some of the most recent breakthroughs in Meniere's disease.
https://www.tandfonline.com/doi/full/10.1080/00016489.2020.1776385
SCORE:
8
Visualization of EH invivo might make a great substantial improvement in diagnose of MD.
https://www.medscape.com/viewarticle/928652
SCORE:
8
The new clinical practice g

# Example 2

Let's look for AI music products, this works reasonably well

In [297]:
start = time.time()
my_question = "What is the best and latest AI music models that are commercially available for users? i.e. models that can compose music."
urls_per_search = 15
top_m_urls = 40
top_n_snippets = 40
sdf = run_search(my_question, urls_per_search, top_m_urls, top_n_snippets)
end = time.time()
print(end - start)

['best AI music models commercially available', 'latest AI music models for users', 'AI-generated musical compositions', 'music composition with artificial intelligence technology ', 'commercial applications of machine learning in music production and composition ', 'artificial neural networks for automated song generation ', 'machine learning algorithms to create original songs from scratch', 'deep learning techniques used in creating new pieces of artful melodies', 'AI tools that generate unique tunes based on user input']
total urls: 
113
df created with length:
613
df_urls done
url

snippet_id

text

embedding

score

similarity

search_df-done
297.09651803970337


In [294]:
#create_df_urls(top_urls, df, my_question)

In [295]:
#sdf[sdf['answer']!='No relevant information here.']#['text'][0]'

In [298]:
for index, row in sdf[sdf['answer']!='No relevant information here.'].iterrows():
    print(row['url'])
    print('SCORE:')
    print(row['score'])
    print(row['answer'])

https://www.xyonix.com/blog/how-ai-is-transforming-music-composition
SCORE:
7
OpenAI's Jukebox, which uses a Vector Quantized Variational AutoEncoder (VQ-VAE) to downsample original audio from the standard sampling rate of 44.1kHz down to 344Hz and then composes new songs based on compressed audio files.
https://ai.stackexchange.com/questions/7734/what-are-the-best-machine-learning-models-for-music-composition
SCORE:
8
The most recent AI music model commercially available is from DeepMind: The challenge of realistic music generation: modelling raw audio at scale.
https://www.xyonix.com/blog/how-ai-is-transforming-music-composition
SCORE:
7
Artificial Intelligence Virtual Artist (AIVA)
https://www.loudly.com/artificial-intelligence-music-composer
SCORE:
7
Experiments in Music Intelligence (EMI)
https://filmora.wondershare.com/audio-editing/best-ai-music-composer.html
SCORE:
7
Amper Music, AIVA Technologies and Jukedeck are three of the best AI music models that are commercially availabl

# Example 3

The first ones were quite good.  The following is probably more complicated, since there might not be much information out there.
The question is "Will audioLM, the AI model for music continuation, be put into a product?" 

In [None]:
start = time.time()
my_question = "Will audioLM, the AI model for music continuation, be put into a product?"
urls_per_search = 10 # How many urls to fetch from each google search, so total urls =this*number_of_proposed_searches_by_gpt3 minus any duplication
top_m_urls = 15 # Once scored by credibility, how many to keep
top_n_snippets = 25 # many snippets per url. How many to send to gpt3 as candidates for containing answer
sdf = run_search(my_question, urls_per_search, top_m_urls, top_n_snippets)
end = time.time()
print(end - start)

In [81]:
for index, row in sdf[sdf['answer']!='No relevant information here.'].iterrows():
    print(row['url'])
    print('SCORE:')
    print(row['score'])
    print(row['answer'])

https://www.lenseup.com/en/google-audio-lm-is-already-capable-of-making-speeches-with-your-voice/
SCORE:
7
For the moment AudioLM is not open to the public, it is only a language model that can be integrated into different projects.


In [82]:
sdf
# There seem to be many urls for which there is no relevant information in them.

Unnamed: 0,url,snippet_id,text,embedding,score,similarity,answer
17,https://www.technologyreview.com/2022/10/07/10...,0,"The technique, called AudioLM, generates natur...","[-0.018406299874186516, -0.002623820910230279,...",8,0.869921,No relevant information here.
24,https://www.automationalley.com/articles/googl...,0,"The technique, called AudioLM, generates natur...","[-0.013355117291212082, -0.027182146906852722,...",7,0.8662,No relevant information here.
22,https://www.lenseup.com/en/google-audio-lm-is-...,0,Computers are already able to play chess games...,"[-0.009313440881669521, -0.008414073847234249,...",7,0.866068,No relevant information here.
18,https://www.technologyreview.com/2022/10/07/10...,1,when piano keys are struck. The music also ha...,"[-0.0166457649320364, 0.002376771531999111, 0....",8,0.858494,No relevant information here.
25,https://musically.com/2022/10/10/google-harmon...,0,\nUsername or Email Address\n\n\nPassword\n\n ...,"[-0.017838450148701668, -0.012966717593371868,...",7,0.855964,No relevant information here.
2,https://ai.googleblog.com/2022/10/audiolm-lang...,2,We release more samples on this webpage.\n\nT...,"[-0.013859458267688751, -0.005921768490225077,...",9,0.85581,No relevant information here.
23,https://www.lenseup.com/en/google-audio-lm-is-...,1,artificial intelligence is not that it is abl...,"[-0.01638486608862877, -0.016690753400325775, ...",7,0.84696,For the moment AudioLM is not open to the publ...
27,https://www.analyticsinsight.net/the-ais-new-w...,0,\nAnalytics Insight\nLinux vs Windows: Which O...,"[-0.00034020928433164954, -0.01493803132325410...",7,0.845151,No relevant information here.
28,https://www.analyticsinsight.net/the-ais-new-w...,1,nerative Pre-trained Transformer 3 (GPT-3) pre...,"[-0.007595139089971781, -0.015444154851138592,...",7,0.839728,No relevant information here.
1,https://ai.googleblog.com/2022/10/audiolm-lang...,1,"racted from w2v-BERT, a self-supervised audio ...","[-0.013595780357718468, 0.004617499187588692, ...",9,0.834435,No relevant information here.


In [None]:
sorted_response = (
    sdf.sort_values("score", ascending=False)
    .head(n)
)

# Example 4

Last example was not great.. This one is more controversial, the question is "Was COVID caused by a lab leak?".  We surprisingly get no answers extracted by GPT-3, even though there is information there.. Maybe some safety mechanism kicked in. I tried with the following prompt and got "No relevant information here".

##### I have a question and a paragraph.  Please extract an answer from the paragraph if present, otherwise say "No relevant information here"
#####    QUESTION:
#####    Was COVID caused by a lab leak?
#####    PARAGRAPH:
#####    There was a lab leak in Wuhan and the virus came from there.
#####    ANSWER: No relevant information here.

On the other hand if I specified "COVID virus", then it outputs "Yes". Not sure what is happening here, but could be looked into further if this kind of behaviour shows up often

In [114]:
start = time.time()
my_question = "Was COVID caused by a lab leak?"
urls_per_search = 10 # How many urls to fetch from each google search, so total urls =this*number_of_proposed_searches_by_gpt3 minus any duplication
top_m_urls = 15 # Once scored by credibility, how many to keep
top_n_snippets = 25 # many snippets per url. How many to send to gpt3 as candidates for containing answer
sdf = run_search(my_question, urls_per_search, top_m_urls, top_n_snippets, pprint = False)
end = time.time()
print(end - start)

['COVID lab leak evidence', 'Was COVID caused by a laboratory accident?', 'Scientific research on the origin of COVID-19', 'Did SARS-CoV2 originate in a lab?', 'Is there proof that coronavirus was created in a laboratory?', 'Evidence for and against the theory of Covid 19 originating from Wuhan Lab', "Does scientific data support claims about covid's origins?", 'What is known about potential sources of 2019 novel Coronavirus (2019nCov) infection? ', 'Are scientists certain where Covid originated from?', 'Research into whether or not COVID came from an animal source']
total urls: 
68
df created with length:
381
df_urls done
url

snippet_id

text

embedding

score

similarity

search_df-done
201.50075340270996


In [115]:
for index, row in sdf[sdf['answer']!='No relevant information here.'].iterrows():
    print(row['url'])
    print('SCORE:')
    print(row['score'])
    print(row['text'])
    print(row['answer'])

In [116]:
sdf

Unnamed: 0,url,snippet_id,text,embedding,score,similarity,answer
90,https://www.cidrap.umn.edu/covid-19/scientists...,0,"COVID-19 virus, highly magnified., NIAIDSince ...","[-0.003664524294435978, -0.011543416418135166,...",9,0.871554,No relevant information here.
67,https://www.cfr.org/backgrounder/will-world-ev...,1,ly viewed as implausible. In the first weeks o...,"[0.006211551371961832, -0.016655869781970978, ...",8,0.866936,No relevant information here.
46,https://www.nature.com/articles/d41586-021-015...,1,"emergencies at the WHO, asked for less politi...","[0.012558558024466038, -0.0007459056214429438,...",9,0.863793,No relevant information here.
87,https://www.newyorker.com/science/elements/the...,15,"DARPA grant. When Jon Cohen, a writer for Sci...","[-0.004357331898063421, -0.010118323378264904,...",8,0.851911,No relevant information here.
45,https://www.nature.com/articles/d41586-021-015...,0,Thank you for visiting nature.com. You are usi...,"[0.010364314541220665, -0.007895675487816334, ...",9,0.84859,No relevant information here.
33,https://www.nature.com/articles/d41586-021-003...,2,robiologist at the University of Manitoba in W...,"[0.010820635594427586, -0.017626848071813583, ...",9,0.846242,No relevant information here.
37,https://www.nature.com/articles/d41586-022-007...,1,"sts say that they, too, would like to see more...","[-0.0007520264480262995, 0.0022817824501544237...",9,0.84589,No relevant information here.
75,https://www.newyorker.com/science/elements/the...,3,ing for an origins investigation that took the...,"[-0.003812673268839717, -0.014424169436097145,...",8,0.841592,No relevant information here.
51,https://www.nature.com/articles/d41586-021-015...,6,", Zhao Lijian, said that US labs should instea...","[0.0007590571185573936, -0.007768488489091396,...",9,0.83817,No relevant information here.
74,https://www.newyorker.com/science/elements/the...,2,ed that SARS-CoV-2 was engineered with genetic...,"[0.0024028965272009373, -0.013563762418925762,...",8,0.837807,No relevant information here.


# Example 5

Here we look for KG software by asking "What are examples of the current best knowledge graph software in the market?".  Results are not mind-blowing, but it should be mentioned that we are only evaluating 25 candidate snippets due to time/cost and design of this experiment.  So results are pretty decent, particularly in some of the examples above.

In [117]:
start = time.time()
my_question = "What are examples of the current best knowledge graph software in the market?"
urls_per_search = 10 # How many urls to fetch from each google search, so total urls =this*number_of_proposed_searches_by_gpt3 minus any duplication
top_m_urls = 15 # Once scored by credibility, how many to keep
top_n_snippets = 25 # many snippets per url. How many to send to gpt3 as candidates for containing answer
sdf = run_search(my_question, urls_per_search, top_m_urls, top_n_snippets, pprint = False)
end = time.time()
print(end - start)

['best knowledge graph software', 'top knowledge graph software', 'knowledge graph solutions', 'current market trends in Knowledge Graphs', 'examples of successful Knowledge Graph implementations', 'Knowledge Graph technology reviews', 'comparison of popular Knowledge Graph tools and platforms ', 'list of leading companies using Knowledge graphs for their products or services ', 'advantages and disadvantages to different types of Knowlegegraph technologies']
total urls: 
67
df created with length:
8443
df_urls done
url

snippet_id

text

embedding

score

similarity

search_df-done
1269.8911747932434


In [119]:
for index, row in sdf[sdf['answer']!='No relevant information here.'].iterrows():
    print(row['url'])
    print('SCORE:')
    print(row['score'])
    #print(row['text'])
    print(row['answer'])

https://www.ibm.com/topics/knowledge-graph
SCORE:
8
IBM Cloud with Red Hat, Watson AI and open source code or visual modeling.
https://www.ontotext.com/knowledgehub/webinars/state-of-kg-adoption/
SCORE:
7
GraphDB, Ontotexts free application
https://enterprise-knowledge.com/what-is-an-enterprise-knowledge-graph-and-why-do-i-want-one/
SCORE:
7
Some of the more prominent examples of knowledge graphs are Google's implementation and LinkedIn.
https://enterprise-knowledge.com/wireframes-visualize-knowledge-graphs/
SCORE:
7
1. As a researcher, I need to identify experts in a particular field by browsing related webinars, publications, conferences, committees, HR data and other such entities; 2. As a lab equipment purchaser I need to access all available content about specific product category so that can make the most informed buying decision possible; 3.As a data scientist I need to see how various financial institutions replied on same regulatory form and be able traverse relationship
https