In [None]:

'''
V2 Version of Research Machine
0. Starts with the key question that needs to be answered
- In memory, 1) context from one level up is provided

1. Searchers (
    INPUT: [key question, context, verifier feedback, running list of clues & relevant info]
    PROCESS: 
    OUTPUT: [key question, context, verifier feedback, updated running list of clues & relevant info, best answer, history of actions]
) -- they figure out if they need to 1.1) search more or 1.2) decompose to smaller key questions that need to be answered

1.1. Seekers
2.1) Key question decompositions are created and sent to verifiers to see if there's a reasonable answer
2.2) Searchers look to find answer online -- need to figure out how to do this well. And when to stop searching.
- In memory, 1) context from directly one level up is provided, 2) biggest clues and top answers are tracked, and 3) feedback from verifiers

2. Verifiers (
    INPUT: [key question, context, verifier feedback, updated running list of clues & relevant info, best answer, updated history]
    PROCESS: 
        1. Evaluate best answer, updated running list of clues & relevant info to see if there's a reasonable answer to the key question
        2. If yes, return answer to previous verifier with this reasonable answer as the answer (learned tactic)
        3. If no, give feedback as to what's wrong to analyzers
    OUTPUT: [updated key question, updated context, updated verifier feedback, updated running list of clues & relevant info, updated history]
) -- they check based on the top answers whether there is a reasonable answer to the question yet. If yes, then return the answer to the previous verifier so they can evaluate if they have a reasonable answer with this new information.
- Memory = full history of decomposition. They are responsible for backtracking and posing the key question for analyzers
- Checks for logical validity
- Checks for reasonable answer or not
- Give feedback to Analyzers

3. Analyzers (
    PROCESS: Given feedback and history, figure out what key question we need to answer. 
)

Risks:
1. Analyzers probably need more frameworks to backtrack on their approach, because decomposition might not work
2. This process is not MECE, I need to spend more time thinking about the right approach later.
'''

In [None]:
# GPT template
prompt = f'''

'''
res = chat_openai(prompt, model="gpt-3.5-turbo")[0]
print(res)

# Searchers

### Coming up with context and key question

In [1]:
# context = "I'm supporting a biocement research project. I have to run a carbon neutrality analysis, so I'd like to know what efficiencies we need to achieve with ECR enzymes for the experimental process to be carbon negative, neutral, or positive."

In [1]:
# key question
search_query = "How efficiently do the ECR enzymes work in Kitsatospor setae bacteria?"

In [2]:
from autogpt.commands.web_selenium import browse_website, scrape_text_with_selenium_no_agent
import json
from util import sanitize_filename
import os
from prompts import get_predicted_usefulness_of_text_prompt
from collections import defaultdict
from llm import chat_openai
from autogpt.commands.web_search import web_search_ddg
from datetime import datetime

In [3]:
search_query_file_safe = sanitize_filename(search_query)

In [4]:
search_query_file_safe

'How_efficiently_do_the_ECR_enzymes_work_in_Kitsatospor_setae_bacteria_'

In [5]:
search_engine = "academic"
# search_engine = "general"

In [6]:
folder_path = f'autoscious_logs/{search_query_file_safe}'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

In [None]:
# 1. Implementing a "write down the context, key question, deliverable, and any clarifying questions before you start searching for the answer to the question".
# prompt = f'''

# '''
# res = chat_openai(prompt, model="gpt-3.5-turbo")[0]
# print(res)


In [None]:
# 2. Implementing a verifier that responds to the clarifying questions as best as possible.

Verifier check (personal check for now):
- The key question seems pretty clear to me. I added answers to FAQs / clarifying questions, but I'd like to be able to support this later. Clarifying questions is key in this step to get feedback from the human. If not, then I suppose the verifierLM can take their best guess at what the deliverable is.

### [Skip] Getting key questions and the decomposition and context

In [8]:
# # Create a decomposition for each key question only
# context = "Enoyl-CoA carboxylase/reductase enzymes (ECRs)"
# key_question_decomposition_list = []
# for driver_key, driver_value in decomposition['key_drivers'].items():
#     for hypothesis_key, hypothesis_value in driver_value['hypotheses'].items():
#         for question_key, question_value in hypothesis_value['key_questions'].items():
#             new_decomposition = decomposition.copy()
#             new_decomposition['key_drivers'] = {
#                 driver_key: {
#                     'driver': driver_value['driver'],
#                     'hypotheses': {
#                         hypothesis_key: {
#                             'hypothesis': hypothesis_value['hypothesis'],
#                             'key_questions': {
#                                 question_key: question_value
#                             }
#                         }
#                     }
#                 }
#             }
#             key_question_decomposition_list.append(new_decomposition)
# print("Key questions decomposition list: ", key_question_decomposition_list)

Key questions decomposition list:  [{'project_question': 'How efficiently do the ECR enzymes work, especially in Kitsatospor setae bacteria?', 'project_objective': 'To determine the efficiency of ECR enzymes in Kitsatospor setae bacteria', 'key_drivers': {'1': {'driver': 'ECR enzyme activity', 'hypotheses': {'1': {'hypothesis': 'ECR enzyme activity is high in Kitsatospor setae bacteria', 'key_questions': {'1': 'What is the level of ECR enzyme activity in Kitsatospor setae bacteria?'}}}}}}, {'project_question': 'How efficiently do the ECR enzymes work, especially in Kitsatospor setae bacteria?', 'project_objective': 'To determine the efficiency of ECR enzymes in Kitsatospor setae bacteria', 'key_drivers': {'1': {'driver': 'ECR enzyme activity', 'hypotheses': {'1': {'hypothesis': 'ECR enzyme activity is high in Kitsatospor setae bacteria', 'key_questions': {'2': 'How does ECR enzyme activity compare to other bacteria?'}}}}}}, {'project_question': 'How efficiently do the ECR enzymes work,

### Coming up with many good search queries

In [10]:
folder_path = f'autoscious_logs/{search_query_file_safe}/sources'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

In [11]:
def get_initial_search_queries_prompt(key_question, search_engine):
  return f'''
Key question:
{key_question}

Task:
For the key question, write a clear and comprehensive but short (1 query) list of search queries optimized for best search engine results, so that you can confidently and quickly surface the most relevant information to determine the best answer to the question. Extract a string of search keywords query from the key question.

The output should be in JSON format: 
```json
{{
  "1": "<insert query>",
  "keywords_query": "<insert keywords>"
}}

Respond only with the output, with no explanation or conversation.
'''

In [12]:
print(get_initial_search_queries_prompt(search_query, search_engine))


Key question:
How efficiently do the ECR enzymes work in Kitsatospor setae bacteria?

Task:
For the key question, write a clear and comprehensive but short (1 query) list of search queries optimized for best search engine results, so that you can confidently and quickly surface the most relevant information to determine the best answer to the question. Extract a string of search keywords query from the key question.

The output should be in JSON format: 
```json
{
  "1": "<insert query>",
  "keywords_query": "<insert keywords>"
}

Respond only with the output, with no explanation or conversation.



In [33]:
context = "Enoyl-CoA carboxylase/reductase enzymes (ECRs)"

# for decomposition_idx, key_question_decomposition in enumerate(key_question_decomposition_list):
key_question_initial_search_queries = json.loads(chat_openai(get_initial_search_queries_prompt(search_query, search_engine), model="gpt-3.5-turbo")[0])

keywords_query = key_question_initial_search_queries.pop('keywords_query')

with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_queries.json', 'w') as f:
    json.dump(key_question_initial_search_queries, f, indent=2)

with open(f'autoscious_logs/{search_query_file_safe}/sources/keywords_query.txt', 'w') as f:
    json.dump(keywords_query, f)

Prompt:  
Key question:
How efficiently do the ECR enzymes work in Kitsatospor setae bacteria?

Task:
For the key question, write a clear and comprehensive but short (1 query) list of search queries optimized for best search engine results, so that you can confidently and quickly surface the most relevant information to determine the best answer to the question. Extract a string of search keywords query from the key question.

The output should be in JSON format: 
```json
{
  "1": "<insert query>",
  "keywords_query": "<insert keywords>"
}

Respond only with the output, with no explanation or conversation.

Completion info:  {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "{\n  \"1\": \"Efficiency of ECR enzymes in Kitsatospor setae bacteria\",\n  \"keywords_query\": \"ECR enzymes efficiency Kitsatospor setae bacteria\"\n}",
        "role": "assistant"
      }
    }
  ],
  "created": 1691419510,
  "id": "chatcmpl-7kvlm90gatY

### (Debug testing) Google Scholar search given search keywords

In [9]:
import json
from scholarly import scholarly
from scholarly import ProxyGenerator

# Set up a ProxyGenerator object to use free proxies
# This needs to be done only once per session
pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)

In [117]:
scholar_res_gen = scholarly.search_pubs('Enoyl-CoA carboxylase/reductase (ECR) enzyme activity Kitasatospora setae bacteria')

In [136]:
scholar_res_gen

<scholarly.publication_parser._SearchScholarIterator at 0x1f0d8958160>

In [137]:
first_res = next(scholar_res_gen)

In [138]:
print(first_res['eprint_url'])

https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/126447/1/ETH23842.pdf


In [130]:
import requests
from PyPDF2 import PdfReader 
from io import BytesIO

def try_getting_pdf(url):
    response = requests.get(url)
    f = BytesIO(response.content)
    try:
        pdf = PdfReader(f)
        return True
    except:
        return False

# Get the PDF content
def try_getting_pdf_content():
    response = requests.get('https://pubs.acs.org/doi/pdf/10.1021/acs.chemrev.2c00581')
    f = BytesIO(response.content)
    try:
        pdf = PdfReader(f)
        content = ""

        for i in range(len(pdf.pages)):
            page = pdf.pages[i]
            text = page.extract_text()
            content += text
        return content
    except:
        return ""

In [146]:
scholar_res_gen = scholarly.search_pubs('Enoyl-CoA carboxylase/reductase (ECR) enzyme activity Kitasatospora setae bacteria')

MaxTriesExceededException: Cannot Fetch from Google Scholar.

In [None]:
web_search_res = []
for res in scholar_res_gen:
    item = {}
    item['title'] = res['bib']['title']
    if try_getting_pdf(res['eprint_url']):
        item['href'] = res['eprint_url']
    else:
        item['href'] = res['pub_url']
    item['body'] = res['bib']['abstract']
    web_search_res += [item]

In [None]:
print(web_search_res)

[]


In [74]:
from autogpt.commands.web_selenium import scrape_text_with_selenium_no_agent

In [103]:
text = scrape_text_with_selenium_no_agent('https://pubs.acs.org/doi/abs/10.1021/acs.chemrev.2c00581', None, search_engine='firefox')

Going through url:  https://pubs.acs.org/doi/abs/10.1021/acs.chemrev.2c00581
select firefox options!
Driver is getting url
set timeout!
Page loaded within 15 seconds
Driver got url
Driver has found page source
Handing off to Beautiful Soup!
done extractin
Text:  Download Hi-Res ImageDownload to MS-PowerPointCite This:Chem. Rev. 2023, 123, 9, 5702-5754
ADVERTISEMENT
RETURN TO ISSUEPREVReviewNEXTEnzymatic Conversion of CO2: From Natural to Artificial UtilizationSarah BierbaumerSarah BierbaumerInstitute of Chemistry, University of Graz, NAWI Graz, Heinrichstraße 28, 8010 Graz, AustriaMore by Sarah BierbaumerView Biographyhttps://orcid.org/0000-0003-3883-7108, Maren NattermannMaren NattermannDepartment of Biochemistry and Synthetic Metabolism, Max Planck In


In [None]:
print(text)

In [91]:
from selenium import webdriver
import time

### Web search given search keywords

In [14]:
import os
from dotenv import load_dotenv
from googleapiclient.discovery import build
load_dotenv()

True

In [15]:
import json
from scholarly import scholarly
from scholarly import ProxyGenerator

# Set up a ProxyGenerator object to use free proxies
# This needs to be done only once per session
pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)

In [44]:
# import requests

# def download_pdf(url, target_path):
#     response = requests.get(url)
    
#     # Ensure we got a valid response
#     if response.status_code == 200:
#         with open(target_path, 'wb') as f:
#             f.write(response.content)
#     else:
#         print(f"Unable to get URL: {url}")
#         print(f"Response Code: {response.status_code}")

In [47]:
# download_pdf("https://www.africau.edu/images/default/sample.pdf", r'C:\Users\1kevi\Desktop\projects\Research\autoscious-carbon-capture\question_answering\test.pdf')

In [16]:
# # Need to support pdfs! https://towardsdatascience.com/how-to-extract-text-from-any-pdf-and-image-for-large-language-model-2d17f02875e6
# # Looks like with requests, we can't always download pdfs unfortunately, they'll probably need to be added as sources to check through by users manually
from pytesseract import image_to_string
import pypdfium2 as pdfium

def convert_pdf_to_images(file_path, scale=300/72):

    pdf_file = pdfium.PdfDocument(file_path)

    page_indices = [i for i in range(len(pdf_file))]

    renderer = pdf_file.render(
        pdfium.PdfBitmap.to_pil,
        page_indices=page_indices,
        scale=scale,
    )

    final_images = []

    for i, image in zip(page_indices, renderer):

        image_byte_array = BytesIO()
        image.save(image_byte_array, format='jpeg', optimize=True)
        image_byte_array = image_byte_array.getvalue()
        final_images.append(dict({i: image_byte_array}))

    return final_images

# 2. Extract text from images via pytesseract


def extract_text_from_img(list_dict_final_images):

    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []

    for index, image_bytes in enumerate(image_list):

        image = Image.open(BytesIO(image_bytes))
        raw_text = str(image_to_string(image))
        image_content.append(raw_text)

    return "\n".join(image_content)


def extract_content_from_url(url: str):
    images_list = convert_pdf_to_images(url)
    text_with_pytesseract = extract_text_from_img(images_list)

    return text_with_pytesseract

In [None]:
# res = extract_content_from_url("https://phys.org/news/2022-04-soil-microbe-rev-artificial-photosynthesis.pdf")

In [None]:
# print(try_getting_pdf_content("https://phys.org/news/2022-04-soil-microbe-rev-artificial-photosynthesis.pdf"))

In [38]:
# import requests
# from PyPDF2 import PdfReader 
# from io import BytesIO

# def is_pdf_encrypted(url):
#     response = requests.get(url)
#     f = BytesIO(response.content)

#     try:
#         pdf = PdfReader(f)
#         return pdf.isEncrypted
#     except Exception as e:
#         print(f"Error occurred: {e}")
#         return False

In [40]:
# url = "https://phys.org/news/2022-04-soil-microbe-rev-artificial-photosynthesis.pdf"
# if is_pdf_encrypted(url):
#     print("The PDF is encrypted.")
# else:
#     print("The PDF is not encrypted.")

Error occurred: EOF marker not found
The PDF is not encrypted.


In [17]:
import requests
from PyPDF2 import PdfReader 
from io import BytesIO

def try_getting_pdf(url):
    response = requests.get(url)
    f = BytesIO(response.content)
    try:
        pdf = PdfReader(f)
        return True
    except:
        print("Could not get pdf")
        return False

# Get the PDF content
def try_getting_pdf_content(url):
    response = requests.get(url)
    f = BytesIO(response.content)
    try:
        pdf = PdfReader(f)
        content = ""

        for i in range(len(pdf.pages)):
            page = pdf.pages[i]
            text = page.extract_text()
            content += text
        return content
    except:
        print("Error getting PDF content")
        return ""

In [18]:
import time
def google_search_raw(search_term, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=os.getenv('DEV_KEY'))
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()

    search_results = res.get("items", [])
    time.sleep(1)

    # Create a list of only the URLs from the search results
    search_results_links = [item["link"] for item in search_results]
    return search_results

In [19]:
def search_google(search_query):
    num_google_searches = 8
    results = google_search_raw(search_query, os.getenv('MY_CSE_ID'), num=num_google_searches, lr="lang_en", cr="countryUS")
    return results

In [20]:
MAX_RETRIES = 3

# for decomposition_idx, key_question_decomposition in enumerate(key_question_decomposition_list):
with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_queries.json', 'r') as f:
    key_question_initial_search_queries = json.load(f)

for idx, query in key_question_initial_search_queries.items():
    print("query: ", query)
    # query = "ECR enzyme efficiency in k setae" # Hard coded to get the results I want

    web_search_res = []
    if search_engine == "academic":
        print("trying academic search")
        try:
            scholar_res_gen = scholarly.search_pubs(query)

            for res in scholar_res_gen:
                item = {}
                item['title'] = res['bib']['title']
                if try_getting_pdf(res['eprint_url']):
                    item['href'] = res['eprint_url']
                    item['pdf'] = True
                else:
                    item['href'] = res['pub_url']
                    item['pdf'] = False
                item['body'] = res['bib']['abstract']
                web_search_res += [item]
        except: 
            print("Exception, trying normal search")
    if web_search_res == []:
        # DDG
        print("trying normal search")
        web_search_res = json.loads(web_search_ddg(query))
        if len(web_search_res) == 0:
            print("trying google search!")
            # Google
            web_search_res_raw = search_google(query) # google uses 'link' instead of 'href'
            web_search_res = [{
                'title': web_search_res_raw[i]['title'], 
                'href': web_search_res_raw[i]['link'], 
                'body': web_search_res_raw[i]['snippet'],
                'pdf': False
                } for i in range(len(web_search_res_raw))
            ]

    # save web search results
    with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_results_query_{idx}.json', 'w') as f:
        json.dump(web_search_res, f, indent=2)

query:  Efficiency of ECR enzymes in Kitsatospor setae bacteria
trying academic search
trying normal search


### Reading type 1: filtering unlikely relevant sources based on title and body

In [21]:
def get_filtering_web_results_ratings(key_question, web_search_res):
    return f'''
Key question:
{key_question}

Task:
Based on the key question and each search result's title and body content, reason and assign a predicted usefulness score of the search result's content and potential useful references to answering the key question using a 5-point Likert scale, with 1 being very not useful, 2 being not useful, 3 being somewhat useful, 4 being useful, 5 being very useful.

Search results:
{web_search_res}

The output should be in JSON format: 
```json
{{
  'href': 'relevance score',
  etc.
}}
```

Respond only with the output, with no explanation or conversation.
'''

In [22]:
from collections import defaultdict

with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_queries.json', 'r') as f:
    key_question_initial_search_queries = json.load(f)

for query_idx, query in key_question_initial_search_queries.items():
    # load web search results
    with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_results_query_{query_idx}.json', 'r') as f:
        web_search_res = json.loads(f.read())
    
    filtered_web_results = {}
    if web_search_res != []:
        # filter web results based on title and body
        filtered_web_results = json.loads(chat_openai(get_filtering_web_results_ratings(search_query, web_search_res), model="gpt-3.5-turbo")[0])

    ratings_url_dict = defaultdict(list)
    for url, rating in filtered_web_results.items():
        ratings_url_dict[str(rating)].append(url)

    # save filtered search results
    with open(f'autoscious_logs/{search_query_file_safe}/sources/rated_web_results_query_{int(query_idx)}.json', 'w') as f:
        json.dump(ratings_url_dict, f, indent=2)

Prompt:  
Key question:
How efficiently do the ECR enzymes work in Kitsatospor setae bacteria?

Task:
Based on the key question and each search result's title and body content, reason and assign a predicted usefulness score of the search result's content and potential useful references to answering the key question using a 5-point Likert scale, with 1 being very not useful, 2 being not useful, 3 being somewhat useful, 4 being useful, 5 being very useful.

Search results:
[{'title': 'Awakening the Sleeping Carboxylase Function of Enzymes: Engineering the ...', 'href': 'https://pubs.acs.org/doi/10.1021/jacs.9b03431', 'body': 'However, for enoyl-CoA carboxylase/reductase from Kitasatospora setae (ECR Ks), four conserved amino acids that form a CO 2-binding pocket at the active site were described recently (Figure 1a). These four amino acids anchor and position the CO 2 molecule during catalysis, in which a reactive enolate is formed that attacks the CO 2. Figure 1'}, {'title': 'Four amino

### Reading type 2 & 3: filtering based on skimming and sampling from each source, and only saving most relevant sources for fact extraction and quotes.

Skimming and rating is irrelevant for now. Only useful for prioritizing which articles to read given limited tokens. Still need to run to get the full text though.

In [23]:
# COMPLETE code for predicting usefulness of very relevant (5) and relevant (4) results.
from autogpt.commands.web_selenium import scrape_text_with_selenium_no_agent

CHUNK_SIZE = 1000
SAMPLING_FACTOR = 0.1 # Also cap it so it falls under the max token limit
MAX_TOKENS = 2500 * 4 # 1 token = 4 chars, 2500 + 500 (prompt) tokens is high for GPT3.5
MAX_CHUNKS = int(MAX_TOKENS / CHUNK_SIZE)
# context = "Enoyl-CoA carboxylase/reductase enzymes (ECRs)"

In [24]:
def get_sample_chunks(text, CHUNK_SIZE, num_chunk_samples):
    step_size = len(text) // num_chunk_samples

    chunks = []
    for i in range(0, len(text), step_size):
        chunk = text[i:i+CHUNK_SIZE]
        chunks.append(chunk)

        # Break after getting the required number of chunks
        if len(chunks) >= num_chunk_samples:
            break

    return chunks

In [25]:
# Need to determine how useful the text is likely to be for answering the key questions
def get_predicted_usefulness_of_text_prompt(key_question, sample_text_chunks):
    return f'''
Key question:
{key_question}

Task: 
Based on the key question and the sample text chunks of the source text, the goal is to identify how useful reading the full source text would be to extract direct quoted facts or references to determine the best answer to the key question. 

Deliverable:
Assign a predicted usefulness score of the full source text using a 5-point Likert scale, with 1 being very unlikely to be usefulness, 2 being unlikely to be useful, 3 being somewhat likely to be useful, 4 being likely to be useful, and 5 being very likely useful and containing facts or references that answer the key question.

Sample text chunks from the source text:
{sample_text_chunks}

The output should be of the following JSON format
{{
    ""predicted_usefulness: <insert predicted usefulness rating>,
   etc.
}}


Respond only with the output, with no explanation or conversation.
'''

In [26]:
# Function to return sample chunks based on relevance to key_question using cosine similarity
# TODO: have chatgpt come up w what the form of the answeer might look like and match on that!
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import re

model = SentenceTransformer('all-MiniLM-L6-v2')  # or 'all-mpnet-base-v2'

def get_most_relevant_chunks(key_question, text, CHUNK_SIZE, num_chunk_samples):
    # 1. Split text into chunks
    chunks = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]

    # 2. For each chunk, compute the cosine similarity with key_question
    key_embedding = model.encode([key_question])
    chunk_embeddings = model.encode(chunks)
    
    similarities = [cosine_similarity(key_embedding.reshape(1, -1), chunk_embedding.reshape(1, -1))[0][0] for chunk_embedding in chunk_embeddings]
    
    # 3. Sort chunks by similarity
    sorted_chunks = [chunk for _, chunk in sorted(zip(similarities, chunks), key=lambda pair: pair[0], reverse=True)]

    # 4. Return top num_chunk_samples chunks
    return sorted_chunks[:num_chunk_samples]

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
from rank_bm25 import BM25Okapi
import re

def get_most_relevant_chunks_with_bm25(key_question, text, CHUNK_SIZE, num_chunk_samples):
    # 1. Split text into chunks
    chunks = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]

    # 2. Tokenize the chunks
    tokenized_chunks = [re.findall(r"\w+", chunk) for chunk in chunks]

    # 3. Initialize BM25
    bm25 = BM25Okapi(tokenized_chunks)

    # 4. Query BM25 with the key question
    tokenized_question = re.findall(r"\w+", key_question)
    scores = bm25.get_scores(tokenized_question)

    # 5. Sort chunks by BM25 scores
    sorted_chunks = [chunk for _, chunk in sorted(zip(scores, chunks), key=lambda pair: pair[0], reverse=True)]

    # 6. Return top num_chunk_samples chunks
    return sorted_chunks[:num_chunk_samples]


In [28]:
# def get_sample_chunks(key_question, text, CHUNK_SIZE, num_chunk_samples):
#     step_size = len(text) // num_chunk_samples

#     chunks = []
#     for i in range(0, len(text), step_size):
#         chunk = text[i:i+CHUNK_SIZE]
#         chunks.append(chunk)

#         # Break after getting the required number of chunks
#         if len(chunks) >= num_chunk_samples:
#             break

#     return chunks

def find_title(url, web_search_info):
    for item in web_search_info:
        if item["href"] == url:
            return item["title"]
    return None

def check_pdf(url, web_search_info):
    for item in web_search_info:
        if "pdf" in item.keys() and item["pdf"]:
            return True
    print("Not pdf")
    return False

In [30]:
folder_path = f'autoscious_logs/{search_query_file_safe}/sources/full_text'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

# Skimming through each highly relevant paper from skimming
with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_queries.json', 'r') as f:
    key_question_initial_search_queries = json.load(f)

for query_idx, query in key_question_initial_search_queries.items():
    # open filtered search results
    with open(f'autoscious_logs/{search_query_file_safe}/sources/rated_web_results_query_{int(query_idx)}.json', 'r') as f:
        ratings_url_dict = json.loads(f.read())

    # open web search info to extract metadata
    with open(f'autoscious_logs/{search_query_file_safe}/sources/initial_search_results_query_{int(query_idx)}.json', 'r') as f:
        web_search_info = json.load(f)
    
    for rating, urls in ratings_url_dict.items():
        if rating == '5' or rating == '4' or rating == '3': # Scraping all useful websites to skim through
            # Start with iterating through 4s and 5s of ratings_url_dict
            folder_path = f'autoscious_logs/{search_query_file_safe}/sources/predicted_usefulness_{rating}'
            if not os.path.exists(folder_path):
                os.makedirs(folder_path)

            for rating_source_idx, url in enumerate(urls):
                print("query ", query_idx, "rating_source_idx", rating_source_idx, "Skimming url:", url)

                # Ensure the url hasn't already been visited
                title = find_title(url, web_search_info)
                if title and not os.path.exists(f'autoscious_logs/{sanitize_filename(search_query)}/sources/full_text/{sanitize_filename(title)}.txt') and not os.path.exists(f'{folder_path}/query_{query_idx}_url_index_{rating_source_idx}.json'):

                    # Check if it's a pdf or not
                    if try_getting_pdf(url):
                        print("PDF found!")
                        text = try_getting_pdf_content(url)
                    else:
                        text = scrape_text_with_selenium_no_agent(url, None, search_engine='firefox')

                    # Only evaluate websites you're able to scrape
                    if text and text != "No information found":
                        total_chunks = len(text) / CHUNK_SIZE
                        num_chunk_samples = min(int(total_chunks * SAMPLING_FACTOR), MAX_CHUNKS)
                        # sample_chunks = get_sample_chunks(text, CHUNK_SIZE, num_chunk_samples)
                        sample_chunks = get_most_relevant_chunks_with_bm25(keywords_query, text, CHUNK_SIZE, num_chunk_samples) # Using BM25 to search for keywords instead of general query
                        print("len(sample_chunks)", len(sample_chunks))

                        # Get predicted usefulness based on sample chunks
                        predicted_usefulness_results = json.loads(chat_openai(get_predicted_usefulness_of_text_prompt(search_query, sample_chunks), model="gpt-3.5-turbo")[0])

                        # save filtered search results
                        with open(f'{folder_path}/query_{query_idx}_url_index_{rating_source_idx}.json', 'w') as f:
                            predicted_usefulness_results['title'] = title
                            predicted_usefulness_results['url'] = url
                            json.dump(predicted_usefulness_results, f, indent=2)
                        
                        # Check if any scores were (4 or) 5, because then we should save the full text
                        pred_usefulness = predicted_usefulness_results.values()

                        # TODO: perhaps make this more dynamic
                        if 5 in pred_usefulness or '5' in pred_usefulness or 4 in pred_usefulness or '4' in pred_usefulness:
                        # DEBUG: Just looking at the scraping results
                            with open(f'autoscious_logs/{sanitize_filename(search_query)}/sources/full_text/{sanitize_filename(title)}.txt', 'w', encoding='utf-8') as f:
                                f.write(title + '\n')
                                f.write(url + '\n')
                                f.write(text)
                else:
                    print("URL or text already visited!")

query  1 rating_source_idx 0 Skimming url: https://pubs.acs.org/doi/10.1021/jacs.9b03431
URL or text already visited!
query  1 rating_source_idx 1 Skimming url: https://www.researchgate.net/figure/Reaction-scheme-and-structural-organization-of-the-K-setae-ECR-complex-a_fig1_360182999
Could not get pdf
Going through url:  https://www.researchgate.net/figure/Reaction-scheme-and-structural-organization-of-the-K-setae-ECR-complex-a_fig1_360182999
select firefox options!
Driver is getting url
set timeout!
Page loaded within 15 seconds
Driver got url
Driver has found page source
Handing off to Beautiful Soup!
done extractin
Text:  Figure 1 - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalContent may be subject to copyright.DownloadView publicationCopy referenceCopy captionEmbed figureReaction scheme and structural organization of the K. setae ECR complex. (a) Carboxylation reaction scheme of ECR. (b) Anisotropic B-factors of the tetramer of t

# Verifiers

# Analyzers