# Prompt to Incident Reports Pipeline

We combine the SerpApi and Selenium WebDriver functionality to create an efficient pipeline for retrieving incident reports from a search engine prompt. We first set up a SerpApi client for retrieving search results.

In [None]:
import serpapi
from ncisKey import ncis_serp_key       # This is a local file containing the NCIS SerpApi key

api_key = ncis_serp_key()
client = serpapi.Client(api_key=api_key)

The SerpApi client returns the results of search as a JSON. When using the standard Google engine, the first page of results is contained in the `organic_results` entry. The `organic_results` is a list of dictionaries, each corresponding to a search result, which contain a result's position, title, and link, among other information. Another entry of interest in the result returned by SerpApi is the `related_questions` which contains all of Google's suggested related searches - perhaps this could be used to generate further prompts of interest?

When using the Google News engine, the results are contained in the `news_results` entry of the JSON. Again, `news_results` is a list of dictionaries, each corresponding to a search result, which contain a result's position, title, and link. The `news_results` also contains authorship and publication date information. Unlike the standard Google engine, the Google news engine returns *all* results of the search, not just the first page.

In [None]:
results = client.search(
	q='vessel caught underreporting catch',			# One pro of using SerpApi is that it is very easy to specify what search engine to use.
	engine="google_news",							# Here we are specifically using the Google News.
	hl="en",										# We can also specify what language the results should be in
	gl="us",										# and the location from which results should be generated.
)

In [None]:
type(results['news_results'][0]['link'])

str

We now setup our web scraping framework using Selenium. For details, see [`web-scraper.ipynb`](https://github.com/j4ck-k/m2pi-ncis-prompts/blob/main/web-scraper.ipynb).

In [None]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.21.0-py3-none-any.whl (9.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.1-py3-none-any.whl (467 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m467.7/467.7 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
# from selenium.webdriver.chrome.options import Options

# headless background execution
options = webdriver.ChromeOptions()
options.add_argument("headless")

In [None]:
def get_url_text(url : str, save : str = None, file : str = None) -> str:
    '''
    Uses the Selenium WebDriver to scrape all text from the webpage associated to the provided url.

    url : URL address for webpage to be scraped.
    save : Optional argument for saving scraped text as a user specified file type.
    file : Optional argument for naming file with scraped text.

    Webpage text is returned as a string.
    '''

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    page_soup = BeautifulSoup(driver.page_source, 'html.parser')
    p_list = page_soup.find_all("p")

    text = ''

    for p in p_list:
        text += ' ' + p.get_text()

    if save:
        with open(f"{file}.txt", "w") as text_file:
            text_file.write(text)

    return text

In [None]:
result['news_result']

NameError: name 'result' is not defined

We can now combine the SerpApi client with the web scraper to define a complete pipeline going from a prompt to a set of scraped incident reports.

In [None]:
from numpy import inf       # This will allow us to retrieve all search results.
import pickle               # This will allow us to save a list of

In [None]:
def prompt_to_reports(prompt : str,
                      num_results : int = inf,
                      engine : str = 'google_news',
                      hl : str = 'en',
                      gl : str = 'us',
                      save : bool = False,
                      file : str = 'scraped-results') -> list:
    '''
    Takes in a prompt and returns a list of strings containing the text of the first n search results.

    prompt : The prompt to be searched.
    engine : The search engine to use. Defaults to Google News.
    hl : Language to use for search. Defaults to English. For supported languages, see https://serpapi.com/google-languages
    gl : Country to use for search. Defaults to US. For countries supported, see https://serpapi.com/google-countries
    save : Boolean determining whether scraped text should be pickled for later use.
    file : Optional argument for naming pickle file with scraped text.

    Returns scraped text from search results as a list of strings.
    '''

    result_type = {'google' : 'organic_results',
                   'google_news' : 'news_results'}

    results_json = client.search(
        q = prompt,
        engine = engine,
        hl = hl,
        gl = gl
    )

    results = results_json[result_type[engine]]
    to_scrape = []

    for i in range(min(num_results, len(results))):
        to_scrape.append(results[i]['link'])

    texts = []

    for url in to_scrape:
        texts.append(get_url_text(url))

    if save:
        with open(f"{file}", "wb") as pickle_file:
            pickle.dump(texts, pickle_file)

    return texts

In [None]:
prompt_to_reports('vessel caught misreporting catch', num_results=10, save=True)

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /root/.cache/selenium/chrome/linux64/126.0.6478.62/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x5afc9689170a <unknown>
#1 0x5afc965740dc <unknown>
#2 0x5afc965a902a <unknown>
#3 0x5afc965a543b <unknown>
#4 0x5afc965efef9 <unknown>
#5 0x5afc965e3613 <unknown>
#6 0x5afc965b34f7 <unknown>
#7 0x5afc965b3e4e <unknown>
#8 0x5afc968578db <unknown>
#9 0x5afc9685b981 <unknown>
#10 0x5afc968433ce <unknown>
#11 0x5afc9685c4e2 <unknown>
#12 0x5afc96827d2f <unknown>
#13 0x5afc96881108 <unknown>
#14 0x5afc968812e0 <unknown>
#15 0x5afc9689083c <unknown>
#16 0x7a621faa1ac3 <unknown>


We can see that the text was successfully scraped from 9 of the top 10 Google News results for the prompt 'vessel caught misreporting catch' - not bad!

The above search results are now saved in the pickle file `scraped-results.pkl`. They can be reloaded as follows:

In [None]:
with open('scraped-results', 'rb') as pickle_file:
    scraped_results = pickle.load(pickle_file)

scraped_results

[' Today, the Environmental Justice Foundation commends the European Commission for approving new rules that require stricter controls for landings by EU fishing vessels, providing new tools to prevent significant misreporting of unsorted catches when landing in selected ports, including those in third countries. It is welcome that the new rules require advanced and stricter control tools, such as CCTV to monitor landings, and set minimum benchmarks for the rates of inspection on trans-shipments. If properly implemented, this can increase transparency and accuracy in reporting by EU fleets that catch a large number of species, including those that have been overfished, such as yellowfin tuna in the Indian Ocean. Sean Parramore, Senior EU Advocacy Officer at the Environmental Justice Foundation, said: "Stricter control measures on landings by EU vessels that have more leeway to misreport their catches are critical to prevent hidden overfishing. In the long run, everyone loses if we open

In [None]:
!pip install -U sentence-transformers &> /dev/null

In [None]:
import spacy
from sentence_transformers import SentenceTransformer, util
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
def get_score1(prompt, article):
  embedding_1 = model.encode(prompt, convert_to_tensor=True)
  embedding_2 = model.encode(article, convert_to_tensor=True)

  similarity = util.pytorch_cos_sim(embedding_1, embedding_2).numpy()[0]
  return similarity.max()

In [None]:
prompt = "vessel underreporting misreporting catch fish"

article = """The South Korean-flagged trawler belonged to the fleet operated by the Sajo Oyang
corporation, notorious for its record of high seas transgressions, as documented by The Guardian.
In recent years, the Oyang 77 had gotten in trouble in New Zealand for illegally dumping
dead fish overboard, underreporting catch and failing to pay workers, according to a report from
Oceana, a nonprofit focused on ocean conservancy. In February 2019, the Argentine Coast Guard
discovered the trawler with its nets extended inside the EEZ. They found more than 310,000 pounds
of seafood on board. Leaving nothing to chance, they deployed a helicopter and an airplane
to assist the Coast Guard in escorting the Oyang 77 to shore, releasing it after confiscating
its fishing equipment and extracting a fine of 25 million Argentine pesos, or about $550,000."""

get_score1(prompt, article)

In [None]:
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(article)

for ent in doc.ents:
    print(ent.text, "|",ent.label_, "|", spacy.explain(ent.label_))

In [None]:
N = 10

def get_score2(article):
  doc = nlp(article)
  vars = []
  num_vars = 0
  for ent in doc.ents:
    if ent.label_ not in vars:
      vars += [ent.label_]
      num_vars += 1
  return num_vars/N

get_score2(article)

In [None]:
def get_score(prompt, article):
  return 0.6*get_score1(prompt, article) + 0.4*get_score2(article)

In [None]:
num_results = 10
def score_prompt(prompt):
  score = 0
  articles = prompt_to_reports(prompt, num_results=num_results)
  for article in articles:
    score += get_score(prompt, article)
  return score/num_results