In [1]:
import warnings
import pandas as pd

# Generating Prompts Using Structure and Synonyms

In this notebook, we demonstrate a protocol for generating search engine prompts of a given structure. We propose that prompts should be of the form "Watercraft" + "Crime" + "Consequence" + "Specifics" where each term in the sequence is randomly generated from a list of keywords. The "Specifics" term should be optional and can be used to generate prompts that can yield more specific incident reports. Examples of specific information that could be included are types of fish and geographic areas.

## Generating Keyword Lists for Each Term

Our first step is to generate lists of keywords for each term in our prompt. To this end, we use the database WordNet to find large lists of synonyms and related words for each term. WordNet is accessed in Python using the `nltk` library.

In [2]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jackkendrick/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We will use WordNet in the following way: First, we manually decide on the highest level of abstraction for each term that we should consider and then generate all synonyms of this word. In the case that a term has many meanings, each different usage is encoded as a separate entity within WordNet and so we must be careful to use the correct one. Once we have found all synonyms, we then find all 'hyponyms' (words with more specific meanings) of our term to form a keyword list.

First, we will generate a list of words for watercraft using WordNet. The highest level of abstraction we will consider here is "Craft". 

In [3]:
craft = wn.synsets('craft', 'n')[1]               # A synset is a collection of synonyms in WordNet - the function synsets generates a synset for each
hypos = lambda s:s.hyponyms()                       # meaning of the word. In our case, the relevant meaning of 'vessel' is given in the second synset

craft_hyponyms = list(craft.closure(hypos))              # The function hypos retrives all hyponyms from a synset. The function closure ensures we see all related terms.

for hypo in craft_hyponyms[:]:
    if wn.synset('fishing_boat.n.01').path_similarity(hypo) < 0.3:      # Here we are making sure that our keywords stay relevant. The similarity parameter can be changed.
        craft_hyponyms.remove(hypo)

craft_hyponyms

  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):


[Synset('vessel.n.02'),
 Synset('bareboat.n.01'),
 Synset('boat.n.01'),
 Synset('fishing_boat.n.01'),
 Synset('galley.n.01'),
 Synset('galley.n.02'),
 Synset('iceboat.n.02'),
 Synset('patrol_boat.n.01'),
 Synset('sailing_vessel.n.01'),
 Synset('ship.n.01'),
 Synset('shrimper.n.01'),
 Synset('weather_ship.n.01'),
 Synset('yacht.n.01'),
 Synset('trawler.n.02')]

The above code generates a list of relevant terms as a collection of 'synsets'. We can access each word in the set as follows:

In [4]:
craft_keywords = set()                        # We are using the set object to ensure that no word is added to our list twice.

for hypo in craft_hyponyms:
    for l in hypo.lemmas():
        name = l.name().replace('_', ' ')
        craft_keywords.add(name)

craft_keywords=list(craft_keywords)

We now repeat similar processes for consequences and other specific information. The following function can be used to automate this process:

In [81]:
def get_synonyms(concept:str, abst:int, control:str = None, control_param:float = None) -> list:
    '''
    Function for getting synonyms at varying levels of abstraction.

    concept : Concept as defined in WordNet. Should be of the form word.n.01.
    abst : Levels of abstraction. Function will generate synonyms by finding hypernyms at this level of abstraction.
    control : A base concept that generated synonyms should be similar to. If None, function will return all generated synonyms.
    control_param : Parameter for how similar synonyms should be to control. Should be a value between 0 and 1 smaller values allowing for less similarity.

    Returns list of generated synonyms as strings.
    '''

    warnings.filterwarnings('ignore') 
    
    syn = wn.synset(concept)
    if abst !=0:
        for _ in range(abst):
            if len(syn.hypernyms()) >0:
                syn = syn.hypernyms()[0]

    print(syn)

    hypos = lambda s:s.hyponyms()                       
    hyponyms = list(syn.closure(hypos))

    if control != None:   
        for hypo in hyponyms[:]:
            if wn.synset(control).path_similarity(hypo) < control_param:
                hyponyms.remove(hypo)

    words = set()                        

    for hypo in hyponyms:
        for l in hypo.lemmas():
            name = l.name().replace('_', ' ')
            words.add(name)

    return list(words)
    

One issue with this function is that it requires the user to specify as concept as it is stored in WordNet. It is possible that this can be fixed later.

We now use the above function to generate synonyms for consequences and some specifics that may be included in our prompts.

In [85]:
consequences = ['arrest.n.01', 'hearing.n.01', 'trial.n.04', 'fine.n.01']
abstractions = 2
similarity = 0.7

consequence_kw = []

for consequence in consequences:
    consequence_kw.extend(get_synonyms(consequence, abstractions, consequence, similarity))

consequence_kw


Synset('acquiring.n.01')
Synset('due_process.n.01')
Synset('due_process.n.01')
Synset('payment.n.01')


['collar',
 'catch',
 'arrest',
 'apprehension',
 'pinch',
 'taking into custody',
 'hearing',
 'trial',
 'mulct',
 'amercement',
 'fine']

In [87]:
for hypo in wn.synset('international_waters.n.01').closure(hypos):
    for l in hypo.lemmas():
        print(l.name())

In [91]:
specifics = ['seafood.n.01', 'water.n.02']
abstractions = 0
similarity = 0.5

specifics_kw = []

for specific in specifics:
    specifics_kw.extend(get_synonyms(specific, abstractions, specific, similarity))

specifics_kw

Synset('seafood.n.01')
Synset('body_of_water.n.01')


['calamari',
 'shellfish',
 'roe',
 'whitefish',
 'prawn',
 'saltwater fish',
 'milt',
 'winkle',
 'calamary',
 'octopus',
 'periwinkle',
 'squid',
 'shrimp',
 'soft roe',
 'freshwater fish',
 'whelk',
 'hard roe',
 'recess',
 'seven seas',
 'drink',
 'bay',
 'waterfall',
 'channel',
 'lake',
 'sea',
 'international waters',
 'sound',
 'ocean',
 'stream',
 'watercourse',
 'falls',
 'estuary',
 'polynya',
 'shallow',
 'territorial waters',
 'shoal',
 'puddle',
 'high sea',
 'waterway',
 'main',
 'ford',
 'gulf',
 'briny',
 'embayment',
 'mid-water',
 'offing',
 'pool',
 'inlet',
 'flowage',
 'backwater',
 'crossing']

We do not use the same method to generate crimes since these are words and phrases that are specific to the domain. Instead, we use a list of known crimes.

In [92]:
file_path = 'crimes.txt'
 
with open(file_path, 'r') as file:
    file_content = ''
    line = file.readline()
     
    while line:
        file_content += line
        line = file.readline()
 
crimes = []

file_content.split('-')
for line in file_content.split('- '):
    if line.split('\n')[0].split(":")[0] != '':
        crimes.append(line.split('\n')[0].split(":")[0])

## Generating Prompts Using Keywords

Now for the fun part: Actually generating the prompts. We will generate a collection of prompts randomly using the keyword lists generated above. The following function will generate a specified number of prompts, with a user specified proportion being specific (i.e. includes either a geographic area or type of fish) with the remaining prompts being more general.

In [93]:
import numpy as np

In [94]:
def get_prompts(n:int, spec_prop:float) -> list:
    '''
    Generates a list of prompts with a proportion containing extra specific information.

    n : Number of prompts to be generated
    spec_prop : Proportion of prompts that should be specific. Float between 0 and 1

    Returns list of prompts as strings.
    '''

    prompts = []

    for _ in range(int(np.floor(n*(spec_prop)))):
        prompts.append(craft_keywords[np.random.randint(0, len(craft_keywords))] + ' ' + crimes[np.random.randint(0, len(crimes))] + ' ' 
                       + consequence_kw[np.random.randint(0, len(consequence_kw))] + ' ' + specifics_kw[np.random.randint(0, len(specifics_kw))])

    for _ in range(int(n-np.floor(n*(spec_prop)))):
        prompts.append(craft_keywords[np.random.randint(0, len(craft_keywords))] + ' ' + crimes[np.random.randint(0, len(crimes))] + ' ' 
                       + consequence_kw[np.random.randint(0, len(consequence_kw))])

    return prompts

In [95]:
prompts = get_prompts(1000, 0.1)
prompts

['fishing vessel Transshipment Underreporting amercement shallow',
 'ship Data Falsification arrest watercourse',
 'watercraft Unreported Incidental Catch amercement squid',
 'fishing boat Black Market Sales hearing embayment',
 'patrol boat Unreported Land-Based Processing mulct shellfish',
 'boat Illegal Transshipment fine shoal',
 'ship Data Falsification catch puddle',
 'racing yacht Quarantine Misreporting apprehension drink',
 'shrimper Landing Mismatches amercement estuary',
 'vessel Logbook Discrepancies trial shallow',
 'fishing smack Quarantine Misreporting hearing gulf',
 'sailing ship Black Market Fishing taking into custody prawn',
 'iceboat Harvest Misrepresentation catch whelk',
 'dragger Catch Distribution Misreporting fine roe',
 'vessel Dockside Sales hearing estuary',
 'watercraft Weight Fraud fine estuary',
 'dragger Unreported Fishing mulct shellfish',
 'scooter Multiple Logbooks pinch ford',
 'sailing ship False Declarations trial hard roe',
 'sailing ship Weight 

## Searching the Web

We can now pass our prompts to a search engine and get links for the top results. To run this code, you must have a Google API key and a Custom Search Engine ID. Since these IDs should not be shared, I have my personal keys stored in a local file and import them below using some custom functions.

The code below was adapted from [this post on StackOverflow](https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search) and you can find instructions for generating your own API and Custome Search Enginer keys there.

In [99]:
from googleapiclient.discovery import build
from apiKeys import google_api_key, search_engine_id          # This is a local file that contains my personal API key and search engine ID.


my_api_key = google_api_key()
my_cse_id = search_engine_id()

The functions below define a custom search engine using the API key and Custom Search Engine ID and then returns the top 10 results as links.

In [100]:
def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()

    if 'items' in res.keys():
        return res['items']

def get_links(prompt:str, api_key:str, cse_id:str, n=10):
    '''
    Takes in a search prompt and returns the top n results as links.

    prompt : Search query.
    api_key : Personal API key for accessing Google.
    cse_id : Custom Search Engine ID
    n : Number of results to return.

    Returns results as a list of urls.
    '''

    links =[]

    results = google_search(prompt, api_key, cse_id, num=n)
    if results != None:
        for result in results:
            links.append(result['link'])

    return links

We now iterate the process of searching and retrieving links over a long list of prompts. We store the links as a set so that no links are repeated.

In [102]:
links = set()

for prompt in list(sample_prompts['Prompt'].values):
    results = get_links(prompt, my_api_key, my_cse_id, 3)
    if results != None:
        for result in results:
            links.add(result)

In [103]:
links

{'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LmhhemFyZHMub3JnL25ld3MvMjAxMC9pbmRleC5odG3SAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9LUw1eGk4RGJRNWPSAQA?oc=5/',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9M043TDhYaGt6cW_SAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9MmxpdHpzRkN3a0HSAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9RTlFa085VXFoOUXSAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9U3otNmpScmJ0dUnSAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9Uk1oMVpLY0h1WUnSAQA?oc=5',
 'https://news.google.com/__i/rss/rd/articles/CBMiK2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9Y0dhd1NPdXViQTDSAQA?oc=5',
 'https://news.google.com/__i/r

By inspection, the results here aren't what we would expect to see. ㅠㅡㅠ

As an alternative to the Google API and custom search engine, we use SerpApi. Here we are only running a query for one specific prompt since SerpApi limits the number of requests a user can make.

In [104]:
import serpapi
from apiKeys import serp_api_key

api_key = serp_api_key()

In [105]:
client = serpapi.Client(api_key=api_key)
result = client.search(
	q='vessel caught underreporting catch',			# One pro of using SerpApi is that it is very easy to specify what search engine to use.
	engine="google_news",							# Here we are specifically using the Google News.
	hl="en",										# We can also specify what language the results should be in
	gl="us",										# and the location from which results should be generated.
)

In [106]:
for r in result['news_results']:
    print(r['link'])

https://www.greenpeace.org/aotearoa/press-release/cameras-reveal-mass-underreporting-of-dolphin-albatross-and-fish-bycatch-by-commercial-fishing-industry/
https://en.yna.co.kr/view/AEN20240402008500315
https://www.thepost.co.nz/politics/350242260/caught-out-cameras-boats-reveal-massive-under-reporting-wildlife-deaths
https://theconversation.com/when-fishing-boats-go-dark-at-sea-theyre-often-committing-crimes-we-mapped-where-it-happens-196694
https://www.washingtonpost.com/climate-environment/interactive/2023/map-illegal-fishing/
https://www.nature.com/articles/ncomms10244
https://www.theguardian.com/environment/2023/mar/21/uk-fishing-industry-underreporting-whale-dolphin-porpoise-bycatch
https://insider.si.edu/2014/07/scientists-say-panama-fish-catch-vastly-reported/
https://www.nationalfisherman.com/northeast/maine-fishermen-take-plea-deal-for-herring-fraud
https://www.pewtrusts.org/en/research-and-analysis/fact-sheets/2013/08/27/faq-illegal-unreported-and-unregulated-fishing
https://

These results are much more relevant!