Skip to content

Scrape research organisations for publications and extract metadata with LLMs

License

Notifications You must be signed in to change notification settings

Autonomy-Data-Unit/pubcrawler

Repository files navigation

pubcrawler

Description

pubcrawler downloads all files of a specified type from an organisation’s website and then extracts metadata from each document using LLMs. The library’s main use-case (as demonstrated in the core scripts) is downloading PDFs from think tanks/policy organisations and mapping authorship, publishing output and institutional affliations.

How to use

Prerequisites

You will need an OpenAI API key to run all scripts. The key must be stored within the .env file in this directory.

Run

To explore the research output for an institution of your choosing, you will need to configure the following parameters. Here is an example of the parameters required to scrape Autonomy’s site for PDFs and then processing them with GPT-4:

url = 'https://autonomy.work/' # organisation's main webpage
directory_name = 'autonomy' # choose a name for the folder to save output data to
file_type = 'pdf' # filetype to download
model = 'gpt-4' # openai model for processing text
org_name = 'autonomy' # name of your organisation (used to filter out irrelevant results)

To run the software you can pass the parameters like so:

python -m pubcrawler.cli.core -cs url='https://autonomy.work/' directory_name='autonomy' file_type='pdf' model='gpt-4' org_name='autonomy'

Alternatively you can manually set these values in const.ipynb and then run:

nbdev_export
pip install -e .
python -m pubcrawler.cli.core

Execute core scripts

proj.core.run_all()

List of commands

!python -m pubcrawler.cli

Execute commands

!python -m pubcrawler.cli.core

You can find the manual for each command using -h