pubcrawler

Description

pubcrawler downloads all files of a specified type from an organisation’s website and then extracts metadata from each document using LLMs. The library’s main use-case (as demonstrated in the core scripts) is downloading PDFs from think tanks/policy organisations and mapping authorship, publishing output and institutional affliations.

How to use

Prerequisites

You will need an OpenAI API key to run all scripts. The key must be stored within the .env file in this directory.

Run

To explore the research output for an institution of your choosing, you will need to configure the following parameters. Here is an example of the parameters required to scrape Autonomy’s site for PDFs and then processing them with GPT-4:

url = 'https://autonomy.work/' # organisation's main webpage
directory_name = 'autonomy' # choose a name for the folder to save output data to
file_type = 'pdf' # filetype to download
model = 'gpt-4' # openai model for processing text
org_name = 'autonomy' # name of your organisation (used to filter out irrelevant results)

To run the software you can pass the parameters like so:

python -m pubcrawler.cli.core -cs url='https://autonomy.work/' directory_name='autonomy' file_type='pdf' model='gpt-4' org_name='autonomy'

Alternatively you can manually set these values in const.ipynb and then run:

nbdev_export

pip install -e .

python -m pubcrawler.cli.core

Execute core scripts

proj.core.run_all()

List of commands

!python -m pubcrawler.cli

Execute commands

!python -m pubcrawler.cli.core

You can find the manual for each command using -h

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
nbs		nbs
pubcrawler		pubcrawler
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
settings.ini		settings.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pubcrawler

Description

How to use

Prerequisites

Run

Execute core scripts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Autonomy-Data-Unit/pubcrawler

Folders and files

Latest commit

History

Repository files navigation

pubcrawler

Description

How to use

Prerequisites

Run

Execute core scripts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages