Automatically collect license and other metadata from open-source data science repositories.
The repository contains a Prefect workflow which downloads a list of open source repositories from paperswithcode (see links between papers and code) and runs gimie on each git repository. The extracted metadata from all repositories are then combined into a single RDF graph. A table is also extracted from this graph to provide specific attributes.
The input dataset contains links to ~200k github repositories. It is provided by paperswithcode under the CC-BY-SA license (download link).
The repository is composed of 3 workflows:
retrieve.py
: Download paperswithcode dataset and filter papers with github/gitlab repositories.extract.py
: Run gimie on each repository and extract RDF metadata using the GitLab / GitHub API.enhance.py
: Combine original paperswithcode dataset with repository metadata for visualization.
⚠️ You will need to provide your own GitLab / GitHub API tokens as environment variablesGITHUB_TOKEN
andGITLAB_TOKEN
.
graph TD;
paperswithcode -->|retrieve.py| filtered_papers.json;
filtered_papers.json -->|extract.py| repo_metadata.ttl;
repo_metadata.ttl -->|enhance.py| combined.csv;
filtered_papers.json -->|enhance.py| combined.csv;
For convenience, main.py
script composes the above steps into a higher level workflow, which can be run with a single command:
Workflow configuration is defined in config.py
.
make pipeline
- Install Poetry
- Set up the environment:
make setup
make activate
To install new PyPI packages, run:
poetry add <package-name>
To run the Python scripts type the following:
make pipeline
Issues and pull requests are welcome. The code formatting standard we use is black, with --line-length=79
to follow PEP8 recommendations. This project uses pyproject.toml to define package information, requirements and tooling configuration.