Skip to content

sdsc-ordes/license-collector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License collector

Automatically collect license and other metadata from open-source data science repositories.

Workflow

The repository contains a Prefect workflow which downloads a list of open source repositories from paperswithcode (see links between papers and code) and runs gimie on each git repository. The extracted metadata from all repositories are then combined into a single RDF graph. A table is also extracted from this graph to provide specific attributes.

The input dataset contains links to ~200k github repositories. It is provided by paperswithcode under the CC-BY-SA license (download link).

The repository is composed of 3 workflows:

  • retrieve.py: Download paperswithcode dataset and filter papers with github/gitlab repositories.
  • extract.py: Run gimie on each repository and extract RDF metadata using the GitLab / GitHub API.
  • enhance.py: Combine original paperswithcode dataset with repository metadata for visualization.

⚠️ You will need to provide your own GitLab / GitHub API tokens as environment variables GITHUB_TOKEN and GITLAB_TOKEN.

graph TD;
    paperswithcode -->|retrieve.py| filtered_papers.json;
    filtered_papers.json -->|extract.py| repo_metadata.ttl;
    repo_metadata.ttl -->|enhance.py| combined.csv;
    filtered_papers.json -->|enhance.py| combined.csv;

For convenience, main.py script composes the above steps into a higher level workflow, which can be run with a single command:

Workflow configuration is defined in config.py.

make pipeline

Quick Start

Set up the environment

  1. Install Poetry
  2. Set up the environment:
make setup
make activate

Install new packages

To install new PyPI packages, run:

poetry add <package-name>

Run Python scripts

To run the Python scripts type the following:

make pipeline

Contributing

Issues and pull requests are welcome. The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. This project uses pyproject.toml to define package information, requirements and tooling configuration.

About

Gather licenses about data science repositories.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published