License collector

Automatically collect license and other metadata from open-source data science repositories.

Workflow

The repository contains a Prefect workflow which downloads a list of open source repositories from paperswithcode (see links between papers and code) and runs gimie on each git repository. The extracted metadata from all repositories are then combined into a single RDF graph. A table is also extracted from this graph to provide specific attributes.

The input dataset contains links to ~200k github repositories. It is provided by paperswithcode under the CC-BY-SA license (download link).

The repository is composed of 3 workflows:

retrieve.py: Download paperswithcode dataset and filter papers with github/gitlab repositories.
extract.py: Run gimie on each repository and extract RDF metadata using the GitLab / GitHub API.
enhance.py: Combine original paperswithcode dataset with repository metadata for visualization.

⚠️ You will need to provide your own GitLab / GitHub API tokens as environment variables GITHUB_TOKEN and GITLAB_TOKEN.

graph TD;
    paperswithcode -->|retrieve.py| filtered_papers.json;
    filtered_papers.json -->|extract.py| repo_metadata.ttl;
    repo_metadata.ttl -->|enhance.py| combined.csv;
    filtered_papers.json -->|enhance.py| combined.csv;

For convenience, main.py script composes the above steps into a higher level workflow, which can be run with a single command:

Workflow configuration is defined in config.py.

make pipeline

Quick Start

Set up the environment

Install Poetry
Set up the environment:

make setup
make activate

Install new packages

To install new PyPI packages, run:

poetry add <package-name>

Run Python scripts

To run the Python scripts type the following:

make pipeline

Contributing

Issues and pull requests are welcome. The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. This project uses pyproject.toml to define package information, requirements and tooling configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

notebooks

notebooks

src

src

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

License collector

Workflow

Quick Start

Set up the environment

Install new packages

Run Python scripts

Contributing

About

Releases

Packages

Languages

License

sdsc-ordes/license-collector

Folders and files

Latest commit

History

Repository files navigation

License collector

Workflow

Quick Start

Set up the environment

Install new packages

Run Python scripts

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages