This repository contains the code and analyses for "Operationalizing Research Software for Supply Chain Security" by Kalu et al.
It is built on top of the Research Software Engineering (rseng) and uses the rseng/software database. We credit and acknowledge the rseng repositories and database that provide the underlying data and framework (, (https://github.com/rseng/software/tree/1.0.0)).
Our work uses this foundation data to propose a Research Software Supply Chain based Taxonomy for Research Software. Then we tested this taxonomy with OpenSSF scorecard (collection and analysis scripts), We also collect a benchmark database of Apache Software Foundation (ASF) project repositories (to which we also obtained the OpenSSF Scorecard reports), and finally we present summary tables to highlight our results.
To cite this repo, use:
@misc{cross_artifact_github,
author = {Kalu, Kelechi G. and Rattan, Soham and Schorlemmer, Taylor R. and Thiruvathukal, George K. and Carver, Jeffrey C. and Davis, James C.},
title = {{Operationalizing Research Software for Supply Chain Security (Software Artiifact)}},
howpublished = {\url{https://github.com/PurdueDualityLab/CROSS}},
note = {GitHub repository},
year = {2026}
}
All project-specific work for this paper lives in scripts/. Below is what each script does and how to run it.
What batch annotation does:
- Reads
database/github/**/metadata.jsonand applies the RSSC prompt (defined insidescripts/annotate_db_gpt.py). - Writes results to
New_SSC_Taxonomy.gpt-5.1inside each metadata file. --batch-size+--batch-indexlet you process the dataset in chunks.--auto-batchruns all chunks sequentially.--progress-file+--resumeskip completed entries and allow restart-safe runs. For large runs of the GPT annotator,scripts/annotate_db_gpt.pysupports batching and resumable progress logs.
# process a single batch
python scripts/annotate_db_gpt.py --batch-size 100 --batch-index 0
# process all batches sequentially
python scripts/annotate_db_gpt.py --batch-size 100 --auto-batch
# record progress and resume safely after failures
python scripts/annotate_db_gpt.py --batch-size 100 --auto-batch \
--progress-file /tmp/annotate.progress.jsonl --resume-
scripts/scorecard_runner.py
Runs OpenSSF Scorecard over the rseng database and writes results into each repo’smetadata.jsonunderopenssf_scorecard.python scripts/scorecard_runner.py --token-file .scorecard_tokens.env --progress-file .scorecard.progress.jsonl --resume
-
scripts/run_scorecard_list.py
Runs Scorecard over a list of repo URLs and writes JSON outputs underapache/scorecard/plus an append-only JSONL atapache/scorecard.results.jsonl.python scripts/run_scorecard_list.py --repo-list apache/apache_github_org_repos.txt --token-file .scorecard_tokens.env --resume
-
scripts/get_apache_projects.py
Fetches ASF project registry and writes a GitHub repo list toapache/apache_repos.txtby default.python scripts/get_apache_projects.py
-
scripts/get_apache_github_org_repos.py
Lists all repositories in the apache GitHub org and writesapache/apache_github_org_repos.txt.python scripts/get_apache_github_org_repos.py
-
scripts/db_results.py
Builds:scorecard_by_actor.csv(aggregated stats by actor_unit)scorecard_missing.csv(repos missing scorecard and/or taxonomy)scorecard_repo_results.csv(per-repo rows with taxonomy + scorecard scores + created_at)
python scripts/db_results.py --repo-root . --repo-output scorecard_repo_results.csv -
scripts/summarize_repo_results.py
Summarizes the per-repo CSV into:taxonomy_category_percentages.csvscorecard_by_distribution_pathway.csvtaxonomy_sankey.csv(Sankey links as CSV)
python scripts/summarize_repo_results.py --input-csv scorecard_repo_results.csv --exclude-na
--exclude-nadrops-1scores (not applicable / insufficient evidence) from averages while keeping0(evaluated and failed). -
scripts/make_summary_table.py
Produces a combined ASF vs RS summary table:results/summary_table.csvresults/summary_table.tex
python scripts/make_summary_table.py \ --rs scorecard_repo_results.csv \ --asf apache/scorecard.results.jsonl \ --outdir results/
scorecard_repo_results.csv— per-repo taxonomy + scorecard scores (rseng database)taxonomy_category_percentages.csv— taxonomy distribution tablescorecard_by_distribution_pathway.csv— average scores by distribution_pathwaytaxonomy_sankey.csv— Sankey links (source, target, value)results/summary_table.csv/results/summary_table.tex— ASF vs RS comparisons
Create .scorecard_tokens.env (ignored by git) with one token per line or KEY=VALUE:
GITHUB_TOKEN_1="ghp_..."
GITHUB_TOKEN_2="ghp_..."
Set OPENAI_API_KEY in .env:
OPENAI_API_KEY="sk-..."
brew install scorecard
# or
go install github.com/ossf/scorecard/v5/cmd/scorecard@latest
python -m venv .venv
source .venv/bin/activate
pip install pandas numpy tqdm python-dotenv openai
- Sochat, V. (2021, October 7). rseng/software: Research Software Encyclopedia Database Release v1.0.0 (Version 1.0.0) [Data set].