Operationalizing Research Software for Supply Chain Security (Artifact)

This repository contains the code and analyses for "Operationalizing Research Software for Supply Chain Security" by Kalu et al.

It is built on top of the Research Software Engineering (rseng) and uses the rseng/software database. We credit and acknowledge the rseng repositories and database that provide the underlying data and framework (, (https://github.com/rseng/software/tree/1.0.0)).

Our work uses this foundation data to propose a Research Software Supply Chain based Taxonomy for Research Software. Then we tested this taxonomy with OpenSSF scorecard (collection and analysis scripts), We also collect a benchmark database of Apache Software Foundation (ASF) project repositories (to which we also obtained the OpenSSF Scorecard reports), and finally we present summary tables to highlight our results.

To cite this repo, use:

@misc{cross_artifact_github,
  author       = {Kalu, Kelechi G. and Rattan, Soham and Schorlemmer, Taylor R. and Thiruvathukal, George K. and Carver, Jeffrey C. and Davis, James C.},
  title        = {{Operationalizing Research Software for Supply Chain Security (Software Artiifact)}},
  howpublished = {\url{https://github.com/PurdueDualityLab/CROSS}},
  note         = {GitHub repository},
  year         = {2026}
}

Our Contributions (scripts and analyses)

All project-specific work for this paper lives in scripts/. Below is what each script does and how to run it.

1) Batch taxonomy annotation (GPT 5.1)

What batch annotation does:

Reads database/github/**/metadata.json and applies the RSSC prompt (defined inside scripts/annotate_db_gpt.py).
Writes results to New_SSC_Taxonomy.gpt-5.1 inside each metadata file.
--batch-size + --batch-index let you process the dataset in chunks.
--auto-batch runs all chunks sequentially.
--progress-file + --resume skip completed entries and allow restart-safe runs. For large runs of the GPT annotator, scripts/annotate_db_gpt.py supports batching and resumable progress logs.

# process a single batch
python scripts/annotate_db_gpt.py --batch-size 100 --batch-index 0

# process all batches sequentially
python scripts/annotate_db_gpt.py --batch-size 100 --auto-batch

# record progress and resume safely after failures
python scripts/annotate_db_gpt.py --batch-size 100 --auto-batch \
  --progress-file /tmp/annotate.progress.jsonl --resume

2) OpenSSF Scorecard collection

scripts/scorecard_runner.py
Runs OpenSSF Scorecard over the rseng database and writes results into each repo’s metadata.json under openssf_scorecard.
```
python scripts/scorecard_runner.py --token-file .scorecard_tokens.env --progress-file .scorecard.progress.jsonl --resume
```
scripts/run_scorecard_list.py
Runs Scorecard over a list of repo URLs and writes JSON outputs under apache/scorecard/ plus an append-only JSONL at apache/scorecard.results.jsonl.
```
python scripts/run_scorecard_list.py --repo-list apache/apache_github_org_repos.txt --token-file .scorecard_tokens.env --resume
```

3) Apache repository list builders

scripts/get_apache_projects.py
Fetches ASF project registry and writes a GitHub repo list to apache/apache_repos.txt by default.
```
python scripts/get_apache_projects.py
```
scripts/get_apache_github_org_repos.py
Lists all repositories in the apache GitHub org and writes apache/apache_github_org_repos.txt.
```
python scripts/get_apache_github_org_repos.py
```

4) Database and analysis outputs

scripts/db_results.py
Builds:
- scorecard_by_actor.csv (aggregated stats by actor_unit)
- scorecard_missing.csv (repos missing scorecard and/or taxonomy)
- scorecard_repo_results.csv (per-repo rows with taxonomy + scorecard scores + created_at)
```
python scripts/db_results.py --repo-root . --repo-output scorecard_repo_results.csv
```
scripts/summarize_repo_results.py
Summarizes the per-repo CSV into:
- taxonomy_category_percentages.csv
- scorecard_by_distribution_pathway.csv
- taxonomy_sankey.csv (Sankey links as CSV)
```
python scripts/summarize_repo_results.py --input-csv scorecard_repo_results.csv --exclude-na
```
--exclude-na drops -1 scores (not applicable / insufficient evidence) from averages while keeping 0 (evaluated and failed).

scripts/make_summary_table.py
Produces a combined ASF vs RS summary table:

results/summary_table.csv
results/summary_table.tex

python scripts/make_summary_table.py \
  --rs scorecard_repo_results.csv \
  --asf apache/scorecard.results.jsonl \
  --outdir results/

Outputs (CSV / Sankey)

scorecard_repo_results.csv — per-repo taxonomy + scorecard scores (rseng database)
taxonomy_category_percentages.csv — taxonomy distribution table
scorecard_by_distribution_pathway.csv — average scores by distribution_pathway
taxonomy_sankey.csv — Sankey links (source, target, value)
results/summary_table.csv / results/summary_table.tex — ASF vs RS comparisons

Tokens and configuration

GitHub tokens (Scorecard)

Create .scorecard_tokens.env (ignored by git) with one token per line or KEY=VALUE:

GITHUB_TOKEN_1="ghp_..."
GITHUB_TOKEN_2="ghp_..."

OpenAI tokens (GPT annotation)

Set OPENAI_API_KEY in .env:

OPENAI_API_KEY="sk-..."

Installing dependencies

OpenSSF Scorecard CLI

brew install scorecard
# or
go install github.com/ossf/scorecard/v5/cmd/scorecard@latest

Python packages

python -m venv .venv
source .venv/bin/activate
pip install pandas numpy tqdm python-dotenv openai

References

Sochat, V. (2021, October 7). rseng/software: Research Software Encyclopedia Database Release v1.0.0 (Version 1.0.0) [Data set].

Name		Name	Last commit message	Last commit date
Latest commit History 597 Commits
.github		.github
apache		apache
database copy		database copy
database		database
devel		devel
docs		docs
results		results
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
analysis_report.json		analysis_report.json
comparison_5.1_vs_5.2.json		comparison_5.1_vs_5.2.json
db_results.nohup.log		db_results.nohup.log
generate.sh		generate.sh
make_summary.nohup.log		make_summary.nohup.log
projects_100.json		projects_100.json
repos.txt		repos.txt
rse.ini		rse.ini
scorecard.nohup.log		scorecard.nohup.log
scorecard_by_actor.csv		scorecard_by_actor.csv
scorecard_by_distribution_pathway.csv		scorecard_by_distribution_pathway.csv
scorecard_missing.csv		scorecard_missing.csv
scorecard_repo_results.csv		scorecard_repo_results.csv
summarize.nohup.log		summarize.nohup.log
taxonomy_category_percentages.csv		taxonomy_category_percentages.csv
taxonomy_sankey.csv		taxonomy_sankey.csv
taxonomy_sankey.html		taxonomy_sankey.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Operationalizing Research Software for Supply Chain Security (Artifact)

Our Contributions (scripts and analyses)

1) Batch taxonomy annotation (GPT 5.1)

2) OpenSSF Scorecard collection

3) Apache repository list builders

4) Database and analysis outputs

Outputs (CSV / Sankey)

Tokens and configuration

GitHub tokens (Scorecard)

OpenAI tokens (GPT annotation)

Installing dependencies

OpenSSF Scorecard CLI

Python packages

References

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Contributors 6

Uh oh!

Languages

Uh oh!

License

PurdueDualityLab/CROSS

Folders and files

Latest commit

History

Repository files navigation

Operationalizing Research Software for Supply Chain Security (Artifact)

Our Contributions (scripts and analyses)

1) Batch taxonomy annotation (GPT 5.1)

2) OpenSSF Scorecard collection

3) Apache repository list builders

4) Database and analysis outputs

Outputs (CSV / Sankey)

Tokens and configuration

GitHub tokens (Scorecard)

OpenAI tokens (GPT annotation)

Installing dependencies

OpenSSF Scorecard CLI

Python packages

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 6

Uh oh!

Languages

Packages