Tools for Discovering, Classifying, and Searching NASA-Relevant Open-Source Repositories on GitHub
This project provides a modular and scalable framework to automate the discovery of high-value, NASA-related open-source software on GitHub. It includes multiple discovery pipelines based on DOI links, keyword search, curated organization lists, and metadata from the Astrophysics Source Code Library (ASCL).
The system extracts GitHub repository links, retrieves README content and metadata, and classifies repositories for NASA relevance using large language models. It also supports semantic code search that allows users to find relevant code blocks based on natural language queries.
The final output is a structured CSV file that can be used in downstream workflows such as NASA's Science Discovery Engine (SDE).
Choose one of the following methods:
# Using pip
pip install uv
# Using Homebrew
brew install uvgit clone https://github.com/NASA-IMPACT/github-code-discovery.gitcd github-code-discovery
uv venv
source .venv/bin/activateuv pip install -e .Create and configure your environment variables:
touch .env
nano .envAdd the following to your .env file:
S2_API_KEY=your_s2_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
GITHUB_ACCESS_TOKEN=your_github_token_hereTest that your API keys are properly configured:
python3 scripts/config.pyYou're now ready to use the GitHub Code Discovery tool! Check the documentation below for usage instructions and examples.
A pipeline to extract GitHub repository links from scientific papers using DOI links.
- Fetches metadata and full-text PDF URLs from Semantic Scholar for a list of DOIs
- Downloads full-text PDFs (if available)
- Extracts GitHub links from PDFs using regex-based heuristics
- Fetches README content from GitHub repos via GraphQL
- Runs a downstream Relevancy classification pipeline on the retrieved repositories
Note: The CSV file must contain a column named
'DOI', from which the links will be extracted.
python scripts/gh_search.py dois <doi_csv_file>
Example:
python scripts/gh_search.py dois ./data/extracted_dois.csv A pipeline to discover and classify GitHub repositories using keyword-based search over the GitHub API.
- Searches GitHub repositories using a provided keyword and time range (in days)
- Automatically splits queries into 5 time intervals to bypass the 1000-results-per-query API limitation
- Fetches repository links and README content using GitHub's REST API
- Runs a downstream Relevancy classification pipeline on the retrieved repositories
python scripts/gh_search.py keywords "<search_keyword>" <days_back>
Example:
python scripts/gh_search.py keywords "Hubble" 90 A pipeline to extract and classify GitHub repository links from the Astrophysics Source Code Library (ASCL) JSON index.
- Fetches the latest metadata from the ASCL API (
https://ascl.net/code/json) - Extracts and filters new GitHub repository links not already present in the local dataset
- Retrieves README content for newly found repositories using GitHub's GraphQL API
- Runs a downstream Relevancy classification pipeline on the retrieved repositories
python scripts/gh_search.py asclA pipeline to discover and classify GitHub repositories from a list of GitHub organizations.
- Parse each input URL to determine whether it is a direct repository link or an org page.
- For org pages, use the GitHub GraphQL API to enumerate all public repositories under the organization.
- Fetch the README file for each repository.
- Save the list of valid base-level repositories and their READMEs to CSV
Note: The CSV file must contain a column named
'URL', from which the links will be extracted.
python scripts/gh_search.py org <input_csv_path>
Example:
python scripts/gh_search.py org "./data/SMD_Sources.csv"Validate setup
python scripts/cocde_search.py validate-setupsearch repo
python scripts/code_search.py search-repo "weather forecasting" --top-k=5search code
python scripts/code_search.py search-codepython scripts/code_search.py search-code "atmospheric pressure simulation" --output-format=json --output-file="tmp/test.json" --top-cr=10 --top-r=5 --top-fr=5 --top-k=5The search happens at 2 levels:
- First, search for a list of top N repositories (repo search)
- Then search for code in each of those repo (code search per repo)