SGMC Pipeline

Welcome to the SGMC Pipeline repository! This Python-based information-retrieval application aims to automate and streamline the process of collecting and curating geospatial metadata from the NCBI SRA (Sequence Read Archive). The pipeline consists of four main steps: data retrieval, web scraping, institution identification, and location identification.

Prerequisites

To use the SGMC Pipeline, you will need the following:

Python version: 3.10.6
R version: 4
Jupyter Notebook

The following keys are also required at various points in the pipeline. Instructions on how to generate each key are included in the notebook file where each one is required:

Google Service Account Key
AWS Credentials Access Key
OpenAI API key

Installation options

Installation via pip

To get started with the SGMC Pipeline, follow these steps:

Navigate to the directory where you want to install the repository, and then clone the repository to your local machine:

$ git clone https://github.com/CDCgov/PASS.git

Navigate to the repository directory:

$ cd PASS/SGMC

Install the required dependencies:

$ pip install -r requirements.txt

Installation via Poetry

To get started with the SGMC Pipeline, follow these steps:

Navigate to the directory where you want to install the repository, and then clone the repository to your local machine:

$ git clone https://github.com/CDCgov/PASS.git

Navigate to the repository directory:

$ cd PASS/SGMC

Install the Poetry:

$ pip install poetry

Activate the enviroment:

$ poetry shell

Create jupyter kernel:

$ poetry run ipython kernel install --user --name=sgmc_kernel
$ jupyter notebook

And then select the created kernel in “Kernel” -> “Change kernel” -> "sgmc_kernel". Make certain that the "sgmc_kernel" is selected in each notebook file.

Usage

The SGMC Pipeline consists of several Jupyter Notebook files that represent each step of the pipeline. Here's a brief description of each file:

00_GCP_SRA_Cloud_metadata.ipynb and 00_AWS_SRA_Cloud_metadata.ipynb : These notebooks handle the preliminary SRA in the cloud data retrieval. Options GCP and AWS.
01_NCBI_webScrape.ipynb: This notebook handles the data retrieval and web scraping process to extract relevant information from the NCBI SRA.
02_ChatGPT_API_Institution Imputation.ipynb: This notebook utilizes the ChatGPT API for institution identification. It leverages artificial intelligence to curate and update geospatial metadata.
03_rmaps.ipynb: This notebook runs R code to performs map generation from in this pipeline geospatial metadata.

Please note that the data directory contains the data files necessary for the pipeline execution, and the rstudio_maps directory stores the generated map visualization.

Feel free to customize and modify the pipeline as per your specific requirements. You can add additional steps or modify the existing ones to suit your needs.

Acknowledgments

This pipeline is developed to automate the geospatial metadata curation process. The study demonstrates the effectiveness of the pipeline in saving human efforts, reducing errors, and improving the quality of public data.

Note: This README provides an overview of the SGMC Pipeline and how to use it. For more detailed instructions, refer to the individual notebook files and their corresponding documentation within the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.devcontainer		.devcontainer
SGMC		SGMC
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.devcontainer

.devcontainer

SGMC

SGMC

.gitignore

.gitignore

README.md

README.md

Repository files navigation

SGMC Pipeline

Prerequisites

Installation options

Installation via pip

Installation via Poetry

Usage

Acknowledgments

About

Releases

Packages

Contributors 4

Languages

CDCgov/PASS

Folders and files

Latest commit

History

Repository files navigation

SGMC Pipeline

Prerequisites

Installation options

Installation via pip

Installation via Poetry

Usage

Acknowledgments

About

Resources

Stars

Watchers

Forks

Languages