A simple Python script that extracts basic SNP (rsID) information from NCBI SNP pages and saves it to an Excel file.
It retrieves chromosome, GRCh37 / GRCh38 positions, alleles, and gene consequence.
- Input: a comma-separated list of rs IDs (e.g.
rs429358, rs7412) - Output: Excel file with columns:
rsID,Chromosome,GRCh37 Position,GRCh38 Position,Alleles,Gene Consequence - Ideal for small-scale SNP lookups.
- Python 3.8+
- Python packages:
pip install requests beautifulsoup4 pandas openpyxl lxmlrsID— SNP identifierChromosome— Chromosome number (if available)GRCh37 Position— Numeric position on GRCh37GRCh38 Position— Numeric position on GRCh38Alleles— Allele stringGene Consequence— Consequence orNoneif missing
-
Fetches NCBI SNP webpage for each rsID:
https://www.ncbi.nlm.nih.gov/snp/{rsid}. -
Uses BeautifulSoup to parse HTML and extract:
Position,Alleles,Gene : Consequence- Placement table (
genomics_placements_table) for GRCh37/GRCh38 coordinates
-
Combines all results in a pandas
DataFrameand writes to Excel.
- The script scrapes NCBI public pages. For large-scale queries, use NCBI Entrez/e-utilities.
- Some fields may be missing for certain SNPs; these will appear blank in Excel.
- Internet connection is required.
This project is licensed under the MIT License. See LICENSE for details.
Created by Rahul Madhav