Skip to content

Kin-zala/seqtrace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 SeqTrace — DSI Reuse Tracker

A bioinformatics tool that tracks the downstream journey of plant DNA sequences — from their original submission in public databases, through scientific literature and patents worldwide.

Built as a prototype for IPK Gatersleben Thesis Appication - Topic: "Analyse downstream reuse of Digital Sequence Information (DSI) in literature, patents, and databases."


What is this project about?

When a researcher sequences a plant's DNA and submits it to a public database like ENA, that sequence becomes Digital Sequence Information (DSI). Other researchers around the world can then use that sequence — in their papers, in patents, in new discoveries — often without any formal link back to the original submission.

This raises an important question for biodiversity and benefit-sharing policy:

How often is DSI actually reused, and can we trace it?

SeqTrace answers this by automatically fetching plant sequences, searching for their appearances in literature and patents, and visualising the results in an interactive dashboard.


Key Findings

Analysis of 500 plant sequences submitted to ENA (filtered to sequences with country metadata, predominantly Japanese plant biodiversity samples):

Metric Value
Total sequences analysed 500
Unique organisms 299
Sequences with paper links 135 (27%)
Sequences with patent links ~37%
Unique patents found 134 across 5 offices
Traceable overall ~73%
Traceability gap ~27%

Even with ~27% still untraceable, this confirms that linkage infrastructure gaps persist even for sequences from well-documented biodiversity studies. This is the core finding relevant to DSI benefit-sharing policy.

Other findings:

  • Japan is the dominant submitting country — sequences cover rich Japanese plant biodiversity including mosses, ferns, aquatic plants, and forest trees
  • US leads patents with 32 patents across 12 organisms, followed by JP (30), WO (29), EP (23), KR (20)
  • Lathyrus oleraceus (grass pea) is the most patented organism with 14 patents
  • Chlamydomonas reinhardtii is the most literature-linked sequence with 10 paper connections
  • Patent offices represented: USPTO 🇺🇸, JPO 🇯🇵, WIPO 🌐, EPO 🇪🇺, KIPO 🇰🇷

Project Structure

seqtrace/
│
├── scripts/
│   ├── fetch_ena.py          # Fetches 500 plant sequences from ENA
│   ├── track_reuse.py        # Searches for paper links via NCBI + EuropePMC
│   ├── fetch_patents.py      # Searches for patent links via NCBI
│   ├── app.py                # Streamlit interactive dashboard
│   └── config.py             # API keys — NOT pushed to GitHub
│
├── data/
│   └── seqtrace.db           # SQLite database (auto-generated, not in git)
│
├── results/
│   ├── ena_plant_sequences_raw.csv
│   ├── seqtrace_reuse.csv
│   └── seqtrace_patents.csv
│
├── requirements.txt          # Python dependencies
├── .streamlit/
│   └── config.toml           # Streamlit theme config
│
└── seqtrace_dashboard.html   # Static standalone HTML dashboard

Data Sources

Source What it provides
ENA Portal API Plant DNA sequences with metadata
NCBI Entrez API Literature links + patent sequences
EuropePMC API Paper metadata (title, journal, year, citations)
NCBI Taxonomy API Taxonomic lineage per organism

Installation & Setup

1. Clone the repository

git clone https://github.com/Kin-zala/seqtrace.git
cd seqtrace

2. Create a conda environment

conda create -n seqtrace python=3.11 -y
conda activate seqtrace

3. Install dependencies

pip install requests pandas plotly streamlit

4. Add your NCBI API key

Create a file called config.py in the project folder:

NCBI_API_KEY = "your_key_here"

Get a free API key at: https://www.ncbi.nlm.nih.gov/account/

⚠️ config.py is listed in .gitignore and will never be pushed to GitHub.


How to Run

Run each script from the scripts/ folder in order — each one builds on the previous:

cd scripts

# Step 1 — Fetch 500 plant sequences from ENA (with country data)
python fetch_ena.py

# Step 2 — Track literature links (run 10 times to process all 500)
for i in {1..10}; do python track_reuse.py; done

# Step 3 — Find patent links
python fetch_patents.py

# Step 4 — Launch the interactive dashboard
streamlit run app.py

The dashboard opens automatically at http://localhost:8501.


Dashboard Features

The Streamlit app has 6 tabs:

  • 📊 Overview — summary stats and traceability breakdown
  • 📄 Literature — paper links by year, most cited sequences
  • ⚖️ Patents — patents by office, most patented organisms
  • 🗂️ Data Table — full searchable sequence table
  • 🌍 World Map — two choropleth maps (submissions + traceability by country)
  • 🌿 Taxonomy — interactive sunburst chart (Order → Family → Organism)

All charts update live when you use the sidebar filters (organism, country, year, traceability).


Relevance to DSI Policy

This project is a direct prototype for the kind of analysis needed to support the WiLDSI portal at IPK Gatersleben and to inform DSI benefit-sharing discussions under the Kunming-Montreal Global Biodiversity Framework. The traceability gap identified here — ~27% of sequences with no recoverable downstream link — illustrates why robust DSI tracking infrastructure is urgently needed, even for well-documented biodiversity collections.


Author

Kinnari Zala M.Sc. Bioinformatics, Deggendorf Institute of Technology GitHub: Kin-zala Website: kinnarizala.me


License

This project is open source and available under the MIT License.

About

Tracks downstream reuse of plant DNA sequences across scientific literature and patents — a DSI traceability tool built on ENA, NCBI, and EuropePMC APIs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors