🧬 SeqTrace — DSI Reuse Tracker

A bioinformatics tool that tracks the downstream journey of plant DNA sequences — from their original submission in public databases, through scientific literature and patents worldwide.

Built as a prototype for IPK Gatersleben Thesis Appication - Topic: "Analyse downstream reuse of Digital Sequence Information (DSI) in literature, patents, and databases."

What is this project about?

When a researcher sequences a plant's DNA and submits it to a public database like ENA, that sequence becomes Digital Sequence Information (DSI). Other researchers around the world can then use that sequence — in their papers, in patents, in new discoveries — often without any formal link back to the original submission.

This raises an important question for biodiversity and benefit-sharing policy:

How often is DSI actually reused, and can we trace it?

SeqTrace answers this by automatically fetching plant sequences, searching for their appearances in literature and patents, and visualising the results in an interactive dashboard.

Key Findings

Analysis of 500 plant sequences submitted to ENA (filtered to sequences with country metadata, predominantly Japanese plant biodiversity samples):

Metric	Value
Total sequences analysed	500
Unique organisms	299
Sequences with paper links	135 (27%)
Sequences with patent links	~37%
Unique patents found	134 across 5 offices
Traceable overall	~73%
Traceability gap	~27%

Even with ~27% still untraceable, this confirms that linkage infrastructure gaps persist even for sequences from well-documented biodiversity studies. This is the core finding relevant to DSI benefit-sharing policy.

Other findings:

Japan is the dominant submitting country — sequences cover rich Japanese plant biodiversity including mosses, ferns, aquatic plants, and forest trees
US leads patents with 32 patents across 12 organisms, followed by JP (30), WO (29), EP (23), KR (20)
Lathyrus oleraceus (grass pea) is the most patented organism with 14 patents
Chlamydomonas reinhardtii is the most literature-linked sequence with 10 paper connections
Patent offices represented: USPTO 🇺🇸, JPO 🇯🇵, WIPO 🌐, EPO 🇪🇺, KIPO 🇰🇷

Project Structure

seqtrace/
│
├── scripts/
│   ├── fetch_ena.py          # Fetches 500 plant sequences from ENA
│   ├── track_reuse.py        # Searches for paper links via NCBI + EuropePMC
│   ├── fetch_patents.py      # Searches for patent links via NCBI
│   ├── app.py                # Streamlit interactive dashboard
│   └── config.py             # API keys — NOT pushed to GitHub
│
├── data/
│   └── seqtrace.db           # SQLite database (auto-generated, not in git)
│
├── results/
│   ├── ena_plant_sequences_raw.csv
│   ├── seqtrace_reuse.csv
│   └── seqtrace_patents.csv
│
├── requirements.txt          # Python dependencies
├── .streamlit/
│   └── config.toml           # Streamlit theme config
│
└── seqtrace_dashboard.html   # Static standalone HTML dashboard

Data Sources

Source	What it provides
ENA Portal API	Plant DNA sequences with metadata
NCBI Entrez API	Literature links + patent sequences
EuropePMC API	Paper metadata (title, journal, year, citations)
NCBI Taxonomy API	Taxonomic lineage per organism

Installation & Setup

1. Clone the repository

git clone https://github.com/Kin-zala/seqtrace.git
cd seqtrace

2. Create a conda environment

conda create -n seqtrace python=3.11 -y
conda activate seqtrace

3. Install dependencies

pip install requests pandas plotly streamlit

4. Add your NCBI API key

Create a file called config.py in the project folder:

NCBI_API_KEY = "your_key_here"

Get a free API key at: https://www.ncbi.nlm.nih.gov/account/

⚠️ config.py is listed in .gitignore and will never be pushed to GitHub.

How to Run

Run each script from the scripts/ folder in order — each one builds on the previous:

cd scripts

# Step 1 — Fetch 500 plant sequences from ENA (with country data)
python fetch_ena.py

# Step 2 — Track literature links (run 10 times to process all 500)
for i in {1..10}; do python track_reuse.py; done

# Step 3 — Find patent links
python fetch_patents.py

# Step 4 — Launch the interactive dashboard
streamlit run app.py

The dashboard opens automatically at http://localhost:8501.

Dashboard Features

The Streamlit app has 6 tabs:

📊 Overview — summary stats and traceability breakdown
📄 Literature — paper links by year, most cited sequences
⚖️ Patents — patents by office, most patented organisms
🗂️ Data Table — full searchable sequence table
🌍 World Map — two choropleth maps (submissions + traceability by country)
🌿 Taxonomy — interactive sunburst chart (Order → Family → Organism)

All charts update live when you use the sidebar filters (organism, country, year, traceability).

Relevance to DSI Policy

This project is a direct prototype for the kind of analysis needed to support the WiLDSI portal at IPK Gatersleben and to inform DSI benefit-sharing discussions under the Kunming-Montreal Global Biodiversity Framework. The traceability gap identified here — ~27% of sequences with no recoverable downstream link — illustrates why robust DSI tracking infrastructure is urgently needed, even for well-documented biodiversity collections.

Author

Kinnari Zala M.Sc. Bioinformatics, Deggendorf Institute of Technology GitHub: Kin-zala Website: kinnarizala.me

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.streamlit		.streamlit
.vscode		.vscode
dashboard		dashboard
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 SeqTrace — DSI Reuse Tracker

What is this project about?

Key Findings

Project Structure

Data Sources

Installation & Setup

1. Clone the repository

2. Create a conda environment

3. Install dependencies

4. Add your NCBI API key

How to Run

Dashboard Features

Relevance to DSI Policy

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 SeqTrace — DSI Reuse Tracker

What is this project about?

Key Findings

Project Structure

Data Sources

Installation & Setup

1. Clone the repository

2. Create a conda environment

3. Install dependencies

4. Add your NCBI API key

How to Run

Dashboard Features

Relevance to DSI Policy

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages