# 01 â€” Build Raw Protein Family Dataset (UniProtKB)

This notebook performs the **data acquisition step** for the project:
we download curated protein sequences from **UniProtKB (Swiss-Prot)** for a set of functional families
and save them in a raw, reproducible format.

---

## What happens in this notebook

1. **Configuration & imports**
   - Project paths and constants are taken from `src/config.py`
     (`PROJECT_ROOT`, `RAW_DIR`, `FAMILY_KEYWORDS`, `MIN_SEQ_LEN`, `MAX_SEQ_LEN`, etc.).
   - We use a dedicated helper module `src/data_uniprot.py` to interact with the UniProt REST API.

2. **Data fetching from UniProtKB**
   - Only **reviewed** entries are requested (`reviewed:true`).
   - Sequences are filtered by length: **`MIN_SEQ_LEN â‰¤ length â‰¤ MAX_SEQ_LEN`**  
     (in this project: `50 â‰¤ length â‰¤ 1000` to respect ESM-1b context limits).
   - For each functional family in `FAMILY_KEYWORDS`, we request up to `n_per_class` sequences
     (here: `n_per_class=500`).
   - Basic de-duplication is applied:
     - remove duplicate UniProt accessions (`uniprot_id`),
     - remove duplicate sequences (`sequence`).

3. **Saving raw artifacts**
   The notebook saves two core artifacts under `data/raw`:
   - `protein_families_small.csv` â€” table with columns  
     `uniprot_id, protein_name, organism, length, sequence, family`
   - `raw_sequences_small.fasta` â€” the same dataset in FASTA format

These files form the **raw input** for all downstream steps.

---

## Next steps

Further analysis of the dataset is **not** performed here intentionally.

All exploratory work and cleaning are moved to:

- `02_eda_and_fetch.ipynb` â€”  
  detailed EDA (class balance, organism diversity, length distribution, sequence sanity checks)
  and preparation of a cleaned dataset under `data/processed/`.

Subsequent notebooks will cover:

- `03_esm_embeddings.ipynb` â€” ESM-1b embeddings,
- `04_train_and_eval.ipynb` â€” model training & evaluation,
- `05_interpret_and_visualize.ipynb` â€” UMAP, SHAP, position-wise and 3D analysis.


In [4]:
from pathlib import Path
import sys

PROJECT_ROOT = Path(r"D:\ML\BioML\ESM")
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))
    
from src.data_uniprot import build_protein_family_dataset

df = build_protein_family_dataset(
    n_per_class=500,   
)
df["length"].describe()

[kinase] Requesting 500 entries from UniProt...
[kinase] Got 500 unique sequences.
[transporter] Requesting 500 entries from UniProt...
[transporter] Got 499 unique sequences.
[ion_channel] Requesting 500 entries from UniProt...
[ion_channel] Got 499 unique sequences.
[transcription] Requesting 500 entries from UniProt...
[transcription] Got 499 unique sequences.
[chaperone] Requesting 500 entries from UniProt...
[chaperone] Got 493 unique sequences.
[receptor] Requesting 500 entries from UniProt...
[receptor] Got 500 unique sequences.
[hydrolase] Requesting 500 entries from UniProt...
[hydrolase] Got 499 unique sequences.
[ligase] Requesting 500 entries from UniProt...
[ligase] Got 499 unique sequences.
[dna_binding] Requesting 500 entries from UniProt...
[dna_binding] Got 500 unique sequences.
[protease] Requesting 500 entries from UniProt...
[protease] Got 499 unique sequences.

Saved CSV: D:\ML\BioML\ESM\data\raw\protein_families_small.csv (n=4264)
Saved FASTA: D:\ML\BioML\ESM\data

count    4264.000000
mean      493.312148
std       217.393913
min        51.000000
25%       333.000000
50%       471.000000
75%       639.000000
max       999.000000
Name: length, dtype: float64

In [5]:
df.head()

Unnamed: 0,uniprot_id,protein_name,organism,length,sequence,family
0,O00444,Serine/threonine-protein kinase PLK4 (EC 2.7.1...,Homo sapiens (Human),970,MATCIGEKIEDFKVGNLLGKGSFAGVYRAESIHTGLEVAIKMIDKK...,kinase
1,O00506,Serine/threonine-protein kinase 25 (EC 2.7.11....,Homo sapiens (Human),426,MAHLRGFANQHSRVDPEELFTKLDRIGKGSFGEVYKGIDNHTKEVV...,kinase
2,O00746,"Nucleoside diphosphate kinase, mitochondrial (...",Homo sapiens (Human),187,MGGLFWRSALRGLRCGPRAPGPSLLVRHGSGGPSWTRERTLVAVKP...,kinase
3,O14757,Serine/threonine-protein kinase Chk1 (EC 2.7.1...,Homo sapiens (Human),476,MAVPFVEDWDLVQTLGEGAYGEVQLAVNRVTEEAVAVKIVDMKRAV...,kinase
4,O15111,Inhibitor of nuclear factor kappa-B kinase sub...,Homo sapiens (Human),745,MERPPGLRPGAGGPWEMRERLGTGGFGNVCLYQHRELDLKIAIKSC...,kinase


In [12]:
df.shape

(4264, 7)

In [13]:
df["family"].value_counts()

family
kinase           500
transporter      499
ligase           495
chaperone        490
transcription    484
hydrolase        445
ion_channel      420
receptor         418
protease         356
dna_binding      157
Name: count, dtype: int64

## Summary of this notebook

In this notebook we constructed the **raw protein family dataset** used throughout the project.  
This step ensures that all downstream processing (EDA â†’ embeddings â†’ models â†’ interpretation) is based on a clean, reproducible and well-defined data source.

### âœ” What we accomplished

1. **Defined the set of functional protein families**  
   We selected 10 diverse UniProt keyword categories, covering enzymes, receptors, DNA-associated proteins and transport systems:

   - kinase  
   - transporter  
   - ion_channel  
   - transcription  
   - chaperone  
   - receptor  
   - hydrolase  
   - ligase  
   - dna_binding  
   - protease  

   These families give the project a biologically meaningful and sufficiently challenging multi-class classification task.

2. **Fetched annotated reviewed proteins from UniProtKB**
   - Used UniProt REST API (`https://rest.uniprot.org`).
   - Only **reviewed** Swiss-Prot entries were selected.
   - Applied length constraints (`50â€“1000 aa`) to match ESM-1bâ€™s context limit.
   - Requested up to 500 sequences per family.

3. **Cleaned the dataset**
   - Removed duplicated entries by `uniprot_id`.
   - Removed duplicated protein sequences to avoid trivial redundancy.
   - Ensured that all sequences are non-empty and valid FASTA sequences.

4. **Saved results for downstream processing**
   - Raw CSV:
     ```
     data/raw/protein_families_small.csv
     ```
   - FASTA file with headers:
     ```
     data/raw/raw_sequences_small.fasta
     ```

5. **Final dataset summary**
   - Total sequences: **4264**
   - All sequences within allowed length range for ESM-1b.
   - Class distribution remains reasonably balanced (except dna_binding, which still has biological reasons for being smaller).

### âœ” Why this notebook matters

This notebook provides a **clean, well-controlled, reproducible source of protein data**.  
Every next step of the project â€” EDA, embedding computation, ML modelling, SHAP interpretation and 3D visualization â€” depends on this standardized dataset.

With the raw data now fully prepared, we can move on to:

ðŸ‘‰ `02_eda_and_fetch.ipynb` â€” exploratory analysis and quality checks of the assembled dataset.
