# MicroTPCT – Interactive Tutorial Notebook

## Introduction

Welcome to the **MicroTPCT interactive tutorial**.

This notebook is designed to help biologists and bioinformaticians:

* Understand what MicroTPCT does conceptually
* Learn how to run the pipeline step by step
* Explore intermediate objects (inputs, databases, matches)
* Perform exploratory analyses beyond the default CLI pipeline

Instead of calling `run_pipeline()` directly, we will reproduce the internal steps of the pipeline in a transparent and pedagogical way. This mirrors the implementation in `core/pipeline.py` while making each stage inspectable.


## 1. Imports and environment setup

We first import the main components of MicroTPCT and standard libraries.

We recommend creating a dedicated virtual environment.

In [14]:
!pip install -e ..
from pathlib import Path

from microtpct.io.readers import read_file, SequenceRole
from microtpct.io.validators import validate_protein_input, validate_peptide_input, validates_wildcards
from microtpct.io.converters import build_database
from microtpct.core.match import MATCHING_ENGINES
from microtpct.core.match.wildcards_matcher import run_wildcard_match
from microtpct.io.writers import write_outputs

Obtaining file:///c:/Users/huawei/Desktop/MicroTPCT
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: microtpct
  Building editable for microtpct (pyproject.toml): started
  Building editable for microtpct (pyproject.toml): finished with status 'done'
  Created wheel for microtpct: filename=microtpct-0.0.1.dev0-0.editable-py3-none-any.whl size=5648 sha256=ea0e6e14b9dc2444fa4253a112683374deb00ca9e7b10e9b21c5d6a3832b3350
  Stored in directory: C:\Users\huawei\AppData\Local\Temp\pip-ephem-wheel-cache-jkxd9ag0\wh


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Define input files and analysis parameters

Here we specify the target proteome and the list of query peptides we want to control the microproteotypicity.

Just below, some information about them :

#### Input files
For this interactive tutorial, we provide fake data especially made to test the pipeline.
* **Target :** `proteome.fasta` is a UniProt-like [1] Fasta file. If needed, you can also provide a tabular format containing accessions and sequences, explicitly specifying the column names.
* **Query :** `peptides.xlsx` is by default a Proline [2] output file, but you can use any tabular format you want as long as it contains an accesion column and a sequence column and you explicitly specify it.

#### Wildcards
Sometimes, the target file may contain ambigous amino acids. In this case, it is very likely that the wildcard symbol will be an `X`. This is why, **microTPCT** offer the option of a wildcard-aware matching. *In any case, MicroTPCT will inform you if your target file contains wildcard symbol, even if you set `allow_wildcard` on `False`*

#### Matching engines
MicroTPCT has several matching engines, which vary in speed and RAM consumption depending on how you use them and your input files. In most cases, the aho-corasick algorithm should be suitable `matching_engine = "aho"`. *For more details about matching engines performances, please refer to the “Benchmark” section of the `README`.*

In [35]:
# Fake data for testing purpose. Feel free to try with your owns !
target_file = Path("example_data/proteome.fasta")
query_file = Path("example_data/peptides.xlsx")

matching_engine = "find"          # "aho" by default
allow_wildcard = True             # Enable wildcard-aware matching
wildcards = "X"                   # Ambiguous amino acid symbol
analysis_name = "tutorial_run" 
output_path = Path("tutorial_run_output")
output_format = "csv"             # You can also choose "excel" if you prefer .xlsx files

# If needed, you can manually setup target_format, query_format, target_separator, query_separator
# In most cases, microTPCT wild be able to deduce it by itself

## 3. Reading input sequences

MicroTPCT first reads protein (target) sequences and peptide (query) sequences using unified readers.

In [None]:
target_inputs = list(
    read_file(target_file, role=SequenceRole.PROTEIN)
) # Assigning a role to the sequence to be treat them in an appropriate manner in the full process

query_inputs = list(
    read_file(query_file, role=SequenceRole.PEPTIDE)
)

print(f"Loaded {len(target_inputs)} target proteins")
print(f"Loaded {len(query_inputs)} query peptides")

Loaded 21999 target proteins
Loaded 206 query peptides
For instance, sequence of target protein ProteinInput(accession='A0A021WW32') is:
MFYEHIILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPKVKLALRTSGHLLLGVVRIYSRKAKYLLADCNEAFVKIKMAFRPGMVDLPEGHREANVNAITLPEVFHDFDTALPELNDIDIEAQFSINQSRADEITMREDYGSLSLSLQDDGFGDIGFEAETPEIIRCSIPSNINDKIFDNDVLENIESLDPHSLDAHADMPGSRLDGDGFGDSFGQPALFEDDLFGDPSQPVEQITKESTTVLNADDSDEDAIDNIHNVPSPATSLVNSIEDEKEENNLNGHASVSENVPMNEITLVQNEDEGFALAPLDVSMYKGVTKAKRKRKLIIDEIKNISGEEMKAQLADTSDILTTLDLAPPTKRLMYWKETGGVEKLFSLPSRSIPARALFGNYNRQLFSHSTFFEDFSSVVPMEILALEFYTKENENALIIFNKKGRKRKNDNMSNLFLDHVPDSVVQSLEAPEVLRANHKSLGVSTVSVEIVSKEQESISCQNELTFFDNMRSPDLLSLNEMEQFSSINELPLTPRNMNHEMGDDFNQGDSTPAGLDHGHATPQHGNIGEMDHDSVIPTKKTAVILNESVGTSVLSDNGVSKRTNNILKGWDNYEIPSFVGQGIRHAGHCYQQYH


  warn("Workbook contains no default style, apply openpyxl's default")


At this point:

* Each target is an object of a protein sequence with an accession
* Each query is an object of a short peptide sequence with the accession of the protein to which it belongs

In [20]:
print(f"For instance, sequence of target protein {target_inputs[0]} is: {target_inputs[0].sequence}")
print(f"And sequence of query peptide {query_inputs[0]} is: {query_inputs[0].sequence}")

For instance, sequence of target protein ProteinInput(accession='A0A021WW32') is: MFYEHIILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPKVKLALRTSGHLLLGVVRIYSRKAKYLLADCNEAFVKIKMAFRPGMVDLPEGHREANVNAITLPEVFHDFDTALPELNDIDIEAQFSINQSRADEITMREDYGSLSLSLQDDGFGDIGFEAETPEIIRCSIPSNINDKIFDNDVLENIESLDPHSLDAHADMPGSRLDGDGFGDSFGQPALFEDDLFGDPSQPVEQITKESTTVLNADDSDEDAIDNIHNVPSPATSLVNSIEDEKEENNLNGHASVSENVPMNEITLVQNEDEGFALAPLDVSMYKGVTKAKRKRKLIIDEIKNISGEEMKAQLADTSDILTTLDLAPPTKRLMYWKETGGVEKLFSLPSRSIPARALFGNYNRQLFSHSTFFEDFSSVVPMEILALEFYTKENENALIIFNKKGRKRKNDNMSNLFLDHVPDSVVQSLEAPEVLRANHKSLGVSTVSVEIVSKEQESISCQNELTFFDNMRSPDLLSLNEMEQFSSINELPLTPRNMNHEMGDDFNQGDSTPAGLDHGHATPQHGNIGEMDHDSVIPTKKTAVILNESVGTSVLSDNGVSKRTNNILKGWDNYEIPSFVGQGIRHAGHCYQQYH
And sequence of query peptide PeptideInput(accession='P13607-4') is: LNIPVSEVNPR


## 4. Wildcard configuration and validation

As previously mentionned, some protein databases may contain ambiguous amino acids (for example `X`). MicroTPCT can optionally enable wildcard-aware matching.

In [None]:
effective_allow_wildcard = allow_wildcard

if allow_wildcard and not wildcards: # If no wildcard symbol provided
    print("Warning: wildcard matching enabled but no wildcard provided → disabling")
    effective_allow_wildcard = False

if wildcards: 
    # Format wildcard(s) in a set
    wildcards = set(wildcards) if isinstance(wildcards, list) else {wildcards}

    # To avoid overlapping between provided wildcard symbol(s) and the amino acid characters,
    # the wildcards are validated.
    # In the case of overlapping, MicroTPCT raises an error.
    validates_wildcards(wildcards)

    print(f"Wildcard characters enabled: {wildcards}")

Wildcard characters enabled: {'X'}


## 5. Validation of protein and peptide inputs

All sequences are validated before building the databases. MicroTPCT checks the integrity of all input objects. This step also tests for the presence of wildcard symbols in query sequences, allowing them to be flagged if required.

### 5.1 Protein validation and detection of wildcards

In [22]:
n_with_wildcards = 0

for obj in target_inputs:
    wildcards_detected = validate_protein_input(obj, wildcards)

    # Count protein sequences that contains wildcard(s) even if allow_wildcard = False, for user information purpose
    if wildcards_detected:
        n_with_wildcards += 1

    if effective_allow_wildcard: # Flag only if allow_wildcard = True, attribute contain_wildcards created for target_inputs object.
        object.__setattr__(obj, "contain_wildcards", wildcards_detected)

print("All target proteins are valid")
print(f"Proteins containing wildcards: {n_with_wildcards}")

All target proteins are valid
Proteins containing wildcards: 466


### 5.2 Peptide validation

In [23]:
for obj in query_inputs:
    validate_peptide_input(obj)

print("All query peptides are valid")

All query peptides are valid


At this point, all sequences are guaranteed to follow valid amino-acid syntax.

## 6. Building the target and query databases

MicroTPCT converts raw sequence objects into optimized internal databases for matching.

In [None]:
from microtpct.core.pipeline import _inject_wildcard_metadata

target_db = build_database(target_inputs, role=SequenceRole.PROTEIN)

if effective_allow_wildcard:
    # When wildcard matching is enabled, MicroTPCT dynamically augments the TargetDB object with additional metadata 
    # and helper methods to isolate sequences containing ambiguous residues.
    _inject_wildcard_metadata(target_db, target_inputs)

query_db = build_database(query_inputs, role=SequenceRole.PEPTIDE)

print(f"TargetDB: {target_db.size} sequences")
print(f"QueryDB: {query_db.size} peptides ({query_db.n_unique_accessions()} unique accessions)")

TargetDB: 21999 sequences
QueryDB: 206 peptides (81 unique accessions)


These database objects provide:

* Fast indexed access to sequences
* Accession management
* Optional handling of ambiguous residues

In [28]:
# For better understanding, let's diplay the target_db databases
target_db.to_dataframe()

Unnamed: 0,id,accession,sequence,ambiguous_il_sequence,contain_wildcard
0,T000001,A0A021WW32,MFYEHIILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQ...,MFYEHJJJAKKGPJARJWJAAHWDKKJTKAHVFETNJEKSVEGJJQ...,False
1,T000002,A0A023GRW4,MMGSPGSQASAIATSVGIRSGRRGQAGGSLLLRLLAVTFVLAACHA...,MMGSPGSQASAJATSVGJRSGRRGQAGGSJJJRJJAVTFVJAACHA...,False
2,T000003,A0A0B4JCQ5,MNSLARVFSNFRDFYNDINAATLTGAIDVIVVEQRDGEFQCSPFHV...,MNSJARVFSNFRDFYNDJNAATJTGAJDVJVVEQRDGEFQCSPFHV...,True
3,T000004,A0A0B4JCQ7,MPFPSLQECEQMVQMLRVVELQKILSFLNISFAGRKTDLQSRILSF...,MPFPSJQECEQMVQMJRVVEJQKJJSFJNJSFAGRKTDJQSRJJSF...,False
4,T000005,A0A0B4JCS1,MKLGDSGEAFFVEECLEDEDEELPANLATSPIPNSFLASRDKANDT...,MKJGDSGEAFFVEECJEDEDEEJPANJATSPJPNSFJASRDKANDT...,False
...,...,...,...,...,...
21994,T021995,X2JKH0,MSKLSDTKIPITEFLEAYRRQPCLYNTLLDSYKNRVSREEAYGAII...,MSKJSDTKJPJTEFJEAYRRQPCJYNTJJDSYKNRVSREEAYGAJJ...,False
21995,T021996,X2JL95,MPPQLMSQMENGGEEEGCRLECKEMDPLRLTTLILSADPRYRIKPM...,MPPQJMSQMENGGEEEGCRJECKEMDPJRJTTJJJSADPRYRJKPM...,False
21996,T021997,X2JLC4,MSQKKSNISHLVIQCIDAMDGFASKEMILKAVSKATDKKFWKMGRR...,MSQKKSNJSHJVJQCJDAMDGFASKEMJJKAVSKATDKKFWKMGRR...,False
21997,T021998,X2JLD3,MSQKKSNISHLVIQCIDAMDGFASKEMILKAVSKATDKKFWKMGRR...,MSQKKSNJSHJVJQCJDAMDGFASKEMJJKAVSKATDKKFWKMGRR...,False


Here a description of every attributes :

* `id`: Internal unique id which begin with "**T**" for `target_db` and "**Q**" for `query_db`. Especially usefull for peptides in `target_db` since same protein accession can refer to multiple peptides.
* `accession`: Protein accession retrive from input file
* `sequence`: Protein or peptidic sequence retrive from input file
* `ambiguous_il_sequence`: Protein or peptidic sequence in which "**L**" and "**I**" have been substituted by "**J**" to enable matching, as leucine and isoleucine are not distinguishable by mass spectrometry.
* `contain_wildcard`: *(Only for `target_db`)* Flag indicating if sequence contains wildcard symbol(s)

Both `target_db` and `query_db` have the same format (exept the columns `contain_wildcard`).

## 7. Selection and execution of the matching engine

As already mentionned, MicroTPCT supports multiple matching backends. Here we perform the **strict matching step**. This one is efficent but unable to match sequences with wildcard.

In [30]:
matching_func = MATCHING_ENGINES[matching_engine]

result_strict_matching = matching_func(target_db, query_db)

print(f"Strict matches found: {len(result_strict_matching)}")

Strict matches found: 338



Each match typically stores:

* Query peptide ID
* Target protein ID
* Match position(s)

In [31]:
# Again, we can display it to understand better
result_strict_matching.to_dataframe()

Unnamed: 0,query_id,target_id,position
0,Q000012,T000019,1858
1,Q000013,T000019,777
2,Q000014,T000019,1856
3,Q000015,T000019,988
4,Q000016,T000019,1467
...,...,...,...
333,Q000189,T019365,69
334,Q000139,T019609,103
335,Q000140,T019609,144
336,Q000201,T019717,68


## 8. Wildcard-aware matching (optional)

If wildcard matching is enabled and ambiguous residues were detected, a second pass is performed especially for this goal.

In [None]:
result_wildcard_matching = None # Initialize with None to avoid: name 'result_wildcard_matching' is not defined later

if effective_allow_wildcard and n_with_wildcards > 0:
    # Extract only target sequence that contains wildcards
    wildcard_targets = target_db.get_wildcard_targets()

    result_wildcard_matching = run_wildcard_match(
        wildcard_targets,
        query_db,
        wildcards,
    )

    print(f"Wildcard matches found: {len(result_wildcard_matching)}")

Wildcard matches found: 35


As `result_wildcard_matching` is an object of the same class as `result_strict_matching`, it contains same informations, attributes and methode.

## 9. Writing output files

MicroTPCT can export:

* A detailed match table
* A statistics summary file

In [37]:
output_path.mkdir(exist_ok=True)

# Here a function that compile all result into a comprenhensive format and compute statistics about the result
result_file, stats_file = write_outputs(
    output_path=output_path,
    output_format=output_format,
    analysis_name=analysis_name,
    query_db=query_db,
    target_db=target_db,
    result_strict=result_strict_matching,
    result_wildcard=result_wildcard_matching,
    n_target_with_wildcards=n_with_wildcards,
    matching_engine=matching_engine,
    allow_wildcard=effective_allow_wildcard,
    wildcards=wildcards if wildcards else None,
)

print("Results written to:", result_file)
print("Statistics written to:", stats_file)

Results written to: tutorial_run_output\microtpct_matching_result_tutorial_run_20260125_233011.csv
Statistics written to: tutorial_run_output\microtpct_statistics_tutorial_run_20260125_233011.csv


## 10. Exploratory analysis ideas

This notebook can now be extended with custom analyses, for example:

* Distribution of peptide lengths that match vs do not match
* Number of peptides per protein
* Proteins enriched in micropeptide hits
* Comparison between strict and wildcard matches

Example:

In [None]:
# Example: count matches per target protein
from collections import Counter

protein_hits = Counter([m.target_id for m in result_strict_matching])

protein_hits.most_common(10)

## Conclusion

This tutorial illustrated the full MicroTPCT pipeline step by step:

1. Reading inputs
2. Validating sequences
3. Building databases
4. Running matching engines
5. Handling ambiguous residues
6. Exporting and exploring results

For automated production runs, we recommend using the high-level function:

In [None]:
from microtpct.core.pipeline import run_pipeline

# et activer le log si besoin

For teaching, debugging, and research exploration, this notebook provides a transparent and extensible workflow.

Happy micropeptide hunting!

## References

1. The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617, https://doi.org/10.1093/nar/gkae1010

2. Bouyssié D, Hesse AM, Mouton-Barbosa E, Rompais M, Macron C, Carapito C, Gonzalez de Peredo A, Couté Y, Dupierris V, Burel A, Menetrey JP, Kalaitzakis A, Poisat J, Romdhani A, Burlet-Schiltz O, Cianférani S, Garin J, Bruley C. Proline: an efficient and user-friendly software suite for large-scale proteomics. Bioinformatics. 2020 May 1;36(10):3148-3155. https://10.1093/bioinformatics/btaa118