# MicroTPCT – Step-by-step interactive tutorial


## Introduction

Welcome to the **MicroTPCT interactive tutorial**.

This notebook is designed for biologists and bioinformaticians who want to:

- Understand the conceptual workflow of MicroTPCT  
- Learn how to run the pipeline step by step  
- Inspect intermediate objects (inputs, databases, matches)  
- Perform exploratory analyses beyond the default CLI pipeline  

Instead of calling `run_pipeline()` directly, we reproduce the internal steps of the pipeline in a transparent and pedagogical way. This notebook mirrors the implementation in `core/pipeline.py` while making each stage inspectable and modifiable.


## Overview of the MicroTPCT pipeline

MicroTPCT implements the following workflow:

1. Read protein and peptide input files  
2. Validate sequence integrity and detect ambiguous residues  
3. Build optimized internal databases  
4. Perform strict exact matching  
5. Perform optional wildcard-aware matching  
6. Merge results and export summary tables  

In this notebook, we explicitly reproduce each of these steps.

## 1. Imports and environment setup

We recommend creating a dedicated virtual environment and installing MicroTPCT in editable mode from the root of the repository:

In [None]:
!cd MicroTPCT
!pip install -e .

This allows the notebook to directly import the local development version of MicroTPCT.

Then, we import the main components of MicroTPCT and standard libraries.

In [None]:
from pathlib import Path

from microtpct.io.readers import read_file, SequenceRole
from microtpct.io.validators import validate_target_input, validate_query_input, validates_wildcards
from microtpct.io.converters import build_database
from microtpct.core.match import MATCHING_ENGINES
from microtpct.core.match.wildcards_matcher import run_wildcard_match
from microtpct.io.writers import write_outputs

## 2. Define input files and analysis parameters

Here we specify the target proteome and the list of query peptides we want to control the microproteotypicity.

Just below, some information about them :

#### Input files
For this interactive tutorial, we provide synthetic data specifically designed to test the pipeline.

- **Target**: `proteome.fasta` is a UniProt-like [1] FASTA file. Alternatively, you may provide a tabular file containing accessions and sequences, explicitly specifying the column names.

- **Query**: `peptides.xlsx` is by default a Proline [2] output file, but any tabular format can be used as long as it contains an accession column and a sequence column, which must be explicitly specified.

#### Wildcards
Sometimes, the target file may contain ambiguous amino acids. In most proteomics datasets, the wildcard symbol is typically `X`.  
For this reason, MicroTPCT offers an optional wildcard-aware matching mode.

In any case, MicroTPCT will inform you if your target file contains wildcard symbols, even if `allow_wildcard` is set to False.

#### Matching engines
MicroTPCT provides several matching engines that differ in speed and memory consumption depending on the size and structure of the input data.  

In most use cases, the Aho–Corasick algorithm (`matching_engine = "aho"`) provides an excellent trade-off between speed and memory efficiency.  

For detailed performance comparisons, please refer to the “Benchmark” section of the README.

In [None]:
# Fake data for testing purpose. Feel free to try with your owns !
target_file = Path("example_data/proteome.fasta")
query_file = Path("example_data/peptides.xlsx")

matching_engine = "find"          # "aho" by default
allow_wildcard = True             # Enable wildcard-aware matching
wildcards = "X"                   # Ambiguous amino acid symbol
analysis_name = "tutorial_run" 
output_path = Path("tutorial_run_output")
output_format = "csv"             # You can also choose "excel" if you prefer .xlsx files

# If needed, you can manually setup target_format, query_format, target_separator, query_separator
# In most cases, microTPCT wild be able to deduce it by itself

## 3. Reading input sequences

MicroTPCT first reads protein (target) sequences and peptide (query) sequences using unified readers.

In [None]:
target_inputs = list(
    read_file(target_file, role=SequenceRole.TARGET)
) # Assign a role to each sequence so that it can be processed appropriately in the pipeline

query_inputs = list(
    read_file(query_file, role=SequenceRole.QUERY)
)

print(f"Loaded {len(target_inputs)} target proteins")
print(f"Loaded {len(query_inputs)} query peptides")

At this point:

* Each target is an object of a protein sequence with an accession
* Each query is an object of a short peptide sequence with the accession of the protein to which it belongs

In [None]:
print(f"For instance, sequence of target protein {target_inputs[0]} is: {target_inputs[0].sequence}")
print(f"And sequence of query peptide {query_inputs[0]} is: {query_inputs[0].sequence}")

## 4. Wildcard configuration and validation

As previously mentionned, some protein databases may contain ambiguous amino acids (for example `X`). MicroTPCT can optionally enable wildcard-aware matching.

In [None]:
effective_allow_wildcard = allow_wildcard

if allow_wildcard and not wildcards: # If no wildcard symbol provided
    print("Warning: wildcard matching enabled but no wildcard provided → disabling")
    effective_allow_wildcard = False

if wildcards: 
    # Format wildcard(s) in a set
    wildcards = set(wildcards) if isinstance(wildcards, list) else {wildcards}

    # To avoid overlapping between provided wildcard symbol(s) and the amino acid characters,
    # the wildcards are validated.
    # In the case of overlapping, MicroTPCT raises an error.
    validates_wildcards(wildcards)

    print(f"Wildcard characters enabled: {wildcards}")

## 5. Validation of target and query inputs

All sequences are validated before building the databases. MicroTPCT checks the integrity of all input objects. This step also tests for the presence of wildcard symbols in query sequences, allowing them to be flagged if required.

### 5.1 Protein validation and detection of wildcards

In [None]:
n_with_wildcards = 0

for obj in target_inputs:
    wildcards_detected = validate_target_input(obj, wildcards)

    # Count protein sequences that contains wildcard(s) even if allow_wildcard = False, for user information purpose
    if wildcards_detected:
        n_with_wildcards += 1

    if effective_allow_wildcard: # Flag only if allow_wildcard = True, attribute contain_wildcards created for target_inputs object.
        object.__setattr__(obj, "contain_wildcards", wildcards_detected)

print("All target proteins are valid")
print(f"Proteins containing wildcards: {n_with_wildcards}")

### 5.2 Peptide validation

In [None]:
for obj in query_inputs:
    validate_query_input(obj)

print("All query peptides are valid")

At this point, all sequences are guaranteed to follow valid amino-acid syntax.

## 6. Building the target and query databases

MicroTPCT converts raw sequence objects into optimized internal databases for matching.

In [None]:
from microtpct.core.pipeline import _inject_wildcard_metadata

target_db = build_database(target_inputs, role=SequenceRole.TARGET)

if effective_allow_wildcard:
    # When wildcard matching is enabled, MicroTPCT dynamically augments the TargetDB object with additional metadata 
    # and helper methods to isolate sequences containing ambiguous residues.
    _inject_wildcard_metadata(target_db, target_inputs)

query_db = build_database(query_inputs, role=SequenceRole.QUERY)

print(f"TargetDB: {target_db.size} sequences")
print(f"QueryDB: {query_db.size} peptides ({query_db.n_unique_accessions()} unique accessions)")

These database objects provide:

* Fast indexed access to sequences
* Accession management
* Optional handling of ambiguous residues

In [None]:
# For better understanding, let's display the target_db databases
target_db.to_dataframe()

Here a description of every attributes :

- id: Internal unique identifier, starting with "T" for `target_db` and "Q" for `query_db`. This is especially useful for peptides, since the same protein accession may refer to multiple peptide entries.

- accession: Protein accession retrieved from the input file  

- sequence: Protein or peptide sequence retrieved from the input file  

- ambiguous_il_sequence: Sequence in which leucine (L) and isoleucine (I) are replaced by "J", since these residues are indistinguishable by mass spectrometry  

- contain_wildcard: *(`target_db` only)* Flag indicating whether the sequence contains wildcard symbols  

Both `target_db` and `query_db` share the same structure, except for the `contain_wildcard` column.


## 7. Selection and execution of the matching engine

This step performs strict exact matching. It is computationally efficient but cannot match sequences containing wildcard characters.

In [None]:
matching_func = MATCHING_ENGINES[matching_engine]

result_strict_matching = matching_func(target_db, query_db)

print(f"Strict matches found: {len(result_strict_matching)}")


Each match typically stores:

* Query peptide ID
* Target protein ID
* Match position(s)

In [None]:
# Again, we can display it to understand better
result_strict_matching.to_dataframe()

## 8. Wildcard-aware matching (optional)

If wildcard matching is enabled and ambiguous residues were detected, a second pass is performed especially for this goal.

In [None]:
result_wildcard_matching = None # Initialize with None to avoid: name 'result_wildcard_matching' is not defined later

if effective_allow_wildcard and n_with_wildcards > 0:
    # Extract only target sequence that contains wildcards
    wildcard_targets = target_db.get_wildcard_targets()

    result_wildcard_matching = run_wildcard_match(
        wildcard_targets,
        query_db,
        wildcards,
    )

    print(f"Wildcard matches found: {len(result_wildcard_matching)}")

Since `result_wildcard_matching` is an object of the same class as `result_strict_matching`, it contains the same information, attributes, and methods.

## 9. Writing output files

MicroTPCT can export:

* A detailed match table
* A statistics summary file

In [None]:
output_path.mkdir(exist_ok=True)

# This function compiles all results into a comprehensive format and computes summary statistics
result_file, stats_file = write_outputs(
    output_path=output_path,
    output_format=output_format,
    analysis_name=analysis_name,
    query_db=query_db,
    target_db=target_db,
    result_strict=result_strict_matching,
    result_wildcard=result_wildcard_matching,
    n_target_with_wildcards=n_with_wildcards,
    matching_engine=matching_engine,
    allow_wildcard=effective_allow_wildcard,
    wildcards=wildcards if wildcards else None,
)

print("Results written to:", result_file)
print("Statistics written to:", stats_file)

## Conclusion

This tutorial illustrated the complete MicroTPCT pipeline step by step:

1. Reading inputs
2. Validating sequences
3. Building databases
4. Running matching engines
5. Handling ambiguous residues
6. Exporting results

For automated production runs, we recommend using the high-level function:

In [None]:
from microtpct.core.pipeline import run_pipeline

For teaching, debugging, and research exploration, this notebook provides a transparent and extensible workflow.

Happy micropeptide hunting!

## References

1. The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617, https://doi.org/10.1093/nar/gkae1010

2. Bouyssié D, Hesse AM, Mouton-Barbosa E, Rompais M, Macron C, Carapito C, Gonzalez de Peredo A, Couté Y, Dupierris V, Burel A, Menetrey JP, Kalaitzakis A, Poisat J, Romdhani A, Burlet-Schiltz O, Cianférani S, Garin J, Bruley C. Proline: an efficient and user-friendly software suite for large-scale proteomics. Bioinformatics. 2020 May 1;36(10):3148-3155. https://10.1093/bioinformatics/btaa118