# MicroTPCT – Interactive Tutorial Notebook

## Introduction

Welcome to the **MicroTPCT interactive tutorial**.

This notebook is designed to help biologists and bioinformaticians:

* Understand what MicroTPCT does conceptually
* Learn how to run the pipeline step by step
* Explore intermediate objects (inputs, databases, matches)
* Perform exploratory analyses beyond the default CLI pipeline

Instead of calling `run_pipeline()` directly, we will reproduce the internal steps of the pipeline in a transparent and pedagogical way. This mirrors the implementation in `core/pipeline.py` while making each stage inspectable.


## 1. Imports and environment setup

We first import the main components of MicroTPCT and standard libraries.

In [None]:
from pathlib import Path

from microtpct.io.readers import read_file, SequenceRole
from microtpct.io.validators import validate_protein_input, validate_peptide_input, validates_wildcards
from microtpct.io.converters import build_database
from microtpct.core.match import MATCHING_ENGINES
from microtpct.core.match.wildcards_matcher import run_wildcard_match
from microtpct.io.writers import write_outputs

## 2. Define input files and analysis parameters

Here we specify the target proteome and the list of query peptides.

In [None]:
target_file = Path("data/example_targets.fasta")
query_file = Path("data/example_queries.fasta")

matching_engine = "aho"          # Aho–Corasick exact matching
allow_wildcard = True
wildcards = "X"                 # ambiguous amino acid symbol
analysis_name = "tutorial_run"
output_format = "csv"

## 3. Reading input sequences

MicroTPCT first reads protein (target) sequences and peptide (query) sequences using unified readers.

In [None]:
target_inputs = list(
    read_file(target_file, role=SequenceRole.PROTEIN)
)

query_inputs = list(
    read_file(query_file, role=SequenceRole.PEPTIDE)
)

print(f"Loaded {len(target_inputs)} target proteins")
print(f"Loaded {len(query_inputs)} query peptides")

At this stage:

* Each target is a protein sequence with an identifier and accession
* Each query is a short peptide sequence

## 4. Wildcard configuration and validation

Some protein databases contain ambiguous amino acids (for example `X`). MicroTPCT can optionally enable wildcard-aware matching.

In [None]:
effective_allow_wildcard = allow_wildcard

if allow_wildcard and not wildcards:
    print("Warning: wildcard matching enabled but no wildcard provided → disabling")
    effective_allow_wildcard = False

if wildcards:
    wildcards = set(wildcards) if isinstance(wildcards, list) else {wildcards}
    validates_wildcards(wildcards)
    print(f"Wildcard characters enabled: {wildcards}")

## 5. Validation of protein and peptide inputs

All sequences are validated before building the databases.

### 5.1 Protein validation and detection of wildcards

In [None]:
n_with_wildcards = 0

for obj in target_inputs:
    wildcards_detected = validate_protein_input(obj, wildcards)

    if wildcards_detected:
        n_with_wildcards += 1

    if effective_allow_wildcard:
        object.__setattr__(obj, "contain_wildcards", wildcards_detected)

print(f"Proteins containing wildcards: {n_with_wildcards}")

### 5.2 Peptide validation

In [None]:
for obj in query_inputs:
    validate_peptide_input(obj)

print("All query peptides are valid")

At this point, all sequences are guaranteed to follow valid amino-acid syntax.

## 6. Building the target and query databases

MicroTPCT converts raw sequence objects into optimized internal databases for matching.

In [None]:
target_db = build_database(target_inputs, role=SequenceRole.PROTEIN)
query_db = build_database(query_inputs, role=SequenceRole.PEPTIDE)

print(f"TargetDB: {target_db.size} sequences")
print(f"QueryDB: {query_db.size} peptides")

These database objects provide:

* Fast indexed access to sequences
* Accession management
* Optional handling of ambiguous residues

## 7. Selection and execution of the matching engine

MicroTPCT supports multiple matching backends. Here we use Aho–Corasick for exact multi-pattern search.

In [None]:
matching_func = MATCHING_ENGINES[matching_engine]

result_strict_matching = matching_func(target_db, query_db)

print(f"Strict matches found: {len(result_strict_matching)}")


Each match typically stores:

* Query peptide ID
* Target protein ID
* Match position(s)


## 8. Wildcard-aware matching (optional)

If wildcard matching is enabled and ambiguous residues were detected, a second pass is performed.

In [None]:
result_wildcard_matching = None

if effective_allow_wildcard and n_with_wildcards > 0:
    wildcard_targets = target_db.get_wildcard_targets()

    result_wildcard_matching = run_wildcard_match(
        wildcard_targets,
        query_db,
        wildcards,
    )

    print(f"Wildcard matches found: {len(result_wildcard_matching)}")

The total number of matches is the sum of strict and wildcard-aware results.

## 9. Writing output files

MicroTPCT can export:

* A detailed match table
* A statistics summary file

In [None]:
output_path = Path("results")
output_path.mkdir(exist_ok=True)

result_file, stats_file = write_outputs(
    output_path=output_path,
    output_format=output_format,
    analysis_name=analysis_name,
    query_db=query_db,
    target_db=target_db,
    result_strict=result_strict_matching,
    result_wildcard=result_wildcard_matching,
    n_target_with_wildcards=n_with_wildcards,
    matching_engine=matching_engine,
    allow_wildcard=effective_allow_wildcard,
    wildcards=wildcards if wildcards else None,
)

print("Results written to:", result_file)
print("Statistics written to:", stats_file)

## 10. Exploratory analysis ideas

This notebook can now be extended with custom analyses, for example:

* Distribution of peptide lengths that match vs do not match
* Number of peptides per protein
* Proteins enriched in micropeptide hits
* Comparison between strict and wildcard matches

Example:

In [None]:
# Example: count matches per target protein
from collections import Counter

protein_hits = Counter([m.target_id for m in result_strict_matching])

protein_hits.most_common(10)

## Conclusion

This tutorial illustrated the full MicroTPCT pipeline step by step:

1. Reading inputs
2. Validating sequences
3. Building databases
4. Running matching engines
5. Handling ambiguous residues
6. Exporting and exploring results

For automated production runs, we recommend using the high-level function:

In [None]:
from microtpct.core.pipeline import run_pipeline

For teaching, debugging, and research exploration, this notebook provides a transparent and extensible workflow.

Happy micropeptide hunting!