# Primary Analysis: FASTQ to Count Matrix

This notebook demonstrates how to parse targeted sequencing FASTQ files for costimulatory domains and generate a count matrix for downstream statistical analysis.

## Workflow

1. Configure the reader with paths and parameters
2. Parse FASTQ files (parallelized across cores)
3. Serialize the count matrix and metadata

## Requirements

```bash
pip install -e .  # Install costim-screen package
```

In [None]:
from primary_analysis import ReadCostimFASTQ

# =============================================================================
# Configuration - Update these paths for your data
# =============================================================================

# Path to directory containing FASTQ files (*_R1.fastq.gz)
fastq_path = '/path/to/fastq/directory'

# Path to costim metadata Excel file (must have 'ID' and 'Costim' columns)
feature_metadata_fn = '/path/to/costim_metadata.xlsx'

# Path to sample metadata (optional - will be inferred from filenames if None)
sample_metadata_fn = '/path/to/sample_metadata.xlsx'

# Output directory for results
results_path = '/path/to/results'

# =============================================================================
# Processing parameters
# =============================================================================

# Filename encoding scheme for inferring sample metadata
# Options: 'PT_EC_RP_SS' (Patient_ExpCond_Replicate_Tsubset)
#          'PT_EC_RP_EX' (Patient_ExpCond_Replicate_SortCat)
fastq_encoding = 'PT_EC_RP_SS'

# Number of bases to use for matching (from start of costim sequence)
n_char_truncate = 30

# Maximum edit distance for fuzzy matching (Hamming distance)
read_error_threshold = 2

# Number of CPU cores (None = auto-detect)
num_cores = None

# Debug mode (limits reads processed, extra logging)
debug = False

In [None]:
# =============================================================================
# Create reader and process FASTQ files
# =============================================================================

# Instantiate the reader
reader = ReadCostimFASTQ(
    fastq_path=fastq_path,
    feature_metadata_fn=feature_metadata_fn,
    sample_metadata_fn=sample_metadata_fn,
    debug=debug,
    fastq_encoding=fastq_encoding,
    fastq_type='gDNA',
    n_char_features=n_char_truncate,
    num_cores=num_cores,
    read_error_threshold=read_error_threshold,
)

# Parse all FASTQ files (parallelized)
reader.read()

# Save results
reader.serialize(results_path=results_path)

# Print summary
reader.print_summary()

## Output Files

After running, the following files will be created in `results_path`:

| File | Description |
|------|-------------|
| `merged_counts.xlsx` | Count matrix (features Ã— samples) |
| `feature_metadata.xlsx` | Feature metadata (costim sequences, IDs) |
| `sample_metadata.xlsx` | Sample metadata (inferred or from file) |
| `ambiguous_costim_report.xlsx` | Report of truncated sequences mapping to multiple features |

## Next Steps

Use the count matrix with `costim_screen` for statistical analysis:

```python
import costim_screen as cs
from pathlib import Path

# Load the count matrix you just created
counts = cs.load_counts_matrix(Path(results_path) / "merged_counts.xlsx")
```

In [None]:
# =============================================================================
# Inspect the results
# =============================================================================

# View the count matrix
print("Count matrix shape:", reader.merged_counts_df.shape)
print("\nFirst few rows:")
reader.merged_counts_df.head()

In [None]:
# View sample metadata
print("Sample metadata:")
reader.sample_metadata_df.head(10)