There are some helper functions that you can use to help integrate your pipeline (or data type) into Cirro.

These include:
- File validation rules / sample matching pattern testing
- Cirro Preprocess script & sample metadata outputs (used for preparing sample sheets for your pipeline)

### File validation rules / sample matching pattern testing

In [1]:
from cirro import DataPortal

portal = DataPortal()
helper = portal.developer_helper

First we need to get a dataset to test against.

In [13]:
dataset = portal.get_dataset(
    project="Pipeline Development",
    dataset="Short reads for hybrid assembly 2"
)
files = dataset.list_files()
print(files)

data/4263-B_S31_R1_001.fastq.gz (842.28 MB)
data/4263-B_S31_R2_001.fastq.gz (808.39 MB)


We can see that the dataset has two files. We want to write a regex that will extract the sample name from the file names (4263-B).

Developing a regex can be tricky, [Pythex.org](https://pythex.org/) is a great resource for the regex development since it has a cheat sheet. AI tools are also great for generating regex patterns, but you should always test them to ensure they work as expected.

Name patterns are evaluated in order, so you should put the most specific patterns first.

In [14]:
file_name_patterns = [
    # Illumina format with no lane information
    # Backslashes are escaped with a double backslash, which you would omit in the text field in Cirro
    "(?<sampleName>\\S*)_S(?<libraryIndex>\\S*)_(?<readType>R|I)(?<read>1|2|3|4)_001\\.fastq\\.gz",
    # You can specify multiple patterns if there are different naming conventions
    # Fall back to a more generic pattern if the above does not match
    "(?<sampleName>\\S*)\\.fastq\\.gz"
]

matches = helper.test_file_name_validation_for_dataset(
    project_id=dataset.project_id,
    dataset_id=dataset.id,
    file_name_patterns=file_name_patterns
)

matches.print()

Matches: 2

data/4263-B_S31_R1_001.fastq.gz
Sample name: 4263-B
Matched regex: (?<sampleName>\S*)_S(?<libraryIndex>\S*)_(?<readType>R|I)(?<read>1|2|3|4)_001\.fastq\.gz

data/4263-B_S31_R2_001.fastq.gz
Sample name: 4263-B
Matched regex: (?<sampleName>\S*)_S(?<libraryIndex>\S*)_(?<readType>R|I)(?<read>1|2|3|4)_001\.fastq\.gz



We can see that it has validated and extracted the sample name from the pattern.
We can now use this pattern when creating the pipeline or data type.

You can also use the `test_file_name_validation` method if you do not have a dataset to test against. This will return a list of matches for the provided file names.

In [4]:
matches = helper.test_file_name_validation(
    file_name_patterns=file_name_patterns,
    file_names=[
        "4263-B_S1_R1_001.fastq.gz",
        "4263-B_S1_R2_001.fastq.gz"
    ]
)
matches.print()


Matches: 2

4263-B_S1_R1_001.fastq.gz
Sample name: 4263-B
Matched regex: (?<sampleName>\S*)_S(?<libraryIndex>\S*)_(?<readType>R|I)(?<read>1|2|3|4)_001\.fastq\.gz

4263-B_S1_R2_001.fastq.gz
Sample name: 4263-B
Matched regex: (?<sampleName>\S*)_S(?<libraryIndex>\S*)_(?<readType>R|I)(?<read>1|2|3|4)_001\.fastq\.gz



### Preprocess testing (sample sheet generation)

To generate the `PreprocessDataset` object using Cirro-provided sample sheets for your pipeline, you can use the `generate_preprocess_for_input_datasets` method.

You can also use `generate_samplesheets_for_dataset` method if you want to access the sample sheets directly. This could be useful if you want to generate test data to write unit tests for your Preprocess script.

In [5]:
dataset = portal.get_dataset(
    project="Example Datasets",
    dataset="10X 5' Immune Profiling Libraries Pooled with Hashtags"
)

In [6]:
ds = helper.generate_preprocess_for_input_datasets(
    project_id=dataset.project_id,
    input_dataset_ids=[dataset.id],
    params={
        'param_1': 'value_1',
    }
)

We can then inspect the `PreprocessDataset` object to see the sample sheets that have been generated.

In [7]:
ds.samplesheet.head()

Unnamed: 0,sample,grouping,feature_types
0,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1,Multiplexing Capture
1,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1,VDJ-B
2,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1,Gene Expression
3,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1,VDJ-T


In [8]:
ds.files.head()

Unnamed: 0,sample,file,process,dataset,sampleIndex,read,readType
0,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,single-cell-10X,b393c81b-1001-4b8c-8c79-bba690bdce04,7,2,R
1,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,single-cell-10X,b393c81b-1001-4b8c-8c79-bba690bdce04,3,2,R
2,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,single-cell-10X,b393c81b-1001-4b8c-8c79-bba690bdce04,2,1,R
3,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,single-cell-10X,b393c81b-1001-4b8c-8c79-bba690bdce04,1,1,R
4,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,single-cell-10X,b393c81b-1001-4b8c-8c79-bba690bdce04,6,1,R


You can use the two dataframes to create your own sample sheet for your pipeline. See the [Preprocess full example](https://docs.cirro.bio/pipelines/preprocess-script/#full-example) for more details on constructing your sample sheet manually.

The `pivot_samplesheet` method is also available to output the typical sample sheet format used by many pipelines which work on paired-end data.

In [10]:
ds.pivot_samplesheet(
    pivot_columns=['read'],  # Pivot using the read column, to become `fastq_<read>`
    column_prefix='fastq_',  # This is typical of pipelines that work on paired-end data
    metadata_columns=['grouping'],  # My pipeline doesn't allow any additional columns, only `grouping`
    file_filter_predicate='readType == "R"'  # I only want to include read files, not index files
)

Unnamed: 0,sample,fastq_1,fastq_2,grouping
0,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
1,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
2,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1_ab,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
3,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
4,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
5,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
6,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
7,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
8,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,s3://project-c05cb0bc-f472-4390-b521-f798486f8...,PBMC-ALL_60k_universal_HashAB1-4_BL_4tags_Rep1
