## Running analysis on a dataset

In [1]:
from pubweb import DataPortal

portal = DataPortal()

In [2]:
# Get the project by name
project = portal.get_project_by_name('Test Project') 
print(f"Project '{project.name}' contains {len(project.list_datasets()):,} datasets")

# Get a particular dataset from that project
dataset = project.get_dataset_by_name('Test data for CRISPR MAGeCK')
print(f"Dataset '{dataset.name}' contains {len(dataset.list_files()):,} files")

# Get the process to run on the dataset
process = portal.get_process_by_name('MAGeCK Count')
print(f"Using the '{process.name}' process (ID: {process.id})")

Project 'Test Project' contains 99 datasets
Dataset 'Test data for CRISPR MAGeCK' contains 6 files
Using the 'MAGeCK Count' process (ID: process-hutch-magic_count-1_0)


Look up the parameters that are required for the process. You'll have to set values for these parameters later.

In [3]:
param_spec = process.get_parameter_spec()
param_spec.print()

Parameters:
	FASTQ (key=fastq, type=string)
	Library (key=library, type=string)
	5' Adapter (key=adapter, default=CTTGTGGAAAGGACGAAACACCG, type=string, description=Adapter sequence to be trimmed from the 5' end of each read)
	Insert Length (key=insert_length, default=20, type=integer, description=Length of the sgRNA sequences contained in each read)


Look up the references you'll need to use as input parameters. See the [Using_references](Using_references.ipynb) notebook for more info on how to find references

In [4]:
references = project.list_references('CRISPR sgRNA Library')
print("The CRISPR references available are:\n" + "\n".join(list(map(str, references))))
reference_library = project.get_reference_by_name('BroadGPP-Brunello', 'CRISPR sgRNA Library')
reference_library

The CRISPR references available are:
BroadGPP-Brunello


<pubweb.sdk.reference.DataPortalReference at 0x124332c70>

Look up some files you'll need to use as input parameters

In [5]:
files = dataset.list_files()
fastqs = files.filter_by_pattern('**/controls/*.fastq.gz')
[str(f) for f in fastqs]

['data/controls/MO_Brunello_gDNA_1.fastq.gz (185096 bytes)',
 'data/controls/MO_Brunello_gDNA_2.fastq.gz (177349 bytes)']

Define the parameters you want to use. The keys you'll want to use will come from the `param_spec` variable defined above (look at the `key` for each entry).

In [6]:
params = {
    'fastq': ','.join([f.absolute_path for f in fastqs]),
    "adapter": "CTTGTGGAAAGGACGAAACACCG",
    "insert_length": 20,
    "library": reference_library.absolute_path
}
params

{'fastq': 's3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/datasets/de2dda9a-c103-4841-ae46-b2df74390f90/data/controls/MO_Brunello_gDNA_1.fastq.gz,s3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/datasets/de2dda9a-c103-4841-ae46-b2df74390f90/data/controls/MO_Brunello_gDNA_2.fastq.gz',
 'adapter': 'CTTGTGGAAAGGACGAAACACCG',
 'insert_length': 20,
 'library': 's3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/resources/data/references/crispr_libraries/BroadGPP-Brunello/library.csv'}

Before submitting the analysis, the client automatically validates that the parameters are valid.
But, you can also validate them manually using `validate_params`

In [7]:
try:
    param_spec.validate_params({
        'library': 1
    })
except Exception as e:
    print(e)

Parameter at $.library error: 1 is not of type 'string'


Run the analysis using the process, dataset, project, and parameters you defined above.

In [8]:
# Run the analysis, specifying a name and description for the resulting dataset
new_dataset_id = dataset.run_analysis(
    name='count analysis',
    description='test from SDK',
    process='MAGeCK Count',
    params=params
)
print(new_dataset_id)

c0463121-94fc-4077-8475-1e9432069f35
