## Running analysis on a dataset

In [1]:
from pubweb import PubWeb

client = PubWeb()

In [2]:
project = client.project.find_by_name('Test Project') 
datasets = client.dataset.find_by_project(project_id=project.id, name='Test data for CRISPR MAGeCK')
dataset = datasets[0]

process = client.process.find_by_name('MAGeCK Count')

Look up the parameters that are required for the process. You'll have to set values for these parameters later.

In [3]:
param_spec = client.process.get_parameter_spec(process.id)
param_spec.print()

Parameters:
	FASTQ (key=fastq, type=string)
	Library (key=library, type=string)
	5' Adapter (key=adapter, default=CTTGTGGAAAGGACGAAACACCG, type=string, description=Adapter sequence to be trimmed from the 5' end of each read)
	Insert Length (key=insert_length, default=20, type=integer, description=Length of the sgRNA sequences contained in each read)


Look up the references you'll need to use as input parameters. See the [Using_references](Using_references.ipynb) notebook for more info on how to find references

In [4]:
references = client.project.get_references(project.id, 'crispr_libraries')
reference_library = references.find_by_name('BroadGPP-Brunello')
reference_library

Reference(path=data/references/crispr_libraries/BroadGPP-Brunello/library.csv)

Look up some files you'll need to use as input parameters

In [5]:
from pubweb.file_utils import filter_files_by_pattern

files = client.dataset.get_dataset_files(project_id=project.id,
                                         dataset_id=dataset.id)
fastqs = filter_files_by_pattern(files, '**/controls/*.fastq.gz')
fastqs

[File(path=data/controls/MO_Brunello_gDNA_1.fastq.gz),
 File(path=data/controls/MO_Brunello_gDNA_2.fastq.gz)]

Define the parameters you want to use. The keys you'll want to use will come from the `param_spec` variable defined above (look at the `key` for each entry).

In [6]:
from pubweb.models.process import RunAnalysisCommand

params = {
    'fastq': ','.join([f.absolute_path for f in fastqs]),
    "adapter": "CTTGTGGAAAGGACGAAACACCG",
    "insert_length": 20,
    "library": reference_library.absolute_path
}
params

{'fastq': 's3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/datasets/de2dda9a-c103-4841-ae46-b2df74390f90/data/controls/MO_Brunello_gDNA_1.fastq.gz,s3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/datasets/de2dda9a-c103-4841-ae46-b2df74390f90/data/controls/MO_Brunello_gDNA_2.fastq.gz',
 'adapter': 'CTTGTGGAAAGGACGAAACACCG',
 'insert_length': 20,
 'library': 's3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/resources/data/references/crispr_libraries/BroadGPP-Brunello/library.csv'}

Before submitting the analysis, the client automatically validates that the parameters are valid.
But, you can also validate them manually using `validate_params`

In [7]:
try:
    param_spec.validate_params({
        'library': 1
    })
except Exception as e:
    print(e)

Parameter at $.library error: 1 is not of type 'string'


Run the analysis using the process, dataset, project, and parameters you defined above.

In [8]:
command = RunAnalysisCommand(
    name='count analysis',
    description='test from SDK',
    process_id=process.id,
    parent_dataset_id=dataset.id,
    project_id=project.id,
    params=params,
    notifications_emails=[]
)

new_dataset_id = client.process.run_analysis(command)
print(new_dataset_id)

9943abf5-561f-45f1-99ef-f5316fd861c2
