# Setup

The package is imported as `cr` in this chunk. Sub-modules can be called using:
- `cr.pp` (functions in the `crispr/processing` directory scripts)
- `cr.ax` (functions in the `crispr/processing` directory scripts)
- `cr.pl` (functions in the `crispr/visualization` directory scripts)
- `cr.tl` (functions in the `crispr/utils` directory scripts)

You can access functions this way directly (e.g., `cr.tl.print_counts()`) if they are "exposed," i.e., listed in the `__init__.py` script under a given sub-module (e.g., `crispr/processing/__init__.py`).

Other functions or objects you may need to import (e.g., the default perturbation layer name, using `from crispr.analysis.perturbations import layer_perturbation`); however, most or all functions are exposed.

There are also certain `Crispr` class initialization arguments defined here. No need to change anything***; these arguments are all just to identify columns in the CRISPR screening data and such.

You will specify arguments to your liking in the next chunk of code under the "Options" section.

---

\*** One exception may be `col_cell_type` to "leiden" or "majority_voting" if you want one of those as the default cell type label used in downstream processes, though you can always just specify something different using the `col_cell_type` argument in a given method.

In [21]:
%load_ext autoreload
%autoreload 2

import crispr as cr
from crispr.class_sc import Omics
from anndata import AnnData
import scanpy as sc
import copy
import os
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from examples import config

col_sample_id = "orig.ident"
kws_harmony = dict(plot_convergence=True, random_state=1618)
kws_init = dict(assay=None, assay_protein=None, col_sample_id=None,
                col_gene_symbols="gene_symbols",
                col_cell_type="predicted_labels",
                col_perturbed="perturbation",
                col_guide_rna="feature_call",
                col_num_umis="num_umis",
                col_condition="target_gene_name",
                key_control="NT", key_treatment="KD")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
cr.tl.print_pretty_dictionary(kws_cluster)

method_cluster="leiden"
kws_umap=dict(min_dist=0.3)
kws_neighbors=None
resolution=1.2
kws_pca=dict(n_comps=20, use_highly_variable=True)


# File Path

Here I specify the file paths for HH03-HH06. Each dataset is specified via `{"directory": <FILE_PATH>}` to signify that it's a CellRanger output.

I specify the `file_path` argument via code below in order to detect where your data are relative to the working directory of this script; however, in practice, you can specify paths directly/manually. 

---

For example, you can specify as a dictionary of multiple files in order to integrate the samples into one object using Harmony (replacing <DIR> below with the path to the directory).

```
file_path = {"HH03_normoxia_noMDP": {"directory": "<DIR>/HH03"}, "HH04_normoxia_MDP": {"directory": "<DIR>/HH04"},
             "HH05_hypoxia_noMDP": {"directory": "<DIR>/HH05"}, "HH06_hypoxia_MDP": {"directory": "<DIR>/HH06"}}
```

If you use this method, you'd need to specify certain initialization arguments differently. Refer to `crispr/examples/crispr_integration.ipynb` for an example.

---

**In this notebook**, for ease of examining minimally-processed objects separately, we will create multiple objects, using each element of the `file_path_list` as the `file_path` argument for each dataset. This will return a list of individual objects (instantiations of the class defined in `crispr/class_crispr.py`) for each dataset. These paths should point to the directory within which the feature_bc_matrix (and any protospacer files) are found (i.e., in the "outs" subdirectory in CellRanger output).

```
file_path_list = [{"directory": "/home/elizabeth/elizabeth/crispr/examples/data/crispr-screening/HH03"},
                  {"directory": "/home/elizabeth/elizabeth/crispr/examples/data/crispr-screening/HH04"},
                  {"directory": "/home/elizabeth/elizabeth/crispr/examples/data/crispr-screening/HH05"},
                  {"directory": "/home/elizabeth/elizabeth/crispr/examples/data/crispr-screening/HH06"}]
selves = [cr.Crispr(file_path)]
```

(These file paths are specific to Spark.)

In [2]:
file_path = ["HH03_normoxia_noMDP", "HH04_normoxia_MDP",
             "HH05_hypoxia_noMDP", "HH06_hypoxia_MDP"]
file_path = dict([(x, {"directory": os.path.join(
    config.DIR, "crispr-screening", x.split("_")[0])}) for x in file_path])
file_path_list = [file_path[x] for x in file_path]

# Options

Here you can specify certain arguments to guide pre-processing of the scRNA-seq (e.g., filtering by MT counts) and Perturb-seq data (e.g., guide RNA filtering and target gene mapping/assignment).

## scRNA-seq Pre-Processing & Clustering

You don't necessarily need to change these, but you can if you want to fiddle with the options. 

The main wrapper function for preprocessing is `cr.pp.process_data()`. This function calls additional functions in the same script (and some others in the visualization module, `crispr/visualization/preprocessing.py`).

<u> __Notes__ </u>  

- More options are available for greater customization. See the docstring in the functions in `crispr/processing/preprocessing.py` (the `cr.pp` module, if you imported the package using `import crispr as cr` as is done in the first chunk in this notebook) and `crispr/analysis/clustering.py`. 

- Certain highly-customized (i.e., infrequently-used) arguments are "catch-alls" that require you to specify a dictionary of arguments to be passed to a function in an external package (e.g., the `kws_neighbors` argument in the pre-processing method is passed to `scanpy.pp.neighbors()`). Refer to the relevant package's documentation for a list of potential arguments.

In [14]:
kws_umap = dict(min_dist=0.3)
kws_pp = {"kws_hvg": {"min_mean": 0.0125, "max_mean": 10, 
                      "min_disp": 0.5, "flavor": "cell_ranger"}, 
          "target_sum": 10000.0, 
          "cell_filter_pmt": [0, 15], 
          "cell_filter_ngene": [200, 7000], 
          "cell_filter_ncounts": [500, 60000], 
          "gene_filter_ncell": [3, None], 
          "kws_scale": "z", "regress_out": None, "kws_umap": kws_umap}
kws_cluster = dict(method_cluster="leiden" kws_umap=dict(min_dist = "0.3")
                   kws_neighbors=None, resolution=1.2
                   kws_pca=dict(n_comps=20, use_highly_variable=True))

## Guide RNA Processing

These are arguments for gRNA filtering and target gene assignment fed to functions contained in `crispr/processing/guide_rna.py`.

The main wrapper function is `cr.pp.process_guide_rna()`. 
- This function calls other functions in that same script, where you can see their code. 
- You can explore how they work individually by **importing them from the submodule** and providing appropriate inputs or by using the **debugger** to step into the function when it's called within `cr.Crispr` initialization.
- Some arguments used by this function (column names, keys) are provided via the object `kws_init` in this example because the `Crispr` class object will use these definitions across multiple contexts beyond gRNA processing (so they are stored in attributes, e.g., `self._columns["col_guide_rna"]`) so they are consistently accessible.
- Not all possible arguments are explicitly listed in the wrapper function's definition. A list of **additional keyword arguments** that can be provided can be found in documentation of the function `filter_by_guide_counts()`, to which the wrapper function feeds additional arguments not listed in its function definition. These additional arguments are also specified in the example below.

Many of these arguments are designed to implement the complex procedure requested in [Jira issue 737](https://mssm-ipm.atlassian.net/browse/CHOLAB-737?atlOrigin=eyJpIjoiYjNjMzFkNzkzZjBhNDU5Zjg0YjY5OTQ2MjM4NDI0NTQiLCJwIjoiaiJ9) (particularly, [this comment](https://mssm-ipm.atlassian.net/browse/CHOLAB-737?focusedCommentId=11589)) and (for now) have defaults reflecting that procedure for ease of remembering them. Most users not wanting to employ this particular procedure will therefore probably have to specify None, 0 (for minimum), 100 (for maximum percent arguments), etc. for certain arguments to avoid these steps.

<u> __Label Processing Arguments__ </u>  

- `feature_split`: This argument defines the character that separates multiple guide RNAs, (e.g., "|" in "STAT1-1|NOD2-2-2").
- `guide_split`: This argument defines the character that separates individual IDs within gRNAs targeting the same gene (e.g., "-" in "STAT1" and "STAT1-1-2").
- `key_control_patterns`: This is a list of patterns that count as control guides. For instance, if control guides are denoted by labels such as "NEGCTRL-1-2" and "CONTROL" and "CTRL-2" and such, you could specify `["CTRL", "CONTROL"]` if those specific strings are not found in any targeting gRNA labels.
- `key_control` is mainly aesthetic because it simply renames the control guides to something else (e.g., "Control") in the final targeting gene label column.

<u> __Guide Mapping and Filtering Arguments__ </u>  

The arguments below are listed roughly in order that the mapping/filtering steps in which they are used are performed.

First, we determine which control guides can be removed from the list of guides with which that cell is considered transfected.
- `remove_multi_transfected`: If True, multiply-transfected cells will be removed from the data, unless they are labeled as pseudo-single-transfected according to other argument specifications.
- `max_pct_control_drop`: If the percent of UMI counts that are control guides (combined across individual guide IDs, e.g., "NEGCTRL-1-1" and "NEGCTRL-1-2" and so on) is below this number, the control label will be removed from the list of guides for that cell (e.g, "CTRL|STAT1|NOD2" --> "STAT1|NOD2").
- `min_n_target_control_drop`: If the percent of control UMI counts is above `max_pct_control_drop`, but the total non-control (targeting) sgRNA UMI count is over 100, control is removed from the list of guides.

For cells that are still not considered pseudo-single-transfected after the control guide filtering described above:

- `min_pct_avg_n`: All guides whose UMIs make up less than this percentage of the total UMIs will be dropped from the list of guides.
- `min_pct_dominant`: For cells that after that step are still considered multiply-transfected, if a single guide makes up this percent or greater of the total UMIs, it is considered the dominant guide and the cell pseudo-single-transfected for the corresponding target gene.

For cells that are still considered multiply-transfected:
- `drop_multi_control`: If False, cells that (after the filtering described above) are still considered transfected with a control guide plus only one type of targeting guide will be considered pseudo-single-transfected for the target guide. Otherwise, if both control and targeting guides remain in the list of guides for that cell after all other gRNA filtering/processing, the cell will be dropped.

If `remove_multi_transfected` is True, then all cells that are still considered multiply-transfected will be dropped.

In [43]:
kws_grna = dict(feature_split="|", guide_split="-", 
                key_control_patterns=["CTRL"], key_control="Control",
                remove_multi_transfected=True, drop_multi_control=False,
                remove_from_gex=False,
                max_pct_control_drop=75, min_n_target_control_drop=100,
                min_pct_avg_n=40, min_pct_dominant=80)

# Data

In [39]:
self = cr.Crispr(file_path_list[0], **kws_init, kws_process_guide_rna=kws_grna)



<<< INITIALIZING CRISPR CLASS OBJECT >>>



<<< INITIALIZING CRISPR CLASS OBJECT >>>


Unused keyword arguments: {'kws_process_guide_rna': {'col_guide_rna': 'feature_call', 'col_num_umis': 'num_umis', 'key_control': 'Control', 'col_guide_rna_new': 'target_gene_name', 'feature_split': '|', 'guide_split': '-', 'key_control_patterns': ['CTRL'], 'remove_multi_transfected': False, 'drop_multi_control': False, 'max_pct_control_drop': 75, 'min_n_target_control_drop': 100, 'min_pct_avg_n': 40, 'min_pct_dominant': 80}}.

col_gene_symbols="gene_symbols"
col_cell_type="leiden"
col_sample_id=None
col_batch=None
col_condition="target_gene_name"
col_num_umis=None
key_control="NT"
key_treatment="KD"

<<< LOADING PROTOSPACER METADATA >>>


Cell Counts: Raw

15078


Gene Counts: Raw



<<< PERFORMING gRNA PROCESSING AND FILTERING >>>



	*** Removing filtered-out cells...
Dropped 5175 out of 15078 observations (34.32%).


Cell Counts: Post-Guide RNA Processing

9903


Gene Counts: Post-Guide RNA Proc

In [None]:
selves = [cr.Crispr(f, **kws_init, kws_process_guide_rna=kws_grna) 
          for f in file_path_list] # create object

## Check Whether Target Gene Expression Data is Available

In [35]:
[any([x in s.adata.var_names for x in s.adata.obs.target_gene_name.unique()]) for s in selves] 

[False, False, False, False]

# Plotting


<u>__These are just examples!__</u>  

The plotting functionality is highly flexible, feature-rich, and customizable. Its capabilities cannot easily be summarized here. 

<u>__If you want to create different plots, explore the documentation and other example notebooks and experiment__</u> with the functions/methods to see if your desired figures can be created.  

## Violin Plots

In [None]:
for self in selves:
    self.plot(kind="violin")