In [None]:
!pip install pycandi

In [None]:
!candi-install

# Getting Started

Let's go over basic functionality and use cases of CanDI package. 

### Importing

CanDI must be imported from from the main CanDI directory. The core CanDI objects are contained within the CanDI.candi module and are imported as follows. 

In [None]:
import CanDI.candi as can
#Can also be imported as 
from CanDI import candi as can

### Data Object
The Data object is instantiated when CanDI and access as data within the candi module
CanDI dataset paths are defined as attributes within the Data object.

In [None]:
print(can.data.gene_effect) # depmap ceres score
print(can.data.expression) # ccle rna seq data
print(can.data.gene_cn) # ccle copy number data

## How to Directly Load a Dataset
The load method of the Data object is used to load specific datasets into memory. The datasets are saved as pandas dataframes as attributes of the data object. 

In [None]:
can.data.load("expression")

## Cell Lines
The Cell Lines dataset contains all cell line metadata. This table is loaded automatically when candi is imported.

In [None]:
can.data.cell_lines.head(5)

## Genes
The genes dataset contains relevant gene metadata. 
The genes dataset is loaded into memory automatically when candi is imported. 

In [None]:
can.data.genes.head(5)

## Locations
The locations dataset contains location annotations for all genes and their associated confidence scores. Confidence scores were crowd sourced from several protein localization papers and integrated into one scale. This dataset is automatically loaded into memory when candi is imported. 

In [None]:
can.data.locations.head(5)

## Basic Object Instantiation
- The user input for object instantiation is used directly for indexing
- This means if it is misspelled candi will not be able to retrieve the data in which the user is interested


In [None]:
kras = can.Gene("KRAS")
lung = can.Cancer("Lung Cancer")
membrane = can.Organelle("Plasma membrane")
a549 = can.CellLine("A549") 

## Gene Object Methods and Attributes
The following function prints the internal attributes and functions of CanDI objects. 

In [None]:
def pretty_print_attr(obj):
    attr = []
    ls_attr = []
    meth = []
    for i in dir(obj):
        if "_" != i[0]:
            if type(getattr(obj, i)) == str or type(getattr(obj, i)) == int:
                attr.append(i)
            elif type(getattr(obj, i)) == list:
                ls_attr.append(i)
            else:
                meth.append(i)
                
    print("Attributes:\n")
    for i in attr: print(i+":", getattr(obj, i))
    for i in ls_attr: print(i+" list first item:", getattr(obj, i)[0])
    for i in ls_attr: print(i+" length:", len(getattr(obj, i)))
    print("\nMethods:\n")
    for i in meth: print(i)

pretty_print_attr(kras)


## Gene Indexing examples
If a dataset has not be loaded into memory candi will prompt you.
Once a dataset is loaded, Gene.expression gives all the rna seq transcript data for that specific object.
In this case we have already instantiated a gene object

In [None]:
kras.expression

### Basic CanDI filtering
the Gene.expressed() method retrieves cell lines where the user defined gene has above 1 transcript per million
the output is a list of cell line ids which can be used to instantiate CellLine or CellLineClbbuster objects


In [None]:
kras.expressed()[0:10]

The user can specify if they want the tpm values with the depmap ids 

In [None]:
kras.expressed(style="values")

If you input a depmap id as an argument to gene.expressed you will get a boolean showing the expression status of your gene

In [None]:
kras.expressed(a549.depmap_id)

The user can use the gene.expression_of() method to check that gene's expression in a specific cell line.
This method only, when called from a Gene object, accepts cell line depmap id's as an argument.

In [None]:
kras.expression_of(a549.depmap_id)

CanDI is consistent in the way this works across all classes and data types

In [None]:
kras.mutations

The gene.mutated() method allows very specific filtering.
Using the variant argument one can select the column on which to filter. Then using the item argument the user can specifiy the specific value in which they're interested. The example below shows retrieval of all cell lines with kras missense mutations.

In [None]:
kras.mutated(variant="Variant_Classification", item="Missense_Mutation")[0:10]

Users can use the unload method of the Data object to remove a dataset from memory and return it to a file path string.

In [None]:
can.data.unload('mutations')
can.data.mutations

## CellLine Methods and Attributes

In [None]:
pretty_print_attr(a549)

All methods work in essentially same way regardless of the candi object in use.
The CellLine.expressed() method will return all genes which have expression above 1 transcript per million
in that specific cell line.

In [None]:
a549.expressed()[:10]

Just like gene.expressed() the user can ask for the values

In [None]:
a549.expressed(style="values")

And for specific genes expression status

In [None]:
a549.expressed("KRAS")

expressed with style="values" gives the same result as expression_of

In [None]:
a549.expression_of("KRAS")

In [None]:
a549.expressed("KRAS", style="values")

The CellLine.mtuations attribute gives all mutation data for that specific cell line

In [None]:
a549.mutations

In [None]:
# calling the CellLine.mutated() method works the same way with all CanDI objects
a549.mutated(variant="Variant_Classification", item="Nonsense_Mutation")[:10]


## Cancer Methods and Attributes


In [None]:
pretty_print_attr(lung)

Cancer objects work essentially works as a group of cell line objects 
the Cancer.expression object returns a pandas dataframe rather than a pandas series since there are multiple cell lines to consider.

In [None]:
lung.expression

Cancer.expressed method uses an abitrary threshold to filter genes the default is if a gene is expressed in 100 percent of the cell lines within the cancer object it will read out as expressed

In [None]:
lung.expressed()[0:10]

The user can relax this threshold as necessary

In [None]:
lung.expressed(threshold=0.50)[0:10]

In [None]:
lung.expressed(threshold=0.50, style="values")

Cancer and CellLineCluster objects have an additional method that outputs a binary matrix
of which genes/cell lines have mutations

In [None]:
lung.mutation_matrix()

## Organelle Methods and Attributes


In [None]:
pretty_print_attr(membrane)