# Prologue - Welcome to the dry-run hackathon.ipynb

In [1]:
# Load the dataset
import pandas as pd
df = pd.read_parquet("s3://allencell-cytodata-variance-data/variance-dataset/processed/hackathon_manifest_092022.parquet")
df.set_index(df['CellId'].astype(int), inplace=True)
print(f'Number of cells: {len(df)}')
print(f'Number of columns: {len(df.columns)}')

Number of cells: 214037
Number of columns: 78


In [17]:
# Visualize a cell
import nbvv
from serotiny.io.image import image_loader

# Boolean Indexing
a_cell =df[
    (df['gene'] == "TUBA1B") &
    (df['cell_stage'] == "M3") &
    (df['cell_volume'] > 3000)
].sample(1).iloc[0]

img_data,channel_names = image_loader(a_cell["registered_path"],return_as_torch=False,return_channels=True)

nbvv.volshow(
    img_data,
    spacing=[1,1,1],  # full_img.physical_pixel_sizes,
    channel_names=channel_names
)

VolumeWidget(dimensions={'tile_width': 204, 'tile_height': 136, 'rows': 15, 'cols': 10, 'atlas_width': 2040, '…

In [23]:
# Load the dataset
import pandas as pd
df = pd.read_parquet("s3://variance-dataset/processed/manifest.parquet")
print(f'Number of cells: {len(df)}')
print(f'Number of columns: {len(df.columns)}')

FileNotFoundError: variance-dataset/processed/manifest.parquet

In [21]:
# Visualize a cell
import nbvv
from aicsimageprocessing import read_ome_zarr

# Boolean Indexing
a_cell =df[
    (df['gene'] == "TUBA1B") &
    (df['cell_stage'] == "M3") &
    (df['cell_volume'] > 3000)
].sample(1).iloc[0]

full_img = read_ome_zarr(a_cell["registered_path"])
img_data = full_img.data.squeeze()
print(img_data.shape)
channel_names = full_img.channel_names
print(channel_names)

nbvv.volshow(
    img_data,
    spacing=[1,1,1],  # full_img.physical_pixel_sizes,
    channel_names=channel_names
)

(7, 136, 245, 381)
['bf', 'dna', 'membrane', 'structure', 'dna_segmentation', 'membrane_segmentation', 'struct_segmentation_roof']


VolumeWidget(dimensions={'tile_width': 204, 'tile_height': 136, 'rows': 15, 'cols': 10, 'atlas_width': 2040, '…

In [27]:
df_cell_metadata = df.filter(items=(col_df[col_df.category=="cell metadata"]['column name']))
df_cell_metadata.head(3)

Unnamed: 0_level_0,cell_stage,CellId,Cellular Component,Description (from Uniprot),Draft mitotic state resolved,edge_flag,gene,Structure,this_cell_index,this_cell_nbr_complete,Protein
CellId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
230741,M4M5,230741,cytoplasm,Central component of the receptor complex resp...,,0,TOMM20,mitochondria,1,1,Tom20
230745,M0,230745,cytoplasm,Central component of the receptor complex resp...,,0,TOMM20,mitochondria,5,1,Tom20
230746,M0,230746,cytoplasm,Central component of the receptor complex resp...,M0,0,TOMM20,mitochondria,6,0,Tom20


The above code can be changed to any of the four categories [`cell metadata`, `field-of-view metadata`,`cell metric`,`cell images`] to quickly get the corresponding information. In section 2.4 we have verbose descriptions of our columns which will allow you to get a better understanding of what each of these values represents.

#### Compute mean volume of cells by cell line

We've already previously learned of ways to split the data based on the broad column categories such as cell_metadata. But we can also quickly operate on specific columns in the usual pandas syntax. For example the cells all have a metric for `cell_volume`. A question we may have is if all the cell lines had a simliar cell volume. To address this we simply need two columns,  1) **cell_volume**  and 2) **gene** from the manifest. Below is code to access these columns.

First we create a new dataframe with only the columns we are interested in. Then we groupby gene and calculate the mean of the volumes.

In [28]:
# Same as 
df_cell_vol = df[['cell_volume','gene']].copy()
df_cell_vol.groupby('gene').mean()

Unnamed: 0_level_0,cell_volume
gene,Unnamed: 1_level_1
ACTB,1933.371832
ACTN1,2029.72911
ATP2A2,1567.063147
CETN2,1963.204244
CTNNB1,2093.322649
DSP,2001.827471
FBL,1917.92583
GJA1,1960.413409
HIST1H2BJ,1697.193264
LAMP1,1942.496855


Parsing the data this way we can see the mean cell volumes for each cell line. 

### Advanced

In [29]:
#See if you can try and plot the results from above in this code block

Throughout this hackathon you will often interact with Pandas DataFrames. For those unfamiliar with Pandas DataFrames, some helpful querying and grouping functions are explored within this sub-chapter. 

Subdatasets can be generated based upon some conditional, these conditionals follow standard boolean logic. Say for example you want to subset on very specific criteria and are interested in using only the `NUP153` cell line that were in interphase `M0` and had a `nuclear height` > 3 microns. That would be represented by the below code:

#### Filtering

In [30]:
# Boolean Indexing
df_filtered_boolean_indexing = df[
    (df['gene'] == "NUP153") &
    (df['cell_stage'] == "M0") &
    (df['nuclear_height'] > 3)
]

df_filtered_boolean_indexing.shape

(16817, 78)

We now have a new dataframe that contains **16817** cells all of NUP153 at interphase and having a nuclear height greater than 3 microns!

Similarly you can filter catagorical variables using list indexing. This allows us to gather data across multiple different values and in this case we are creating a new data frame with `NUP153` `PXN` and `TOMM20`.

In [31]:
# List Indexing 
value_list = [
    "NUP153",
    "PXN",
    "TOMM20",
]

df_list_index = df[df["gene"].isin(value_list)]
df_list_index.shape

(45851, 78)

#### Grouping

Being able to group data by some criteria is a useful tool for analysis. Using Pandas built in function .groupby we gather useful information on a particular slice of the data. We've already used this function quite extensively in the previous code blocks. But here we demonstrate the useful function `describe`. Say for example we quickly want to understand what the number, mean, std, min, and max of a specific metric is. Below we take a look at the `nuclear_volume` for each `gene` by simply adding .describe to the end of the column metric we are interested in.

In [32]:
# Grouping
df_groupby = df.groupby(['gene'])
df_groupby.nuclear_volume.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ACTB,4010.0,550.073328,172.187209,76.509056,438.949047,535.582462,653.01694,1494.93663
ACTN1,8214.0,549.262034,184.254973,41.30271,436.248275,539.876617,663.695749,1520.946938
ATP2A2,10163.0,444.919202,132.225361,38.021226,356.023869,433.659386,527.043102,1384.668082
CETN2,7575.0,540.922476,177.04184,40.227106,428.286448,533.457952,650.581891,1707.55926
CTNNB1,6217.0,546.300572,176.876904,51.836262,433.761098,536.598311,655.345828,1508.413474
DSP,10228.0,555.827474,166.991559,47.630469,445.461477,540.170946,660.704462,1392.924556
FBL,10415.0,544.089984,170.460219,68.143242,433.550681,529.421257,651.274168,1889.263977
GJA1,6546.0,538.954979,173.463913,58.973903,426.52397,526.549799,645.722599,1699.281172
HIST1H2BJ,15875.0,475.939712,147.809395,40.804321,380.370549,465.113829,566.70952,1446.498822
LAMP1,10725.0,533.995609,182.976931,40.328818,417.756711,519.033916,646.606222,1907.677667


#### Missing Data

The dataset contains many columns that may be missing values, though sparse, this omittence can affect your process. In this subsection some simple ways to interact with empty cells are explored.

In [33]:
# Dropping missing values by column or whole dataset
df_dropna = df.copy()
df_dropna.dropna(subset=['shape_mode_3_major_tilt'], inplace = True)
df_dropna.shape

(175147, 78)

In [34]:
# Filling, either by single column or whole dataset 
df_fill_na = df.copy()
df_fill_na['shape_mode_3_major_tilt'] = df_fill_na['shape_mode_3_major_tilt'].fillna(0)
df_fill_na.shape

(214037, 78)

The dataset contains 79 columns with important metrics which you may want to incorporate during the challenge. In this section we'll create a tiny dash app that runs within the notebook in order to display an interactive table of the columns and their descriptions. **The table is searchable** lending itself to more efficient querying.

In [39]:
from jupyter_dash import JupyterDash
import dash
from dash import dcc
from dash import html 
JupyterDash.infer_jupyter_proxy_config()
col_df = pd.read_csv("resources/hackathon_column_descriptions.csv",delimiter=",") #Already read in but just in case
col_df.columns = ["column name","description","category"] 

In [40]:
# Some sort of Look up method built in for looking at column definitions 
from dash import dash_table
app = JupyterDash(__name__)
server = app.server

app.layout = dash_table.DataTable(
    col_df.to_dict('records'), 
    [{"name": i, "id": i} for i in col_df.columns],
    style_data={
        'whiteSpace':'normal',
        'height': 'auto',
        'lineHeight':'15px',
        'backgroundColor': 'rgb(50,50,50)',
        'color': 'white',
    },
    style_header={
        'backgroundColor':'rgb(30,30,30)',
        'color':'white'
    },  
    style_cell={
        'textAlign':'left'
    },
    filter_action="native",
)


In [41]:
app.run_server(mode="jupyterlab",debug=False) # This runs on a specified port. Either enable port forwarding on your maching. #TODO work with Gui to enable this directly from the app rather than port forwarding. 

 * Running on http://127.0.0.1:8050
Press CTRL+C to quit
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_alive_a5974aa4-d867-4aa1-9cd7-4f4451a7bdcb HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_dash-component-suites/dash/deps/react-dom@16.v2_6_1m1663894122.14.0.min.js HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_dash-component-suites/dash/deps/polyfill@7.v2_6_1m1663894122.12.1.min.js HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_dash-component-suites/dash/dash_table/bundle.v5_1_5m1663894122.js HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_dash-component-suites/dash/dcc/dash_core_components-shared.v2_6_1m1663894122.js HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GET /_dash-component-suites/dash/dash-renderer/build/dash_renderer.v2_6_1m1663894122.min.js HTTP/1.1" 200 -
127.0.0.1 - - [24/Sep/2022 22:17:59] "GE