# Prologue - Welcome to the dry-run hackathon

In [1]:
# Load the dataset
import pandas as pd
df = pd.read_parquet("s3://allencell-cytodata-variance-data/variance-dataset/processed/hackathon_manifest_092022.parquet")
print(f'Number of cells: {len(df)}')
print(f'Number of columns: {len(df.columns)}')

Number of cells: 214037
Number of columns: 78


In [2]:
# Visualize a cell
import nbvv
from serotiny.io.image import image_loader

# Boolean Indexing
a_cell =df[
    (df['gene'] == "TUBA1B") &
    (df['cell_stage'] == "M3") &
    (df['cell_volume'] > 3000)
].sample(1).iloc[0]

img_data,channel_names = image_loader(a_cell["registered_path"],return_as_torch=False,return_channels=True)

nbvv.volshow(
    img_data,
    spacing=[1,1,1],  # full_img.physical_pixel_sizes,
    channel_names=channel_names
)



VolumeWidget(dimensions={'tile_width': 204, 'tile_height': 136, 'rows': 15, 'cols': 10, 'atlas_width': 2040, '…

In [None]:
# # Load the dataset
# import pandas as pd
# df = pd.read_parquet("s3://variance-dataset/processed/manifest.parquet")
# print(f'Number of cells: {len(df)}')
# print(f'Number of columns: {len(df.columns)}')

In [None]:
# # Visualize a cell
# import nbvv
# from aicsimageprocessing import read_ome_zarr

# # Boolean Indexing
# a_cell =df[
#     (df['gene'] == "TUBA1B") &
#     (df['cell_stage'] == "M3") &
#     (df['cell_volume'] > 3000)
# ].sample(1).iloc[0]

# full_img = read_ome_zarr(a_cell["registered_path"])
# img_data = full_img.data.squeeze()
# print(img_data.shape)
# channel_names = full_img.channel_names
# print(channel_names)

# nbvv.volshow(
#     img_data,
#     spacing=[1,1,1],  # full_img.physical_pixel_sizes,
#     channel_names=channel_names
# )

In [None]:
df_cell_metadata = df.filter(items=(col_df[col_df.category=="cell metadata"]['column name']))
df_cell_metadata.head(3)

The above code can be changed to any of the four categories [`cell metadata`, `field-of-view metadata`,`cell metric`,`cell images`] to quickly get the corresponding information. In section 2.4 we have verbose descriptions of our columns which will allow you to get a better understanding of what each of these values represents.

#### Compute mean volume of cells by cell line

We've already previously learned of ways to split the data based on the broad column categories such as cell_metadata. But we can also quickly operate on specific columns in the usual pandas syntax. For example the cells all have a metric for `cell_volume`. A question we may have is if all the cell lines had a simliar cell volume. To address this we simply need two columns,  1) **cell_volume**  and 2) **gene** from the manifest. Below is code to access these columns.

First we create a new dataframe with only the columns we are interested in. Then we groupby gene and calculate the mean of the volumes.

In [None]:
# Same as 
df_cell_vol = df[['cell_volume','gene']].copy()
df_cell_vol.groupby('gene').mean()

Parsing the data this way we can see the mean cell volumes for each cell line. 

### Advanced

In [None]:
#See if you can try and plot the results from above in this code block

Throughout this hackathon you will often interact with Pandas DataFrames. For those unfamiliar with Pandas DataFrames, some helpful querying and grouping functions are explored within this sub-chapter. 

Subdatasets can be generated based upon some conditional, these conditionals follow standard boolean logic. Say for example you want to subset on very specific criteria and are interested in using only the `NUP153` cell line that were in interphase `M0` and had a `nuclear height` > 3 microns. That would be represented by the below code:

#### Filtering

In [None]:
# Boolean Indexing
df_filtered_boolean_indexing = df[
    (df['gene'] == "NUP153") &
    (df['cell_stage'] == "M0") &
    (df['nuclear_height'] > 3)
]

df_filtered_boolean_indexing.shape

We now have a new dataframe that contains **16817** cells all of NUP153 at interphase and having a nuclear height greater than 3 microns!

Similarly you can filter catagorical variables using list indexing. This allows us to gather data across multiple different values and in this case we are creating a new data frame with `NUP153` `PXN` and `TOMM20`.

In [None]:
# List Indexing 
value_list = [
    "NUP153",
    "PXN",
    "TOMM20",
]

df_list_index = df[df["gene"].isin(value_list)]
df_list_index.shape

#### Grouping

Being able to group data by some criteria is a useful tool for analysis. Using Pandas built in function .groupby we gather useful information on a particular slice of the data. We've already used this function quite extensively in the previous code blocks. But here we demonstrate the useful function `describe`. Say for example we quickly want to understand what the number, mean, std, min, and max of a specific metric is. Below we take a look at the `nuclear_volume` for each `gene` by simply adding .describe to the end of the column metric we are interested in.

In [None]:
# Grouping
df_groupby = df.groupby(['gene'])
df_groupby.nuclear_volume.describe()

#### Missing Data

The dataset contains many columns that may be missing values, though sparse, this omittence can affect your process. In this subsection some simple ways to interact with empty cells are explored.

In [None]:
# Dropping missing values by column or whole dataset
df_dropna = df.copy()
df_dropna.dropna(subset=['shape_mode_3_major_tilt'], inplace = True)
df_dropna.shape

In [None]:
# Filling, either by single column or whole dataset 
df_fill_na = df.copy()
df_fill_na['shape_mode_3_major_tilt'] = df_fill_na['shape_mode_3_major_tilt'].fillna(0)
df_fill_na.shape

The dataset contains 79 columns with important metrics which you may want to incorporate during the challenge. In this section we'll create a tiny dash app that runs within the notebook in order to display an interactive table of the columns and their descriptions. **The table is searchable** lending itself to more efficient querying.

In [None]:
from jupyter_dash import JupyterDash
import dash
from dash import dcc
from dash import html 
JupyterDash.infer_jupyter_proxy_config()
col_df = pd.read_csv("resources/hackathon_column_descriptions.csv",delimiter=",") #Already read in but just in case
col_df.columns = ["column name","description","category"] 

In [None]:
# Some sort of Look up method built in for looking at column definitions 
from dash import dash_table
app = JupyterDash(__name__)
server = app.server

app.layout = dash_table.DataTable(
    col_df.to_dict('records'), 
    [{"name": i, "id": i} for i in col_df.columns],
    style_data={
        'whiteSpace':'normal',
        'height': 'auto',
        'lineHeight':'15px',
        'backgroundColor': 'rgb(50,50,50)',
        'color': 'white',
    },
    style_header={
        'backgroundColor':'rgb(30,30,30)',
        'color':'white'
    },  
    style_cell={
        'textAlign':'left'
    },
    filter_action="native",
)


In [None]:
app.run_server(mode="jupyterlab",debug=False) # This runs on a specified port. Either enable port forwarding on your maching. #TODO work with Gui to enable this directly from the app rather than port forwarding. 