# Part 1: Load data

## Gate data in FlowJo

The flow cytometer saves data in `.fcs` files, one per sample. While it is possible to load these files in Python, it is (currently) easier to do an initial gating step in FlowJo, a paid flow cytometry analysis software. Specifically, you'll want to gate single cells in two steps:

1. **Gate cells**: in an FSC-A vs SSC-A plot, select the main cluster of events (assuming voltages are set properly) to remove debris and obvious clumps
2. **Gate single cells**: in an FSC-A vs FSC-H (or SSC-A vs SSC-H) plot of the gated cell population, exclude the population with similar height but larger area to remove doublets

Draw these gates on one sample, then apply them uniformly to the entire experiment. Spot-check wells to ensure the gates capture the desired populations.

Finally, export the single cell population as `.csv` files, saving them in Smithsonian in a folder near the `.fcs` files. Use the 'CSV - scale value' setting. For easiest data loading later, save the files with the default filename in FlowJo, which is 'export_[Filename]_[PopulationName].csv'. For plate experiments, this will create a file like 'export_A1_singlets.csv'. Note that this operation will generate one `.csv` file per sample (`.fcs` file).

Because the FlowJo software is paid, it is currently only available on the lab computers (e.g., Attune computer). If possible, it is convenient to perform this single cell gating immediately after finishing your Attune run. You can then do all subsequent data analysis steps on your own computer.

## Set up your Jupyter notebook

You should do all data analysis in a Git repository for version control and collaboration. Create or clone a Git repo for your project, or follow along with this tutorial by cloning the following repo: [https://github.com/GallowayLabMIT/example-training.git](https://github.com/GallowayLabMIT/example-training.git) 

**Tip:** Follow [this checklist](https://gallowaylabmit.github.io/protocols/en/latest/training/onboarding/startup_checklist.html) for working with repositories. See also [this section](https://gallowaylabmit.github.io/protocols/en/latest/training/onboarding/environment_check.html#the-next-frontier) in the Computational environment check protocol for creating a new project repo and a first Jupyter notebook.

A common repo layout may be:

```text
your-repo
├── data-analysis
│   ├── exp001.ipynb
│   ├── exp002.ipynb
│   └── ...
├── inputs
│   └── plasmid-metadata.csv
├── output
│   ├── exp001
│   │   ├── data.gzip
│   │   ├── gates.svg
│   │   ├── scatter.png
│   │   └── ...
│   ├── exp002
│   └── ...
├── .gitignore
├── datadir.txt
├── README.md
└── requirements.txt
```


It is common to perform data analysis in Jupyter notebooks located in a subfolder called 'data_analysis', 'flow', or similar. (In this repo, Jupyter notebooks are in the folder 'analysis_tutorials'.) In this folder, it is common to create a new Jupyter notebook (`.ipynb` file) for each experiment. Other files and directories are discussed below.

In the Jupyter notebook, the first step is to import the relevant packages you'll need for analysis.

In [None]:
# Import all our favorite packages
import matplotlib       # basic plotting
import matplotlib.pyplot as plt
import numpy as np      # array manipulation and math
import pandas as pd     # working with 2D arrays (DataFrame structures)
import rushd as rd      # created by the lab for reproducible data analysis, see more below
import scipy as sp      # statistics, etc.
import seaborn as sns   # fancy plotting

## Specify the path to your data

#### Data path

To load your data in Python, you'll need the path to the `.csv` files you exported from FlowJo. This is something you can specify by looking up the absolute path on your computer. However, using this path would mean that others can't run your notebook as-is: they'd have to replace the path with one specific to their computer. For reproducible analysis and easy collaboration, you should instead specify a relative path that does not depend on the exact location of the data on your computer. That way, future users on other computers don't need to edit any paths.

While there are several ways this could be achieved, the Python package `rushd` offers an elegant solution. `rushd` was created by members of the Galloway lab for this exact purpose: to make data management easier, robust, and reproducible. Using this package makes it much simpler to collaborate on data analysis, and for the broader scientific community to reproduce analysis and plots in published manuscripts.

`rd.datadir` is a variable representing the path to your main data directory, i.e., the folder where all of your data lives. This should be the highest level directory you'll ever need for your project. For the Galloway lab, this is the root of Smithsonian, the Nextcloud-synced server where all our data is stored. You'll need to set up this path for `rushd` just once for the entire repo.

To do so, create the file `datadir.txt` in the root of your repo (here, directly in the 'example-training' folder). In the file, paste the full, absolute path to your main data directory. You do not need quotes or any other characters around this path, and this should be the only line in the file. For example, the path might look something like `/Users/username/Library/CloudStorage/Nextcloud/kerberos@mit.edu@smithsonian.mit.edu`.

**Tip:** Be sure to tell Git to ignore `datadir.txt`! To do so, add 'datadir.txt' to a new line in the `.gitignore` file in your repo. Also add a description of `datadir.txt` and how to create it in your repo's `README.md`, so that anyone who clones your repo can follow these same steps.

**Tip:** If you edit `datadir.txt` while running a Jupyter notebook, you must restart the kernel for the changes to take effect.

Here's an example set of data we'll use for this training. To run the analysis below, make sure you have `datadir.txt` set up in your repo.

In [None]:
# Path to example data
data_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.12.04_exp092.3'/'export'

#### Output path

Next, in your Jupyter notebook, it is a good idea to specify a path to a folder where you can save any plots you make. A common place to put these is in a folder called 'output' in your repo, in a sub-folder named for the experiment (probably similar to the name of your Jupyter notebook).

**Tip:** Be sure to tell Git to ignore the 'output' folder! These files can be regenerated by running your code, so there's no need for Git to track and store them.

`rushd` also provides a convenient way to specify paths within your repo. While it is possible to write paths relative to your Jupyter notebook, these would need to be updated if you ever changed the file structure or location of the notebook. Instead, use `rd.rootdir`, which is the path to the root of your repo (here, the path to the 'example-training' directory). 

In [None]:
# Path to directory to save analysis outputs
output_path = rd.rootdir/'output'/'flow-tutorial'

In addition to the path variables `rd.rootdir` and `rd.datadir`, `rushd` also includes functions for loading common data types (more on this below), performing common calculations, and plotting. See the [documentation](https://gallowaylabmit.github.io/rushd/en/main/index.html) for details.

## Add well metadata with a `.yaml` file

#### What are `.yaml` files?

A second important feature of `rushd` is the way it loads data with user-specified metadata for wells. Specifically, it maps metadata—typically experimental condition information—to samples based on their well ID (e.g., A1). This metadata is encoded in a `.yaml` file.

YAML is a human-readable file format (data serialization language) that is often used for writing configuration files. It is similar to JSON. Basically, it is an easy, standardized way to write nested dictionaries and arrays, which is how you'll use it here. For quick tips on syntax, check out [this brief tutorial](https://learnxinyminutes.com/yaml/), from which some examples are reproduced below.

Dictionary elements are written as key-value pairs:

```yaml
key: value
key2: value2
```

These can be nested using indentation:

```yaml
nested_dict:
  key: value
```

You can also create lists:

```yaml
list:
  - item1
  - item2
```

And combine lists with dictionaries:

```yaml
nested_dict:
  list:
    - key: value
    - key2: value2
```

Numbers are interpreted as `int`s or `float`s; enclose with quotes to treat them as strings. Letters/words are treated as strings even without quotes. There's more, of course, but this is the bare minimum you'll need to write metadata `.yaml` files compatible with `rushd`.



#### Writing a `.yaml` metadata file

`rushd` can read `.yaml` files to map conditions to wells. Multiple features (kinds of conditions) can be specified per well, e.g., plasmid and inducer. Conveniently, you can write rectangular regions as ranges, e.g., A1-H12 for an entire 96-well plate.

The general form looks something like this: 

```yaml
metadata:
  feature1:
    - treatment1: A1
    - treatment2: A2-A12
    - treatment3: B1-B12
  feature2:
    - 1: A1-B1, A12
    - 10: A2-B2
    - 100: A3-B3
    - 0: A4-B11
```

In this example, well A1 has 'treatment1' for the feature 'feature1' and `1` for 'feature2'. Since we don't specify 'feature2' for well B12, it will be loaded as `<NA>`. Note that `rushd` expects everything to be nested under the `metadata:` key.

In a more realistic example, here are the contents of a `.yaml` file ([link](https://mitprod.sharepoint.com/:u:/s/GallowayLab/EaYvEuC2pFZAthDMbKNZvyQBtWvaCUpX1tWi6wEhoGeLHw?e=PXO9n7)) for a real experiment in the lab. In this experiment, a reporter plasmid ('construct') was transfected with different transcriptional activator plasmids ('activator') and treated with different small-molecule inducers ('inducer'). (Here, 'NT' refers to the non-transfected (aka untransfected) condition.) Each experimental condition is defined by a combination of construct, activator, and inducer, and there are multiple wells per condition.

```yaml
metadata:
  construct:
    - pGEEC525: A1-A4, C1-C4, E1-E4, G1-G4
    - pGEEC526: B1-B4, D1-D4, F1-F4, H1-H4
    - NT: A5-A8
    - pGEEC549: B5-B8
    - pGEEC516: C5-C8
  activator:
    - pGEEC527: A1-A4, E1-E4
    - pGEEC528: B1-B4, F1-F4
    - pGEEC541: C1-C4, G1-G4
    - pGEEC542: D1-D4, H1-H4
    - NT: A5-A8
    - none: B5-C8
  inducer:
    - none: A1-D4, A5-C8
    - Rap: E1-E4, G1-G4
    - dox: F1-F4, H1-H4
```

As you can see, this format is a lot easier than writing out the full condition for each well!

#### Specify the path to the `.yaml` file

You should save the `.yaml` file in the same folder as (or near) your raw data (i.e., in Smithsonian). Most data-loading functions in `rushd` take the arguments 'data_path', the path to the data files to load, and 'yaml_path', the path to the `.yaml` file containing metadata for that set of data. Usually, both of these will be written using `rd.datadir`.

Then, double check that the `.yaml` file correctly specifies conditions. It's easy to write errors in the `.yaml` file, but these can be spotted by visualizing the plate map using `rushd`. A separate plate map will be drawn for each feature. Note: If multiple values for a feature are specified for a single well, they will be loaded as 'value1.value2 (usually, you don't want this). See [this tutorial](https://gallowaylabmit.github.io/rushd/en/main/tutorial/overview/plot_well_metadata.html) for details.

In [None]:
# Visualize .yaml metadata (i.e., well mapping / plate map)
yaml_path = rd.datadir/'data'/'attune'/'kasey'/'2024.07.18_exp100'/'export'/'wells.yaml'
rd.plot.plot_well_metadata(yaml_path)

In [None]:
# Another example, using the .yaml file printed above
rd.plot.plot_well_metadata(rd.datadir/'data'/'attune'/'kasey'/'2024.07.16_exp099'/'export_comp'/'wells.yaml')

## Load `.csv` data into a pandas DataFrame

#### What to expect

Now you are finally ready to load your data! As a summary, you should have the following:

1. A folder of `.csv` files containing your data gated on single cells, stored in Smithsonian
2. A Jupyter notebook file in the folder 'data_analysis' (or similar) in your repo where you will analyze the data from your experiment
3. The folder 'output' in your repo, containing a sub-folder where you will save analysis outputs for this experiment
4. The file `datadir.txt` with the path to your local sync of Smithsonian, located in the root directory of your repo
5. A `.yaml` file with well-level metadata for your experiment, typcially located in the same folder as your data in Smithsonian

If you are following along with the examples in this notebook, you only need to set up #3-4.

The basic idea is to load the data from a single experiment into one [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). This data structure is a long-form array, where each row is a data point (cell), and each column is a feature of that data, including all the channels the flow cytometer measures and any metadata. The `seaborn` plotting package works well with these data structures and contains a [nice explanation](https://seaborn.pydata.org/tutorial/data_structure.html) of long-form data.

Loading the data from a single experiment will require combining data for each well (each `.csv` file) and possibly for multiple plates. At the same time, it is useful to add metadata, such as well, plate, and condition information. `rushd` has several functions that do this.

#### `rushd` functions for loading flow data

- `flow.load_csv`: load data from mutiple samples without well metadata
- `flow.load_csv_with_metadata`: load data from multiple samples (a plate) with well metadata
- `flow.load_groups_with_metadata`: load data from multiple plates with well metadata

See the [documentation](https://gallowaylabmit.github.io/rushd/en/main/api/rushd.html#rushd.flow.load_csv) for more information. The second two functions are most commonly used. The first function is useful for samples that were run as tubes on the flow cytometer, which do not have well IDs in the sample names. Instead, the function saves metadata from the filename only. It is possible to add condition-level metadata for these samples after loading using a strategy discussed later. 

In [None]:
'''
Example 1: Single plate

This uses the default filename from exporting in FlowJo, namely 'export_{well}_{population}.csv'.
Columns labeled 'well' and 'population' are added based on the filename.
Here, for instance, the file 'export_A1_singlets.csv' is added with 'A1' in the 'wells' column 
and 'singlets' in the 'population' column.
'''
data_path = rd.datadir/'data'/'attune'/'kasey'/'2024.12.04_exp092.3'/'export'
yaml_path = data_path/'wells.yaml'
data = rd.flow.load_csv_with_metadata(data_path, yaml_path)
display(data)

# Print the column names to see all the flow cytometer channels
display(data.columns)

In [None]:
'''
Example 2: Load selected columns

This example loads the same data as above, but only the two channels we care about,
specified via the 'columns' argument. This saves time/space by not storing values 
for channels not used in the experiment.
'''
channel_list = ['mRuby2-A','mGL-A']
data = rd.flow.load_csv_with_metadata(data_path, yaml_path, columns=channel_list)
display(data)

In [None]:
'''
Example 3: Multiple plates

This example loads four plates. Files are named with the default FlowJo naming as above, and
the data for each plate are stored in separate folders. These data are then loaded into a single
DataFrame with extra metadata specifying the cell type in each plate.
'''
base_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.02.07_exp77.3'/'export'

plates = pd.DataFrame({
    'data_path': [base_path/f'plate{n+1}' for n in range(4)],
    'yaml_path': [base_path/'exp77.3_plate1_wells.yaml', base_path/'exp77.3_plate2_wells.yaml',]*2,
    'cell': ['MEF', 'MEF', '293T', '293T'],
})

data2 = rd.flow.load_groups_with_metadata(plates, columns=['mRuby2-A','mGL-A'])
display(data2)

In [None]:
''' 
Example 4: Tube experiment
'''
# TODO: data3

#### Save data in a local cache

Loading data from a locally synced cloud server can sometimes take several minutes, as each file must first be downloaded from the server (assuming you are using a virtual file system). To speed up this step for future analyses, you can download the data once from the server and store it locally on your computer. Note that this will take up storage space—save space by only loading and saving the data channels you care about.

It's convenient to store these data caches in a 'data_cache' folder in your repo, or alternatively in the 'output' sub-folder associated with your Jupyter notebook.

**Tip:** Be sure to tell Git to ignore the folder where you store local caches! These files are large and should not be tracked by Git.

`pandas` uses the Parquet format for storing DataFrames. See [here](https://pandas.pydata.org/docs/user_guide/io.html#parquet) for more details. It can be convenient to wrap the call to save the data in a call to `rd.outfile`, which writes metadata to an associated `.yaml` file. Conveniently, it will create directories in the specified path to the cache if they do not yet exist.

In [None]:
'''
Local data cache

Save the data as a zip file in a folder in the 'output' folder in your git repo. The if-else 
structure here is convenient because you can re-run this cell of the notebook whether or not 
the cache already exists.
'''
cache_path = rd.rootdir/'output'/'exp092.3'/'data.gzip'

data = pd.DataFrame()

# If cache exists, load data from cache
if cache_path.is_file(): 
    data = pd.read_parquet(cache_path)

# Otherwise, load from datadir and create cache
else: 
    data = rd.flow.load_csv_with_metadata(data_path, data_path/'wells.yaml', columns=channel_list)
    data.to_parquet(rd.outfile(cache_path))
    
display(data)

## Add condition-level metadata

Sometimes, the values of various features in your data have additional metadata. For instance, experimental conditions may be defined by a plasmid, inducer amount, and replicate. It's easiest (and clearest) to add the plasmid information to the `.yaml` well mapping using plasmid IDs (e.g., pKG01000, pTA021). However, when analyzing your data, you may be interested in other features of the plasmid, for instance the promoter, gene, or syntax. As this metadata remains the same across experiments, it's convenient to encode this only once, then add it to your data as needed.

For example, a spreadsheet for TANGLES constructs could look like this:

| plasmid | upstream_gene | downstream_gene | spacer | syntax             |
| ------- | ------------- | --------------- | ------ | ------------------ |
| pTA001  | tagBFP        | mRuby2          | 1x     | downstream_tandem  |
| pTA002  | tagBFP        | mRuby2          | 1x     | divergent          |
| pTA003  | tagBFP        | mRuby2          | 1x     | convergent         |
| pTA004  | mRuby2        | tagBFP          | 1x     | upstream_tandem    |

#### Using an external file

Specifically, you can create a single mapping of plasmid ID (or cell line ID, or other condition) to metadata, load this as a DataFrame, then merge the DataFrame with your data. The most readable format to write the mapping is probably an Excel file (or `.csv`). You should save this file in your repo so it's tracked by Git and available to others. A good place to put this is in an 'inputs' folder.

In [None]:
'''
Example 1: Column names match

If the name of the first column in the metadata file matches a column name in your data, 
you can merge directly using the 'on' argument. Here, the plasmid IDs are in the 'construct' 
column of both the data and the metadata.
'''
# Load metadata
metadata_path = rd.rootdir/'inputs'/'plasmid-metadata.csv'
metadata = pd.read_csv(metadata_path)
display(metadata)

# Add metadata to data
data = data.merge(metadata, how='left', on='construct')
display(data)

In [None]:
'''
Example 2: Multiple columns with plasmids

If you need to add metadata for two columns in your data, you can repeatedly call `merge` 
on each. See the pandas documentation for more information. This example adds plasmid 
metadata to the 'construct' column (the reporter plasmid, a better name would be 'reporter')
and the 'activator' column (co-transfected activator plasmid) from one of the previous examples.
Notice that the activator metadata contain a suffix to differentiate them.
'''
# Load new dataset
data_path4 = rd.datadir/'data'/'attune'/'kasey'/'2024.07.16_exp099'/'export_comp'
cache_path4 = rd.rootdir/'output'/'KL_exp099'/'data.gzip'

data4 = pd.DataFrame()
if cache_path4.is_file(): data4 = pd.read_parquet(cache_path4)
else: 
    channel_list = ['mRuby2-A','iRFP720-A','tagBFP-A']
    data4 = rd.flow.load_csv_with_metadata(data_path4, data_path4/'wells.yaml', columns=channel_list)
    data4.to_parquet(rd.outfile(cache_path4))

# Add metadata
metadata4 = pd.read_csv(rd.rootdir/'inputs'/'plasmid-metadata-geec.csv')
data4 = data4.merge(metadata4, how='left', on='construct') # you can chain `merge` calls, but these are too long for one line
data4 = data4.merge(metadata4, how='left', right_on='construct', left_on='activator', suffixes=(None,'_activator'))
display(data4)

#### Directly in Python

Of course, there are other ways to do write and add this condition-level metadata. Feel free to explore other possibilities! For instance, it might be easier to add small amounts of metadata (e.g., ~1-2 features for up to ~5 conditions) using a dictionary and pandas functions.

In [None]:
# example: TODO

Now you have your data loaded with useful metadata! It's time to see what the data shows...

# Part 2: Explore data

**Warning!** The section below has not been revised in 2026. Some parts may be out of date.

### Set plotting defaults

To explore trends in your data, you'll make a bunch of plots. To make the plots look nicer, you can set some basic defaults for font size, line width, etc. (This is much more important for polished figures, but starting with decent plots now will make even quick slides easier to understand.) 

This is also a good time to define a color palette for your conditions. See Seaborn's [Choosing color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) for suggestions, or try a palette from someone else in lab. You can specify colors using the matplotlib named colors, hex codes, or a few other formats.

In [None]:
'''
Seaborn style (applies to entire notebook)

The 'talk' context sets font sizes, etc. appropriate for a presentation.
(Other options include 'notebook', 'paper', and 'poster'.)
This also sets the font to Helvetica Neue, but you can change this to whatever 
you prefer. See Seaborn/matplotlib documentation for other parameters.
'''
sns.set_style('ticks')
sns.set_context('talk', rc={'font.family': 'sans-serif', 'font.sans-serif':['Helvetica Neue']})

In [None]:
'''
Define a color palette

Use a dictionary to map categorical condition values to colors.
Here, colors are assigned to different gene syntaxes. Additionally,
this modified form of the viridis palette (yellow -> purple) is 
good for continuous values (e.g., small molecule amounts).
'''
# Categorical palette
palette = {
    'tandem_reporter_upstream': '#225A9B',
    'tandem_reporter_downstream': '#19D2BF',
    'convergent': '#FFB133',
    'divergent': '#FE484E',
}

# Continuous palette
no_yellow_viridis = matplotlib.colors.ListedColormap(matplotlib.colormaps['viridis'](np.linspace(0, 0.85, 256)))

### Remove cells with negative channel values

Negative values from the Attune are essentially "off the chart" and represent non-expressing cells. There aren't usually too many of them, and it is safe to simply exclude them. This makes it simpler to plot the data, which is log-distributed. (Note that some plotting functions or calculations will throw an error when run on data with non-positive values.)

To remove these cells, you can use a "mask", finding the rows that satisfy some True/False statement and then reassigning `data` to this value. To remove cells with negative channel measurements, the statement will specify that the value in each channel column be greater than zero.

In [None]:
'''
Remove cells with negative channel values

The 'channel_list' should contain all the Attune channels you're 
interested in (e.g., mGL-A) that could have non-positive values.
Here, we filter all three datasets that were loaded above.
'''
for c in ['mRuby2-A','mGL-A']:
    data = data[data[c] > 0]
    data2 = data2[data2[c] > 0]

channel_list = ['mRuby2-A','AF514-A','tagBFP-A']
for c in channel_list:
    data3 = data3[data3[c] > 0]

### Plot histograms & joint distributions

The first thing you'll likely want to do is visualize distributions of expression across relevant channels for various conditions. The easiest way to plot multiple conditions at once is by using Seaborn's FacetGrid and related functions: https://seaborn.pydata.org/tutorial/axis_grids.html With these functions, you shouldn't have to define each axis yourself!

Seaborn plotting functions come in several types:

1. **Functions that generate a single axis** (return a matplotlib `Axes` object). You can facet the data on up to two columns (e.g., one category on the x-axis and the other as the color). This includes `scatterplot`, `kdeplot`, etc.

2. **Functions that generate a grid of axes** (return a Seaborn `FacetGrid` object). You can facet the data on up to four columns (e.g., x-axis, color, row in the grid, column in the grid). These functions typically have several different plot type options, which you can specify with the `kind` argument. For instance, `catplot` includes 'strip', 'bar', 'violin' and other plot types with one categorical axis, while `displot` includes 'kde' and 'hist' plots to show 1D or 2D distributions.

You probably want to begin by using `kdeplot` to plot 1D or 2D distributions: https://seaborn.pydata.org/generated/seaborn.kdeplot.html Note that 2D kdeplots may take several minutes to generate.

In [None]:
''' 
Example 1: Plot 1D kdeplot

For kdeplots, be sure to use a log scale and normalize the area under the
curve within conditions rather than across them (no "common normalization",
i.e., 'common_norm=False').
You can manually adjust the placement of the legend to move it outside of
the plot area.
'''
ax = sns.kdeplot(
        data=data, x='mGL-A', hue='construct',  # specify the DataFrame, x-axis, and color column
        log_scale=True, common_norm=False       # additional arguments for kdeplots 
    )
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
''' 
Example 2: Plot kdeplots in a grid

Use the corresponding grid functions to facet your data along 
rows/columns of a grid. 
Here, we subset the data to only plot conditions where the 
'ts_kind' column has the value 'NT' (OL circuit) or 'T' (CL circuit).
'''
# Subset data into a DataFrame for plotting
plot_df = data[data.ts_kind.isin(['NT','T'])]

# Plot in a grid
g = sns.displot(
        data=plot_df, x='mGL-A', hue='ts_num',  # same arguments as above
        palette=no_yellow_viridis,              # specify the color palette (dictionary mapping values of the hue column to colors)
        col='ts_kind', kind='kde',              # additional arguments for displot: facet grid columns on 'ts_kind' and plot a kde plot
        log_scale=True, common_norm=False,      # same kde-specific arguments as above
    )

# Loop over the axes to add the untransfected condition for comparison
for ax in g.axes_dict.values():
    sns.kdeplot(data=data[data.construct=='UT'], x='mGL-A', color='black', ls=':', ax=ax)

In [None]:
''' 
Example 2: Plot 2D kdeplot in a grid

This plot uses the same plotting function as above.
It can help to downsample your data to fewer cells per condition
so that initial plots generate faster (remove this for 
final figures).
'''
# Subset data to remove conditions where the activator is 'na'
plot_df = data4[(data4.gene_activator!='na') & (data4.promoter_activator!='na')]

# Downsample to 1000 cells per condition, where conditions are defined
#  by the 'construct' and 'activator' columns
plot_df = plot_df.groupby(['construct','activator']).sample(1000)

# Plot
g = sns.displot(
        data=plot_df, x='AF514-A', y='mRuby2-A', hue='inducer',
        row='gene_activator', col='promoter_activator',
        log_scale=True, common_norm=False, kind='kde', 
        facet_kws=dict(margin_titles=True)                      # additional facet argument to move row titles to the right
    )

# Adjust plots
g.set(xlim=(1e1,1e5), ylim=(1e0,1e5)) # 'set' includes many more parameters, see matplotlib documentation
g.set_titles(row_template='{row_name}', col_template='activator promoter: {col_name}')
g.refline(y=2e2, color='black', ls='-', zorder=0) # add a horizontal reference line to each axis

### Gate expressing cells

Depending on your experiment, you might want to analyze only a fraction of the population. For transfections, we typically only care about transfected cells, or those expressing the transfection marker (co-delivered fluorescent protein). You can manually eyeball this threshold (or gate), or you can set it based some high percentile of the untransfected cells. It can be helpful to make a new DataFrame with only these cells. Be sure to exclude any conditions that lack the transfection marker!

In [None]:
''' 
Example 1: Choose a threshold manually
'''
gate = 3e2  # manual gate
data_gated = data[(data['mGL-A']>gate) & (data.construct!='UT')].copy()

ax = sns.kdeplot(data=data, x='mGL-A', hue='construct', 
                 log_scale=True, common_norm=False)
ax.axvline(gate, color='black', zorder=0)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
''' 
Example 2: Gate based on the untransfected population

Here, we use the 99.9th percentile of the untransfected population (UT)
as the gate, meaning that only 0.1% of untransfected cells are (mis)labeled
as expressing. This gate is for the mGL-A channel, which represents the
transfection marker.
'''
# Compute the gate
gate = data.loc[data.construct=='UT', 'mGL-A'].quantile(0.999)
display(gate)

data_gated = data[(data['mGL-A']>gate) & (data.construct!='UT')].copy()

ax = sns.kdeplot(data=data, x='mGL-A', hue='construct', 
                 log_scale=True, common_norm=False)
ax.axvline(gate, color='black', zorder=0)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
'''
Example 3: Use different gates for different conditions

We want to use different gates for the MEFs and 293T cells since they have
different autofluorescence profiles. We'll define gates based on the 
uninfected ('UI') populations for each cell type.
'''
# Define a function to choose the gate and return the gated population
#  Input: (group of a) DataFrame --> Returns: subset of that DataFrame group
def gate_data(df):
    gate = df.loc[df.construct=='UI', 'mGL-A'].quantile(0.999)
    display(gate)
    return df[(df['mGL-A']>gate) & (df.construct!='UI')]

# Gate data
data2_gated = data2.groupby(['cell'])[data2.columns].apply(gate_data).reset_index(drop=True)
display(data2_gated)

# Plot uninfected populations to visualize autofluorescence profiles
plot_df = data2[data2.construct=='UI'].groupby('cell').sample(1000)
cell_palette = {'293T': 'teal', 'MEF': 'orange'}
g = sns.displot(data=plot_df, x='mGL-A', y='mRuby2-A', col='cell', 
                hue='cell', palette=cell_palette,
                kind='kde', log_scale=True)
g.refline(x=257, ls='-', color=cell_palette['293T'], zorder=0)
g.refline(x=345, ls='-', color=cell_palette['MEF'], zorder=0)

### Calculate summary statistics

Now that you've explored the distributions for each condition, you probably want to quantify trends. Calculating summary statistics (mean, standard deviation, etc.) is straightforward and quick with Pandas functions. 

In [None]:
'''
Calculate summary statistics for multiple channels at once

We want to compute the geometric mean (gmean) and standard 
deviation for both the output and marker channels. Conveniently,
Pandas can do all four of these calculations at once. The 
DataFrame resulting from this computation ('agg' function) has
both a multi-level index (from 'groupby') and multi-level 
column names. We can collapse the index into columns with 'reset_index',
but flattening the column names is more complicated.
'''
# Compute geometric mean (gmean) and standard deviation on two relevant channels
channel_list = ['mGL-A','mRuby2-A']
stats = data_gated.groupby('construct')[channel_list].agg([sp.stats.gmean, np.std]).reset_index().dropna()

# Flatten the multi-level column names to '{level1}_{level2}' 
#  without leaving a hanging '_' for single-level columns
stats.columns = ['_'.join(c).rstrip('_') for c in stats.columns.to_flat_index()]
display(stats)

In [None]:
''' 
A function for computing summary statistics

The computation above can be summarized into a convenient function. 
Maybe soon we'll add it to rushd, but for now feel free to use & modify
it yourself!

Inputs
  df: a DataFrame with your data
  by: a list of columns used to group the data for summarizing
  columns: a list of columns to summarize (i.e., channels)
  stats: a list of functions to use to summarize (i.e., stats)
Returns
  stats: a new DataFrame, where each row is a unique condition defined
    by 'by' and the number of columns = len(by) + len(columns) x len(stat_list)
'''
def summarize(df, by, columns, stat_list):
    stats = df.groupby(by)[columns].agg(stat_list).reset_index().dropna()
    stats.columns = ['_'.join(c).rstrip('_') for c in stats.columns.to_flat_index()]
    return stats

# Calculate statistics on gated data
stats = summarize(data_gated, 'construct', ['mGL-A','mRuby2-A'], [sp.stats.gmean, np.std])

# Add construct metadata to stats DataFrame too
stats = stats.merge(metadata, how='left', on='construct')
display(stats)

### Plot summary statistics

There are many ways to plot summary statistics (box plot, scatter plot, bar plot, etc.) and several ways to display the variability between measurements (error bars, shading, etc.). Choose your favorite representation!

One recommendation: do not use bar plots for values without a relevant zero, and always display the zero on the axis. This ensures the sizes of the bars accurately reflect the relative values they represent. For example, bar plots are effective for displaying percentages (e.g., reprogramming purity) but not geometric mean fluorescence values (which are log-distributed and typically much higher than 0). 

In [None]:
'''
Example 1: Dataset 1

Here, we plot the gmean of the output gene (mRuby2) as a function
of target site number for the OL and CL ComMAND circuits. Note that
this plot would make more sense with additional biological replicates.
'''
# Add plasmid metadata to stats
stats = stats.merge(metadata, how='left', on='construct')

# Plot mRuby2 geometric mean for each condition
plot_df = stats[stats['ts_kind']!='na']
ax = sns.stripplot(
        data=plot_df, x='ts_num', y='mRuby2-A_gmean', 
        hue='ts_kind', palette={'NT': 'gray', 'T': 'teal'},
        size=10, jitter=False  # arguments specific to stripplot: the size of the markers and the amount of random x-offset
    )
ax.set(yscale='log', xlabel='# of target sites', ylabel='ouput (gmean)')
sns.despine()

In [None]:
'''
Example 2: Small molecule titration experiment

Here, we load another dataset, where varying the concentration
of a small molecule (auxin) changes the expression of EGFP.
'''
# Load another dataset: Emma's auxin calibration curve
data_path = rd.datadir/'instruments'/'data'/'attune'/'Emma'/'2022.03.12_Auxin_Calib'/'Data'
data5 = rd.flow.load_csv_with_metadata(data_path, data_path/'wells.yaml', columns=['EGFP-A'])
data5 = data5[data5['EGFP-A']>0]
display(data5)

# Compute EGFP gmean for each replicate of each auxin concentration, excluding untransfected cells (NT)
stats5 = data5[data5.Auxin!='NT'].groupby(['Auxin','Replicates'])['EGFP-A'].apply(sp.stats.gmean).rename('EGFP-A_gmean').reset_index()
display(stats5)

# Plot summary statistics
plot_df = stats5[stats5.Auxin > 0] # exclude Auxin = 0 condition for more convenient plotting
ax = sns.scatterplot(
        data=plot_df, x='Auxin', y='EGFP-A_gmean', 
        hue='Auxin', palette=no_yellow_viridis, hue_norm=matplotlib.colors.LogNorm(), # distribute viridis colors in logspace
        legend=False  # exclude the legend, since colors are the same as the x-values
    )
ax.set(xscale='log',yscale='log', xlim=(1e-1,1e3)) # note: use xscale='symlog' to include the Auxin=0 condition

In [None]:
'''
Example 3: Alternative ways to plot replicates

Using the same data as above, we can collapse replicates into
a single value with an estimate of their spread. Notice that 
these functions will summarize the replicates for you, without
requiring any additional calculations.
'''
fig, axes = plt.subplots(1,2, figsize=(10,5), sharey=True)
plot_df = stats5

# Plot as line with shaded region
sns.lineplot(
        data=plot_df, x='Auxin', y='EGFP-A_gmean', 
        estimator='median', errorbar='ci',   # arguments specific to lineplot
        ax=axes[0]  # since we created a figure above, pass the axis to plot on
    )

# Plot as points with error bars
sns.lineplot(
        data=plot_df, x='Auxin', y='EGFP-A_gmean',
        estimator='median', errorbar='ci',    # plot the same stats as above (median + confidence interval)
        err_style='bars', marker='o', ls='',  # instead of a line, plot individual points with errorbars
        ax=axes[1],
    )

for ax in axes:
    ax.set(xscale='log', yscale='log', ylim=(1e2,1e4))

### Some additional computations

Besides simple summary statistics, you may be interested in computing metrics like fraction positive in a particular channel or the fold change of one condition relative to another. Below are a few metrics that might be useful, or that might give you ideas for approaching other calculations. Note that there are several ways to perform these calculations; these are each just one approach.

The more you learn about the Pandas package, the better you can leverage built-in functions to help you perform computations more efficiently. The "[split-apply-combine](https://pandas.pydata.org/docs/user_guide/groupby.html)" framework is often a helpful strategy.

In [None]:
'''
Gated fraction

For a given channel, calculate the fraction of cells in each
condition that have values greater than the specified threshold.
Essentially, divide the count of gated cells by the count of ungated
cells in each group. Then, clean up the resulting DataFrame so the
index and columns are easy to read.
'''
fraction = (data_gated.groupby('construct')['mGL-A'].count() / 
            data.groupby('construct')['mGL-A'].count()).reset_index().rename(columns={'mGL-A': 'fraction'}).dropna()
display(fraction)

In [None]:
'''
Fold change

Compute the fold change of one statistic for some condition,
or set of conditions, relative to some baseline condition.
Here, we find the fold change of the output (mRuby2) for OL
and CL circuits relative to their respective 1x target site 
conditions. The new function defined takes DataFrame groups, 
computes the baseline value for that group (output gmean of 1x
target site condition), and returns the original DataFrame group
with an added column containing the computed fold change.
'''
# Define a function to compute fold change within a group
def get_fc(df):
    d = df.copy()
    baseline = d.loc[d['ts_num']==1, 'mRuby2-A_gmean'].mean()
    d['fold_change'] = d['mRuby2-A_gmean'] / baseline
    return d

# Compute fold change within each circuit type (base/OL/CL defined by 'ts_kind')
stats = stats.groupby('ts_kind')[stats.columns].apply(get_fc).reset_index(drop=True)
display(stats)

In [None]:
'''
Quantile binning

Rather than calculating summary statistics on an entire
condition, bin the data into equal-quantile groups based 
on values of a given channel (e.g., 10 bins each with 10% 
of the data). Here, we bin on the transfection marker (mGL).
'''
# Assign quantiles
num_bins = 20
data['bin_quantiles'] = data.groupby('construct')['mGL-A'].transform(lambda x: pd.qcut(x, q=num_bins, duplicates='drop'))

# Calculate the median of each bin
quantiles = data.groupby(['construct','bin_quantiles'])['mGL-A'].median().rename('bin_quantiles_median').reset_index()

# Create a new column in data with the bin median
data = data.merge(quantiles, how='left', on=['construct','bin_quantiles'])
display(data)

In [None]:
'''  
Quadrants defined by two gates

Categorize cells into quadrants based on two gates/channels.
The function defined below takes the two columns to gate on
plus their gates, then returns a DataFrame column (Series) with
the quadrant number.
Possible values for 'quadrant':
  0 = double negative
  1 = x-positive
  2 = y-positive
  3 = double positive
Then, we compute the fraction of cells in each quadrant (notice that
this is an alternative to the earlier approach) and rename the 
quadrants with useful labels.
'''
def get_quadrant(x, y, gate_x, gate_y):
    df_quad = pd.DataFrame()
    df_quad['x'] = x > gate_x
    df_quad['y'] = y > gate_y
    df_quad['quadrant'] = df_quad['x'].astype(int) + df_quad['y'].astype(int)*2
    return df_quad['quadrant']

# Categorize each cell into a quadrant
gate_mRuby2 = data.loc[data.construct=='UT', 'mRuby2-A'].quantile(0.999)
data['quadrant'] = get_quadrant(data['mGL-A'], data['mRuby2-A'], gate, gate_mRuby2)

# Compute fraction of cells in each quadrant
quadrants = data.groupby(['construct','quadrant'])['mGL-A'].count().rename('count')
quadrants = (quadrants/quadrants.groupby('construct').transform('sum')).dropna().reset_index(name='fraction')

# Rename quadrant numbers with interpretable labels
quadrants['label'] = quadrants.quadrant.map({0: 'double-negative', 1: 'mGL-positive', 2: 'mRuby2-positive', 3: 'double-positive'})
display(quadrants)

In [None]:
'''
Fitting to a model

Here, we use one of the earlier datasets where Emma has generated a calibration 
curve for auxin (a small molecule that leads to degradation of proteins with the 
associated AID tag) by varying the auxin concentration and measuring the 
resulting drop in EGFP-AID fluorescence. From the literature, we find an 
equation that explains this relationship (described in 'my_model') and fit the 
coefficients using scipy's 'curve_fit' function. Then, we plot the results.
'''
# Define a function for the model, where x is auxin concentration in µM
#  and the result is log10(fluorescence)
def my_model(x, basal_fluorescence, amplitude, EC50):
    return basal_fluorescence - amplitude * x/(x+EC50)

# Fit the data to the model and print the results
fit_df = stats5[stats5.Auxin > 0] # ignore the Auxin=0 condition for easier plotting
popt, pcov = sp.optimize.curve_fit(         # returns the optimized parameters ('popt') and 
        my_model,                           # the function to fit to
        fit_df.Auxin,                       # the x-values
        np.log10(fit_df['EGFP-A_gmean'])    # the y-values
    )

# Print 'popt', the optimized values for the other arguments in 'my_model'
#  and 'pcov', the estimated covariance of 'popt' 
perr = np.sqrt(np.diag(pcov)) # gives one standard deviation error for the parameter estimates
print(f'basal fluorescence (log10): {popt[0]:.1f} -/+ {perr[0]:.1f}')
print(f'amplitude: {popt[1]:.1f} -/+ {perr[1]:.1f}')
print(f'EC50 (µM): {popt[2]:.1f} -/+ {perr[2]:.1f}')

# Plot the data
ax = sns.scatterplot(data=fit_df, x='Auxin', y='EGFP-A_gmean',)
ax.set(xscale='log', yscale='log')

# Plot the model fit on the same axes
xs = np.logspace(np.log10(fit_df.Auxin.min()), np.log10(fit_df.Auxin.max()), 1000)
ys = my_model(xs, *popt)
sns.lineplot(x=xs, y=10**ys)

# Include EC_50 info on the plot
ax.axvline(popt[2], color='gray', zorder=0)
ax.axvspan(popt[2]-perr[2], popt[2]+perr[2], facecolor='gray', alpha=0.2, edgecolor=None)
ax.annotate(
        r'EC$_{50}$ = ' + f'{popt[2]:.1f} µM',  # the text to add (use an r-string to read latex)
        (0.05, 0.05), xycoords='axes fraction', # the location of the text, in fraction of the axes (from 0 to 1)
        color='gray'
    )

### Now go forth and explore your data!