# Part 1: Load data

### Setup: Import packages

In [None]:
# Import all our favorite packages
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rushd as rd
import scipy as sp
import seaborn as sns

### Specifying the path to the SharePoint using `datadir`
To enable reproducible analysis, Python notebooks/code should be able to run from locations other than your personal computer. However, because our data is stored in the cloud, the location of the data files will differ across users. Namely, accessing the lab SharePoint requires an absolute path specific to your computer. 

To get around this issue, `rushd` facilitates one-time user specification of the path to this data directory, or `datadir`, that remains constant for a particular location (i.e., a github repository stored on your computer). This way, you include the absolute path to the data-containing directory outside of your Python notebook, meaning that cloning the repository / running the code elsewhere requires only one change to properly run the analyses.

To specify a `datadir`, write the absolute path to the data-containing directory in a text file called `datadir.txt`. Use the highest-level directory you anticipate needing, i.e., the main SharePoint (not one of the subfolders). Include only this line in the file, and do not enclose the path in quotes. For instance, the path to the SharePoint on my computer is: 

`/Users/kaseylove/Massachusetts Institute of Technology/GallowayLab - Documents`

so my `datadir.txt` file contains just this line. The file is stored in the root directory of my git repository.

Then, to access data in this directory, simply use `rd.datadir` as the path to this folder. To access subfolders, normal Path-type operations apply, e.g., `rd.datadir/'subfolder'`. Additional usage examples are below. 

The path to root directory of your repository is also conveniently accessible via `rushd` using `rd.rootdir` (no text file required). This is helpful for specifying output paths when saving figures/files.

### Add well metadata using a `.yaml` file

The default Attune and FlowJo filenames for samples comprise the well number without other condition information. To add this metadata (e.g., plasmids, small molecules), create a `.yaml` file to map conditions to wells. See this quick tutorial for creating these files: https://learnxinyminutes.com/docs/yaml/. Here is an example format:

```
metadata:
  key1:
    - valueA: A1-A12
    - valueB: B1-B12
  key2:
    - 0: A1-H1
    - 1: A2-H2
```

Save this file in the same folder as (or near) your raw data.

To double check that the `.yaml` file correctly specifies conditions, you can display the associated plate map using `rushd`.
See the example below and https://gallowaylabmit.github.io/rushd/en/main/tutorial/plot_well_metadata.html for details.

In [None]:
'''
Plate layout example

View the layout of conditions on an example plate.
'''
yaml_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.01.24_exp80.3'/'export'/'exp80.3_wells.yaml'
rd.plot.plot_well_metadata(yaml_path)

### Load data: Use `rushd` package to import `.csv` data into a Pandas DataFrame with associated metadata
The function loads data from all wells into a single DataFrame object, which is nicely compatible with Seaborn plotting functions (see description of 'long-form' data here: https://seaborn.pydata.org/tutorial/data_structure.html and Pandas documentation here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

Each row is a single cell, and columns are the data associated with that cell. This includes all the measurement channels on the Attune (e.g., FSC-H, VL1-A, mCherry-A, Time -- it will use the names that you set for the channels on the Attune software) as well as the information from the filename (e.g., well, FlowJo population) and your `.yaml` file (e.g., plasmid/construct, small molecule conditions, replicate number).

See examples below. Feel free to take a look at the filenames and `.yaml` files in the indicated folder for reference. If your `datadir` is set up properly, you should be able to run these cells and see the data loaded into DataFrames.

In [None]:
'''
Example 1: Single plate

This uses the default filename from exporting in FlowJo, namely `export_{well}_{population}.csv`.
Columns labeled 'well' and 'population' are added based on the filename.
Here, for instance, the file `export_A1_singlets.csv` is added with 'A1' in the 'wells' column 
and 'singlets' in the 'population' column.
'''
data_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.12.04_exp092.3'/'export'
yaml_path = data_path/'wells.yaml'
data = rd.flow.load_csv_with_metadata(data_path, yaml_path)
display(data)

In [None]:
'''
Example 2: Load selected columns

This example loads the same data as above, but only the two channels we care about,
specified via the 'columns' argument. This saves time/space by not storing values 
for irrelevant Attune channels.
'''
channel_list = ['mRuby2-A','mGL-A']
data = rd.flow.load_csv_with_metadata(data_path, yaml_path, columns=channel_list)
display(data)

In [None]:
'''
Example 3: Multiple plates

This example loads four plates. Files are named with the default FlowJo naming as above, and
the data for each plate is stored in separate folders. This data is then loaded into a single
DataFrame with extra metadata specifying the cell type in each plate.
'''
base_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.02.07_exp77.3'/'export'

plates = pd.DataFrame({
    'data_path': [base_path/f'plate{n}' for n in range(1,5)],
    'yaml_path': [base_path/'exp77.3_plate1_wells.yaml', base_path/'exp77.3_plate2_wells.yaml',]*2,
    'cell': ['MEF', 'MEF', '293T', '293T'],
})

data2 = rd.flow.load_groups_with_metadata(plates)
display(data2)

### Save data to a local cache

Loading data from the SharePoint can sometimes take several minutes since each file must be downloaded from the server. To speed up this step for future analyses, you can download the data once from the SharePoint and save the relevant components locally on your computer. It is useful to combine this step with the original loading one.

A convenient place for this cache is in a folder in the analysis repo. I like to save this file in the same place as any plots I generate during analysis: a folder for the experiment in the `output` folder of my git repo. Make sure to add the `output` folder to `.gitignore` so that git doesn't track these large files.\

`rushd` also provides the function `outfile` that creates a `.yaml` file containing metadata associated with the output (e.g., git version). It will also create the directories in the path if they don't already exist, which is convenient. You can wrap any output path in `outfile`, including your data cache and any plots you generate.

In [None]:
cache_path = rd.rootdir/'output'/'exp092.3'/'data.gzip'

# Specify the data to load
channel_list = ['mRuby2-A','mGL-A']
data_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.12.04_exp092.3'/'export'
yaml_path = data_path/'wells.yaml'

data3 = pd.DataFrame()

# If cache exists, load data from cache
if cache_path.is_file(): 
    data3 = pd.read_parquet(cache_path)

# Otherwise, load from SharePoint and create cache
else: 
    data3 = rd.flow.load_csv_with_metadata(data_path, yaml_path, columns=channel_list)
    data3.to_parquet(rd.outfile(cache_path))
    
display(data)

### Add additional condition-level metadata

Sometimes conditions, such as plasmids, vary across multiple dimensions (promoter, gene, syntax, etc.) that would be helpful to add as metadata. This can be cumbersome to add to your `.yaml` file, especially if you have multiple plasmids per condition or reuse plasmids across experiments. Therefore, it can be helpful to include only the plasmid name in the `.yaml` file mapping conditions to wells and also to create a single spreadsheet with plasmid metadata. You can save this in the git repo directly, or in a project folder in the SharePoint.

For example, a spreadsheet for TANGLES constructs could look like this:

| plasmid | upstream_gene | downstream_gene | spacer | syntax             |
| ------- | ------------- | --------------- | ------ | ------------------ |
| pTA001  | tagBFP        | mRuby2          | 1x     | downstream_tandem  |
| pTA002  | tagBFP        | mRuby2          | 1x     | divergent          |
| pTA003  | tagBFP        | mRuby2          | 1x     | convergent         |
| pTA004  | mRuby2        | tagBFP          | 1x     | upstream_tandem    |

You can then load this spreadsheet as a DataFrame and merge it with your data.


In [None]:
'''
Example 1: Column names match

If the name of the first column in the metadata file matches a column name in your data, 
you can merge directly using the 'on' argument.
'''
metadata_path = rd.datadir/'projects'/'miR-iFFL'/'plasmids'/'construct-metadata.xlsx'
metadata = pd.read_excel(metadata_path)
data = data.merge(metadata, how='left', on='construct')
display(data)

In [None]:
'''
Example 2: Multiple columns with plasmids

If you need to add metadata information for two columns in your data, you can repeatedly
call `merge` on each. See the pandas documentation for more information. This example 
adds plasmid metadata to the 'construct' column (the reporter plasmid) and the 'activator' 
column (co-transfected activator plasmid). Notice that the activator metadata contain a
suffix to differentiate them.
'''

# Load data 
data_path = rd.datadir/'instruments'/'data'/'attune'/'kasey'/'2024.07.16_exp099'/'export_comp'
yaml_path = data_path/'wells.yaml'
cache_path = rd.rootdir/'output'/'KL_exp099'/'data.gzip'

data3 = pd.DataFrame()
if cache_path.is_file(): data3 = pd.read_parquet(cache_path)
else: 
    channel_list = ['mRuby2-A','AF514-A','tagBFP-A']
    data3 = rd.flow.load_csv_with_metadata(data_path, yaml_path, columns=channel_list)
    data3.to_parquet(rd.outfile(cache_path))

# Add metadata
metadata3 = pd.read_excel(rd.datadir/'projects'/'geec'/'construct-metadata_KL.xlsx')
data3 = data3.merge(metadata3, how='left', on='construct')
data3 = data3.merge(metadata3, how='left', right_on='construct', left_on='activator', suffixes=(None,'_activator'))
display(data3)

Now you have your data loaded with useful metadata! It's time to see what the data shows...

# Part 2: Explore data

### Set plotting defaults

To explore trends in your data, you'll make a bunch of plots. To make the plots look nicer, you can set some basic defaults for font size, line width, etc. (This is much more important for polished figures, but starting with decent plots now will make even quick slides easier to understand.) 

This is also a good time to define a set color palette. See Seaborn "Choosing color palettes" for suggestions, or try a palette from someone else in lab. You can specify colors using the matplotlib named colors, hex codes, or a few other formats.

In [None]:
'''
Seaborn style (applies to entire notebook)

The 'talk' context sets font size, etc. appropriate for a presentation.
(Other options include 'notebook', 'paper', and 'poster'.)
I also set the font to Helvetica Neue, but you can change this to whatever you prefer.
See Seaborn/matplotlib documentation for other parameters.
'''
sns.set_style('ticks')
sns.set_context('talk', rc={'font.family': 'sans-serif', 'font.sans-serif':['Helvetica Neue']})

In [None]:
'''
Define a color palette

Use a dictionary to map categorical condition values to colors. 
This modified form of the viridis palette (yellow -> purple) is 
good for continuous values (e.g., small molecule amounts).
'''
palette = {
    'tandem_reporter_upstream': '#225A9B',
    'tandem_reporter_downstream': '#19D2BF',
    'convergent': '#FFB133',
    'divergent': '#FE484E',
}

no_yellow_viridis = matplotlib.colors.ListedColormap(matplotlib.colormaps['viridis'](np.linspace(0, 0.85, 256)))

### Remove cells with negative channel values

Negative values from the Attune are essentially "off the chart" and represent non-expressing cells. There aren't usually too many of them, and it is safe to simply exclude them. This makes it simpler to plot the data, which is log-distributed.

To remove these cells, you can use a "mask", finding the rows that satisfy some True/False statement and then reassigning `data` to this value. To remove cells with negative channel measurements, the statement will specify that the value in each channel column be greater than zero.

In [None]:
'''
Remove cells with negative channel values

The 'channel_list' should contain all the Attune channels you're 
interested in (e.g., mGL-A).
'''
for c in ['mRuby2-A','mGL-A']:
    data = data[data[c] > 0]
    data2 = data2[data2[c] > 0]

channel_list = ['mRuby2-A','AF514-A','tagBFP-A']
for c in channel_list:
    data3 = data3[data3[c] > 0]

### Plot histograms & joint distributions

The first thing you'll likely want to do is visualize distributions of expression across relevant channels for various conditions. The easiest way to plot multiple conditions at once is by using Seaborn's FacetGrid and related functions: https://seaborn.pydata.org/tutorial/axis_grids.html With these functions, you shouldn't have to loop over manual axes!

Use `kdeplot` to plot 1D or 2D distributions: https://seaborn.pydata.org/generated/seaborn.kdeplot.html Note that 2D kdeplots may take several minutes to generate.

In [None]:
''' 
Example 1: Plot 1D kdeplot

For kdeplots, be sure to use a log scale and normalize the area under the
curve within conditions rather than across them (no "common normalization",
i.e., 'common_norm=False').
You can manually adjust the placement of the legend to move it outside of
the plot area.
'''
ax = sns.kdeplot(data=data, x='mGL-A', hue='construct',
                log_scale=True, common_norm=False)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
''' 
Example 2: Plot kdeplots in a grid

Use the corresponding grid functions to facet your data along 
rows/columns of a grid. 
Here, we subset the data to only plot conditions where the 
'ts_kind' column has the value 'NT' (OL circuit) or 'T' (CL circuit).
'''
plot_df = data[data.ts_kind.isin(['NT','T'])]
g = sns.displot(data=plot_df, x='mGL-A', hue='ts_num', palette=no_yellow_viridis,
                col='ts_kind',
                log_scale=True, common_norm=False, kind='kde', facet_kws=dict(margin_titles=True))

# Loop over the axes to add the untransfected condition for comparison
for ax in g.axes_dict.values():
    sns.kdeplot(data=data[data.construct=='UT'], x='mGL-A', color='black', ls=':', ax=ax)

In [None]:
''' 
Example 2: Plot 2D kdeplot in a grid

It can help to downsample your data to fewer cells per condition
so that initial plots generate more quickly (remove this for 
final figures).
'''
plot_df = data3[(data3.gene_activator!='na') & (data3.promoter_activator!='na')].groupby(['construct','activator']).sample(1000)
g = sns.displot(data=plot_df, x='AF514-A', y='mRuby2-A', hue='inducer',
                row='gene_activator', col='promoter_activator',
                log_scale=True, common_norm=False, kind='kde', facet_kws=dict(margin_titles=True))
g.set(xlim=(1e1,1e5), ylim=(1e0,1e5))
g.set_titles(row_template='{row_name}', col_template='activator promoter: {col_name}')
g.refline(y=2e2, color='black', ls='-', zorder=0)

### Gate expressing cells

Depending on your experiment, you might want to analyze only a fraction of the population. For transfections, we typically only care about transfected cells, or those expressing the transfection marker (co-delivered fluorescent protein). You can manually eyeball this threshold (or gate), or you can set it based some high percentile of the untransfected cells. It can be helpful to make a new DataFrame with only these cells. Be sure to exclude any conditions that lack the transfection marker!

In [None]:
''' 
Example 1: Choose a threshold manually
'''
gate = 3e2
data_gated = data[(data['mGL-A']>gate) & (data.construct!='UT')].copy()

ax = sns.kdeplot(data=data, x='mGL-A', hue='construct', 
                 log_scale=True, common_norm=False)
ax.axvline(gate, color='black', zorder=0)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
''' 
Example 2: Gate based on the untransfected population

Here, we use the 99.9th percentile of the untransfected population
as the gate, meaning that only 0.1% of untransfected cells are (mis)labeled
as expressing.
'''
gate = data.loc[data.construct=='UT', 'mGL-A'].quantile(0.999)
display(gate)
data_gated = data[(data['mGL-A']>gate) & (data.construct!='UT')].copy()

ax = sns.kdeplot(data=data, x='mGL-A', hue='construct', 
                 log_scale=True, common_norm=False)
ax.axvline(gate, color='black', zorder=0)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In [None]:
'''
Example 3: Use different gates for different conditions

We want to use different gates for the MEFs and 293T cells since they have
different autofluorescence profiles. We'll define gates based on the 
uninfected ('UI') populations for each cell type.
'''
# Define a function to choose the gate and return the gated population
def gate_data(df):
    gate = df.loc[df.construct=='UI', 'mGL-A'].quantile(0.999)
    display(gate)
    return data[(data['mGL-A']>gate) & (data.construct!='UI')]

# Gate data
data2_gated = data2.groupby(['cell'])[data2.columns].apply(gate_data).reset_index()
display(data2_gated)

# Plot uninfected populations to visualize autofluorescence profiles
plot_df = data2[data2.construct=='UI'].groupby('cell').sample(1000)
cell_palette = {'293T': 'teal', 'MEF': 'orange'}
g = sns.displot(data=plot_df, x='mGL-A', y='mRuby2-A', col='cell', 
                hue='cell', palette=cell_palette,
                kind='kde', log_scale=True)
g.refline(x=243, ls='-', color=cell_palette['293T'], zorder=0)
g.refline(x=335, ls='-', color=cell_palette['MEF'], zorder=0)

### Calculate summary statistics

Now that you've explored the distributions for each condition, you probably want to quantify trends. Calculating summary statistics (mean, standard deviation, etc.) is straightforward and quick with Pandas functions. 

In [None]:
'''
Calculate summary statistics for multiple channels at once

We want to use different gates for the MEFs and 293T cells since they have
different autofluorescence profiles.
'''
# Compute geometric mean (gmean) and standard deviation on two relevant channels
channel_list = ['mGL-A','mRuby2-A']
stats = data_gated.groupby('construct')[channel_list].agg([sp.stats.gmean, np.std]).reset_index().dropna()

# Flatten the multi-level column names
stats.columns = ['_'.join(c).rstrip('_') for c in stats.columns.to_flat_index()]
display(stats)

In [None]:
''' 
A function for computing summary statistics

I turned the computation above into a convenient function. 
Maybe soon we'll add it to rushd, but for now feel free to
use & modify it yourself!

Inputs
  df: your DataFrame
  by: a list of columns used to group the data for summarizing
  columns: a list of columns to summarize
  stats: a list of functions to use to summarize
'''
def summarize(df, by, columns, stats):
    stats = df.groupby(by)[columns].agg(stats).reset_index().dropna()
    stats.columns = ['_'.join(c).rstrip('_') for c in stats.columns.to_flat_index()]
    return stats

stats = summarize(data_gated, 'construct', channel_list, [sp.stats.gmean, np.std])
display(stats)

### Plot summary statistics

There are many ways to plot summary statistics (box plot, scatter plot, bar plot, etc.) and several ways to display the variability between measurements (error bars, shading, etc.). Choose your favorite representation!

One recommendation: do not use bar plots for values without a relevant zero, and always display the zero on the axis. This ensures the sizes of the bars accurately reflect the relative values they represent. For example, bar plots are effective for displaying percentages (e.g., reprogramming purity) but not geometric mean fluorescence values (which are log-distributed and typically much higher than 0). 

In [None]:
'''
Example 1: Dataset 1

Here, we plot the gmean of the output gene (mRuby2) as a function
of target site number for the OL and CL ComMAND circuits. Note that
this plot would make more sense with additional biological replicates.
'''
# Add plasmid metadata to stats
stats = stats.merge(metadata, how='left', on='construct')

# Plot mRuby2 geometric mean for each condition
plot_df = stats[stats['ts_kind']!='na']
ax = sns.stripplot(data=plot_df, x='ts_num', y='mRuby2-A_gmean', 
                   hue='ts_kind', palette={'NT': 'gray', 'T': 'teal'},
                   size=10, jitter=False)
ax.set(yscale='log', xlabel='# of target sites', ylabel='ouput (gmean)')
sns.despine()

In [None]:
'''
Example 2: Small molecule titration experiment

Here, we load another dataset, where varying the concentration
of a small molecule (auxin) changes the expression of EGFP.
'''
# Load another dataset: Emma's auxin calibration curve
data_path = rd.datadir/'instruments'/'data'/'attune'/'Emma'/'2022.03.12_Auxin_Calib'/'Data'
data4 = rd.flow.load_csv_with_metadata(data_path, data_path/'wells.yaml', columns=['EGFP-A'])
data4 = data4[data4['EGFP-A']>0]
display(data4)

# Compute gmean for each auxin concentration, excluding untransfected cells (NT)
stats4 = data4[data4.Auxin!='NT'].groupby(['Auxin','Replicates'])['EGFP-A'].apply(sp.stats.gmean).rename('EGFP-A_gmean').reset_index()

# Plot summary statistics
plot_df = stats4[stats4.Auxin > 0]
ax = sns.scatterplot(data=plot_df, x='Auxin', y='EGFP-A_gmean', 
                     hue='Auxin', palette=no_yellow_viridis, hue_norm=matplotlib.colors.LogNorm(),
                     legend=False)
ax.set_xscale('symlog', linthresh=0.5)
ax.set(yscale='log', xlim=(0.4,1e3))

In [None]:
'''
Example 3: Alternative ways to plot replicates

Using the same data as above, we can collapse replicates into
a single value with an estimate of their spread. Notice that 
these functions will summarize the replicates for you, without
requiring any additional calculations.
'''
fig, axes = plt.subplots(1,2, figsize=(10,5), sharey=True)
plot_df = stats4

# Plot as line with shaded region
sns.lineplot(data=plot_df, x='Auxin', y='EGFP-A_gmean', ax=axes[0],
             estimator='median', errorbar='ci')

# Plot as points with error bars
sns.lineplot(data=plot_df, x='Auxin', y='EGFP-A_gmean', ax=axes[1],
             estimator='median', errorbar='ci', err_style='bars',
             marker='o', ls='')

for ax in axes:
    ax.set(xscale='log', yscale='log', ylim=(1e2,1e4))

### Some additional computations

Besides simple summary statistics, you may be interested in computing metrics like fraction positive in a particular channel or the fold change of one condition relative to another. Below are a few metrics that might be useful, or that might give you ideas for approaching other calculations. Note that there are several ways to perform these calculations; these are each just one approach.

In [None]:
'''
Gated fraction

For a given channel, calculate the fraction of cells in each
condition that have values greater than the specified threshold.
'''
fraction = (data_gated.groupby('construct')['mGL-A'].count() / 
            data.groupby('construct')['mGL-A'].count()).reset_index().rename(columns={'mGL-A': 'fraction'}).dropna()
display(fraction)

In [None]:
'''
Fold change

Compute the fold change of one statistic for some condition,
or set of conditions, relative to some baseline condition.
Here, we find the fold change of the output (mRuby2) for OL
and CL circuits relative to their respective 1x target site 
conditions.
'''
# Define a function to compute fold change within a group
def get_fc(df):
    d = df.copy()
    baseline = d.loc[d['ts_num']==1, 'mRuby2-A_gmean'].mean()
    d['fold_change'] = d['mRuby2-A_gmean'] / baseline
    return d

stats = stats.groupby(by=['ts_kind'])[stats.columns].apply(get_fc).reset_index(drop=True)
display(stats)

In [None]:
'''
Quantile binning

Rather than calculating summary statistics on an entire
condition, bin the data into equal-quantile groups based 
on values of a given channel (e.g., 10 bins each with 10% 
of the data). Here, we bin on the transfection marker (mGL).
'''
# Assign quantiles
num_bins = 20
data['bin_quantiles'] = data.groupby('construct')['mGL-A'].transform(lambda x: pd.qcut(x, q=num_bins, duplicates='drop'))

# Calculate that median of each bin
quantiles = data.groupby(['construct','bin_quantiles'])['mGL-A'].median().rename('bin_quantiles_median').reset_index()

# Create a new column in data with the bin median
data = data.merge(quantiles, how='left', on=['construct','bin_quantiles'])
display(data)

In [None]:
'''  
Quadrants defined by two gates

Categorize cells into quadrants based on two gates/channels.
Possible values:
  0 = double negative
  1 = x-positive
  2 = y-positive
  3 = double positive
Then, compute the fraction of cells in each quadrant and rename
the quadrants with useful labels.
'''
def get_quadrant(x,y,gate_x,gate_y):
    df_quad = pd.DataFrame()
    df_quad['x'] = x > gate_x
    df_quad['y'] = y > gate_y
    df_quad['quadrant'] = df_quad['x'].astype(int) + df_quad['y'].astype(int)*2
    return df_quad['quadrant']

# Categorize each cell into a quadrant
gate_mRuby2 = data.loc[data.construct=='UT', 'mRuby2-A'].quantile(0.999)
data['quadrant'] = get_quadrant(data['mGL-A'], data['mRuby2-A'], gate, gate_mRuby2)

# Compute fraction of cells in each quadrant
quadrants = data.groupby(['construct','quadrant'])['mGL-A'].count().rename('count')
quadrants = (quadrants/quadrants.groupby('construct').transform('sum')).dropna().reset_index(name='fraction')

# Rename quadrant numbers with interpretable labels
quadrants['label'] = quadrants.quadrant.map({0: 'double-negative', 1: 'mGL-positive', 2: 'mRuby2-positive', 3: 'double-positive'})
display(quadrants)

In [None]:
'''
Fitting to a model

Here, we load a new dataset, where Emma has generated a calibration curve 
for auxin (a small molecule that leads to degradation of proteins with the 
associated AID tag) by varying the auxin concentration and measuring the 
resulting drop in EGFP-AID fluorescence. From the literature, we find an 
equation that explains this relationship and fit the coefficients using
scipy's 'curve_fit'. Then, we plot the results.
'''
# Define a function for the model, where x is auxin concentration in µM
#  and the result is log10(fluorescence)
def my_model(x, basal_fluorescence, amplitude, EC50):
    return basal_fluorescence - amplitude * x/(x+EC50)

# Fit the data to the model and print the results
fit_df = stats4[stats4.Auxin > 0]
popt, pcov = sp.optimize.curve_fit(my_model, fit_df.Auxin, np.log10(fit_df['EGFP-A_gmean']))
print('basal fluorescence (log10): {0:.1f}\namplitude: {1:.1f}\nEC50 (µM): {2:.1f}'.format(*popt))

# Plot the data
ax = sns.scatterplot(data=fit_df, x='Auxin', y='EGFP-A_gmean',)
ax.set(xscale='log', yscale='log')

# Plot the model fit on the same axes
xs = np.logspace(np.log10(fit_df.Auxin.min()), np.log10(fit_df.Auxin.max()), 1000)
ys = my_model(xs, *popt)
sns.lineplot(x=xs, y=10**ys)
ax.axvline(popt[2], color='gray', zorder=0)
ax.annotate(r'EC$_{50}$ = ' + f'{popt[2]:.1f} µM', (0.05,0.05), xycoords='axes fraction', color='gray')

### Now go forth and explore your data!