# Compute Summary Motility Metrics
---

#### Overview
This notebook computes summary motility metrics from individual cell trajectories that were extracted from time lapse microscopy data acquired across two different imaging experiments.

_Idea is that this notebook generates datasets for the subsequent notebooks so that you don't have to go through the trouble of downloading the dataset to work with the summary motility metrics._

Details regarding the imaging experiments, the motility metrics, and the procedure for extracting cell trajectories from the time lapse microscopy data can be found in the [pub]() [*add link*]. The cell trajectories are stored in CSV files that can be found in the [data repository]() [*add link*].

This notebook outputs two CSV files to [`data/`](../data), one for each imaging experiment. The columns of the CSV file are the various motility metrics (total distance, net distance, confinement ratio, etc.), and each row corresponds to the summary motility metrics for one particular cell trajectory. Only the motility metrics for trajectories longer than 20µm and with a duration lasting more than 10s are output to the CSV file. Because the time lapses are 20s, this ensures that cells are not represented more than once.

Note that cell trajectories were originally output by running the script [`../src/chlamytracker/scripts/track_cells.py`](../src/chlamytracker/scripts/track_cells.py).

#### Datasets
The datasets from which the cell trajectories derive are split by vessel type.

| Experiment ID | Date          | Vessel                            |
|---------------|---------------|-----------------------------------|
| AMID-04       | 01 March 2024 | Agar microchamber pools ("Pools") |
| AMID-05       | 01 March 2024 | 384-well plate ("Wells")          |

#### Imports

In [1]:
import json
import re
from pathlib import Path

import pandas as pd
from chlamytracker.timelapse import Timelapse
from chlamytracker.tracking_metrics import TrajectoryCSVParser
from natsort import natsorted
from tqdm.notebook import tqdm

#### Collect CSV files
As stated above, the cell trajectories are stored in CSV files that can be found in the [data repository]() [*add link*]. The file paths in `input_directories` should be updated to wherever this data has been downloaded to.

In [2]:
input_directories = {
    "AMID-04": Path(
        "/Volumes/Microscopy/Published_Datasets/HTP-motility-assay/AMID-04_Pools_cc124"
    ),
    "AMID-05": Path(
        "/Volumes/Microscopy/Published_Datasets/HTP-motility-assay/AMID-05_Wells_cc124"
    ),
}

csv_files = {
    "AMID-04": natsorted(input_directories["AMID-04"].glob("*/processed/*/*_tracks.csv")),
    "AMID-05": natsorted(input_directories["AMID-05"].glob("processed/*_tracks.csv")),
}

out = f"""\
AMID-04 csv files :: {len(csv_files["AMID-04"])}
AMID-05 csv files :: {len(csv_files["AMID-05"])}
--------------------
Total :: {sum([len(csvs) for csvs in csv_files.values()])}
"""
print(out)

AMID-04 csv files :: 258
AMID-05 csv files :: 24
--------------------
Total :: 282



#### Load experimental parameters
As described in the [pub]() [*add link*], cells were prepared under a variety of experimental conditions, tabulated below.

| species        | vessel_type    | position_in_tube | time_in_water |
|----------------|----------------|------------------|---------------|
| C. reinhardtii | pools          | top              | 4 hrs         |
|                | wells          | middle           | 21 hrs        |

In order to match cell motility data with the proper set of experimental variables, we must be able to map each csv file to the known parameter set. The information that enables this mapping is stored in [`data/experimental_parameters.json`](../data/experimental_parameters.json)

In [3]:
experimental_parameters_json = Path("../data/experimental_parameters.json")
experimental_parameter_sets = json.loads(experimental_parameters_json.read_text())

#### Get frame rates and pixel sizes of raw image data
Because it is derived from image data, the cell trajectories stored in the CSV files is in units of pixels. We must extract the pixel size and frame rate from the metadata in order to properly scale the trajectories.

In [4]:
nd2_files = {
    "AMID-04": next(input_directories["AMID-04"].glob("*/*.nd2")),
    "AMID-05": next(input_directories["AMID-05"].glob("*.nd2")),
}

pixelsizes = {
    "AMID-04": Timelapse(nd2_files["AMID-04"], load=False).pixelsize_um,
    "AMID-05": Timelapse(nd2_files["AMID-05"], load=False).pixelsize_um,
}

framerates = {
    "AMID-04": Timelapse(nd2_files["AMID-04"], load=False).framerate,
    "AMID-05": Timelapse(nd2_files["AMID-05"], load=False).framerate,
}

out = f"""\
Pixelsizes
----------
AMID-04 pixel size :: {pixelsizes["AMID-04"]:.3f} µm/px
AMID-05 pixel size :: {pixelsizes["AMID-05"]:.3f} µm/px

Framerates
----------
AMID-04 frame rate :: {framerates["AMID-04"]:.1f} fps
AMID-05 frame rate :: {framerates["AMID-05"]:.1f} fps
"""
print(out)

Pixelsizes
----------
AMID-04 pixel size :: 0.642 µm/px
AMID-05 pixel size :: 0.433 µm/px

Framerates
----------
AMID-04 frame rate :: 20.0 fps
AMID-05 frame rate :: 20.0 fps



## Compute summary motility metrics
---

In [5]:
# initialize dataframe to collect the summary motility metrics
# for each cell trajectory
motility_metrics_dataframe = pd.DataFrame()

# loop through all the csv files of cell trajectories
for experiment_ID, csvs in csv_files.items():
    for csv in tqdm(csvs):
        # ID tags relevant agar microchamber pools
        if experiment_ID == "AMID-04":
            well_ID = "NA"
            slide_ID = csv.parents[2].name
            timelapse_ID = int(re.findall(r"\d+", csv.parent.name)[-1])
            pool_ID = "x".join(csv.stem.split("_")[1:3])
            # get experimental parameters from slide ID
            experimental_parameters = experimental_parameter_sets[experiment_ID].get(slide_ID)

        # ID tags relevant for 384-well plate
        elif experiment_ID == "AMID-05":
            well_ID = csv.name[4:7]
            slide_ID = "NA"
            timelapse_ID = "NA"
            pool_ID = "NA"
            # get experimental parameters from well ID
            experimental_parameters = experimental_parameter_sets[experiment_ID].get(well_ID)

        else:
            raise ValueError(f"Unknown experiment ID {experiment_ID}")

        # parse motility data from csv
        framerate = framerates[experiment_ID]
        pixelsize = pixelsizes[experiment_ID]
        cell_trajectories = TrajectoryCSVParser(csv, framerate, pixelsize)

        # estimate cell count and compute motility measurements
        # for a batch of cell trajectories
        cell_count = cell_trajectories.estimate_cell_count()
        motility_metrics = cell_trajectories.compute_summary_statistics()
        dataframe = pd.DataFrame(motility_metrics)

        # build up dataframe
        dataframe["cell_count"] = cell_count
        dataframe["experiment_ID"] = experiment_ID
        dataframe["strain"] = experimental_parameters["strain"]
        dataframe["vessel_type"] = experimental_parameters["vessel_type"]
        dataframe["position_in_tube"] = experimental_parameters["position_in_tube"]
        dataframe["time_in_water"] = experimental_parameters["time_in_water"]
        dataframe["well_ID"] = well_ID
        dataframe["slide_ID"] = slide_ID
        dataframe["timelapse_ID"] = timelapse_ID
        dataframe["pool_ID"] = pool_ID

        # concatenate batch of motility metrics
        motility_metrics_dataframe = pd.concat([motility_metrics_dataframe, dataframe])

# clean up dataframe by removing superfluous `cell_id` column and resetting the index
motility_metrics_dataframe = motility_metrics_dataframe.drop("cell_id", axis=1).reset_index(
    drop=True
)

# preview dataframe
motility_metrics_dataframe.drop("slide_ID", axis=1).groupby("experiment_ID").head(5)

  0%|          | 0/258 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Unnamed: 0,total_time,total_distance,net_distance,max_sprint_length,confinement_ratio,mean_curvilinear_speed,mean_linear_speed,mean_angular_speed,num_rotations,num_direction_changes,pivot_rate,cell_count,experiment_ID,strain,vessel_type,position_in_tube,time_in_water,well_ID,timelapse_ID,pool_ID
0,2.55051,149.796243,87.404928,19.246914,0.583492,58.731878,34.269588,2.263515,0.0,6,0.040054,1,AMID-04,cc124,pools,top,21,,1.0,2x0
1,14.60292,723.059408,51.339131,22.719548,0.071003,49.514714,3.515676,3.595302,0.0,33,0.045639,1,AMID-04,cc124,pools,top,21,,1.0,2x0
2,1.20024,79.21207,74.641985,18.294917,0.942306,65.996859,62.189217,2.072528,0.0,1,0.012624,1,AMID-04,cc124,pools,top,21,,1.0,2x1
3,8.5017,485.537964,7.913013,18.817182,0.016297,57.110691,0.930757,2.763627,0.0,17,0.035013,1,AMID-04,cc124,pools,top,21,,1.0,2x1
4,4.20084,265.723997,16.991896,19.515305,0.063946,63.254967,4.044881,2.603108,0.0,8,0.030106,1,AMID-04,cc124,pools,top,21,,1.0,2x1
1269,5.401102,11.501189,1.680213,0.928078,0.14609,2.129415,0.311087,12.126698,0.0,47,4.086534,7,AMID-05,cc124,wells,middle,21,I03,,
1270,1.600327,3.539114,0.438554,0.802046,0.123916,2.211495,0.274041,18.509826,1.0,18,5.086018,7,AMID-05,cc124,wells,middle,21,I03,,
1271,2.600531,6.484695,0.96429,0.98697,0.148702,2.493605,0.370805,16.742955,2.0,30,4.626277,7,AMID-05,cc124,wells,middle,21,I03,,
1272,3.850786,15.982466,1.970273,1.503816,0.123277,4.150443,0.511655,17.890597,3.0,46,2.878154,7,AMID-05,cc124,wells,middle,21,I03,,
1273,16.853439,288.823532,131.213339,6.633705,0.454303,17.137365,7.785553,2.682857,1.0,23,0.079633,7,AMID-05,cc124,wells,middle,21,I03,,


#### Filter based on trajectory distance and duration
As stated above, we wish to only output the motility metrics for trajectories longer than 20µm and with a duration lasting more than 10s to filter out artifacts and broken cells as well as to ensure that no cell is represented more than once.

In [6]:
# filtering criteria
total_time_threshold = 10
total_distance_threshold = 20

# apply filters
motility_metrics_dataframe_filtered = motility_metrics_dataframe.loc[
    (motility_metrics_dataframe["total_time"] > total_time_threshold)
    & (motility_metrics_dataframe["total_distance"] > total_distance_threshold)
]

# output stats
num_trajectories = len(motility_metrics_dataframe)
num_filtered = num_trajectories - len(motility_metrics_dataframe_filtered)
num_remaining = len(motility_metrics_dataframe_filtered)

msg = (
    f"Filtered out {num_filtered} of {num_trajectories} trajectories.\n"
    f"{num_remaining} trajectories ({num_remaining/num_trajectories:.0%}) remaining."
)
print(msg)

Filtered out 2059 of 2295 trajectories.
236 trajectories (10%) remaining.


## Export
---

In [7]:
# output directory
output_directory = Path("../data/")

# group dataframe by experiment ID and export to csv
for experiment_ID, dataframe in motility_metrics_dataframe_filtered.groupby("experiment_ID"):
    csv_file = f"{experiment_ID}_summary_motility_metrics.csv"
    dataframe.to_csv(output_directory / csv_file, index=False)