# Profile calculations
Written by: Amanda Ng R.H.
<br>Language: `python3`
<br>Created on: 19 Apr 2023
<br>Last updated on: 06 Sep 2023
<br>Prior data processing: Required (📔 2_QC_ProfileAssembly_FeatureSelection.ipynb)
<br>Documentation status: In progress ([Sphinx documentation style](https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html))

The morphological profile and the relevant features have been selected using different feature selection strategies in 📔 2_QC_ProfileAssembly_FeatureSelection.ipynb.

## a| Calculations for RKO WT data with features selected using `global_c662_rko_wt` strategy was done in `julia`

The calculations were executed externally from this notebook using the following commands, which runs the 📜 execute_distanceCalculation.sh script:

```bash
# Path to the script
script=/research/lab_winter/users/ang/isogenicCPA_repo/3a_execute_distanceCalculation.sh

# Prepare the script for execution
dos2unix ${script}
chmod +x ${script}

# Execute the script
sbatch ${script}
```

The calculations done include UMAP reduction of the profile with `UMAP.jl` and the distance calculation with the robust Hellinger distance from `BioProfiling.jl`. The data from here will be used for visualization purposes in downstream notebooks.

I use these calculations for assessing the morphological perturbation strength of each compound.

## b| Calculations for CRBN-dependency interpretation in `python`
The calculations for CRBN-dependency prediction are done in this notebook.

In brief, features are selected for each treatment ("treatment-centric"). These features exhibit high heterogeniety in RKO WT, but exhibit a clear morphological perturbation in RKO CRBN OE. The morphological perturbation in question tends to be higher in RKO CRBN OE than RKO CRBN KO.

I hypothesize that these treatment-centric features could be used to calculate an **induction score** (loosely inspired by the approaches described by [Schneidewind _et al._](https://doi.org/10.1016/j.chembiol.2021.06.003) and [Woehrmann _et al._](https://doi.org/10.1039/C3MB70245F)). The induction score should reflect the overall change in treatment-centric features per image (per treatment and per cell line). The induction score should be the highest in RKO CRBN OE and the lowest in RKO CRBN KO. The difference between these two cell lines can be quantified using the U statistic from the Mann-Whitney U test. (I use the **corrected U** statistic instead, which is the U statistic as a fraction of the maximum possible U statistic for a given treatment.)

Considering the way features are selected with the treatment-centric strategy, CRBN-independent treatments could exhibit the same induction vaue trend. I would, however, expect the trend to be a lot weaker/less pronounced compared to CRBN-dependent treatments.

### Import packages

In [None]:
import os
import warnings

import pandas as pd

import seaborn as sns
from matplotlib import pylab as plt
from matplotlib.patches import Patch

import statistics
import numpy as np
from scipy.stats import mannwhitneyu
import math

### Import self-written functions

In [None]:
# Add the paths to the modules to where python will search for modules
import sys
root = "/research/lab_winter/users/ang"
module_paths = [
    f"{root}/isogenicCPA_repo/2_feature_analysis"
]
for module_path in module_paths:
    sys.path.insert(0, module_path)
    
from general_modules import *
from profile_interpretation_modules import *

### Setting up

In [None]:
######################################################################################
# Paths to the relevant files and directories prior or unrelated to feature extraction
######################################################################################
# Path to the parent direcotry with all the data for the experiment
parent_dir = "/research/lab_winter/users/ang/projects/GW015_SuFEX_IMiDs/GW015_006__full_cpa"

# Path to the drug metadata sheet
drugMetadata_path = f"{parent_dir}/drug_metadata_sheet.csv"

###################################################################################
# Paths to the relevant files and directories from post-feature extraction pipeline
###################################################################################
# Path to the outputs from the post-feature extraction pipeline
post_feature_extraction_output_dir = f"{parent_dir}/cleanPostFeatureExtraction/output_dir"

# Path to the morphology profile after robust Z standardization and baseline feature selection
profile_path = f"{post_feature_extraction_output_dir}/2_ProfileAssemblyAndFeatureSelection_output/baseline_output.csv"

# Path to the features selected
selectedFeatures_path = f"{post_feature_extraction_output_dir}/2_ProfileAssemblyAndFeatureSelection_output/featuresSelected_output.csv"

############################################################
# Path to the directory for the output(s) from this notebook
############################################################
# Path to the output directory and make it
output_dir = f"{post_feature_extraction_output_dir}/3_ProfileInterpretation_output/Python_calculations"
makeDirectory(output_dir)

### Induction score and corrected U calculation

In [None]:
# Path to export the induction scores calculated
inductionScores_path = f"{output_dir}/inductionScores.csv"

# Calculate the induction scores
inductionScores = calculateInductionScores(
    selectedFeatures = pd.read_csv(selectedFeatures_path),
    profile = pd.read_csv(profile_path),
    treatmentsToUse = "all"
)

# Export the induction scores
inductionScores.to_csv(inductionScores_path, index = False)

# Have a look at the induction scores calculated
inductionScores.sample(n = 5)

In [None]:
# Path to the corrected U scores calculated
correctedU_path = f"{output_dir}/correctedU.csv"

# Calculate the corrected U
correctedU = compareInduction(
    inductionScores = inductionScores,
    ko = "c1141_rko_ko",
    oe = "c1327_rko_oe"
)

# Export the corrected U
correctedU.to_csv(correctedU_path, index = False)

# Have a look at the corrected U calculated
correctedU.sample(n = 5)

## Completion of profile calculations
Proceed on to <span style="color:blue">**Profile Interpretations**</span>.