### Statistical tests
**Notebook description/ method ** <br>

Previously, I have calculated the delay time differences between the GNE, GNEO1, and GNEO2 networks and the GN reference network, $\Delta\tau_{i -GN}$, where $i \in \{GNE, GNEO1, GNEO2\}$.

In this notebook, I perform statistical analysis on the $\Delta\tau_{i -GN}$ data.
For each condition (parameter, rel change), I compare the delay time difference.

First, I check if the distrition of delay time differences is a normal distribution.
Besides a plotting the distribution (graphical method), I perform a Shapiro-Wilk test, to acces whether the delay time differences are normally distributioned or not. The null hypothesis is: "The distribution of delay time diffrences follow a normal distribution." The choice if significance level $\alpha=0.05$ (common practice.)

If the normality assumption is valid, I perform a one-sided t-test with the null hypothesis: "The delay time differences between network i+1 is the same or smaller than the delay time differnces between network i." For example, if i = GNE, i+1 = GNEO1.

If the distributions of the delay time diffrences do not follow a normal distribution, I perform a one-sided Mann-Whitney U-test. The null hypothesis is the same: "The delay time differences between network i+1 is the same or smaller than the delay time differnces between network i."

In both cases, I use signicance level $\alpha=0.05$ (\*, reasonable confidence),$\alpha=0.01$,(\*\*)$\alpha=0.001$ (\*\*\*, high-confidence results) to test the strength of evidence. [note to self: i will only indicate the aesterics if the null hypothesis is rejected.]
**Result** <br>

Example of result (update when the analysis is done. this is just to rememeber what i am interested in): None of the distributions follow a normal distribution.


In [1]:
import pandas as pd # to read single csv files

In [2]:
from Module_241021 import (load_data_w_glob,
                           shapiro_wilk_test_and_plots,
                           MW_test,
                           wilcoxon_test_GNE_timedata
                          )

In [3]:
perform_shapiro_wilk_test = False
perform_mannwhitney_u_test = False

## Do the distributions of the delay time differences follow a normal distribution?
#### perform a formal normality test (Shapiro-Wilk test)
H0: The sample is normally distributed

choice of significance level: $\sigma$=0.05 (widely accepted in practice)

Note on interpretation of the p-value:
If p-value > 0.05, the SW-test does not reject the null hypothesis, and the data is consistent with being normally distributed.

If p-value < 0.05, the SW-test does reject the null hypothesis, suggesting the data is not normally distributed.

## Test for normality

In [4]:
# load time diff data
diff_tau_all = pd.read_csv("time_delay_diff_241021.csv")

# load + pickout velocity data
PSA_all_df = load_data_w_glob(directory_path="PSA_data", file_pattern="*241021.csv")
velocity_all = PSA_all_df[["network", "parameter", "rel change", "condition 1", "condition 2", "velocity"]]

In [5]:
if perform_shapiro_wilk_test:
    
    # delay time difference, normal test
    shapiro_wilk_test_and_plots(dataframe=diff_tau_all, network="two_networks", datatype="time_diff",
                             date="241021", verbose=True, savefig=True
                            )
    
    # velocity data, normal test
    shapiro_wilk_test_and_plots(dataframe=velocity_all, network="network", datatype="velocity",
                             date="241021", verbose=True, savefig=True
                            )

## Review normality test results

In [12]:
SW_results_delay_time_diff = pd.read_csv("distribution stats/shapiro_test_results_time_diff_241021.csv")
SW_results_velocity = pd.read_csv("distribution stats/shapiro_test_results_velocity_241021.csv")

# cases of where H0 is not rejected (normality)
print("delay time diff:\n",SW_results_delay_time_diff[SW_results_delay_time_diff["reject"]==False])
print("velocity:\n",SW_results_velocity[SW_results_velocity["reject"]==False])



print(len(SW_results_delay_time_diff[SW_results_delay_time_diff["networks"]=="GNEvGN"]))
print(len(SW_results_velocity))

delay time diff:
    networks parameter  rel change   p_value  significance level  reject  \
10   GNEvGN   alphaEO         0.2  0.064536                0.05   False   
66   GNEvGN       KEG         2.0  0.200544                0.05   False   

    condition 1  condition 2  
10         True         True  
66         True         True  
velocity:
 Empty DataFrame
Columns: [networks, parameter, rel change, p_value, significance level, reject, condition 1, condition 2]
Index: []
100
400


I perform a Mann-Whitney U test since only 2 out of 300 cases (delay time diff) are normally distributed/ H0 is not rejected.

### Perform a one sided Mann-Whitney U test
**Delay Time Diff** <br> I test the following
- Case A with $H_0$: The $\Delta\tau_{GNEO2-GN}$ are the equal to or smaller than the $\Delta\tau_{GNE-GN}$
- Case B with $H_0$: The $\Delta\tau_{GNEO1-GN}$ are the equal to or smaller than the $\Delta\tau_{GNE-GN}$
- Case C with $H_0$: The $\Delta\tau_{GNEO2-GN}$ are the equal to or smaller than the $\Delta\tau_{GNEO1-GN}$

In [7]:
run_test=False
if perform_mannwhitney_u_test:
    # prepare data
    mask0 = diff_tau_all["condition 1"]==True # bi-stable
    mask1 = diff_tau_all["condition 2"]==True # ref sfp

    GNEvGN_time_df = diff_tau_all[(mask0)&(mask1)&(diff_tau_all["two_networks"]=="GNEvGN")]
    GNEO1vGN_time_df = diff_tau_all[(mask0)&(mask1)&(diff_tau_all["two_networks"]=="GNEO1vGN")]
    GNEO2vGN_time_df = diff_tau_all[(mask0)&(mask1)&(diff_tau_all["two_networks"]=="GNEO2vGN")]

    # ensure that the length of the dataframes are the same
    print(len(GNEvGN_time_df), len(GNEO1vGN_time_df), len(GNEO2vGN_time_df))
    
    if run_test: # extra safety belt!!
        # Case A
        MW_test(df_compare=GNEO2vGN_time_df, df_ref=GNEvGN_time_df,
                datatype="time diff", date="241021", alternative="greater")

        # Case B
        MW_test(df_compare=GNEO1vGN_time_df, df_ref=GNEvGN_time_df,
                datatype="time diff", date="241021", alternative="greater")

        # Case C
        MW_test(df_compare=GNEO2vGN_time_df, df_ref=GNEO1vGN_time_df,
                datatype="time diff", date="241021", alternative="greater")

**Velocity Data** <br>
My question: Does adding a pluripotency marker (Esrrb or Oct4) slow down the cell specification? For example, are the velocities of the GNE network *smaller* than the velocities of the GN network?

To answer my question, I perform a Mann-Whitney U test with null hypothesis stated in case D.
I test the following
- Case D with $H_0$: The velocities from the GNE network are equal to or larger than the velocities from the GN network.
- Case E with $H_0$: The velocities from the GNEO1 network are equal to or larger than the velocities from the GNE network.
- Case F with $H_0$: The velocities from the GNEO2 network are equal to or larger than the velocities from the GNEO1 network.
- Case G with $H_0$: The velocities from the GNEO1 network are equal to or larger than the velocities from the GN network.
- Case H with $H_0$: The velocities from the GNEO2 network are equal to or larger than the velocities from the GNE network.
- Case I with $H_0$: The velocities from the GNEO2 network are equal to or larger than the velocities from the GN network.

Remember to set alternative="less" in scipy stats mann-whitney u test.
df_compare is the network with an additional pluripotency marker. df_ref is the network without.

In [7]:
run_test=True
if perform_mannwhitney_u_test:
    
    # prepare the data
    # edit 25.10.2024 include all data regardless of robustness test.
    #mask0 = velocity_all["condition 1"]==True
    #mask1 = velocity_all["condition 2"]==True
    #GN_velocity_df = velocity_all[(mask0)&(mask1)& (velocity_all["network"]=="GN")]
    #GNE_velocity_df = velocity_all[(mask0)&(mask1)& (velocity_all["network"]=="GNE")]
    #GNEO1_velocity_df = velocity_all[(mask0)&(mask1)& (velocity_all["network"]=="GNEO1")]
    #GNEO2_velocity_df = velocity_all[(mask0)&(mask1)& (velocity_all["network"]=="GNEO2")]
    
    GN_velocity_df = velocity_all[velocity_all["network"] == "GN"]
    GNE_velocity_df = velocity_all[velocity_all["network"] == "GNE"]
    GNEO1_velocity_df = velocity_all[velocity_all["network"] == "GNEO1"]
    GNEO2_velocity_df = velocity_all[velocity_all["network"] == "GNEO2"]
    

    # the not empty condition takes care of cases where one of the dataframes do not fulfill condition 1 and 2.
    # in that case, for a certain paramter and rel change, that dataframe is empty and the 
    # function continues to the next. So, it is not a problem if the size of the dataframes are unequal when
    # evaluating them in the MW_test function.
    print(len(GN_velocity_df), len(GNE_velocity_df),len(GNEO1_velocity_df), len(GNEO2_velocity_df))
    
    if run_test: # extra safty measure to avoid overwriting files. - change the date (filename)
        # Case D
        MW_test(df_compare=GNE_velocity_df, df_ref=GN_velocity_df,
                datatype="velocity", date="241025", alternative="less")
        # Case E
        MW_test(df_compare=GNEO1_velocity_df, df_ref=GNE_velocity_df,
                datatype="velocity", date="241025", alternative="less")
        # Case F
        MW_test(df_compare=GNEO2_velocity_df, df_ref=GNEO1_velocity_df,
                datatype="velocity", date="241025", alternative="less")
        # Case G
        MW_test(df_compare=GNEO1_velocity_df, df_ref=GN_velocity_df,
                datatype="velocity", date="241025", alternative="less")
        # Case H
        MW_test(df_compare=GNEO2_velocity_df, df_ref=GNE_velocity_df,
                datatype="velocity", date="241025", alternative="less")
        # Case I
        MW_test(df_compare=GNEO2_velocity_df, df_ref=GN_velocity_df,
                datatype="velocity", date="241025", alternative="less")
    

112700 112700 112700 112700
Mann-Whitney U test results are saved to MW_results_velocity_GNE_GN_241025.csv 
Mann-Whitney U test results are saved to MW_results_velocity_GNEO1_GNE_241025.csv 
Mann-Whitney U test results are saved to MW_results_velocity_GNEO2_GNEO1_241025.csv 
Mann-Whitney U test results are saved to MW_results_velocity_GNEO1_GN_241025.csv 
Mann-Whitney U test results are saved to MW_results_velocity_GNEO2_GNE_241025.csv 
Mann-Whitney U test results are saved to MW_results_velocity_GNEO2_GN_241025.csv 


## Question: Are $\Delta\tau_{GNE-GN}$ positive?

H0: $\Delta\tau_{GNE-GN}$ is equal to or smaller than zero. <br>
test: Wilcoxon - by eye, the box plots do not show normal distributions. I have also quantified that by running a shapiro-wilk test.

In [9]:
# prepare data
mask0 = diff_tau_all["condition 1"]==True # bi-stable
mask1 = diff_tau_all["condition 2"]==True # ref sfp
GNEvGN_time_df = diff_tau_all[(mask0)&(mask1)&(diff_tau_all["two_networks"]=="GNEvGN")]

# run test
wilcoxon_test_GNE_timedata(dataframe=GNEvGN_time_df, date="241021", datatype="time diff")

Wilcoxon test results are saved in csv file wilcoxon_test_GNEvGN_241021.csv
