# Behavior Analysis

## Assesing Sex as a Statistically Significant Factor

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import pingouin as pg

## Open Field

In [3]:
of_raw = pd.read_csv('/Users/labc02/Documents/PDCB_data/Behavior/Open_Field/Open_Field_pool.csv')

In [4]:
of_raw.rename(columns={'Subject Name':'Name', 'Subject Group': 'Group', 'Subject Gender':'Sex', 'Subject Genotype':'Genotype', 'Total Distance':'Distance', 'Time in Zone (%) - Center':'Time_in_Center'}, inplace = True)

#### Check Normality for Total Distance, Corsses, Time in Center per group

In [5]:
for var_ in ['Distance', 'Crosses', 'Time_in_Center']:
    print(f'Normality test (Shapiro), {var_}')
    print(pg.normality(of_raw, dv=var_, group='Group'))

Normality test (Shapiro), Distance
            W      pval  normal
KOF  0.893278  0.250967    True
KOM  0.898312  0.278996    True
WTF  0.808044  0.049156   False
WTM  0.859051  0.093632    True
Normality test (Shapiro), Crosses
            W      pval  normal
KOF  0.920730  0.435819    True
KOM  0.950585  0.717135    True
WTF  0.945284  0.686676    True
WTM  0.931077  0.491669    True
Normality test (Shapiro), Time_in_Center
            W      pval  normal
KOF  0.929021  0.507204    True
KOM  0.944262  0.653440    True
WTF  0.849858  0.122533    True
WTM  0.954134  0.735383    True


#### Check homoscedasticity; Leven's Test

In [6]:
for var_ in ['Distance', 'Crosses', 'Time_in_Center']:
    print(f'Normality test (Shapiro), {var_}')
    print(pg.homoscedasticity(of_raw, dv=var_, group='Group'))

Normality test (Shapiro), Distance
               W      pval  equal_var
levene  1.637141  0.203219       True
Normality test (Shapiro), Crosses
               W      pval  equal_var
levene  0.281895  0.838005       True
Normality test (Shapiro), Time_in_Center
               W      pval  equal_var
levene  3.171074  0.039658      False


#### Total Distance; normality passed, homoscedasticity passed -> N-way anova

In [7]:
pg.anova(data = of_raw, dv='Distance', between = ['Sex', 'Genotype'])

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Sex,182530.0,1.0,182530.0,0.147029,0.704287,0.005224
1,Genotype,14882800.0,1.0,14882800.0,11.988174,0.001738,0.299793
2,Sex * Genotype,215594.1,1.0,215594.1,0.173662,0.680055,0.006164
3,Residual,34760800.0,28.0,1241457.0,,,


Thus, we got a significant Two-Way ANOVA that shows that Sex is not a significant factor, though. Not even by interaction with the significant factor; Genotype.
Collapsing the Sex category leaves a two group comparison.

#### Normality check for Genotype, collapsing Sex

In [8]:
pg.normality(of_raw, dv='Distance', group='Genotype')

Unnamed: 0,W,pval,normal
KO,0.900918,0.083137,True
WT,0.868974,0.026231,False


Normality Failed -> Mann-Whitney U

In [9]:
pg.mwu(x=of_raw['Distance'][of_raw['Genotype'] == 'WT'], y=of_raw['Distance'][of_raw['Genotype'] == 'KO'])

Unnamed: 0,U-val,tail,p-val,RBC,CLES
MWU,200.0,two-sided,0.007044,-0.5625,0.78125


Significant difference: p≤ 0.05

### Same approach for Crosses
Normality: Passed

Homoscedasticity: Passed

Test: Two-Way ANOVA

In [10]:
pg.anova(data = of_raw, dv='Crosses', between = ['Sex', 'Genotype'])

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Sex,5.954724,1.0,5.954724,0.077077,0.783338,0.002745
1,Genotype,560.041734,1.0,560.041734,7.249086,0.01184,0.205653
2,Sex * Genotype,145.040315,1.0,145.040315,1.877377,0.181519,0.062836
3,Residual,2163.19246,28.0,77.256874,,,


Same conclusion. Perform normality for collapsed category: Sex

In [11]:
pg.normality(of_raw, dv='Crosses', group='Genotype')

Unnamed: 0,W,pval,normal
KO,0.952178,0.524968,True
WT,0.947595,0.452579,True


Normality Passed.
Test: unparied t-Test

In [12]:
pg.ttest(x=of_raw['Crosses'][of_raw['Genotype'] == 'WT'], y=of_raw['Crosses'][of_raw['Genotype'] == 'KO'])

Unnamed: 0,T,dof,tail,p-val,CI95%,cohen-d,BF10,power
T-test,2.717191,30,two-sided,0.010826,"[2.1, 14.78]",0.960672,4.769,0.748377


Again, a significant result, p ≤ 0.05, with a high statistical power; 0.75

#### Time in Center violated the equivalence of variance principle. Should procede with Welch ANOVA, but it's not Factorial.

In [13]:
pg.normality(of_raw, dv='Time_in_Center', group='Genotype')

Unnamed: 0,W,pval,normal
KO,0.94545,0.42119,True
WT,0.917852,0.155765,True


In [14]:
pg.welch_anova(data=of_raw, dv='Time_in_Center', between='Group')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,Group,3,14.353239,0.535941,0.665047,0.020896


#### Nevermind! There's no significant difference in the time spent at center.

## Social Interaction

### Same approach

In [15]:
si_raw = pd.read_csv('/Users/labc02/Documents/PDCB_data/Behavior/Social Interaction/Social_Interaction_data.csv')

In [16]:
si_clean = si_raw[si_raw['Phase']== 'Sample']

In [17]:
si_clean.rename(columns={'Time Object/New Cons Chamber':'Time_Object', 'Time Conspecific Chamber': 'Time_Conspecific', 'Total Exploration': 'Total'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [18]:
for var_ in ['Time_Object', 'Time_Conspecific', 'Total']:
    print(f'Normality test (Shapiro), {var_}')
    print(pg.normality(data=si_clean, dv=var_, group='Group'))

Normality test (Shapiro), Time_Object
            W      pval  normal
KOF  0.944423  0.516654    True
KOM  0.720208  0.000891   False
WTF  0.917846  0.301104    True
WTM  0.914444  0.275046    True
Normality test (Shapiro), Time_Conspecific
            W      pval  normal
KOF  0.928938  0.330071    True
KOM  0.924065  0.284459    True
WTF  0.948646  0.626702    True
WTM  0.861803  0.060766    True
Normality test (Shapiro), Total
            W      pval  normal
KOF  0.945061  0.525663    True
KOM  0.808731  0.008671   False
WTF  0.930491  0.415843    True
WTM  0.854802  0.049291   False


In [19]:
def detec_outlier(df, var_name, var_group):
    '''[DataFrame, str, str -> DataFrame]
    Outlier detection based on absolute deviaton from the median.
    Returns a copy of the original DataFrame without the indexes deemed as outliers'''
    clean_df = df.copy()
    outliers_idx = []
    for var_ in var_name:
        for group in df[var_group].unique():
            outliers = pg.madmedianrule(df[var_][df[var_group]==group])
            out_idx = df[var_][df[var_group]==group][outliers].index.values
            for ii in out_idx:
                outliers_idx.append(ii)
    clean_df.drop(set(outliers_idx), inplace = True)
    return clean_df

Removing outliers; mad-median rule

In [20]:
si_tidy = detec_outlier(si_clean, ['Time_Object', 'Time_Conspecific', 'Total'], 'Group')

#### Re-check normality

In [21]:
for var_ in ['Time_Object', 'Time_Conspecific', 'Total']:
    print(f'Normality test (Shapiro), {var_}')
    print(pg.normality(data=si_tidy, dv=var_, group='Group'))

Normality test (Shapiro), Time_Object
            W      pval  normal
KOF  0.935501  0.442069    True
KOM  0.923020  0.382836    True
WTF  0.917846  0.301104    True
WTM  0.970216  0.896569    True
Normality test (Shapiro), Time_Conspecific
            W      pval  normal
KOF  0.931436  0.395579    True
KOM  0.916366  0.327636    True
WTF  0.948646  0.626702    True
WTM  0.930693  0.487973    True
Normality test (Shapiro), Total
            W      pval  normal
KOF  0.945132  0.567252    True
KOM  0.960652  0.793276    True
WTF  0.930491  0.415843    True
WTM  0.938467  0.565858    True


#### Check homoscedasticity

In [22]:
for var_ in ['Time_Object', 'Time_Conspecific', 'Total']:
    print(f'Normality test (Levene), {var_}')
    print(pg.homoscedasticity(data=si_tidy, dv=var_, group='Group'))

Normality test (Levene), Time_Object
               W      pval  equal_var
levene  0.992779  0.406591       True
Normality test (Levene), Time_Conspecific
               W      pval  equal_var
levene  0.611101  0.611966       True
Normality test (Levene), Total
               W      pval  equal_var
levene  4.142158  0.012347      False


In [23]:
samp_time = pd.melt(si_tidy, id_vars=['Subject', 'Group', 'Sex', 'Genotype'], value_vars=['Time_Object', 'Time_Conspecific'], var_name='Side', value_name='Time')

In [24]:
pg.anova(data=samp_time, dv='Time', between=['Sex', 'Genotype', 'Side'])

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Sex,151.291092,1.0,151.291092,0.498434,0.4823478,0.006516
1,Genotype,309.31747,1.0,309.31747,1.019057,0.315946,0.013231
2,Side,68571.428571,1.0,68571.428571,225.910915,1.806106e-24,0.74827
3,Sex * Genotype,115.598555,1.0,115.598555,0.380843,0.5389967,0.004986
4,Sex * Side,5320.438465,1.0,5320.438465,17.528366,7.538056e-05,0.187412
5,Genotype * Side,5495.574246,1.0,5495.574246,18.105357,5.897705e-05,0.192395
6,Sex * Genotype * Side,3896.114818,1.0,3896.114818,12.835883,0.000597548,0.14449
7,Residual,23068.511616,76.0,303.533048,,,


### Geting complicated

Neither Sex or Genotype is a factor on their own. But the interaction Sex * Side IS significant. Genotype * Side IS ALSO significant.

There's sexual dimorfism in social interaction.

In [25]:
pg.anova(data=si_tidy, dv='Total', between=['Sex', 'Genotype'])

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Sex,302.582184,1.0,302.582184,0.995574,0.324695,0.02553
1,Genotype,618.63494,1.0,618.63494,2.03547,0.161832,0.050842
2,Sex * Genotype,231.197109,1.0,231.197109,0.760699,0.388588,0.019626
3,Residual,11549.238889,38.0,303.927339,,,


#### There's no difference in the total time spent exploring both chambers.