# Example

In [1]:
import numpy as np
import pandas as pd

# don't forget the most important one :)
import otpsy as ot

## Data simulation
Consider conducting a research study aimed at investigating the impact of art exposition on the visual perception of angry facial expressions in 50 participants. In this context, variables could include the **duration (ms)** of exploration of the painting scene, **behavioral performance** (accuracy and RT) in discriminating between angry and happy faces, and **scores** related to depression. Subsequently, there is a desire to implement control for various factors in the analysis:

* Does the participant look at the painting scene during the art exposition?
* Is the participant realising the task properly? (fatigue, lack of motivation,...)
* Exclusion of participants with excessively high depression scores (>12).

In [2]:
# Make the result reproducible
rng = np.random.default_rng(seed=22404)

In [3]:
# art exposition
art_looking_time = rng.normal(loc=2000, scale=400, size=60)

# discrimination task
discrimination_performance = rng.normal(loc=0.9, scale=0.05, size=60)
discrimination_time = rng.normal(loc=400, scale=100,size=60)

# questionnaire
depression_score = rng.normal(loc=2, scale = 2, size=60)
gender = ["M"  if i%2 == 0 else "W" for i in range(1, 61)]
age = rng.normal(loc=30, scale = 4, size=60)
random_col = rng.normal(loc=20, scale = 2, size=60)
index_participant = [f"P{i}" for i in range(1, 61)]

In [4]:
# Introduce some abberation in data
art_looking_time[9:11] = 200 # participants 9 and 10 didn't look at the painting scene (only 200 ms of exploration time)
discrimination_performance[36] = 0.51 # participant 36's discrimination score is near chance level
discrimination_time[36] = 95 # participant 36's mean response time is way too short relatively to human ability
depression_score[4] = 21 # participant 4 has a high depression score (above 12)

In [5]:
df = pd.DataFrame({
    "index_participant":index_participant,
    "gender": gender,
    "age": age,
    "random_col": random_col,
    'art_looking_time':art_looking_time,
    'discrimination_performance': discrimination_performance,
    'discrimination_time':discrimination_time,
    'depression_score': depression_score
})

## Outliers detection

The first step is to define a sample object to specify which columns you want to apply a specific method to. You have to specify one sample for each planned method you want to apply. For *art looking time*, *discrimination performance*, and *discrimination time*, we can use continuous but robust methods like IQR or MAD. For this purpose, we create a sample object to visualize the columns to test and apply the method afterward.

In [14]:
sample = ot.Sample(df=df,
                   columns_to_test=["art_looking_time", 
                                   "discrimination_performance", 
                                   "discrimination_time"],
                   participant_column="index_participant")

In [None]:
# Visualise the data
sample.visualise()

In [16]:
outliers = sample.method_MAD(distance = 2.5)
print(outliers)

---------------------------------
Summary of the outliers detection
---------------------------------

Method used : Median Absolute Distance
Distance used : 2.5
Column tested : art_looking_time, discrimination_performance, discrimination_time
Total number of outliers : 3
Total number of flagged values : 4
------------------------------

The column art_looking_time has 2 outliers : P10, P11
Low threshold : 750.87 / High threshold : 2957.41

The column discrimination_performance has 1 outlier : P37
Low threshold : 0.74 / High threshold : 1.04

The column discrimination_time has 1 outlier : P37
Low threshold : 149.12 / High threshold : 637.71


In [17]:
# To obtain more details about the different values
print(outliers.inspect())

                  art_looking_time discrimination_performance  \
index_participant                                               
P10                          200.0                      False   
P11                          200.0                      False   
P37                          False                       0.51   

                  discrimination_time  
index_participant                      
P10                             False  
P11                             False  
P37                              95.0  


As we can see, outliers that we introduce are spotted with median absolute distance method.
In an interesting manner, we can see the P37 has really low performance, associated with a low reaction time. We can suggest that he didn't realise the task properly.   
We could remove then now, but we want to take into account too high level of depression. Thus, we can create another outliers object that we will concatenate.

In [18]:
outliers_depression = ot.Sample(
    df,
    "depression_score",
    "index_participant"
).method_cutoff(
    high_threshold=12,
    threshold_included=False)
print(outliers_depression)

---------------------------------
Summary of the outliers detection
---------------------------------

Method used : Cut-Off
Distance used : [-3.4410861785291784, 12.0]
Column tested : depression_score
Total number of outliers : 1
Total number of flagged values : 1
------------------------------

The column depression_score has 1 outlier : P5
Low threshold : -3.44 / High threshold : 12.0


In [19]:
# Concat both object
final_outliers_object = ot.concat([outliers, outliers_depression])
print(final_outliers_object)

---------------------------------
Summary of the outliers detection
---------------------------------

Method used  : Median Absolute Distance, Cut-Off
Distance used : 2.5 (mad), (-3.4410861785291784, 12.0) (cut-off)
Column tested : discrimination_performance (mad), depression_score (cut-off), discrimination_time (mad), art_looking_time (mad)
Total number of outliers : 4
Total number of flagged values : 5
------------------------------

The column discrimination_performance has 1 outlier : P37
MAD: low: 0.74 / high: 1.04 

The column depression_score has 1 outlier : P5
CUT-OFF: low: -3.44 / high: 12.0 

The column discrimination_time has 1 outlier : P37
MAD: low: 149.12 / high: 637.71 

The column art_looking_time has 2 outliers : P10, P11
MAD: low: 750.87 / high: 2957.41 


Finally, one participant (P44) has reported that he understood your hypothesis and acted in a way to confirm it. You decide to exclude him. You can simply add it to the outliers object.

In [20]:
final_outliers_object.add("P44")

If we would have consider an outliers as not being "really an outliers", it is possible to remove him (or them) with the method `.remove()`

In [21]:
final_outliers_object.remove("P1")
# obj.remove(["P1", "P2"]) if you want to remove more than one outlier
# obj.remove({"Col1": "P1"}) if you want to remove an outlier on a specific column.

Finally, you can obtain your dataframe without outliers.

In [13]:
df_cleaned = final_outliers_object.manage("delete")
# "na" if you want to replace aberrant values with missing values
# "winsorise" if you want to replace aberrant values with the threshold.