# Example

In [1]:
import numpy as np
import pandas as pd

# don't forget the most important one :)
import otpsy as ot

## Data simulation
Consider conducting a research study aimed at investigating the impact of art exposition on the visual perception of angry facial expressions in 50 participants. In this context, variables could include the **duration (ms)** of exploration of the painting scene, **behavioral performance** (accuracy and RT) in discriminating between angry and happy faces, and **scores** related to depression. Subsequently, there is a desire to implement control for various factors in the analysis:

* Does the participant look at the painting scene during the art exposition?
* Is the participant realising the task properly? (fatigue, lack of motivation,...)
* Exclusion of participants with excessively high depression scores (>12).

In [2]:
# Make the result reproducible
rng = np.random.default_rng(seed=22404)
NB_PART = 60 # number of participant

In [3]:
# art exposition
art_looking_time = rng.normal(loc=2000, scale=400, size=NB_PART)

# discrimination task
discrimination_performance = rng.normal(loc=0.9, scale=0.05, size=NB_PART)
discrimination_time = rng.normal(loc=400, scale=100,size=NB_PART)

# questionnaire
depression_score = rng.normal(loc=2, scale = 2, size=NB_PART)
gender = ["M"  if i%2 == 0 else "W" for i in range(1, 61)]
age = rng.normal(loc=30, scale = 4, size=NB_PART)
random_col = rng.normal(loc=20, scale = 2, size=NB_PART)
index_participant = [f"P{i}" for i in range(1, 61)]
likert1 = rng.integers(low = 1, high = 7, size = NB_PART)
likert2 = rng.integers(low = 1, high = 7, size = NB_PART)
likert3 = rng.integers(low = 1, high = 7, size = NB_PART)
likert4 = rng.integers(low = 1, high = 7, size = NB_PART)

In [4]:
# Introduce some abberation in data
art_looking_time[9:11] = 200 # participants 10 and 11 didn't look at the painting scene (only 200 ms of exploration time)
discrimination_performance[36] = 0.51 # participant 36's discrimination score is near chance level
discrimination_time[36] = 95 # participant 36's mean response time is way too short relatively to human ability
depression_score[4] = 21 # participant 4 has a high depression score (above 12)
likert1[3] = likert2[3] = likert3[3] = likert4[3] = 1 # Same answer for the 4 items likert (despite inverted items)
likert1[5] = likert2[5] = likert3[5] = likert4[5] = 7 # Same answer for the 4 items likert (despite inverted items)

In [5]:
df = pd.DataFrame({
    "index_participant":index_participant,
    "gender": gender,
    "age": age,
    "random_col": random_col,
    'art_looking_time':art_looking_time,
    'discrimination_performance': discrimination_performance,
    'discrimination_time':discrimination_time,
    'depression_score': depression_score,
    'likert1': likert1,
    'likert2': likert2,
    'likert3': likert3,
    'likert4': likert4,
})

In [17]:
df.to_csv("./tests/data.csv", sep = ";")

## Outliers detection

The first step is to define a sample object to specify which columns you want to apply a specific method to. You have to specify one sample for each planned method you want to apply. For *art looking time*, *discrimination performance*, and *discrimination time*, we can use continuous but robust methods like IQR or MAD. For this purpose, we create a sample object to visualize the columns to test and apply the method afterward.

In [7]:
sample = ot.Sample(df=df,
                   columns_to_test=["art_looking_time", 
                                   "discrimination_performance", 
                                   "discrimination_time"],
                   participant_column="index_participant")

In [None]:
# Visualise the data
sample.visualise()

In [None]:
outliers = sample.method_MAD(distance = 2.5)
print(outliers)

In [None]:
# To obtain more details about the different values
print(outliers.inspect())

As we can see, outliers that we introduce are spotted with median absolute distance method.
In an interesting manner, we can see the P37 has really low performance, associated with a low reaction time. We can suggest that he didn't realise the task properly.   
We could remove then now, but we want to take into account too high level of depression. Thus, we can create another outliers object that we will concatenate.

In [None]:
outliers_depression = ot.Sample(
    df,
    "depression_score",
    "index_participant"
).method_cutoff(
    high_threshold=12,
    threshold_included=False)
print(outliers_depression)

In [None]:
# Concat both object
final_outliers_object = ot.concat([outliers, outliers_depression])
print(final_outliers_object)

Finally, one participant (P44) has reported that he understood your hypothesis and acted in a way to confirm it. You decide to exclude him. You can simply add it to the outliers object.

In [13]:
final_outliers_object.add("P44")

If we would have consider an outliers as not being "really an outliers", it is possible to remove them with the method `.remove()`

In [14]:
final_outliers_object.remove("P1")
# obj.remove(["P1", "P2"]) if you want to remove more than one outlier
# obj.remove({"Col1": "P1"}) if you want to remove an outlier on a specific column.

Finally, you can obtain your dataframe without outliers.

In [15]:
df_cleaned = final_outliers_object.manage("delete")
# "na" if you want to replace aberrant values with missing values
# "winsorise" if you want to replace aberrant values with the threshold.

In [None]:
sample = ot.Sample(df, columns_to_test=[f"likert{i}" for i in range(1, 5)],participant_column="index_participant")
outliers = sample.method_identical()
df3 = outliers.manage(method = "na")
df3