# AsaPy

### Asa Analysis

#### Analysis.detect_outliers

    """
    Detect outliers in a Pandas DataFrame using either Inter-Quartile Range (IQR) or Z-Score method.

    Args:
        df (pd.DataFrame): The input DataFrame containing numerical data.
        method (str): The method used for outlier detection, options are 'IQR' or 'zscore'. Default is 'IQR'.
        thr (float): The threshold value for Z-Score method. Default is 3.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: 
            - The first DataFrame contains the index, column name, and values of the outliers.
            - The second DataFrame contains the outlier thresholds for each column.

    Raises:
        ValueError: If the method is not 'IQR' or 'zscore'.
    """

#### Analysis.remove_outliers
    """
    Remove outliers from a Pandas DataFrame using the Interquartile Range (IQR) method.
    
    Args:
        df (pd.DataFrame): DataFrame containing the data.
        verbose (bool, optional): If True, print the number of lines removed. Defaults to False.
    
    Returns:
        Tuple[pd.DataFrame, List[int]]: DataFrame with the outliers removed, 
                                        List of indexes of the rows that were removed (unique indices).
    """

### Detectando e Removendo Outliers

Neste trecho de código, são realizadas duas técnicas diferentes para detectar e listar outliers em um conjunto de dados.

**Método IQR (Intervalo Interquartil)**

A primeira parte do código utiliza o Intervalo Interquartil (IQR) para identificar os outliers. O IQR é a diferença entre o terceiro quartil (Q3) e o primeiro quartil (Q1) dos dados. Qualquer ponto de dados fora do intervalo definido por [Q1 - 1,5 * IQR, Q3 + 1,5 * IQR] é considerado um outlier.

**Método de Pontuação Z (zscore)**

A segunda parte do código utiliza a pontuação Z para identificar os outliers. A pontuação Z de uma observação é o número de desvios padrão que ela está longe da média. Um valor absoluto da pontuação Z maior do que um determinado limiar (como 2) geralmente é considerado um outlier.

Ambos os métodos são úteis para entender a distribuição dos dados e identificar pontos de dados que são significativamente diferentes do resto do conjunto de dados. Identificar e tratar os outliers pode ser vital para a construção de modelos estatísticos robustos, já que os outliers podem ter um efeito desproporcional sobre as estimativas dos parâmetros.

In [1]:
import asapy
from sklearn.datasets import load_diabetes

X, y  = load_diabetes(as_frame=True, return_X_y=True)

Analysis = asapy.Analysis

outliers_iqr, thresholds_iqr =Analysis.detect_outliers(X, method='IQR')

2023-12-19 18:05:02.079663: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 18:05:02.153971: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 18:05:02.155293: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Using TensorFlow backend


In [2]:
# Verificar os outliers usando o IQR
outliers_iqr.head(5)

Unnamed: 0,index,column,outlier_value
0,256,bmi,0.160855
1,366,bmi,0.137143
2,367,bmi,0.170555
3,123,s1,0.152538
4,161,s1,0.133274


In [3]:
# Detectar outliers usando o método zscore (pontuação Z)
outliers_zscore, thresholds_zscore = Analysis.detect_outliers(X, method='zscore')

# Verificar os outliers usando a pontuação Z
outliers_zscore.head(5)

Unnamed: 0,index,column,outlier_value
0,256,bmi,0.160855
1,367,bmi,0.170555
2,123,s1,0.152538
3,230,s1,0.153914
4,123,s2,0.198788


In [4]:
# Remover outliers de um DataFrame Pandas utilizando o método do Intervalo Interquartil (IQR) de uma só vez.
data_update, drop_lines = Analysis.remove_outliers(X)
data_update

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641
...,...,...,...,...,...,...,...,...,...,...
404,-0.056370,-0.044642,-0.074108,-0.050428,-0.024960,-0.047034,0.092820,-0.076395,-0.061177,-0.046641
405,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
406,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485
407,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491
