In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Carregar Dataset

In [46]:
df = pd.read_csv("Data/group_22.csv")

# snapshot das colunas numéricas
num_df   = df.select_dtypes(include=[np.number]).copy()
num_cols = num_df.columns

# DataFrame de para disperção
disp = pd.DataFrame(index=num_cols)


## 1. Estatisticas Descritivas

**Objetivo.** Resumir o dataset em três frentes:


### 1.1 Tendência central
*média*, *mediana*, *moda* 

In [47]:
# media e mediana e moda por coluna
central_tendency = pd.DataFrame({
    "mean": df[num_cols].mean(),
    "median": df[num_cols].median(),
    "mode": df[num_cols].mode(dropna=True).iloc[0] #vai buscar a primeira moda que encontrar por coluna ingnorando valores nulos
})
central_tendency



Unnamed: 0,mean,median,mode
duration_1,0.06733333,0.0,0.0
duration_2,0.1713333,0.0,0.0
duration_3,0.319,0.0,0.0
duration_4,0.4263333,0.0,0.0
duration_5,0.016,0.0,0.0
loudness_level,1.714,2.0,2.0
popularity_level,2.043,2.0,3.0
tempo_class,1.018,1.0,1.0
time_signature,0.07697062,0.2218242,0.221824
key_mode,-0.01977656,-0.07678645,0.485996


### 1.2 Disperção

#### 1.2.1 Função MAD (Median Absolute Deviation)
Dispersão robusta útil por ser pouco influenciada por valores extremos (`median(|x − median(x)|)`)  
Boa alternativa ao desvio padrão quando a distribuição é assimétrica ou com outliers.

In [48]:
# Função MAD (Median Absolute Deviation)
def mad(series: pd.Series) -> float:
    x = series.dropna().to_numpy()
    if x.size == 0:
        return np.nan
    med = np.median(x)
    return float(np.median(np.abs(x - med)))

In [50]:
disp["MAD"] = num_df.apply(mad)
disp

Unnamed: 0,MAD
duration_1,0.0
duration_2,0.0
duration_3,0.0
duration_4,0.0
duration_5,0.0
loudness_level,1.0
popularity_level,1.0
tempo_class,0.0
time_signature,0.0
key_mode,0.844174


#### 1.2.2 Min e Max
  
Valores mínimo e máximo observados por coluna. Úteis para perceber limites mas sensíveis a outliers.


In [51]:
disp["Min"]=df[num_cols].min()
disp["Max"]=df[num_cols].max()
disp

Unnamed: 0,MAD,Min,Max
duration_1,0.0,0.0,1.0
duration_2,0.0,0.0,1.0
duration_3,0.0,0.0,1.0
duration_4,0.0,0.0,1.0
duration_5,0.0,0.0,1.0
loudness_level,1.0,0.0,4.0
popularity_level,1.0,0.0,4.0
tempo_class,0.0,0.0,3.0
time_signature,0.0,-6.712656,2.533318
key_mode,0.844174,-1.511882,1.611562


#### 1.2.3 Amplitude

Amplitude total dos dados. Muito simples e muito sensível a outliers. `Amplitude = max - min` 


In [52]:
disp["Range"]= disp["Max"] - disp["Min"]
disp

Unnamed: 0,MAD,Min,Max,Range
duration_1,0.0,0.0,1.0,1.0
duration_2,0.0,0.0,1.0,1.0
duration_3,0.0,0.0,1.0,1.0
duration_4,0.0,0.0,1.0,1.0
duration_5,0.0,0.0,1.0,1.0
loudness_level,1.0,0.0,4.0,4.0
popularity_level,1.0,0.0,4.0,4.0
tempo_class,0.0,0.0,3.0,3.0
time_signature,0.0,-6.712656,2.533318,9.245973
key_mode,0.844174,-1.511882,1.611562,3.123444


#### 1.2.4 Variancia

Média do quadrado dos desvios à média; mede dispersão em unidades ao quadrado.  

Exemplo  
Dados: [2, 4, 4, 4, 5, 5, 7, 9]

Média \\(\bar{x}=5\\).  
Desvios: \\([-3,-1,-1,-1,0,0,2,4]\\).  
Quadrados: \\([9,1,1,1,0,0,4,16]\\). Soma \\(=32\\).

**Variância amostral:** \\(\frac{32}{7}\approx 4.5714\\) (correção de Bessel \\({N-1})\\)

In [53]:
disp["Variance"]= df[num_cols].var()
disp

Unnamed: 0,MAD,Min,Max,Range,Variance
duration_1,0.0,0.0,1.0,1.0,0.0628205
duration_2,0.0,0.0,1.0,1.0,0.1420256
duration_3,0.0,0.0,1.0,1.0,0.2173114
duration_4,0.0,0.0,1.0,1.0,0.2446548
duration_5,0.0,0.0,1.0,1.0,0.01574925
loudness_level,1.0,0.0,4.0,4.0,1.816143
popularity_level,1.0,0.0,4.0,4.0,0.9534688
tempo_class,0.0,0.0,3.0,3.0,0.06102968
time_signature,0.0,-6.712656,2.533318,9.245973,0.613258
key_mode,0.844174,-1.511882,1.611562,3.123444,1.010088


#### 1.2.5 Desvio-Padrão

Mede a disperção média da média do dados.
Raiz quadrada da variância volta às mesmas unidades da variável.

Dados: `[2, 4, 4, 4, 5, 5, 7, 9]`  
Média \\(\bar{x}=5\\). Quadrados dos desvios somam **32**.
- **Variância amostral:** \\(32/7 \approx 4{,}5714\\)  
- **Desvio-padrão amostral:** \\(\sqrt{4{,}5714}\approx 2{,}138\\)


In [55]:
disp["Std"]= df[num_cols].std() 
disp

Unnamed: 0,MAD,Min,Max,Range,Variance,Std
duration_1,0.0,0.0,1.0,1.0,0.0628205,0.25064
duration_2,0.0,0.0,1.0,1.0,0.1420256,0.376863
duration_3,0.0,0.0,1.0,1.0,0.2173114,0.466167
duration_4,0.0,0.0,1.0,1.0,0.2446548,0.494626
duration_5,0.0,0.0,1.0,1.0,0.01574925,0.125496
loudness_level,1.0,0.0,4.0,4.0,1.816143,1.347643
popularity_level,1.0,0.0,4.0,4.0,0.9534688,0.976457
tempo_class,0.0,0.0,3.0,3.0,0.06102968,0.247042
time_signature,0.0,-6.712656,2.533318,9.245973,0.613258,0.783108
key_mode,0.844174,-1.511882,1.611562,3.123444,1.010088,1.005031


#### 1.2.5 Quantis e Quartis

quantis dividem a distribuição em **frações**.  
O quantil de ordem \\(p\\) (com \\(0 \le p \le 1\\)) é o valor abaixo do qual está **p·100%** dos dados.

- **Mediana** = quantil **0.5** (50.º percentil).
- **Quartis:**  
  - \\(Q_1\\) = quantil **0.25** (25.º percentil)  
  - \\(Q_3\\) = quantil **0.75** (75.º percentil)

O intervalo entre \\(Q_1\\) e \\(Q_3\\) capta o “**miolo**” dos dados (50% central).

Os dados são colocados por ordem (do mais pequeno para o maior).  
No quantil 0.25 (25%), cerca de um quarto dos valores fica abaixo deste ponto.


In [56]:
disp["Q1"]= df[num_cols].quantile(0.25)
disp["Median"]= df[num_cols].quantile(0.50)
disp["Q3"] = df[num_cols].quantile(0.75)
disp

Unnamed: 0,MAD,Min,Max,Range,Variance,Std,Q1,Median,Q3
duration_1,0.0,0.0,1.0,1.0,0.0628205,0.25064,0.0,0.0,0.0
duration_2,0.0,0.0,1.0,1.0,0.1420256,0.376863,0.0,0.0,0.0
duration_3,0.0,0.0,1.0,1.0,0.2173114,0.466167,0.0,0.0,1.0
duration_4,0.0,0.0,1.0,1.0,0.2446548,0.494626,0.0,0.0,1.0
duration_5,0.0,0.0,1.0,1.0,0.01574925,0.125496,0.0,0.0,0.0
loudness_level,1.0,0.0,4.0,4.0,1.816143,1.347643,1.0,2.0,3.0
popularity_level,1.0,0.0,4.0,4.0,0.9534688,0.976457,1.0,2.0,3.0
tempo_class,0.0,0.0,3.0,3.0,0.06102968,0.247042,1.0,1.0,1.0
time_signature,0.0,-6.712656,2.533318,9.245973,0.613258,0.783108,0.221824,0.2218242,0.221824
key_mode,0.844174,-1.511882,1.611562,3.123444,1.010088,1.005031,-0.920961,-0.07678645,0.767388
