# Obteniendo estimados de locación y variabilidad

### OBJETIVO

- Utilizar `estimados de locación y variabilidad` para describir las `columnas numéricas` de un dataset

## Carga general de datos

Para obtener nuestras distribuciones, utilizaremos las siguientes librerias y funciones:

In [1]:
from scipy import stats
import pandas as pd
import numpy as np

In [2]:
# Nos calcula la media truncada al 10% de una columna x
def trim_mean_10(x):
    return stats.trim_mean(x, 0.1)

In [3]:
# Nos calcula el percentil n de una columna x 
def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    return percentile_

Vamos a realizar lo siguiente:

- Cargar los documentos que validaremos, para obtener nuestros estimados
- Normalizar la fecha en formato de `ns`

In [4]:
df_global = pd.read_csv('../Datasets/Datos_2016_2020.csv')  #Data general

## Estimados de $PM_{2.5}$ y variables antropogénicas


In [5]:
df_global.head()

Unnamed: 0,date,id_station,PM2.5,RH,TMP,WDR,WSP
0,01/01/2016 01:00,AJM,45,61,15.1,185,1.9
1,01/01/2016 01:00,AJU,20,88,5.6,197,1.3
2,01/01/2016 01:00,BJU,78,59,16.6,195,1.3
3,01/01/2016 01:00,HGM,68,52,17.1,51,0.5
4,01/01/2016 01:00,MER,61,58,17.0,97,0.8


In [7]:
df_global.dtypes

date           object
id_station     object
PM2.5           int64
RH              int64
TMP           float64
WDR             int64
WSP           float64
dtype: object

In [11]:
df_global.rename(columns={'PM2.5': 'PM2_5'}, inplace=True)

Vamos a validar los siguientes `estimados de locación y variabilidad`, esto a nivel `station`:

    - Media o promedio
    - Mediana
    - Media truncada al 10%
    - Desviación Estándar
    - Dato Mínimo
    - Percentil 25
    - Percentil 50
    - Percentil 75
    - Dato Máxino
    - Rango
    - Rango Intercuartílico (IQR)

In [13]:
df_global.groupby('id_station').PM2_5.agg(
    media='mean',
    mediana='median',
    media_truncada=trim_mean_10,
    desv_estandar='std',
    minimo='min',
    percentile_25=percentile(0.25),
    percentile_50=percentile(0.5),
    percentile_75=percentile(0.75),
    maximo='max',
    rango=np.ptp, # max - min
    IQR=stats.iqr
)

Unnamed: 0_level_0,media,mediana,media_truncada,desv_estandar,minimo,percentile_25,percentile_50,percentile_75,maximo,rango,IQR
id_station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AJM,19.153503,17,17.752157,12.349253,1,10,17,25,117,116,15
AJU,19.314821,17,17.789067,14.511454,1,9,17,26,302,301,17
BJU,21.961316,19,20.339694,14.09439,1,12,19,29,250,249,17
GAM,23.893179,21,22.078107,15.673482,1,13,21,31,359,358,18
HGM,23.730537,21,21.954238,15.54503,1,13,21,31,346,345,18
INN,13.869627,12,12.719888,9.606686,1,7,12,18,246,245,11
MER,25.103629,22,23.50941,16.162922,1,14,22,33,380,379,19
MGH,23.343532,21,21.716301,14.528993,1,13,21,30,173,172,17
MON,20.414607,18,18.578729,14.415153,1,11,18,26,227,226,15
MPA,19.354674,16,17.598002,14.138447,1,9,16,26,211,210,17


In [14]:
df_global.groupby('id_station').RH.agg(
    media='mean',
    mediana='median',
    media_truncada=trim_mean_10,
    desv_estandar='std',
    minimo='min',
    percentile_25=percentile(0.25),
    percentile_50=percentile(0.5),
    percentile_75=percentile(0.75),
    maximo='max',
    rango=np.ptp, # max - min
    IQR=stats.iqr
)

Unnamed: 0_level_0,media,mediana,media_truncada,desv_estandar,minimo,percentile_25,percentile_50,percentile_75,maximo,rango,IQR
id_station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AJM,52.762641,53,53.06441,20.53573,2,36,53,70,101,99,34
AJU,70.479903,76,72.453796,22.345575,7,54,76,89,100,93,35
BJU,52.579918,54,52.95946,20.184942,2,36,54,69,94,92,33
GAM,57.464545,59,58.043547,20.425361,6,41,59,74,99,93,33
HGM,48.281173,49,48.517288,19.169283,5,33,49,64,87,82,31
INN,67.133004,72,69.012109,22.466029,5,50,72,87,97,92,37
MER,52.630897,54,53.027145,20.408351,2,36,54,70,95,93,34
MGH,46.991888,46,47.023917,20.674605,1,30,46,65,89,88,35
MON,57.336548,60,58.331617,21.568782,2,40,60,76,93,91,36
MPA,60.938274,63,62.043571,21.342613,3,45,63,79,98,95,34


In [15]:
df_global.groupby('id_station').TMP.agg(
    media='mean',
    mediana='median',
    media_truncada=trim_mean_10,
    desv_estandar='std',
    minimo='min',
    percentile_25=percentile(0.25),
    percentile_50=percentile(0.5),
    percentile_75=percentile(0.75),
    maximo='max',
    rango=np.ptp, # max - min
    IQR=stats.iqr
)

Unnamed: 0_level_0,media,mediana,media_truncada,desv_estandar,minimo,percentile_25,percentile_50,percentile_75,maximo,rango,IQR
id_station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AJM,16.1316,15.8,16.084483,3.785415,1.1,13.6,15.8,18.7,27.9,26.8,5.1
AJU,10.123797,10.6,10.339948,5.658911,-6.2,6.2,10.6,14.5,23.4,29.6,8.3
BJU,17.707097,17.2,17.656742,4.267441,2.8,14.9,17.2,20.8,31.2,28.4,5.9
GAM,18.232813,17.7,18.178271,4.36201,1.9,15.4,17.7,21.3,32.2,30.3,5.9
HGM,17.58015,17.1,17.519764,4.215733,3.1,14.8,17.1,20.5,31.2,28.1,5.7
INN,10.857725,10.5,10.798188,4.969519,-3.5,7.3,10.5,14.5,25.8,29.3,7.2
MER,17.806035,17.3,17.750899,4.366044,2.0,14.9,17.3,20.9,30.6,28.6,6.0
MGH,17.500604,16.9,17.39207,4.466399,2.4,14.5,16.9,20.7,31.5,29.1,6.2
MON,17.623488,17.2,17.628101,5.562761,1.2,14.0,17.2,21.8,33.4,32.2,7.8
MPA,14.795386,14.4,14.627928,4.485424,0.7,11.4,14.4,17.9,29.0,28.3,6.5


In [16]:
df_global.groupby('id_station').WSP.agg(
    media='mean',
    mediana='median',
    media_truncada=trim_mean_10,
    desv_estandar='std',
    minimo='min',
    percentile_25=percentile(0.25),
    percentile_50=percentile(0.5),
    percentile_75=percentile(0.75),
    maximo='max',
    rango=np.ptp, # max - min
    IQR=stats.iqr
)

Unnamed: 0_level_0,media,mediana,media_truncada,desv_estandar,minimo,percentile_25,percentile_50,percentile_75,maximo,rango,IQR
id_station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AJM,2.743468,2.5,2.558582,1.51728,0.0,1.7,2.5,3.4,14.3,14.3,1.7
AJU,2.58286,2.4,2.449145,1.221964,0.0,1.7,2.4,3.1,11.8,11.8,1.4
BJU,1.789737,1.7,1.702243,0.831121,0.0,1.3,1.7,2.2,7.7,7.7,0.9
GAM,1.922274,1.6,1.772715,1.234218,0.0,1.1,1.6,2.5,8.8,8.8,1.4
HGM,1.822161,1.6,1.697121,1.108285,0.0,1.0,1.6,2.4,8.8,8.8,1.4
INN,1.648204,1.4,1.520329,1.026929,0.0,0.8,1.4,2.2,7.5,7.5,1.4
MER,2.107434,1.9,1.993267,1.061193,0.0,1.3,1.9,2.7,8.8,8.8,1.4
MGH,1.973737,1.9,1.901692,0.996859,0.0,1.3,1.9,2.5,8.2,8.2,1.2
MON,2.05655,1.7,1.857069,1.562397,0.0,0.8,1.7,2.8,8.4,8.4,2.0
MPA,2.751328,2.5,2.618989,1.291523,0.1,1.8,2.5,3.4,9.8,9.7,1.6


### Conclusión

Con lo anterior, podemos analizar lo siguiente:

- Nuestros `rangos` son muy amplios, comparados contra el IQR, esto por los datos atípicos `(outliders)` que contiene el dataset   
- Vemos que nuestras `medias` no se encuentran tan alejadas de las `medianas`, lo que nos pareciera indicar que tenemos un `sesgo bajo`.
- Como es de esperarse, la media truncada se aproxima más a la mediana, al quitar los valores del 5% de cada extremo de nuestros datos.
- Contamos con desviaciones estándar amplias, pero inferiores a nuestra mediana. 
- Podemos apreciar que la mayoria de nuestros datos se encuentran cerca de la mediana  $\pm$ 1 $\sigma$, de acuerdo a los valores en los `percentiles 25 y 75`.