# Trabajo. Análisis exploratorio del consumo eléctrico de una casa

El conjunto de datos que analizaremos en este trabajo  coresponden a mediciones del consumo eléctrico de una vivienda, obtenido de [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption#], el UCI Machine Learning Repository (Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)

El conjunto de datos inicial es mucho más grande, se ha agregado los valores por hora haciendo promedios, para hacerlo más manejable.

## Primer paso: cargamos los datos
Después de importar los módulos necesarios y definir `DATA_DIRECTORY`, cargar el fichero `household_hourly_power_consumption.txt` en un DataFrame llamado `vivienda`. 

In [1]:
# Completar aquí
import pandas as pd
import numpy as np
from pathlib import Path
DATA_DIR = Path("..") / ".." / "data"

def nice(str, value):
    print(f"*-* {str} *-*\n{value}\n")

vivienda = pd.read_csv(
    DATA_DIR / "household_hourly_power_consumption.txt",
    skiprows=13,
    sep=";",
    parse_dates=["date_hour"],
    index_col="date_hour",
)
# --------------------
vivienda

Unnamed: 0_level_0,global_active_power,global_reactive_power,voltage,global_intensity,sub_metering_1,sub_metering_2,sub_metering_3
date_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:00:00,4.222889,0.229000,234.643889,18.100000,0.0,0.527778,16.861111
2006-12-16 18:00:00,3.632200,0.080033,234.580167,15.600000,0.0,6.716667,16.866667
2006-12-16 19:00:00,3.400233,0.085233,233.232500,14.503333,0.0,1.433333,16.683333
2006-12-16 20:00:00,3.268567,0.075100,234.071500,13.916667,0.0,0.000000,16.783333
2006-12-16 21:00:00,3.056467,0.076667,237.158667,13.046667,0.0,0.416667,17.216667
...,...,...,...,...,...,...,...
2010-11-26 17:00:00,1.725900,0.061400,237.069667,7.216667,0.0,0.000000,12.866667
2010-11-26 18:00:00,1.573467,0.053700,237.531833,6.620000,0.0,0.000000,0.000000
2010-11-26 19:00:00,1.659333,0.060033,236.741000,7.056667,0.0,0.066667,0.000000
2010-11-26 20:00:00,1.163700,0.061167,239.396000,4.913333,0.0,1.066667,0.000000


> En el conjunto, las zonas de medición correspondientes a "sub_metering_n" son las siguientes:
- sub_metering_1: cocina
- sub_metering_2: lavadero que tiene lavadora y secadora aparte de un frigorífico
- sub_metering_3: termo eléctrico de ACS y un aire condicionado.

2. Cuál el valor mínimo de la potencia global activa que se puede encontrar en el conjunto?  ¿Cuál es el valor máximo? ¿En qué fechas y hora se dieron?  (Indicación: echad un vistazo a [idxmin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.idxmin.html#pandas.Series.idxmin) y su hermano `idxmax`.)



In [4]:
# Completar aquí
hMin = vivienda['global_active_power'].idxmin()
hMax = vivienda['global_active_power'].idxmax()

min = vivienda['global_active_power'].min()
max = vivienda['global_active_power'].max()

print(f"El valor minimo de global_active_power es {min} en la fecha hora: {hMin}")
print(f"El valor maximo de global_active_power es {max} en la fecha hora: {hMax}")


# --------------------


El valor minimo de global_active_power es 0.124 en la fecha hora: 2008-08-23 21:00:00
El valor maximo de global_active_power es 6.56053333333333 en la fecha hora: 2008-11-23 18:00:00


3. Cuál es el valor promedio de la intensidad global en el conjunto?


In [3]:
# Completar aquí
nice('vivienda.mean()["global_intensity"]', vivienda.mean()["global_intensity"])

# --------------------


*-* vivienda.mean()["global_intensity"] *-*
4.628238362989026



## Manipulaciones

### Añadimos una columna *sub_metering_resto*

Las columnas sub_metering_1, sub_metering_2 y sub_metering_3 miden la energía activa en tres zonas de la vivienda. Para calcular la energía activa en el resto de la vivienda, debemos substraerlas de la columna global_active_power, (despúes de multiplicar está última por 60/1000 para pasar de kW por minuto  a W por hora), según la fórmula

> (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) 

Tenéis que añadir esta columna que llamaréis sub_metering_resto al dataframe *vivienda*

In [33]:
vivienda

Unnamed: 0_level_0,global_active_power,global_reactive_power,voltage,global_intensity,sub_metering_1,sub_metering_2,sub_metering_3
date_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:00:00,4.222889,0.229000,234.643889,18.100000,0.0,0.527778,16.861111
2006-12-16 18:00:00,3.632200,0.080033,234.580167,15.600000,0.0,6.716667,16.866667
2006-12-16 19:00:00,3.400233,0.085233,233.232500,14.503333,0.0,1.433333,16.683333
2006-12-16 20:00:00,3.268567,0.075100,234.071500,13.916667,0.0,0.000000,16.783333
2006-12-16 21:00:00,3.056467,0.076667,237.158667,13.046667,0.0,0.416667,17.216667
...,...,...,...,...,...,...,...
2010-11-26 17:00:00,1.725900,0.061400,237.069667,7.216667,0.0,0.000000,12.866667
2010-11-26 18:00:00,1.573467,0.053700,237.531833,6.620000,0.0,0.000000,0.000000
2010-11-26 19:00:00,1.659333,0.060033,236.741000,7.056667,0.0,0.066667,0.000000
2010-11-26 20:00:00,1.163700,0.061167,239.396000,4.913333,0.0,1.066667,0.000000


In [5]:
# Completar aquí
def formula(x):
    return x["global_active_power"] * 1000 / 60 - x["sub_metering_1"] - x["sub_metering_2"] - x["sub_metering_3"]

vivienda["sub_metering_resto"] = vivienda.apply(formula, axis=1) # fila a fila



# --------------------
vivienda

Unnamed: 0_level_0,global_active_power,global_reactive_power,voltage,global_intensity,sub_metering_1,sub_metering_2,sub_metering_3,sub_metering_resto
date_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16 17:00:00,4.222889,0.229000,234.643889,18.100000,0.0,0.527778,16.861111,52.992593
2006-12-16 18:00:00,3.632200,0.080033,234.580167,15.600000,0.0,6.716667,16.866667,36.953333
2006-12-16 19:00:00,3.400233,0.085233,233.232500,14.503333,0.0,1.433333,16.683333,38.553889
2006-12-16 20:00:00,3.268567,0.075100,234.071500,13.916667,0.0,0.000000,16.783333,37.692778
2006-12-16 21:00:00,3.056467,0.076667,237.158667,13.046667,0.0,0.416667,17.216667,33.307778
...,...,...,...,...,...,...,...,...
2010-11-26 17:00:00,1.725900,0.061400,237.069667,7.216667,0.0,0.000000,12.866667,15.898333
2010-11-26 18:00:00,1.573467,0.053700,237.531833,6.620000,0.0,0.000000,0.000000,26.224444
2010-11-26 19:00:00,1.659333,0.060033,236.741000,7.056667,0.0,0.066667,0.000000,27.588889
2010-11-26 20:00:00,1.163700,0.061167,239.396000,4.913333,0.0,1.066667,0.000000,18.328333


## Explorando el número de datos.




Usando los atributos `year`, `month` del  índice de `vivienda`, y el método `value_counts`, obtened cuántas mediciones tenemos para cada año


In [57]:
# Completar aquí
nice('vivienda.index.year.value_counts()', vivienda.index.year.value_counts())
nice('vivienda.index.month.value_counts()', vivienda.index.month.value_counts())

# --------------------


*-* vivienda.index.year.value_counts() *-*
date_hour
2008    8784
2007    8760
2009    8760
2010    7918
2006     367
Name: count, dtype: int64

*-* vivienda.index.month.value_counts() *-*
date_hour
1     2976
3     2976
5     2976
7     2976
8     2976
10    2976
4     2880
6     2880
9     2880
11    2782
2     2712
12    2599
Name: count, dtype: int64



Para obtener una tabla de frecuencias del número de datos por año y por mes, podemos usar la función `crosstab` de pandas.

In [58]:
# Completar aquí
pd.crosstab(vivienda.index.year, vivienda.index.month)

# --------------------


col_0,1,2,3,4,5,6,7,8,9,10,11,12
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2006,0,0,0,0,0,0,0,0,0,0,0,367
2007,744,672,744,720,744,720,744,744,720,744,720,744
2008,744,696,744,720,744,720,744,744,720,744,720,744
2009,744,672,744,720,744,720,744,744,720,744,720,744
2010,744,672,744,720,744,720,744,744,720,744,622,0


## Resumimos por grupos
En esta sección usaremos `groupby` aplicado al índice del DataFrame `vivienda` para obtener distintos resúmenes del consumo energético de la vivienda.

### Perfil de potencia a lo largo del día

Queremos ver para empezar el perfil de potencia global activa medio por hora, es decir para cada hora del día (0 a 23), cuál es el valor promedio de la potencia global activa.

In [73]:
# Completar aquí
vivienda.groupby(lambda x: x.hour)["global_active_power"].mean()
# --------------------


date_hour
0     0.659562
1     0.539325
2     0.480618
3     0.444850
4     0.443844
5     0.453674
6     0.791606
7     1.502373
8     1.460940
9     1.331642
10    1.260913
11    1.246408
12    1.207061
13    1.144471
14    1.082750
15    0.990806
16    0.948805
17    1.056164
18    1.326433
19    1.733428
20    1.899073
21    1.876063
22    1.412681
23    0.902142
Name: global_active_power, dtype: float64

Repetir la instrucción anterior para añadir a la vez la potencia global máxima por hora, el número de datos que han entrado en el cálculo, y la potencia mínima. Guardaréis el resultado en un DataFrame  llamado `perfil_horario_vivienda`.

In [28]:
# Completar aquí
perfil_horario_potencia = vivienda.groupby(lambda x: x.hour).agg(
    {
        "global_active_power": [
            ("potencia_media", "mean"),
            ("potencia_maxima", "max"),
            ("numero", lambda x: x.count()),
            ("potencia_minima", "min"),
        ]
    }
)

# --------------------
perfil_horario_potencia
 
       

Unnamed: 0_level_0,global_active_power,global_active_power,global_active_power,global_active_power
Unnamed: 0_level_1,potencia_media,potencia_maxima,numero,potencia_minima
date_hour,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,0.659562,5.1555,1426,0.1276
1,0.539325,5.759067,1424,0.1441
2,0.480618,3.498267,1424,0.1315
3,0.44485,2.847333,1424,0.135067
4,0.443844,2.9925,1422,0.127533
5,0.453674,2.9103,1421,0.130467
6,0.791606,3.590267,1421,0.151067
7,1.502373,4.4166,1422,0.131633
8,1.46094,4.4189,1422,0.144567
9,1.331642,3.716267,1422,0.129233


## Proporción de potencia correspondiente a la cocina

Empezamos por calcular la suma, por cada fila, de las columnas desde `sub_metering_1` hasta `sub_metering_resto`.

In [6]:
# Completar aquí
vivienda['sub_metering_suma'] = vivienda.loc[:, 'sub_metering_1':].sum(axis=1)
vivienda['sub_metering_suma']
# --------------------


date_hour
2006-12-16 17:00:00    70.381481
2006-12-16 18:00:00    60.536667
2006-12-16 19:00:00    56.670556
2006-12-16 20:00:00    54.476111
2006-12-16 21:00:00    50.941111
                         ...    
2010-11-26 17:00:00    28.765000
2010-11-26 18:00:00    26.224444
2010-11-26 19:00:00    27.655556
2010-11-26 20:00:00    19.395000
2010-11-26 21:00:00    15.577778
Name: sub_metering_suma, Length: 34589, dtype: float64

A continuación, añadimos al conjunto `vivienda` la columna calculada `prop_cocina` que contenga en porcentaje la proporción de `sub_metering_1` respecto a la suma de las columnas `sub_metering`.


In [12]:
# Completar aquí
vivienda['prop_cocina'] = vivienda['sub_metering_1'] / vivienda['sub_metering_suma'] * 100
vivienda
# --------------------
vivienda

Unnamed: 0_level_0,global_active_power,global_reactive_power,voltage,global_intensity,sub_metering_1,sub_metering_2,sub_metering_3,sub_metering_resto,sub_metering_suma,prop_cocina
date_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2006-12-16 17:00:00,4.222889,0.229000,234.643889,18.100000,0.0,0.527778,16.861111,52.992593,70.381481,0.0
2006-12-16 18:00:00,3.632200,0.080033,234.580167,15.600000,0.0,6.716667,16.866667,36.953333,60.536667,0.0
2006-12-16 19:00:00,3.400233,0.085233,233.232500,14.503333,0.0,1.433333,16.683333,38.553889,56.670556,0.0
2006-12-16 20:00:00,3.268567,0.075100,234.071500,13.916667,0.0,0.000000,16.783333,37.692778,54.476111,0.0
2006-12-16 21:00:00,3.056467,0.076667,237.158667,13.046667,0.0,0.416667,17.216667,33.307778,50.941111,0.0
...,...,...,...,...,...,...,...,...,...,...
2010-11-26 17:00:00,1.725900,0.061400,237.069667,7.216667,0.0,0.000000,12.866667,15.898333,28.765000,0.0
2010-11-26 18:00:00,1.573467,0.053700,237.531833,6.620000,0.0,0.000000,0.000000,26.224444,26.224444,0.0
2010-11-26 19:00:00,1.659333,0.060033,236.741000,7.056667,0.0,0.066667,0.000000,27.588889,27.655556,0.0
2010-11-26 20:00:00,1.163700,0.061167,239.396000,4.913333,0.0,1.066667,0.000000,18.328333,19.395000,0.0


Podemos obtener ahora la evolución horario de la proporción correspondiente a la cocina, con los mismos indicadores que para la potencia global.

In [26]:
# Completar aquí
vivienda.groupby(lambda x: x.hour).agg(
    {  "prop_cocina" : [
        ("prop_media", "mean"),
        ("prop_maxima", "max"),
        ("numero", "size"),
        ("prop_minima", "min")       
        ]
    }
)
# --------------------


Unnamed: 0_level_0,prop_cocina,prop_cocina,prop_cocina,prop_cocina
Unnamed: 0_level_1,prop_media,prop_maxima,numero,prop_minima
date_hour,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1.547905,71.124576,1441,0.0
1,1.080553,68.412045,1441,0.0
2,0.683529,65.83916,1441,0.0
3,0.350238,64.637306,1441,0.0
4,0.256954,64.203017,1441,0.0
5,0.169877,53.975143,1441,0.0
6,0.133651,56.187602,1441,0.0
7,0.779337,29.547224,1441,0.0
8,3.992971,66.780186,1441,0.0
9,4.974586,73.693355,1441,0.0


Queremos añadir como factor de agrupamiento el día de la semana (atributo `weekday` de un objeto `datetime`). Calcular para el agrupamiento, día de la semana, hora, el valor promedio de la proporción correspondiente a la cocina y ordenarlos de mayor a menor. Weekday toma el valor 0 para Lunes y 6 para Domingo. Cuándo se hace más uso de la cocina en esta familia? 


In [27]:
# Completar aquí
vivienda.groupby([lambda x: x.dayofweek, lambda x: x.hour]).agg(
    {
        "prop_cocina": "mean"
    }
).rename_axis(["dayofweek", "hour"]).sort_values("prop_cocina", ascending=False)
# --------------------


Unnamed: 0_level_0,Unnamed: 1_level_0,prop_cocina
dayofweek,hour,Unnamed: 2_level_1
5,15,12.068463
6,12,11.794802
6,15,10.361347
5,14,10.335298
6,11,9.900869
...,...,...
1,4,0.000000
1,5,0.000000
2,6,0.000000
4,2,0.000000
