# Limpieza: Predicción de la presencia altos niveles de PM10

En este notebook vamos a tomar los datasets de contaminantes y metereología obtenidos de http://www.aire.cdmx.gob.mx/default.php

Juntaremos los dataframes con una PivotTable y las agruparemos por el momento de la medición

In [1]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
import pandas as pd
import matplotlib
import seaborn as sns

from datetime import timedelta

Definimos unas funciones para convertir las fechas de DD-MM-AAAA a AAAA-MM-DD y para convertir las horas 24:00 en 00:00:

In [2]:
def time_converter(x):
    x0 = x.split(" ")[0]
    x0 = x0.split("/")
    x1 = x.split(" ")[1]
    if x1[:].endswith("24:00"):
        return x0[2]+"-"+x0[1]+"-"+x0[0]+" 00:00"
    else:
        return x0[2]+"-"+x0[1]+"-"+x0[0]+" "+ x1[:]

In [3]:
def time_converter_guion(x):
    x0 = x.split(" ")[0]
    x0 = x0.split("-")
    x1 = x.split(" ")[1]
    if x1[:].endswith("24:00"):
        return x0[2]+"-"+x0[1]+"-"+x0[0]+" 00:00"
    else:
        return x0[2]+"-"+x0[1]+"-"+x0[0]+" "+ x1[:]

In [5]:
def time_converter_date(x):
    x0 = x.split("/")
    return x0[2]+"-"+x0[1]+"-"+x0[0]

## Carga de los conjuntos de datos   <a class="anchor" id="limpieza-bullet"></a>

Definamos el año a limpiar:

In [344]:
a="2018"

Cargamos los datos

### Presión

In [345]:
pre_2018 = pd.read_csv(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/presion/PA_' + a + ".csv"),header=8)

Eliminamos las entradas conn NAN y eliminamos la columna unit:

In [346]:
pre_2018 = pre_2018.dropna(how='any') 
pre_2018 = pre_2018.drop(['unit'], axis=1)

In [347]:
#pre_2018.head()

### Metereología

In [348]:
met_2018 = pd.read_csv(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/Metereología/meteorología_' + a + ".CSV"),header=10)
met_2018 = met_2018.dropna(how='any')
met_2018 = met_2018.drop(['unit'], axis=1)

In [349]:
#met_2018.head(20)

### Contaminantes

In [350]:
cont_2018 = pd.read_csv(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/Contaminantes/contaminantes_'+ a +'.CSV'),header=10)
cont_2018 = cont_2018.dropna(how='any')
cont_2018 = cont_2018.drop(['unit'], axis=1)

In [351]:
#cont_2018.head(200)

### Radiación UVA

In [352]:
radA_2018 = pd.read_csv(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/UV/UVA/UVA_'+ a +'.csv'),header=8)
radA_2018 = radA_2018.fillna(0)
radA_2018 = radA_2018.drop(['unit'], axis=1)

### Radiación UVB

In [353]:
radB_2018 = pd.read_csv(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/UV/UVB/UVB_'+ a +'.csv'),header=8)
radB_2018 = radB_2018.fillna(0)
radB_2018 = radB_2018.drop(['unit'], axis=1)

## Append Radiación

Juntamos los df. de UVA y UVBm

In [354]:
rad_2018 = radA_2018.append(radB_2018, ignore_index=True)

## Precipitación

In [355]:
prec_2018 = pd.read_excel(str('/Users/danielbustillos/Documents/servicio/Contaminación PM10/Precipitación/'+a+"PPH.xls"))

Los valores vacíos vienen como -99, vamos a eliminarlos:

In [356]:
prec_2018 = prec_2018.where(prec_2018.LOM != -99.00)

In [357]:
prec_2018 = prec_2018.dropna()

Renombramos algunas columnas:

In [358]:
prec_2018 = prec_2018.rename(columns={'FECHA': 'fecha'})

Transponemos el df:

In [359]:
prec_2018 = pd.melt(prec_2018, id_vars=["fecha"], 
                   var_name="id_station",value_name="Precip")

In [360]:
prec_2018['fecha'] =  pd.to_datetime(prec_2018['fecha'], format='%Y-%m-%d %H:%M')

# Limpieza paso a paso

### Pres

Nos quedamos con las columnas indicadas y reseteamos el index:

In [361]:
pre_ACO = pre_2018
pre_ACO = pre_ACO.reset_index(drop=False)
pre_ACO = pre_ACO[["Date","cve_station","parameter","value"]]

In [362]:
pre_ACO = pre_ACO.rename(columns={'Date': 'date', 'cve_station': 'id_station','parameter': 'id_parameter'})

### Contaminantes

In [363]:
cont_ACO = cont_2018
#cont_ACO = cont_ACO[(cont_ACO["id_parameter"] == "PM10")]
cont_ACO = cont_ACO.reset_index(drop=False)
cont_ACO = cont_ACO[["date","id_station","id_parameter","value"]]

### Metereologia

In [364]:
met_ACO = met_2018
#cont_ACO = cont_ACO[(cont_ACO["id_parameter"] == "PM10")]
met_ACO = met_ACO.reset_index(drop=False)
met_ACO = met_ACO[["date","id_station","id_parameter","value"]]

### Radiación

In [365]:
rad_2018 = rad_2018.rename(columns={'Date': 'date', 'cve_station': 'id_station','parameter': 'id_parameter'})

In [366]:
rad_ACO = rad_2018
#cont_ACO = cont_ACO[(cont_ACO["id_parameter"] == "PM10")]
rad_ACO = rad_ACO.reset_index(drop=False)
rad_ACO = rad_ACO[["date","id_station","id_parameter","value"]]

## Pivot_Table <a class="anchor" id="pivot-bullet"></a>

### Pre

Creamos la tabla dinámica o pivote:

In [367]:
pre_ACO_hour.head()

id_parameter,date,id_station,PA
0,2016-01-01 01:00:00,AJM,561.0
1,2016-01-01 01:00:00,CUT,586.0
2,2016-01-01 01:00:00,HGM,586.0
3,2016-01-01 01:00:00,LAA,587.0
4,2016-01-01 01:00:00,TLA,584.0


In [368]:
pre_ACO_hour = pd.pivot_table(pre_ACO,index=["date","id_station"],columns=["id_parameter"])

Reseteamos el index para desacernos del multiindex:

In [369]:
pre_ACO_hour = pre_ACO_hour.reset_index(drop=False)

In [370]:
pre_ACO_hour.columns = pre_ACO_hour.columns.droplevel()

Reefinimos las columnas:

In [371]:
pre_ACO_hour["id_station"] = pre_ACO_hour.iloc[:,1]
pre_ACO_hour["date"] = pre_ACO_hour.iloc[:,0]

Nos deshacemos de las columnas sin nombre:

In [372]:
pre_ACO_hour = pre_ACO_hour.drop([""],axis=1)

Aplicamos las funciones definidas al principio del documento:

In [373]:
pre_ACO_hour['date'] = pre_ACO_hour.apply(lambda row: time_converter_guion(row['date']), axis=1) 

Convertimos lacolumna fecha en datetime:

In [374]:
pre_ACO_hour['date'] =  pd.to_datetime(pre_ACO_hour['date'], format='%Y-%m-%d %H:%M')

In [375]:
pre_ACO_hour = pre_ACO_hour[[ "date" ,"id_station",'PA']]

In [376]:
#pre_ACO_hour = pre_ACO_hour.groupby(["date"]).mean().reset_index(drop=False)

### Contaminantes

In [377]:
cont_ACO_hour = pd.pivot_table(cont_ACO,index=["date","id_station"],columns=["id_parameter"])
cont_ACO_hour = cont_ACO_hour.reset_index(drop=False)
cont_ACO_hour.columns = cont_ACO_hour.columns.droplevel()
cont_ACO_hour["id_station"] = cont_ACO_hour.iloc[:,1]
cont_ACO_hour["date"] = cont_ACO_hour.iloc[:,0]
cont_ACO_hour = cont_ACO_hour.drop([""],axis=1)
cont_ACO_hour['date'] = cont_ACO_hour.apply(lambda row: time_converter(row['date']), axis=1) 
cont_ACO_hour['date'] =  pd.to_datetime(cont_ACO_hour['date'], format='%Y-%m-%d %H:%M')
cont_ACO_hour = cont_ACO_hour[[ "date" ,"id_station",'CO', 'NO', 'NO2', 'NOX', 'O3', 'PM2.5', 'PMCO', 'SO2','PM10']]

### Metereología

In [378]:
met_ACO_hour = pd.pivot_table(met_ACO,index=["date","id_station"],columns=["id_parameter"])
met_ACO_hour = met_ACO_hour.reset_index(drop=False)
met_ACO_hour.columns = met_ACO_hour.columns.droplevel()
met_ACO_hour["id_station"] = met_ACO_hour.iloc[:,1]
met_ACO_hour["date"] = met_ACO_hour.iloc[:,0]
met_ACO_hour = met_ACO_hour.drop([""],axis=1)
met_ACO_hour['date'] = met_ACO_hour.apply(lambda row: time_converter(row['date']), axis=1) 
met_ACO_hour['date'] =  pd.to_datetime(met_ACO_hour['date'], format='%Y-%m-%d %H:%M')
met_ACO_hour = met_ACO_hour[["date","id_station","RH","TMP","WSP","WDR"]]

### Radiación

In [379]:
rad_ACO = pd.pivot_table(rad_ACO,index=["date","id_station"],columns=["id_parameter"])
rad_ACO = rad_ACO.reset_index(drop=False)
rad_ACO.columns = rad_ACO.columns.droplevel()
rad_ACO["id_station"] = rad_ACO.iloc[:,1]
rad_ACO["date"] = rad_ACO.iloc[:,0]
rad_ACO = rad_ACO.drop([""],axis=1)
rad_ACO['date'] = rad_ACO.apply(lambda row: time_converter_guion(row['date']), axis=1)
rad_ACO['date'] =  pd.to_datetime(rad_ACO['date'], format='%Y-%m-%d %H:%M')
rad_ACO = rad_ACO[[ "date","id_station",'UVA',"UVB"]]
rad_ACO = rad_ACO.dropna(how='any')

# Merge de Dataframes   <a class="anchor" id="merge-bullet"></a>

Por hora: no es posible añadir precipitación ya que de origen viene por día.

Juntamos los dataframes:

In [380]:
pd.merge(cont_ACO_hour, met_ACO_hour, on=["date","id_station"],how="outer").id_station.unique()

array(['AJM', 'ATI', 'BJU', 'CAM', 'CCA', 'CHO', 'COY', 'CUA', 'CUT',
       'FAC', 'GAM', 'INN', 'IZT', 'LLA', 'LPR', 'MER', 'MGH', 'NEZ',
       'PED', 'SAG', 'SFE', 'SJA', 'TAH', 'TLA', 'TLI', 'UAX', 'UIZ',
       'VIF', 'XAL', 'HGM', 'ACO', 'AJU', 'MPA', 'MON'], dtype=object)

In [381]:
data_hour_merge = pd.merge(cont_ACO_hour, met_ACO_hour, on=["date","id_station"],how="outer")

In [382]:
data_hour_merge = pd.merge(data_hour_merge, pre_ACO_hour, on=["date","id_station"],how="outer")

In [383]:
data_hour_merge = pd.merge(data_hour_merge, rad_ACO, on=["date","id_station"],how="outer")

Eliminamos los NAN:

In [384]:
#data_hour_merge = data_hour_merge.dropna(how='any')

Definimos las columnas hora, dia, mes

In [385]:
data_hour_merge["hora"] = pd.DatetimeIndex(data_hour_merge['date']).hour
data_hour_merge["dia"] = pd.DatetimeIndex(data_hour_merge['date']).day
data_hour_merge["mes"]= pd.DatetimeIndex(data_hour_merge['date']).month

Cambiamos el nombre de date a fecha:

In [386]:
data_hour_merge = data_hour_merge.rename(columns={'date': 'fecha'})

In [387]:
data_hour_merge.head()

id_parameter,fecha,id_station,CO,NO,NO2,NOX,O3,PM2.5,PMCO,SO2,...,RH,TMP,WSP,WDR,PA,UVA,UVB,hora,dia,mes
0,2017-01-01 01:00:00,AJM,0.5,0.0,11.0,11.0,30.0,23.0,10.0,1.0,...,64.0,12.3,3.0,207.0,560.0,,,1,1,1
1,2017-01-01 01:00:00,ATI,0.8,,,,8.0,,,2.0,...,,,,,,,,1,1,1
2,2017-01-01 01:00:00,BJU,1.0,,,,2.0,40.0,11.0,,...,64.0,13.5,0.7,226.0,,,,1,1,1
3,2017-01-01 01:00:00,CAM,0.8,1.0,27.0,28.0,12.0,57.0,23.0,3.0,...,,,,,,,,1,1,1
4,2017-01-01 01:00:00,CCA,0.8,1.0,25.0,26.0,8.0,50.0,,1.0,...,,,,,,,,1,1,1


In [388]:
data_hour_merge = data_hour_merge[["fecha",'hora',"dia", 'mes', 'id_station','UVA',"UVB", "PA",'CO', 'NO', 'NO2', 'NOX', 'O3',
       'PM2.5', 'PMCO', 'SO2', 'RH', 'TMP', 'WSP', 'WDR', 'PM10']]

Exportamos:

In [389]:
data_hour_merge.head()

id_parameter,fecha,hora,dia,mes,id_station,UVA,UVB,PA,CO,NO,...,NOX,O3,PM2.5,PMCO,SO2,RH,TMP,WSP,WDR,PM10
0,2017-01-01 01:00:00,1,1,1,AJM,,,560.0,0.5,0.0,...,11.0,30.0,23.0,10.0,1.0,64.0,12.3,3.0,207.0,33.0
1,2017-01-01 01:00:00,1,1,1,ATI,,,,0.8,,...,,8.0,,,2.0,,,,,
2,2017-01-01 01:00:00,1,1,1,BJU,,,,1.0,,...,,2.0,40.0,11.0,,64.0,13.5,0.7,226.0,51.0
3,2017-01-01 01:00:00,1,1,1,CAM,,,,0.8,1.0,...,28.0,12.0,57.0,23.0,3.0,,,,,80.0
4,2017-01-01 01:00:00,1,1,1,CCA,,,,0.8,1.0,...,26.0,8.0,50.0,,1.0,,,,,


In [390]:
data_hour_merge.to_csv(str("/Users/danielbustillos/Documents/servicio/Contaminación PM10/Outputs/por_hora/cont_hora" + a + ".csv"))

## Promedio Por día