# Extracción de Datos de Temperatura

## Abstract

En el presente notebook se resume el proceso de preparación de los datos de **temperatura** para el período 2014-2022 de 5 ciudades españolas:

* Madrid
* Barcelona
* Sevilla
* Valencia
* Bilbao

El criterio de selección de las mismas es su importancia económica, población y distribución geográfica que permite captar bastante bien las variaciones de temperatura que se experimentan en la Península Ibérica para un mismo período de tiempo.

Los datos de temperatura se extrajeron de la web de **Copernicus Climate Data Store**. Copernicus es una iniciativa de la Comisión Europea y de la Agencia Espacial Europea para construir un sistema autónomo de observación de la Tierra que permita la observación del medio ambiente y cómo le afectan los cambios ambientales, el origen de estos cambios y la influencia en la vida de las personas.

La solicitud de datos se realiza en el siguiente enlace:

https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=form


## 0. Importación e instalación de librerías

Para poder leer los ficheros **.grib** que nos devuelve la web de **Copernicus** instalamos los siguientes paquetes.

In [1]:
pip install xarray

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install eccodes

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install cfgrib

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [4]:
import requests
import json
import numpy as np
import datetime
import string
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

## 1. Lectura de ficheros

Vamos a leer uno de los ficheros para ver qué información nos proporcionan y con qué campos nos quedaremos.

In [5]:
ds_dataframe = xr.open_dataset('temperature_datasets/original/madrid_2014_2022.grib', engine='cfgrib').to_dataframe()

Can't read index file 'temperature_datasets/original/madrid_2014_2022.grib.923a8.idx'
Traceback (most recent call last):
  File "/home/berni/.local/lib/python3.10/site-packages/cfgrib/messages.py", line 547, in from_indexpath_or_filestream
    self = cls.from_indexpath(indexpath)
  File "/home/berni/.local/lib/python3.10/site-packages/cfgrib/messages.py", line 429, in from_indexpath
    index = pickle.load(file)
EOFError: Ran out of input


In [6]:
ds_dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,number,surface,valid_time,t2m
time,step,latitude,longitude,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-12-31,0 days 01:00:00,40.41,-3.71,0,0.0,2013-12-31 01:00:00,
2013-12-31,0 days 02:00:00,40.41,-3.71,0,0.0,2013-12-31 02:00:00,
2013-12-31,0 days 03:00:00,40.41,-3.71,0,0.0,2013-12-31 03:00:00,
2013-12-31,0 days 04:00:00,40.41,-3.71,0,0.0,2013-12-31 04:00:00,
2013-12-31,0 days 05:00:00,40.41,-3.71,0,0.0,2013-12-31 05:00:00,
...,...,...,...,...,...,...,...
2022-02-28,0 days 20:00:00,40.41,-3.71,0,0.0,2022-02-28 20:00:00,284.110107
2022-02-28,0 days 21:00:00,40.41,-3.71,0,0.0,2022-02-28 21:00:00,282.633789
2022-02-28,0 days 22:00:00,40.41,-3.71,0,0.0,2022-02-28 22:00:00,281.246094
2022-02-28,0 days 23:00:00,40.41,-3.71,0,0.0,2022-02-28 23:00:00,280.459717


Para facilitar la lectura de los campos del dataframe resultante, lo transformamos a un fichero .csv.

In [7]:
ds_dataframe.to_csv(r'temperature_datasets/processed/madrid_2014_2022.csv', index=True)

In [8]:
ds_dataframe = pd.read_csv('temperature_datasets/processed/madrid_2014_2022.csv')
ds_dataframe

Unnamed: 0,time,step,latitude,longitude,number,surface,valid_time,t2m
0,2013-12-31,0 days 01:00:00,40.41,-3.71,0,0.0,2013-12-31 01:00:00,
1,2013-12-31,0 days 02:00:00,40.41,-3.71,0,0.0,2013-12-31 02:00:00,
2,2013-12-31,0 days 03:00:00,40.41,-3.71,0,0.0,2013-12-31 03:00:00,
3,2013-12-31,0 days 04:00:00,40.41,-3.71,0,0.0,2013-12-31 04:00:00,
4,2013-12-31,0 days 05:00:00,40.41,-3.71,0,0.0,2013-12-31 05:00:00,
...,...,...,...,...,...,...,...,...
71563,2022-02-28,0 days 20:00:00,40.41,-3.71,0,0.0,2022-02-28 20:00:00,284.11010
71564,2022-02-28,0 days 21:00:00,40.41,-3.71,0,0.0,2022-02-28 21:00:00,282.63380
71565,2022-02-28,0 days 22:00:00,40.41,-3.71,0,0.0,2022-02-28 22:00:00,281.24610
71566,2022-02-28,0 days 23:00:00,40.41,-3.71,0,0.0,2022-02-28 23:00:00,280.45972


Vemos la información de cada columna del dataframe.

In [9]:
ds_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71568 entries, 0 to 71567
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   time        71568 non-null  object 
 1   step        71568 non-null  object 
 2   latitude    71568 non-null  float64
 3   longitude   71568 non-null  float64
 4   number      71568 non-null  int64  
 5   surface     71568 non-null  float64
 6   valid_time  71568 non-null  object 
 7   t2m         71545 non-null  float64
dtypes: float64(4), int64(1), object(3)
memory usage: 4.4+ MB


Comprobamos que los valores de latitud y longitus son únicos.

In [10]:
ds_dataframe['latitude'].unique()

array([40.41])

In [11]:
ds_dataframe['longitude'].unique()

array([-3.71])

Observamos que los datos de temperatura comienzan a partir del 01/01/2014 y que los valores de temperatura (t2m) vienen dados en Kelvin.

In [12]:
ds_dataframe.loc[ds_dataframe['time'] == '2013-12-31']

Unnamed: 0,time,step,latitude,longitude,number,surface,valid_time,t2m
0,2013-12-31,0 days 01:00:00,40.41,-3.71,0,0.0,2013-12-31 01:00:00,
1,2013-12-31,0 days 02:00:00,40.41,-3.71,0,0.0,2013-12-31 02:00:00,
2,2013-12-31,0 days 03:00:00,40.41,-3.71,0,0.0,2013-12-31 03:00:00,
3,2013-12-31,0 days 04:00:00,40.41,-3.71,0,0.0,2013-12-31 04:00:00,
4,2013-12-31,0 days 05:00:00,40.41,-3.71,0,0.0,2013-12-31 05:00:00,
5,2013-12-31,0 days 06:00:00,40.41,-3.71,0,0.0,2013-12-31 06:00:00,
6,2013-12-31,0 days 07:00:00,40.41,-3.71,0,0.0,2013-12-31 07:00:00,
7,2013-12-31,0 days 08:00:00,40.41,-3.71,0,0.0,2013-12-31 08:00:00,
8,2013-12-31,0 days 09:00:00,40.41,-3.71,0,0.0,2013-12-31 09:00:00,
9,2013-12-31,0 days 10:00:00,40.41,-3.71,0,0.0,2013-12-31 10:00:00,


Nos quedamos los datos a partir del 01/01/2014.

In [13]:
ds_dataframe = ds_dataframe.dropna().reset_index(drop=True)

Vamos a descomponer la columna **valid_time** separando las fechas de las horas. Para ello comenzamos convirtiéndola en un formato **datetime** válido.

In [14]:
ds_dataframe['valid_time'] = pd.to_datetime(ds_dataframe['valid_time'])

A partir de la columna valid_time, podemos crear 2 columnas: **date** y **hour**.

In [15]:
ds_dataframe['date'] = [datetime.datetime.date(d) for d in ds_dataframe['valid_time']] 

In [16]:
ds_dataframe['hour'] = [datetime.datetime.time(d) for d in ds_dataframe['valid_time']] 

In [17]:
ds_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71545 entries, 0 to 71544
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   time        71545 non-null  object        
 1   step        71545 non-null  object        
 2   latitude    71545 non-null  float64       
 3   longitude   71545 non-null  float64       
 4   number      71545 non-null  int64         
 5   surface     71545 non-null  float64       
 6   valid_time  71545 non-null  datetime64[ns]
 7   t2m         71545 non-null  float64       
 8   date        71545 non-null  object        
 9   hour        71545 non-null  object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(4)
memory usage: 5.5+ MB


Dado que solo necesitamos los valores de temperatura a lo largo del tiempo, simplificamos el dataframe.

In [18]:
del ds_dataframe['time']
del ds_dataframe['step']
del ds_dataframe['latitude']
del ds_dataframe['longitude']
del ds_dataframe['number']
del ds_dataframe['surface']
del ds_dataframe['valid_time']

In [19]:
ds_dataframe = ds_dataframe[['date', 'hour', 't2m']]

Realizamos las siguientes modificaciones en las columnas **hour** y **t2m**:
* **hour:** ponemos en formato `00`.
* **t2m:** ponemos los valores en grados centígrados.

In [20]:
ds_dataframe['hour'] = pd.to_datetime(ds_dataframe['hour'], format='%H:%M:%S').dt.strftime('%H')

In [21]:
ds_dataframe['t2m'] = ds_dataframe['t2m']-273.15

In [22]:
ds_dataframe

Unnamed: 0,date,hour,t2m
0,2014-01-01,00,4.69985
1,2014-01-01,01,4.59854
2,2014-01-01,02,4.65054
3,2014-01-01,03,4.71353
4,2014-01-01,04,4.80654
...,...,...,...
71540,2022-02-28,20,10.96010
71541,2022-02-28,21,9.48380
71542,2022-02-28,22,8.09610
71543,2022-02-28,23,7.30972


Guardamos el dataframe resultante.

In [23]:
ds_dataframe = ds_dataframe.to_csv(r'temperature_datasets/processed/madrid_2014_2022_processed.csv', index=True)
ds_dataframe

In [24]:
ds_dataframe = pd.read_csv('temperature_datasets/processed/madrid_2014_2022_processed.csv')

In [25]:
ds_dataframe = ds_dataframe.drop("Unnamed: 0",axis=1)

In [26]:
ds_dataframe

Unnamed: 0,date,hour,t2m
0,2014-01-01,0,4.69985
1,2014-01-01,1,4.59854
2,2014-01-01,2,4.65054
3,2014-01-01,3,4.71353
4,2014-01-01,4,4.80654
...,...,...,...
71540,2022-02-28,20,10.96010
71541,2022-02-28,21,9.48380
71542,2022-02-28,22,8.09610
71543,2022-02-28,23,7.30972


## 2. Procesamiento conjunto de ficheros

Definimos una función que resume todo el proceso anterior para preparar el resto de ficheros.

In [27]:
def convert_file_to_csv(grib_file):
    # Extract file name:
    file_name = grib_file.replace('.grib', '')
    
    # Read file:
    ds = xr.open_dataset('temperature_datasets/original/'+grib_file, engine='cfgrib')
    
    # Create a dataframe:
    ds_dataframe = ds.to_dataframe()
    
    # Convert to a .csv file:
    ds_dataframe.to_csv(r'temperature_datasets/processed/'+file_name+'.csv', index=True)
    
    # Read csv file:
    ds_dataframe = pd.read_csv('temperature_datasets/processed/'+file_name+'.csv')
    
    # Create date and hour columns:
    ds_dataframe['valid_time'] = pd.to_datetime(ds_dataframe['valid_time'])
    ds_dataframe['date'] = [datetime.datetime.date(d) for d in ds_dataframe['valid_time']]
    ds_dataframe['hour'] = [datetime.datetime.time(d) for d in ds_dataframe['valid_time']]

    # Delete unnecesary columns
    del ds_dataframe['time']
    del ds_dataframe['step']
    del ds_dataframe['latitude']
    del ds_dataframe['longitude']
    del ds_dataframe['number']
    del ds_dataframe['surface']
    del ds_dataframe['valid_time']
    
    # Order columns:
    ds_dataframe = ds_dataframe[['date', 'hour', 't2m']]

    # Delete Nan values:
    ds_dataframe = ds_dataframe.dropna().reset_index(drop=True)

    # Format hour column:
    ds_dataframe['hour'] = pd.to_datetime(ds_dataframe['hour'], format='%H:%M:%S').dt.strftime('%H')

    # Transform 't2m' values to Celsius degrees:
    ds_dataframe['t2m'] = ds_dataframe['t2m']-273.15

    # Save processed csv file:
    new_file = file_name+'_processed.csv'
    ds_dataframe.to_csv(r'temperature_datasets/processed/'+new_file, index=True)
    
    return new_file

Procesamos los siguientes ficheros de temperatura.

In [30]:
files = [
    'madrid_2014_2022.grib', 
    'barcelona_2014_2022.grib', 
    'sevilla_2014_2022.grib', 
    'bilbao_2014_2022.grib', 
    'valencia_2014_2022.grib'
]

files_processed = []
for file in files:
    # file_processed = {}
    file_processed = convert_file_to_csv(file)
    files_processed.append(file_processed)

Verificamos que las coordenadas de cada fichero son diferentes y correspondientes a las ciudades que representan.

In [31]:
# madrid
madrid_dataframe = pd.read_csv('temperature_datasets/processed/'+'madrid_2014_2022.csv')
print('Madrid')
print('latitude:', madrid_dataframe['latitude'].unique())
print('longitude:', madrid_dataframe['longitude'].unique())
print('=====================')
# barcelona
barcelona_dataframe = pd.read_csv('temperature_datasets/processed/'+'barcelona_2014_2022.csv')
print('Barcelona')
print('latitude:', barcelona_dataframe['latitude'].unique())
print('longitude:', barcelona_dataframe['longitude'].unique())
print('=====================')
# bilbao
bilbao_dataframe = pd.read_csv('temperature_datasets/processed/'+'bilbao_2014_2022.csv')
print('Bilbao')
print('latitude:', bilbao_dataframe['latitude'].unique())
print('longitude:', bilbao_dataframe['longitude'].unique())
print('=====================')
# sevilla
sevilla_dataframe = pd.read_csv('temperature_datasets/processed/'+'sevilla_2014_2022.csv')
print('Sevilla')
print('latitude:', sevilla_dataframe['latitude'].unique())
print('longitude:', sevilla_dataframe['longitude'].unique())
print('=====================')
# valencia
valencia_dataframe = pd.read_csv('temperature_datasets/processed/'+'valencia_2014_2022.csv')
print('Valencia')
print('latitude:', valencia_dataframe['latitude'].unique())
print('longitude:', valencia_dataframe['longitude'].unique())
print('=====================')

Madrid
latitude: [40.41]
longitude: [-3.71]
Barcelona
latitude: [41.37]
longitude: [2.14]
Bilbao
latitude: [43.25]
longitude: [-2.95]
Sevilla
latitude: [37.37]
longitude: [-6.]
Valencia
latitude: [39.45]
longitude: [-0.39]


Creamos un dataframe único con los valores de temperatura de cada ciudad a lo largo del tiempo.

In [53]:
madrid_dataframe = pd.read_csv('temperature_datasets/processed/'+'madrid_2014_2022_processed.csv')
del madrid_dataframe['Unnamed: 0']

In [54]:
madrid_dataframe

Unnamed: 0,date,hour,t2m
0,2014-01-01,0,4.69985
1,2014-01-01,1,4.59854
2,2014-01-01,2,4.65054
3,2014-01-01,3,4.71353
4,2014-01-01,4,4.80654
...,...,...,...
71540,2022-02-28,20,10.96010
71541,2022-02-28,21,9.48380
71542,2022-02-28,22,8.09610
71543,2022-02-28,23,7.30972


In [55]:
barcelona_dataframe = pd.read_csv('temperature_datasets/processed/'+'barcelona_2014_2022_processed.csv')
del barcelona_dataframe['Unnamed: 0']

In [56]:
bilbao_dataframe = pd.read_csv('temperature_datasets/processed/'+'bilbao_2014_2022_processed.csv')
del bilbao_dataframe['Unnamed: 0']

In [57]:
valencia_dataframe = pd.read_csv('temperature_datasets/processed/'+'valencia_2014_2022_processed.csv')
del valencia_dataframe['Unnamed: 0']

In [58]:
sevilla_dataframe = pd.read_csv('temperature_datasets/processed/'+'sevilla_2014_2022_processed.csv')
del sevilla_dataframe['Unnamed: 0']

In [59]:
sevilla_dataframe = pd.read_csv('temperature_datasets/processed/'+'sevilla_2014_2022_processed.csv')
del sevilla_dataframe['Unnamed: 0']

In [60]:
temp_data_final = madrid_dataframe
temp_data_final['madrid_temp'] = temp_data_final['t2m']
del temp_data_final['t2m']
temp_data_final['barcelona_temp'] = barcelona_dataframe['t2m']
temp_data_final['bilbao_temp'] = bilbao_dataframe['t2m']
temp_data_final['sevilla_temp'] = valencia_dataframe['t2m']
temp_data_final['valencia_temp'] = sevilla_dataframe['t2m']

In [61]:
temp_data_final

Unnamed: 0,date,hour,madrid_temp,barcelona_temp,bilbao_temp,sevilla_temp,valencia_temp
0,2014-01-01,0,4.69985,4.82314,7.79263,10.86538,9.41128
1,2014-01-01,1,4.59854,4.57510,6.95278,10.87760,9.13394
2,2014-01-01,2,4.65054,4.45498,6.63174,10.91030,9.30117
3,2014-01-01,3,4.71353,4.49868,7.63442,10.86562,9.53726
4,2014-01-01,4,4.80654,4.52236,9.08853,10.86440,9.85757
...,...,...,...,...,...,...,...
71540,2022-02-28,20,10.96010,10.00160,10.66055,11.31730,15.38467
71541,2022-02-28,21,9.48380,9.52505,9.47500,10.29922,13.25698
71542,2022-02-28,22,8.09610,8.93716,9.06484,9.78700,12.03090
71543,2022-02-28,23,7.30972,8.22085,8.95205,8.31997,12.30728


Finalmente, convertimos el dataframe resultante en un csv.

In [64]:
temp_data_final.to_csv(r'temperature_datasets/final/tempa_data_2014_2022.csv', index=True)