# Data visualization

This Jupyter notebook allowed me to choose the station I wanted to use for my experiments. Please note that I highly recommand you to restart your kernel when you enter a new main section (that began with a cell containing imports.

## All stations

This jupyter notebook only consider the station that are both metrological AND rainfall station.

In this notebook some data visualization is made on the dataset to see there viability for my next experiment (Second experiment directory).

In [None]:
import os
import sys
import re
import pandas as pd
import numpy as np
import json

import subprocess

from matplotlib import pyplot as plt

from time import time
from threading import Thread
from threading import Lock

import sklearn as sk
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
import xgboost as xgb


The **data extraction Jupyter notebook** need to be run **before** this one.

In [None]:
if "data/output" not in os.getcwd():
    os.chdir("data/output")

Get the list of the meteorological station.

In [None]:
station_list = !ls AlertaRio_DadosMet/full | sed "s/\.csv//g"

Get the list of the rainfall station.

In [None]:
rainfall_station_list = !ls AlertaRio_DadosPluv/full | sed "s/\.csv//g"

### Rainfall and meteorological station

Checking which stations are both meteorological and rainfall station.

In [None]:
for station in station_list:
    if station in rainfall_station_list:
        print("OK:", station)
    else:
        print("Not Ok:", station)

The following station are both meteorological and rainfall station :

- alto_da_boa_vista
- guaratiba
- iraja
- jardim_botanico
- riocentro
- santa_cruz
- sao_cristovao
- vidigal

### Data loading

Loading the data of all the stations:

In [None]:
data = {}
sources = ['AlertaRio_DadosPluv', 'AlertaRio_DadosMet']

In [None]:
for station in station_list:
    data[station] = {}
    for source in sources:
        print(source + "/full/" + station + ".csv")
        data[station][source] = pd.read_csv(source + "/full/" + station + ".csv", sep=',')

### Checking the data

Convert the date to pandas datetime format.

In [None]:
init_time = time()
for station in station_list:
    for source in sources:
        data[station][source]['datetime'] = pd.to_datetime(data[station][source]['Dia'] + data[station][source]['Hora'], format='%d/%m/%Y%H:%M:%S')
        data[station][source].set_index('datetime', inplace=True)
print(time() - init_time)

Checking the date (if the right format has been read DD/MM/YYYY)

In [None]:
data[station][source].head(1500).tail() # Dates seems ok

Check for dupplicated values (if there was dupplicated values, it would mean the script didn't worked)

In [None]:
for station in station_list:
    for source in sources:
        print(station, source)
        print(len(data[station][source]), len(data[station][source][data[station][source].index.duplicated() == True]))

Check if the data is sorted by dates (if it wasn't, it would mean the script didn't worked)

In [None]:
is_sorted = True

for station in station_list:
    for source in sources:
        is_sorted &= data[station][source].index.is_monotonic_increasing
        is_sorted &= data[station][source].sort_index().equals(data[station][source])
print(is_sorted)
# The data is sorted by index (Checking the amount of missing data in each station.dates)

Checking the type of the data.

In [None]:
for station in station_list:
    print("=====", station, "=====")
    for source in sources:
        print("\t", source)
        print(data[station][source].info())

All the features have the right format.

## Missing data

Checking the amount of missing data in each station.

All the missing data :

In [None]:
fig, ax = plt.subplots(len(station_list), len(sources), figsize=(17, 10 * len(station_list)))

for a, station in zip(ax[:,0], station_list):
    a.set_ylabel(station, rotation=0, size='large')
    
for a, source in zip(ax[0], sources):
    a.set_title(source)

fig.tight_layout()

for i in range(len(station_list)):
    for j in range(len(sources)):
        station = station_list[i]
        source = sources[j]
        
        N = data[station][source].shape[0] * data[station][source].shape[1]
        N_missing = data[station][source].isnull().sum().sum()
        ax[i][j].pie([N - N_missing, N_missing], autopct='%1.2f%%')
        ax[i][j].legend(["Data", "Missing data"])
        

Converting all the data to have a 15 minutes frequency. (In pandas minute is T because M is for month)

In [None]:
data_complete = {}

for station in station_list:
    data_complete[station] = {}
    for source in sources:
        data_complete[station][source] = data[station][source].asfreq("15T")

In [None]:
fig, ax = plt.subplots(len(station_list), len(sources), figsize=(17, 10 * len(station_list)))

for a, station in zip(ax[:,0], station_list):
    a.set_ylabel(station, rotation=0, size='large')
    
for a, source in zip(ax[0], sources):
    a.set_title(source)

fig.tight_layout()

for i in range(len(station_list)):
    for j in range(len(sources)):
        station = station_list[i]
        source = sources[j]
        
        N = data_complete[station][source].shape[0] * data_complete[station][source].shape[1]
        N_missing = data_complete[station][source].isnull().sum().sum()
        ax[i][j].pie([N - N_missing, N_missing], autopct='%1.2f%%')
        ax[i][j].legend(["Data", "Missing data"])
        

As we can see there is a lot of missing data, this can be explain easy. Indeed some station changed there sampling rate during there usage, so as São Cristóvão has a pretty low amount of missing data and don't have this frequency problem, I will use São Cristóvão dataset for my experiments.

## São Cristóvão dataset

In this section some information are gathered from the dataset such as amount of missing data, features, etc.

In [None]:
import os
import sys
import re
import pandas as pd
import numpy as np
import json

import subprocess

from matplotlib import pyplot as plt

from time import time
from threading import Thread
from threading import Lock

import sklearn as sk
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import xgboost as xgb


In [None]:
if "data/output" not in os.getcwd():
    os.chdir("data/output")

In [None]:
translate_dict = {
    "15min" : "15min",
    "01h" : "01h",
    "04h" : "04h",
    "24h" : "24h",
    "96h" : "96h",
    "DirVento" : "WindDir",
    "Pressao" : "Pressure",
    "Temperatura" : "Temperature",
    "Umidade" : "Humidity",
    "VelVento" : "WindSpeed"
}

### Loading data

In [None]:
data = {}
station = "sao_cristovao"
sources = ['AlertaRio_DadosPluv', 'AlertaRio_DadosMet']
data[station] = {}
source = sources[0]
data[station][source] = data[station][source] = pd.read_csv(source + "/full/" + station + ".csv", sep=',')
source = sources[1]
data[station][source] = data[station][source] = pd.read_csv(source + "/full/" + station + ".csv", sep=',')

for source in sources:
    data[station][source]['datetime'] = pd.to_datetime(data[station][source]['Dia'] + data[station][source]['Hora'], format='%d/%m/%Y%H:%M:%S')
    data[station][source].set_index('datetime', inplace=True)
    data[station][source] = data[station][source].asfreq("15T")["2000":"2023-05-18 02:00:00"]
data[station][sources[1]].drop(columns=["Chuva"], inplace=True)

### Checking some data content

During the first 2 years, the station doesn't contains data on wind, temperature and humidity, therefore, the data will be used after 2002.

In [None]:
data[station][sources[1]].loc['2000'].head()

In [None]:
data[station][sources[1]].loc['2000'].tail()

In [None]:
data[station][sources[1]].loc['2001 11'].tail()

### Plotting Missing Data

Amount of missing data in general.

In [None]:
fig, ax = plt.subplots(1, len(sources), figsize=(17, 10))

year = '1800'

station = "sao_cristovao"
for source in sources:
    i = sources.index(source)
    N = data[station][source][year:].shape[0] * data[station][source][year:].shape[1]
    N_missing = data[station][source][year:].isnull().sum().sum()
    ax[i].pie([N - N_missing, N_missing], autopct='%1.2f%%')
    ax[i].legend(["Data", "Missing data"])
    ax[i].set_title(source)
plt.title("Missing data in General of the São Cristóvão station")
    
# plt.tight_layout()
plt.savefig("Fig/Dataset-full.png", bbox_inches='tight', dpi=600)

Amount of missing data from 2002 to today.

In [None]:
fig, ax = plt.subplots(1, len(sources), figsize=(18, 10))

year = '2002'

station = "sao_cristovao"
for source in sources:
    i = sources.index(source)
    N = data[station][source][year:].shape[0] * data[station][source][year:].shape[1]
    N_missing = data[station][source][year:].isnull().sum().sum()
    ax[i].pie([N - N_missing, N_missing], autopct='%1.2f%%')
    ax[i].legend(["Data", "Missing data"])
    ax[i].set_title(source)
plt.title("Missing data from 2002 of the São Cristóvão station")

# plt.tight_layout()
plt.savefig("Fig/Dataset-2002.png", bbox_inches='tight', dpi=600)

### Creating one dataframe

Combining the 2 dataset into one.

In [None]:
drop_list = ['Dia', 'Hora']
data_features = pd.concat([data["sao_cristovao"]["AlertaRio_DadosPluv"].drop(columns=drop_list), data["sao_cristovao"]["AlertaRio_DadosMet"].drop(columns=drop_list)], axis=1)

In [None]:
data_features.loc['2002'].head()