In [1]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Data integration

### References
- https://github.com/jorisvandenbossche/2015-PyDataParis
- https://github.com/FIIT-IAU/IAU-2019-2020

<!--
navody na pouzivanie pandas, matplotlib a numpy na spracovanie dat. Niesu to informacie o tom ako robit explorativnu analyzu, ale ako pouzivat kniznice

Z tohoto povyberam zaujimave casti, spojim ich s nejakou kapitolou v knihe o tom ako riesit spracovanie, cistanie dat a transformovanie dat
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_10_pandas_introduction.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_11_pandas_adding_data.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_12_pandas_groupby.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_13_pandas_movies.ipynb 
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_14_pandas_reshape.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_15_pandas_transforming.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_21_pandas_processing.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_22_pandas_cleaning.ipynb

http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_23_titanic_example.ipynb
-->

### Working with a Pandas dataframe

<img src="https://github.com/FIIT-IAU/2015-PyDataParis/raw/b900fdb9f3c12e9206bb417022dd004abf023c0f/img/dataframe.png" width="50%" height="50%" />


# Case study: Air quality in Europe
**[European air quality information reported by EEA member countries](https://www.eea.europa.eu/data-and-maps/data/aqereporting-8#tab-european-data).**

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

In [3]:
filename = "data/input/BETR8010000800100hour.1-1-1990.31-12-2012"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,1990-01-01\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0\t-999.000\t0
0,1990-01-02\t-999.000\t0\t-999.000\t0\t-999.000...
1,1990-01-03\t51.000\t1\t50.000\t1\t47.000\t1\t4...
2,1990-01-04\t-999.000\t0\t-999.000\t0\t-999.000...
3,1990-01-05\t51.000\t1\t51.000\t1\t48.000\t1\t5...
4,1990-01-06\t-999.000\t0\t-999.000\t0\t-999.000...


Vidíme, že pri načítaní nastalo viacero problémov. Skúsme sa teda pozrieť na dáta v nejakom editore predtým, ako ich načítame:

In [4]:
%%bash
head data/input/BETR8010000800100hour.1-1-1990.31-12-2012

1990-01-01	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0
1990-01-02	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	48.000	1	-999.000	0	-999.000	0	48.000	1	50.000	1	55.000	1	59.000	1	58.000	1	59.000	1	58.000	1	57.000	1	58.000	1	54.000	1	49.000	1	48.000	1
1990-01-03	51.000	1	50.000	1	47.000	1	48.000	1	51.000	1	52.000	1	58.000	1	57.000	1	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	69.000	1	74.000	1	-999.000	0	-999.000	0	103.000	1	84.000	1	75.000	1	-999.000	0	-999.000	0	-999.000	0
1990-01-04	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	67.000	1	57.000	1	57.000	1	-999.000	0	71.000	1	74.000	1	70.000	1	70.000	1	69.000	1	65.000	1	64

z tohoto zatiaľ vieme asi len to, že pôjde o **csv formát, separátor hodnôt je \t**, sú tam samé numerické dáta a nemáme pomenované atribúty.

In [5]:
%%bash
ls -lh data/input/BETR8010000800100hour.1-1-1990.31-12-2012

-rw-r--r--  1 giangnguyen  staff   1.9M Sep  5 12:01 data/input/BETR8010000800100hour.1-1-1990.31-12-2012


In [6]:
%%bash 
wc -l data/input/BETR8010000800100hour.1-1-1990.31-12-2012

    8392 data/input/BETR8010000800100hour.1-1-1990.31-12-2012


Takže tých dat nie je zas tak veľa a nemusím sa báť to všetko načítať do pamäte

In [7]:
data = pd.read_csv(filename, 
                   sep='\t', 
                   header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
0,1990-01-01,-999.0,0,-999.0,0,-999.0,0,-999.0,0,-999.0,...,-999.0,0,-999.0,0,-999.0,0,-999.0,0,-999.0,0
1,1990-01-02,-999.0,0,-999.0,0,-999.0,0,-999.0,0,-999.0,...,57.0,1,58.0,1,54.0,1,49.0,1,48.0,1
2,1990-01-03,51.0,1,50.0,1,47.0,1,48.0,1,51.0,...,84.0,1,75.0,1,-999.0,0,-999.0,0,-999.0,0
3,1990-01-04,-999.0,0,-999.0,0,-999.0,0,-999.0,0,-999.0,...,69.0,1,65.0,1,64.0,1,60.0,1,59.0,1
4,1990-01-05,51.0,1,51.0,1,48.0,1,50.0,1,51.0,...,-999.0,0,-999.0,0,-999.0,0,-999.0,0,-999.0,0


Máme 49 stĺpcov. Dátum a 48 ďalších numerických atribútov. Každý druhý sa zdá byt binárny. Asi nejaký príznak.

Dáta sú tvorené meraniami nejakej veličiny asi v jednotlivých hodinách dňa. 

Čo deň, to riadok. Každá hodina má zvlášť stĺpec + je tu stĺpec pre nejaký príznak, ktorý nás teraz nezaujíma.

Sú tam nejaké divné hodnoty, ktoré by tam asi nemali byť: -999 a -9999.

Dátum bude asi index

In [8]:
data = pd.read_csv(filename, 
                   sep='\t', 
                   header=None,
                   na_values=[-999, -9999], 
                   index_col=0
                  )
data.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,39,40,41,42,43,44,45,46,47,48
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1990-01-01,,0,,0,,0,,0,,0,...,,0,,0,,0,,0,,0
1990-01-02,,0,,0,,0,,0,,0,...,57.0,1,58.0,1,54.0,1,49.0,1,48.0,1
1990-01-03,51.0,1,50.0,1,47.0,1,48.0,1,51.0,1,...,84.0,1,75.0,1,,0,,0,,0
1990-01-04,,0,,0,,0,,0,,0,...,69.0,1,65.0,1,64.0,1,60.0,1,59.0,1
1990-01-05,51.0,1,51.0,1,48.0,1,50.0,1,51.0,1,...,,0,,0,,0,,0,,0


In [9]:
# skusime povyhadzovat tie flagy, ktore nas nezaujimaju. Zhodou okolnosti je to kazdy druhy stlpec
data.columns[1::2]

Int64Index([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
            36, 38, 40, 42, 44, 46, 48],
           dtype='int64')

In [10]:
data = data.drop(data.columns[1::2], 
                 axis=1)
data.head()

Unnamed: 0_level_0,1,3,5,7,9,11,13,15,17,19,...,29,31,33,35,37,39,41,43,45,47
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1990-01-01,,,,,,,,,,,...,,,,,,,,,,
1990-01-02,,,,,,,,,,48.0,...,55.0,59.0,58.0,59.0,58.0,57.0,58.0,54.0,49.0,48.0
1990-01-03,51.0,50.0,47.0,48.0,51.0,52.0,58.0,57.0,,,...,69.0,74.0,,,103.0,84.0,75.0,,,
1990-01-04,,,,,,,,,,,...,,71.0,74.0,70.0,70.0,69.0,65.0,64.0,60.0,59.0
1990-01-05,51.0,51.0,48.0,50.0,51.0,58.0,65.0,66.0,69.0,74.0,...,,,,,,,,,,


In [11]:
# Skusme si nejak normalne pomenovat vzniknute stlpce
["{:02d}".format(i) for i in range(len(data.columns))]

['00',
 '01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23']

In [12]:
# mam nejako rozsypane nazvy stlpcov
data.columns = ["{:02d}".format(i) for i in range(len(data.columns))]
data.head()

Unnamed: 0_level_0,00,01,02,03,04,05,06,07,08,09,...,14,15,16,17,18,19,20,21,22,23
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1990-01-01,,,,,,,,,,,...,,,,,,,,,,
1990-01-02,,,,,,,,,,48.0,...,55.0,59.0,58.0,59.0,58.0,57.0,58.0,54.0,49.0,48.0
1990-01-03,51.0,50.0,47.0,48.0,51.0,52.0,58.0,57.0,,,...,69.0,74.0,,,103.0,84.0,75.0,,,
1990-01-04,,,,,,,,,,,...,,71.0,74.0,70.0,70.0,69.0,65.0,64.0,60.0,59.0
1990-01-05,51.0,51.0,48.0,50.0,51.0,58.0,65.0,66.0,69.0,74.0,...,,,,,,,,,,


**Skúsme presunúť každé meranie na samostatný riadok**

In [13]:
data = data.stack()
data.head()

1990-01-02  09    48.0
            12    48.0
            13    50.0
            14    55.0
            15    59.0
dtype: float64

In [14]:
# vysledok preusporiadania je viacdimenzionalny Series objekt, a nie DataFrame.
type(data)  

pandas.core.series.Series

In [15]:
# mohli by sme nejak normalne pomenovat stlpec, napr. nazvom meracej stanice, ktora je v nazve suboru
_, fname = os.path.split(filename)
station = fname[:7]
print(filename)
print(station)

data/input/BETR8010000800100hour.1-1-1990.31-12-2012
BETR801


In [16]:
#reset index mi z toho spravi data frame
data = data.reset_index(name=station) 
# data = data.reset_index() 

print(type(data))
data.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,level_1,BETR801
0,1990-01-02,9,48.0
1,1990-01-02,12,48.0
2,1990-01-02,13,50.0
3,1990-01-02,14,55.0
4,1990-01-02,15,59.0


In [17]:
data = data.rename(columns = {0:'date', 'level_1':'hour'})
data.head()

Unnamed: 0,date,hour,BETR801
0,1990-01-02,9,48.0
1,1990-01-02,12,48.0
2,1990-01-02,13,50.0
3,1990-01-02,14,55.0
4,1990-01-02,15,59.0


In [18]:
# teraz tomu vyrobime novy index z datumu a hodiny
data.index = pd.to_datetime(data['date'] + ' ' + data['hour'])
data.head()

Unnamed: 0,date,hour,BETR801
1990-01-02 09:00:00,1990-01-02,9,48.0
1990-01-02 12:00:00,1990-01-02,12,48.0
1990-01-02 13:00:00,1990-01-02,13,50.0
1990-01-02 14:00:00,1990-01-02,14,55.0
1990-01-02 15:00:00,1990-01-02,15,59.0


In [19]:
# zmazeme nepotrebne stlpce
data = data.drop(['date', 'hour'], axis=1)
data.head()

Unnamed: 0,BETR801
1990-01-02 09:00:00,48.0
1990-01-02 12:00:00,48.0
1990-01-02 13:00:00,50.0
1990-01-02 14:00:00,55.0
1990-01-02 15:00:00,59.0


# Above-code for one station is inserted into the python file  `airbase.py`
**We are going to work with more stations.**

In [20]:
import airbase
no2 = airbase.load_data()

FileNotFoundError: [Errno 2] No such file or directory: 'data/BETR8010000800100hour.1-1-1990.31-12-2012'

In [None]:
no2.head(3)

In [None]:
no2.tail()

In [None]:
no2.info()

In [None]:
no2.describe()

In [None]:
no2.plot(kind='box')

In [None]:
# boxplot vie ukazat aj outlierov
sns.boxplot(no2, sym='k.')

In [None]:
no2['BETN029'].plot(kind='hist', bins=50)

In [None]:
sns.violinplot(no2)

In [None]:
# first plotting
no2.plot(figsize=(12,6))

In [None]:
# mozem si povedat, ze chcem len nejaku mensiu cast
no2[-500:].plot(figsize=(12,6))

**Alebo použijem zaujímavejšie operácie s timeseries**

In [None]:
# kedze index su casy, tak viem robit s nimi zaujimave veci
no2.index 

In [None]:
# napriklad definovat rozsahy pomocou retazca s datumom
no2["2010-01-01 09:00": "2010-01-01 12:00"] 

In [None]:
# alebo takto vybrat vsetky data z jedneho konkretneho roku
no2['2012'] 
# no2['2012'].head()

# alebo len data z marca
# no2['2012/03'] 

In [None]:
# komponenty datumu su pristupne z indexu
# no2.index.hour
no2.index.year

In [None]:
# a co je zaujimavejsie, viem zmenit vzorkovaciu frekvenciu
no2.resample('D').mean().head()

In [None]:
# je tu asi nejaka sezonnost?
no2.resample('M').mean().plot()

In [None]:
# dlhodoby trend?
no2.resample('A').mean().plot()

In [None]:
# tyzdenna sezonnost?
no2['2012-3':'2012-4'].resample('D').mean().plot()

In [None]:
# mozem pouzit aj viacero agregacnych funkcii a porovnat si ich
no2.loc['2009':, 'FR04037'].resample('M').agg(['mean', 'median']).plot()
# no2.loc['2009':, 'FR04037'].resample('M').agg(['mean', 'std']).plot()

## Pozor resample != groupby

In [None]:
# Toto je časový priebeh s mesačnou granularitou. Spriemerované sú hodnoty v priebehu mesiaca
no2.resample('M').mean().plot()

In [None]:
# Toto sú spriemerované všetky hodnoty pre mesiac s daným číslom. Aj naprieč rokmi. 
# Získal som teda priemerný priebeh hodnoty počas roka s mesačnou granularitou.
no2.groupby(no2.index.month).mean().plot()

# Sumár, čo si zobrať z tejto EDA

* Uistite sa, že dáta sú kódované správne (najčastejšie sa treba pozrieť manuálne do dát)
* Uistite sa, že dáta spadajú do očakávaného rozsahu a všetky majú očakávaný tvar (napríklad formát času)
* Nikdy nemeňte dáta manuálne. Vždy používajte kód, ktorý si odložíte a použijete vždy, keď budete opakovať experiment. Chceme, aby bola analýza reprodukovateľná
* Spravte si grafy všetkého, čo sa len dá, aby ste si vizuálne potvrdili, že niečo je tak, ako by malo byt