# Group assignment Pandas - dataset SYKE

For this part of the assignment we decided to utilise the open data provided by the Finnish Environment Institute (SYKE), section Hydrology on their website https://www.syke.fi/en-US/Open_information/Open_web_services/Environmental_data_API.

For downloading and saving the data in the CSV format we used https://pragmatiqa.com/xodata/# to be able to download the data in batches up to 50000 lines as the SYKE's API doesn't allow automatically to download batches bigger than 500 lines.

Our goal will be combining SYKE's datasets on the ice thickness and water temperature in Finnish lakes over the history when these have been measured and analyse the changes over this time frame. We retrieved various simple datasets from the SYKE's database which will be merged together to get more comprehensive data for further analysis. Since each dataset is dependent on the temperature developments in the past we'd expect a strong correlation across all data from these datasets.

<img src="https://drive.google.com/uc?export=download&id=1TDvTLatcpcrsKhrjSpiVVKBRMc-RQne8" height=1200 width=750 alt="Pond hockey on Saimaa lake in Mikkeli (2018)">

                          Pond hockey on frozen Saimaa lake in Mikkeli (©Teemu Paappanen, 2018)

In [172]:
import pandas as pd
import numpy as np

## Combining data of ice thickness

All files are saved as csv so we used the Pandas read_csv() function to load them into Jupyter notebook.

In [173]:
ice_thickness1 = pd.read_csv("Ice_thickness_1_4.csv", sep=",")
ice_thickness2 = pd.read_csv("Ice_thickness_2_4.csv", sep=",")
ice_thickness3 = pd.read_csv("Ice_thickness_3_4.csv", sep=",")
ice_thickness4 = pd.read_csv("Ice_thickness_4_4.csv", sep=",")

During the download process from the SYKE's API the dates at the end of each but last files were cut in the middle and then started from the beginning in the following file. This fact is visible when presenting the last and first couple of rows of the files below and will have to be addressed when merging the files together.

In [174]:
print(ice_thickness1.tail(2), "\n")
print(ice_thickness2.head(2), "\n")
print(ice_thickness2.tail(2), "\n")
print(ice_thickness3.head(2), "\n")
print(ice_thickness3.tail(2), "\n")
print(ice_thickness4.head(2), "\n")
print(ice_thickness4.tail(2))

       Paikka_Id                 Aika  Arvo  Lippu_Id
49998        495  1987-03-20T00:00:00    22        37
49999        496  1987-03-20T00:00:00    21        37 

   Paikka_Id                 Aika  Arvo  Lippu_Id
0        401  1987-03-20T00:00:00    17        37
1        402  1987-03-20T00:00:00    18        37 

       Paikka_Id                 Aika  Arvo  Lippu_Id
49998        413  2000-02-28T00:00:00    52        41
49999        421  2000-02-28T00:00:00    39        41 

   Paikka_Id                 Aika  Arvo  Lippu_Id
0        402  2000-02-28T00:00:00     0        37
1        409  2000-02-28T00:00:00    19        37 

       Paikka_Id                 Aika  Arvo  Lippu_Id
49998        495  2016-11-20T00:00:00     0        37
49999        503  2016-11-20T00:00:00     0        37 

   Paikka_Id                 Aika  Arvo  Lippu_Id
0        449  2016-11-20T00:00:00     0        37
1        470  2016-11-20T00:00:00     0        37 

       Paikka_Id                 Aika  Arvo  Lippu_I

We'll merge all 4 files together to create a single DataFrame and also rename the columns by their English translations.

In [175]:
ice_thickness = pd.concat([ice_thickness1,ice_thickness2,ice_thickness3,ice_thickness4])
ice_thickness.reset_index(inplace=True, drop=True)
ice_thickness.columns = ["Place_Id", "Date", "Value", "Flag_Id"]

In [176]:
ice_thickness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164868 entries, 0 to 164867
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Place_Id  164868 non-null  int64 
 1   Date      164868 non-null  object
 2   Value     164868 non-null  int64 
 3   Flag_Id   164868 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 5.0+ MB


As we can see in the DataFrame overview above, the **Date** column's Dtype is *object* so we'll change it to Pandas' *datetime* to be able to work with it better during our analysis. At the same time we create a column called **Place_date** by combining the **Place_Id** and **Date** columns and then remove any duplicates.

In [177]:
ice_thickness["Date"] = pd.to_datetime(ice_thickness["Date"])
ice_thickness["Place_date"] = ice_thickness.Place_Id.astype(str).str.cat(ice_thickness.Date.astype(str), sep="_")

In [178]:
ice_thickness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164868 entries, 0 to 164867
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Place_Id    164868 non-null  int64         
 1   Date        164868 non-null  datetime64[ns]
 2   Value       164868 non-null  int64         
 3   Flag_Id     164868 non-null  int64         
 4   Place_date  164868 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 6.3+ MB


In [179]:
ice_thickness.drop_duplicates(inplace=True)

In [180]:
ice_thickness.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 164748 entries, 0 to 164867
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Place_Id    164748 non-null  int64         
 1   Date        164748 non-null  datetime64[ns]
 2   Value       164748 non-null  int64         
 3   Flag_Id     164748 non-null  int64         
 4   Place_date  164748 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 7.5+ MB


At last we'll check whether there's any missing data and if so remove such rows from the DataFrame.

In [181]:
ice_thickness.isna().sum()

Place_Id      0
Date          0
Value         0
Flag_Id       0
Place_date    0
dtype: int64

## Combining data of water temperatures

Again all files are saved as csv so we used the Pandas read_csv() function to load them into Jupyter notebook. However, since there's 10 files in total this time we'll use a for-loop to simplify the DataFrame creation process.

In [182]:
water_temperature = pd.read_csv("Surface_Water_Temperature_1_10.csv", sep=",")
for i in range(2,11):
    water_file = pd.read_csv(f"Surface_Water_Temperature_{i}_10.csv", sep=",")
    water_temperature = water_temperature.append(water_file)

water_temperature.reset_index(inplace=True, drop=True)

In [183]:
water_temperature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473104 entries, 0 to 473103
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Paikka_Id  473104 non-null  int64  
 1   Aika       473104 non-null  object 
 2   Arvo       473104 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 10.8+ MB


In [184]:
water_temperature.head()

Unnamed: 0,Paikka_Id,Aika,Arvo
0,1667,1916-04-30T00:00:00,1.0
1,1667,1916-05-01T00:00:00,1.0
2,1667,1916-05-02T00:00:00,1.0
3,1667,1916-05-03T00:00:00,2.0
4,1667,1916-05-04T00:00:00,2.0


In [185]:
water_temperature.tail()

Unnamed: 0,Paikka_Id,Aika,Arvo
473099,2604,2020-11-26T00:00:00,4.7
473100,2605,2020-11-26T00:00:00,5.6
473101,2606,2020-11-26T00:00:00,3.4
473102,2608,2020-11-26T00:00:00,0.2
473103,3094,2020-11-26T00:00:00,4.5


As in the ice thickness DataFrame also in this case we'll change the **Date** column Dtype to *datetime*, create a new column called **Place_date** by combining the **Place_Id** and **Date** columns and then remove any duplicates.

In [186]:
water_temperature.columns = ["Place_Id", "Date", "Value"]
water_temperature["Date"] = pd.to_datetime(water_temperature["Date"])
water_temperature["Place_date"] = water_temperature.Place_Id.astype(str).str.cat(water_temperature.Date.astype(str), sep="_")

In [187]:
water_temperature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473104 entries, 0 to 473103
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Place_Id    473104 non-null  int64         
 1   Date        473104 non-null  datetime64[ns]
 2   Value       473104 non-null  float64       
 3   Place_date  473104 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 14.4+ MB


In [188]:
water_temperature.drop_duplicates(inplace=True)
water_temperature.reset_index(inplace=True, drop=True)

In [189]:
water_temperature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 472962 entries, 0 to 472961
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Place_Id    472962 non-null  int64         
 1   Date        472962 non-null  datetime64[ns]
 2   Value       472962 non-null  float64       
 3   Place_date  472962 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 14.4+ MB


Again we'll check whether there's any missing data and if so remove such rows.

In [190]:
water_temperature.isna().sum()

Place_Id      0
Date          0
Value         0
Place_date    0
dtype: int64

## Brief overview of both datasets

In [191]:
ice_thickness.describe(include="all", datetime_is_numeric=True)

Unnamed: 0,Place_Id,Date,Value,Flag_Id,Place_date
count,164748.0,164748,164748.0,164748.0,164748
unique,,,,,51857
top,,,,,509_2018-04-30
freq,,,,,7
mean,652.294425,1994-12-06 22:58:06.983789696,27.487593,43.028456,
min,401.0,1910-01-15 00:00:00,0.0,37.0,
25%,440.0,1985-03-20 00:00:00,10.0,38.0,
50%,479.0,1995-01-30 00:00:00,24.0,40.0,
75%,521.0,2008-05-10 00:00:00,43.0,41.0,
max,3809.0,2020-11-25 00:00:00,126.0,114.0,


In [192]:
water_temperature.describe(include="all", datetime_is_numeric=True)

Unnamed: 0,Place_Id,Date,Value,Place_date
count,472962.0,472962,472962.0,472962
unique,,,,472962
top,,,,1658_1998-11-04 00:00:00
freq,,,,1
mean,1795.825096,1984-06-12 08:01:03.024093312,10.261295,
min,1657.0,1916-04-30 00:00:00,0.0,
25%,1670.0,1970-09-01 00:00:00,3.8,
50%,1685.0,1983-01-17 00:00:00,10.9,
75%,1707.0,2002-07-02 00:00:00,16.2,
max,3485.0,2020-11-26 00:00:00,28.8,


From the overviews above ...

In [210]:
ice_thickness["Flag_Id"].value_counts()

41     51657
37     34141
38     32185
40     23718
39     14270
112     2920
114     2857
113     2824
43       176
Name: Flag_Id, dtype: int64

In [254]:
flag_ids = list(ice_thickness["Flag_Id"].unique())


In [255]:
flag_ids

[41, 37, 38, 40, 39, 112, 113, 114, 43]

In [199]:
lippu_file = pd.read_json("Lippu.json")

In [202]:
lippu_file.head()

Unnamed: 0,odata.metadata,value
0,http://rajapinnat.ymparisto.fi/api/Hydrologiar...,"{'Lippu_id': 1, 'LippuKoodi': '*', 'Kuvaus': '..."
1,http://rajapinnat.ymparisto.fi/api/Hydrologiar...,"{'Lippu_id': 2, 'LippuKoodi': '""', 'Kuvaus': '..."
2,http://rajapinnat.ymparisto.fi/api/Hydrologiar...,"{'Lippu_id': 3, 'LippuKoodi': '=', 'Kuvaus': '..."
3,http://rajapinnat.ymparisto.fi/api/Hydrologiar...,"{'Lippu_id': 4, 'LippuKoodi': '!', 'Kuvaus': '..."
4,http://rajapinnat.ymparisto.fi/api/Hydrologiar...,"{'Lippu_id': 5, 'LippuKoodi': '00', 'Kuvaus': ..."


In [207]:
lippu = pd.json_normalize(data=lippu_file["value"])

In [218]:
lippu.head()

Unnamed: 0,Lippu_id,LippuKoodi,Kuvaus,KuvausEng
0,1,*,Interpoloitu,
1,2,"""",Muu vertailu,
2,3,=,Redukoitu,
3,4,!,Mahdollisesti virheellinen arvo,
4,5,00,Ei huomauttamista,


In [228]:
lippu[(lippu["Lippu_id"] == 37) | (lippu["Lippu_id"] == 38) | (lippu["Lippu_id"] == 39) | (lippu["Lippu_id"] == 40) | (lippu["Lippu_id"] == 41) | (lippu["Lippu_id"] == 43) | (lippu["Lippu_id"] == 112) | (lippu["Lippu_id"] == 113) | (lippu["Lippu_id"] == 114)]


Unnamed: 0,Lippu_id,LippuKoodi,Kuvaus,KuvausEng
35,37,Lumen syvyys,Lumen syvyys kairausreiän päältä (cm),
36,38,Veden pinta,Veden korkeus jään alareunasta veden pintaan (cm),
37,39,Kohvasauva,Sauvasta luetun lumen tai lumettoman jään kork...,
38,40,Kohva+Vesikerros,Kairausreiästä mitatun kohvan paksuus (cm) + J...,
39,41,Jäänpaksuus,Jään kokonaispaksuus alareunasta yläreunaan (cm),
41,43,Heikko jää,"Jäätä on näköpiirissä, mutta se on liian heikk...",
57,112,Kohva,Kairausreiästä mitatun kohvan paksuus (cm),
58,113,Teräsjää,Kairausreiästä mitatun teräsjään paksuus (cm),
59,114,Vesikerros,Jään välissä olevien mahdollisten vesikerroste...,


In [244]:
lippu["Lippu_id"].any(flag_ids)

TypeError: unhashable type: 'numpy.ndarray'

Based on the analysis above we can confirm that only **Flag_Id** is relevant for our analysis because we're interested in ice thickness. The remaining flags aren't important for our analysis and therefore we'll remove them from the DataFrame.

In [256]:
flag_ids_drop = flag_ids.copy()
flag_ids_drop.remove(41)

In [257]:
print(flag_ids_drop)

[37, 38, 40, 39, 112, 113, 114, 43]


In [259]:
ice_thickness["Flag_Id"].drop(labels=flag_ids_drop)

0          41
1          41
2          41
3          41
4          41
         ... 
164863    114
164864     41
164865     41
164866     41
164867     43
Name: Flag_Id, Length: 164740, dtype: int64