# Step 1: Import, Clean and Transform your data

In this notebook, we'll walk through how to import your data from csv, clean it so it is usable, and transform it into
a usable format. Lastly, we'll re-save this data to use it for other notebooks. 

In [91]:
#Import your libraries
import pandas as pd
import numpy as np
import os

In [96]:
#Define your path directory..for this example it will be Datasets. If you create your own folders, it'll be different. 
file_dir = os.path.join("..",'Datasets')
df = pd.read_csv(os.path.join(file_dir,'EconLossesEU.csv'))

In [97]:
df.head()

Unnamed: 0,stk_flow,geo,Year,Value
0,LOSS_CLIM_MEUR,EU28,2016,681.0
1,LOSS_CUM_EUR_HAB,AT,2016,1590.0
2,LOSS_CUM_EUR_HAB,BE,2016,399.0
3,LOSS_CUM_EUR_HAB,BG,2016,296.0
4,LOSS_CUM_EUR_HAB,CH,2016,2580.0


Let's check out our data

In [98]:
df.dtypes

stk_flow     object
geo          object
Year          int64
Value       float64
dtype: object

We could chance Year to datetime, but in this case it doesn't impact much. 
If you wanted to, you could type: 

```python
df['Year'] = pd.to_datetime(df['Year']).dt.year
```

In [99]:
df.describe()

Unnamed: 0,Year,Value
count,182.0,182.0
mean,2001.362637,4644.203297
std,11.946026,6070.755348
min,1980.0,1.0
25%,1991.0,734.0
50%,2002.0,2446.5
75%,2013.75,5367.0
max,2016.0,30056.0


In [100]:
df.isnull().sum()

stk_flow    0
geo         0
Year        0
Value       0
dtype: int64

In [101]:
df.stk_flow.value_counts()

LOSS_HYD_MEUR       37
LOSS_CLIM_MEUR      37
LOSS_MEUR           37
LOSS_MET_MEUR       37
LOSS_CUM_EUR_HAB    34
Name: stk_flow, dtype: int64

But what do these mean?
Looking at the source info here: https://ec.europa.eu/eurostat/cache/metadata/en/cli_iad_loss_esms.htm,
We can see that: 

LOSS_HYD_MEUR= hydrological losses (Mill Euro)

LOSS_CLIM_MEUR = climatological losses (Mill Euro)

LOSS_MEUR = total losses (Mill Euro)

LOSS_MET_MEUR = meterological losses (Mill Euro)

LOSS_CUM_EUR_HAB = cumulative losses since 1980 in EUR per capita


In [102]:
translate ={'LOSS_HYD_MEUR':'water_losses','LOSS_CLIM_MEUR':'clim_losses','LOSS_MEUR':'total_losses','LOSS_MET_MEUR':'met_losses',
           "LOSS_CUM_EUR_HAB":'per_cap_losses'}



In [103]:
#The map function will map values of stk_flow to the varaible name we want to use as defined in the variable 'translate'
df['variable']= df['stk_flow'].map(translate)

In [104]:
df.head()

Unnamed: 0,stk_flow,geo,Year,Value,variable
0,LOSS_CLIM_MEUR,EU28,2016,681.0,clim_losses
1,LOSS_CUM_EUR_HAB,AT,2016,1590.0,per_cap_losses
2,LOSS_CUM_EUR_HAB,BE,2016,399.0,per_cap_losses
3,LOSS_CUM_EUR_HAB,BG,2016,296.0,per_cap_losses
4,LOSS_CUM_EUR_HAB,CH,2016,2580.0,per_cap_losses


In [105]:
#That looks like it worked, so let's get rid of stk_flow:
df = df.drop('stk_flow',axis = 1)

In [106]:
df.geo.value_counts()

EU28    149
IT        1
RO        1
BG        1
LI        1
NL        1
AT        1
HU        1
ES        1
LV        1
EL        1
CY        1
SE        1
PL        1
NO        1
HR        1
CZ        1
FI        1
TR        1
LU        1
PT        1
SK        1
EE        1
MT        1
IS        1
SI        1
FR        1
IE        1
UK        1
BE        1
DK        1
LT        1
DE        1
CH        1
Name: geo, dtype: int64

There seems to be only one year/one metric of data for each country, but more data available for all of Europe.

In [107]:
df[df['geo']=='EU28']['variable'].value_counts()

clim_losses       37
total_losses      37
water_losses      37
met_losses        37
per_cap_losses     1
Name: variable, dtype: int64

In [108]:
df[df['geo']=='EU28']['Year'].value_counts()

2016    5
1997    4
1995    4
1994    4
1993    4
1992    4
1991    4
1990    4
1989    4
1988    4
1987    4
1986    4
1985    4
1984    4
1983    4
1982    4
1981    4
1996    4
1998    4
2015    4
1999    4
2014    4
2013    4
2012    4
2011    4
2010    4
2009    4
2008    4
2007    4
2006    4
2005    4
2004    4
2003    4
2002    4
2001    4
2000    4
1980    4
Name: Year, dtype: int64

In [67]:
#Let's see what's available for the country-level info

In [68]:
df[df['geo']!='EU28'][['variable','Year']].drop_duplicates()

Unnamed: 0,variable,Year
1,per_cap_losses,2016


For any analysis we do, it probably makes sense to split this up into two data sets:
1 for country comparisons, and 1 for EU comparisons over time. 

In [69]:
country = df[df['geo']!='EU28']

In [70]:
EU = df[df['geo']=='EU28']

In [71]:
country.head()

Unnamed: 0,geo,Year,Value,variable
1,AT,2016,1590.0,per_cap_losses
2,BE,2016,399.0,per_cap_losses
3,BG,2016,296.0,per_cap_losses
4,CH,2016,2580.0,per_cap_losses
5,CY,2016,574.0,per_cap_losses


In [63]:
#A lot of this data is now not necessary--
#all of our data is for 2016 per capita losses, but we can tidy it up a bit

In [73]:
country = country.drop('variable',axis = 1)
country.columns = ['Country','Year','Per_Cap_Losses']

In [74]:
country.head()

Unnamed: 0,Country,Year,Per_Cap_Losses
1,AT,2016,1590.0
2,BE,2016,399.0
3,BG,2016,296.0
4,CH,2016,2580.0
5,CY,2016,574.0


In [75]:
#Now for the EU data...
EU.head()

Unnamed: 0,geo,Year,Value,variable
0,EU28,2016,681.0,clim_losses
12,EU28,2016,852.0,per_cap_losses
35,EU28,2016,5950.0,water_losses
36,EU28,2016,2765.0,met_losses
37,EU28,2016,9396.0,total_losses


In [76]:
#Let's drop geo, since its all EU28
EU = EU.drop('geo',axis = 1)

In [77]:
#let's turn each variable into its own column so we can compare them
EU_piv = EU.pivot_table(index = ['Year'],columns='variable',values = 'Value')

In [78]:
EU_piv.head()

variable,clim_losses,met_losses,per_cap_losses,total_losses,water_losses
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1980,3170.0,158.0,,3414.0,86.0
1981,2.0,1919.0,,2277.0,356.0
1982,7026.0,2452.0,,14331.0,4853.0
1983,4149.0,708.0,,11026.0,6169.0
1984,2.0,4346.0,,4690.0,342.0


In [79]:
EU_piv.describe()

variable,clim_losses,met_losses,per_cap_losses,total_losses,water_losses
count,37.0,37.0,1.0,37.0,37.0
mean,2579.027027,4279.432432,852.0,11092.108108,4233.648649
std,3029.190646,5011.413354,,7876.957759,5167.074039
min,1.0,158.0,852.0,2277.0,29.0
25%,681.0,1457.0,852.0,4812.0,750.0
50%,2116.0,2452.0,852.0,9265.0,1882.0
75%,3170.0,5411.0,852.0,15628.0,5371.0
max,16556.0,22958.0,852.0,30056.0,23991.0


In [80]:
EU_piv[EU_piv['per_cap_losses'].isnull()==False]

variable,clim_losses,met_losses,per_cap_losses,total_losses,water_losses
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,681.0,2765.0,852.0,9396.0,5950.0


In [81]:
#Since we only have 1 entry for this, let's drop it. 
EU_piv = EU_piv.drop('per_cap_losses',axis = 1)

In [82]:
EU_piv.head()

variable,clim_losses,met_losses,total_losses,water_losses
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,3170.0,158.0,3414.0,86.0
1981,2.0,1919.0,2277.0,356.0
1982,7026.0,2452.0,14331.0,4853.0
1983,4149.0,708.0,11026.0,6169.0
1984,2.0,4346.0,4690.0,342.0


In [90]:
#Now that these look usable, let's export to csv to use in another workbook:
country.to_csv(os.path.join(file_dir,'EU_Country_Losses_16.csv'))
EU_piv.to_csv(os.path.join(file_dir,'EU_Total_Losses_80-15.csv'))
