In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

# Identifying limed catchments

Atle has provided two Excel files:

 1. A list of 1005 lakes proposed for a new national scale monitoring programme, and <br><br>
 
 2. A list of 2957 lakes that have been historically limed (originally provided by Kari).
 
This notebook links these two datasets based on lake IDs from NVE to see which of the sites in the new programme may have been limed in the past.

## 1. Read data

The file containing data for the new programme has a few incomplete lines right at the end. I've removed these manually and copied the data to a new Excel file, along with the data from Kari.

In [2]:
# Read liming data
xl_path = r'../data/data_tidied.xlsx'
lim_df = pd.read_excel(xl_path, sheet_name='Limed_Catchments')
lim_df.head()

Unnamed: 0,Navn,NVE-nr,UTM E32,UTM N32,Hoh (m),Prøve fra år,"""Ukalket"" Ca (mg/l)",TOC (mg/L),Høyde,Kalk,...,"""Ukalket"" Ca (mg/l).1",TOC (mg/L).1,Kalk.1,Humus.1,G/M ANC (uekv/l).1,"""Ukalket"" ANC (uekv/l) rapport","Ny ""Ukalket"" ANC (uekv/l) 2014",Usikkerhet,Rapport,2014
0,"""Øvre"" Damtjørn",80270,519698.464062,6661278.0,0.0,2003-10-17 00:00:00,2.742715,11.6835,Lavland,1-4,...,,,1-4,>5,30.0,164.577793,,40,S,S
1,"""Øvre"" Damtjørn",80270,519698.464062,6661278.0,0.0,2006-09-06 00:00:00,1.222431,8.1,Lavland,1-4,...,,,1-4,>5,30.0,61.859097,,40,U,U
2,"""Øvre"" Damtjørn",80270,519698.464062,6661278.0,0.0,2007-05-28 00:00:00,0.80163,5.7,Lavland,<1,...,,,0.75-1,>5,30.0,44.701154,,40,U,U
3,"""Øvre"" Damtjørn",80270,519698.464062,6661278.0,0.0,2007-09-23 00:00:00,1.543639,11.0,Lavland,1-4,...,,,1-4,>5,30.0,98.684502,,40,S,S
4,"(S)M, Ertevann",3380,638671.642401,6597786.0,239.0,2011-11-06 00:00:00,1.284453,17.2,Skog,1-4,...,,,1-4,>5,30.0,83.570362,,25,U*,S


In [3]:
# Read new programme data
new_df = pd.read_excel(xl_path, sheet_name='New_Project')
new_df.head()

Unnamed: 0,Station ID,Station Code,Station name,Lake/River name,Kommune Nr,Kommune,Fylke Nr,Fylke,NVE Vatn nr,UTM North,UTM East,UTM Zone,Lake Area,Altitude
0,2999,101-4-1,Femsjøen,Femsjøen,101,Halden kommune,1,Østfold,316.0,6559050.0,642460.0,32,10.64,79
1,9,101-2-7,Hokksjøen,Hokksjøen,101,Halden kommune,1,Østfold,3608.0,6543369.0,647241.906048,32,0.133879,148
2,2998,101-2-2,Steinsvatnet,Steinsvatnet,101,Halden kommune,1,Østfold,3562.0,6554004.0,652477.0,32,0.21,178
3,3000,105-3-6,Isesjøen,Isesjøen,105,Sarpsborg kommune,1,Østfold,133.0,6572400.0,626400.0,32,6.2,38
4,3001,105-3-10,Tunevatnet,Tunevatnet,105,Sarpsborg kommune,1,Østfold,3451.0,6576676.0,619486.0,32,2.25,40


## 2. Data checking

In [4]:
# Are the IDs in the limed dataset unique?
print 'Is unique?:       ', lim_df['NVE-nr'].is_unique
print 'Number unique:    ', len(lim_df['NVE-nr'].unique())
print 'Number without ID:', len(lim_df[lim_df['NVE-nr'].isnull()])

Is unique?:        False
Number unique:     1565
Number without ID: 0


So, there are many duplicated IDs in the limed dataset, but all rows have an ID. Looking down the e-mail chain from Atle, this is expected: Kari says the data sometimes includes single samples, so the same lake can appear on multiple lines. Based on the above, there are actually **1565 unique lakes in the limed dataset**.

What about the new dataset?

In [5]:
# Are the IDs in the new dataset unique?
print 'Is unique?:       ', new_df['NVE Vatn nr'].is_unique
print 'Number unique:    ', len(new_df['NVE Vatn nr'].unique())
print 'Number without ID:', len(new_df[new_df['NVE Vatn nr'].isnull()])

Is unique?:        False
Number unique:     1004
Number without ID: 2


In [6]:
# Identify lakes with no ID
new_df[new_df['NVE Vatn nr'].isnull()]

Unnamed: 0,Station ID,Station Code,Station name,Lake/River name,Kommune Nr,Kommune,Fylke Nr,Fylke,NVE Vatn nr,UTM North,UTM East,UTM Zone,Lake Area,Altitude
271,15705,830-26,Måvatn,,830,Nissedal kommune,8,Telemark,,6563000.0,478400.0,32,0.63,665
305,15712,904-12,Snøløsvatn,,904,Grimstad kommune,9,Aust-Agder,,6481832.0,468564.0,32,1.24,109


Two lakes in the new dataset are missing IDs. Can these be found in the limed dataset?

In [7]:
# Search limed data for Måvatn
lim_df[lim_df['Navn'].str.contains(u'Måvatn', na=False)]

Unnamed: 0,Navn,NVE-nr,UTM E32,UTM N32,Hoh (m),Prøve fra år,"""Ukalket"" Ca (mg/l)",TOC (mg/L),Høyde,Kalk,...,"""Ukalket"" Ca (mg/l).1",TOC (mg/L).1,Kalk.1,Humus.1,G/M ANC (uekv/l).1,"""Ukalket"" ANC (uekv/l) rapport","Ny ""Ukalket"" ANC (uekv/l) 2014",Usikkerhet,Rapport,2014
1417,Måvatn,8146,505672.432872,6531369.0,142.0,2011-10-04 00:00:00,1.036021,10.5,Lavland,1-4,...,,,1-4,>5,30.0,52.127344,,30,U,U
1418,"Måvatn, store",13527,492168.776802,6599101.0,791.0,2011-10-16 00:00:00,0.654453,7.4,Skog,<1,...,,,0.5-0.75,>5,20.0,37.412865,,30,U,U
1419,"Måvatn, vesle",13542,492953.501412,6598952.0,797.0,2011-10-16 00:00:00,0.775113,8.3,Skog,<1,...,,,0.75-1.0,>5,30.0,45.807358,,30,U,U


There are 3 different Måvatns in the limed dataset, but none match the co-ordinates (or elevation) of the Måvatn in the new project. **Ask Atle to check whether Måvatn in the new project is one of the three above**.

In [8]:
# Search limed data for Snøløsvatn
lim_df[lim_df['Navn'].str.contains(u'Snøløsvatn', na=False)]

Unnamed: 0,Navn,NVE-nr,UTM E32,UTM N32,Hoh (m),Prøve fra år,"""Ukalket"" Ca (mg/l)",TOC (mg/L),Høyde,Kalk,...,"""Ukalket"" Ca (mg/l).1",TOC (mg/L).1,Kalk.1,Humus.1,G/M ANC (uekv/l).1,"""Ukalket"" ANC (uekv/l) rapport","Ny ""Ukalket"" ANC (uekv/l) 2014",Usikkerhet,Rapport,2014


I can't find anything in the limed data to match 'Snøløsvatn'. For now, I will therefore ignore these two sites (Måvatn and Snøløsvatn).

In [9]:
# Drop rows without IDs from new dataset
new_df.dropna(axis=0, subset=['NVE Vatn nr',], inplace=True)
print 'Number of sites in new project with IDs:', len(new_df)

Number of sites in new project with IDs: 1003


In [10]:
# Convert NVE ID cols to str for safe matching
lim_df['NVE-nr'] = lim_df['NVE-nr'].astype(str)
new_df['NVE Vatn nr'] = new_df['NVE Vatn nr'].astype(int).astype(str)

## 3. Identify limed catchments

In [11]:
# Get list of unique IDs for limed catchments
lim_uni = lim_df['NVE-nr'].unique()

# List lakes in 'new_df' that appear in 'lim_df'
union_df = new_df[new_df['NVE Vatn nr'].isin(lim_uni)]

# Save to CSV
union_df.to_csv(r'../Data/limed_sites_in_new_project.csv', encoding='utf-8')

# Print results
print ('There are %s sites in the new project '
       'that appear listed in the limed dataset:' % len(union_df))
union_df

There are 8 sites in the new project that appear listed in the limed dataset:


Unnamed: 0,Station ID,Station Code,Station name,Lake/River name,Kommune Nr,Kommune,Fylke Nr,Fylke,NVE Vatn nr,UTM North,UTM East,UTM Zone,Lake Area,Altitude
11,3008,128-2-111,Store Hosten,Store Hosten,128,Rakkestad kommune,1,Østfold,3415,6592256.0,644363.0,32,0.2,190
32,3029,402-2-33,Fjellsjøen,Fjellsjøen,402,Kongsvinger kommune,4,Hedmark,4112,6681939.0,348777.0,33,0.59,384
58,3052,428-2-24,Fiskebekktjørna,Fiskebekktjørna,428,Trysil kommune,4,Hedmark,33453,6825188.0,650148.0,32,0.28,815
159,3138,604-2-18,Tverrvatnet,Tverrvatnet,604,Kongsberg kommune,6,Buskerud,6384,6593783.0,535801.0,32,0.14,529
189,3168,621-1-27,Flåvatna,Flåvatna,621,Sigdal kommune,6,Buskerud,18142,6673699.0,510145.0,32,0.07,854
193,3172,623-2-16,Damheggesjø,Damheggesjø,623,Modum kommune,6,Buskerud,5310,6645867.0,558939.0,32,0.17,588
240,17942,817-606,Kleppsvatn,,817,Drangedal kommune,8,Telemark,1245,6560359.0,482226.0,32,1.25,542
641,3574,1432-1-7,Nordvatnet,Nordvatnet,1432,Førde kommune,14,Sogn og Fjordane,28322,6829875.0,347236.0,32,0.07,893


## 4. Summary

 * There are 2 sites in new project (Måvatn and Snøløsvatn) without NVE IDs <br><br>
 
 * Of these two, one (Måvatn) *might* be listed in the limed dataset, but there is no exact match. Perhaps Atle can check? For now, I have simply ignored these two sites <br><br>
 
 * Of the remaining 1003 sites in the new project, 8 are listed in the limed dataset