In [1]:
import pandas as pd

From Evans et al. (2017) supplementary information:

Evans et al. data: lake and reservoir budget data only, only one (mean) value per site regardless of study duration, but included both source and sink sites
Catalan et al.: restricted to sites acting as DOC sinks, incorporated data from a wider range of different waterbody types and assessment methods, and treated multiple years of data from some sites as independent measurements in the statistical analysis. Of the 107 data points used, 29 were from rivers or catchment studies, 8 from oceans or estuaries, and 75 from lakes or reservoirs. Taking into account the use of multiple data years from single sites, the study included data from 48 unique lakes or reservoirs. Of these, 22 k values were derived from modelled data, and four from summer-only studies, i.e. 26 not thought to be reliable, leaving 22 lakes.

Evans et al. excluded these 26 data points to compare to their results, giving 22 sites with annual lake input-output budgets that were common to both studies. Of these, 20 were obtained from Ontario, Canada, from the Experimental Lakes Area (ELA) and the Dorset region. The original authors of the ELA study calculated DOC half-lives assuming that they were closed systems with no outflows. Evans et al. re-calculated T½ for these sites to take account of DOC loss through the lake outflows, thus obtaining different somewhat estimates to those used in the analysis of Catalán et al. 

Overall, therefore, the two lake datasets are not wholly independent, but there are major differences in site selection criteria, geographic range and data analysis methods, and approximately three quarters of the data points used by each study were independent of the other.  

In [9]:
# Read in data
data_fpath = r"data/catalan_2016_doc_data.xlsx"
df_full = pd.read_excel(data_fpath)

df_full.rename(
    {
        "WRT (years) ": "tau",
        "Degradation rate temperature corrected (d-1)": "k_per_day",
        "FieldII": "Method",
        "System10": "System",
        "Latitude (d.d.)": "Lat",
        "Longitude (d.d.)": "Lon",
        "Cite": "Ref"
    },
    axis="columns",
    inplace=True,
)

# Drop data that Catalan et al. excluded from their regression analyses, for comparability
df = df_full.loc[df_full["Excluded from analysis"] == 0].copy()

# Drop all but lake data
cat_df = df.loc[df_full["Method"] == "FieldModel"]
cat_df = cat_df.loc[cat_df["System"] == "Lake"]

# df of unique lakes with lat & lon
groupby_cols = ["Site", "Ref", "Lat", "Lon"]
cat_lake_df = cat_df[groupby_cols].drop_duplicates(subset=groupby_cols).reset_index(drop=True)

cat_lake_df.to_csv(r'data/catalan_lake_sites.csv')

cat_lake_df

Unnamed: 0,Site,Ref,Lat,Lon
0,Mirror lake (epilimnion),Cole et al. 1989,43.944013,-71.693437
1,Friskjön,Sobek et al. 2006,57.937636,16.697475
2,239_ELA,Schindler et al. 1992,49.663097,-93.723074
3,114_ELA,"Curtis and Schindler, 1997",49.671072,-93.757008
4,221_ELA,"Curtis and Schindler, 1997",49.702057,-93.72628
5,222_ELA,"Curtis and Schindler, 1997",49.696173,-93.72319
6,224_ELA,"Curtis and Schindler, 1997",49.690621,-93.716152
7,225_ELA,"Curtis and Schindler, 1997",49.687955,-93.714264
8,226_ELA,"Curtis and Schindler, 1997",49.689954,-93.74173
9,227_ELA,"Curtis and Schindler, 1997",49.687733,-93.688515


Notes:

- Site 239_ELA has 17 rows, with diff WRT and k values. i.e. including changes over time. WRT of the lake is 5-17 years.
- What are Hanson et al. 2011-derived rows? Large lakes simulation and Small lakes simulation give as the Site name...
- San Diego Bay doesn't sound like a lake... Maybe that's ok. In there four times, one per season

Still some repeat sites due to different times of year. Drop duplicates and just keep the first value of name and ref

In [13]:
unique_coords_df = cat_lake_df.drop_duplicates(subset=["Lat", "Lon"]).reset_index(drop=True)
unique_coords_df

Unnamed: 0,Site,Ref,Lat,Lon
0,Mirror lake (epilimnion),Cole et al. 1989,43.944013,-71.693437
1,Friskjön,Sobek et al. 2006,57.937636,16.697475
2,239_ELA,Schindler et al. 1992,49.663097,-93.723074
3,114_ELA,"Curtis and Schindler, 1997",49.671072,-93.757008
4,221_ELA,"Curtis and Schindler, 1997",49.702057,-93.72628
5,222_ELA,"Curtis and Schindler, 1997",49.696173,-93.72319
6,224_ELA,"Curtis and Schindler, 1997",49.690621,-93.716152
7,225_ELA,"Curtis and Schindler, 1997",49.687955,-93.714264
8,226_ELA,"Curtis and Schindler, 1997",49.689954,-93.74173
9,227_ELA,"Curtis and Schindler, 1997",49.687733,-93.688515


Just did the comparison manually in the end. Result: 15/85 of Evans et al. sites are in the Catalan et al. data.
Limiting this to just the oligotrophic lakes where retention > 0.1, then 15/49 lakes are in Evans et al. data. Can't really afford to drop those... Then would only have 34 data points. Live with the overlap?? Or drop?