In [1]:
%matplotlib inline 

# Import required modules
import matplotlib.pyplot as plt # Basic plotting
import seaborn as sn            # Advanced plotting
import mpld3                    # "Interactive" plotting  
import pandas as pd             # Data handling

sn.set_context('notebook')      # Plot styling

# Introduction to resampling, interpolation and joins with Python

The Excel file *resa2_data.xlsx* has been exported directly from RESA2 and contains the discharge and NO3-N data from Langtjern from 1990 to the present day. In Excel, I've deleted unnecessary columns and the units row, but made no other changes.

## 1. Basic parsing of data

In [2]:
# Read data
in_xlsx = r'C:\Data\James_Work\Staff\Kari_A\Python_Example\Example_Data\resa2_data.xlsx'
df = pd.read_excel(in_xlsx, sheetname='DATA', index_col=0)

df.head(10) # Displays the first 10 rows

Unnamed: 0_level_0,Qs,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-06,0.005,27
1990-01-13,0.005,27
1990-01-20,0.007,26
1990-01-28,0.014,40
1990-02-03,0.078,63
1990-02-11,0.142,73
1990-02-17,0.062,72
1990-02-25,0.129,61
1990-03-03,0.07,40
1990-03-09,0.042,58


It's a good idea to check the data types, as RESA2 often includes '<' characters for detection limit values. For this example, we'll check and replace any '<' values with the detection limit itself.

In [3]:
# Print data types for columns
df.dtypes

Qs       float64
NO3-N     object
dtype: object

'float64' is a decimal number format, so the Qs column does not contain any '<' characters. 'object' means that the 'NO3-N' column includes mixed data types: most probably a mixture of numbers and '<' symbols. To fix this, we first convert the whole column to text ('str'), then remove the '<' characters and convert it back to decimals ('float').

In [4]:
# Convert NO3-N column, removing '<'
df['NO3-N'] = df['NO3-N'].astype(str).str.strip('<').astype(float)

# Print data types again
print df.dtypes

# Print the first 5 rows
df.head()

Qs       float64
NO3-N    float64
dtype: object


Unnamed: 0_level_0,Qs,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-06,0.005,27.0
1990-01-13,0.005,27.0
1990-01-20,0.007,26.0
1990-01-28,0.014,40.0
1990-02-03,0.078,63.0


Note that both columns are now of 'float64' type.

## 2. Basic plotting

The code below gives an "interactive" plot: use the tools at the bottom-left of the plot to navigate around.

In [5]:
df.plot(subplots=True, figsize=(12,6))
mpld3.display()

## 3. Resampling

Suppose we want to calculate mean values for each parameter in each month.

(Note that you can also calculate medians, sums, standard deviations etc. in exactly the same way).

In [6]:
# Resample to monthly
mon_df = df.resample('M').mean()
mon_df.head()

Unnamed: 0_level_0,Qs,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-31,0.00775,30.0
1990-02-28,0.10275,67.25
1990-03-31,0.105,43.0
1990-04-30,0.17975,25.5
1990-05-31,0.039,7.0


Or we can go the other way, and convert the raw series to daily resolution, with 'NoData' values wherever data are missing.

In [7]:
# Resample to daily
day_df = df.resample('D').mean()
day_df.head(10)

Unnamed: 0_level_0,Qs,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-06,0.005,27.0
1990-01-07,,
1990-01-08,,
1990-01-09,,
1990-01-10,,
1990-01-11,,
1990-01-12,,
1990-01-13,0.005,27.0
1990-01-14,,
1990-01-15,,


## 4. Interpolation

Having created the daily series, we might want to interpolate over the data gaps. There are lots of methods to choose from: simply set the 'method' parameter to one of the following: ‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’.

In [8]:
# Linear interpolation
lin_df = day_df.interpolate(method='linear')
lin_df.head(10)

Unnamed: 0_level_0,Qs,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-06,0.005,27.0
1990-01-07,0.005,27.0
1990-01-08,0.005,27.0
1990-01-09,0.005,27.0
1990-01-10,0.005,27.0
1990-01-11,0.005,27.0
1990-01-12,0.005,27.0
1990-01-13,0.005,27.0
1990-01-14,0.005286,26.857143
1990-01-15,0.005571,26.714286


As an example, let's compare linear versus cubic interpolation for the just flows data. (Again, use the tools bottom-left to zoom in and look at the differences).

In [9]:
# Extract just the flows data
q_df = day_df[['Qs']]

# Interpolation
q_df['q_linear'] = q_df['Qs'].interpolate(method='linear') # Linear
q_df['q_cubic'] = q_df['Qs'].interpolate(method='cubic')   # Cubic

# Plot
q_df[['q_linear', 'q_cubic']].plot(subplots=True, figsize=(12,6))
mpld3.display()

Sometimes you want to interpolate over small data gaps, but not large ones. In this case you can also pass an additional 'limit' parameter as follows:

    q_df['Qs'].interpolate(method='linear', limit=7)

This will interpolate over gaps of up to 7 consecutive time steps (7 days in this case), but large gaps will be left unfilled.

## 5. Annual loads

Let's try a very simply (and not very accurate!) calculation of annual loads. We'll linearly interpolate the raw data to daily resolution, calculate daily loads by multiplying flows by concentrations, then sum the daily loads in each year to estimate a time series of annual loads. We start off with 'day_df', which we created above.

In [10]:
# Linear interpolation of Qs and NO3-N columns
lin_df = day_df.interpolate(method='linear')

# Calculate daily load (including converting units from ug-N/l and m3/s to kg/day)
lin_df['load_kg/day'] = lin_df['NO3-N']*lin_df['Qs']*24*60*60*1E-6

# Sum annual loads
ann_df = lin_df[['load_kg/day']].resample('A').sum()

ann_df.head()

Unnamed: 0_level_0,load_kg/day
Date,Unnamed: 1_level_1
1990-12-31,58.945491
1991-12-31,58.785563
1992-12-31,52.195263
1993-12-31,55.491904
1994-12-31,114.82709


In [11]:
# Plot
ann_df.plot()
mpld3.display()

## 6. Joins

The reason why many people end up using a database for this kind of analysis is that they need to be able to "join" different datasets, for example by matching dates in two time series to get the values to align correctly. This can all be done very easily with Python/Pandas too.

As an example, let's create another time series showing annual mean nitrate concentrations (using the raw data rather than the interpolated values) and then "join" it back to the annual loads series we created just above.

In [12]:
# Calculate annual average nitrate
nit_df = df[['NO3-N']].resample('A').mean()
nit_df.head()

Unnamed: 0_level_0,NO3-N
Date,Unnamed: 1_level_1
1990-12-31,22.568627
1991-12-31,20.372549
1992-12-31,19.607843
1993-12-31,18.921569
1994-12-31,24.45098


And now join this to the loads series by matching the dates.

In [13]:
# Database-style join by matching dates
join_df = ann_df.join(nit_df)
join_df.head()

Unnamed: 0_level_0,load_kg/day,NO3-N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-12-31,58.945491,22.568627
1991-12-31,58.785563,20.372549
1992-12-31,52.195263,19.607843
1993-12-31,55.491904,18.921569
1994-12-31,114.82709,24.45098


In [14]:
# Plot
join_df.plot(subplots=True, figsize=(12,6))
mpld3.display()

## 7. Comparison to RESA2

RESA2 includes options for automatically calculating loads. Most of the code files for RESA are on the on the network here:

K:\Prosjekter\langtransporterte forurensninger\RESAII\Kode

but the files names aren't always obvious (at least to me), and so far I haven't managed to find the bit of code that does the interpolation. However, based on looking at the output and Kari's e-mails, I can take a pretty good guess at what Tore's code does. Probably something like this:

 1. Read the NVE daily flow data associated with the site in question. Scale it according to the ratio of catchment areas between the discharge and chemistry stations. <br><br>
 
 2. Read the chemistry data for the site in question. <br><br>
 
 3. Convert both series to daily resolution and fill any no data gaps by linear interpolation. <br><br>
 
 4. Calculate daily loads. <br><br>
 
 5. Sum the loads to the desired frequency.
 
I'd like to duplicate this behaviour and have written a function in Python which bypasses Tore's application entirely and reads directly from the database. Start off by importing my new fucntions.

In [15]:
# Import custom functions
import imp
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

My function has the following options (from my very incomplete documentation):

    simple_loads(stn_code, par, st_dt, end_dt, 
                 interp_limit=None, output_unit='kg', freq='A'):
    """ Calculates at the desired frequency. First extracts daily water 
        chemistry and flow time series for the selected site, parameter
        and time period. Data gaps are filled by linear interpolation, 
        daily loads are calculated and the resulting series is summed 
        to the specified frequency.
        
        Note: The units conversion step is currently hard-coded and 
        needs adapting!
        
    Args: 
        stn_code     RESA2 station code
        par          Parameter of interest. Must match an entry in the 
                     PARAMETER_DEFINITIONS table
        st_dt        Format: 'YYYY-MM-DD'
        end_dt       Format: 'YYYY-MM-DD'
        interp_limit Maximum number of steps to interpolate. 
                     Default: interpolate all
        output_unit  Units for loads (per specified frequency)
        freq         'A', Annual; 'M', Monthly; 'D', daily
    
    Returns:
        Dataframe of loads
    """

In the example below, I'm calculating annual N loads (kg-N/yr) for Langtjern. 

In [16]:
# Calculate annual loads
df = resa2_basic.simple_loads('LAE01', 'NO3-N', '1990-01-01', '2015-12-31')
df.index = df.index.year
df.head()

Unnamed: 0,Load_kg/A
1990,68.25497
1991,54.280991
1992,46.37372
1993,50.11088
1994,127.221746


How does this compare to the output from RESA2? The code below reads the annual loads from RESA and joins them to the above dataframe.

In [17]:
# Read data
in_xlsx = r'C:\Data\James_Work\Staff\Kari_A\Python_Example\Example_Data\loads_from_resa2.xlsx'
resa_df = pd.read_excel(in_xlsx, sheetname='DATA', index_col=0)

df = df.join(resa_df)
df

Unnamed: 0,Load_kg/A,RESA_Load_kg
1990,68.25497,68.317732
1991,54.280991,54.280991
1992,46.37372,46.37372
1993,50.11088,50.11088
1994,127.221746,127.221746
1995,77.4708,77.4708
1996,52.05754,52.05754
1997,42.163319,42.163319
1998,59.84065,59.84065
1999,58.560264,58.560264


These results are basically identical, which suggests that my code is doing the same as Tore's. A bit more testing will be required though as my code is very rough at present.