<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Ambient-Temperature-(TAmb)-Data-Filtering-&amp;-Cleaning" data-toc-modified-id="Ambient-Temperature-(TAmb)-Data-Filtering-&amp;-Cleaning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Ambient Temperature (TAmb) Data Filtering &amp; Cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Load Data</a></span><ul class="toc-item"><li><span><a href="#Notes:" data-toc-modified-id="Notes:-1.0.2.1"><span class="toc-item-num">1.0.2.1&nbsp;&nbsp;</span>Notes:</a></span></li></ul></li><li><span><a href="#Check-upload" data-toc-modified-id="Check-upload-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Check upload</a></span></li></ul></li></ul></li><li><span><a href="#STEP-0---CREATE-SUMMARY-DATAFRAME" data-toc-modified-id="STEP-0---CREATE-SUMMARY-DATAFRAME-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>STEP 0 - CREATE SUMMARY DATAFRAME</a></span></li><li><span><a href="#STEP-1---REMOVE-REPEATING-VALUES" data-toc-modified-id="STEP-1---REMOVE-REPEATING-VALUES-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>STEP 1 - REMOVE REPEATING VALUES</a></span></li><li><span><a href="#STEP-2---REMOVING-IDENTICAL-VALUES-ACROSS-SENSORS" data-toc-modified-id="STEP-2---REMOVING-IDENTICAL-VALUES-ACROSS-SENSORS-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>STEP 2 - REMOVING IDENTICAL VALUES ACROSS SENSORS</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#3.2.2-Remove-identical-values" data-toc-modified-id="3.2.2-Remove-identical-values-4.0.1"><span class="toc-item-num">4.0.1&nbsp;&nbsp;</span>3.2.2 Remove identical values</a></span></li></ul></li></ul></li><li><span><a href="#STEP-3---REMOVE-OUTLIERS" data-toc-modified-id="STEP-3---REMOVE-OUTLIERS-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>STEP 3 - REMOVE OUTLIERS</a></span></li><li><span><a href="#STEP-4---FILL-IN-DATA-GAPS-LESS-THAN-3HRS" data-toc-modified-id="STEP-4---FILL-IN-DATA-GAPS-LESS-THAN-3HRS-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>STEP 4 - FILL IN DATA GAPS LESS THAN 3HRS</a></span></li><li><span><a href="#STEP-5---FILL-IN-DATA-GAPS-MORE-THAN-3HRS" data-toc-modified-id="STEP-5---FILL-IN-DATA-GAPS-MORE-THAN-3HRS-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>STEP 5 - FILL IN DATA GAPS MORE THAN 3HRS</a></span></li><li><span><a href="#STEP-6---SAVE-OUTPUT" data-toc-modified-id="STEP-6---SAVE-OUTPUT-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>STEP 6 - SAVE OUTPUT</a></span></li></ul></div>

# Ambient Temperature (TAmb) Data Filtering & Cleaning

**Steps in this file:**

- **Step 1 -** Repeating values are removed

- **Step 2 -** Identical values across pyranometers are removed

- **Step 3 -** Outliers (data point two standard deviations or 3% away from the mean) are removed

- **Step 4 -** Less than 3hrs gaps: Fill in by lineary interpolation (check start point, end point, and fill linearly) 

- **Step 5 -** Other timesteps: Null data will be estimated using the previous and next day (same timestep)

### Imports

Imports for timeseries

In [1]:
import pandas as pd 
import numpy as np
import datetime

### Load Data

#### Notes:
-  Setting first column as dataframe index
-  Automatically interpreting date-like values as dates through 'parse_dates=True'
-  Interpreting dates with format dd/mm/yyyy through 'dayfirst=True'

In [2]:
data = pd.read_csv("Temp data.csv",index_col=0, parse_dates=True, dayfirst=True)

### Check upload

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12960 entries, 2019-12-01 00:15:00 to 2020-04-14 00:00:00
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Tamb_sensor_1   12941 non-null  float64
 1   Tamb_sensor_2   12941 non-null  float64
 2   Tamb_sensor_3   12941 non-null  float64
 3   Tamb_sensor_4   12941 non-null  float64
 4   Tamb_sensor_5   12941 non-null  float64
 5   Tamb_sensor_6   12941 non-null  float64
 6   Tamb_sensor_7   12941 non-null  float64
 7   Tamb_sensor_8   12941 non-null  float64
 8   Tamb_sensor_9   12941 non-null  float64
 9   Tamb_sensor_10  12941 non-null  float64
 10  Tamb_sensor_11  12941 non-null  float64
 11  Tamb_sensor_12  12941 non-null  float64
dtypes: float64(12)
memory usage: 1.3 MB


In [4]:
data.head()

Unnamed: 0,Tamb_sensor_1,Tamb_sensor_2,Tamb_sensor_3,Tamb_sensor_4,Tamb_sensor_5,Tamb_sensor_6,Tamb_sensor_7,Tamb_sensor_8,Tamb_sensor_9,Tamb_sensor_10,Tamb_sensor_11,Tamb_sensor_12
2019-12-01 00:15:00,25.18,25.51,25.51,25.42,25.49,25.14,25.64,0.0,25.72,25.97,25.83,26.24
2019-12-01 00:30:00,24.61,24.98,24.98,25.32,25.06,24.86,24.86,0.0,25.38,25.72,25.47,25.83
2019-12-01 00:45:00,24.12,24.39,24.39,24.81,24.53,24.38,24.44,0.0,24.88,25.23,24.92,25.26
2019-12-01 01:00:00,23.75,24.06,24.06,24.43,24.2,23.98,23.99,0.0,24.42,24.81,24.49,24.91
2019-12-01 01:15:00,23.32,23.72,23.72,24.03,23.87,23.5,23.57,0.0,24.17,24.57,24.28,24.62


# STEP 0 - CREATE SUMMARY DATAFRAME

This dataframe is used to record the diffent cleaning (sub-)steps, to view impact of each step and ensure the order of the methodology is correct

In [5]:
summary_df = pd.DataFrame()

Calculate average of initial file

In [6]:
summary_df['Step 0'] = data.mean(axis=1)

# STEP 1 - REMOVE REPEATING VALUES

Panda's diff() function compares n values with their n-1 values

Remove values if identical to value in previous timestep, as it indicates a communication error. 

Also removing values which varied by more than **3 degrees / 10min basis**. This value is arbitrary, and is to be re-assessed if timestep frequency is different.

In [7]:
data_diff = data.diff()

In [8]:
# Iterate through each sensor
for sensors in data:

    # Remove identical values using diff() function 
    data[sensors] = np.where(data_diff[sensors] == 0, np.nan, data[sensors])
    
    # Remove values if previous timestep is erroneous 
    #data[sensors] = np.where(data_diff[sensors].shift(1) == 0, np.nan, data[sensors])
    
    # Remove values if following timestep is erroneous
    #data[sensors] = np.where(data_diff[sensors].shift(-1) == 0, np.nan, data[sensors])
    
    # Remove values if change in temperature is greater than +/-3 degrees Celsius / 10min 
    data[sensors] = np.where((data_diff[sensors] > 3) | (data_diff[sensors] < -3), np.nan, data[sensors])
    

In [9]:
summary_df['Step 1'] = data.mean(axis=1)

# STEP 2 - REMOVING IDENTICAL VALUES ACROSS SENSORS

Calculate average (mean) and standard deviation (std)

In [10]:
std = data.std(axis=1)

### 3.2.2 Remove identical values

A standard deviation of 0 mean that all of the temp sensors have the same value, which indicates a communication error.

In [11]:
# Iterate through each sensor
for sensors in data:

    # Remove values where standard deviation across timestep is zero
    data[sensors] = np.where(std == 0, np.nan, data[sensors])

In [12]:
summary_df['Step 2'] = data.mean(axis=1)

# STEP 3 - REMOVE OUTLIERS

Compare TempSensors value with lowest boundary and higher boundary (between 2 std away from mean, or 3%). 

In [13]:
pyr_accuracy = 0.03
average = data.mean(axis=1)

In [14]:
upper_std_boundary = average + (std * 2)
lower_std_boundary = average - (std * 2)
upper_accuracy_boundary = average * (1 + pyr_accuracy)
lower_accuracy_boundary = average * (1 - pyr_accuracy)

In [15]:
upper_boundary = np.maximum(upper_std_boundary, upper_accuracy_boundary)
lower_boundary = np.minimum(lower_std_boundary, lower_accuracy_boundary)

Remove value if outside boundaries otherwise keep values.

In [16]:
# Iterate through each sensor
for sensors in data:

    # Remove values outside of either boundaries
    data[sensors] = np.where((data[sensors] > upper_boundary) | (data[sensors] < lower_boundary), 
                                 np.nan, data[sensors])

In [17]:
summary_df['Step 3'] = data.mean(axis=1)

# STEP 4 - FILL IN DATA GAPS LESS THAN 3HRS

Create mask of timesteps with valid data points & null datapoints if part of less than 3hr datagap (18 timesteps)

*Solution from JohnE on https://stackoverflow.com/questions/30533021/interpolate-or-extrapolate-only-small-gaps-in-pandas-dataframe*

In [18]:
mask = data.copy()

# Iterate through each sensor
for sensors in data:
    
    # Create new DataFrame
    df = pd.DataFrame(data[sensors])
    
    #  This column counts the sequences of valid data points and sequences of nulls (resets when it encounters change)
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    
    # This column filled with 1s is required for the groupby function below  
    df['ones'] = 1
    
    # Add to the mask if sequence is lower than 3hrs (18 timesteps), or contains valid datapoints.
    mask[sensors] = (df.groupby('new')['ones'].transform('count') < 18) | data[sensors].notnull()

Interpolate any nulls in mask

In [19]:
data = data.interpolate().bfill()[mask]

In [20]:
summary_df['Step 4'] = data.mean(axis=1)

# STEP 5 - FILL IN DATA GAPS MORE THAN 3HRS

In [21]:
prev = data.copy()
next = data.copy()

In [22]:
# Iterate through each MET Station
for sensors in data:

    prev[sensors] = data.groupby([data.index.hour, data.index.minute])[sensors].shift(1)
    next[sensors] = data.groupby([data.index.hour, data.index.minute])[sensors].shift(-1)

Create DataFrame with (i) replacement values which will be amended, (ii) data shift by +1 day, (iii) data shifted by -1 day

In [23]:
replacement = data.copy()
replacement = replacement.join(prev, rsuffix='_Prev')
replacement = replacement.join(next, rsuffix='_Next')

Amend replacement values to equate average between corresponding MET Station's previous and next values

In [24]:
# Iterate through each sensor
for sensors in data:
    replacement[sensors] = replacement[[sensors + '_Prev', sensors + '_Next']].mean(axis=1)

Keep only replacement values

In [25]:
data[~mask] = replacement[~mask]

In [26]:
summary_df['Step 5'] = data.mean(axis=1)

# STEP 6 - SAVE OUTPUT

Save summary

In [30]:
summary_df.to_csv("Tamb Clean Data - Summary.csv")

Save cleaned data

In [28]:
data.to_csv("Tamb Clean Data.csv")

Save final output

In [29]:
final_df = pd.DataFrame(summary_df['Step 5'])
final_df.rename(columns={'Step 5':'Tamb (°C)'}, inplace=True)
final_df.to_csv("Tamb Clean Data - Average.csv")