# Sertao - Tamb Data Filtering & Cleaning

**Steps in this file:**

- **Step 1 -** Repeating values are removed

- **Step 2 -** Identical values across pyranometers are removed

- **Step 3 -** Outliers (data point two standard deviations or 3% away from the mean) are removed

- **Step 4 -** Less than 3hrs gaps: Fill in by lineary interpolation (check start point, end point, and fill linearly) 

- **Step 5 -** Other timesteps: Null data will be estimated using the previous and next day (same timestep)

### Imports

Imports for timeseries

In [1]:
import pandas as pd 
import numpy as np
import datetime

### Load Data

#### Notes:
-  Setting first column as dataframe index
-  Automatically interpreting date-like values as dates through 'parse_dates=True'
-  Interpreting dates with format dd/mm/yyyy through 'dayfirst=True'

In [2]:
Data = pd.read_csv("2 SMR - Tamb Data 10T - Timestep Cleaning.csv",index_col=0, parse_dates=True, dayfirst=True)

### Check upload

In [3]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48667 entries, 2019-04-01 00:00:00 to 2020-03-03 23:00:00
Data columns (total 18 columns):
MET01    47535 non-null float64
MET02    47535 non-null float64
MET03    47535 non-null float64
MET04    47535 non-null float64
MET05    47535 non-null float64
MET06    47535 non-null float64
MET07    47535 non-null float64
MET08    47535 non-null float64
MET09    47535 non-null float64
MET10    47535 non-null float64
MET11    47535 non-null float64
MET12    47535 non-null float64
MET13    47535 non-null float64
MET14    47535 non-null float64
MET15    47535 non-null float64
MET16    47535 non-null float64
MET17    47535 non-null float64
MET18    47535 non-null float64
dtypes: float64(18)
memory usage: 7.1 MB


In [4]:
Data.head()

Unnamed: 0,MET01,MET02,MET03,MET04,MET05,MET06,MET07,MET08,MET09,MET10,MET11,MET12,MET13,MET14,MET15,MET16,MET17,MET18
2019-04-01 00:00:00,10.07,10.19,10.15,10.13,9.98,10.61,10.36,9.7,9.89,10.53,10.11,10.69,10.27,9.81,9.73,10.5,9.24,10.5
2019-04-01 00:10:00,10.39,10.11,10.09,10.16,9.69,10.35,10.21,9.74,9.67,10.0,9.92,10.32,10.17,9.53,9.68,10.3,9.46,10.49
2019-04-01 00:20:00,9.78,9.71,10.34,9.93,9.91,10.24,10.41,10.3,10.8,9.86,9.67,10.37,10.05,9.04,9.95,9.89,9.4,10.19
2019-04-01 00:30:00,9.36,9.7,9.51,9.17,9.25,10.42,10.78,9.68,10.17,9.53,9.31,10.66,10.09,9.21,9.91,9.36,9.73,10.0
2019-04-01 00:40:00,10.17,10.18,9.12,9.21,8.83,10.05,10.24,9.43,9.82,8.91,9.19,10.34,9.95,9.84,10.09,8.56,9.48,9.77


# STEP 0 - CREATE SUMMARY DATAFRAME

This dataframe is used to record the diffent cleaning (sub-)steps, to view impact of each step and ensure the order of the methodology is correct

In [5]:
Summary_df = pd.DataFrame()

Calculate average of initial file

In [6]:
Summary_df['Step 0'] = Data.mean(axis=1)

# STEP 1 - REMOVE REPEATING VALUES

Panda's diff() function compares n values with their n-1 values

Remove values if identical to value in previous timestep, as it indicates a communication error. 

Also removing values which varied by more than **3 degrees / 10min basis**. This value is arbitrary, and is to be re-assessed if timestep frequency is different.

In [7]:
Data_diff = Data.diff()

In [8]:
# Enumerate through MET_stations
for MET_station in Data:

    # Remove identical values using diff() function 
    Data[MET_station] = np.where(Data_diff[MET_station] == 0, np.nan, Data[MET_station])
    
    # Remove values if previous timestep is erroneous 
    #Data[MET_station] = np.where(Data_diff[MET_station].shift(1) == 0, np.nan, Data[MET_station])
    
    # Remove values if following timestep is erroneous
    #Data[MET_station] = np.where(Data_diff[MET_station].shift(-1) == 0, np.nan, Data[MET_station])
    
    # Remove values if change in temperature is greater than +/-3 degrees Celsius / 10min 
    Data[MET_station] = np.where((Data_diff[MET_station] > 3) | (Data_diff[MET_station] < -3), np.nan, Data[MET_station])
    

In [9]:
Summary_df['Step 1'] = Data.mean(axis=1)

# STEP 2 - REMOVING IDENTICAL VALUES ACROSS SENSORS

Calculate average (mean) and standard deviation (std)

In [10]:
std = Data.std(axis=1)

### 3.2.2 Remove identical values

A standard deviation of 0 mean that all of the temp sensors have the same value, which indicates a communication error.

In [11]:
# Enumerate through MET_stations
for MET_station in Data:

    # Remove values where standard deviation across timestep is zero
    Data[MET_station] = np.where(std == 0, np.nan, Data[MET_station])

In [12]:
Summary_df['Step 2'] = Data.mean(axis=1)

# STEP 3 - REMOVE OUTLIERS

Compare TempSensors value with lowest boundary and higher boundary (between 2 std away from mean, or 3%). 

In [13]:
pyr_accuracy = 0.03
average = Data.mean(axis=1)

In [14]:
Upper_std_boundary = average + (std * 2)
Lower_std_boundary = average - (std * 2)
Upper_accuracy_boundary = average * (1 + pyr_accuracy)
Lower_accuracy_boundary = average * (1 - pyr_accuracy)

In [15]:
Upper_boundary = np.maximum(Upper_std_boundary, Upper_accuracy_boundary)
Lower_boundary = np.minimum(Lower_std_boundary, Lower_accuracy_boundary)

Remove value if outside boundaries otherwise keep values.

In [16]:
# Enumerate through MET_stations
for MET_station in Data:

    # Remove values outside of either boundaries
    Data[MET_station] = np.where((Data[MET_station] > Upper_boundary) | (Data[MET_station] < Lower_boundary), 
                                 np.nan, Data[MET_station])

In [17]:
Summary_df['Step 3'] = Data.mean(axis=1)

# STEP 4 - FILL IN DATA GAPS LESS THAN 3HRS

Create mask of timesteps with valid data points & null datapoints if part of less than 3hr datagap (18 timesteps)

*Solution from JohnE on https://stackoverflow.com/questions/30533021/interpolate-or-extrapolate-only-small-gaps-in-pandas-dataframe*

In [18]:
mask = Data.copy()

# Iterate through each MET Station
for MET_Station in Data:
    
    # Create new DataFrame
    df = pd.DataFrame(Data[MET_Station])
    
    #  This column counts the sequences of valid data points and sequences of nulls (resets when it encounters change)
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    
    # This column filled with 1s is required for the groupby function below  
    df['ones'] = 1
    
    # Add to the mask if sequence is lower than 3hrs (18 timesteps), or contains valid datapoints.
    mask[MET_Station] = (df.groupby('new')['ones'].transform('count') < 18) | Data[MET_Station].notnull()

Interpolate any nulls in mask

In [19]:
Data = Data.interpolate().bfill()[mask]

In [20]:
Summary_df['Step 4'] = Data.mean(axis=1)

# STEP 5 - FILL IN DATA GAPS MORE THAN 3HRS

In [21]:
Prev = Data.copy()
Next = Data.copy()

In [22]:
# Iterate through each MET Station
for MET_Station in Data:

    Prev[MET_Station] = Data.groupby([Data.index.hour, Data.index.minute])[MET_Station].shift(1)
    Next[MET_Station] = Data.groupby([Data.index.hour, Data.index.minute])[MET_Station].shift(-1)

Create DataFrame with (i) replacement values which will be amended, (ii) data shift by +1 day, (iii) data shifted by -1 day

In [23]:
Replacement = Data.copy()
Replacement = Replacement.join(Prev, rsuffix='_Prev')
Replacement = Replacement.join(Next, rsuffix='_Next')

Amend replacement values to equate average between corresponding MET Station's previous and next values

In [24]:
# Iterate through each MET Station
for MET_Station in Data:
    Replacement[MET_Station] = Replacement[[MET_Station + '_Prev', MET_Station + '_Next']].mean(axis=1)

Keep only replacement values

In [25]:
Data[~mask] = Replacement[~mask]

In [26]:
Summary_df['Step 5'] = Data.mean(axis=1)

# STEP 6 - SAVE OUTPUT

Save summary

In [27]:
Summary_df.to_csv("3 SMR - Tamb Data Cleaning - Summary.csv")

Save cleaned data

In [28]:
Data.to_csv("3 SMR - Tamb Data 10T - Data Cleaning.csv")

Save final output

In [29]:
Final_df = pd.DataFrame(Summary_df['Step 5'])
Final_df.rename(columns={'Step 5':'Tamb (°C)'}, inplace=True)
Final_df.to_csv("3 SMR - Tamb Data 10T - Average.csv")