## Preprocessing the Frongoch Farm Magnetometer Data

The magnetometer data is stored on an smb server with the following folder structure:
![The magnetometer data](mag_file_struct.png)
All data recorded before 2014 is stored in the folder ``2013``. The data in this folder is stored in ``.CSV`` files (in capitals), one ``.CSV`` file for each day. Data from 2014 onwards is stored in a folder, labelled the year it's recorded. (n.b. At the time of writing this, the folders only go up to 2018.)
Each ``.csv`` file contains 3 columns; the time of the recording (timestamp), reading (a number from 0 to 65534) and temperature (in celsius). The first task, will be to combine these daily ``.csv``s into yearly ``.csv``s.

``.csv`` files created before 2014 only have the hour, minute and second in their timestamp.
``.csv``s from 2014 onwards have a LabView timestamp, which is a number representing the number of seconds elapsed since $00:00:00, 1^{st} January, 1904$ (LabView zerotime, see http://www.ni.com/tutorial/7900/en/). To covert this timestamp to a python (or numpy) datetime object, the timestamp needs converting to POSIX time. This can be done by adding the value of LabView zerotime in POSIX time, to the LabView timestamps.

1. Import the relevant libs, and set the value of LabView zerotime in POSIX time. Also, set the path where the .csv's are stored.

In [None]:
import os

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from IPython.display import display

import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
%matplotlib inline

# Set the location of the magnetometer .csv data files smb mount/ copy of the smb mount.
# (the '/' are necessary!)
dat_fp = "../mag_data_copy/"
# Set the filepath the save the .csvs to.
save_fp = "csvs/"

# Get the National instruments (LabView) zero time in unix time.
NI_zerotime = np.datetime64('1904-01-01T00:00:00.000Z').astype('datetime64[s]').astype(np.float64)
# Print the value of LabView zerotime in UNIX zero time.
print("LV_zerotime in UNIX zerotime is:", NI_zerotime)

2. Load daily ``.csv``s from 2014 to 2018 and save them as yearly ``.csv``s. Different years are loaded into separate pandas dataframes. The LabView timestamps will be converted into POSIX datetimes.

In [None]:
def load_data(folder):
    """Load the .csv's for a given year folder into a pandas dataframe."""
    # Initial list of dataframes, one dataframe for each csv (which will be combined later).
    dataframes = []
    
    # Walk through the folder to find .csvs.
    for root, dirs, files in os.walk(folder):
        for file in files:
            try:
                # Try reading the csv into a pandas dataframe, with the expected
                # columns: time, reading, temperature.
                day_dat = pd.read_csv(os.path.join(root, file),
                                      names=("time", "reading", "temperature"),
                                    dtype={"time": np.float64, "reading": np.float64, "temperature": np.float64})
            except Exception as e:
                # Print what could not be loaded and why.
                print("Could not load:", os.path.join(root, file))
                print(e)
            else:
                # Convert the LabView timestamps into POSIX timestamps, then load as datetimes.
                day_dat["time"] = pd.to_datetime(day_dat["time"] + NI_zerotime, unit='s')
                dataframes.append(day_dat)
    
    # Combine the dataframes for each .csv (daily dataframes) into a single dataframe.
    sensor_dat = pd.concat(dataframes, ignore_index=True)
    # Sort data by time.
    sensor_dat.sort_values("time", inplace=True)
    # Set the timestamps as the dataframe index.
    sensor_dat.set_index("time", drop=True, inplace=True)
    return sensor_dat


# Load the magnetometer data from 2014 to 2018.
for year_data in ["2014", "2015", "2016", "2017", "2018"]:
    # Load the data for a given year. The data for a given year is
    # stored in "/path/to/mag/data/"+year e.g. "../mag_dat/2014/"
    sensor_dat = load_data(dat_fp+year_data)
    # Save the loaded year of data to a new csv, which will contain a whole year of data.
    # This is stored in "/path/to/mag/data/"+year+"mag_dat.csv" e.g. "./csvs/2014_mag_dat.csv"
    sensor_dat.to_csv(save_fp+year_data+"_mag_dat.csv", index=True)
    # Display the head of the sensor data.
    print("\n" + year_data + " Data")
    display(sensor_dat.head())


3. Load the data from before 2014.

This is more complex than loading the data from after 2014, because the readings are not accompanied with proper timestamps. Some examples of timestamps are given below:

```
5
347
155869
```

These timestamps would correspond to the following times:

```
00:00:05
00:03:47
15:58:69
```

To make reading these timestamps easier, the python string method `str.zfill()` can be used to pad out these timestamps. e.g. `"5".zfill(6)` returns `"000005"`, `"347".zfill(6)` returns `"000347"`. Another tricky problem, is that the timestamps recorded during BST are shifted 1 hour, so 1 hour needs subtracting from these times to shift them back to GMT/ UTC.

To get the day, month and year for the timestamp, the modification time of the ``.csv``s (when the ``.csv`` was written) can be used. It will be assumed that the file modification time is the day after the readings were taken, so 1 day will need subtracting from the generated timestamps.

Sometimes the last reading in the ``.CSV`` file is actually recorded the day after the other sensor readings. In this case, 1 day is added to the sensor reading timestamp.

Set the start and end of BST for 2011, 2012 and 2013:

In [None]:
# Manually set when the BST hour changes occur.
hour_change_2011 = {"start": datetime(year=2011, month=3, day=27, hour=1, minute=0, second=0), "end": datetime(year=2011, month=10, day=30, hour=2, minute=0, second=0)}
hour_change_2012 = {"start": datetime(year=2012, month=3, day=25, hour=1, minute=0, second=0), "end": datetime(year=2012, month=10, day=28, hour=2, minute=0, second=0)}
hour_change_2013 = {"start": datetime(year=2013, month=3, day=31, hour=1, minute=0, second=0), "end": datetime(year=2013, month=10, day=27, hour=2, minute=0, second=0)}

Load the pre-2014 data. This will also save all the from before 2014 into a ``.csv`` file called `pre2014_mag_dat.csv`:

In [None]:
def load_2013_dat():
    """Load the .CSV's from before 2014, at the filepath dat_fp."""
    dataframes = []
    
    # Walk through all the files created before 2013.
    for root, dirs, files in os.walk(dat_fp+"2013"):
        print("Files loaded:")
        for file in files:
            # Check whether the file being loaded is a csv file (upper or lowercase .csv name).
            if file.lower().endswith(".csv"):
                try:
                    file = os.path.join(root, file)
                    # Retrieve file modification time. This will be used to get
                    # the year, month and day of the recording.
                    date = datetime.fromtimestamp(os.path.getmtime(file))
                    day_dat = pd.read_csv(file, names=("time", "reading", "temperature"))
                    print(file)
                except Exception as e:
                    print(e)
                else:
                    # First, set the time to an integer, then to a string.
                    day_dat["time"] = day_dat["time"].astype('int').astype("str", copy=False)
                    # Make sure all strings are 6 characters long, filled with zeros
                    day_dat["time"] = day_dat["time"].apply(lambda dt : dt.zfill(6))
                    # Convert strings to datetimes, formatted as "%H%M%S".
                    day_dat["time"] = pd.to_datetime(day_dat["time"], format="%H%M%S")
                    # Include the current year, day, month in the datetimes, as retrieved from the file modification time.
                    day_dat["time"] = day_dat["time"].apply(lambda dt: dt.replace(year=date.year, month=date.month, day=date.day))
                    # Due to an error, the last recording in the csv's is for the next day.
                    # The error can be verified by checking whether the last time is
                    # earlier than the previous.
                    # We need a try-except because the csv may only contain 1 item.
                    try:
                        if day_dat.iloc[-1, 0] < day_dat.iloc[-2, 0]:
                            # if the last time is earlier than the previous, then add a day to the time.
                            day_dat.iloc[-1, 0] += timedelta(days=1)
                    except IndexError:
                        print("Warning! Only 1 reading in the csv!")
                        
                    dataframes.append(day_dat)

    sensor_dat = pd.concat(dataframes, ignore_index=True)
    # Readings were taken in GMT and BST. Make them all GMT (UTC).
    # Find BST times.
    BST_mask = ((sensor_dat["time"] >= hour_change_2011["start"]) & (sensor_dat["time"] < hour_change_2011["end"]))
    BST_mask |= ((sensor_dat["time"] >= hour_change_2012["start"]) & (sensor_dat["time"] < hour_change_2012["end"]))
    BST_mask |= ((sensor_dat["time"] >= hour_change_2013["start"]) & (sensor_dat["time"] < hour_change_2013["end"]))
    print(sensor_dat.shape, BST_mask.shape)
    # Subtract an hour from the bst times
    sensor_dat.loc[BST_mask, "time"] -= timedelta(hours=1)
    # Subtract 1 day from each timestamp, as the modification time of the file
    # is actually the day after the sensor readings are taken.
    sensor_dat["time"] -= timedelta(days=1)
    # Set the timestamps as the index.
    sensor_dat.set_index("time", inplace=True)
    
    return sensor_dat

# Load the pre-2014 magnetometer data.
sensor_dat = load_2013_dat()
# Save pre-2014 data to a csv.
sensor_dat.to_csv(save_fp+"pre2014_mag_dat.csv")
# Display the first 5 rows of data.
sensor_dat.head()

4. Save the pre-2014 data into yearly ``.csv``s.

Load the data recorded before 2014 into the csv file `pre2014_mag_dat.csv`:

In [None]:
sensor_dat = pd.read_csv(save_fp+"pre2014_mag_dat.csv")

Convert the string timestamps into datetimes:

In [None]:
sensor_dat["time"] = pd.to_datetime(sensor_dat["time"], format="%Y-%m-%d %H:%M:%S")

Split the data by year and save to separate csvs called `year+"mag_dat.csv"` e.g. `2013_mag_dat.csv`:

In [None]:
sensor_dat_yearsplit = {}
for year in sensor_dat["time"].dt.year.unique():
    sensor_dat_yearsplit[str(year)] = sensor_dat.loc[sensor_dat["time"].dt.year == year]
    sensor_dat_yearsplit[str(year)].to_csv(save_fp+str(year)+"_mag_dat.csv", index=False)


5. Check what some of this sensor data looks like when plotted (note this is **raw** data, not magnetic readings):

In [None]:
sensor_dat = sensor_dat.loc[sensor_dat["reading"] < 0.6e7]
sensor_dat.plot(x="time", y="reading", figsize=(15,5))

# plt.savefig("test.pdf")