*Author: C Mitchell*

**This notebook is for wrangling the raw PSR (.sed) and SVIS (.sig) data from the TNC Basin Lab Day data, the Pemaquid Lab Day data and the Bigelow Lab Day data into one large data frame. This data frame is saved as *spectral_library.csv***

**Note: The leaf clip data are not included in this spectral library.**

# File details

For each sample there are ~40 measurements (see `Measurement Setup.md` for more details). Each measurement has the naming convention:

`filestem_xxxx.sed`

where `filestem` is a unique identifying descriptor, and `xxxx` is the measurement number (e.g. `0001`).

# Data wrangling details

1. For each sample, we need to calculate the mean and standard deviation from the ~40 scans.
2. We are going to combine the mean reflectance and standard deviation for all the samples into one data frame.

# Initializing workflow

Importing modules and setting up the data directories

In [7]:
import pandas as pd
import os
import re
import numpy as np

In [21]:
datadirs = {'Basin' : '/mnt/storage/labs/mitchell/projects/meifsci-seaweed-drones/analysis/data/20210622_TNCBasin/raw/', 
            'Pemaquid' : '/mnt/storage/labs/mitchell/projects/meifsci-seaweed-drones/analysis//data/20210721_Pemaquid/raw/labday/',
           'Bigelow' : '/mnt/storage/labs/mitchell/projects/meifsci-seaweed-drones/analysis/data/20211209_Bigelow/raw/'}

filetypes = {'Basin' : '.sed', 'Pemaquid' : '.sed', 'Bigelow' : '.sig'}

## Defining functions

In [22]:
def groupraw(datadir, filetype):
    # finds all the raw files of filetype from a given 
    # data dir and groups them by filestem
    #
    # inputs: 
    #     datadir = full path to directory with the raw files
    #     filetype = extension of raw filetypes e.g. '.sed'  
    # outputs: a dictionary with the filestem as the key
    #          and the list of corresponding individual 
    #          files as the value
    
    allfiles = os.listdir(datadir)
    
    if filetype == '.sed':
        datafiles = [f for f in allfiles if re.search('.sed',f)]
        ind = [re.search('_[0-9]{5}\.sed',f).start() for f in datafiles]
    elif filetype == '.sig':
        datafiles = [f for f in allfiles if re.search('.sig',f)]
        ind = [re.search('_[0-9]{4}\.sig',f).start() for f in datafiles]
    else: 
        print("unknown file type, expect either '.sed' or '.sig'")
        
    fstem = sorted(set([f[:ii] for f,ii in zip(datafiles,ind)]))
    filegroups = {}
    for fs in fstem:
        filegroups[fs] = [datadir+f for f in datafiles if re.match(fs+'_[0-9]',f)]
    
    return filegroups

In [23]:
def combineraw(idkey,filelist):
    # for a group of .sed or .sig files, calculates the mean and the
    # standard deviation
    # inputs: 
    #    idkey = an identifying string
    #    filelist = a list containing the file paths to import
    # output: 
    #      a data frame with the mean and standard deviation
    #      of the reflectance for the group of sed files
    
    data = []
    for ff in filelist:
        if re.search('\.sed',ff):
            data += [pd.read_csv(ff,sep='\t',skiprows=26)]
        elif re.search('\.sig',ff):
            data += [pd.read_csv(ff,sep='\s+',skiprows=27,header=None,usecols=[0,3],names=['Wvl','Reflect. %'])]
        else:
            print(ff)
            print('unknown file type, expected ".sed" or ".sig" files')
            
    data = pd.concat(data)
    data_mean = data.groupby('Wvl').mean().rename(columns = {'Reflect. %' : idkey+'_mean'})
    data_std = data.groupby('Wvl').std().rename(columns = {'Reflect. %' : idkey+'_std'})
    data_combined = pd.merge(data_mean,data_std,on='Wvl')
    
    return data_combined

# Processing files

Now we can loop through the files, import the data, and combine it into one data frame

In [24]:
data_allsites = []
for sitekey,datadir in datadirs.items():
    filegroups = groupraw(datadir,filetypes[sitekey])
    
    data_combined = []
    for fg,flist in filegroups.items():
        data_combined += [combineraw(sitekey+'_'+fg, flist)]

    data_all = pd.concat(data_combined, axis=1)
    data_all.index = np.round(data_all.index)
    data_allsites += [data_all]
DATA = pd.concat(data_allsites, axis=1)

Let's remove a couple of columns that were for reference and not part of the spectral library:

In [25]:
colprefix = ['Basin_Fucusxx', 'Pemaquid_Whiteref', 'Pemaquid_WhiteRef4', 'Pemaquid_Whiteref5', 
             'Basin_Background']

cols = [xx+'_mean' for xx in colprefix] + [xx+'_std' for xx in colprefix]

DATA_FINAL = DATA.drop(labels = cols, axis = 1)

Finally, we can save this reflectance data to a CSV file

In [26]:
DATA_FINAL.to_csv('../data/spectral_library.csv')