*Author: C Mitchell*

**This notebook is for wrangling the TNC Basin Lab Day data and the Pemaquid Lab Day data from the raw PSR data (.sed files) into one large data frame. This data frame is saved as *spectral_library.csv***

# File details

For each sample there are ~40 measurements (see `Measurement Setup.md` for more details). Each measurement has the naming convention:

`filestem_xxxx.sed`

where `filestem` is a unique identifying descriptor, and `xxxx` is the measurement number (e.g. `0001`).

# Data wrangling details

1. For each sample, we need to calculate the mean and standard deviation from the ~40 scans.
2. We are going to combine the mean reflectance and standard deviation for all the samples into one data frame.

# Initializing workflow

Importing modules and setting up the data directories

In [1]:
import pandas as pd
import os
import re

In [91]:
datadirs = {'Basin' : '../data/20210622_TNCBasin/raw/', 'Pemaquid' : '../data/20210721_Pemaquid/raw/labday/'}

## Defining functions

In [105]:
def groupsed(datadir):
    # finds all the .sed files from a given data dir and
    # groups them by filestem
    #
    # inputs: full path to directory with the .sed files
    # outputs: a dictionary with the filestem as the key
    #          and the list of corresponding individual 
    #          files as the value
    
    allfiles = os.listdir(datadir)
    datafiles = [f for f in allfiles if re.search('.sed',f)]

    ind = [re.search('_[0-9]{5}\.sed',f).start() for f in datafiles]
    fstem = sorted(set([f[:ii] for f,ii in zip(datafiles,ind)]))
    filegroups = {}
    for fs in fstem:
        filegroups[fs] = [datadir+f for f in datafiles if re.match(fs+'_[0-9]',f)]
    
    return filegroups

In [120]:
def combinesed(idkey,filelist):
    # for a group of .sed files, calculates the mean and the
    # standard deviation
    # inputs: 
    #    idkey = an identifying string
    #    filelist = a list containing the file paths to import
    # output: 
    #      a data frame with the mean and standard deviation
    #      of the reflectance for the group of sed files
    
    data = []
    for ff in filelist:
        try:
            data += [pd.read_csv(ff,sep='\t',skiprows=26)]
        except:
            print('Problem with file: '+ff)
            continue
    data = pd.concat(data)
    data_mean = data.groupby('Wvl').mean().rename(columns = {'Reflect. %' : idkey+'_mean'})
    data_std = data.groupby('Wvl').std().rename(columns = {'Reflect. %' : idkey+'_std'})
    data_combined = pd.merge(data_mean,data_std,on='Wvl')
    
    return data_combined

# Processing files

Now we can loop through the files, import the data, and combine it into one data frame

In [121]:
sample_numbers = {}
data_allsites = []
for sitekey,datadir in datadirs.items():
    filegroups = groupsed(datadir)
    
    data_combined = []
    for fg,flist in filegroups.items():
        data_combined += [combinesed(sitekey+'_'+fg, flist)]
        sample_numbers[sitekey+'_'+fg] = n

    data_allsites += [pd.concat(data_combined, axis=1)]
DATA = pd.concat(data_allsites, axis=1)

In [122]:
DATA

Unnamed: 0_level_0,Basin_Asco1_mean,Basin_Asco1_std,Basin_Asco1b_mean,Basin_Asco1b_std,Basin_Asco1c_mean,Basin_Asco1c_std,Basin_Asco2_mean,Basin_Asco2_std,Basin_Asco2b_mean,Basin_Asco2b_std,...,Pemaquid_Fucus8_mean,Pemaquid_Fucus8_std,Pemaquid_Fucus9_mean,Pemaquid_Fucus9_std,Pemaquid_WhiteRef4_mean,Pemaquid_WhiteRef4_std,Pemaquid_Whiteref_mean,Pemaquid_Whiteref_std,Pemaquid_Whiteref5_mean,Pemaquid_Whiteref5_std
Wvl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
350.0,4.410037,1.370665,4.813855,1.473507,3.937815,1.546030,4.430382,0.913576,5.660038,1.608154,...,4.730840,1.613612,3.715712,1.159811,101.1189,,149.422100,88.794705,102.5824,
351.0,4.320250,1.340077,4.632545,1.377413,3.811673,1.438041,4.231857,0.804827,5.406262,1.493778,...,4.422935,1.458978,3.462775,1.088701,101.3008,,148.901333,87.308328,102.5433,
352.0,4.221567,1.278706,4.414810,1.258284,3.706417,1.313864,4.101953,0.826004,5.185452,1.362899,...,4.175685,1.264562,3.238738,1.032755,101.5859,,149.189400,86.630813,102.2802,
353.0,3.987295,1.177080,4.282783,1.130131,3.579575,1.198622,3.920847,0.925106,4.927162,1.297431,...,3.930332,1.075472,3.096600,1.075743,100.7329,,151.170133,88.947975,102.1731,
354.0,3.831272,1.146005,4.189085,1.020070,3.514777,1.106307,3.739163,0.980105,4.785085,1.269270,...,3.732233,0.992225,3.006300,0.996552,99.5602,,151.305200,89.066045,101.5550,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2496.0,1.724858,1.456327,1.667245,1.286970,1.193300,1.132628,0.932055,0.983644,0.813022,0.890796,...,0.652608,0.578782,0.487625,0.642961,99.1025,,146.439867,78.913113,97.8607,
2497.0,1.808115,1.459203,1.770877,1.248552,1.336103,1.140618,1.020415,0.993102,0.977452,0.988768,...,0.743765,0.620235,0.539813,0.661102,98.7176,,147.425367,79.300498,97.5523,
2498.0,1.885157,1.519443,1.851097,1.241340,1.473483,1.245092,1.116702,1.053866,1.209298,1.181991,...,0.819505,0.724708,0.605250,0.729082,98.1381,,149.124167,80.925189,97.5284,
2499.0,1.962638,1.619380,1.921982,1.322684,1.613933,1.487410,1.211338,1.181539,1.452517,1.494643,...,0.902060,0.898647,0.681462,0.865176,97.5479,,150.929700,82.651456,97.6032,


Finally, we can save this reflectance data to a CSV file

In [123]:
DATA.to_csv('spectral_library.csv')