*Author: C Mitchell*

The `sample_list.csv` was initially created from the notes in the lab book. 

In this notebook, we add the time columns and the number of measurements for each sample into the `sample_list.csv` file.

**The output of this notebook is a new file, `sample_list_v2.csv`**

# Time details

For some of the lab experiments, we were investigation the effect of desiccation i.e. time to dry. Each of the raw PSR .sed files have time in the metadata/header information. We're going to use these time stamps to include the `sample_list.csv` file.

For each sample, we have 40 measurements, so we have 40 time stamps. We are going to use the time stamp from the *first measurement* of each sample.


# Initializing workflow

Importing modules and setting up the data directories

In [26]:
import pandas as pd
import os
import re
import datetime as dt

In [5]:
datadirs = {'Basin' : '../data/20210622_TNCBasin/raw/', 'Pemaquid' : '../data/20210721_Pemaquid/raw/labday/'}

## Defining functions

In [6]:
def groupsed(datadir):
    # finds all the .sed files from a given data dir and
    # groups them by filestem
    #
    # inputs: full path to directory with the .sed files
    # outputs: a dictionary with the filestem as the key
    #          and the list of corresponding individual 
    #          files as the value
    
    allfiles = os.listdir(datadir)
    datafiles = [f for f in allfiles if re.search('.sed',f)]

    ind = [re.search('_[0-9]{5}\.sed',f).start() for f in datafiles]
    fstem = sorted(set([f[:ii] for f,ii in zip(datafiles,ind)]))
    filegroups = {}
    for fs in fstem:
        filegroups[fs] = [datadir+f for f in datafiles if re.match(fs+'_[0-9]',f)]
    
    return filegroups

In [133]:
def sedtimestamp(filename):
    # finds the time information in the header info and
    # pulls out the hours, minutes and seconds of the measurement
    # inputs: full path to the .sed file
    # outputs: hour, minutes and seconds as integers
    
    with open(filename) as fp:
        Lines = fp.readlines()
        for line in Lines:
            if re.match('Time: ', line):
                parsed = re.split(':',line)
                hr = int(re.split(',',parsed[3])[1])
                mins = int(parsed[4])
                secs = int(float(parsed[5]))
    
    return hr, mins, secs

# Processing files

Now we can loop through our different samples and pull out the time stamps and count how many measurements there were for each sample.

In [148]:
sample_info = {}
for sitekey,datadir in datadirs.items():
    filegroups = groupsed(datadir)
    
    for fg,flist in filegroups.items():
        hrs, mins, secs  = sedtimestamp(flist[0])
        sample_info[sitekey+'_'+fg]  = [hrs, mins, secs, len(flist)]

And finally, we can put this together into a data frame:

In [175]:
new_sample_info = pd.DataFrame(sample_info,index =['hours','minutes','seconds','n']).T.rename_axis('idkey')

In [176]:
new_sample_info

Unnamed: 0_level_0,hours,minutes,seconds,n
idkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Basin_Asco1,10,53,31,40
Basin_Asco1b,13,51,31,40
Basin_Asco1c,15,19,45,40
Basin_Asco2,11,0,12,40
Basin_Asco2b,13,58,50,40
...,...,...,...,...
Pemaquid_Fucus8,15,37,41,40
Pemaquid_Fucus9,16,17,41,8
Pemaquid_WhiteRef4,13,0,13,1
Pemaquid_Whiteref,9,53,21,3


# Combining the new sample information with `sample_list.csv`

We need to add the time stamps and number of measurements to the existing sample information in the `sample_list.csv` file.

First, we need to import `sample_list.csv`

In [153]:
sampledf = pd.read_csv('sample_list.csv')

In [154]:
sampledf

Unnamed: 0,Site,Sample,Wet_weight,Dry_weight,Canopy_Depth
0,Basin,Asco1b,39.0,13.3,1.0
1,Basin,Asco1c,39.0,13.3,1.0
2,Basin,Asco1,39.0,13.3,1.0
3,Basin,Asco2b,83.3,28.7,2.8
4,Basin,Asco2c,83.3,28.7,2.8
...,...,...,...,...,...
60,Pemaquid,DFucus6,184.1,,2.4
61,Pemaquid,Chondrus1,37.2,,2.5
62,Pemaquid,Chondrus2,66.4,,4.3
63,Pemaquid,Chondrus3,77.5,,3.3


To merge `sampledf` with `new_sample_info`, we need a unique, identifying key to merge on. The unique key will be `site_sample`. The `new_sample_info` already has this key as a column, but we need to add that key to `sampledf`

In [172]:
sampledf['idkey'] = sampledf['Site'] + '_' + sampledf['Sample']
sampledf.set_index('idkey')

Unnamed: 0_level_0,Site,Sample,Wet_weight,Dry_weight,Canopy_Depth
idkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Basin_Asco1b,Basin,Asco1b,39.0,13.3,1.0
Basin_Asco1c,Basin,Asco1c,39.0,13.3,1.0
Basin_Asco1,Basin,Asco1,39.0,13.3,1.0
Basin_Asco2b,Basin,Asco2b,83.3,28.7,2.8
Basin_Asco2c,Basin,Asco2c,83.3,28.7,2.8
...,...,...,...,...,...
Pemaquid_DFucus6,Pemaquid,DFucus6,184.1,,2.4
Pemaquid_Chondrus1,Pemaquid,Chondrus1,37.2,,2.5
Pemaquid_Chondrus2,Pemaquid,Chondrus2,66.4,,4.3
Pemaquid_Chondrus3,Pemaquid,Chondrus3,77.5,,3.3


And finally, merge these two dataframes together:

In [181]:
final_sample_list = pd.merge(sampledf,new_sample_info,on='idkey',how='outer').set_index('idkey')

In [182]:
final_sample_list

Unnamed: 0_level_0,Site,Sample,Wet_weight,Dry_weight,Canopy_Depth,hours,minutes,seconds,n
idkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Basin_Asco1b,Basin,Asco1b,39.0,13.3,1.0,13,51,31,40
Basin_Asco1c,Basin,Asco1c,39.0,13.3,1.0,15,19,45,40
Basin_Asco1,Basin,Asco1,39.0,13.3,1.0,10,53,31,40
Basin_Asco2b,Basin,Asco2b,83.3,28.7,2.8,13,58,50,40
Basin_Asco2c,Basin,Asco2c,83.3,28.7,2.8,15,27,48,40
...,...,...,...,...,...,...,...,...,...
Basin_Background,,,,,,10,34,49,4
Basin_Fucusxx,,,,,,11,47,40,1
Pemaquid_WhiteRef4,,,,,,13,0,13,1
Pemaquid_Whiteref,,,,,,9,53,21,3


In [184]:
final_sample_list.to_csv('sample_list_v2.csv')