# GPR Preprocessing

Before we begin working on the GPR data itself, we would like to investigate the metadata and other associated data (i.e., GPS data) to make sure the data we collected is of high quality and that there are no surprises in our dataset (for example, the GPS system was not functioning properly, or the lengths of the profiles are different than what we expected, etc.).

These scripts are specific to data collected by the Sensors and Software Noggin250+ GPR instrument, which is the data we will use in this course.

# Notebook Setup

Uncomment the first line of code in the next cell if you are using Google Colab

In [1]:
import pathlib #To read and manipulate filepaths
import csv #to read files
import datetime #For manipulating dates and times

import pandas as pd #to organize data. This is not included in the python standard libary and will need to be installed
import numpy as np #for numerical manipulation. This is not included in the python standard library and will need to be installed

## Read, Convert, and Export GPS Data

Now, we will convert our GPS data to a .csv file that can be manipulated and read in by various programs

Let's look at the raw GPGGA sentences first (you can also just open the .gps file). 

We'll print up the first 10 lines of the actual data.

In [11]:
dataDirectory = "../GPRSampleData/"
#Read in .gps files only, with relevant information only
GPSFiles = pathlib.Path(dataDirectory).glob('*.GPS')
for f in GPSFiles:
    lines = [] #Create an empty, resuable list for each line

    with open(str(f)) as cf:
        for row_number, row in enumerate(cf):
            if row_number < 11:
                print(row)

Trace #1 at position 0.000000

$GPVTG,118.3,T,,,000.02,N,000.04,K,A*4D

$GPGGA,164010.60,4018.7801826,N,08819.6413086,W,1,10,0.9,217.53,M,-33.93,M,,*5E

Trace #11 at position 0.250000

$GPVTG,84.7,T,,,000.46,N,000.86,K,A*77

$GPGGA,164013.60,4018.7801853,N,08819.6412256,W,1,10,0.9,217.54,M,-33.93,M,,*56

Trace #21 at position 0.500000

$GPVTG,96.3,T,,,000.85,N,001.58,K,A*7D

$GPGGA,164014.40,4018.7801693,N,08819.6409945,W,1,10,0.9,217.54,M,-33.93,M,,*52

Trace #31 at position 0.750000

$GPVTG,85.3,T,,,000.82,N,001.51,K,A*71



We are interested in the $GPGGA data (which shows the GPS location), not the #GPVTG (which shows the direction of travel)

Find more information about these "sentence" formats [here](https://www.rfwireless-world.com/Terminology/GPS-sentences-or-NMEA-sentences.html)

The following cell will read that data, format it into a table (or dataframe), then export that table to a .csv file.

That .csv file will export to the same directory that the .gps file is in. You can find it there for the next steps of your preprocessing.

In [None]:
#This script reads in the .gps files collected with the Sensors and Software Noggin+ 250
#  These gps files contain GPGGA sentences. This file converts those into a .csv file with lat/lon that can be read more easily by other software, including GIS and python
dataDirectory = "../GPRSampleData/"
#Read in .gps files only, with relevant information only
GPSFiles = pathlib.Path(dataDirectory).glob('*.GPS')

allLines=[] #Create an empty list for appending information from each GPGGA line in the .gps file. Will be converted to dataframe

#Go through each .gps file and convert GPGGA data into a dataframe
for f in GPSFiles:
    lines = [] #Create an empty, resuable list for each line

    with open(str(f)) as cf:
        inData = csv.reader(cf, delimiter=',') #Open file and read it into inData
        for row in inData: #Go through each row at a time
            if row[0] == '$GPGGA': #For the rows that contain the GPGGA sentence (i.e., the GPS data)...
                mdfTime = f.stat().st_mtime #Get time of file creation from file metadata itself
                mdfTime = datetime.datetime.fromtimestamp(mdfTime) #Convert that information and add it as a column
                row.append(f.stem) #Add the line name.filename to our list
                row.append(mdfTime) #Add the file creation time to our list
                lines[-1] = lines[-1] + row #Cleanup
            elif '#' in row[0]:
                #If it is not line with a GPGGA sentence, it might be a line containing information about the trace the GPS point is assocaited with
                currRow = row[0]
                startInd = int(currRow.find('#'))+1
                newStr = currRow[startInd:]
                endInd = int(newStr.find(' '))
                traceNo = int(newStr[:endInd])
                lines.append([traceNo]) #Add that trace information to our list
            else:
                pass #Do not collect information from the other lines
    for l in lines:
        allLines.append(l) #Add each GPS point as a separate "row" in the allLines list

inDF = pd.DataFrame(allLines) #Convert allLines list to dataframe

#Create new dataframe for data manipulation, and formatting
df = inDF.copy() #So we can preserve the original data as we manipulate it
df.dropna(axis=0, inplace=True, how='any') #Drop any GPS data (or errant row) that does not contain all relevant information
cols = ["Trace","StrType", "Time", "Lat_Unformat", "LatDir", "Lon_Unformat", "LonDir", "Quality", "Satellites", "HDOP", "Elev_m", "ElevUnit", "GeoidOffset", "GeoidOffsetUnit", "Col", "STID", "Filename","FileCreateTime"]
df.columns = cols #Assign column labels
floatCols = ["Lat_Unformat", "Lon_Unformat","Quality","Satellites","HDOP","Elev_m","GeoidOffset"] #Note columns to be converted to float for easier manipulation later

for c in floatCols:#conver some of the data (read in as strings) to numeric/float datatype
    df[c] = pd.to_numeric(df[c],errors='coerce')#.astype(float,errors='ignore') #last part not needed in current iteration

df.dropna(axis=0, inplace=True, how='any',)#Again, drop any errant columns
df.reset_index(drop=True, inplace=True) #Reset the index with our new data

#Add new columns for time
df["Time"] = pd.to_datetime(df["Time"],format="%H%M%S.%f")
df["year"] = df["FileCreateTime"].dt.year
df["month"] = df["FileCreateTime"].dt.month
df["day"] = df["FileCreateTime"].dt.day
df["hour"] = df["Time"].dt.hour
df["minute"] = df["Time"].dt.minute
df["second"] = df["Time"].dt.second

df["CollectTime"] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second']])

#Reformat GGA sentence coordinates to decimal degree
df["LatDeg"] = (df["Lat_Unformat"]/100).astype(float)
df["LatDeg"] = np.floor(df["LatDeg"])
df["LonDeg"] = (df["Lon_Unformat"]/100).astype(float)
df["LonDeg"] = np.floor(df["LonDeg"])

df["LatMin"] = df["Lat_Unformat"].astype(str)
df["LatMin"] = df.LatMin.str.extract('(?<=^..)(.*)') #This uses a "regular expression" to find the right substring to extract
df["LonMin"] = df["Lon_Unformat"].astype(str)
df["LonMin"] = (df.LonMin.str.extract('(?<=^..)(.*)')) #This uses a "regular expression" to find the right substring to extract

df["LatMinDec"] = pd.to_numeric(df["LatMin"])/60
df["LonMinDec"] = pd.to_numeric(df["LonMin"])/60

df["Latitude"] = df["LatDeg"]+df["LatMinDec"]
df["Longitude"] = df["LonDeg"]+df["LonMinDec"]

for d in enumerate(df["LatDir"]):
    if d[1] == 'S':
        df.loc[d[0],"Latitude"] = df.loc[d[0],"Latitude"] *-1

for d in enumerate(df["LonDir"]):
    if d[1] == 'W':
        df.loc[d[0],"Longitude"] = df.loc[d[0],"Longitude"] *-1

#Create dataframe for output, with just the most important columns
outDF = pd.DataFrame()

outDF.index.name="ID"
outDF["ScanNo"] = df['Trace']
outDF["Time"] = df["CollectTime"]
outDF["File"] = df["Filename"]
outDF["Latitude"] = df["Latitude"]
outDF["Longitude"] = df["Longitude"]
outDF["Elev_m"] = df["Elev_m"]
outDF["Quality"] = df["Quality"]
outDF["Satellites"] = df["Satellites"]
outDF["HDOP"] = df["HDOP"]
outDF["GeoidSep"] = df["GeoidOffset"]

#Preview, Export
print(outDF)
outGPSFilePath = pathlib.Path(dataDirectory).joinpath('AllGPSPts_GPR_GEOL451Data.csv')
outDF.to_csv(outGPSFilePath)

     ScanNo                Time   File   Latitude  Longitude  Elev_m  Quality  \
ID                                                                              
0         1 2022-07-21 16:40:10  LINE0  40.313003 -88.327355  217.53        1   
1        11 2022-07-21 16:40:13  LINE0  40.313003 -88.327354  217.54        1   
2        21 2022-07-21 16:40:14  LINE0  40.313003 -88.327350  217.54        1   
3        31 2022-07-21 16:40:15  LINE0  40.313003 -88.327347  217.55        1   
4        41 2022-07-21 16:40:15  LINE0  40.313003 -88.327345  217.55        1   
..      ...                 ...    ...        ...        ...     ...      ...   
464    4641 2022-07-21 16:43:12  LINE0  40.313026 -88.325976  215.38        1   
465    4651 2022-07-21 16:43:12  LINE0  40.313027 -88.325971  215.40        1   
466    4661 2022-07-21 16:43:13  LINE0  40.313027 -88.325968  215.38        1   
467    4671 2022-07-21 16:43:13  LINE0  40.313027 -88.325965  215.35        1   
468    4681 2022-07-21 16:43