# This notebook will look at the GPS and GLONASS data collected on board the Antarctic Circumnavigation Expedition, which forms the cruise track. 

Data were collected simulataneously from a GPS (Trimble) and GLONASS (based on the bridge) to monitor the track of the ship during the expedition. 

Both data streams were collected in their raw format with a 1-second resolution. In addition, the Trimble GPS data were fed through a number of other on-board instruments, meaning we also have a "pre-processed" version of the data. 

These data sets were pre-processed on board to convert them from the raw format which is a set of NMEA strings, to a more useable, csv format. This was done by combining both data streams and outputting the csv files in a number of resolutions: 1 second, 1 minute, 5 minute and 1 hour, for the needs of different projects. 

This notebook has the following aims: 

1 - check the conversion of the raw data to pre-processed data was correct for the GPS and GLONASS data streams

2 - check the integrity of the data itself by doing some basic quality checking for the GPS and GLONASS data streams

3 - compare the pre-processed, quality-checked data with the pre-processed data contained in the other data streams (eg. motion data)

4 - highlight any areas where the data look to be incorrect for the GPS and GLONASS data streams

5- compare the GPS and GLONASS data streams

### Set up python and pandas

In [3]:
import csv
import pandas
import datetime
import MySQLdb

pandas.set_option('display.max_columns', 100)
pandas.set_option('display.max_rows', 20000)

  return f(*args, **kwds)
  return f(*args, **kwds)


### Importing data

Import data files from a folder

In [None]:
def get_input_files(input_data_folder):
    
    list_data_files = []
    
    os.chdir(input_data_folder)
    directory_path = os.getcwd()
    
    for filename in os.listdir(input_data_folder):
        if filename.startswith("gpsdata_201"):
            fullpath = directory_path + "/" + filename
            list_data_files.append(fullpath)
    
    return list_data_files

Import a single test file

In [None]:
def get_input_file(input_data_folder, filename):
    
    list_data_files = []
    
    full_filepath = input_data_folder + filename
    list_data_files.append(full_filepath)
    
    return list_data_files

Import data from a database table into a dataframe

In [None]:
def get_data_from_database(query, db_connection):
    
    dataframe = pandas.read_sql(query, con=db_connection)

    return dataframe

### Dataframe utils

Read list of files into a single dataframe. 

Note that the number of rows is likely to be large. Each Trimble GPS daily file has ~ 430,000 rows => 105 files ~ 50,000,000 rows.

A note on the columns: each row within the raw data file is preceeded by an NMEA string name, eg. GPGGA which denotes what it contains in terms of variables. Therefore as each NMEA string contains a different number of variables, hence each row in the dataframe will contain a variable number of columns. When loading the data into the dataframe, the columns need a name to overcome this problem (see names = list('abcdefghijklmno' in the code, where the number of letters in the list is the same as the maximum number of variables in an NMEA string). 

Pandas will be used to get the data from different NMEA strings into different dataframes.

In [None]:
def read_files(list_data_files):
    
    df_from_each_file = (pandas.read_csv(file, names = list('abcdefghijklmno')) for file in list_data_files) # columns are named as letters at the moment. required because the data has irregular numbers of columns in each row.
    concatenated_df = pandas.concat(df_from_each_file, ignore_index=True)
    
    return concatenated_df

### Specific NMEA strings utils

Create a dataframe from a specific NMEA string.

In [None]:
def get_nmea_string_data(nmea_string, dataframe, header):
    
    nmea_dataframe = dataframe.loc[dataframe['nmea_reference'] == nmea_string]
    
    print("Header length:",len(header))
    print("Number columns:", len(nmea_dataframe.columns))
    
    if len(nmea_dataframe.columns) > len(header):
        nmea_dataframe = nmea_dataframe.iloc[:,0:len(header)]
        nmea_dataframe.columns = header
    elif len(nmea_dataframe.columns) == len(header):
        nmea_dataframe.columns = header
            
#    if nmea_string == '$GPGGA': # TODO this does not work
#        nmea_dataframe['fix_time'] = pandas.to_datetime(nmea_dataframe['fix_time'], '%H%M%S')
#    elif nmea_string == '$GPZDA':
#        nmea_dataframe['record_time'] = pandas.to_datetime(nmea_dataframe['record_time'], '%H%M%S')
#    elif nmea_string == '$GPRMC':
#        nmea_dataframe['fix_time'] = pandas.to_datetime(nmea_dataframe['fix_time'], '%H%M%S')
#        nmea_dataframe['fix_date'] = pandas.to_datetime(nmea_dataframe['fix_date'], '%d%m%y')
 
        #if nmea_string == '$GPGGA':
         #   nmea_dataframe['fix_time'] = datetime.datetime.strptime(nmea_dataframe['fix_time'], format='%H%M%S')
    return nmea_dataframe

Define GPGGA header

In [None]:
gpgga_header = ['nmea_reference', 'fix_time', 'latitude', 'latitude_ns', 'longitude', 'longitude_ew',
             'fix_quality', 'number_satellites', 'horiz_dilution_of_position','altitude', 'altitude_units', 'geoid_height', 'geoid_height_units',
             'unknown', 'checksum']

Define GPZDA header

In [None]:
gpzda_header = ['nmea_reference', 'record_time', 'day', 'month', 'year', 'local_time_zone_hours', 'min_checksum']

Define GPRMC header

In [None]:
gprmc_header = ['nmea_reference', 'fix_time', 'status', 'latitude', 'latitude_ns', 'longitude', 'longitude_ew', 'speed_over_gound_kts', 'track_angle_degs', 'fix_date', 'magnetic_variation', 'magnetic_variation_ew', 'checksum']

### Optimisations

Convert numbers to floats to optimise memory usage.

In [None]:
def optimise_line(line):
    """Convert the values in a line of data in a list that look like numbers, to floats (to optimise the memory usage and make the next stage more efficient). If the value is not a number, then leave it in its original format."""
    for i, value in enumerate(line):
        try:
            line[i] = float(line[i])
        except ValueError:
            pass  

In [None]:
def optimise_dataframe(dataframe):    
    
    dataframe.info()
    print(dataframe[:5])
    
    cols_float64 = ['latitude', 'longitude', 'record_time']
    cols_float32 = ['horiz_dilution_of_position', 'altitude', 'geoid_height']
    cols_int = ['id', 'fix_quality', 'number_satellites', 'device_id', 'measureland_qualifier_flags_id', 'day', 'month', 'year']
    
    #for col in cols_float64: 
    #    if col in dataframe.columns:
    #        dataframe[cols_float64] = dataframe[cols_float64].apply(pandas.to_numeric, errors='ignore')
            
    for col in cols_float32:        
        if col in dataframe.columns:
            dataframe[cols_float32] = dataframe[cols_float32].apply(pandas.to_numeric, errors='ignore', downcast='float')
    
    for col in cols_int:
        if col in dataframe.columns:
            dataframe[col] = dataframe[col].apply(pandas.to_numeric, errors='ignore', downcast='integer')
    
    dataframe.info()
    print(dataframe[:5])
    
    return dataframe

Optimise memory usage in the dataframe by converting float64 to float32 (uses less bytes per digit).

In [None]:
# The code below was taken from https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65 and is used to convert the datatype to one that uses less memory.

def reduce_memory_usage(props):
    """Takes a dataframe and converts the data type of each float to float32, reducing the memory usage."""
    
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            #print("******************************")
            #print("Column: ",col)
            #print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)    
            
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
            
            # Print new column type
            #print("dtype after: ",props[col].dtype)
            #print("******************************")
            
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist

## TODO Join the ZDA and GGA nmea sentences together - how?

Read ZDA lines of file, line by line, then read next line and append it to the ZDA line.
nmea_string = '$GPZDA

In [11]:
def data_to_list(filename, nmea_string):
    """Read files from a list of files, get the date from each file, then append the date to each line within the file as the line of data is read into a list. Output a list of data from all of the files."""

    row_of_data = []
    with open(filename, 'r') as data_file:
        contents = csv.reader(data_file, delimiter=',')
        for line in contents:
            if line[0] == nmea_string:
                row_of_data.append(line)
                
    return row_of_data

In [12]:
nmea_string = '$GPZDA'

list_files = ['/home/jen/projects/ace_data_management/ship_data/gps_trimble/gpsdata_20170104.log', '/home/jen/projects/ace_data_management/ship_data/gps_trimble/gpsdata_20170105.log']

rows_of_data = list()
for filename in list_files:
    row_of_data = data_to_list(filename, nmea_string)
    rows_of_data.append(row_of_data)

AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

In [None]:
print(rows_of_data[:5]) #this works but i want a list of lists, not list of list of lists TODO

Optimise this list

Read the list to a dataframe

Optimise the dataframe

# 1 - Check conversion of raw data to pre-processed data

### Trimble GPS

Import an example raw data file

In [None]:
input_data_folder_trimble = '/home/jen/projects/ace_data_management/ship_data/gps_trimble/'
trimble_filename = 'gpsdata_20170104.log'

#test_list_trimble_data_files = get_input_file(input_data_folder_trimble, trimble_filename)
test_list_trimble_data_files = ['/home/jen/projects/ace_data_management/ship_data/gps_trimble/gpsdata_20170104.log', '/home/jen/projects/ace_data_management/ship_data/gps_trimble/gpsdata_20170105.log']

Read raw data into dataframe

In [None]:
trimble_raw_df = read_files(test_list_trimble_data_files)
trimble_raw_df = trimble_raw_df.rename(columns = {'a': 'nmea_reference'})
len(trimble_raw_df)

Preview the start of the dataframe.

In [None]:
trimble_raw_df.iloc[:10]

Put GPGGA data into a separate dataframe

In [None]:
nmea_string = '$GPGGA'

gpgga_trimble_raw_df = get_nmea_string_data(nmea_string, trimble_raw_df, gpgga_header)
gpgga_trimble_raw_df.info()

In [None]:
gpgga_trimble_raw_df.iloc[:10]

Optimise the dataframe

In [None]:
gpgga_trimble_raw_df_opt = optimise_dataframe(gpgga_trimble_raw_df)

Put GPZDA data into a separate dataframe.

In [None]:
nmea_string = '$GPZDA'

gpzda_trimble_raw_df = get_nmea_string_data(nmea_string, trimble_raw_df, gpzda_header)

In [None]:
gpzda_trimble_raw_df.iloc[:10]

In [None]:
gpzda_trimble_raw_df_opt = optimise_dataframe(gpzda_trimble_raw_df)

Combine the GPZDA and GPGGA rows into another dataframe so that we have a date/timestamp with each latitude and longitude. 

Get the data from the database

In [None]:
query_trimble = 'select * from ship_data_gpggagpsfix where device_id=63;'

db_connection = MySQLdb.connect(host = 'localhost', user = 'ace', passwd = 'ace',db = 'ace2016', port = 3306); 

gpsdb_df = get_data_from_database(query_gps, db_connection)
gpsdb_df_opt = optimise_dataframe(gpsdb_df_opt)

Compare the raw data and database data.

### GLONASS

Import an example raw data file. Note that one of the NMEA strings, GPRMC has 30 columns which would need to be included if the full data set is required. 

In [None]:
input_data_folder_glonass = '/home/jen/projects/ace_data_management/ship_data/gps_bridge1/'
glonass_filename = 'gpsdata-20170104.log'

#test_list_glonass_data_files = get_input_file(input_data_folder_glonass, glonass_filename)
test_list_glonass_data_files = ['/home/jen/projects/ace_data_management/ship_data/gps_bridge1/gpsdata-20170104.log', '/home/jen/projects/ace_data_management/ship_data/gps_bridge1/gpsdata-20170105.log']

glonass_df = read_files(test_list_glonass_data_files)
glonass_df = glonass_df.rename(columns = {'a': 'nmea_reference'})

len(glonass_df)

In [None]:
glonass_df.iloc[:10]

Put GPGGA data into a separate dataframe.

In [None]:
nmea_string = '$GPGGA'

gpgga_glonass_df = get_nmea_string_data(nmea_string, glonass_df, gpgga_header)

In [None]:
gpgga_glonass_df.iloc[:10]

Put GPZDA data into a separate dataframe.

In [None]:
nmea_string = '$GPZDA'

gpzda_glonass_df = get_nmea_string_data(nmea_string, glonass_df, gpzda_header)

In [None]:
gpzda_glonass_df.iloc[:10]

Put GPRMC data into a separate dataframe.

In [None]:
nmea_string = '$GPRMC'

gprmc_glonass_df = get_nmea_string_data(nmea_string, glonass_df, gprmc_header)

In [None]:
gprmc_glonass_df.iloc[:100]

In [None]:
gprmc_glonass_df['fix_time'] = pandas.to_datetime(gprmc_glonass_df['fix_time'], format='%H%M%S').dt.time # this works
#datetime.time(gprmc_glonass_df['fix_time']) #TODO get the time only



#gprmc_glonass_df['fix_time'].datetime.time()#

In [None]:
gprmc_glonass_df.iloc[:5]
#gprmc_glonass_df.info()

Get the GLONASS data from the database.

In [None]:
query_glonass = 'select * from ship_data_gpggagpsfix where device_id=64;'

db_connection = MySQLdb.connect(host = 'localhost', user = 'ace', passwd = 'ace',db = 'ace2016', port = 3306); 

glonassdb_df = get_data_from_database(query_glonass, db_connection)