# Find most relevant saildrone data to compare to nearby CTD data

## Basic workflow
- choose a CTD location
- determine a time-window of relevance around when the CTD was deployed to use to match to Saildrone data
- find relevant saildrone data in that timewindow.  Determine distance to ctd location (use subsampled data to speed this process up)
- determine if distance between two platforms is close enough for comparison
- if data exists in proper time range and distance range, compare instruments
- repeat for all ctd casts and all saildrones

In [35]:
#standard python import of libraries needed (pandas and datetime)
import pandas as pd
import datetime

the following code is a generic and simple *great circle* calculator and can be found at [https://github.com/CodeDrome/great-circle-distances-python](https://github.com/CodeDrome/great-circle-distances-python)

In [8]:
import math

DEGREES_IN_RADIAN = 57.29577951
MEAN_EARTH_RADIUS_KM = 6371
KILOMETRES_IN_MILE = 1.60934


class GreatCircle(object):

    """
    This class has attributes for the names and locations of a pair of cities,
    and for their distances.
    After the city attributes are set, call calculate to set the distance attributes.
    Validation is carried out and the valid attributes set. This should be checked before using
    calculated attributes.
    """

    def __init__(self):

        """
        Create a set of attributes with default values.
        """

        self.name1 = None
        self.latitude1_degrees = 0
        self.longitude1_degrees = 0
        self.latitude1_radians = 0
        self.longitude1_radians = 0

        self.name2 = None
        self.latitude2_degrees = 0
        self.longitude2_degrees = 0
        self.latitude2_radians = 0
        self.longitude2_radians = 0

        self.central_angle_radians = 0
        self.central_angle_degrees = 0
        self.distance_kilometres = 0
        self.distance_miles = 0
        self.valid = False

    def calculate(self):

        """
        Central method to set calculated attributes, which it
        does by calling other private functions.
        """

        self.__validate_degrees()

        if self.valid:
            self.__calculate_radians()
            self.__calculate_central_angle()
            self.__calculate_distance()

    def __validate_degrees(self):

        """
        Check latitudes and longitudes are within valid ranges,
        setting the valid attribute accordingly.
        """

        self.valid = True

        if self.latitude1_degrees < -90.0 or self.latitude1_degrees > 90.0:
            self.valid = False

        if self.longitude1_degrees < -180.0 or self.longitude1_degrees > 180.0:
            self.valid = False

        if self.latitude2_degrees < -90.0 or self.latitude2_degrees > 90.0:
            self.valid = False

        if self.longitude2_degrees < -180.0 or self.longitude2_degrees > 180.0:
            self.valid = False

    def __calculate_radians(self):

        """
        Calculate radians from degrees by dividing by constant.
        """

        self.latitude1_radians = self.latitude1_degrees / DEGREES_IN_RADIAN
        self.longitude1_radians = self.longitude1_degrees / DEGREES_IN_RADIAN

        self.latitude2_radians = self.latitude2_degrees / DEGREES_IN_RADIAN
        self.longitude2_radians = self.longitude2_degrees / DEGREES_IN_RADIAN

    def __calculate_central_angle(self):

        """
        Slightly complex formula for calculating the central angle
        between two points on the surface of a sphere.
        """

        if self.longitude1_radians > self.longitude2_radians:
            longitudes_abs_diff = self.longitude1_radians - self.longitude2_radians
        else:
            longitudes_abs_diff = self.longitude2_radians - self.longitude1_radians

        self.central_angle_radians = math.acos( math.sin(self.latitude1_radians)
                                         * math.sin(self.latitude2_radians)
                                         + math.cos(self.latitude1_radians)
                                         * math.cos(self.latitude2_radians)
                                         * math.cos(longitudes_abs_diff))

        self.central_angle_degrees = self.central_angle_radians * DEGREES_IN_RADIAN

    def __calculate_distance(self):

        """
        Because we are using radians, this is a simple formula multiplying the radius
        by the angle, the actual units used being irrelevant.
        Also the distance in miles is calculated from kilometres.
        """

        self.distance_kilometres = MEAN_EARTH_RADIUS_KM * self.central_angle_radians

        self.distance_miles = self.distance_kilometres / KILOMETRES_IN_MILE

In [32]:
#read in saildrone data via pandas - use subsampled data to spead up distance calculations
sd_df = pd.read_csv('data/sd-1033_data_6hr.csv',parse_dates=True,index_col='TIM')
sd_df.sample()

Unnamed: 0_level_0,LATITUDE,LONGITUDE,TEMP_AIR_MEAN,TEMP_AIR_STDDEV,PAR_AIR_MEAN,PAR_AIR_STDDEV,TEMP_SBE37_MEAN,TEMP_SBE37_STDDEV,SAL_SBE37_MEAN,SAL_SBE37_STDDEV,...,TEMP_SBE37_MEAN_QC,SAL_SBE37_MEAN_QC,CHLOR_WETLABS_MEAN_QC,CHLOR_RBR_MEAN_QC,O2_CONC_AANDERAA_MEAN_QC,O2_SAT_AANDERAA_MEAN_QC,O2_CONC_SBE37_MEAN_QC,O2_SAT_SBE37_MEAN_QC,O2_CONC_RBR_MEAN_QC,O2_SAT_RBR_MEAN_QC
TIM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-06-23 12:00:00,69.09593,-166.8471,5.0,0.01,22.0,1.0,7.3718,0.0008,30.693,0.0004,...,7.3718,30.693,0.48,0.560817,313.95,102.47,311.06,101.15,317.56,103.31


In [50]:
#read in healy data via pandas
he1901_df = pd.read_excel('data/he1901_merged_final.xlsx',sheet_name=0,skiprows=34,parse_dates=True)

***ONCE YOU KNOW WHATS GOING ON, YOU CAN DELETE THE CELLS FROM HERE TO WHERE I SAY KEEP CELLS***

Lets just start with *CTD001* from the healy1901 cruise.  I don't know if there is a valid saildrone point nearby but it will show the math

We can use pandas to only retrieve the CTD of interest by subsetting on name/value pairs.  Its a pandas thing and pretty easy to pick up on.

In [51]:
he1901_df[he1901_df['Cast'] == 1] #this is pandas speak, and only one way to do it, to get all casts that equal 1 and index on those rows only

Unnamed: 0,Cruise,Cast,Latitude,Longitude,Date Time,P_4,S_42,S_41,Fch_906,T_28,...,PO4_186,NO2_184,NO3_182,NH4_189,Notes,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32
0,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,4,32.19231,32.19264,3.0667,9.6986,...,,,,,,,,,,
1,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,5,32.19261,32.19266,3.0544,9.6969,...,0.27,0.0,0.0,0.04,,,,,,
2,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,5,32.19261,32.19266,3.0544,9.6969,...,,,,,,,,,,
3,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,6,32.19276,32.19305,3.1207,9.6947,...,,,,,,,,,,
4,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,7,32.19300,32.19321,3.0869,9.6950,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,73,32.26394,32.26368,0.7219,0.1195,...,,,,,,,,,,
71,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,74,32.26433,32.26402,0.7172,0.1213,...,,,,,,,,,,
72,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,75,32.26443,32.26432,0.7080,0.1213,...,,,,,,,,,,
73,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,76,32.26466,32.26416,0.7371,0.1237,...,,,,,,,,,,


The results above are the entire profile, you are only interested in near surface so you could also say that P_4 must equal some value: see following cell

In [52]:
he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == 4]

  he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == 4]


Unnamed: 0,Cruise,Cast,Latitude,Longitude,Date Time,P_4,S_42,S_41,Fch_906,T_28,...,PO4_186,NO2_184,NO3_182,NH4_189,Notes,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32
0,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,4,32.19231,32.19264,3.0667,9.6986,...,,,,,,,,,,


or, without knowing the depth, but just wanting the shallowest (minimum pressure... as the .min() function on the 'P_4' column will do) you could say:

In [53]:
he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == he1901_df[he1901_df['Cast'] == 1]['P_4'].min()] #i'm stringing together index calls here and strongly suggest you look at pandas/python examples for how to index dataframes if the above is confusing.  

  he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == he1901_df[he1901_df['Cast'] == 1]['P_4'].min()] #i'm stringing together index calls here and strongly suggest you look at pandas/python examples for how to index dataframes if the above is confusing.


Unnamed: 0,Cruise,Cast,Latitude,Longitude,Date Time,P_4,S_42,S_41,Fch_906,T_28,...,PO4_186,NO2_184,NO3_182,NH4_189,Notes,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32
0,he1901,1,62.0048,-175.0692,2019-08-06 04:44:00,4,32.19231,32.19264,3.0667,9.6986,...,,,,,,,,,,


If you do the above, looping over all ctd casts, you will get a new list of only the shallowest datapoints in each cast... but perhaps you already did this in another routine... lets move on, to calculating distance

Taking the CTD001 example above, you can see I have latitude and Longitude... and same is true of the saildrone data.  Lets choose a 24hour window (+/- 12hrs) and see if there is any saildrone data in that time period

In [86]:
ctdsfc = he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == he1901_df[he1901_df['Cast'] == 1]['P_4'].min()] #save a variable for ease of calculations to follow

# this will report the datetime of the cast and will be the central point of the window we want to look for in time
# the [0] at the end chooses the array value for furthur purposes and the datetime function is used to add 12 hours and subtract 12 hours from the central time
(ctdsfc['Date Time']-datetime.timedelta(hours=12))[0]

  ctdsfc = he1901_df[he1901_df['Cast'] == 1][he1901_df['P_4'] == he1901_df[he1901_df['Cast'] == 1]['P_4'].min()] #save a variable for ease of calculations to follow


Timestamp('2019-08-05 16:44:00')

In [87]:
ctd_sd_match = sd_df[(pd.to_datetime(ctdsfc['Date Time'])-datetime.timedelta(hours=12))[0]:(pd.to_datetime(ctdsfc['Date Time'])+datetime.timedelta(hours=12))[0]]
ctd_sd_match

Unnamed: 0_level_0,LATITUDE,LONGITUDE,TEMP_AIR_MEAN,TEMP_AIR_STDDEV,PAR_AIR_MEAN,PAR_AIR_STDDEV,TEMP_SBE37_MEAN,TEMP_SBE37_STDDEV,SAL_SBE37_MEAN,SAL_SBE37_STDDEV,...,TEMP_SBE37_MEAN_QC,SAL_SBE37_MEAN_QC,CHLOR_WETLABS_MEAN_QC,CHLOR_RBR_MEAN_QC,O2_CONC_AANDERAA_MEAN_QC,O2_SAT_AANDERAA_MEAN_QC,O2_CONC_SBE37_MEAN_QC,O2_SAT_SBE37_MEAN_QC,O2_CONC_RBR_MEAN_QC,O2_SAT_RBR_MEAN_QC
TIM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-08-05 18:00:00,71.45203,-156.8688,5.28,0.02,158.0,1.0,8.6723,0.0005,30.2521,0.0004,...,8.6723,30.2521,0.31,1.423872,288.34,98.01,283.67,94.74,290.79,97.17
2019-08-06 00:00:00,71.48311,-156.9229,5.59,0.03,535.0,5.0,8.5169,0.0015,30.0989,0.0003,...,8.5169,30.0989,0.21,1.351458,-1e+34,98.44,286.12,95.13,293.52,97.64
2019-08-06 06:00:00,71.56493,-157.5024,4.66,0.02,77.0,1.0,6.5538,0.0003,29.4707,0.0004,...,6.5538,29.4707,0.23,1.486134,307.46,99.07,302.28,95.68,310.63,98.35
2019-08-06 12:00:00,71.55433,-157.771,4.08,0.02,4.0,1.0,6.483,0.0003,29.6651,0.0003,...,6.483,29.6651,0.25,1.492911,307.85,98.58,301.45,95.38,309.63,98.0


So the 4 rows shown above are the relevant time values for the saildrone data, but are they close enough? lets use the great circle calculator between each of these four points and the ctd location 

In [94]:
#defined at the beginning of this notebook, its a python method and has some unique ways to call its functions
gc = GreatCircle()

#pt 1
gc.name1 = 'HLY Cast'
gc.latitude1_degrees = ctdsfc['Latitude'][0]
gc.longitude1_degrees = ctdsfc['Longitude'][0]

#choose just the first SD matched in time location for now, python is zero indexed
gc.name1 = 'SD Location'
gc.latitude2_degrees = ctd_sd_match['LATITUDE'][0]
gc.longitude2_degrees = ctd_sd_match['LONGITUDE'][0]

gc.calculate()

In [98]:
print(f'Distance between points are {gc.distance_kilometres} kilometers')

Distance between points are 1308.9943672921042 kilometers


In [121]:
#loop over all valid saildrone points at this time and
#defined at the beginning of this notebook, its a python method and has some unique ways to call its functions
gc = GreatCircle()

#pt 1
gc.name1 = 'HLY Cast'
gc.latitude1_degrees = ctdsfc['Latitude'][0]
gc.longitude1_degrees = ctdsfc['Longitude'][0]

for i,rows in ctd_sd_match.iterrows():
    gc.name1 = 'SD Location'
    gc.latitude2_degrees = rows['LATITUDE']
    gc.longitude2_degrees = rows['LONGITUDE']

    gc.calculate()
    print(f'Distance between points are {gc.distance_kilometres} kilometers for sd at time {i}')

Distance between points are 1308.9943672921042 kilometers for sd at time 2019-08-05 18:00:00
Distance between points are 1310.030787552367 kilometers for sd at time 2019-08-06 00:00:00
Distance between points are 1302.02454701064 kilometers for sd at time 2019-08-06 06:00:00
Distance between points are 1294.6123417742713 kilometers for sd at time 2019-08-06 12:00:00


***KEEP ALL CELLS BELOW***


So for the example of Saildrone 1033 and CTD001 on HLY1901 - the saildrone was ~1300km away from the the ctd in the window of 1day around the cast.

So now the goal would be to loop over all casts for this one Saildrone and see if there are any points of interest.  I'm gonna do it in one cell... this means that if you want to perform analysis's on other saildrones or cruises, you aught to be able to just change the data files and then run this cell.  **Warning** Some previous cells that are SD-1033 specific will break if you do this blindly...

**NOTE** for the saildrone data, it looks to have some information about the distance between the drones... you may be able to use this information also when looking at all drones or you may not ... up to you


In [127]:
#we are going to use another pandas trick call groupby to associate unique casts together and then loop through each cast
he1901_df.groupby('Cast').groups.keys() #will show all the unique 'Cast' numbers

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124])

In [161]:
for groups in he1901_df.groupby('Cast').groups.keys():
    cast = he1901_df.groupby('Cast').get_group(groups)
    print(f'Working on CTD {groups}')
    #get relevant CTD info
    ctdsfc = he1901_df[he1901_df['Cast'] == groups][he1901_df['P_4'] == he1901_df[he1901_df['Cast'] == groups]['P_4'].min()]
    #get matching saildrone info
    ctd_sd_match = sd_df[(pd.to_datetime(ctdsfc['Date Time'])-datetime.timedelta(hours=12)).iloc[0]:(pd.to_datetime(ctdsfc['Date Time'])+datetime.timedelta(hours=12)).iloc[0]]

    #calc great circle distance
    gc = GreatCircle()

    #pt 1
    gc.name1 = 'HLY Cast'
    gc.latitude1_degrees = ctdsfc['Latitude'].iloc[0]
    gc.longitude1_degrees = ctdsfc['Longitude'].iloc[0]

    for i,rows in ctd_sd_match.iterrows():
        gc.name1 = 'SD Location'
        gc.latitude2_degrees = rows['LATITUDE']
        gc.longitude2_degrees = rows['LONGITUDE']

        gc.calculate()
        print(f'Distance between points are {gc.distance_kilometres} kilometers for sd at time {i}')


Working on CTD 1
Distance between points are 1308.9943672921042 kilometers for sd at time 2019-08-05 18:00:00
Distance between points are 1310.030787552367 kilometers for sd at time 2019-08-06 00:00:00
Distance between points are 1302.02454701064 kilometers for sd at time 2019-08-06 06:00:00
Distance between points are 1294.6123417742713 kilometers for sd at time 2019-08-06 12:00:00
Working on CTD 2
Distance between points are 1309.9089507683927 kilometers for sd at time 2019-08-06 00:00:00
Distance between points are 1301.7614441050487 kilometers for sd at time 2019-08-06 06:00:00
Distance between points are 1294.3086916567295 kilometers for sd at time 2019-08-06 12:00:00
Distance between points are 1291.5239550855442 kilometers for sd at time 2019-08-06 18:00:00
Working on CTD 3
Distance between points are 1276.541748549079 kilometers for sd at time 2019-08-06 06:00:00
Distance between points are 1269.119471961215 kilometers for sd at time 2019-08-06 12:00:00
Distance between points 

The output above will present you with the distance and time of each CTD and Saildrone point.  VSC may give you an error about perfomance on the output... just click the link and add a zero to the number it provides and then rerun the cell.  Also, you are likely to see some warnings... in this case, i don't think you need to worry about any as they all have to do with "boolean indexing". any others may need to be explored.


For the example of HE1901 CTD data and SD-1033 :  CTD's 35-40 look to be the promising ones for your comparison.