duplicates_accounting_stats.txt

H. Hourston July 16, 2021

Duplicates checking
Accounting statistics

#####################
This section in new (Aug 6, 2021)
IOS WP CTD data had not been filtered to include only profiles having oxygen data. Now it is.

print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
106
72303

print(len(df_all))
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
72409
106
12183

print(len(df_copy.Inexact_duplicate_row))
60120
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
3659
# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2373

Accounting statistics:
Subset length 3719
Number of verified inexact duplicates: 60
Number of inexact duplicates that failed the check (i.e. aren't actually duplicates): 3659

#####################
This section is old (Aug 4, 2021).
This run includes data from IOS Water Properties that was missing from the CIOOS download.
5 bottle files and 738 ctd profile files make up the missing files.

>print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
106
>print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
72459
>print(len(df_all))
72565

106 exact duplicates (identical instrument type, lat, lon, and date string). 72459 profiles
pass to the next check:

>print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
12189

72565-106-12189=60270 rows to check for inexact duplicates:

# Accounting statistics
print(len(df_copy.Inexact_duplicate_row))
60270
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
3851
# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2519


#####################
This section is old (July 26, 2021).
I had accidentally included NODC CTD oxygen data which we had decided not to use
over concerns about sensor errors and no calibration against titration data.

print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
106
71716
len(df_all)
71822

106 exact duplicate rows out of 71822 rows (profiles).

Proceed to CTD-bottle duplicate checking:

print(len(df_all))
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
71822
106
12195

71822-106-12195=59521 rows that are not exact duplicates or first occurrences
of exact duplicates. We will do inexact duplicate checking on these 59521 rows.

print(len(df_copy.Inexact_duplicate_row))
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
59521
3432

# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2245

So there are 2245 inexact duplicate rows. 59521-2245=57276 unique rows (profiles).

We will do another round of inexact duplicate checking on the rows flagged by the
first inexact duplicate check. For this second check, we will compare the raw
profile data from the source data files.

print(len(df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values]))
1

One actual inexact duplicate row/profile, but it contains only nans:

df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Latitude
3339   NaN
Name: Latitude, dtype: float64

df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Longitude
3339   NaN
Name: Longitude, dtype: float64

df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Date_string
3339   NaN
Name: Date_string, dtype: float64

df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Institute
3339    NaN
Name: Institute, dtype: object

#####################
This section is OLD (July 22, 2021).
I had mistakes in the profile data table latitude and longitude:
- MEDS longitude data *(-1)
- IOS lat and lon data switched

print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
578
83105
len(df_all)
83683

So, there are 578 exact duplicate rows out of 83105 total rows.
83105 rows will move to the CTD-bottle duplicate checking:

print(len(df_all))
83683
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
578
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
12843

So, 83683 - 578 - 12843 = 70262 rows that will move to the inexact duplicate checking.
This number is comparable to the number of rows from the earlier wrong check of 70290 rows.
No inexact duplicates before row 5844.

print(len(df_copy.Inexact_duplicate_row))
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
70262
9013  #18694 without checking same instrument
5116  #10207 without checking same instrument

The Partner_index equals -1 for non-duplicates, including first occurrences of inexact
duplicate rows. For inexact duplicate rows that are not the first occurrence, the
Partner_index equals the row number of its first occurrence (its "partner").
So, there are 10207 duplicate rows, excluding the first occurrences. This leaves
70262 - 10207 = 60055 non-duplicate rows.

This is less than from the first wrong check value of 62710 non-duplicate rows.

##################### WRONG and old
print(len(df_copy))
83683

print(len(df_copy.iloc[df_copy['Exact_duplicate_row'].values]))
564

# The following excludes exact duplicates already counted by 'Exact_duplicate_row'
print(len(df_copy.iloc[df_copy['CTD_BOT_duplicate_row'].values]))
12829

###############
Based off of the above statistics, subsetting non-duplicates for
inexact duplicate checking yields:

# How many rows subsetter removes
print(len(subsetter[subsetter]), len(subsetter[~subsetter]))
13393 70290

So, inexact duplicate checking should cover 70290 rows.

###############
Inexact duplicate checking accounting statistics:

LIMITS: 0.2 degrees for lat/lon and 1 hour for time.

print(len(df_copy.Inexact_duplicate_row))
70290
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
21233

Hence, we are left with 49057 non-duplicate profiles.

NEW LIMITS: 0.01 degrees for lat/lon and 1 hour for time.
(in order to try to retain more data through this check)

print(len(df_copy.Inexact_duplicate_row))
70290
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
7580

Hence, we are left with 70290-7580=62710 non-duplicate rows (profiles).

There are 13959 total duplicates including the first occurrence of each duplicate.
13959-7580=6379 < 7580, so there are sometimes multiple duplicates for a certain row.