-
Notifications
You must be signed in to change notification settings - Fork 1
/
duplicates_accounting_stats.txt
213 lines (162 loc) · 6.95 KB
/
duplicates_accounting_stats.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
H. Hourston July 16, 2021
Duplicates checking
Accounting statistics
#####################
This section in new (Aug 6, 2021)
IOS WP CTD data had not been filtered to include only profiles having oxygen data. Now it is.
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
106
72303
print(len(df_all))
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
72409
106
12183
print(len(df_copy.Inexact_duplicate_row))
60120
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
3659
# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2373
Accounting statistics:
Subset length 3719
Number of verified inexact duplicates: 60
Number of inexact duplicates that failed the check (i.e. aren't actually duplicates): 3659
#####################
This section is old (Aug 4, 2021).
This run includes data from IOS Water Properties that was missing from the CIOOS download.
5 bottle files and 738 ctd profile files make up the missing files.
>print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
106
>print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
72459
>print(len(df_all))
72565
106 exact duplicates (identical instrument type, lat, lon, and date string). 72459 profiles
pass to the next check:
>print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
12189
72565-106-12189=60270 rows to check for inexact duplicates:
# Accounting statistics
print(len(df_copy.Inexact_duplicate_row))
60270
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
3851
# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2519
#####################
This section is old (July 26, 2021).
I had accidentally included NODC CTD oxygen data which we had decided not to use
over concerns about sensor errors and no calibration against titration data.
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
106
71716
len(df_all)
71822
106 exact duplicate rows out of 71822 rows (profiles).
Proceed to CTD-bottle duplicate checking:
print(len(df_all))
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
71822
106
12195
71822-106-12195=59521 rows that are not exact duplicates or first occurrences
of exact duplicates. We will do inexact duplicate checking on these 59521 rows.
print(len(df_copy.Inexact_duplicate_row))
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
59521
3432
# Print the number of non-first occurrences of nonexact duplicates
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
2245
So there are 2245 inexact duplicate rows. 59521-2245=57276 unique rows (profiles).
We will do another round of inexact duplicate checking on the rows flagged by the
first inexact duplicate check. For this second check, we will compare the raw
profile data from the source data files.
print(len(df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values]))
1
One actual inexact duplicate row/profile, but it contains only nans:
df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Latitude
3339 NaN
Name: Latitude, dtype: float64
df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Longitude
3339 NaN
Name: Longitude, dtype: float64
df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Date_string
3339 NaN
Name: Date_string, dtype: float64
df_subset.loc[(df_subset.Inexact_duplicate_check2 == True).values].Institute
3339 NaN
Name: Institute, dtype: object
#####################
This section is OLD (July 22, 2021).
I had mistakes in the profile data table latitude and longitude:
- MEDS longitude data *(-1)
- IOS lat and lon data switched
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == True).values]))
print(len(df_all['Exact_duplicate_row'].iloc[(df_all['Exact_duplicate_row'] == False).values]))
578
83105
len(df_all)
83683
So, there are 578 exact duplicate rows out of 83105 total rows.
83105 rows will move to the CTD-bottle duplicate checking:
print(len(df_all))
83683
print(len(df_all.iloc[df_all['Exact_duplicate_row'].values]))
578
print(len(df_all.iloc[df_all['CTD_BOT_duplicate_row'].values]))
12843
So, 83683 - 578 - 12843 = 70262 rows that will move to the inexact duplicate checking.
This number is comparable to the number of rows from the earlier wrong check of 70290 rows.
No inexact duplicates before row 5844.
print(len(df_copy.Inexact_duplicate_row))
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
print(len(df_copy.iloc[(df_copy.Partner_index != -1).values]))
70262
9013 #18694 without checking same instrument
5116 #10207 without checking same instrument
The Partner_index equals -1 for non-duplicates, including first occurrences of inexact
duplicate rows. For inexact duplicate rows that are not the first occurrence, the
Partner_index equals the row number of its first occurrence (its "partner").
So, there are 10207 duplicate rows, excluding the first occurrences. This leaves
70262 - 10207 = 60055 non-duplicate rows.
This is less than from the first wrong check value of 62710 non-duplicate rows.
##################### WRONG and old
print(len(df_copy))
83683
print(len(df_copy.iloc[df_copy['Exact_duplicate_row'].values]))
564
# The following excludes exact duplicates already counted by 'Exact_duplicate_row'
print(len(df_copy.iloc[df_copy['CTD_BOT_duplicate_row'].values]))
12829
###############
Based off of the above statistics, subsetting non-duplicates for
inexact duplicate checking yields:
# How many rows subsetter removes
print(len(subsetter[subsetter]), len(subsetter[~subsetter]))
13393 70290
So, inexact duplicate checking should cover 70290 rows.
###############
Inexact duplicate checking accounting statistics:
LIMITS: 0.2 degrees for lat/lon and 1 hour for time.
print(len(df_copy.Inexact_duplicate_row))
70290
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
21233
Hence, we are left with 49057 non-duplicate profiles.
NEW LIMITS: 0.01 degrees for lat/lon and 1 hour for time.
(in order to try to retain more data through this check)
print(len(df_copy.Inexact_duplicate_row))
70290
print(len(df_copy.Inexact_duplicate_row.iloc[df_copy.Inexact_duplicate_row.values]))
7580
Hence, we are left with 70290-7580=62710 non-duplicate rows (profiles).
There are 13959 total duplicates including the first occurrence of each duplicate.
13959-7580=6379 < 7580, so there are sometimes multiple duplicates for a certain row.