Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in provided chipping/labeling scheme #15

Open
laserstonewall opened this issue Nov 30, 2021 · 0 comments
Open

Error in provided chipping/labeling scheme #15

laserstonewall opened this issue Nov 30, 2021 · 0 comments

Comments

@laserstonewall
Copy link

I believe there may be an error in the reference implementation, in the __init__() function for the XView3Dataset class in dataloader.py that causes a significant number of samples to incorrectly be labeled as from class 1/FISHING.

The results hold for both the tiny and full datasets. If we look at the chipping annotations csv generated for the tiny validation set:

tiny_valid_chips = pd.read_csv('/home/ubuntu/xview3/process_uuid/validation/val_chip_annotations.csv')

print(tiny_valid_chips[['scene_id', 'chip_index', 'is_vessel', 'is_fishing', 'confidence', 'vessel_class']].head())

# Output:

            scene_id  chip_index is_vessel is_fishing confidence  vessel_class
0  b1844cde847a3942v         333      True        NaN     MEDIUM             1
1  b1844cde847a3942v         303      True        NaN        LOW             1
2  b1844cde847a3942v         824     False        NaN       HIGH             3
3  b1844cde847a3942v         404      True        NaN     MEDIUM             1
4  b1844cde847a3942v         404     False        NaN       HIGH             3

We can see in the first row, is_vessel is True, is_fishing is NaN. This should result in a label of 2/NONFISHING, however the label ends up with 1/FISHING. If we do a grouping (filling in NaN values, which pandas will drop if they are in the groupby keys), we get:

groupby_cols = ['is_vessel', 'is_fishing', 'confidence', 'vessel_class']

tst = tiny_valid_chips[groupby_cols]
tst = tst.fillna('Null')

print(tst.groupby(groupby_cols).size())

# Output

is_vessel  is_fishing  confidence  vessel_class
False      Null        HIGH        3               400
                       MEDIUM      3                26
True       False       HIGH        2               272
                       MEDIUM      2                 3
           True        HIGH        1                19
           Null        LOW         1               329
                       MEDIUM      1               303
Null       Null        LOW         1                10                                                                                                      

So we can see that cases where is_vessel is True and is_fishing is NaN are always labeled as 1/FISHING. This occurs for both LOW and MEDIUM confidence labels in the tiny set. Additionally, cases where both is_vessel and is_fishing are NaN are also labeled as 1/FISHING.

If we do the same analysis for the chipping annotations csv generated for the tiny training set:

tiny_train_chips = pd.read_csv('/home/ubuntu/xview3/process_uuid/train/train_chip_annotations.csv')

groupby_cols = ['is_vessel', 'is_fishing', 'confidence', 'vessel_class']

tst = tiny_train_chips[groupby_cols]
tst = tst.fillna('Null')

print(tst.groupby(groupby_cols).size())

# Output

is_vessel  is_fishing  confidence  vessel_class
-1         -1          -1          0               260
False      Null        HIGH        3                34
Null       Null        LOW         1               111
True       False       MEDIUM      2               184
           True        MEDIUM      1               194

We can see that cases where is_vessel is True and is_fishing is NaN don't occur in this set. However, cases where both columns are NaN do, and they are labeled as 1/FISHING.

The issue seems to be in the loop in lines 271 - 277:

self.detections = pd.read_csv(detect_file, low_memory=False)
vessel_class = []
for ii, row in self.detections.iterrows():
    if row.is_vessel and row.is_fishing:
        vessel_class.append(FISHING)
    elif row.is_vessel and not row.is_fishing:
        vessel_class.append(NONFISHING)
    elif not row.is_vessel:
        vessel_class.append(NONVESSEL)

The first conditional statement,

if row.is_vessel and row.is_fishing:
	vessel_class.append(FISHING)

is meant to test if both is_vessel and is_fishing are True. However, the Pandas NaN will also evaluate to True. Here is an example data point:

tst = tiny_valid_chips[(tiny_valid_chips['is_vessel']==True) & (tiny_valid_chips['is_fishing'].isnull())]
example = tst.iloc[0]

print(f"Detect ID: {example['detect_id']}")
print(example[['is_vessel', 'is_fishing', 'vessel_class']])

if example['is_vessel']:
    print('Test 1')

if example['is_fishing']:
    print('Test 2')
    
if not np.isnan(example['is_fishing']):
    print('Test 3')
    
# Output

Detect ID: b1844cde847a3942v_006.46879283500000035190_003.47593584100000008164
is_vessel       True
is_fishing       NaN
vessel_class       1
Name: 0, dtype: object
Test 1
Test 2

So the first conditional statement will classify is_vessel/is_fishing combinations of True/True, NaN/NaN, True/NaN all as class 1/FISHING.

For a dataset with a random subset of 300 of the training scenes chipped, the label distribution looks like:

is_vessel  is_fishing  confidence  vessel_class
False      Null        HIGH        3               5418
                       MEDIUM      3               1730
True       False       HIGH        2               2868
                       MEDIUM      2                 62
           True        HIGH        1                933
                       MEDIUM      1                 28
           Null        LOW         1               3315
                       MEDIUM      1               4751
Null       Null        LOW         1                119

So there end up with (4751 + 3315 + 119) = 8185 of the 19224 detections labeled as 1, seemingly incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant