Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Duplicate JPEGs are not being removed #1

Open
3 tasks done
2320sharon opened this issue Sep 28, 2023 · 11 comments
Open
3 tasks done

Bug: Duplicate JPEGs are not being removed #1

2320sharon opened this issue Sep 28, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@2320sharon
Copy link
Contributor

2320sharon commented Sep 28, 2023

Description:

Users have found that duplicate images are being downloaded and used to extract shorelines. Downloading duplicate images has always been an issue with coastsat's download workflow, but the question is the process of removing the duplicates only happening to the tifs and not removing duplicate jpegs? Users are also wondering if its possible to modify the download workflow so that duplicates are detected before they are downloaded so that downloading duplicate images does not further slow down the downloads. Issues with duplicates are most prevalent with S2 imagery. This has led to significant delays in download times, impacting user experience and overall workflow efficiency.

Concerns:

  • The removal of duplicate images seems to be limited to TIFFs only, leaving behind duplicate JPEGs.
  • Users have expressed a need for refining the download workflow to identify and exclude duplicate images before the download process begins, ideally improving workflow speed.

Tasks:

  • Task 1: Investigate whether the “remove duplicates” function is currently excluding JPEGs and ensure all duplicate JPEGs are removed in the process.
  • Task 2: Validate the sequence of the workflow, ensuring the “remove duplicates” function is executed before the shoreline extraction begins.
  • Task 3: Explore the feasibility of identifying duplicate images in the dataset prior to initiating downloads from Google Earth Engine, possibly through advanced filtering techniques.

Acceptance Criteria:

  • The “remove duplicates” function should effectively remove both TIFF and JPEG duplicates.
@2320sharon 2320sharon added the bug Something isn't working label Sep 28, 2023
@2320sharon 2320sharon self-assigned this Sep 28, 2023
@2320sharon
Copy link
Contributor Author

2320sharon commented Oct 4, 2023

Relevant Research Findings

  • The remove_duplicates defined in coastsat removes duplicate shorelines from the Shoreline dictionary. It does not remove duplicate tiffs or jpegs
  • duplicate tips are typically S2 imagery
  • handle_duplicate_image_names defined in coastsat renames the duplicate tiffs by adding dup_X to the file name. These duplicate files are images from the same satellite collection with the same image timestamp. These files are temporary tiffs that are later converted into the real tiffs that are saved.
  • the filtering of the S2 collection begins before the download ever starts. This process is performed in im_dict_T1["S2"] = filter_S2_collection(im_dict_T1["S2"])
    • Basically the filter_S2_collection function deletes all the S2 imagery with the same time stamp and different UTM zones. Images with the same timestamp and the same UTM zone keep only the first one.

Conclusions

Given that the S2 collection has its duplicates filtered out before the download ever begins makes me confused how duplicate jpegs for the S2 collection are even being generated. I think it will take some testing to figure out when this happens. It's also possible that each person has a different definition of what a duplicate image is. From the coastsat implementation it classifies a duplicate image as an image from the same satellite collection and timestamp that's less than 24 hours apart.

As I write this I realize the issue might not be the filtering technique but the fact that the collections are in two different tiers. It's possible that there are timestamps that are the same across both tiers for the same satellite and this would cause there to be duplicate imagery even though it should have been filtered out. While the S2 collection does not have two tiers the other satellites do, so I'm going to do some testing and see if duplicates are arising because of this.

@2320sharon
Copy link
Contributor Author

2320sharon commented Oct 4, 2023

These two jpegs 2018-12-03-18-39-48_RGB_L8.jpg and 2018-12-03-18-40-12_RGB_L8.jpg were captured on the same day
2018-12-03 and at almost the same time 18-39-48 and 18-40-12 would these images be considered duplicates since they are less than 24 hours apart? @dbuscombe-usgs
2018-12-03-18-39-48_RGB_L8

2018-12-03-18-40-12_RGB_L8

@dbuscombe-usgs
Copy link
Member

We'd want to keep both images. That's super valuable having images on consecutive days!

Duplicates are only when images are identical times

@2320sharon
Copy link
Contributor Author

We'd want to keep both images. That's super valuable having images on consecutive days!

Duplicates are only when images are identical times

Ah good to know, thanks for helping me double check that.

When I ran the script below on 700 S2 images I downloaded I didn't find any duplicate imagery. Sometimes the images are only a few minutes apart, but other than that I'm not finding duplicates.

import os 
from collections import defaultdict
from collections import Counter

file_list=os.listdir('/home/sha23/development/coastseg/CoastSeg/data/ID_kyg1_datetime10-02-23__03_11_52/jpg_files/preprocessed/RGB')

counter = Counter(file_list)
duplicates = {file: count for file, count in counter.items() if count > 1}

# Print the duplicates
for duplicate, count in duplicates.items():
    print(f"Filename: {duplicate} - Count: {count}")

@2320sharon
Copy link
Contributor Author

I ran this script across all the data I've downloaded and I didn't find any duplicates

import os 
from collections import defaultdict
from collections import Counter

data_directory = r'C:\development\doodleverse\coastseg\CoastSeg\data'
roi_dirs =  os.listdir(data_directory)
for roi_dir in roi_dirs:
    jpeg_directory =  os.path.join(roi_dir,"jpg_files","preprocessed","RGB")
    if os.path.exists(jpeg_directory):
        file_list=os.listdir(jpeg_directory)
        counter = Counter(file_list)
        duplicates = {file: count for file, count in counter.items() if count > 1}

        # Print the duplicates
        for duplicate, count in duplicates.items():
            print(f"Filename: {duplicate} - Count: {count}")

@2320sharon
Copy link
Contributor Author

@dbuscombe-usgs have you found duplicate imagery in any of the downloads you've performed?

@2320sharon
Copy link
Contributor Author

I heard back from Catherine on the duplicate images issue and here is what she said:

I'm going through the images currently. It appears that the images may actually have unique IDs, but they were so extremely similar with the exact date, hour, and even minutes for some. There are some days with S2 that have 2-3 images for the same day and extremely similar times, hence I thought it was identical. Here is an example that is only 15 mins away from each other. I haven't seen it with Landsat yet. This happens with nearly 75% of S2 images after 2018.

2022-11-18-15-52-02_RGB_S2
2022-11-18-15-51-47_RGB_S2

So it seems there aren't identical images being generated just multiple images that are sometimes seconds apart.
@dbuscombe-usgs do we want to keep these images that are minutes/seconds apart?

@2320sharon
Copy link
Contributor Author

During the meeting today we addressed the confusion about "duplicates" images or more accurately images captured within a few minutes of each other or less. We determined that it would be easiest to make this a post-processing script that removes images that are less than a few minutes apart. It was suggested that this would be a script located in the scripts directory.

@2320sharon
Copy link
Contributor Author

2320sharon commented Oct 5, 2023

write a script the removes all images within a designated time frame from other imagery.
@dbuscombe-usgs maybe images within 5-10 minutes of each other should be removed?

@dbuscombe-usgs
Copy link
Member

dbuscombe-usgs commented Oct 5, 2023

I think the point of the script would be for the user to specify what time period they like, and it should go in SDS-tools. It wouldn't filter out images, but shorelines

@dbuscombe-usgs
Copy link
Member

And no, we don't want to remove any imagery. The SDS tools script will remove duplicate shorelines. It will look at all the shorelines with X minutes (hours days whatever) of another one, and remove them

@2320sharon 2320sharon transferred this issue from SatelliteShorelines/CoastSeg Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

2 participants