# Satellite image list management
This notebook 
* reads a file which lists all satellite images to be used for training image generation
* places the contents into a dataframe and displays it 
* does a few sanity checks
* is the basis for other notebooks that actually do something with the data (e.g. searching for them)

In [47]:
import pandas as pd

In [48]:
# name of the file containing the list (as json; one entry per line)
# fn = "/media/hh/hd_internal/_data_DS/DSR/satelliteImages/list_satellite_images_training.txt"
fn = "../../list_satellite_images_training.json"
img_list = pd.read_json(fn)
img_list

Unnamed: 0,AOIName,analyticImgName,comment,directory,doUse,labelFileNames
0,Harz3,20180419_074323_0c43_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz3_a.geojson
1,Harz3,20180419_074324_0c43_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz3_a.geojson
2,Harz3,20180419_074324_1_0c43_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz3_a.geojson
3,Harz3,20180419_074325_0c43_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz3_a.geojson
4,Harz3,20180419_074326_0c43_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz3_a.geojson
5,Harz1,20180504_094435_0e19_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz1_a.geojson
6,Harz1,20180724_094554_0e19_3B_AnalyticMS.tif,curated OpenStreetMap labels,Harz/,1,RoadLabels_RDT_Harz1_a.geojson
7,3093,20180427_020503_103c_3B_AnalyticMS.tif,RDT manual labels,Borneo/3093/,0,RoadLabels_RDT_3093_a.kml
8,3093,20180427_020504_103c_3B_AnalyticMS.tif,RDT manual labels,Borneo/3093/,0,RoadLabels_RDT_3093_a.kml
9,3093,20180606_020625_0f1b_3B_AnalyticMS.tif,RDT manual labels,Borneo/3093/,0,RoadLabels_RDT_3093_a.kml


## Some sanity checks of list


In [49]:
# alert to duplicate files which have identical label file names - these are likely indadvertent duplicates
assert(not any(img_list.duplicated(["analyticImgName","labelFileNames"]))), "Duplicate entries in list"

# display duplicate files (which is fine if they have different underlying labelFileNames)
duplicate_img_ix = img_list.duplicated(["analyticImgName"])
if any(duplicate_img_ix):
    print("The following image files are listed more than once, but with different label filenames:")
    print((img_list.loc[duplicate_img_ix,"analyticImgName"]))

The following image files are listed more than once, but with different label filenames:
10    20180427_020503_103c_3B_AnalyticMS.tif
11    20180427_020504_103c_3B_AnalyticMS.tif
12    20180606_020625_0f1b_3B_AnalyticMS.tif
Name: analyticImgName, dtype: object


## Check that all files listed are in directories and vice versa

In [50]:
# to come

In [None]:
# img_list.loc[img_list.directory.str.contains('borneo', case=False)]