This notebook scrapes satellite images for each leak repair. For each location it gets a NxM rectangle around the leak before and after it was repaired. Then it collated all the data into h5 files and all the metadata into json files.

It takes days to run because of rate limiting on the google earth api. Because of limited satelite coverage you might find matches for only 10% of the leaks.

## Modifying

- make sure google earth is setup
- load leaks, so they pass the asserts
- change params
- run rest of cells

In [1]:
from path import Path
import arrow
import json
import pytz
import time
from pprint import pprint
from tqdm import tqdm_notebook as tqdm
import re, os, collections, itertools, uuid, logging
import tempfile
import tables

import zipfile
import urllib

import ee
import pyproj
import numpy as np
import scipy as sp
import pandas as pd
import geopandas as gpd
from matplotlib import pyplot as plt
import seaborn as sns
import shapely

plt.rcParams['figure.figsize'] = (15, 5) # bigger plots
plt.style.use('fivethirtyeight')
%matplotlib inline
%precision 4

'%.4f'

In [2]:
# %load_ext autoreload
# %autoreload 2

In [3]:
helper_dir = str(Path('..').abspath())
if helper_dir not in os.sys.path:
    os.sys.path.append(helper_dir)
    
from leak_helpers.earth_engine import display_ee, get_boundary, tifs2np, bands_s2, download_image, bands_s2, bands_l7, bands_l8, bands_NAIP
from leak_helpers.visualization import imshow_bands

# Load leaks

Load the leaks from a geojson file and make sure they have unique fields REPO_Date and leak_id (see asserts below)

In [4]:
# load 

leaks1 = gpd.read_file('../../data/leak_datasets/austin_leaks/derived/austin_leaks-repairs.geojson')
leaks_datas = [leaks1]

In [5]:
# join them all, with primary columns and random metadata
primary_cols = ['leak_id','REPO_Date','geometry']
leaks = gpd.GeoDataFrame(pd.concat([leaks_data[primary_cols] for leaks_data in leaks_datas]), crs={'init': 'epsg:4326'})
leaks['metadata'] = np.concatenate([leaks_data.drop(primary_cols,1).to_dict('records') for leaks_data in leaks_datas])
leaks.index = leaks.leak_id
leaks

Unnamed: 0_level_0,leak_id,REPO_Date,geometry,metadata
leak_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ATX-47486,ATX-47486,2009-08-24T19:56:00,POINT (-97.82460426524551 30.24299059253743),"{'INITDTTM': '2009-08-23T00:20:00', 'PREDIR': ..."
ATX-47487,ATX-47487,2009-08-24T20:00:00,POINT (-97.82675029997988 30.24219926503761),"{'INITDTTM': '2009-08-23T00:50:00', 'PREDIR': ..."
ATX-47488,ATX-47488,2009-08-22T06:05:00,POINT (-97.90543523261422 30.21938883975655),"{'INITDTTM': '2009-08-22T03:32:00', 'PREDIR': ..."
ATX-47489,ATX-47489,2008-12-11T15:47:00,POINT (-97.74688985035503 30.27942444961227),"{'INITDTTM': '2008-12-08T16:54:00', 'PREDIR': ..."
ATX-47490,ATX-47490,2008-12-12T03:09:00,POINT (-97.75148894169132 30.32082458189908),"{'INITDTTM': '2008-12-05T21:58:00', 'PREDIR': ..."
ATX-47491,ATX-47491,2008-12-16T16:24:00,POINT (-97.76771015540359 30.2064011266074),"{'INITDTTM': '2008-12-12T16:54:00', 'PREDIR': ..."
ATX-47492,ATX-47492,2008-12-04T05:52:00,POINT (-97.75918111620598 30.29698530319657),"{'INITDTTM': '2008-12-03T23:54:00', 'PREDIR': ..."
ATX-47493,ATX-47493,2008-12-18T11:41:00,POINT (-97.7354500990549 30.30641876306587),"{'INITDTTM': '2008-12-03T23:15:00', 'PREDIR': ..."
ATX-47494,ATX-47494,2010-03-08T00:30:00,POINT (-97.82803137965742 30.24776340772963),"{'INITDTTM': '2010-03-07T16:11:00', 'PREDIR': ..."
ATX-47495,ATX-47495,2008-12-08T03:05:00,POINT (-97.71051676533885 30.26298823681678),"{'INITDTTM': '2008-12-07T15:12:00', 'PREDIR': ..."


In [49]:
# limit leaks to the time and space where satellite data exists
print('before',len(leaks))

# limit it to after satelite came into service
leaks= leaks[pd.to_datetime(leaks.REPO_Date)>pd.Timestamp('1 Jan 2003')]

# also limit them by location
print('middle',len(leaks))
satellite_bounds = shapely.geometry.box(
    # continentental us
    minx = -124.7844079, # west long
    miny =  24.7433195, # south lat
    maxx = -66.9513812, # east long
    maxy = 49.3457868, # north lat   
)
leaks = leaks[leaks.intersects(satellite_bounds)]
print('after',len(leaks))

before 29675
middle 29675
after 29675


In [50]:
assert 'REPO_Date' in leaks.columns, 'should have REPO_Date columns with the leak repair date'
assert leaks.REPO_Date.apply(lambda x:arrow.get(x)).all(), 'should be parsable via arrow'
assert 'leak_id' in leaks.columns, 'should have unique leak_id column'
assert leaks.leak_id.apply(lambda x:'_' not in x).all(), 'should have no underscore in id'

How many matches should this notebook have?

For the austin data we have a leak every 4 hours starting from 2013-04. Texas has full ocverage in 2014 and 2016 so I should have a few matches...

Google ee has data from 2003 onwards

CA also had full coverage in 12, 14, 16 and LA goes to 2005?

Yet I'm only getting 15 within a month? Should get ~200 just for AUTX

In [51]:
# temp
dt=pd.to_datetime(leaks.REPO_Date)
dt=dt.sort_values()
dtd=dt.diff()[1:]
dtd2=dtd.astype(int)/(60*60)*1e-9 # to hours
dtd2.mean()
# dtd
(dt.max()-dt.min())/len(dt), dt.min()

(Timedelta('0 days 03:34:52.531086'), Timestamp('2005-01-03 01:45:00'))

## Params

Customise the values in the cell below

In [52]:
# params

# change notebook name when you want to start a new dataset

bands = bands_NAIP # list of satelite band names from earth engine
satellite = 'USDA/NAIP/DOQQ' # satelite name from earth engine
resolution_min = 1.0 # the lowest resolution on earth engine

# how many pixels hight and wide you image will be (centered on leak), should be odd, e.g. 25
pixel_length = 129.0 

fudge_distance_factor = -0.5

## Init

In [58]:
%%javascript
// get notebook name
var command = "notebook_name = '" + IPython.notebook.notebook_path.replace('.ipynb','') + "'";
IPython.notebook.kernel.execute(command);

<IPython.core.display.Javascript object>

In [59]:
notebook_name

'scraping_earth_engine_NAIP_all_v2'

In [60]:
# constant params, probobly don't change
time_bin_delta = 60*60*24*28 # how long before a leak to look (in seconds)
crs_grid = 3857 # keep this as auxilary sphere, this is the CRS the downloaded images will be in

# init
## init directories
ts=arrow.utcnow().format('YYYYMMDD-HH-mm-ss')
temp_dir = Path('/tmp/{}'.format(notebook_name))
output_dir = Path('../../data/scraped_satellite_images/downloaded_images-{}-{}'.format(notebook_name,satellite.replace('/','-')))
cache_dir = output_dir.joinpath('cache')
output_dir.makedirs_p()
temp_dir.makedirs_p()
cache_dir.makedirs_p()

## init logger
logger = logging.getLogger(notebook_name)
# logger.setLevel(logging.WARN)

temp_dir, output_dir, cache_dir

(Path('/tmp/scraping_earth_engine_NAIP_all_v2'),
 Path('../data/downloaded_images-scraping_earth_engine_NAIP_all_v2-USDA-NAIP-DOQQ'),
 Path('../data/downloaded_images-scraping_earth_engine_NAIP_all_v2-USDA-NAIP-DOQQ/cache'))

In [61]:
# record cofig in a json file
metadata = dict(
    notebook_name=notebook_name,
    satellite=satellite,
    time_bin_delta=time_bin_delta,
    pixel_length=pixel_length,
    resolution_min=resolution_min,
    bands=bands,
    ts=ts,
    crs_grid=crs_grid,
    cache_dir=str(cache_dir),
    temp_dir=str(temp_dir),
    output_dir=str(output_dir),
)
metadata_file = output_dir.joinpath('script_metadata.json')
json.dump(metadata, open(metadata_file,'w'))

# earth engine

## Steps:
- first need to apply for an account and wait ~ 1day
- Setup instructions [here](https://developers.google.com/earth-engine/python_install#setting-up-authentication-credentials)

## Refs/examples:
- api https://developers.google.com/earth-engine/
- code examples https://code.earthengine.google.com/
- sentinel1 https://developers.google.com/earth-engine/sentinel1
    - `ee.ImageCollection('satellite');`
    - `ee.ImageCollection('COPERNICUS/S1_GRD');`
- keras and google earth https://github.com/patrick-dd/landsat-landstats

In [62]:
# test earth-engine setup
from oauth2client import crypt # should have not error
import ee
ee.Initialize() # should give no errors, if so follow instructions


# test
image = ee.Image('srtm90_v4')
assert image.getInfo()=={'type': 'Image', 'properties': {'system:time_start': 950227200000, 'system:asset_size': 18827626666, 'system:time_end': 951177600000}, 'bands': [{'data_type': {'type': 'PixelType', 'max': 32767, 'min': -32768, 'precision': 'int'}, 'crs': 'EPSG:4326', 'id': 'elevation', 'dimensions': [432000, 144000], 'crs_transform': [0.000833333333333, 0.0, -180.0, 0.0, -0.000833333333333, 60.0]}], 'id': 'srtm90_v4', 'version': 1463778555689000}
print('ok')

ok


In [63]:
# test satellite
image_collection = ee.ImageCollection(satellite)
image = ee.Image(image_collection.first())
info = image.getInfo()
info

{'bands': [{'crs': 'EPSG:26914',
   'crs_transform': [2.0000,
    0.0000,
    631040.0000,
    0.0000,
    -2.0000,
    2876720.0000],
   'data_type': {'max': 255,
    'min': 0,
    'precision': 'int',
    'type': 'PixelType'},
   'dimensions': [3480, 3805],
   'id': 'N'},
  {'crs': 'EPSG:26914',
   'crs_transform': [2.0000,
    0.0000,
    631040.0000,
    0.0000,
    -2.0000,
    2876720.0000],
   'data_type': {'max': 255,
    'min': 0,
    'precision': 'int',
    'type': 'PixelType'},
   'dimensions': [3480, 3805],
   'id': 'R'},
  {'crs': 'EPSG:26914',
   'crs_transform': [2.0000,
    0.0000,
    631040.0000,
    0.0000,
    -2.0000,
    2876720.0000],
   'data_type': {'max': 255,
    'min': 0,
    'precision': 'int',
    'type': 'PixelType'},
   'dimensions': [3480, 3805],
   'id': 'G'}],
 'id': 'USDA/NAIP/DOQQ/c_2509703_ne_14_2_20060519',
 'properties': {'system:asset_size': 34233523,
  'system:footprint': {'coordinates': [[-97.6914, 26.0028],
    [-97.6914, 25.9347],
    [-97.62

In [64]:
info_bands = [i['id'] for i in info['bands']]
assert np.array([band in bands for band in info_bands]).all(), 'bands should contain the name of each downloaded band'

# Fetching images

For each point
- find the nearest image before the repair
- and the soonest image after repair
- save a part of each with metadata

Later will can filter, interpolate, read into numpy arrays, and save to hdf file

In [65]:
import dataset
cache_file = 'sqlite:///{}'.format(cache_dir.dirname().joinpath('cache.db'))
db = dataset.connect(cache_file)
cache_table = db.get_table('cached_ids', primary_id='leak_id', primary_type='String')

def get_cached_ids():
    return set(row['leak_id'] for row in cache_table.distinct('leak_id'))

def init_cache(leak_id):
    """We will cache downloads in folders like 'id_after'"""
    if leak_id:
        try:
            cache_table.insert(dict(leak_id=leak_id))
        except:
            db.rollback()
        else:
            db.commit()
    return get_cached_ids()
# init_cache(1)
cache_file

'sqlite:///../data/downloaded_images-scraping_earth_engine_NAIP_all_v2-USDA-NAIP-DOQQ/cache.db'

In [66]:
# def get_cached_ids():
#     cache_dirs = [str(f.relpath(cache_dir)).split('_')[0] for f in cache_dir.listdir()]
#     return cache_dirs

# def init_cache(leak_id):
#     """We will cache downloads in folders like 'id_after'"""
#     if leak_id:
#         cache_subdir = cache_dir.joinpath(leak_id+'_after')
#         cache_subdir.makedirs_p()
#         cache_subdir = cache_dir.joinpath(leak_id+'_before')
#         cache_subdir.makedirs_p()
#     return get_cached_ids()

### Test the distance need to get your rectangle

Here we need to tweak `fudge_distance_factor` so that we get the image size of our choice. Start with zero and try -1, -0.5, -.25,0,0.25,0.5,0.75. This is to deal with rounding, projecting between CRS's etc. Don't worry the asserts below will yet you know when it's right.

In [67]:
distance = resolution_min*(pixel_length/2.0+fudge_distance_factor)

In [142]:
# test with one image

for i in np.random.choice(leaks.index,5):
    leak=leaks.iloc[[i]]
    leak_id = str(leak['leak_id'].values[0])

    repo_date_ts = arrow.get(leak.REPO_Date.values[0]).timestamp
    boundary = get_boundary(leak, distance=distance)
    sentinel2_before = ee.ImageCollection(satellite)\
        .filterBounds(boundary)\
        .filterDate(933828614605,1488776737937)
    image = ee.Image(sentinel2_before.first()).clip(boundary)
    info = image.getInfo()
    name=leak_id+'_after'
    path,files=download_image(
        image, 
        scale=resolution_min, 
        crs=crs_grid, 
        name=name,
        cache_dir=Path('/tmp')
    )
    data = tifs2np(path,files,bands=bands)
    print(i,[(d.shape,d.sum()) for d in data])
    for d in data:
        assert d.shape[0]==pixel_length, 'the downloaded image is the wrong size, tweak distance'
        assert d.shape[1]==pixel_length
    assert np.sum(data)!=0, 'should not be empty (make sure you are using the right bands)'

18458 [((129, 129), 1944079.0), ((129, 129), 2050014.0), ((129, 129), 1857304.0), ((129, 129), 2168330.0)]
7075 [((129, 129), 2228830.0), ((129, 129), 2234832.0), ((129, 129), 2081836.0), ((129, 129), 1848594.0)]
13565 [((129, 129), 2025938.0), ((129, 129), 2132676.0), ((129, 129), 1961430.0), ((129, 129), 2327725.0)]
14965 [((129, 129), 0.0), ((129, 129), 0.0), ((129, 129), 0.0), ((129, 129), 0.0)]


AssertionError: should not be empty (make sure you are using the right bands)

In [129]:
cached_ids = get_cached_ids()
import time, traceback


def get_image_for_leak(i, cached_ids=cached_ids):    
    leak = leaks.loc[[i]]
    repo_date_ts = arrow.get(leak.REPO_Date.values[0]).timestamp
    
    # crappy way of recording that we tried this one
    leak_id = leak.leak_id.values[0]
    if leak_id in cached_ids:
        logger.info('Skipping cached download for leak id %s ',leak_id)
        return
    
    boundary = get_boundary(leak, distance=distance)
    boundary_small = get_boundary(leak, distance=distance/10)
    
    # get image day before    
    sentinel2_before = ee.ImageCollection(satellite)\
        .filterBounds(boundary_small)\
        .filterDate((repo_date_ts-time_bin_delta)*1000,(repo_date_ts)*1000)\
        .sort('system:time_start', opt_ascending=False) # first will be latest
    
    results = sentinel2_before.size().getInfo()
    if results<1:
        logger.info('Error no results for day before %s',leak_id)
        cached_ids = init_cache(leak_id) # so we know there were no results
        return
        
    # get image day after
    sentinel2_after = ee.ImageCollection(satellite)\
        .filterBounds(boundary_small)\
        .filterDate((repo_date_ts)*1000,(repo_date_ts+time_bin_delta*6)*1000)\
        .sort('system:time_start', opt_ascending=True) # first will be earliest
        
    results = sentinel2_after.size().getInfo()
    if results<1:
        logger.info('Error no results for day after, id %s',leak_id)
        cached_ids = init_cache(leak_id) # so we know there where no results
        return
        
    # download as save images    
    logger.info('results for %s', leak_id)
    image = ee.Image(sentinel2_before.first()).clip(boundary)
    name=leak_id+'_before'
    path,files=download_image(
        image, 
        scale=resolution_min, 
        crs=crs_grid, 
        name=name,
        cache_dir=cache_dir
    )
    # also save metadata so we can filter by date
    with open(path.joinpath('metadata.json'), 'w') as fo:
        metadata = dict(
            image=image.getInfo(),
            scale=resolution_min,
            crs=crs_grid,
            name=name,
            distance=distance,
            leak=json.loads(leak.to_json())
        )
        json.dump(metadata, fo)

    image = ee.Image(sentinel2_after.first()).clip(boundary)
    name=leak_id+'_after'
    path,files=download_image(
        image, 
        scale=resolution_min, 
        crs=crs_grid, 
        name=name,
        cache_dir=cache_dir
    )
    with open(path.joinpath('metadata.json'), 'w') as fo:
        metadata = dict(
            image=image.getInfo(),
            scale=resolution_min,
            crs=crs_grid,
            name=name,
            distance=distance,
            leak=json.loads(leak.to_json())
        )
        json.dump(metadata, fo)
    cached_ids = init_cache(leak_id) # so we know there where no results
        
    return

# could take 27 hours
leak_to_scrape = set(leaks.leak_id).difference(set(cached_ids))
for i in tqdm(leak_to_scrape):
    try:
        get_image_for_leak(i)
    except urllib.error.HTTPError as e:
        print(i,e) # "HTTP Error 429: unknown"
        traceback.print_stack()
        if e.code == 429:
            print('sleep for 13s')
            time.sleep(13);
    except ee.ee_exception.EEException as e:
        print(i,e) # "Earth Engine memory capacity exceeded."
        traceback.print_stack()
        ee.Initialize()
    except zipfile.BadZipFile as e:
        print(i,e) # "File is not a zip file"
        traceback.print_stack()
    except Exception as e:
        print(i,e)
        # e.g. "An internal server error has occurred (216bc442fe171620592bc53fb578bceb)."
        traceback.print_stack()




# load tiffs to arrays

When there are errors, e.g. no metadata.json is directory, delete the directory and go back to the scraping step


In [130]:
# This loads it as X and y for machine learning, and also time and metadata so we can filter
import shapely
from IPython.display import display
X = []
y = []
t = []
m = []
discarded=[]
cdirs = [cdir for cdir in cache_dir.listdir() if ('_after_' in cdir) or ('_before_' in cdir)]
for path in tqdm(cdirs):
    files = [file.relpath(path) for file in path.listdir() if file.endswith('.tif')]
    if files:
        # check metadata
        try:
            metadata = json.load(open(path.joinpath('metadata.json')))
        except (FileNotFoundError, ValueError) as e:
#             path.move(path.dirname().dirname().joinpath('.deleteme-'+str(uuid.uuid4())))
#             if '_after_' in path: # also delete the before path                
#                 path_after = Path(path.replace('_after_','_before_'))
#                 if path_after.isdir():
#                     path_after.move(path.dirname().dirname().joinpath('.deleteme-'+str(uuid.uuid4())))
            logger.error('Invalid metadata.json, deleted folder %s, please rerun scraping cell to rescrape this image', path)
            continue
        
        # e.g. lets filter it so "before" image are only 1 day before
        if '_before_' in path.basename():
            yy = True
        else:
            yy = False
        
        # work out time gap too
        t1 = arrow.get(metadata['image']['properties']['system:time_end']/1000)
        t0 = arrow.get(metadata['leak']['features'][0]['properties']['REPO_Date'])
        td=t1-t0
        tt = td.total_seconds()
        
        # load data
        data = tifs2np(path,files,bands=bands, pixel_length=pixel_length)
             
        # check we don't have empty bands 1-13
        empty_bands = np.array([d.sum() for d in data])==0
        
        # lets check we didn't get the edge of an image
        bbox = np.array(metadata['image']['properties']['system:footprint']['coordinates'][0])
        loc = metadata['leak']['features'][0]['geometry']['coordinates']
        minx=bbox[:,0].min()
        maxx=bbox[:,0].max()
        miny=bbox[:,1].min()
        maxy=bbox[:,1].max()
        bbox_shp = shapely.geometry.box(
            minx=minx,
            maxx=maxx,
            miny=miny,
            maxy=maxy
        )
        loc_shp = shapely.geometry.Point(loc[0],loc[1])
        
        try:
            assert loc_shp.intersects(bbox_shp), 'leak location should be inside image'
            assert bbox_shp.centroid.almost_equals(loc_shp, decimal=5), 'leak should be near center of image'
            assert (np.array([d.shape for d in data])==pixel_length).all(), 'image area should be the right amount of pixels'
            assert (maxx-minx)/(maxy-miny)<1.3, 'should be roughly square'
            assert (maxx-minx)/(maxy-miny)>0.7, 'should be roughly square'
            assert not empty_bands.all(), 'should not have all bands empty'
        except Exception as exc:
            print(path, exc)
            display(shapely.geometry.GeometryCollection([bbox_shp, loc_shp]))
            discarded.append(path)
        else:
            X.append(data)
            y.append(yy)
            t.append(tt)
            m.append(metadata)
        

len(X), len(discarded)




(34, 0)

In [131]:
# for p in discarded[1:]:
#     leak_id = p.basename().split('_')[0]
#     cache_table.delete(leak_id=leak_id)
#     p.rmtree()

True
False


In [117]:
#             path.move(path.dirname().dirname().joinpath('.deleteme-'+str(uuid.uuid4())))
#             if '_after_' in path: # also delete the before path                
#                 path_after = Path(path.replace('_after_','_before_'))
#                 if path_after.isdir():
#                     path_after.move(path.dirname().dirname().joinpath('.deleteme-'+str(uuid.uuid4())))

In [132]:
np.array(t)/(60*60*24)

array([ 46.8236,  54.9833,  -7.0167,  50.1812, -11.8187,  61.3556,
        -0.6444,  52.1514,  -9.8486,  51.9167, -10.0833,   3.1743,
        -5.8257,  50.15  , -11.85  ,  61.1201,  38.9896, -23.0104,
        49.9799, -12.0201, -15.1764,  50.2917, -11.7083,  50.2917,
       -11.7083,  50.1382, -11.8618,  43.9375, -18.0625, -22.0701,
        39.0312, -22.9688,  35.4806,  -3.    ])

In [133]:
# shuffle
from sklearn.utils import shuffle
X,y,m,t = shuffle(X,y,m,t,random_state=1337)

In [134]:
# save using hdf5 (so keras can easily load it) and json 
import h5py
h5file = output_dir.joinpath('data.h5')
with h5py.File(h5file, 'w') as h5f:
    h5f.create_dataset('X', data=X)
    h5f.create_dataset('y', data=y)
    h5f.create_dataset('t', data=t)

json.dump(m,open(output_dir.joinpath('data_metadata.json'),'w'))

with open(output_dir.joinpath('readme.md'),'w') as fo:
    fo. write("""
Files:
- cache- cached tiff files
- script_metadata.json - information on scraping script
- data.h5 contains X, y, and t.
    - X: tiff files for each band loaded into an array of shape (Leak, Bands, width, length)
    - y: True for before the leak, False for after
    - t: time before leak (can be negative) in seconds
- data_metadata: array of metadata for each leak in X. Each contain info on leak, image, and image search
    
Loading: 
```py
# load
metadatas = json.load(open('data_metadata.json'))
with h5py.File('data.h5','r') as h5f:
    X2 = h5f['X'][:]
    y2 = h5f['y'][:]
y
```
    """)

In [79]:
# test load
metadatas = json.load(open(output_dir.joinpath('data_metadata.json')))
with h5py.File(output_dir.joinpath('data.h5'),'r') as h5f:
    X2 = h5f['X'][:]
    y2 = h5f['y'][:]
    t2 = h5f['t'][:]
X2.shape, y2, t2, metadatas[0].keys()

((6, 4, 129, 129),
 array([ True, False,  True, False, False,  True], dtype=bool),
 array([-1311240.,  4045560.,  -606240.,  4750560.,  4345200., -1011600.]),
 dict_keys(['scale', 'leak', 'crs', 'name', 'distance', 'image']))

In [80]:
output_dir

Path('../data/downloaded_images-scraping_earth_engine_NAIP_all_v2-USDA-NAIP-DOQQ')

# NOTES

In [168]:
leak

Unnamed: 0,leak_id,REPO_Date,geometry,metadata
8421,ATX-55907,2009-10-13T01:36:00,POINT (-97.68137075836523 30.29424927604158),"{'LOC': 'E 51ST ST', 'ZIP': '78723- ', 'STN..."


In [174]:
# check what times have image in AUTX

i=0
leak=leaks2.iloc[[i]]
leak_id = str(leak['leak_id'].values[0])
print(leak_id)

repo_date_ts = arrow.get(leak.REPO_Date.values[0]).timestamp
boundary = get_boundary(leak, distance=distance)
sentinel2_before = ee.ImageCollection(satellite)\
    .filterBounds(boundary)
info2=sentinel2_before.getInfo()


ATX-47486


In [177]:
info2

dict_keys(['version', 'features', 'bands', 'properties', 'type', 'id'])

In [37]:
# check what times have image in AUTX

i=0
leak=leaks1.iloc[[i]]
leak_id = str(leak['leak_id'].values[0])
print(leak_id)

repo_date_ts = arrow.get(leak.REPO_Date.values[0]).timestamp
boundary = get_boundary(leak, distance=distance)
collection = ee.ImageCollection(satellite)\
    .filterBounds(boundary)
info3=collection.getInfo()


39928


In [38]:
size = collection.size().getInfo()
collection_list=collection.toList(size)
info=collection_list.getInfo()

In [39]:


df_info = pd.concat([
    pd.DataFrame([i for i in info]),
    pd.DataFrame([i['properties'] for i in info]),
    pd.DataFrame([i['properties']['system:footprint'] for i in info]),
],1).drop(['properties','system:footprint'] ,1)
df_info['system:time_start']=pd.to_datetime(df_info['system:time_start'],unit='ms')
df_info['system:time_end']=pd.to_datetime(df_info['system:time_end'],unit='ms')
df_info

Unnamed: 0,bands,id,type,version,system:asset_size,system:index,system:time_end,system:time_start,coordinates,type.1
0,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/m_3411845_se_11_1_20090626,Image,1405650094950000,214296777,m_3411845_se_11_1_20090626,2009-06-27,2009-06-26,"[[-118.4407531163593, 34.31612598038248], [-11...",LinearRing
1,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/m_3411845_se_11_1_20100504,Image,1405650977308000,219715496,m_3411845_se_11_1_20100504,2010-05-05,2010-05-04,"[[-118.44075265536497, 34.31599817610279], [-1...",LinearRing
2,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/m_3411845_se_11_1_20120428,Image,1397774489835000,235582213,m_3411845_se_11_1_20120428,2012-04-29,2012-04-28,"[[-118.44085464097876, 34.316121533301896], [-...",LinearRing
3,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/m_3411845_se_11_1_20140515,Image,1456640865281000,238401022,m_3411845_se_11_1_20140515,2014-05-16,2014-05-15,"[[-118.44069883308718, 34.31596199093854], [-1...",LinearRing
4,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909201_ne_15_2_20030825,Image,1445818646739000,32460465,n_2909201_ne_15_2_20030825,2003-08-26,2003-08-25,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing
5,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909201_se_15_2_20030825,Image,1445834220417000,29341708,n_2909201_se_15_2_20030825,2003-08-26,2003-08-25,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing
6,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909202_ne_15_2_20030823,Image,1445834490113000,25777403,n_2909202_ne_15_2_20030823,2003-08-24,2003-08-23,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing
7,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909202_nw_15_2_20030825,Image,1445824891205000,28084081,n_2909202_nw_15_2_20030825,2003-08-26,2003-08-25,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing
8,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909202_se_15_2_20030823,Image,1445825037012000,22661968,n_2909202_se_15_2_20030823,2003-08-24,2003-08-23,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing
9,"[{'id': 'R', 'data_type': {'min': 0, 'precisio...",USDA/NAIP/DOQQ/n_2909202_sw_15_2_20030825,Image,1445825389222000,28568063,n_2909202_sw_15_2_20030825,2003-08-26,2003-08-25,"[[-92.99999999999997, -89.99999999999997], [-9...",LinearRing


In [40]:
def difft(x):
    a=(x-df_info['system:time_start'])
    b=a[a>'0 days']
    return b.min()
dft=pd.to_datetime(leaks1.REPO_Date).apply(difft)
dft[dft<'30 days']


393     1 days
394     1 days
395     1 days
396     2 days
397     3 days
398     5 days
399     6 days
400     6 days
401     6 days
402     7 days
403     7 days
404     7 days
405     8 days
406     8 days
407     8 days
408     8 days
409     8 days
410     9 days
411    10 days
412    11 days
413    12 days
414    12 days
415    12 days
416    13 days
417    14 days
418    14 days
419    14 days
420    15 days
421    15 days
422    16 days
         ...  
4955   18 days
4956   18 days
4957   18 days
4958   19 days
4959   19 days
4960   19 days
4961   19 days
4962   19 days
4963   19 days
4964   20 days
4965   20 days
4966   20 days
4967   21 days
4968   21 days
4969   22 days
4970   23 days
4971   24 days
4972   24 days
4973   24 days
4974   25 days
4975   25 days
4976   26 days
4977   26 days
4978   26 days
4979   27 days
4980   27 days
4981   28 days
4982   28 days
4983   28 days
4984   29 days
Name: REPO_Date, dtype: timedelta64[ns]

In [43]:
len(dft[dft<'30 days'])

164

In [44]:
dft[dft<'30 days'].sort_values()

393     1 days
394     1 days
395     1 days
2752    1 days
2755    2 days
2754    2 days
2753    2 days
4927    2 days
4926    2 days
396     2 days
2756    3 days
4928    3 days
2758    3 days
2759    3 days
397     3 days
2757    3 days
4929    4 days
4930    4 days
4931    4 days
4932    4 days
2760    4 days
2762    4 days
2761    4 days
2763    5 days
398     5 days
4933    5 days
2764    5 days
2765    5 days
399     6 days
2766    6 days
         ...  
2795   24 days
4975   25 days
4974   25 days
2796   25 days
2797   25 days
436    25 days
4978   26 days
4977   26 days
4976   26 days
2800   26 days
2799   26 days
2798   26 days
2801   27 days
4980   27 days
4979   27 days
2804   27 days
2803   27 days
2802   27 days
4982   28 days
4981   28 days
437    28 days
2806   28 days
4983   28 days
2805   28 days
442    29 days
441    29 days
440    29 days
439    29 days
438    29 days
4984   29 days
Name: REPO_Date, dtype: timedelta64[ns]