In [1]:
import matplotlib.pyplot as plt
import numpy as np
import dask_geopandas as dg

## Purpose of Notebook:
This notebook shows how to download only part of a datset, how to download a different dataset from source.coop (using S3) and gives advice on users wishing to use the same dataset but for a different location (that are also only partially downloading the dataset).

To download only some of the geoparquets in a dataset, you utilize the "..._specified()" functions in the utils.py.

If you are capable/prefer to install all of a dataset's geoparquet files, feel free to follow the approach in 'worcester_analysis.ipynb'. But note, if you are not savvy with source.coop datasets, this notebook details how to alter the code to download a different dataset from the website. 

### ATTENTION:
User must follow below instructions and replace the AWS access key and secret access key. This is crucial for the analysis to work.

##### Utilizing data from: https://source.coop/repositories/wherobots/usa-structures/description

All data on Source Cooperative, are hosted on AWS S3 bucket. In order to access them, you need credentials that you can generate on Source Cooperative website. Atfer logging in, click on your name at the top right corner, and then click on your username. Then navigate to "Manage" page on the left side. At the bottom of this page you will find a section called "API Keys". If no key has been generated before, generate a new one and then copy the values for each of the following keys, and paste them in the following cell.

source.coop website: https://source.coop/

###### Source: https://github.com/github.com/HamedAlemo/vector-data-tutorial/scalable_vector_analysis.ipynb

In [2]:
##################################
#   Read Above 'ATTENTION' Note  #
##################################

AWS_ACCESS_KEY_ID = "<YOUR ACCESS KEY>"
AWS_SECRET_ACCESS_KEY = "<YOUR SECRET ACCESS KEY>"

In [3]:
import boto3
s3_client = boto3.client('s3',
                         aws_access_key_id = AWS_ACCESS_KEY_ID, 
                         aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
                         endpoint_url='https://data.source.coop'
                        )

In [4]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
print(client.dashboard_link)

http://127.0.0.1:8787/status


Local path for downloading the data

In [5]:
# If running from analysis file within the 'saved'//mounted folder and do NOT wish to save the raw data - delete the
# first local_path and uncomment (remove #) on the second one

local_path = "./data/" # saves data to machine
# local_path ="/data/" # deletes data after removing the container

In [6]:
import utils as ut

##### This approach requires a list of parquet links to be inputted. However, if you desire to only download 1 of the parquet files, you can use the following example as a way to insert only 1 files later on:

In [7]:
example_list_of_parquets = ['parquet1.parquet',
                    'parquet2.parquet',
                    'parquet3.parquet',
                    'parquet4.parquet',
                    'parquet5.parquet']

print(example_list_of_parquets[0:1]) # returns the first index of the list (0)    Indices start at 0 in Python. Hence the '(0)'.

print(example_list_of_parquets[2:3]) # returns the third index of the list (2)

print('\n\nOnce you get to the section for manually inputting the parquet links you want, you can do the same index operations to achieve just 1 parquet.')

['parquet1.parquet']
['parquet3.parquet']


Once you get to the section for manually inputting the parquet links you want, you can do the same index operations to achieve just 1 parquet.


### Walkthrough

I will walk you through an example of using the specified code that will automatically grab parquet files with the unique number identifier of 00000 through 00004. NOTE: the 'endfix' variable is for the 12/6/2024 update. It is very likely to be different once you run this package.

As of the 12/6/2024 download, you can go to the following page to see all of the parquet files:

https://source.coop/wherobots/usa-structures

To the right of the .parquet files will show the size of them.
Click on one of the .parquet file links, it should bring you to a window that details the file, for instance:

#### File Name

part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

#### File Size

2,058,402,104 Bytes (1.9 GiB)

#### Content Type

application/octet-stream

#### Last Modified

Fri Dec 06 2024 15:29:54 GMT-0500 (Eastern Standard Time)

#### Data URL

https://data.source.coop/wherobots/usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

#### S3 URI

s3://wherobots/usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

# We are using an S3 client approach.
### Therefore, focus on the S3 URI, this is what we use to download the data

The bucket_name is 'whereobots'

The prefix is everything after the bucket_name AND it's slash (/),  and everything before the unique #-identifier, 00000.

The endfix is everything after the unique #-identifier, including the hyphen.

#### Copy and paste any adjustments to the file link that have occurred since this update. It is very likely the bucket_name and prefix will be the same.

#### In the code cell below, if you wish to automatically grab 'x' number of files, change the 'range(0, 5)' portion. This goes in numeric order:

00000, 00001, 00002, 00003, 00004

#### You can adjust the starting number to be something besides 0, and whichever number you want to end on, remember it is NOT inclusive. To go from 0-4, you must use 5 as the end number for the range function.

In [8]:
bucket_name = 'wherobots'  # (Bucket name is the account the posted the dataset)?
prefix = "usa-structures/geoparquet/part-"

endfix = "-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet"

links = [f"{prefix}{str(i).zfill(5)}{endfix}" for i in range(0, 5)] # originally 0-200, datset updated mid project, now goes to 00009
links

['usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00001-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00002-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00003-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00004-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet']

Going back to the beginning, you can specify if you'd prefer only 1 of the links:

In [9]:
one_link = links[0:1]
one_link

['usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet']

Or use the entire list:

In [10]:
link_list = links
link_list

['usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00001-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00002-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00003-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet',
 'usa-structures/geoparquet/part-00004-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet']

Finally, we will create the dask dataframe with your desired links. Now you may start analysis. Use code from 'worcester_analysis' as an example, but remember the warning from the README.md file. If you are trying to do a specific location, you will likely need to test several parquet files before you find it. AND there could be a situation where 2 or more parquet files cover your desired area. Proceed with caution as this could be the start of a LONG testing process. Good luck!

In [11]:
structure_ddf = ut.get_US_structures_specified(bucket_name, link_list, s3_client, local_path, blocksize = "16M")

# This will use 'link_list'. If you desire to use the 'one_link', switch it out in the above function.

# A note about the function: blocksize will vary by computers, 256MiB is a regular block size, my computer cannot handle that and uses 16M (M = MiB).
# The blocksize helps handle large datasets.

# More details can be found here: https://coderzcolumn.com/tutorials/python/dask-dataframes-guide-to-work-with-large-tabular-datasets
# Use ctrl + f for 'blocksize'

File already exists locally. No download needed.
File already exists locally. No download needed.
File already exists locally. No download needed.
File already exists locally. No download needed.
File already exists locally. No download needed.


##### If you download the data in your mounted folder, then each time you run the function it will produce the above output. Confirming you already have the data installed.

In [12]:
structure_ddf.head(5)

Unnamed: 0,geometry,BUILD_ID,OCC_CLS,PRIM_OCC,SEC_OCC,PROP_ADDR,PROP_CITY,PROP_ST,PROP_ZIP,OUTBLDG,...,USNG,LONGITUDE,LATITUDE,IMAGE_NAME,IMAGE_DATE,VAL_METHOD,REMARKS,UUID,bbox,geohash
0,"MULTIPOLYGON (((-170.8292 -14.32645, -170.8292...",12059,Unclassified,Unclassified,,,,American Samoa,,,...,02L NK 18412 16164,-170.829258,-14.326434,104001005B4C3F00,2020-05-14,Automated,,{245aca01-8d02-4340-810e-3872ee35e7ef},"{'xmin': -170.82931163799998, 'ymin': -14.3264...",2jqw2xvymz0vch1
1,"MULTIPOLYGON (((-170.82798 -14.32867, -170.827...",29894,Unclassified,Unclassified,,,,American Samoa,,,...,02L NK 18540 15909,-170.828069,-14.328742,NOAA Topographic LiDAR,2013-01-01,Unverified,,{a46fac03-1829-47a2-b63f-86d4e52700a8},"{'xmin': -170.82815531799997, 'ymin': -14.3288...",2jqw2xwc12zkvd0
2,"MULTIPOLYGON (((-170.82786 -14.3283, -170.8278...",29886,Unclassified,Unclassified,,,,American Samoa,,,...,02L NK 18569 15948,-170.827797,-14.328382,NOAA Topographic LiDAR,2013-01-01,Unverified,,{4c8bb4a4-86e1-4cb2-908c-a367e7dfa525},"{'xmin': -170.82790860099996, 'ymin': -14.3284...",2jqw2xwgpg8y3ej
3,"MULTIPOLYGON (((-170.82747 -14.32844, -170.827...",29879,Unclassified,Unclassified,,,,American Samoa,,,...,02L NK 18594 15936,-170.827566,-14.328497,NOAA Topographic LiDAR,2013-01-01,Unverified,,{22106c1f-562e-4468-a496-32d407fd8b4f},"{'xmin': -170.82767668499997, 'ymin': -14.3286...",2jqw2xx4mn23x0c
4,"MULTIPOLYGON (((-170.82714 -14.32846, -170.827...",29885,Unclassified,Unclassified,,,,American Samoa,,,...,02L NK 18636 15935,-170.827176,-14.328503,NOAA Topographic LiDAR,2013-01-01,Unverified,,{bc01e591-8516-49bf-b66e-dd21667de2f7},"{'xmin': -170.82721784799998, 'ymin': -14.3285...",2jqw2xx6qm2znys
