In [1]:
import matplotlib.pyplot as plt
import numpy as np
import dask_geopandas as dg

# ATTENTION:
User must follow below instructions and replace the AWS access key and secret access key. This is crucial for the analysis to work.

##### Utilizing data from: https://source.coop/repositories/wherobots/usa-structures/description

All data on Source Cooperative, are hosted on AWS S3 bucket. In order to access them, you need credentials that you can generate on Source Cooperative website. Atfer logging in, click on your name at the top right corner, and then click on your username. Then navigate to "Manage" page on the left side. At the bottom of this page you will find a section called "API Keys". If no key has been generated before, generate a new one and then copy the values for each of the following keys, and paste them in the following cell.

source.coop website: https://source.coop/

###### Source: https://github.com/github.com/HamedAlemo/vector-data-tutorial/scalable_vector_analysis.ipynb

In [2]:
##################################
#   Read Above 'ATTENTION' Note  #
##################################

AWS_ACCESS_KEY_ID = "<YOUR ACCESS KEY>"
AWS_SECRET_ACCESS_KEY = "<YOUR SECRET ACCESS KEY>"

In [3]:
import boto3
s3_client = boto3.client('s3',
                         aws_access_key_id = AWS_ACCESS_KEY_ID, 
                         aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
                         endpoint_url='https://data.source.coop'
                        )

In [4]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
print(client.dashboard_link)

http://127.0.0.1:8787/status


Local path for downloading the data

In [5]:
# If running from analysis file within the 'saved'//mounted folder and do NOT wish to save the raw data - delete the
# first local_path and uncomment (remove #) on the second one

local_path = "./data/"
# local_path ="/data/"

In [6]:
import utils as ut

## Understanding the SPECIFIED Edition

This approach requires a list of parquet links to be inputted. However, if you desire to only download 1 of the parquet files, you can use the following example as a way to insert only 1 files later on:

In [8]:
example_list_of_parquets = ['parquet1.parquet',
                    'parquet2.parquet',
                    'parquet3.parquet',
                    'parquet4.parquet',
                    'parquet5.parquet']

print(example_list_of_parquets[0:1]) # returns the first index of the list (0)    Indices start at 0 in Python. Hence the '(0)'.

print(example_list_of_parquets[2:3]) # returns the third index of the list (2)

print('\n\nOnce you get to the section for manually inputting the parquet links you want, you can do the same index operations to achieve just 1 parquet.')

['parquet1.parquet']
['parquet3.parquet']


Once you get to the section for manually inputting the parquet links you want, you can do the same index operations to achieve just 1 parquet.


### Walkthrough

I will walk you through an example of using the specified code that will automatically grab parquet files with the unique number identifier of 00000 through 00004. NOTE: the 'endfix' variable is for the 12/6/2024 update. It is very likely to be different once you run this package.

In [None]:
# If running from analysis file within the 'saved'//mounted folder and do NOT wish to save the raw data to your machine - delete the
# first local_path and uncomment (remove #) on the second one

local_path = "./data/" # saves data to machine
# local_path ="/data/" # deletes data after removing the container

As of the 12/6/2024 download, you can go to the following page to see all of the parquet files:

https://source.coop/wherobots/usa-structures

To the right of the .parquet files will show the size of them.
Click on one of the .parquet file links, it should bring you to a window that details the file, for instance:

#### File Name

part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

#### File Size

2,058,402,104 Bytes (1.9 GiB)

#### Content Type

application/octet-stream

#### Last Modified

Fri Dec 06 2024 15:29:54 GMT-0500 (Eastern Standard Time)

#### Data URL

https://data.source.coop/wherobots/usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

#### S3 URI

s3://wherobots/usa-structures/geoparquet/part-00000-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet

# We are using an S3 client approach.
### Therefore, focus on the S3 URI, this is what we use to download the data

The bucket_name is 'whereobots'

The prefix is everything after the bucket_name AND it's slash (/),  and everything before the unique #-identifier, 00000.

The endfix is everything after the unique #-identifier, including the hyphen.

#### Copy and paste any adjustments to the file link that have occurred since this update. It is very likely the bucket_name and prefix will be the same.

#### In the code cell below, if you wish to automatically grab 'x' number of files, change the 'range(0, 5)' portion. This goes in numeric order:

00000, 00001, 00002, 00003, 00004

#### You can adjust the starting number to be something besides 0, and whichever number you want to end on, remember it is NOT inclusive. To go from 0-4, you must use 5 as the end number for the range function.

In [None]:
bucket_name = 'wherobots'  # (Bucket name is the account the posted the dataset)?
prefix = "usa-structures/geoparquet/part-"

endfix = "-170892fe-0fe0-43c1-999d-f0911ce43365-c000.zstd.parquet"

links = [f"{prefix}{str(i).zfill(5)}{endfix}" for i in range(0, 5)] # originally 0-200, datset updated mid project, now goes to 00009
links

Going back to the beginning, you can specify if you'd prefer only 1 of the links:

In [None]:
link_list = links[0:1]

Or use the entire list:

In [None]:
link_list = links

Make sure to only run the cell block you desire. If you aren't sure, we can print the 'link_list' variable:

In [None]:
link_list

Finally, we will create the dask dataframe with your desired links. Now you may start analysis. Use code from 'basic_analysis_ALL' as an example, but remember the warning from the README.md file. If you are trying to do a specific location, you will likely need to test several parquet files before you find it. AND there could be a situation where 2 or more parquet files cover your desired area. Proceed with caution as this could be the start of a LONG testing process. Good luck!

In [None]:
structure_ddf = ut.get_US_structures_specified(bucket_name, link_list, s3_client, local_path, blocksize = "16M") #256M is regular block size

In [None]:
structure_ddf.head(5)