# Using CDP File Stores

Methods for retrieving open access data.

### Connecting to the file store

CDP Seattle uses Google Cloud Storage, specifically a storage bucket tied to our database to store our files. However, a properly setup file storage host and associated file store module _should_ have the same functionality.

Here is how to connect to the Seattle file store for **read only** operations.

**Note:** This notebook connects to the staging instance of Seattle's GCS storage. To use production data connect to `cdp-seattle.appspot.com`.

In [1]:
from cdptools.file_stores.gcs_file_store import GCSFileStore

fs = GCSFileStore("stg-cdp-seattle.appspot.com")
fs

<GCSFileStore [stg-cdp-seattle.appspot.com]>

### Getting file URIs

If you know which file you are looking for, simply pass the filename. Files are tagged with a `SHA256` hash of the video uri used to create all downstream artifacts. You can either recreate the hash for the event you are looking for, or query the database for file linkage. For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

In [2]:
# Using a hash
import hashlib
key = hashlib.sha256("http://video.seattle.gov:8080/media/council/gen_062717V.mp4".encode("utf8")).hexdigest()
fs.get_file_uri(f"{key}_audio.wav")

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/52d797171b74b68246b4ad0f2c4131c125c3a9338688eaf83109ae719fff2bee_audio.wav'

In [3]:
# Using the database
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
files = pd.DataFrame(db.select_rows_as_list("file"))
fs.get_file_uri(files.loc[0]["filename"])

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/165f1f333b3607748d8beca97e59a6b273fde4e2cf7b577f0f1b2fab631952a2_transcript_0.txt'

### Downloading files

Unless you want to stream the data from the URI returned, it is recommended to download the file locally prior to usage.

**Note:** You can optionally provide a `save_path` parameter if you don't want the file to be stored with the same name as it is stored with in the bucket.

In [4]:
# Download to local
save_path = fs.download_file(files.loc[21]["filename"])
save_path

[INFO: file_store:  54 2019-04-24 18:52:49,219] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/eb448680c9c0cd7132d91b70a36d1a1c9f6bf7bc5eb14811cc5abc68f78d2894_transcript_0.txt


PosixPath('/Users/jacksonb/Desktop/active/cdp/cdptools/examples/eb448680c9c0cd7132d91b70a36d1a1c9f6bf7bc5eb14811cc5abc68f78d2894_transcript_0.txt')

In [5]:
# Read the transcript
with open(save_path, "r") as read_in:
    print(read_in.read()[:500])

Greetings, welcome to the Civil Rights utilities economic development in arts committee. This is our December 13th meeting and we are getting started at 9:37 a.m. On today's agenda will be starting out as we do each month with a poetry reading with Wordsworth followed by public comment. Then we have several appointments to the Seattle music commission followed by appointments as well to the Seattle Human Rights Commission, and the people with disabilities commission and then an ordinance changin
