# Using CDP File Stores

Methods for retrieving open access data.

### Connecting to the file store

CDP Seattle uses Google Cloud Storage, specifically a storage bucket tied to our database to store our files. However, a properly setup file storage host and associated file store module _should_ have the same functionality.

**Note:** This notebook connects to the staging instance of Seattle's GCS storage. To use production data connect to `cdp-seattle.appspot.com`.

In [1]:
from cdptools.file_stores.gcs_file_store import GCSFileStore

fs = GCSFileStore("stg-cdp-seattle.appspot.com")
fs

<GCSFileStore [stg-cdp-seattle.appspot.com]>

### Getting file URIs

If you know which file you are looking for, simply pass the filename. Files are tagged with a `SHA256` hash of the video uri used to create all downstream artifacts. You can either recreate the hash for the event you are looking for, or query the database for file linkage. For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

In [2]:
# Using a hash
import hashlib
key = hashlib.sha256("http://video.seattle.gov:8080/media/council/plan_071719_2511923V.mp4".encode("utf8")).hexdigest()
fs.get_file_uri(f"{key}_audio.wav")

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/abbd14724d2a7afc8d292c3879ff9d791ac0907560ce6b4c416e45f8e54f65cb_audio.wav'

In [3]:
# Using the database
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
files = pd.DataFrame(db.select_rows_as_list("file"))
files.head()

Unnamed: 0,content_type,created,description,file_id,filename,uri
0,,2019-07-20 23:48:12.019094,,0024f4a7-2b68-4c8e-89a1-4e4eaae3724c,e0660611bf4a191296218466f64a589e054409fa40d7bc...,gs://stg-cdp-seattle.appspot.com/e0660611bf4a1...
1,,2019-07-20 23:06:06.017226,,01ef139d-b607-405e-9496-77c71b7b4a55,7cf917c2bc3ef9b3ec674776f87d4e55b83d4782c24d88...,gs://stg-cdp-seattle.appspot.com/7cf917c2bc3ef...
2,,2019-07-21 00:07:41.687798,,033bd572-1d4c-4093-853d-3ea5c4a538ef,d59ab2d3435b9c147317c1dfa4c2909068765c47b7255d...,gs://stg-cdp-seattle.appspot.com/d59ab2d3435b9...
3,,2019-07-20 23:14:01.180502,,071f6392-b763-43b2-95b7-8e57165759f5,fc52ca9f9febd50ece14f46170014936f76f3d0227688f...,gs://stg-cdp-seattle.appspot.com/fc52ca9f9febd...
4,,2019-07-20 23:35:10.197880,,097aa050-8718-43b3-b347-6d033b4a8375,528af4d034464168dc80bc7275413e6820a6d46c06e1c5...,gs://stg-cdp-seattle.appspot.com/528af4d034464...


In [4]:
fs.get_file_uri(files.loc[0]["filename"])

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/e0660611bf4a191296218466f64a589e054409fa40d7bc73471b2d8e431aa131_audio.out'

### Downloading files

Unless you want to stream the data from the URI returned, it is recommended to download the file locally prior to usage.

**Note:** You can optionally provide a `save_path` parameter if you don't want the file to be stored with the same name as it is stored with in the bucket.

In [5]:
# Download to local
save_path = fs.download_file(files.loc[files["file_id"] == "0fb72a18-a43c-4820-b3a0-2ee858728981"]["uri"].values[0])
save_path

PosixPath('/home/maxfield/active/cdp/cdptools/examples/980177ce22091454a5ed25e8d9ab23c93b6bf19b7d8b946961e18fdbb4e5e394_ts_sentences_transcript_0.json')

The following code snippet reads in a transcript. CDP produced transcripts follow a specific JSON schema which you can read about in the [transcript_formats](../docs/transcript_formats.md) documentation.

In [6]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    transcript = json.load(read_in)
    for s in transcript["data"][:3]:
        print(s)

{'text': 'Good afternoon, everyone.', 'start_time': 19.4, 'end_time': 20.4}
{'text': 'Is July the July 1st 2019 city council meeting of the Seattle City council will come to order.', 'start_time': 21.8, 'end_time': 28.5}
{'text': "It's 2:04 p.m.", 'start_time': 28.5, 'end_time': 31.0}
