# Using CDP File Stores

Methods for retrieving open access data.

### Connecting to the file store

CDP Seattle uses Google Cloud Storage, specifically a storage bucket tied to our database to store our files. However, a properly setup file storage host and associated file store module _should_ have the same functionality.

Here is how to connect to the Seattle file store for **read only** operations.

**Note:** This notebook connects to the staging instance of Seattle's GCS storage. To use production data connect to `cdp-seattle.appspot.com`.

In [1]:
from cdptools.file_stores.gcs_file_store import GCSFileStore

fs = GCSFileStore("stg-cdp-seattle.appspot.com")
fs

<GCSFileStore [stg-cdp-seattle.appspot.com]>

### Getting file URIs

If you know which file you are looking for, simply pass the filename. Files are tagged with a `SHA256` hash of the video uri used to create all downstream artifacts. You can either recreate the hash for the event you are looking for, or query the database for file linkage. For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

In [2]:
# Using a hash
import hashlib
key = hashlib.sha256("http://video.seattle.gov:8080/media/council/hum_061119_2591921V.mp4".encode("utf8")).hexdigest()
fs.get_file_uri(f"{key}_audio.wav")

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/194ba87bfce105fce73ad52e2d29b1b399d0a12079058133ebb5576daa9254c8_audio.wav'

In [3]:
# Using the database
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
files = pd.DataFrame(db.select_rows_as_list("file"))
fs.get_file_uri(files.loc[0]["filename"])

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/3dc758dd6438882ca2e4cbe51a168b69a0ac58b002dbfe30924cc0387d44c83f_ts_words_transcript_0.txt'

### Downloading files

Unless you want to stream the data from the URI returned, it is recommended to download the file locally prior to usage.

**Note:** You can optionally provide a `save_path` parameter if you don't want the file to be stored with the same name as it is stored with in the bucket.

In [4]:
# Download to local
save_path = fs.download_file(files.loc[files["file_id"] == "03e9b46f-d7d4-4bd0-888e-4d2d0cbc2ce3"]["uri"].values[0])
save_path

PosixPath('/home/maxfield/active/cdp/cdptools/examples/cee27aca34b4e27f66dde5e7e0c901b600f35a3e209b61fd35e04bd8cc4f311b_ts_sentences_transcript_0.txt')

In [5]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    sentences = json.load(read_in)
    for s in sentences[:3]:
        print(s)

{'sentence': 'Good morning, and welcome to the June 11th meeting of the Civil Rights utilities economic development in our committee.', 'start_time': 15.7, 'end_time': 22.3}
{'sentence': 'I am Lisa herbold the chair of the committee and council member representing West Seattle and South Park district one.', 'start_time': 22.3, 'end_time': 28.8}
{'sentence': 'It is 9:37 a.m.', 'start_time': 28.8, 'end_time': 31.0}
