# Using CDP File Stores

Methods for retrieving open access data.

In [1]:
from cdptools import CDPInstance, configs

seattle = CDPInstance(configs.SEATTLE)
seattle.file_store

<GCSFileStore [stg-cdp-seattle.appspot.com]>

### Getting file URIs

If you know which file you are looking for, simply pass the filename. Files are tagged with a `SHA256` hash of the video uri used to create all downstream artifacts. You can either recreate the hash for the event you are looking for, or query the database for file linkage. For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

In [2]:
# Using a hash
import hashlib
key = hashlib.sha256("https://video.seattle.gov/media/council/gov_080619_2571925V.mp4".encode("utf8")).hexdigest()
seattle.file_store.get_file_uri(f"{key}_audio.wav")

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/b0d394f1837ee56aae3664e8e665e806564b52953bfce0e98045a7dcc9c24f69_audio.wav'

In [3]:
# Using the database
import pandas as pd

files = pd.DataFrame(seattle.database.select_rows_as_list("file"))
files.head()

Unnamed: 0,file_id,uri,content_type,filename,description,created
0,06fa7dcb-387a-421c-86c6-8bb8f47c8374,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,,2019-08-07 04:25:24.260846
1,12b71eb9-8c4d-4205-8dc2-cdb6d8b7fb52,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,,2019-08-07 04:25:23.385379
2,263d6c5e-51bf-4c3f-a9f2-f47dee443996,gs://stg-cdp-seattle.appspot.com/49fd94d68ee70...,,49fd94d68ee7072972609e653a0ea180483d57de78b34d...,,2019-08-07 05:17:28.994852
3,344eb89a-bdba-4e1a-899e-7ee1e5fe4b5e,gs://stg-cdp-seattle.appspot.com/49fd94d68ee70...,,49fd94d68ee7072972609e653a0ea180483d57de78b34d...,,2019-08-07 04:14:56.381728
4,5ed09a36-8783-40ff-8078-1c9079728d68,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,,2019-08-07 04:25:22.463978


In [4]:
seattle.file_store.get_file_uri(files.loc[0]["filename"])

'https://storage.googleapis.com/stg-cdp-seattle.appspot.com/b0d394f1837ee56aae3664e8e665e806564b52953bfce0e98045a7dcc9c24f69_ts_sentences_transcript_0.json'

### Downloading files

Unless you want to stream the data from the URI returned, it is recommended to download the file locally prior to usage.

**Note:** You can optionally provide a `save_path` parameter if you don't want the file to be stored with the same name as it is stored with in the bucket.

In [5]:
# Download to local
save_path = seattle.file_store.download_file(files.loc[files["file_id"] == "06fa7dcb-387a-421c-86c6-8bb8f47c8374"]["uri"].values[0])
save_path

PosixPath('/home/maxfield/active/cdp/cdptools/examples/b0d394f1837ee56aae3664e8e665e806564b52953bfce0e98045a7dcc9c24f69_ts_sentences_transcript_0.json')

The following code snippet reads in a transcript. CDP produced transcripts follow a specific JSON schema which you can read about in the [transcript_formats](../docs/transcript_formats.md) documentation.

In [6]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    transcript = json.load(read_in)
    for s in transcript["data"][:3]:
        print(s)

{'text': 'Good morning.', 'start_time': 18.1, 'end_time': 18.9}
{'text': 'Thank you for being here for a regular schedules govern a second and Technology committee.', 'start_time': 18.9, 'end_time': 22.3}
{'text': "I'm doing Bank comes from Jennifer Samuelson just told them that we have three agenda items will cover today with the first one in Ordnance and will actually go down when creating a a code reviser position for the reasons for the Scribe during our discussion of that region.", 'start_time': 22.3, 'end_time': 41.7}
