# Working with CDP Transcripts

Methods for retrieving open access data.

A database schema diagram for production instances of CDP may be found [here](https://github.com/CouncilDataProject/cdptools/blob/master/docs/resources/database_diagram.pdf).

# Connecting to resources

Having access to both the CDP instance's database and file store with make accessing and using the transcripts easiest. It is recommended to read the database and file store usage notebooks prior to working through this one.

For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

For details on file store usage, refer to the notebook example on file store basics [here](./file_store.ipynb).

**Note:** This notebook connects to the staging instance of Seattle's Firestore database and file store. To use production data, connect to the Cloud Firestore instance: `cdp-seattle`. To use production files, connect to the GCS instance: `cdp-seattle.appspot.com`.

In [1]:
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
from cdptools.file_stores.gcs_file_store import GCSFileStore
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
fs = GCSFileStore("stg-cdp-seattle.appspot.com")
db, fs

(<CloudFirestoreDatabase [stg-cdp-seattle]>,
 <GCSFileStore [stg-cdp-seattle.appspot.com]>)

### Find all transcripts

Simple query the transcript table!

In [2]:
transcripts = pd.DataFrame(db.select_rows_as_list("transcript"))
transcripts.head()

Unnamed: 0,confidence,created,event_id,file_id,transcript_id
0,0.934595,2019-05-16 18:49:44.387876,43437841-5be5-45b0-9e38-f3fa94c10515,72d3e26c-93f8-4e5d-8cbf-43add1754443,13d1578d-967d-48fe-b861-32a281a31d34
1,0.953214,2019-05-16 18:34:06.637447,ec5f6b47-3ddf-4cdd-8079-4e683c3f1e56,fbbafda3-4256-4f41-9383-89ad4ad6b088,2d83cd7f-8277-438f-978e-ff9484c45c65
2,0.939957,2019-05-16 18:46:27.774580,435d6632-dad9-487b-a02e-2f05eb7735ee,09b692dc-80b5-403b-9a32-3b278500d5ab,6b36e024-2751-48e2-9da8-8db2ae6f7fc0
3,0.930869,2019-05-16 18:39:11.784250,c82b1615-8c0e-4505-8b56-98fa45d4d1cd,1de6d753-0424-46cd-b306-e825b8e514a7,7ebf40c2-c103-4d23-aace-746a7e4c0550


### Join file, event, and body information

While the above results are somewhat useful, we should probably merge information from the other tables too...

In [3]:
# Get the other tables
events = pd.DataFrame(db.select_rows_as_list("event"))
bodies = pd.DataFrame(db.select_rows_as_list("body"))
files = pd.DataFrame(db.select_rows_as_list("file"))

# Merge the tables
events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
transcripts.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,created_file,filename,uri,body_id,created_event,event_datetime,source_uri,video_uri,created_body,description,name
0,0.934595,2019-05-16 18:49:44.387876,43437841-5be5-45b0-9e38-f3fa94c10515,72d3e26c-93f8-4e5d-8cbf-43add1754443,13d1578d-967d-48fe-b861-32a281a31d34,2019-05-16 18:49:41.134503,0292eaf436dc30f91c474ef2ac19e46f715d53f02fc511...,gs://stg-cdp-seattle.appspot.com/0292eaf436dc3...,e7b78f89-8053-42bc-996b-fe569042b8bc,2019-05-16 18:49:44.164436,2014-08-12,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/sp...,2019-05-16 18:49:43.937359,,Seattle Public Utilities and Neighborhoods Com...
1,0.953214,2019-05-16 18:34:06.637447,ec5f6b47-3ddf-4cdd-8079-4e683c3f1e56,fbbafda3-4256-4f41-9383-89ad4ad6b088,2d83cd7f-8277-438f-978e-ff9484c45c65,2019-05-16 18:34:04.036932,41960b32acc7f79d6bbe41366fb7fac495b81246378bdf...,gs://stg-cdp-seattle.appspot.com/41960b32acc7f...,8f23cb96-200c-4fb0-ad5d-4a0213f7f4ac,2019-05-16 18:34:06.393230,2015-08-19,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ed...,2019-05-16 18:34:06.156169,,Education and Governance Committee
2,0.939957,2019-05-16 18:46:27.774580,435d6632-dad9-487b-a02e-2f05eb7735ee,09b692dc-80b5-403b-9a32-3b278500d5ab,6b36e024-2751-48e2-9da8-8db2ae6f7fc0,2019-05-16 18:46:24.272457,57848045b7f7b12cf9f38d38e59262233ae386592dd86b...,gs://stg-cdp-seattle.appspot.com/57848045b7f7b...,b7c392fd-7580-41cf-a25b-6b9419ad9266,2019-05-16 18:46:27.453903,2018-01-03,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/pl...,2019-05-16 18:46:27.182217,,"Planning, Land Use, and Zoning Committee"
3,0.930869,2019-05-16 18:39:11.784250,c82b1615-8c0e-4505-8b56-98fa45d4d1cd,1de6d753-0424-46cd-b306-e825b8e514a7,7ebf40c2-c103-4d23-aace-746a7e4c0550,2019-05-16 18:39:08.573363,6f5f3136eb79cd4676ae3d95a7ada31bf60b56921c2465...,gs://stg-cdp-seattle.appspot.com/6f5f3136eb79c...,42a6b9df-aa42-45c5-9991-9cf445517bb2,2019-05-16 18:39:11.530673,2013-10-10,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/aw...,2019-05-16 18:39:11.235766,,"Central Waterfront, Seawall, and Alaskan Way V..."


### Downloading and reading a transcript

A manifest is great, but you probably want to work with the transcripts locally.

In [4]:
# Download to local
save_path = fs.download_file(transcripts.loc[3]["filename"])
save_path

PosixPath('/Users/maxfield/Desktop/active/cdp/cdptools/examples/6f5f3136eb79cd4676ae3d95a7ada31bf60b56921c2465582c8aa525b65d8801_ts_sentences_transcript_0.txt')

In [5]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    sentences = json.load(read_in)
    for s in sentences[:3]:
        print(s)

{'sentence': 'Good evening.', 'start_time': 16.9, 'end_time': 17.7}
{'sentence': 'This is October 10th 2013.', 'start_time': 17.7, 'end_time': 23.6}
{'sentence': 'This is the special committee on the central Waterfront Seawall and Alaskan Way Viaduct replacement program.', 'start_time': 23.6, 'end_time': 31.4}


### Handling multiple transcripts for a single event

A single event may have multiple versions of a transcripts. In this case you can decide how you want to filter them down, but generally, a decent solution is to group by `id_event`  then choose the most recent. You can do this with `pandas.Series.idxmax()`.

[stackoverflow](https://stackoverflow.com/questions/10202570/pandas-dataframe-find-row-where-values-for-column-is-maximal)

In [6]:
# Group rows
most_recent_transcripts = []
grouped = transcripts.groupby("event_id")
for name, group in grouped:
    most_recent = group.loc[group["created_transcript"].idxmax()]
    most_recent_transcripts.append(most_recent)
    
most_recent = pd.DataFrame(most_recent_transcripts)
most_recent.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,created_file,filename,uri,body_id,created_event,event_datetime,source_uri,video_uri,created_body,description,name
0,0.934595,2019-05-16 18:49:44.387876,43437841-5be5-45b0-9e38-f3fa94c10515,72d3e26c-93f8-4e5d-8cbf-43add1754443,13d1578d-967d-48fe-b861-32a281a31d34,2019-05-16 18:49:41.134503,0292eaf436dc30f91c474ef2ac19e46f715d53f02fc511...,gs://stg-cdp-seattle.appspot.com/0292eaf436dc3...,e7b78f89-8053-42bc-996b-fe569042b8bc,2019-05-16 18:49:44.164436,2014-08-12,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/sp...,2019-05-16 18:49:43.937359,,Seattle Public Utilities and Neighborhoods Com...
2,0.939957,2019-05-16 18:46:27.774580,435d6632-dad9-487b-a02e-2f05eb7735ee,09b692dc-80b5-403b-9a32-3b278500d5ab,6b36e024-2751-48e2-9da8-8db2ae6f7fc0,2019-05-16 18:46:24.272457,57848045b7f7b12cf9f38d38e59262233ae386592dd86b...,gs://stg-cdp-seattle.appspot.com/57848045b7f7b...,b7c392fd-7580-41cf-a25b-6b9419ad9266,2019-05-16 18:46:27.453903,2018-01-03,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/pl...,2019-05-16 18:46:27.182217,,"Planning, Land Use, and Zoning Committee"
3,0.930869,2019-05-16 18:39:11.784250,c82b1615-8c0e-4505-8b56-98fa45d4d1cd,1de6d753-0424-46cd-b306-e825b8e514a7,7ebf40c2-c103-4d23-aace-746a7e4c0550,2019-05-16 18:39:08.573363,6f5f3136eb79cd4676ae3d95a7ada31bf60b56921c2465...,gs://stg-cdp-seattle.appspot.com/6f5f3136eb79c...,42a6b9df-aa42-45c5-9991-9cf445517bb2,2019-05-16 18:39:11.530673,2013-10-10,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/aw...,2019-05-16 18:39:11.235766,,"Central Waterfront, Seawall, and Alaskan Way V..."
1,0.953214,2019-05-16 18:34:06.637447,ec5f6b47-3ddf-4cdd-8079-4e683c3f1e56,fbbafda3-4256-4f41-9383-89ad4ad6b088,2d83cd7f-8277-438f-978e-ff9484c45c65,2019-05-16 18:34:04.036932,41960b32acc7f79d6bbe41366fb7fac495b81246378bdf...,gs://stg-cdp-seattle.appspot.com/41960b32acc7f...,8f23cb96-200c-4fb0-ad5d-4a0213f7f4ac,2019-05-16 18:34:06.393230,2015-08-19,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ed...,2019-05-16 18:34:06.156169,,Education and Governance Committee


### A function to download the most recent transcripts

To make gathering the most recent transcripts easier, let's wrap this example up with a function to do all this for us.

**Note:** The function defined below is available in the `cdptools.utils` module.

`from cdptools.utils import download_most_recent_transcripts`

In [7]:
from pathlib import Path
from typing import Optional, Union

from cdptools.databases import Database
from cdptools.file_stores import FileStore


def download_most_recent_transcripts(db: Database, fs: FileStore, save_dir: Optional[Union[str, Path]] = None) -> Path:
    """
    Download the most recent versions of event transcripts.

    :param db: An already initialized `Database` object to query against.
    :param fs: An alreay initialized `FileStore` object to query against.
    :param save_dir: Path to a directory to save the transcripts and the dataset manifest.
    :return: Fully resolved path where transcripts and dataset manifest were stored.
    """
    # Use current directory is None provided
    if save_dir is None:
        save_dir = "."
    
    # Resolve save directory
    save_dir = Path(save_dir).resolve()
    
    # Make the save directory if not already exists
    save_dir.mkdir(parents=True, exist_ok=True)
    
    # Get transcript dataset
    transcripts = pd.DataFrame(db.select_rows_as_list("transcript"))
    events = pd.DataFrame(db.select_rows_as_list("event"))
    bodies = pd.DataFrame(db.select_rows_as_list("body"))
    files = pd.DataFrame(db.select_rows_as_list("file"))
    events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
    transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
    transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
    
    # Group
    most_recent_transcripts = []
    grouped = transcripts.groupby("event_id")
    for name, group in grouped:
        most_recent = group.loc[group["created_transcript"].idxmax()]
        most_recent_transcripts.append(most_recent)
    
    most_recent = pd.DataFrame(most_recent_transcripts)
    
    # Begin storage
    most_recent.apply(
        lambda r: fs.download_file(r["filename"], save_dir / r["filename"], overwrite=True), 
        axis=1
    )
    
    # Write manifest
    most_recent.to_csv(save_dir / "transcript_manifest.csv", index=False)
    
    return save_dir

In [8]:
save_dir = download_most_recent_transcripts(db, fs, "most_recent_seattle_transcripts")
save_dir

PosixPath('/Users/maxfield/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts')