# Working with CDP Transcripts

Methods for retrieving open access data.

A database schema diagram for production instances of CDP may be found [here](https://github.com/CouncilDataProject/cdptools/blob/master/docs/resources/database_diagram.pdf).

# Connecting to resources

Having access to both the CDP instance's database and file store with make accessing and using the transcripts easiest. It is recommended to read the database and file store usage notebooks prior to working through this one.

For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

For details on file store usage, refer to the notebook example on file store basics [here](./file_store.ipynb).

**Note:** This notebook connects to the staging instance of Seattle's Firestore database and file store. To use production data, connect to the Cloud Firestore instance: `cdp-seattle`. To use production files, connect to the GCS instance: `cdp-seattle.appspot.com`.

In [1]:
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
from cdptools.file_stores.gcs_file_store import GCSFileStore
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
fs = GCSFileStore("stg-cdp-seattle.appspot.com")
db, fs

(<CloudFirestoreDatabase [stg-cdp-seattle]>,
 <GCSFileStore [stg-cdp-seattle.appspot.com]>)

### Find all transcripts

Simple query the transcript table!

In [2]:
transcripts = pd.DataFrame(db.select_rows_as_list("transcript"))
transcripts.head()

Unnamed: 0,confidence,created,event_id,file_id,transcript_id
0,0.94473,2019-04-26 06:45:32.822767,2bb1663d-3437-4a3d-98b0-d14862599227,746ed63d-bc65-4d61-9f39-b0fafbbb2e86,102e11db-f444-44db-b788-8d3d3f32f525
1,0.881332,2019-04-22 06:56:27.283578,3807e904-a7f6-44a9-8116-667aac02ec93,db6e8e0e-338e-4ff3-b7ac-15bc9aa6d7ec,3563074c-c791-4e2c-83e1-2172977be487
2,0.923538,2019-04-26 06:25:43.313380,88f7db80-ed8c-46f5-8c0b-3b32ef884150,8169c5ca-8ae4-406d-ae5f-3d10597dcaa2,41e63a2c-855b-4ea2-b237-55a602a26127
3,0.947337,2019-04-21 23:31:38.809855,226d8033-666c-49aa-831d-37d04d693106,43b2d231-5a0e-4c5b-876e-51859e86f0da,658bfe6b-6efc-4efc-b7c9-de53a1a98651
4,0.947802,2019-04-26 08:28:43.719303,d051c064-cdac-4bc5-974b-eae7c1a8702b,98dfbf6d-5db7-4387-9378-4373b9d0cde5,66cc98fe-05ef-480c-a79e-2ba057ee338a


### Join file, event, and body information

While the above results are somewhat useful, we should probably merge information from the other tables too...

In [3]:
# Get the other tables
events = pd.DataFrame(db.select_rows_as_list("event"))
bodies = pd.DataFrame(db.select_rows_as_list("body"))
files = pd.DataFrame(db.select_rows_as_list("file"))

# Merge the tables
events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
transcripts.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,created_file,filename,uri,body_id,created_event,event_datetime,source_uri,video_uri,created_body,description,name
0,0.94473,2019-04-26 06:45:32.822767,2bb1663d-3437-4a3d-98b0-d14862599227,746ed63d-bc65-4d61-9f39-b0fafbbb2e86,102e11db-f444-44db-b788-8d3d3f32f525,2019-04-26 06:45:28.195322,efb7914150e213ef70b7506050b988e4742fe5434fa7e4...,gs://stg-cdp-seattle.appspot.com/efb7914150e21...,ff970f53-9ebb-4741-8342-defb35893dba,2019-04-26 06:45:32.445272,2017-03-17T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ge...,2019-04-26 06:45:32.020269,,"Gender Equity, Safe Communities & New American..."
1,0.881332,2019-04-22 06:56:27.283578,3807e904-a7f6-44a9-8116-667aac02ec93,db6e8e0e-338e-4ff3-b7ac-15bc9aa6d7ec,3563074c-c791-4e2c-83e1-2172977be487,2019-04-22 06:56:25.586170,eb448680c9c0cd7132d91b70a36d1a1c9f6bf7bc5eb148...,gs://stg-cdp-seattle.appspot.com/eb448680c9c0c...,8309112f-85c6-458a-8ef8-879907068177,2019-04-22 06:56:26.878303,2016-12-13T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ci...,2019-04-22 06:56:26.298234,,"Civil Rights, Utilities, Economic Development ..."
2,0.923538,2019-04-26 06:25:43.313380,88f7db80-ed8c-46f5-8c0b-3b32ef884150,8169c5ca-8ae4-406d-ae5f-3d10597dcaa2,41e63a2c-855b-4ea2-b237-55a602a26127,2019-04-26 06:25:39.620593,cffaf38aa33b03c977bf1eace4e9c40eb65ebbb702bcbd...,gs://stg-cdp-seattle.appspot.com/cffaf38aa33b0...,318d0a2a-93d1-417b-aa26-e37ad61b81e8,2019-04-26 06:25:42.903491,2015-05-13T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/fi...,2019-04-21 23:24:47.435263,,Finance and Culture Committee
3,0.947337,2019-04-21 23:31:38.809855,226d8033-666c-49aa-831d-37d04d693106,43b2d231-5a0e-4c5b-876e-51859e86f0da,658bfe6b-6efc-4efc-b7c9-de53a1a98651,2019-04-21 23:31:36.233402,be4a521638e7f738bb2f8441ebac43aa2c4ae8d461511d...,gs://stg-cdp-seattle.appspot.com/be4a521638e7f...,c28e1141-60f2-421d-9c17-e629b57e8890,2019-04-21 23:31:38.209946,2019-04-11T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/li...,2019-04-21 23:31:37.572810,,Select Committee on the Library Levy
4,0.947802,2019-04-26 08:28:43.719303,d051c064-cdac-4bc5-974b-eae7c1a8702b,98dfbf6d-5db7-4387-9378-4373b9d0cde5,66cc98fe-05ef-480c-a79e-2ba057ee338a,2019-04-26 08:28:39.435215,ff6445e0547ec1499a9d1bf85fe32ced9e7b061fd0bc89...,gs://stg-cdp-seattle.appspot.com/ff6445e0547ec...,0b1ae9bd-53c3-4882-900c-fed1b59e2442,2019-04-26 08:28:43.331152,2018-04-11T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ge...,2019-04-26 08:28:42.919760,,"Gender Equity, Safe Communities, New Americans..."


### Downloading and reading a transcript

A manifest is great, but you probably want to work with the transcripts locally.

In [4]:
# Download to local
save_path = fs.download_file(transcripts.loc[12]["filename"])
save_path

[INFO: file_store:  54 2019-04-27 16:38:01,330] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/62e909eddf50bf8aa7626a3629558be42074a6b0a37c1c19c7d90d7340cbcaf7_transcript_0.txt


PosixPath('/Users/jacksonb/Desktop/active/cdp/cdptools/examples/62e909eddf50bf8aa7626a3629558be42074a6b0a37c1c19c7d90d7340cbcaf7_transcript_0.txt')

In [5]:
# Read the transcript
with open(save_path, "r") as read_in:
    print(read_in.read()[:101])

Yeah, and are you going to give it time for 10 minutes? Good afternoon. Today is September 15th, 2014


### Handling multiple transcripts for a single event

A single event may have multiple versions of a transcripts. In this case you can decide how you want to filter them down, but generally, a decent solution is to group by `id_event`  then choose the most recent. You can do this with `pandas.Series.idxmax()`.

[stackoverflow](https://stackoverflow.com/questions/10202570/pandas-dataframe-find-row-where-values-for-column-is-maximal)

In [6]:
# Group rows
most_recent_transcripts = []
grouped = transcripts.groupby("event_id")
for name, group in grouped:
    most_recent = group.loc[group["created_transcript"].idxmax()]
    most_recent_transcripts.append(most_recent)
    
most_recent = pd.DataFrame(most_recent_transcripts)
most_recent.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,created_file,filename,uri,body_id,created_event,event_datetime,source_uri,video_uri,created_body,description,name
7,0.930153,2019-04-21 23:58:05.245933,0e3bd59c-3f07-452c-83cf-e9eebeb73af2,ebbd9727-d3ef-41ea-82b5-1cda7d1ca050,bb31c1eb-021d-4eb4-8c34-ec97e8871828,2019-04-21 23:58:03.624545,52d797171b74b68246b4ad0f2c4131c125c3a9338688ea...,gs://stg-cdp-seattle.appspot.com/52d797171b74b...,6f38a688-2e96-4e33-841c-883738f9f03d,2019-04-21 23:58:04.832481,2017-06-27T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ge...,2019-04-21 23:58:04.378827,,"Gender Equity, Safe Communities & New Americans"
8,0.891255,2019-04-26 05:07:41.610656,157aabee-5b7d-4821-9b5d-31f96457b67a,e40e9af3-1732-44fd-9701-ec493775868a,ca5bc859-f623-4f26-a0e0-88cc39b2ee9d,2019-04-26 05:07:37.487206,bed1f27c82b05feae4b55a0f7d924fdb9b2323a010dc2c...,gs://stg-cdp-seattle.appspot.com/bed1f27c82b05...,ebf7278b-debd-4d19-b0fd-ab1bdeb0f595,2019-04-26 05:07:41.088262,2018-08-08T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/fi...,2019-04-26 05:07:40.696253,,Finance & Neighborhoods Committee
5,0.943861,2019-04-26 15:06:34.578380,1802e1eb-5225-4219-b2a9-c5336259ce76,aabdb800-22cf-49f6-92e7-337d1c8294bd,9e72a19e-d66e-49df-b71a-4c1ed31711a5,2019-04-26 15:06:31.968155,7188dee9e2b8b77c71bf144c4457eadc50739e997128a3...,gs://stg-cdp-seattle.appspot.com/7188dee9e2b8b...,f92ad6c2-3994-465e-bc8c-9abeb777ebd2,2019-04-26 15:06:34.285370,2012-06-11T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/Tr...,2019-04-26 15:06:34.021277,,Seattle Transportation Benefit District Board
11,0.92734,2019-04-21 23:23:31.532067,1ffb5920-3c23-4084-b287-cef74c9c56c8,0099dfe6-ee18-4a16-a9b5-e7a053d9582d,d4e6bb99-6624-44af-a25f-dfde9f69bf29,2019-04-21 23:23:29.888746,165f1f333b3607748d8beca97e59a6b273fde4e2cf7b57...,gs://stg-cdp-seattle.appspot.com/165f1f333b360...,44a794de-6e1d-43dd-ac9f-317924345bdb,2019-04-21 23:23:30.958242,2017-12-06T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ed...,2019-04-21 23:23:30.575342,,"Education, Equity, and Governance Committee"
3,0.947337,2019-04-21 23:31:38.809855,226d8033-666c-49aa-831d-37d04d693106,43b2d231-5a0e-4c5b-876e-51859e86f0da,658bfe6b-6efc-4efc-b7c9-de53a1a98651,2019-04-21 23:31:36.233402,be4a521638e7f738bb2f8441ebac43aa2c4ae8d461511d...,gs://stg-cdp-seattle.appspot.com/be4a521638e7f...,c28e1141-60f2-421d-9c17-e629b57e8890,2019-04-21 23:31:38.209946,2019-04-11T00:00:00,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/li...,2019-04-21 23:31:37.572810,,Select Committee on the Library Levy


### A function to download the most recent transcripts

To make gathering the most recent transcripts easier, let's wrap this example up with a function to do all this for us.

**Note:** The function defined below is available in the `cdptools.utils` module.

`from cdptools.utils import download_most_recent_transcripts`

In [7]:
from pathlib import Path
from typing import Optional, Union

from cdptools.databases import Database
from cdptools.file_stores import FileStore


def download_most_recent_transcripts(db: Database, fs: FileStore, save_dir: Optional[Union[str, Path]] = None) -> Path:
    """
    Download the most recent versions of event transcripts.

    :param db: An already initialized `Database` object to query against.
    :param fs: An alreay initialized `FileStore` object to query against.
    :param save_dir: Path to a directory to save the transcripts and the dataset manifest.
    :return: Fully resolved path where transcripts and dataset manifest were stored.
    """
    # Use current directory is None provided
    if save_dir is None:
        save_dir = "."
    
    # Resolve save directory
    save_dir = Path(save_dir).resolve()
    
    # Make the save directory if not already exists
    save_dir.mkdir(parents=True, exist_ok=True)
    
    # Get transcript dataset
    transcripts = pd.DataFrame(db.select_rows_as_list("transcript"))
    events = pd.DataFrame(db.select_rows_as_list("event"))
    bodies = pd.DataFrame(db.select_rows_as_list("body"))
    files = pd.DataFrame(db.select_rows_as_list("file"))
    events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
    transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
    transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
    
    # Group
    most_recent_transcripts = []
    grouped = transcripts.groupby("event_id")
    for name, group in grouped:
        most_recent = group.loc[group["created_transcript"].idxmax()]
        most_recent_transcripts.append(most_recent)
    
    most_recent = pd.DataFrame(most_recent_transcripts)
    
    # Begin storage
    most_recent.apply(
        lambda r: fs.download_file(r["filename"], save_dir / r["filename"], overwrite=True), 
        axis=1
    )
    
    # Write manifest
    most_recent.to_csv(save_dir / "transcript_manifest.csv", index=False)
    
    return save_dir

In [8]:
save_dir = download_most_recent_transcripts(db, fs, "most_recent_seattle_transcripts")
save_dir

[INFO: file_store:  54 2019-04-27 16:38:03,398] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts/52d797171b74b68246b4ad0f2c4131c125c3a9338688eaf83109ae719fff2bee_transcript_0.txt
[INFO: file_store:  54 2019-04-27 16:38:03,645] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts/bed1f27c82b05feae4b55a0f7d924fdb9b2323a010dc2cdcb0d264988377ab68_transcript_0.txt
[INFO: file_store:  54 2019-04-27 16:38:03,868] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts/7188dee9e2b8b77c71bf144c4457eadc50739e997128a396b51dae90c377e949_transcript_0.txt
[INFO: file_store:  54 2019-04-27 16:38:04,096] Stored external resource copy: /Users/jacksonb/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts/165f1f333b3607748d8beca97e59a6b273fde4e2cf7b577f0f1b2fab631952a2_transcript_0.txt
[INFO: file_stor

PosixPath('/Users/jacksonb/Desktop/active/cdp/cdptools/examples/most_recent_seattle_transcripts')