# Working with CDP Transcripts

Methods for retrieving open access data.

Prior to reading this example, it is recommended to read the [transcript_formats](../docs/transcript_formats.md) documentation.

A database schema diagram for production instances of CDP may be found [here](https://github.com/CouncilDataProject/cdptools/blob/master/docs/resources/database_diagram.pdf).

# Connecting to resources

Having access to both the CDP instance's database and file store with make accessing and using the transcripts easiest. It is recommended to read the database and file store usage notebooks prior to working through this one.

For details on database usage, refer to the notebook example on database basics [here](./database.ipynb).

For details on file store usage, refer to the notebook example on file store basics [here](./file_store.ipynb).

**Note:** This notebook connects to the staging instance of Seattle's Firestore database and file store. To use production data, connect to the Cloud Firestore instance: `cdp-seattle`. To use production files, connect to the GCS instance: `cdp-seattle.appspot.com`.

In [1]:
from cdptools.databases.cloud_firestore_database import CloudFirestoreDatabase
from cdptools.file_stores.gcs_file_store import GCSFileStore
import pandas as pd

db = CloudFirestoreDatabase("stg-cdp-seattle")
fs = GCSFileStore("stg-cdp-seattle.appspot.com")
db, fs

(<CloudFirestoreDatabase [stg-cdp-seattle]>,
 <GCSFileStore [stg-cdp-seattle.appspot.com]>)

### Find all transcripts

Simple query the transcript table!

In [2]:
transcripts = pd.DataFrame(db.select_rows_as_list("transcript"))
transcripts.head()

Unnamed: 0,confidence,created,event_id,file_id,transcript_id
0,0.92438,2019-07-20 23:34:22.276948,846f548a-23b4-4af8-8889-0ce04071bfe7,6d6dc323-44ee-419c-a156-24420eb959f3,0dfbdbee-f858-4878-8073-a59585f1d708
1,0.955153,2019-07-20 23:12:34.389277,886359ee-3ef1-4ae1-a0d7-5bed9c5a26e8,eea51e15-4bd8-40f8-8122-1d4f43ae77d1,1366c1f0-5c82-4520-aa79-f6d4f64166cc
2,0.941176,2019-07-20 23:36:07.383284,b2393ed3-cfa6-4941-a502-725e1c71aec4,a092bdf1-0bec-49b2-9ddc-70201bcdd1f4,190ee670-0437-4b03-a5a5-18556b5346c0
3,0.956912,2019-07-20 23:52:32.172165,9338fedb-ff60-4b8b-9467-dde1dd6a4afb,3506e8d3-0377-484a-9e1b-5fdbc8b44051,1cf397da-cd77-426d-b6ca-789fec62b6c5
4,0.936947,2019-07-20 23:27:48.141518,ea27454a-d68c-4ab6-8265-67788f23d3c1,7155077d-325d-4dd7-b334-1300d8eacf10,1dcbc26c-f837-4f9d-b5ea-e71d003046b0


### Join file, event, and body information

While the above results are somewhat useful, we should probably merge information from the other tables too...

In [3]:
# Get the other tables
events = pd.DataFrame(db.select_rows_as_list("event"))
bodies = pd.DataFrame(db.select_rows_as_list("body"))
files = pd.DataFrame(db.select_rows_as_list("file"))

# Merge the tables
events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
transcripts.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,content_type,created_file,description_transcript,filename,uri,...,created_event,event_datetime,legistar_event_id,legistar_event_link,minutes_file_uri,source_uri,video_uri,created_body,description_event,name
0,0.92438,2019-07-20 23:34:22.276948,846f548a-23b4-4af8-8889-0ce04071bfe7,6d6dc323-44ee-419c-a156-24420eb959f3,0dfbdbee-f858-4878-8073-a59585f1d708,,2019-07-20 23:33:38.096775,,67f7625bb7f9e59c20dc3f98089f13f48e7a71e4a3afbe...,gs://stg-cdp-seattle.appspot.com/67f7625bb7f9e...,...,2019-07-20 23:33:41.354756,2019-06-24 14:00:00,4007,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/FullCouncil?vide...,http://video.seattle.gov:8080/media/council/co...,2019-07-20 23:11:01.877009,,City Council
1,0.955153,2019-07-20 23:12:34.389277,886359ee-3ef1-4ae1-a0d7-5bed9c5a26e8,eea51e15-4bd8-40f8-8122-1d4f43ae77d1,1366c1f0-5c82-4520-aa79-f6d4f64166cc,,2019-07-20 23:12:27.475829,,a7eb608ceb4ac7e35875b006ad9afbe69bfbd07cbb8be8...,gs://stg-cdp-seattle.appspot.com/a7eb608ceb4ac...,...,2019-07-20 23:12:29.670998,2019-07-08 09:30:00,4024,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/CouncilBriefings...,http://video.seattle.gov:8080/media/council/br...,2019-07-20 23:12:29.468004,,Council Briefing
2,0.941176,2019-07-20 23:36:07.383284,b2393ed3-cfa6-4941-a502-725e1c71aec4,a092bdf1-0bec-49b2-9ddc-70201bcdd1f4,190ee670-0437-4b03-a5a5-18556b5346c0,,2019-07-20 23:36:00.489374,,1c765c4c84649938e36fa141903992bd066dc744fb66c0...,gs://stg-cdp-seattle.appspot.com/1c765c4c84649...,...,2019-07-20 23:36:02.527396,2019-06-25 09:30:00,4009,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ci...,2019-07-20 23:27:44.607321,,"Civil Rights, Utilities, Economic Development,..."
3,0.956912,2019-07-20 23:52:32.172165,9338fedb-ff60-4b8b-9467-dde1dd6a4afb,3506e8d3-0377-484a-9e1b-5fdbc8b44051,1cf397da-cd77-426d-b6ca-789fec62b6c5,,2019-07-20 23:52:27.442038,,190c092c1f7406a15f6f3a7ce62f5bdd1cf42d303b9a61...,gs://stg-cdp-seattle.appspot.com/190c092c1f740...,...,2019-07-20 23:52:29.762367,2019-07-16 12:00:00,4032,https://seattle.legistar.com/MeetingDetail.asp...,,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ge...,2019-07-20 23:52:29.558243,,"Gender Equity, Safe Communities, New Americans..."
4,0.936947,2019-07-20 23:27:48.141518,ea27454a-d68c-4ab6-8265-67788f23d3c1,7155077d-325d-4dd7-b334-1300d8eacf10,1dcbc26c-f837-4f9d-b5ea-e71d003046b0,,2019-07-20 23:27:42.052487,,0305a78668f2798b1e11e713153811431335247d7781de...,gs://stg-cdp-seattle.appspot.com/0305a78668f27...,...,2019-07-20 23:27:44.848637,2019-07-09 09:30:00,4016,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ci...,2019-07-20 23:27:44.607321,,"Civil Rights, Utilities, Economic Development,..."


### Downloading and reading a transcript

A manifest is great, but you probably want to work with the transcripts locally.

In [4]:
# Download to local
save_path = fs.download_file(transcripts.loc[3]["filename"])
save_path

PosixPath('/home/maxfield/active/cdp/cdptools/examples/190c092c1f7406a15f6f3a7ce62f5bdd1cf42d303b9a610463d03ebb1e619285_ts_sentences_transcript_0.json')

In [5]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    transcript = json.load(read_in)
    for s in transcript["data"][:3]:
        print(s)

{'text': 'Good afternoon.', 'start_time': 16.3, 'end_time': 18.2}
{'text': 'Today is Tuesday, July 16th.', 'start_time': 18.2, 'end_time': 20.7}
{'text': '2008 is 12:05 p.m.', 'start_time': 20.7, 'end_time': 25.4}


### Handling multiple transcripts for a single event

A single event may have multiple versions of a transcripts. In this case you can decide how you want to filter them down, but generally, a decent solution is to group by `id_event`  then choose the most recent. You can do this with `pandas.Series.idxmax()`.

[stackoverflow](https://stackoverflow.com/questions/10202570/pandas-dataframe-find-row-where-values-for-column-is-maximal)

In [6]:
# Group rows
most_recent_transcripts = []
grouped = transcripts.groupby("event_id")
for name, group in grouped:
    most_recent = group.loc[group["created_transcript"].idxmax()]
    most_recent_transcripts.append(most_recent)
    
most_recent = pd.DataFrame(most_recent_transcripts)
most_recent.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,content_type,created_file,description_transcript,filename,uri,...,created_event,event_datetime,legistar_event_id,legistar_event_link,minutes_file_uri,source_uri,video_uri,created_body,description_event,name
7,0.937676,2019-07-21 00:29:29.463397,1408bf08-d6c0-4ab7-96de-e1ef2294f075,bddb0ef5-6d2c-4b4f-82b8-b326123c6ecb,35e6b0e6-0842-4a11-86a9-59f8edec161e,,2019-07-21 00:29:21.508901,,abbd14724d2a7afc8d292c3879ff9d791ac0907560ce6b...,gs://stg-cdp-seattle.appspot.com/abbd14724d2a7...,...,2019-07-21 00:29:25.001338,2019-07-17 09:30:00,4033,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/pl...,2019-07-21 00:29:24.783820,,"Planning, Land Use, and Zoning Committee"
19,0.950578,2019-07-20 23:44:44.957399,1d0c6214-1343-463d-b0ab-03a3d5e36c5d,75ba93e0-d42f-4790-9d3f-9d80e5168d28,bf6072ca-f7dc-4da0-be03-3e95358f149b,,2019-07-20 23:44:38.215966,,e8fdcb53624e9f409ae630dd23db5c338c10b1b70f65e7...,gs://stg-cdp-seattle.appspot.com/e8fdcb53624e9...,...,2019-07-20 23:44:41.348645,2019-06-24 10:30:00,4008,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ho...,2019-07-20 23:44:41.123429,,Select Committee on Homelessness and Housing A...
17,0.942756,2019-07-21 00:16:08.079434,276c6722-b35c-4341-b7fe-304d874a17ef,52de772e-0c2c-4778-8c90-3e791b6227c0,9da73a31-95ab-4a58-97ac-6ca40ad7cae4,,2019-07-21 00:16:01.613492,,011c9aaaf743549497c685cbcd52afc9d3a55226eff359...,gs://stg-cdp-seattle.appspot.com/011c9aaaf7435...,...,2019-07-21 00:16:04.078408,2019-06-25 14:00:00,4012,https://seattle.legistar.com/MeetingDetail.asp...,,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/hu...,2019-07-21 00:16:03.860927,,"Human Services, Equitable Development, and Ren..."
9,0.944773,2019-07-21 00:21:34.778762,294015e9-ee34-4de3-80d3-27175a127275,3c74995b-8241-4617-af11-bd635882b5f2,58264323-6ce1-464a-8f2d-579b671c3031,,2019-07-21 00:21:28.790639,,294bc99e4e9eccf906b59538de5a074c2b2deaaf6ca5c4...,gs://stg-cdp-seattle.appspot.com/294bc99e4e9ec...,...,2019-07-21 00:21:31.105103,2019-06-27 12:00:00,4013,https://seattle.legistar.com/MeetingDetail.asp...,,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ho...,2019-07-21 00:13:11.547208,,"Housing, Health, Energy, and Workers’ Rights C..."
10,0.938222,2019-07-20 23:28:38.729629,2cbb7f4b-2e19-4f44-a99e-a0dcebdb3ac3,8a9d3837-ce65-4744-b0d6-0c9f44ca14fe,68968c69-d8de-4912-bfa0-d4271e6646c7,,2019-07-20 23:28:24.950920,,c93c28f3ac6e5bdbcf23dcdff2434ac2cd9bf36b8d44f2...,gs://stg-cdp-seattle.appspot.com/c93c28f3ac6e5...,...,2019-07-20 23:28:27.305315,2019-07-15 10:30:00,4029,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ar...,2019-07-20 23:28:27.107855,,Select Committee on Civic Arenas


### A function to download the most recent transcripts

To make gathering the most recent transcripts easier, we have written two helper functions that wrap this example up to do all this for us.

In [7]:
from pathlib import Path
from typing import Optional, Union

from cdptools.research_utils import transcripts

transcript_manifest = transcripts.get_most_recent_transcript_manifest(db)
transcript_manifest.head()

Unnamed: 0,confidence,created_transcript,event_id,file_id,transcript_id,content_type,created_file,description_transcript,filename,uri,...,created_event,event_datetime,legistar_event_id,legistar_event_link,minutes_file_uri,source_uri,video_uri,created_body,description_event,name
7,0.937676,2019-07-21 00:29:29.463397,1408bf08-d6c0-4ab7-96de-e1ef2294f075,bddb0ef5-6d2c-4b4f-82b8-b326123c6ecb,35e6b0e6-0842-4a11-86a9-59f8edec161e,,2019-07-21 00:29:21.508901,,abbd14724d2a7afc8d292c3879ff9d791ac0907560ce6b...,gs://stg-cdp-seattle.appspot.com/abbd14724d2a7...,...,2019-07-21 00:29:25.001338,2019-07-17 09:30:00,4033,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/pl...,2019-07-21 00:29:24.783820,,"Planning, Land Use, and Zoning Committee"
19,0.950578,2019-07-20 23:44:44.957399,1d0c6214-1343-463d-b0ab-03a3d5e36c5d,75ba93e0-d42f-4790-9d3f-9d80e5168d28,bf6072ca-f7dc-4da0-be03-3e95358f149b,,2019-07-20 23:44:38.215966,,e8fdcb53624e9f409ae630dd23db5c338c10b1b70f65e7...,gs://stg-cdp-seattle.appspot.com/e8fdcb53624e9...,...,2019-07-20 23:44:41.348645,2019-06-24 10:30:00,4008,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ho...,2019-07-20 23:44:41.123429,,Select Committee on Homelessness and Housing A...
17,0.942756,2019-07-21 00:16:08.079434,276c6722-b35c-4341-b7fe-304d874a17ef,52de772e-0c2c-4778-8c90-3e791b6227c0,9da73a31-95ab-4a58-97ac-6ca40ad7cae4,,2019-07-21 00:16:01.613492,,011c9aaaf743549497c685cbcd52afc9d3a55226eff359...,gs://stg-cdp-seattle.appspot.com/011c9aaaf7435...,...,2019-07-21 00:16:04.078408,2019-06-25 14:00:00,4012,https://seattle.legistar.com/MeetingDetail.asp...,,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/hu...,2019-07-21 00:16:03.860927,,"Human Services, Equitable Development, and Ren..."
9,0.944773,2019-07-21 00:21:34.778762,294015e9-ee34-4de3-80d3-27175a127275,3c74995b-8241-4617-af11-bd635882b5f2,58264323-6ce1-464a-8f2d-579b671c3031,,2019-07-21 00:21:28.790639,,294bc99e4e9eccf906b59538de5a074c2b2deaaf6ca5c4...,gs://stg-cdp-seattle.appspot.com/294bc99e4e9ec...,...,2019-07-21 00:21:31.105103,2019-06-27 12:00:00,4013,https://seattle.legistar.com/MeetingDetail.asp...,,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ho...,2019-07-21 00:13:11.547208,,"Housing, Health, Energy, and Workers’ Rights C..."
10,0.938222,2019-07-20 23:28:38.729629,2cbb7f4b-2e19-4f44-a99e-a0dcebdb3ac3,8a9d3837-ce65-4744-b0d6-0c9f44ca14fe,68968c69-d8de-4912-bfa0-d4271e6646c7,,2019-07-20 23:28:24.950920,,c93c28f3ac6e5bdbcf23dcdff2434ac2cd9bf36b8d44f2...,gs://stg-cdp-seattle.appspot.com/c93c28f3ac6e5...,...,2019-07-20 23:28:27.305315,2019-07-15 10:30:00,4029,https://seattle.legistar.com/MeetingDetail.asp...,http://legistar2.granicus.com/seattle/meetings...,http://www.seattlechannel.org/mayor-and-counci...,http://video.seattle.gov:8080/media/council/ar...,2019-07-20 23:28:27.107855,,Select Committee on Civic Arenas


In [8]:
event_corpus_map = transcripts.download_most_recent_transcripts(db, fs)
event_corpus_map

{'1408bf08-d6c0-4ab7-96de-e1ef2294f075': PosixPath('/home/maxfield/active/cdp/cdptools/examples/abbd14724d2a7afc8d292c3879ff9d791ac0907560ce6b4c416e45f8e54f65cb_ts_sentences_transcript_0.json'),
 '1d0c6214-1343-463d-b0ab-03a3d5e36c5d': PosixPath('/home/maxfield/active/cdp/cdptools/examples/e8fdcb53624e9f409ae630dd23db5c338c10b1b70f65e729c2e6696128aacd04_ts_sentences_transcript_0.json'),
 '276c6722-b35c-4341-b7fe-304d874a17ef': PosixPath('/home/maxfield/active/cdp/cdptools/examples/011c9aaaf743549497c685cbcd52afc9d3a55226eff3590a865d9420422249da_ts_sentences_transcript_0.json'),
 '294015e9-ee34-4de3-80d3-27175a127275': PosixPath('/home/maxfield/active/cdp/cdptools/examples/294bc99e4e9eccf906b59538de5a074c2b2deaaf6ca5c4b82250d652102126fc_ts_sentences_transcript_0.json'),
 '2cbb7f4b-2e19-4f44-a99e-a0dcebdb3ac3': PosixPath('/home/maxfield/active/cdp/cdptools/examples/c93c28f3ac6e5bdbcf23dcdff2434ac2cd9bf36b8d44f2ba8e145a94b039f545_ts_sentences_transcript_0.json'),
 '2cf94f65-6332-4b3b-9481