# Working with CDP Transcripts

Methods for retrieving open access data.

Prior to reading this example, it is recommended to read the [transcript_formats](../docs/transcript_formats.md) documentation.

A database schema diagram for production instances of CDP may be found [here](https://github.com/CouncilDataProject/cdptools/blob/master/docs/resources/database_diagram.pdf).

In [1]:
from cdptools import CDPInstance, configs

seattle = CDPInstance(configs.SEATTLE)
seattle

<CDPInstance [database: <CloudFirestoreDatabase [cdp-seattle]>, file_store: <GCSFileStore [cdp-seattle.appspot.com]>]>

### Find all transcripts

Simple query the transcript table!

In [2]:
import pandas as pd
transcripts = pd.DataFrame(seattle.database.select_rows_as_list("transcript"))
transcripts.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created
0,0008c494-66e4-4459-894c-52935580f776,0.950079,56eeb4fc-079f-425e-bd76-8727b78365dd,445896ff-8eca-4594-a49b-7b088244cb6d,2019-08-08 08:55:27.733798
1,00fd1e9f-9718-4ba9-bc24-01ddaad4fff2,0.946712,cd2b1685-e6ef-4084-a9d0-d8ee38e2a1da,29e15ed4-a03b-4d64-9078-c7a13101df6e,2019-08-08 17:00:39.151725
2,021b763c-437f-4bcb-b458-4eec1b9c9058,0.919946,33ab51f9-911b-4965-82ba-e7a62ab26221,5c91e7bd-4e76-4ff3-b16e-25e4ff8bd6a6,2019-10-22 01:55:25.680446
3,02b4e049-07a6-4355-a16c-2b3066ec75a8,0.925703,e1064f64-da37-480f-9de3-b4d622dd08d9,99f1a4f2-0ea2-449a-99af-d6163550f024,2019-08-08 16:09:00.554294
4,069f42b1-94de-4edd-b197-8d4d8ca7563d,0.946464,3932cb65-bc2f-48d6-8604-e0ddd8896491,6527b5c6-4ff8-468b-b356-480569762212,2019-08-08 07:13:36.980601


### Join file, event, and body information

While the above results are somewhat useful, we should probably merge information from the other tables too...

In [3]:
# Get the other tables
events = pd.DataFrame(seattle.database.select_rows_as_list("event"))
bodies = pd.DataFrame(seattle.database.select_rows_as_list("body"))
files = pd.DataFrame(seattle.database.select_rows_as_list("file"))

# Merge the tables
events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
transcripts.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created_transcript,content_type,filename,description_transcript,created_file,uri,...,created_event,body_id,legistar_event_link,source_uri,legistar_event_id,event_datetime,agenda_file_uri,name,description_event,created_body
0,0008c494-66e4-4459-894c-52935580f776,0.950079,56eeb4fc-079f-425e-bd76-8727b78365dd,445896ff-8eca-4594-a49b-7b088244cb6d,2019-08-08 08:55:27.733798,,0985c2571e2da7fd37cf2e5fc3535dc280ce9d42c886d3...,,2019-08-08 08:55:17.193110,gs://cdp-seattle.appspot.com/0985c2571e2da7fd3...,...,2019-08-08 08:55:19.421827,758a36d7-c5aa-4cba-af00-0a1ce44ccb39,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,3915,2019-05-02 09:30:00,http://legistar2.granicus.com/seattle/meetings...,"Housing, Health, Energy, and Workers’ Rights C...",,2019-08-08 08:02:05.625714
1,00fd1e9f-9718-4ba9-bc24-01ddaad4fff2,0.946712,cd2b1685-e6ef-4084-a9d0-d8ee38e2a1da,29e15ed4-a03b-4d64-9078-c7a13101df6e,2019-08-08 17:00:39.151725,,7e729d2068c5babe6672565664e2c01176a51b759f913d...,,2019-08-08 17:00:27.905998,gs://cdp-seattle.appspot.com/7e729d2068c5babe6...,...,2019-08-08 17:00:30.978199,3f69227d-5cdf-4e48-8c77-2526a6ca190d,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,3900,2019-04-24 14:00:00,http://legistar2.granicus.com/seattle/meetings...,Finance and Neighborhoods Committee,,2019-08-08 06:45:28.462804
2,021b763c-437f-4bcb-b458-4eec1b9c9058,0.919946,33ab51f9-911b-4965-82ba-e7a62ab26221,5c91e7bd-4e76-4ff3-b16e-25e4ff8bd6a6,2019-10-22 01:55:25.680446,,ed3d2c9e9cc745be6a9db227dce5f6aacd44cef643bbc1...,,2019-10-22 01:55:03.957929,gs://cdp-seattle.appspot.com/ed3d2c9e9cc745be6...,...,2019-10-22 01:55:07.861191,6385675d-e4ce-484e-b35d-3461b52edd22,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/FullCouncil?vide...,4165,2019-10-21 14:00:00,http://legistar2.granicus.com/seattle/meetings...,City Council,,2019-08-08 04:57:38.039158
3,069f42b1-94de-4edd-b197-8d4d8ca7563d,0.946464,3932cb65-bc2f-48d6-8604-e0ddd8896491,6527b5c6-4ff8-468b-b356-480569762212,2019-08-08 07:13:36.980601,,ce90894107a4e81431a190e2236ea213f07eb5622a15ce...,,2019-08-08 07:13:27.296433,gs://cdp-seattle.appspot.com/ce90894107a4e8143...,...,2019-08-08 07:13:29.580029,d1be6b39-22e1-4ca9-8644-9128e74c2248,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4045,2019-07-31 09:30:00,http://legistar2.granicus.com/seattle/meetings...,"Gender Equity, Safe Communities, New Americans...",,2019-08-08 07:10:43.521005
4,06adbfed-e0d8-4884-bc8a-267868bacf89,0.937053,d07291a5-4ec0-4ebf-9736-740d35fbab50,68fa2d4f-2526-4799-a07e-0419036e80fc,2019-08-08 06:20:26.667629,,cb5428b8f63b30578fcc78ed83e2e13b9186dfffe2367a...,,2019-08-08 06:20:21.773707,gs://cdp-seattle.appspot.com/cb5428b8f63b30578...,...,2019-08-08 06:20:24.110208,f6986d6f-4b0b-407c-8e13-1ec873158a66,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4016,2019-07-09 09:30:00,http://legistar2.granicus.com/seattle/meetings...,"Civil Rights, Utilities, Economic Development,...",,2019-08-08 06:20:23.903175


### Downloading and reading a transcript

A manifest is great, but you probably want to work with the transcripts locally.

In [4]:
# Download to local
save_path = seattle.file_store.download_file(transcripts.loc[1]["filename"])
save_path

PosixPath('/home/maxfield/active/cdptools/examples/7e729d2068c5babe6672565664e2c01176a51b759f913d30aab13bcb2cb771f6_ts_sentences_transcript_0.json')

In [5]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    transcript = json.load(read_in)
    for s in transcript["data"][:3]:
        print(s)

{'text': 'Hi, good afternoon, everybody and welcome to our finance and neighborhoods committee.', 'start_time': 16.0, 'end_time': 20.7}
{'text': 'This is our April 24th.', 'start_time': 20.7, 'end_time': 23.5}
{'text': '2019 meeting.', 'start_time': 23.5, 'end_time': 25.0}


### Handling multiple transcripts for a single event

A single event may have multiple versions of a transcripts. In this case you can decide how you want to filter them down, but generally, a decent solution is to group by `id_event`  then choose the most recent. You can do this with `pandas.Series.idxmax()`.

[stackoverflow](https://stackoverflow.com/questions/10202570/pandas-dataframe-find-row-where-values-for-column-is-maximal)

In [6]:
# Group rows
most_recent_transcripts = []
grouped = transcripts.groupby("event_id")
for name, group in grouped:
    most_recent = group.loc[group["created_transcript"].idxmax()]
    most_recent_transcripts.append(most_recent)
    
most_recent = pd.DataFrame(most_recent_transcripts)
most_recent.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created_transcript,content_type,filename,description_transcript,created_file,uri,...,created_event,body_id,legistar_event_link,source_uri,legistar_event_id,event_datetime,agenda_file_uri,name,description_event,created_body
6,0992b12d-3062-442c-9852-8d0db89c56f5,0.945244,0684e33e-0582-49b1-b24a-fb94666af072,3cbb94de-ff70-4a56-b048-ccf089914a68,2019-08-08 15:16:43.574340,,20b624e641163c7c808548ccd81b1035abc3f8f9275eb3...,,2019-08-08 15:16:38.656239,gs://cdp-seattle.appspot.com/20b624e641163c7c8...,...,2019-08-08 15:16:40.904996,fb71fff7-0a11-4dab-a45b-4c65d7098a83,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/CouncilBriefings...,3834,2019-03-11 09:30:00,http://legistar2.granicus.com/seattle/meetings...,Council Briefing,,2019-08-08 04:46:11.549790
122,d0b8fb90-7a82-4024-ab50-4159641f7b30,0.9457,06ce7221-4899-4129-8361-d82fc66992ae,466120db-246a-4218-a070-9123d4427f27,2019-08-08 04:46:41.202181,,e8d80518f82842c149de46fba8a703bf2cf9c6e4eb3c1b...,,2019-08-08 04:46:35.825902,gs://cdp-seattle.appspot.com/e8d80518f82842c14...,...,2019-08-08 04:46:37.974962,fb71fff7-0a11-4dab-a45b-4c65d7098a83,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/CouncilBriefings...,4041,2019-07-22 09:30:00,http://legistar2.granicus.com/seattle/meetings...,Council Briefing,,2019-08-08 04:46:11.549790
72,72c21e59-5100-4bed-a54c-beea836ba355,0.945288,07f936f4-e454-4712-8813-515c0ceb56af,370839da-3e73-4317-8d9f-d302a51cc623,2019-09-28 00:27:41.417738,,e4c97bd4a477f5416a98c4a94093ce3cf57463b90ac4da...,,2019-09-28 00:27:33.302149,gs://cdp-seattle.appspot.com/e4c97bd4a477f5416...,...,2019-09-28 00:27:35.626002,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/BudgetCommittee?...,4134,2019-09-26 09:30:00,http://legistar2.granicus.com/seattle/meetings...,Select Budget Committee,,2019-08-08 05:53:34.743281
78,7efebc6f-2d30-458a-83a4-3c65bed4057a,0.940499,0b178831-dfa9-468f-80b8-881470e838f0,10fdb1cc-c19c-4340-8869-a3d88933a40d,2019-10-07 22:13:45.913262,,b7460175bc7e51272b4f66bff0bb3f0eddacc0cb3b5ed4...,,2019-10-07 22:13:35.999331,gs://cdp-seattle.appspot.com/b7460175bc7e51272...,...,2019-10-07 22:13:39.400247,fb71fff7-0a11-4dab-a45b-4c65d7098a83,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/CouncilBriefings...,4146,2019-10-07 09:30:00,http://legistar2.granicus.com/seattle/meetings...,Council Briefing,,2019-08-08 04:46:11.549790
35,33c55fcf-6ce0-4f35-9b94-f967ad58101b,0.919899,0d36e2b2-5505-401e-81ea-0a459b006732,3316fa96-e6bc-4205-8a20-c426de582d40,2019-08-13 07:20:11.869321,,675782da8d98c8531f594ac3d8db1236d025063bb2b3cc...,,2019-08-13 07:19:10.005527,gs://cdp-seattle.appspot.com/675782da8d98c8531...,...,2019-08-13 07:19:13.691073,6385675d-e4ce-484e-b35d-3461b52edd22,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/FullCouncil?vide...,4069,2019-08-12 14:00:00,http://legistar2.granicus.com/seattle/meetings...,City Council,,2019-08-08 04:57:38.039158


### A function to download the most recent transcripts

To make gathering the most recent transcripts easier, we have written two helper functions that wrap this example up to do all this for us.

_**Note:** The utility functions shown below use complex queries to acheive the same functionality but run server side to make the process even faster. Additionally, they select the transcript with the highest confidence for each event rather than the most recent._

In [7]:
transcript_manifest = seattle.get_transcript_manifest()
transcript_manifest.head()

Unnamed: 0,event_id,source_uri,legistar_event_id,event_datetime,agenda_file_uri,minutes_file_uri,video_uri,created_event,body_id,legistar_event_link,...,file_id,created_transcript,content_type,filename,created_event.1,description_event,uri,name,description_body,created_body
0,047109ac-c502-4c0a-98d1-7612a104d244,http://www.seattlechannel.org/BudgetCommittee?...,4169,2019-11-01 09:30:00,http://legistar2.granicus.com/seattle/meetings...,,https://video.seattle.gov/media/council/budget...,2019-11-02 04:48:08.980073,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,...,e066cb24-f6bf-4a0c-9ac3-fbf38645cc10,2019-11-02 04:48:47.499943,,1a3757f22eafdfdc0ed389b2cbda5457c1c66f963e8bf1...,2019-11-02 04:48:05.442495,,gs://cdp-seattle.appspot.com/1a3757f22eafdfdc0...,Select Budget Committee,,2019-08-08 05:53:34.743281
1,0492402b-8dec-43fd-bdc9-55f0744c05db,http://www.seattlechannel.org/BudgetCommittee?...,4157,2019-10-18 09:30:00,http://legistar2.granicus.com/seattle/meetings...,,https://video.seattle.gov/media/council/budget...,2019-10-19 19:19:25.649458,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,...,be892865-3c35-46f4-a01f-525890918d1a,2019-10-19 19:19:30.590282,,a20233822ca8e28dd95763e60b11834fc43346b3a8771a...,2019-10-19 19:19:22.103761,,gs://cdp-seattle.appspot.com/a20233822ca8e28dd...,Select Budget Committee,,2019-08-08 05:53:34.743281
2,07f936f4-e454-4712-8813-515c0ceb56af,http://www.seattlechannel.org/BudgetCommittee?...,4134,2019-09-26 09:30:00,http://legistar2.granicus.com/seattle/meetings...,,https://video.seattle.gov/media/council/budget...,2019-09-28 00:27:35.626002,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,...,370839da-3e73-4317-8d9f-d302a51cc623,2019-09-28 00:27:41.417738,,e4c97bd4a477f5416a98c4a94093ce3cf57463b90ac4da...,2019-09-28 00:27:33.302149,,gs://cdp-seattle.appspot.com/e4c97bd4a477f5416...,Select Budget Committee,,2019-08-08 05:53:34.743281
3,1101bdea-009c-48aa-a618-546e8204e0e7,http://www.seattlechannel.org/BudgetCommittee?...,4158,2019-10-21 14:30:00,http://legistar2.granicus.com/seattle/meetings...,,https://video.seattle.gov/media/council/budget...,2019-10-22 17:50:08.769328,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,...,7c8007e2-3afd-4e6e-86cd-1ff6405de290,2019-10-22 17:50:13.616808,,18d02fab377743108df48535f14f64b6f7dd1682f48f96...,2019-10-22 17:50:05.013769,,gs://cdp-seattle.appspot.com/18d02fab377743108...,Select Budget Committee,,2019-08-08 05:53:34.743281
4,1e085d19-f880-4f55-9127-3e0006dc3b41,http://www.seattlechannel.org/BudgetCommittee?...,4159,2019-10-22 09:30:00,http://legistar2.granicus.com/seattle/meetings...,,https://video.seattle.gov/media/council/budget...,2019-10-23 20:57:57.981081,07d342d1-86f6-4074-9206-bcb45aab017e,https://seattle.legistar.com/MeetingDetail.asp...,...,90d25712-d96c-498a-aff7-0ea1d1db14e3,2019-10-23 20:58:01.704461,,85622ec63f60e791dbb47510daf159bd9668abdb8e68b8...,2019-10-23 20:57:54.087763,,gs://cdp-seattle.appspot.com/85622ec63f60e791d...,Select Budget Committee,,2019-08-08 05:53:34.743281


In [8]:
event_corpus_map = seattle.download_transcripts()
for event_id, path in event_corpus_map.items():
    print(event_id, path)
    break

047109ac-c502-4c0a-98d1-7612a104d244 /home/maxfield/active/cdptools/examples/1a3757f22eafdfdc0ed389b2cbda5457c1c66f963e8bf15545d01dc1b17fdf76_ts_sentences_transcript_0.json


  for transcript_details in selected_transcripts.to_dict("records"):
