# Working with CDP Transcripts

Methods for retrieving open access data.

Prior to reading this example, it is recommended to read the [transcript_formats](../docs/transcript_formats.md) documentation.

A database schema diagram for production instances of CDP may be found [here](https://github.com/CouncilDataProject/cdptools/blob/master/docs/resources/database_diagram.pdf).

In [1]:
from cdptools import CDPInstance, configs

seattle = CDPInstance(configs.SEATTLE)
seattle

<CDPInstance [database: <CloudFirestoreDatabase [stg-cdp-seattle]>, file_store: <GCSFileStore [stg-cdp-seattle.appspot.com]>]>

### Find all transcripts

Simple query the transcript table!

In [2]:
import pandas as pd
transcripts = pd.DataFrame(seattle.database.select_rows_as_list("transcript"))
transcripts.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created
0,1b6e3436-d7ff-45e5-b595-001b646274ed,0.936318,1d57a6d9-965e-4e37-b0fb-a0bcf84fa22e,06fa7dcb-387a-421c-86c6-8bb8f47c8374,2019-08-07 04:25:33.513951
1,9cad6ca0-1c3e-4b7b-91e0-66de22314385,0.941259,36cbb43b-faf0-48aa-96b3-7f201d51b114,675c488b-6573-4da0-9376-16bfdfcbb5a0,2019-08-07 05:17:44.558042


### Join file, event, and body information

While the above results are somewhat useful, we should probably merge information from the other tables too...

In [3]:
# Get the other tables
events = pd.DataFrame(seattle.database.select_rows_as_list("event"))
bodies = pd.DataFrame(seattle.database.select_rows_as_list("body"))
files = pd.DataFrame(seattle.database.select_rows_as_list("file"))

# Merge the tables
events = events.merge(bodies, left_on="body_id", right_on="body_id", suffixes=("_event", "_body"))
transcripts = transcripts.merge(files, left_on="file_id", right_on="file_id", suffixes=("_transcript", "_file"))
transcripts = transcripts.merge(events, left_on="event_id", right_on="event_id", suffixes=("_transcript", "_event"))
transcripts.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created_transcript,description_transcript,created_file,uri,content_type,filename,...,video_uri,created_event,body_id,legistar_event_link,source_uri,legistar_event_id,event_datetime,name,description_event,created_body
0,1b6e3436-d7ff-45e5-b595-001b646274ed,0.936318,1d57a6d9-965e-4e37-b0fb-a0bcf84fa22e,06fa7dcb-387a-421c-86c6-8bb8f47c8374,2019-08-07 04:25:33.513951,,2019-08-07 04:25:24.260846,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,...,https://video.seattle.gov/media/council/gov_08...,2019-08-07 04:25:27.506080,2d74aeb0-71dd-47bb-a534-df6db760de17,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4055,2019-08-06 09:30:00,"Governance, Equity, and Technology Committee",,2019-08-07 04:25:27.279863
1,9cad6ca0-1c3e-4b7b-91e0-66de22314385,0.941259,36cbb43b-faf0-48aa-96b3-7f201d51b114,675c488b-6573-4da0-9376-16bfdfcbb5a0,2019-08-07 05:17:44.558042,,2019-08-07 05:17:29.712970,gs://stg-cdp-seattle.appspot.com/49fd94d68ee70...,,49fd94d68ee7072972609e653a0ea180483d57de78b34d...,...,https://video.seattle.gov/media/council/sus_08...,2019-08-07 05:17:33.183874,887c08bd-ae3b-455a-85bd-3c17502b3121,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4056,2019-08-06 14:00:00,Sustainability and Transportation Committee,,2019-08-07 05:17:32.926111


### Downloading and reading a transcript

A manifest is great, but you probably want to work with the transcripts locally.

In [4]:
# Download to local
save_path = seattle.file_store.download_file(transcripts.loc[1]["filename"])
save_path

PosixPath('/home/maxfield/active/cdp/cdptools/examples/49fd94d68ee7072972609e653a0ea180483d57de78b34d856203cc805a0fc550_ts_sentences_transcript_0.json')

In [5]:
# Read the transcript
import json
with open(save_path, "r") as read_in:
    transcript = json.load(read_in)
    for s in transcript["data"][:3]:
        print(s)

{'text': 'Good afternoon, everybody Welcome to the sustainability and transportation committee.', 'start_time': 18.2, 'end_time': 22.1}
{'text': 'Thank you all so much for being here.', 'start_time': 22.1, 'end_time': 23.8}
{'text': "It's so nice to have big crowds here for the last couple meetings.", 'start_time': 23.8, 'end_time': 26.6}


### Handling multiple transcripts for a single event

A single event may have multiple versions of a transcripts. In this case you can decide how you want to filter them down, but generally, a decent solution is to group by `id_event`  then choose the most recent. You can do this with `pandas.Series.idxmax()`.

[stackoverflow](https://stackoverflow.com/questions/10202570/pandas-dataframe-find-row-where-values-for-column-is-maximal)

In [6]:
# Group rows
most_recent_transcripts = []
grouped = transcripts.groupby("event_id")
for name, group in grouped:
    most_recent = group.loc[group["created_transcript"].idxmax()]
    most_recent_transcripts.append(most_recent)
    
most_recent = pd.DataFrame(most_recent_transcripts)
most_recent.head()

Unnamed: 0,transcript_id,confidence,event_id,file_id,created_transcript,description_transcript,created_file,uri,content_type,filename,...,video_uri,created_event,body_id,legistar_event_link,source_uri,legistar_event_id,event_datetime,name,description_event,created_body
0,1b6e3436-d7ff-45e5-b595-001b646274ed,0.936318,1d57a6d9-965e-4e37-b0fb-a0bcf84fa22e,06fa7dcb-387a-421c-86c6-8bb8f47c8374,2019-08-07 04:25:33.513951,,2019-08-07 04:25:24.260846,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,...,https://video.seattle.gov/media/council/gov_08...,2019-08-07 04:25:27.506080,2d74aeb0-71dd-47bb-a534-df6db760de17,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4055,2019-08-06 09:30:00,"Governance, Equity, and Technology Committee",,2019-08-07 04:25:27.279863
1,9cad6ca0-1c3e-4b7b-91e0-66de22314385,0.941259,36cbb43b-faf0-48aa-96b3-7f201d51b114,675c488b-6573-4da0-9376-16bfdfcbb5a0,2019-08-07 05:17:44.558042,,2019-08-07 05:17:29.712970,gs://stg-cdp-seattle.appspot.com/49fd94d68ee70...,,49fd94d68ee7072972609e653a0ea180483d57de78b34d...,...,https://video.seattle.gov/media/council/sus_08...,2019-08-07 05:17:33.183874,887c08bd-ae3b-455a-85bd-3c17502b3121,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4056,2019-08-06 14:00:00,Sustainability and Transportation Committee,,2019-08-07 05:17:32.926111


### A function to download the most recent transcripts

To make gathering the most recent transcripts easier, we have written two helper functions that wrap this example up to do all this for us.

In [7]:
transcript_manifest = seattle.get_most_recent_transcript_manifest()
transcript_manifest.head()

Unnamed: 0,transcript_id,file_id,created_transcript,confidence,event_id,description_transcript,created_file,uri,content_type,filename,...,created_event,body_id,legistar_event_link,source_uri,legistar_event_id,event_datetime,agenda_file_uri,name,description_event,created_body
0,1b6e3436-d7ff-45e5-b595-001b646274ed,06fa7dcb-387a-421c-86c6-8bb8f47c8374,2019-08-07 04:25:33.513951,0.936318,1d57a6d9-965e-4e37-b0fb-a0bcf84fa22e,,2019-08-07 04:25:24.260846,gs://stg-cdp-seattle.appspot.com/b0d394f1837ee...,,b0d394f1837ee56aae3664e8e665e806564b52953bfce0...,...,2019-08-07 04:25:27.506080,2d74aeb0-71dd-47bb-a534-df6db760de17,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4055,2019-08-06 09:30:00,http://legistar2.granicus.com/seattle/meetings...,"Governance, Equity, and Technology Committee",,2019-08-07 04:25:27.279863
1,9cad6ca0-1c3e-4b7b-91e0-66de22314385,675c488b-6573-4da0-9376-16bfdfcbb5a0,2019-08-07 05:17:44.558042,0.941259,36cbb43b-faf0-48aa-96b3-7f201d51b114,,2019-08-07 05:17:29.712970,gs://stg-cdp-seattle.appspot.com/49fd94d68ee70...,,49fd94d68ee7072972609e653a0ea180483d57de78b34d...,...,2019-08-07 05:17:33.183874,887c08bd-ae3b-455a-85bd-3c17502b3121,https://seattle.legistar.com/MeetingDetail.asp...,http://www.seattlechannel.org/mayor-and-counci...,4056,2019-08-06 14:00:00,http://legistar2.granicus.com/seattle/meetings...,Sustainability and Transportation Committee,,2019-08-07 05:17:32.926111


In [8]:
event_corpus_map = seattle.download_most_recent_transcripts()
event_corpus_map

{'1d57a6d9-965e-4e37-b0fb-a0bcf84fa22e': PosixPath('/home/maxfield/active/cdp/cdptools/examples/b0d394f1837ee56aae3664e8e665e806564b52953bfce0e98045a7dcc9c24f69_ts_sentences_transcript_0.json'),
 '36cbb43b-faf0-48aa-96b3-7f201d51b114': PosixPath('/home/maxfield/active/cdp/cdptools/examples/49fd94d68ee7072972609e653a0ea180483d57de78b34d856203cc805a0fc550_ts_sentences_transcript_0.json')}