# Getting Set Up

Make sure that all packages are installed, pull data, and answer a few questions.

## Checking Packages

In [1]:
import cdp_data
import seaborn as sns

cdp_data.__version__, sns.__version__

('0.0.9', '0.12.2')

## Pull Some Data

In [2]:
from cdp_data import CDPInstances, datasets

# Get a dataset of "city council sessions"
seattle_transcripts_oct_2022_to_nov_2022 = datasets.get_session_dataset(
    CDPInstances.Seattle,  # specify the city (or county) council we want data from
    start_datetime="2022-10-01",  # YYYY-MM-DD format
    end_datetime="2022-11-01",  # YYYY-MM-DD format
    store_transcript=True,  # store transcripts locally for fast file reading
    store_transcript_as_csv=True,  # store transcripts as CSVs for easy pandas reading
)
seattle_transcripts_oct_2022_to_nov_2022

Fetching each model attached to event_ref:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching transcripts:   0%|          | 0/5 [00:00<?, ?it/s]

Converting and storing each transcript as a CSV: 5it [00:06,  1.31s/it]


Unnamed: 0,session_datetime,session_index,session_content_hash,video_uri,video_start_time,video_end_time,caption_uri,external_source_id,id,key,event,transcript,transcript_path,transcript_as_csv_path
0,2022-10-03 21:00:00+00:00,0,ae7cc8bd62900cb81350b0097db959fe6aa54bf94cbdc6...,https://video.seattle.gov/media/council/brief_...,,,https://www.seattlechannel.org/documents/seatt...,,e9bdfa228805,session/e9bdfa228805,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...
1,2022-10-04 21:00:00+00:00,0,2bb8d74511e9806f675679490fb437bae46112ea6b936e...,https://video.seattle.gov/media/council/counci...,,,https://www.seattlechannel.org/documents/seatt...,,9a9ac561e588,session/9a9ac561e588,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...
2,2022-10-13 16:30:00+00:00,0,71434085f1fed70cca131099ecc6da33743828d50f9949...,https://video.seattle.gov/media/council/budget...,,,https://www.seattlechannel.org/documents/seatt...,,51141742fef8,session/51141742fef8,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...
3,2022-10-13 16:30:00+00:00,1,2a743c6bad095b4c33035c71ded456cbd637362da42983...,https://video.seattle.gov/media/council/budget...,,,https://www.seattlechannel.org/documents/seatt...,,8e422a197c62,session/8e422a197c62,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...
4,2022-10-18 21:00:00+00:00,0,2b990b7ac5f00e31c33b0683badb54923d7ad9765d33c8...,https://video.seattle.gov/media/council/counci...,,,https://www.seattlechannel.org/documents/seatt...,,2f7a3e667ed9,session/2f7a3e667ed9,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...,/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...


**Note:** All of the files you just downloaded are in the `cdp-datasets` directory next to this notebook. If at anytime you want to remove old unused data and clear up some space on your machine, just delete the `cdp-datasets` folder!

## Data Details

What is in this dataframe? Each row of the dataframe is a "city council session".
Taken together, this is a dataframe containing metadata (the date and time,
our database id for this session, and references to other database objects), and a
path to a local file of the transcript for that session.

We won't go over all the columns / fields of the dataset but will highlight a few.

In [3]:
example_session = seattle_transcripts_oct_2022_to_nov_2022.iloc[0]
example_session

session_datetime                                  2022-10-03 21:00:00+00:00
session_index                                                             0
session_content_hash      ae7cc8bd62900cb81350b0097db959fe6aa54bf94cbdc6...
video_uri                 https://video.seattle.gov/media/council/brief_...
video_start_time                                                       None
video_end_time                                                         None
caption_uri               https://www.seattlechannel.org/documents/seatt...
external_source_id                                                     None
id                                                             e9bdfa228805
key                                                    session/e9bdfa228805
event                     <cdp_backend.database.models.Event object at 0...
transcript                <cdp_backend.database.models.Transcript object...
transcript_path           /Users/eva/active/pit/sig-cdp/cdp-datasets/cdp...
transcript_a

### `session_datetime`

The date and time the session occurred.

In [4]:
example_session.session_datetime

Timestamp('2022-10-03 21:00:00+0000', tz='UTC')

### `video_uri`

A link to the original video of the council meeting. Try going to that URI in your browser, it _should_ play the video. 🤞

In [5]:
example_session.video_uri

'https://video.seattle.gov/media/council/brief_100322_2012261V.mp4'

### `id` and `key`

The database `id` for this session. Each session has a unique ID. The `key` is just the long form of the `id`.

In [6]:
example_session.id, example_session.key

('e9bdfa228805', 'session/e9bdfa228805')

### `transcript_path` and `transcript_as_csv_path`

The local path to the transcript in it's raw JSON file format or
converted to a CSV file for easy reading with `pandas`.

Note: do not get `transcript` confused with `transcript_path`.
`transcript` holds very basic metadata about the transcript
while `transcript_path` is a path to the whole transcript file itself.

In [7]:
example_session.transcript_path

PosixPath('/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp-seattle-21723dcf/event-c5a6c08c7135/session-e9bdfa228805/transcript.json')

In [8]:
example_session.transcript_as_csv_path

PosixPath('/Users/eva/active/pit/sig-cdp/cdp-datasets/cdp-seattle-21723dcf/event-c5a6c08c7135/session-e9bdfa228805/transcript.csv')

## Reading a Transcript's Sentence Data (CSV)

In [9]:
import pandas as pd

example_transcript_df = pd.read_csv(example_session.transcript_as_csv_path)
example_transcript_df

Unnamed: 0,index,confidence,start_time,end_time,text,speaker_index,speaker_name,annotations
0,0,0.97,14.214,14.280,& GT;,0,,
1,1,0.97,14.280,30.930,"With that, good afternoon, everyone . today is...",1,,
2,2,0.97,30.930,34.634,Councilmember Nelson? Councir Nelson? Councilm...,2,,
3,3,0.97,34.634,36.703,Present.,3,,
4,4,0.97,36.703,38.071,Councilmember Suwant?,4,,
...,...,...,...,...,...,...,...,...
123,123,0.97,4750.645,4751.379,Session . colleagues don't go an.,100,,
124,124,0.97,4751.379,4752.114,Session . colleagues don't go an.,100,,
125,125,0.97,4752.114,4752.848,Session . colleagues don't go an.,100,,
126,126,0.97,4752.848,4753.582,Session . colleagues don't go an.,100,,


### Sentence Data Details

Just a few, not all.

#### `start_time`

The start time of the sentence in the session in seconds. I.e. The first sentence in the session started at 14.214 seconds into the recording.

#### `end_time`

The end time of the sentence in the session in seconds. I.e. The first sentence in the session ended at 14.280 seconds into the recording.

#### `text`

The transcribed text! There may be a few mistakes here but generally it is 85% ish accurate (rough guess).

In [10]:
# lets look at the first three sentences
for _, row in example_transcript_df[:3].iterrows():
    print(f"{row.start_time} - {row.end_time} -- '{row.text}'")

14.214 - 14.28 -- '& GT;'
14.28 - 30.93 -- 'With that, good afternoon, everyone . today is Monday, October 3RD . The Council briefing meeting will come to order . It is 2:01 and I am Debora Juarez . will the clerk please call the roll?'
30.93 - 34.634 -- 'Councilmember Nelson? Councir Nelson? Councilmember Pedersen?'


It is a bit messy, there are obvious punctuation mistakes, and these sentences should have been split up correctly (this is a bug and we are working on it), but generally the data is all there. It's just text data!

## (Optional) Reading a Transcript with Full Metadata and Deeper Access (JSON)

In [11]:
from cdp_backend.pipeline.transcript_model import Transcript

# open the file
with open(example_session.transcript_path, "r") as open_file:
    # read the contents into a transcript object
    example_transcript = Transcript.from_json(open_file.read())

example_transcript

Transcript(generator='CDP WebVTT Conversion -- CDP v3.2.3', confidence=0.9699999999999991, session_datetime='2022-10-03T14:00:00-07:00', created_datetime='2022-10-04T04:39:14.131557', sentences=[...] (n=128), annotations=None)

We can see some basic metadata about the transcript from the print out.

* We can see the algorithm that generated the transcript: `generator=...`
* We can see how confident we are that the transcript is accurate `confidence=...`
* We can see the session datetime `session_datetime=...` (this should always be the same as in the dataframe)
* We can see when this transcript was created (it may be different from the session) `created_datetime=...`
* We can see the number of sentences `sentences=[...] (n=128)` (128 sentences in the transcript)

This object has a lot more data than what fits in the "dataframe version" of the transcript.

In [12]:
# Lets look at the first three sentences
for i in range(3):
    # Get the start time, the end time, and the text
    example_sentence = example_transcript.sentences[i]
    start_time = example_sentence.start_time
    end_time = example_sentence.end_time
    text = example_sentence.text
    print(f"{start_time} - {end_time} -- {text}")

14.214 - 14.28 -- & GT;
14.28 - 30.93 -- With that, good afternoon, everyone . today is Monday, October 3RD . The Council briefing meeting will come to order . It is 2:01 and I am Debora Juarez . will the clerk please call the roll?
30.93 - 34.634 -- Councilmember Nelson? Councir Nelson? Councilmember Pedersen?


Their are two cases to use the `Transcript` object over a `pandas.DataFrame`:

1. if you prefer that data structure
2. if you want access to word-level timestamps

In [13]:
# Get the third sentence
single_example_sentence = example_transcript.sentences[2]

for word in single_example_sentence.words:
    print(f"{word.start_time} - {word.end_time} -- '{word.text}'")

33.366 - 33.577333333333335 -- 'councilmember'
33.577333333333335 - 33.788666666666664 -- 'nelson'
33.788666666666664 - 34.0 -- 'councir'
34.0 - 34.211333333333336 -- 'nelson'
34.211333333333336 - 34.422666666666665 -- 'councilmember'
34.422666666666665 - 34.634 -- 'pedersen'
