# CDP Ngram Viewer

One of the interesting things we can do with CDP data is look at trends in discussion by keyword.
Much like [Google's Ngram Viewer](https://books.google.com/ngrams) we can plot these trends over time.

In [1]:
# Import pandas and altair because they are used multiple times in the notebook
import pandas as pd
import altair as alt
from altair.expr import datum

## Ngram Usage over Time

To generate a plot using the same process as Google's Ngram Viewer, we must download and then process transcripts for an instance, there is not stored data in the instance for us to use.

In [2]:
from cdp_data.datasets import get_session_dataset
from cdp_data.keywords import compute_ngram_usage_history

session_ds = get_session_dataset("cdp-seattle-21723dcf", store_transcript=True)
session_ds.head()

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:02<00:00, 130.19it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [00:19<00:00, 13.98it/s]


Unnamed: 0,session_datetime,session_index,session_content_hash,video_uri,caption_uri,external_source_id,id,key,event,transcript,transcript_path
0,2021-05-26 16:30:00+00:00,0,47bd7bd47f1f6c6e387ca27af10bc6196e9e349c3a08ff...,https://video.seattle.gov/media/council/land_0...,https://www.seattlechannel.org/documents/seatt...,,008f4e8d253c,session/008f4e8d253c,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/home/maxfield/active/cdp/cdp-data/notebooks/c...
1,2022-03-09 22:00:00+00:00,0,3a28c8fb6e43cbc2b1e67daa98d7c363f3c6a311fc6bca...,https://video.seattle.gov/media/council/land_0...,https://www.seattlechannel.org/documents/seatt...,,015dd602acce,session/015dd602acce,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/home/maxfield/active/cdp/cdp-data/notebooks/c...
2,2021-07-12 16:30:00+00:00,0,ba927581999043a7c4b119785bb7a24a12b88caa6cfbde...,https://video.seattle.gov/media/council/brief_...,https://www.seattlechannel.org/documents/seatt...,,01a6d09dd442,session/01a6d09dd442,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/home/maxfield/active/cdp/cdp-data/notebooks/c...
3,2022-01-31 22:00:00+00:00,0,fb6014b9d11fe01b63312f28ad7f16989efedd48e9aa0c...,https://video.seattle.gov/media/council/brief_...,https://www.seattlechannel.org/documents/seatt...,,01e75165fea1,session/01e75165fea1,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/home/maxfield/active/cdp/cdp-data/notebooks/c...
4,2021-09-20 16:30:00+00:00,0,a5ba330d5e02eeaf8d8338fd2766f800edaea5bd4d4d8d...,https://video.seattle.gov/media/council/brief_...,https://www.seattlechannel.org/documents/seatt...,,0208ee2103f7,session/0208ee2103f7,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/home/maxfield/active/cdp/cdp-data/notebooks/c...


Once we have the downloaded and cached session dataset, we can process the dataset to create a DataFrame of counts of ngrams for each day in the dataset. We additionally store this counts DataFrame to disk because this step takes a while to complete and if we wanted to pick up where we left off it would be fast to read from the CSV than re-compute the count data.

In [3]:
ngram_usage_counts = compute_ngram_usage_history(session_ds)
ngram_usage_counts.to_csv("keyword_usage.csv", index=False)
ngram_usage_counts.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 267/267 [01:31<00:00,  2.92it/s]


Unnamed: 0,ngram,count,session_id,session_datetime,session_date,day_ngram_count_sum,day_words_count_sum,day_ngram_percent_usage
0,good,26,008f4e8d253c,2021-05-26 16:30:00+00:00,2021-05-26,54,19844,0.002721
1,morn,12,008f4e8d253c,2021-05-26 16:30:00+00:00,2021-05-26,12,19844,0.000605
2,everyon,7,008f4e8d253c,2021-05-26 16:30:00+00:00,2021-05-26,19,19844,0.000957
3,apolog,6,008f4e8d253c,2021-05-26 16:30:00+00:00,2021-05-26,7,19844,0.000353
4,hit,1,008f4e8d253c,2021-05-26 16:30:00+00:00,2021-05-26,2,19844,0.000101


Note how this DataFrame store the "stemmed" ngram, if we wanted to be strict about the ngram structure, and prefer to use the full English spelling, we could provide `strict=True` to the `compute_ngram_usage_history` function.

In [4]:
from cdp_data.keywords import prepare_ngram_usage_history_plotting_data

prepped_police_data = prepare_ngram_usage_history_plotting_data("police", ngram_usage_counts)
prepped_police_data

Unnamed: 0,ngram,session_date,day_ngram_percent_usage
0,polic,2021-01-04,0.005360
1,polic,2021-01-05,0.000000
2,polic,2021-01-06,0.000000
3,polic,2021-01-07,0.000000
4,polic,2021-01-08,0.000000
...,...,...,...
453,polic,2022-04-02,0.000000
454,polic,2022-04-03,0.000000
455,polic,2022-04-04,0.001012
456,polic,2022-04-05,0.000000


If we want to clean this up and prepare the data specifically for plotting we can import and use a function to do just that.

This function will stem the provided ngram of interest (similar `strict=True` behavior if you don't want stemmed grams), add missing dates from the found date range in provided DataFrame and set the percent usage for those dates to 0.

Additionally it will subset the data to just the columns we need for plotting.

In [5]:
alt.Chart(prepped_police_data).mark_line(interpolate="basis").encode(
    x="session_date:T",
    y="day_ngram_percent_usage:Q",
    color="ngram:N",
)

We can also prepare and compare multiple ngrams usage over time against each other.

In [6]:
gram_history = pd.concat([
    prepped_police_data,
    prepare_ngram_usage_history_plotting_data("housing", ngram_usage_counts),
    prepare_ngram_usage_history_plotting_data("transportation", ngram_usage_counts),
])
gram_history

Unnamed: 0,ngram,session_date,day_ngram_percent_usage
0,polic,2021-01-04,0.005360
1,polic,2021-01-05,0.000000
2,polic,2021-01-06,0.000000
3,polic,2021-01-07,0.000000
4,polic,2021-01-08,0.000000
...,...,...,...
453,transport,2022-04-02,0.000000
454,transport,2022-04-03,0.000000
455,transport,2022-04-04,0.001012
456,transport,2022-04-05,0.000588


In [7]:
base = alt.Chart(gram_history).mark_line(interpolate="basis").encode(
    x="session_date:T",
    y="day_ngram_percent_usage:Q",
    color="ngram:N",
)

chart = alt.hconcat()
for ngram in gram_history.ngram.unique():
    chart |= base.transform_filter(datum.ngram == ngram)

chart.resolve_scale(
    x="shared",
    y="shared",
)

## Ngram Relevancy Over Time

In addition to simple "percent of total" Ngram trends, we can also plot how an Ngram is deemed relevant or not over time.
This is useful to see where spikes in activity occur.

While the "Ngram Usage Over Time" section detailed how an Ngram may be used in every meeting, this function and plot will normalize such behaviors and help us narrow in on when major activity and discussion occurred around the topic.

In [8]:
from cdp_data.keywords import get_ngram_relevancy_history

police = get_ngram_relevancy_history("police", infrastructure_slug="cdp-seattle-21723dcf")
police.head()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:01<00:00, 108.23it/s]


Unnamed: 0,unstemmed_gram,stemmed_gram,context_span,value,datetime_weighted_value,id,key,query_gram,event,event_datetime
0,police,polic,... calendar--there is no item on the agenda f...,1.203569,0.614036,00bc111ce558,indexed_event_gram/00bc111ce558,police,<cdp_backend.database.models.Event object at 0...,2021-11-08 22:00:00+00:00
1,Police,polic,That's the Seattle Police Department's communi...,0.300892,0.109354,00ec58f23fce,indexed_event_gram/00ec58f23fce,police,<cdp_backend.database.models.Event object at 0...,2021-02-24 22:00:00+00:00
2,police,polic,... their peers being killed at the hands of p...,0.300892,0.115281,02206e5a2e1e,indexed_event_gram/02206e5a2e1e,police,<cdp_backend.database.models.Event object at 0...,2021-04-27 21:00:00+00:00
3,police,polic,... of the bill before you to end Seattle poli...,32.496369,14.994999,026b102636c0,indexed_event_gram/026b102636c0,police,<cdp_backend.database.models.Event object at 0...,2021-09-20 21:00:00+00:00
4,police,polic,... what has been happening and that's why the...,0.902677,0.326806,0301372cc93b,indexed_event_gram/0301372cc93b,police,<cdp_backend.database.models.Event object at 0...,2021-02-19 17:30:00+00:00


If we want to clean this up and prepare the data specifically for plotting we can import and use a function to do just that.

This function will add missing dates for each ngram present in the provided DataFrame and set the values for those dates to 0. In the case that there were multiple meetings on the same day which both utilized the query gram, the meeting with the max value is chosen for the date.

Additionally it will subset the data to just the columns we need for plotting.

In [9]:
from cdp_data.keywords import prepare_ngram_relevancy_history_plotting_data

prepped_police_data = prepare_ngram_relevancy_history_plotting_data(police)
prepped_police_data.head()

Unnamed: 0,query_gram,event_datetime,value
0,police,2021-01-04 00:00:00+00:00,4.814277
1,police,2021-01-05 00:00:00+00:00,0.0
2,police,2021-01-06 00:00:00+00:00,0.0
3,police,2021-01-07 00:00:00+00:00,0.0
4,police,2021-01-08 00:00:00+00:00,0.0


In [10]:
alt.Chart(prepped_police_data).mark_line(interpolate="basis").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

We can also plot multiple ngrams to compare how each of ngram compares to the other in terms of when their activity spikes occur. 

In [11]:
gram_history = pd.concat([
    police,
    get_ngram_relevancy_history("housing", infrastructure_slug="cdp-seattle-21723dcf"),
    get_ngram_relevancy_history("transportation", infrastructure_slug="cdp-seattle-21723dcf"),
])
gram_history.head()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [00:01<00:00, 177.72it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:01<00:00, 111.84it/s]


Unnamed: 0,unstemmed_gram,stemmed_gram,context_span,value,datetime_weighted_value,id,key,query_gram,event,event_datetime
0,police,polic,... calendar--there is no item on the agenda f...,1.203569,0.614036,00bc111ce558,indexed_event_gram/00bc111ce558,police,<cdp_backend.database.models.Event object at 0...,2021-11-08 22:00:00+00:00
1,Police,polic,That's the Seattle Police Department's communi...,0.300892,0.109354,00ec58f23fce,indexed_event_gram/00ec58f23fce,police,<cdp_backend.database.models.Event object at 0...,2021-02-24 22:00:00+00:00
2,police,polic,... their peers being killed at the hands of p...,0.300892,0.115281,02206e5a2e1e,indexed_event_gram/02206e5a2e1e,police,<cdp_backend.database.models.Event object at 0...,2021-04-27 21:00:00+00:00
3,police,polic,... of the bill before you to end Seattle poli...,32.496369,14.994999,026b102636c0,indexed_event_gram/026b102636c0,police,<cdp_backend.database.models.Event object at 0...,2021-09-20 21:00:00+00:00
4,police,polic,... what has been happening and that's why the...,0.902677,0.326806,0301372cc93b,indexed_event_gram/0301372cc93b,police,<cdp_backend.database.models.Event object at 0...,2021-02-19 17:30:00+00:00


In [12]:
# Prepare all for plotting
police_housing_transpo = prepare_ngram_relevancy_history_plotting_data(gram_history)
police_housing_transpo.head()

Unnamed: 0,query_gram,event_datetime,value
0,housing,2021-01-04 00:00:00+00:00,0.62199
1,housing,2021-01-05 00:00:00+00:00,0.0
2,housing,2021-01-06 00:00:00+00:00,0.0
3,housing,2021-01-07 00:00:00+00:00,0.0
4,housing,2021-01-08 00:00:00+00:00,0.0


In [13]:
base = alt.Chart(police_housing_transpo).mark_line(interpolate="basis").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

chart = alt.hconcat()
for query_gram in gram_history.query_gram.unique():
    chart |= base.transform_filter(datum.query_gram == query_gram)

chart.resolve_scale(
    x="shared",
    y="shared",
)