# CDP Ngram Viewer

One of the interesting things we can do with CDP data is look at trends in discussion by keyword.
Much like [Google's Ngram Viewer](https://books.google.com/ngrams) we can plot these trends over time.

## Ngram Usage over Time

To generate a plot using the same process as Google's Ngram Viewer, we must download and then process transcripts for an instance, there is not stored data in the instance for us to use.

In [1]:
# TODO
# from cdp_data.keywords import get_ngram_usage_history

## Ngram Relevancy Over Time

In addition to simple "percent of total" Ngram trends, we can also plot how an Ngram is deemed relevant or not over time.
This is useful to see where spikes in activity occur.

While the "Ngram Usage Over Time" section detailed how an Ngram may be used in every meeting, this function and plot will normalize such behaviors and help us narrow in on when major activity and discussion occurred around the topic.

In [2]:
from cdp_data.keywords import get_ngram_relevancy_history
import pandas as pd

police = get_ngram_relevancy_history("police", infrastructure_slug="cdp-seattle-21723dcf")
police[["event_datetime", "value", "query_gram"]].head()

Event attachment: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:21<00:00,  8.61it/s]


Unnamed: 0,event_datetime,value,query_gram
0,2021-11-08 22:00:00+00:00,1.203569,police
1,2021-02-24 22:00:00+00:00,0.300892,police
2,2021-04-27 21:00:00+00:00,0.300892,police
3,2021-09-20 21:00:00+00:00,32.496369,police
4,2021-02-19 17:30:00+00:00,0.902677,police


In [3]:
import altair as alt

alt.Chart(police[["event_datetime", "value", "query_gram"]]).mark_line(interpolate="step").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

We can also plot multiple ngrams to compare how each of ngram compares to the other in terms of when their activity spikes occur. 

In [4]:
gram_history = pd.concat([
    police,
    get_ngram_relevancy_history("housing", infrastructure_slug="cdp-seattle-21723dcf"),
    get_ngram_relevancy_history("transportation", infrastructure_slug="cdp-seattle-21723dcf"),
])
gram_history[["event_datetime", "value", "query_gram"]].head()

Event attachment: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [00:24<00:00,  9.76it/s]
Event attachment: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:11<00:00, 15.44it/s]


Unnamed: 0,event_datetime,value,query_gram
0,2021-11-08 22:00:00+00:00,1.203569,police
1,2021-02-24 22:00:00+00:00,0.300892,police
2,2021-04-27 21:00:00+00:00,0.300892,police
3,2021-09-20 21:00:00+00:00,32.496369,police
4,2021-02-19 17:30:00+00:00,0.902677,police


In [5]:
from altair.expr import datum

base = alt.Chart(gram_history[["event_datetime", "value", "query_gram"]]).mark_line(interpolate="step").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

chart = alt.hconcat()
for query_gram in gram_history.query_gram.unique():
    chart |= base.transform_filter(datum.query_gram == query_gram)

chart.resolve_scale(
    x="shared",
    y="shared",
)