# CDP Ngram Viewer

One of the interesting things we can do with CDP data is look at trends in discussion by keyword.
Much like [Google's Ngram Viewer](https://books.google.com/ngrams) we can plot these trends over time.

## Ngram Usage over Time

To generate a plot using the same process as Google's Ngram Viewer, we must download and then process transcripts for an instance, there is not stored data in the instance for us to use.

In [1]:
# TODO
# from cdp_data.keywords import get_ngram_usage_history

## Ngram Relevancy Over Time

In addition to simple "percent of total" Ngram trends, we can also plot how an Ngram is deemed relevant or not over time.
This is useful to see where spikes in activity occur.

While the "Ngram Usage Over Time" section detailed how an Ngram may be used in every meeting, this function and plot will normalize such behaviors and help us narrow in on when major activity and discussion occurred around the topic.

In [2]:
from cdp_data.keywords import get_ngram_relevancy_history
import pandas as pd

police = get_ngram_relevancy_history("police", infrastructure_slug="cdp-seattle-21723dcf")
police.head()

  from .autonotebook import tqdm as notebook_tqdm
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:02<00:00, 70.20it/s]


Unnamed: 0,unstemmed_gram,stemmed_gram,context_span,value,datetime_weighted_value,id,key,query_gram,event,event_datetime
0,police,polic,... calendar--there is no item on the agenda f...,1.203569,0.631109,00bc111ce558,indexed_event_gram/00bc111ce558,police,<cdp_backend.database.models.Event object at 0...,2021-11-08 22:00:00+00:00
1,Police,polic,That's the Seattle Police Department's communi...,0.300892,0.110304,00ec58f23fce,indexed_event_gram/00ec58f23fce,police,<cdp_backend.database.models.Event object at 0...,2021-02-24 22:00:00+00:00
2,police,polic,... their peers being killed at the hands of p...,0.300892,0.116501,02206e5a2e1e,indexed_event_gram/02206e5a2e1e,police,<cdp_backend.database.models.Event object at 0...,2021-04-27 21:00:00+00:00
3,police,polic,... of the bill before you to end Seattle poli...,32.496369,15.297768,026b102636c0,indexed_event_gram/026b102636c0,police,<cdp_backend.database.models.Event object at 0...,2021-09-20 21:00:00+00:00
4,police,polic,... what has been happening and that's why the...,0.902677,0.329602,0301372cc93b,indexed_event_gram/0301372cc93b,police,<cdp_backend.database.models.Event object at 0...,2021-02-19 17:30:00+00:00


If we want to clean this up and prepare the data specifically for plotting we can import and use a function to do just that.

This function will add missing dates for each ngram present in the provided DataFrame and set the values for those dates to 0. In the case that there were multiple meetings on the same day which both utilized the query gram, the meeting with the max value is chosen for the date.

Additionally it will subset the data to just the columns we need for plotting.

In [3]:
from cdp_data.keywords import prepare_ngram_history_plotting_data

prepped_police_data = prepare_ngram_history_plotting_data(police)
prepped_police_data.head()

Unnamed: 0,query_gram,event_datetime,value
0,police,2021-01-04 00:00:00+00:00,4.814277
1,police,2021-01-05 00:00:00+00:00,0.0
2,police,2021-01-06 00:00:00+00:00,0.0
3,police,2021-01-07 00:00:00+00:00,0.0
4,police,2021-01-08 00:00:00+00:00,0.0


In [4]:
import altair as alt

alt.Chart(prepped_police_data).mark_line(interpolate="basis").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

We can also plot multiple ngrams to compare how each of ngram compares to the other in terms of when their activity spikes occur. 

In [5]:
gram_history = pd.concat([
    police,
    get_ngram_relevancy_history("housing", infrastructure_slug="cdp-seattle-21723dcf"),
    get_ngram_relevancy_history("transportation", infrastructure_slug="cdp-seattle-21723dcf"),
])
gram_history.head()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [00:02<00:00, 81.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:02<00:00, 76.98it/s]


Unnamed: 0,unstemmed_gram,stemmed_gram,context_span,value,datetime_weighted_value,id,key,query_gram,event,event_datetime
0,police,polic,... calendar--there is no item on the agenda f...,1.203569,0.631109,00bc111ce558,indexed_event_gram/00bc111ce558,police,<cdp_backend.database.models.Event object at 0...,2021-11-08 22:00:00+00:00
1,Police,polic,That's the Seattle Police Department's communi...,0.300892,0.110304,00ec58f23fce,indexed_event_gram/00ec58f23fce,police,<cdp_backend.database.models.Event object at 0...,2021-02-24 22:00:00+00:00
2,police,polic,... their peers being killed at the hands of p...,0.300892,0.116501,02206e5a2e1e,indexed_event_gram/02206e5a2e1e,police,<cdp_backend.database.models.Event object at 0...,2021-04-27 21:00:00+00:00
3,police,polic,... of the bill before you to end Seattle poli...,32.496369,15.297768,026b102636c0,indexed_event_gram/026b102636c0,police,<cdp_backend.database.models.Event object at 0...,2021-09-20 21:00:00+00:00
4,police,polic,... what has been happening and that's why the...,0.902677,0.329602,0301372cc93b,indexed_event_gram/0301372cc93b,police,<cdp_backend.database.models.Event object at 0...,2021-02-19 17:30:00+00:00


In [6]:
# Prepare all for plotting
police_housing_transpo = prepare_ngram_history_plotting_data(gram_history)
police_housing_transpo.head()

Unnamed: 0,query_gram,event_datetime,value
0,housing,2021-01-04 00:00:00+00:00,0.62199
1,housing,2021-01-05 00:00:00+00:00,0.0
2,housing,2021-01-06 00:00:00+00:00,0.0
3,housing,2021-01-07 00:00:00+00:00,0.0
4,housing,2021-01-08 00:00:00+00:00,0.0


In [7]:
from altair.expr import datum

base = alt.Chart(police_housing_transpo).mark_line(interpolate="basis").encode(
    x="event_datetime:T",
    y="value:Q",
    color="query_gram:N",
)

chart = alt.hconcat()
for query_gram in gram_history.query_gram.unique():
    chart |= base.transform_filter(datum.query_gram == query_gram)

chart.resolve_scale(
    x="shared",
    y="shared",
)