# Lecture 8 -- Analysis and Visualization

## Outline

* We have learned how many different tools work, how to apply them, and how to train and build some of our own models.

* however, we have not yet taken the step of applying these tools together in a meaningful way.

* in this chapter we are going to do just that

* we will use the tools we have learned to answer a question by analyzing a dataset and visualizing and reporting our results

## Coming Up With Questions

* there are a lot of questions we can ask about this data. but generally a good place to start is by asking something simple that may reveal a larger pattern for further investigation

* for example, lets try to answer the question "how does discussion about housing and homelessness across the country take place? what is the tone of these discussions?"

## Planning

* our first step in this process is to break down how we can answer our question

* we will want to identify meetings in which housing and homeless come up -- when you are working on identification problems, **classification** is always a good place to start

* we will also want to look into "tone" of the meeting, are these meetings generally positive? generally negative? mixed? which meetings are mixed and why? these types of questions require a sentiment model

* before we do any of this, we will probably want to work with a small dataset before scaling up to the entire corpus

* lets start by looking for housing and homeless in meetings of the seattle city council between 2021 and 2022

## Identifying Housing and Homeless Meetings in Seattle

* import and pull down data, create chunks, embed, and cluster, UMAP, cluster, and inspect clusters

* we will use a new embedding model with a larger context length to allow for as much of the meeting transcript to be embedded at once
    * https://huggingface.co/jinaai/jina-embeddings-v2-base-en

In [2]:
from cdp_data import CDPInstances, datasets

sessions = datasets.get_session_dataset(
    CDPInstances.Seattle,
    start_datetime="2021-01-01",
    end_datetime="2022-01-01",
    store_transcript=True,
    store_transcript_as_csv=True,
    raise_on_error=False,
)
sessions

Fetching each model attached to event_ref:   0%|          | 0/226 [00:00<?, ?it/s]

Fetching transcripts:   0%|          | 0/226 [00:00<?, ?it/s]

Converting transcripts:   0%|          | 0/226 [00:00<?, ?it/s]

Unnamed: 0,session_datetime,session_index,session_content_hash,video_uri,video_start_time,video_end_time,caption_uri,external_source_id,id,key,event,transcript,transcript_path,transcript_as_csv_path
0,2021-01-04 17:30:00+00:00,0,afb06065232167ccc35ad4cc7bb24c46a67357846c6acd...,https://video.seattle.gov/media/council/brief_...,,,https://www.seattlechannel.org/documents/seatt...,,de9d821a3aba,session/de9d821a3aba,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
1,2021-01-04 22:00:00+00:00,0,81a736cb5605f776712765dd770df7a28f554987c893ea...,https://video.seattle.gov/media/council/counci...,,,https://www.seattlechannel.org/documents/seatt...,,f1434b7a9fa2,session/f1434b7a9fa2,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
2,2021-01-11 17:30:00+00:00,0,8265168b36b383e19509244766c0d4e789230ee839371b...,https://video.seattle.gov/media/council/brief_...,,,https://www.seattlechannel.org/documents/seatt...,,fe7c8aa0dd58,session/fe7c8aa0dd58,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
3,2021-01-11 22:00:00+00:00,0,386d37689ea482446c8837f1a950c66af9d0bbf7f04ba0...,https://video.seattle.gov/media/council/counci...,,,https://www.seattlechannel.org/documents/seatt...,,2b30cc0f5847,session/2b30cc0f5847,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
4,2021-01-12 17:30:00+00:00,0,91118e8a210e92600a869df3e4e4b7c27b2785d025c2f0...,https://video.seattle.gov/media/council/safe_0...,,,https://www.seattlechannel.org/documents/seatt...,,eff5db85156a,session/eff5db85156a,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,2021-12-10 17:30:00+00:00,0,30fa17ba9b2eab53bf435b47a16dc97f6130dc1ac31dca...,https://video.seattle.gov/media/council/econ_1...,,,https://www.seattlechannel.org/documents/seatt...,,5930000b2d24,session/5930000b2d24,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
222,2021-12-13 17:30:00+00:00,0,0c190b87881c4d1c598ec67eb267129170ee2e773550df...,https://video.seattle.gov/media/council/brief_...,,,https://www.seattlechannel.org/documents/seatt...,,62504aeb2c9e,session/62504aeb2c9e,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
223,2021-12-13 22:00:00+00:00,0,ce27b141953c0e4e281ff6d99bb98fae719b0447a8690a...,https://video.seattle.gov/media/council/counci...,,,https://www.seattlechannel.org/documents/seatt...,,726ec414e79f,session/726ec414e79f,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...
224,2021-12-14 17:30:00+00:00,0,1d5d9034b132cdc8702ce7399729a9a01526639cc8e4fb...,https://video.seattle.gov/media/council/safe_1...,,,https://www.seattlechannel.org/documents/seatt...,,42c7d47dd1b9,session/42c7d47dd1b9,<cdp_backend.database.models.Event object at 0...,<cdp_backend.database.models.Transcript object...,/Users/eva/active/research/ml-for-pit/content/...,/Users/eva/active/research/ml-for-pit/content/...


In [None]:
# load sentence transformers to embed
# chunks of meetings at a time and
# store in a dict with the session id
# and a chunk index as the key
import pandas as pd
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jinaai/jina-embeddings-v2-base-en")

session_chunk_to_embedding = {}
for _, row in sessions.iterrows():
    transcript_df = pd.read_csv(row.transcript_as_csv_path)

    # split transcript into chunks of 