# Analysis on the Transcripts from C-Span

This is some analysis on the initial transcripts we got from c-span.

To start with, import the modules used in this notebook. Note if this part fails for you, you may need to do a conda install "module"

Also, for spacy, you may need to run "spacy download en_core_web_sm", after installing the spacy module with "conda install spacy". This downloads the english language module for spacy

In [91]:
# Imports
import plotly as py
import plotly.graph_objects as go
import pandas as pd
import json
import spacy

In [92]:
# Get data

with open("../data/Transcripts/c-spanTranscript1.json", "r") as f:
    transcript_df = pd.json_normalize(json.loads(f.read()), "messages")

with open("../data/Transcripts/c-spanTranscript-ActionItems.json", "r") as f:
    action_items_df = pd.json_normalize(json.loads(f.read()), "actionItems")

with open("../data/Transcripts/c-spanTranscript-Analytics.json", "r") as f:
    json_data = json.loads(f.read())
    analytics_metrics_df = pd.json_normalize(json_data, "metrics")
    analytics_members_df = pd.json_normalize(json_data, "members")

with open("../data/Transcripts/c-spanTranscript-Questions.json", "r") as f:
    questions_df = pd.json_normalize(json.loads(f.read()), "questions")

with open("../data/Transcripts/c-spanTranscript-Topics.json", "r") as f:
    topics_df = pd.json_normalize(json.loads(f.read()), "topics")

with open("../data/Transcripts/c-spanTranscript1Speaker.json", "r") as f:
    speaker_transcript_df = pd.json_normalize(json.loads(f.read()), "messages")



# Topics Analysis

First, a quick overview with the .head() function, which shows the first part of a table (called a DataFrame by pandas)

In [93]:
topics_df.head()

Unnamed: 0,id,text,type,score,messageIds,parentRefs
0,6429454378205184,measure,topic,0.35,"[4953935262515200, 5538147195682816, 512398529...",[]
1,5923067520876544,debt limit,topic,0.717,"[5550821375737856, 6309678263828480, 541620111...",[]
2,5882140924313600,debt ceiling,topic,0.853,"[4610345260810240, 6561363561283584, 483419694...",[]
3,4856520035532800,senate amendment,topic,0.847,"[5765923391668224, 6363913177268224, 600527473...",[]
4,5726876636020736,minority member,topic,0.905,"[5654814449991680, 5744991130353664, 462335352...",[]


Seeing this data, a few things come to mind as ways of investigating this.
1. How useful is the score? Can we check the quality of this by checking the messages?
2. Is the number of times referenced by messages useful?
3. Is the text itself useful? Can an aggregate of a lot of different video results provide a way to find reoccuring topics? Are topics that are more general / specific useful?

We can do a quick visualization for questions one and two.

In [94]:
data = go.Bar(x=topics_df["text"], y=topics_df["score"])

layout = go.Layout(title="Topic Confidence",
                   xaxis={"title":"Topics", "categoryorder": "total descending"},
                   yaxis={"title":"Confidence Score"})

go.Figure(data=data, layout=layout)

In [95]:
topics_data_df = topics_df.copy()
topics_data_df['messageCount'] = [len(mess) for mess in topics_data_df["messageIds"]]

data = go.Bar(x=topics_data_df["text"], y=topics_data_df["messageCount"])

layout = go.Layout(title="Topics Referenced in Transcript",
                   xaxis={"title":"Topics", "categoryorder": "total descending"},
                   yaxis={"title":"Number of times referenced"})

go.Figure(data=data, layout=layout)

A couple of things to note:
1. minority member was referenced only 6 times, yet had the highest score of .905. In comparison, senate ammendment had 21 references, with a score of .847.
2. coal ammendment was referneced only 1 time, but had a score of .592.

# Speaker Analysis

In [97]:
speaker_transcript_df.head()

Unnamed: 0,id,text,startTime,endTime,conversationId,phrases,from.id,from.name
0,5217668870176768,I can hit the gym does so.,2021-10-17T00:38:59.926Z,2021-10-17T00:39:01.526Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11
1,5754971829043200,"Because of the bureaucracy, and right.",2021-10-17T00:39:03.726Z,2021-10-17T00:39:07.326Z,5611763203571712,[],9ab722b4-7097-4fca-9f19-043ba94c70a8,Speaker 1
2,6079902680875008,Army.,2021-10-17T00:39:08.426Z,2021-10-17T00:39:08.726Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12
3,5922707481821184,"Okay, I think we will get a.",2021-10-17T00:39:11.126Z,2021-10-17T00:39:12.426Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12
4,4934946956247040,Standard which.,2021-10-17T00:39:12.426Z,2021-10-17T00:39:14.426Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11


In [165]:
# Unique speakers
speaker_transcript_df["from.name"].unique()


array(['Speaker 11', 'Speaker 1', 'Speaker 12', 'Speaker 9', 'Speaker 5',
       'Speaker 13', 'Speaker 6', 'Speaker 8', 'Speaker 7', 'Speaker 10',
       'Speaker 2', 'Speaker 3', 'Speaker 4'], dtype=object)

In [163]:
gb = speaker_transcript_df.groupby(by="from.name", as_index=False)

counts = gb.count()

data = go.Bar(x=counts["from.name"], y=counts["id"])

layout = go.Layout(title="Speaker segment counts",
                   xaxis={"title":"Speakers", "categoryorder": "total descending"},
                   yaxis={"title":"Segment count"})

go.Figure(data=data, layout=layout)



In [195]:
nlp = spacy.load("en_core_web_sm")

speaker_data_df = speaker_transcript_df.copy()

# Filter out punctuation tokens, then get count (len) of all tokens
get_words = lambda txt: len([tok for tok in nlp(txt) if str(tok) not in ['.', '?', '!', ',']])

# Set the wordCount colunm based on transformation of text through nlp
speaker_data_df["wordCount"] = [get_words(text) for text in speaker_data_df["text"]]

# Also setup named entities as a column
speaker_data_df["nlp"] = [nlp(text) for text in speaker_data_df["text"]]

speaker_data_df.head()

Unnamed: 0,id,text,startTime,endTime,conversationId,phrases,from.id,from.name,wordCount,nlp
0,5217668870176768,I can hit the gym does so.,2021-10-17T00:38:59.926Z,2021-10-17T00:39:01.526Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11,7,"(I, can, hit, the, gym, does, so, .)"
1,5754971829043200,"Because of the bureaucracy, and right.",2021-10-17T00:39:03.726Z,2021-10-17T00:39:07.326Z,5611763203571712,[],9ab722b4-7097-4fca-9f19-043ba94c70a8,Speaker 1,6,"(Because, of, the, bureaucracy, ,, and, right, .)"
2,6079902680875008,Army.,2021-10-17T00:39:08.426Z,2021-10-17T00:39:08.726Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12,1,"(Army, .)"
3,5922707481821184,"Okay, I think we will get a.",2021-10-17T00:39:11.126Z,2021-10-17T00:39:12.426Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12,7,"(Okay, ,, I, think, we, will, get, a.)"
4,4934946956247040,Standard which.,2021-10-17T00:39:12.426Z,2021-10-17T00:39:14.426Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11,2,"(Standard, which, .)"


In [182]:
gb = speaker_data_df.groupby(by="from.name", as_index=False)

sums = gb.sum()

data = go.Bar(x=sums["from.name"], y=sums["wordCount"])

layout = go.Layout(title="Speaker word counts",
                   xaxis={"title":"Speakers", "categoryorder": "total descending"},
                   yaxis={"title":"Words"})

go.Figure(data=data, layout=layout)


In [199]:
speaker_data_df["sentiment"] = [text.sentiment for text in speaker_data_df["nlp"]]
speaker_data_df["ents"] = [text.ents for text in speaker_data_df["nlp"]]

speaker_data_df.head()


Unnamed: 0,id,text,startTime,endTime,conversationId,phrases,from.id,from.name,wordCount,nlp,sentiment,ents
0,5217668870176768,I can hit the gym does so.,2021-10-17T00:38:59.926Z,2021-10-17T00:39:01.526Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11,7,"(I, can, hit, the, gym, does, so, .)",0.0,()
1,5754971829043200,"Because of the bureaucracy, and right.",2021-10-17T00:39:03.726Z,2021-10-17T00:39:07.326Z,5611763203571712,[],9ab722b4-7097-4fca-9f19-043ba94c70a8,Speaker 1,6,"(Because, of, the, bureaucracy, ,, and, right, .)",0.0,()
2,6079902680875008,Army.,2021-10-17T00:39:08.426Z,2021-10-17T00:39:08.726Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12,1,"(Army, .)",0.0,"((Army),)"
3,5922707481821184,"Okay, I think we will get a.",2021-10-17T00:39:11.126Z,2021-10-17T00:39:12.426Z,5611763203571712,[],cd91310e-b3ea-47f0-8480-d699b476a400,Speaker 12,7,"(Okay, ,, I, think, we, will, get, a.)",0.0,()
4,4934946956247040,Standard which.,2021-10-17T00:39:12.426Z,2021-10-17T00:39:14.426Z,5611763203571712,[],66d8689e-8606-40e2-aa4b-28058f2067db,Speaker 11,2,"(Standard, which, .)",0.0,()


In [219]:
data = go.Scatter(x=speaker_data_df["startTime"], y=speaker_data_df["wordCount"] )

go.Figure(data)