# Analysis on the Transcripts from C-Span

This is some analysis on the initial transcripts we got from c-span.

To start with, import the modules used in this notebook. Note if this part fails for you, you may need to do a conda install "module"

In [91]:
# Imports
import plotly as py
import plotly.graph_objects as go
import pandas as pd
import json

In [92]:
# Get data

with open("../data/Transcripts/c-spanTranscript1.json", "r") as f:
    transcript_df = pd.json_normalize(json.loads(f.read()), "messages")

with open("../data/Transcripts/c-spanTranscript-ActionItems.json", "r") as f:
    action_items_df = pd.json_normalize(json.loads(f.read()), "actionItems")

with open("../data/Transcripts/c-spanTranscript-Analytics.json", "r") as f:
    json_data = json.loads(f.read())
    analytics_metrics_df = pd.json_normalize(json_data, "metrics")
    analytics_members_df = pd.json_normalize(json_data, "members")

with open("../data/Transcripts/c-spanTranscript-Questions.json", "r") as f:
    questions_df = pd.json_normalize(json.loads(f.read()), "questions")

with open("../data/Transcripts/c-spanTranscript-Topics.json", "r") as f:
    topics_df = pd.json_normalize(json.loads(f.read()), "topics")

with open("../data/Transcripts/c-spanTranscript1Speaker.json", "r") as f:
    speaker_transcript_df = pd.json_normalize(json.loads(f.read()), "messages")



# Topics Analysis

First, a quick overview with the .head() function, which shows the first part of a table (called a DataFrame by pandas)

In [93]:
topics_df.head()

Unnamed: 0,id,text,type,score,messageIds,parentRefs
0,6429454378205184,measure,topic,0.35,"[4953935262515200, 5538147195682816, 512398529...",[]
1,5923067520876544,debt limit,topic,0.717,"[5550821375737856, 6309678263828480, 541620111...",[]
2,5882140924313600,debt ceiling,topic,0.853,"[4610345260810240, 6561363561283584, 483419694...",[]
3,4856520035532800,senate amendment,topic,0.847,"[5765923391668224, 6363913177268224, 600527473...",[]
4,5726876636020736,minority member,topic,0.905,"[5654814449991680, 5744991130353664, 462335352...",[]


Seeing this data, a few things come to mind as ways of investigating this.
1. How useful is the score? Can we check the quality of this by checking the messages?
2. Is the number of times referenced by messages useful?
3. Is the text itself useful? Can an aggregate of a lot of different video results provide a way to find reoccuring topics? Are topics that are more general / specific useful?

We can do a quick visualization for questions one and two.

In [94]:
data = go.Bar(x=topics_df["text"], y=topics_df["score"])

layout = go.Layout(title="Topic Confidence",
                   xaxis={"title":"Topics", "categoryorder": "total descending"},
                   yaxis={"title":"Confidence Score"})

go.Figure(data=data, layout=layout)

In [95]:
topics_data_df = topics_df.copy()
topics_data_df['messageCount'] = [len(mess) for mess in topics_data_df["messageIds"]]

data = go.Bar(x=topics_data_df["text"], y=topics_data_df["messageCount"])

layout = go.Layout(title="Topics Referenced in Transcript",
                   xaxis={"title":"Topics", "categoryorder": "total descending"},
                   yaxis={"title":"Number of times referenced"})

go.Figure(data=data, layout=layout)

A couple of things to note:
1. minority member was referenced only 6 times, yet had the highest score of .905. In comparison, senate ammendment had 21 references, with a score of .847.
2. coal ammendment was referneced only 1 time, but had a score of .592.