## Exploring Speech Data


In [1]:
import numpy as np
import pandas as pd
import os
import re

In [2]:
os.chdir("../../scripts/assembly")
from session_speaker_assembly import *

In [3]:
SPEECHES = "speeches_%s.txt"

In [17]:
# finally, it works.
speeches = pd.read_csv(os.path.join(HB_PATH, SPEECHES % '111'), sep = "|")

In [22]:
speeches

Unnamed: 0,speech_id,speech
0,1110000001,The Representativeselect and their guests will...
1,1110000002,As directed by law. the Clerk of the House has...
2,1110000003,The quor closes that 428 Represer have respond...
3,1110000004,Credentials. regular in form. have been receiv...
4,1110000005,The Clerk is in receipt of a letter of resigna...
...,...,...
179263,1110179264,Madam Speaker. on rollcall Nos. 662 and 661. I...
179264,1110179265,Madam Speaker. as I leave Congress as the peop...
179265,1110179266,Madam Speaker. on rolicall No. 658. I was unav...
179266,1110179267,Madam Speaker. on rollcall No. 658 my flight w...


In [23]:
speech_dict = speeches.to_dict()

## Preprocessing

### What did Gentzgow et al. do?

#### How the speech data came to be:

Text of the speech is processed by 
1. removing non-speech text
2. removing apostrophes and replacing commas and semicolons with periods,
3. replacing repeated whitespace characters with a single space
4. removing punctuation—hyphens, periods, and asterisks—that separate the speaker’s demarcation from the speech(v) 
5. removing whitespace leading and trailing the speech

#### Processing steps for the bigram files:

1. the number of characters and space-delimited wordsare computed. 
2. the speech is coerced to lowercase. 
3. the speech is broken into separatewords, treating all non-alphanumeric characters as delimiters
4. general English-language stopwords are removed
5. remaining words are reduced to their stems using the Porter2 (English) stemmingalgorithm 
6. the stemmed words are converted to bigrams following their order in the speech. 
7. the bigrams of the speech are converted into counts of bigrams, which undoes the ordering

#### Processing the Vocabulary:
...


### What should we do?

Steps processing step for DL:
* lowercasing
* punctuation removal (already done)
* normalization
    * dates
    * numbers
    * Currency/Percent signs
    * Expanding of abbreviations 
    * Spelling mistakes correction
* Tokenization

Choices:
* keep speaker names?
    * canonicalize these as MEMBER_NAME
    * canonicalize with party affiliation, MEMBER_NAME_REPUB, MEMBBER_NAME_DEM
* keep stop words?
* do some name entity recognition?


Tools:
* nltk
* keras preprocessing (https://keras.io/preprocessing/text/)
* 

Resources:
* text preprocessing for ML: https://towardsdatascience.com/text-preprocessing-steps-and-universal-pipeline-94233cb6725a
* text preprocessing for DL: https://towardsdatascience.com/nlp-learning-series-part-1-text-preprocessing-methods-for-deep-learning-20085601684b
    

In [70]:
type(list(speech_dict["speech"].values()))

list

In [71]:
speech_values = list(speech_dict["speech"].values())

Baseline preprocessor:
* lowercase
* remove punctuation
* normalize:
    * dates
    * numbers
    * Currency/Percent signs
    * Expanding of abbreviations 
    * Spelling mistakes correction
* tokenize

In [29]:
import nltk

In [38]:
d = {0: 'The Representativeselect and their guests will please remain standing and join in the Pledge of Allegiance.',
 1: 'As directed by law. the Clerk of the House has prepared the official roll of the Representativeselect. Certificates of election covering 435 seats in the 111th Congress have been received by the Clerk of the House. and the names of those persons whose credentials show that they were regularly elected as Representatives in accordance with the laws of their respective States or of the United States will be called. The Representativeselect will record their presence by electronic device and their names will be recorded in alphabetical order by State. beginning with the State of Alabama. to determine whether a quorum is present. Representativeselect will have a minimum of 15 minutes to record their presence by electronic device. Representativeselect who have not obtained their voting ID cards may do so now in the Speakers lobby.',
 2: 'The quor closes that 428 Represer have responded to the quorum is present. Roe Tanner Wamp Neugebauer Olson Ortiz Paul Poe Reyes Rodriguez Sessions Smith Thornberry Matheson Scott Wittma Wolf n 55. Pt. 1 January 6. 2009',
 3: 'Credentials. regular in form. have been received showing the election of: The Honorable PEDRO R. PIERLUISI as Resident Commissioner from the Commonwealth of Puerto Rico for a term of 4 years beginning January 3. 2009. The Honorable ELEANOR HOLMES NORTON as Delegate from the District of Columbia. The Honorable MADELEINE Z. BORDALLO as Delegate from Guam. The Honorable DONNA M. CHRISTENSEN as Delegate from the Virgin Islands. The Honorable ENI F. H. FALEOMAVAEGA as Delegate from American Samoa. and The Honorable GREGORIO SABLAN. Delegate from the Commonwealth of the Northern Mariana Islands.',}

d_vals = list(d.values())

In [61]:
ALPHA_NUM = "[^a-zA-z0-9\s]"
DIGIT = "\d"
NUM = "\d+"

def basic_preprocess(text):
    text = text.lower()
    text = re.sub(ALPHA_NUM, '', text)
    
    if bool(re.search(DIGIT, text)):
        text = re.sub(NUM, "number", text)

    return text

In [128]:
import time
start = time.time()

clean_speeches = list(map(lambda s: basic_preprocess(s), speech_values))

end = time.time()
elapsed = end - start
print(round(elapsed, 3), " seconds")

9.158  seconds


In [129]:
# Clean Speeches after basic preprocessing
clean_speeches[:10]

['the representativeselect and their guests will please remain standing and join in the pledge of allegiance',
 'as directed by law the clerk of the house has prepared the official roll of the representativeselect certificates of election covering number seats in the numberth congress have been received by the clerk of the house and the names of those persons whose credentials show that they were regularly elected as representatives in accordance with the laws of their respective states or of the united states will be called the representativeselect will record their presence by electronic device and their names will be recorded in alphabetical order by state beginning with the state of alabama to determine whether a quorum is present representativeselect will have a minimum of number minutes to record their presence by electronic device representativeselect who have not obtained their voting id cards may do so now in the speakers lobby',
 'the quor closes that number represer have res

In [78]:
len(clean_speeches)

179268

Further preprocessing for LDA

* stopword removal
* order 2 stemming

In [121]:
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [122]:
# create stemmer
stemmer = SnowballStemmer("english")
# create tokenizer
tokenizer = Tokenizer()

In [117]:
# nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rocassius/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [135]:
def lda_preprocess(text):
    """basic preprocessing, and then stemming, tokenizing and stopword removal"""
    
    text = basic_preprocess(text)
    text = stemmer.stem(text)
    text = text.split()
    text = list(filter(lambda w: w not in stop_words, text))
    
    return text


In [136]:
import time
start = time.time()

lda_speeches = list(map(lambda s: lda_preprocess(s), speech_values))

end = time.time()
elapsed = end - start
print(round(elapsed, 3), " seconds")

117.679  seconds


In [137]:
lda_speeches[:10]

[['representativeselect',
  'guests',
  'please',
  'remain',
  'standing',
  'join',
  'pledge',
  'allegi'],
 ['directed',
  'law',
  'clerk',
  'house',
  'prepared',
  'official',
  'roll',
  'representativeselect',
  'certificates',
  'election',
  'covering',
  'number',
  'seats',
  'numberth',
  'congress',
  'received',
  'clerk',
  'house',
  'names',
  'persons',
  'whose',
  'credentials',
  'show',
  'regularly',
  'elected',
  'representatives',
  'accordance',
  'laws',
  'respective',
  'states',
  'united',
  'states',
  'called',
  'representativeselect',
  'record',
  'presence',
  'electronic',
  'device',
  'names',
  'recorded',
  'alphabetical',
  'order',
  'state',
  'beginning',
  'state',
  'alabama',
  'determine',
  'whether',
  'quorum',
  'present',
  'representativeselect',
  'minimum',
  'number',
  'minutes',
  'record',
  'presence',
  'electronic',
  'device',
  'representativeselect',
  'obtained',
  'voting',
  'id',
  'cards',
  'may',
  'speakers

In [138]:
import gensim

In [139]:
dictionary = gensim.corpora.Dictionary(lda_speeches)

In [140]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [142]:
bow_corpus = [dictionary.doc2bow(doc) for doc in lda_speeches]
bow_corpus[4310]

[(27, 1),
 (136, 2),
 (405, 1),
 (461, 1),
 (752, 1),
 (1229, 1),
 (1274, 1),
 (3665, 1)]

In [143]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [144]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=20, id2word=dictionary, passes=2, workers=7)

In [145]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.010*"service" + 0.008*"speaker" + 0.006*"mr" + 0.006*"women" + 0.006*"time" + 0.006*"today" + 0.005*"honor" + 0.005*"us" + 0.005*"war" + 0.005*"country"
Topic: 1 
Words: 0.012*"community" + 0.010*"years" + 0.008*"speaker" + 0.008*"madam" + 0.006*"today" + 0.006*"county" + 0.006*"family" + 0.005*"service" + 0.005*"many" + 0.005*"life"
Topic: 2 
Words: 0.041*"health" + 0.034*"care" + 0.012*"insurance" + 0.012*"medicare" + 0.010*"would" + 0.009*"senator" + 0.008*"reform" + 0.008*"bill" + 0.007*"people" + 0.006*"costs"
Topic: 3 
Words: 0.022*"veterans" + 0.010*"care" + 0.009*"bill" + 0.008*"motion" + 0.008*"senate" + 0.008*"health" + 0.007*"upon" + 0.007*"time" + 0.006*"president" + 0.006*"table"
Topic: 4 
Words: 0.015*"court" + 0.013*"judge" + 0.010*"law" + 0.008*"would" + 0.007*"supreme" + 0.007*"federal" + 0.006*"case" + 0.006*"justice" + 0.005*"bill" + 0.005*"act"
Topic: 5 
Words: 0.015*"jobs" + 0.015*"small" + 0.011*"business" + 0.010*"businesses" + 0.009*"would" + 