### *Data Collection - US Congress*
## Preparing Raw Data
---
**Sample Text 10**
Title: National Defense Authorization Act <br>
US House of Rep // September 21/22, 2021; US Senate // Date: October 25, November 18, December 02/15, 2021 - Washington D.C.

In [79]:
# import necessary libraries
import requests
from requests_html import HTMLSession
import urllib.request
import time
from bs4 import BeautifulSoup
import urllib
from urllib import request
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import os.path 
import pandas as pd
import collections
from collections import defaultdict

---
### Process: HTML to ACII to Text (tokens, not yet nltk text)
- Download web page, strip html if necessary, trim to desired content
- Tokenize the text, select tokens of interest

In [80]:
# Add directory with user-agent ID, to bypass 404 response error and ensure that the US Congress Website Server allows scraping
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}

---
### Prep first debate on the matter

In [81]:
url = "https://www.congress.gov/congressional-record/2021/9/21/house-section/article/h4596-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [82]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'House',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 'te

In [83]:
# Shorten the raw text down to the actual debate
tokens.index("extraneous") #323685
tokens.index("enthusiastically") #324676



324676

In [84]:
tokens1 = tokens[323655:324688]

In [85]:
# Add second relevant portion
tokens2 = tokens[324688:]

In [86]:
tokens2.index("ROGERS") #6665
tokens2.index("Perlmutter") #17354

17354

In [87]:
tokens2 = tokens2[6664:17347]

In [88]:
tokens1 = tokens1 + tokens2

In [89]:
for word in tokens1:
    print(word, end=' ')

Mr. SMITH of Washington . Madam Speaker , I ask unanimous consent that all Members may have 5 legislative days in which to revise and extend their remarks and include extraneous material on H.R . 4350 . The SPEAKER pro tempore . Is there objection to the request of the gentleman from Washington ? There was no objection . Mr. SMITH of Washington . Madam Speaker , I yield myself 4 minutes . Madam Speaker , we have before us the National Defense Authorization Act for Fiscal Year 2022 , and I highly recommend it to the Members of the House and urge everybody to support this incredibly important and very well put together piece of legislation . [ [ Page H4794 ] ] The first thing I will say is thank you to all of the staff , certainly on the Armed Services Committee , but the Rules Committee as well , and the leadership staff . This has been a truly bipartisan legislative process . We had our markup in committee which lasted , I will do a little quick math in my head , something like 16 hour

---
### Prep second debate on the matter

In [90]:
url = "https://www.congress.gov/congressional-record/2021/9/22/extensions-of-remarks-section/article/e1005-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [91]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Extensions',
 'of',
 'Remarks',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 

In [92]:
# Shorten the raw text down to the actual debate
tokens.index("PELOSI") #10708
tokens = tokens[10709:]


In [93]:
tokens.index("PELOSI") #14
tokens.index("____________________") #481
tokens2 = tokens[13:481]

In [94]:
for word in tokens2:
    print(word, end=' ')

Ms. PELOSI . Madam Speaker , I rise to support this bipartisan National Defense Authorization Act : which honors our values , supports our servicemembers and families , strengthens our security and advances our leadership in the world . Thank you to Chairman Smith , the Committee and staff for their patriotic , persistent leadership on this legislation , which reflects the brilliance and diversity of the whole House . This bill honors our responsibility to meet the needs of the servicemembers who sacrifice for our freedoms , and their families . It does so by : Providing a pay raise for our men and women in uniform , Strengthening parental leave for caregivers , Expanding access to child care , Improving the financial security of military members , including those who are low-income and have family members living with disabilities , Protecting military communities from PFAS `` forever chemicals . '' This NDAA also makes historic changes to better combat sexual assault in the military ,

---
### Prep third debate on the matter

In [95]:
url = "https://www.congress.gov/congressional-record/2021/9/22/house-section/article/h4880-4"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [96]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'House',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 'te

In [97]:
# Shorten the raw text down to the actual debate
tokens.index("fortify") #142650


142650

In [98]:
tokens3_1 = tokens[142614:]

In [99]:
tokens3_1.index("117-125")
tokens3_1 = tokens3_1[:2050]

In [100]:
for word in tokens3_1:
    print(word, end=' ')

Mr. LANGEVIN . Mr. Speaker , I yield myself such time as I may consume . Mr. Speaker , the United States attracts and develops some of the brightest minds in the world . They can fortify national security and boost economic competitiveness . Unfortunately , much of that talent leaves because there are few options to remain . My amendment provides a pathway to citizenship for the best foreign talent to work in the U.S. in support of our National Security Innovation Base . Great power competition is a race for talent to maintain our military and technological superiority . We want the brightest minds in the world working for us , not the Chinese Communist Party . The U.S. has less than 5 percent of world 's population , so it is no surprise that many great scientific minds are born outside U.S. borders . So then how have we maintained our technological superiority over the last 70 years , by way of example ? Well , our world-class universities and innovative private sector attract future

In [101]:
# Add second relevant portion of debate

In [102]:
idx = defaultdict(list)

In [103]:
for i,j in enumerate(tokens):
    idx[j].append(i)

In [104]:
idx['LANGEVIN']
idx['Buchanan']

[677, 6432, 11279, 12399, 13410, 14503, 15555, 16598, 203254, 203329]

In [105]:
tokens3_2 = tokens[197094:203368]

In [106]:
tokens3= tokens3_1 + tokens3_2

In [107]:
for word in tokens3:
    print(word, end=' ')



---
### Prep fourth debate on the matter

In [108]:
url = "https://www.congress.gov/congressional-record/2021/10/25/senate-section/article/s7330-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [109]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [110]:
# Shorten the raw text down to the actual debate
tokens.index("TUBERVILLE") #10687
tokens = tokens[10686:]

In [111]:
tokens.index("yield") #1987
tokens = tokens[:2000]

In [112]:
tokens4 = tokens 

In [113]:
for word in tokens4:
    print(word, end=' ')

Mr. TUBERVILLE . Madam President , after being in Washington , DC , for 10 months , I have seen this town jump from one issue to another . Sadly , many of the issues we face are self-inflicted -- illegal immigrants on the southern border , Americans who remain trapped in Afghanistan , and rampant inflation , just to name three . But we face a more serious threat in this Nation , an issue larger than left or right , a threat that goes beyond conservative and liberal -- China . China seeks to shackle the United States economically , technologically , and militarily . The Communist leaders of China are employing every instrument of national power to diminish our standing and influence in the world . Last month , President Biden told world leaders during his maiden U.N. General Assembly speech that the United States `` is not seeking a cold war . '' Well , the United States may not be seeking out a new Cold War , but China is , so we should n't give them the shovel to bury us . When asked 

---
### Prep fifth debate on the matter

In [114]:
url = "https://www.congress.gov/congressional-record/2021/11/18/senate-section/article/s8407-7"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [115]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [116]:
# Shorten the raw text down to the actual debate
tokens.index("REED") #16179
tokens.index("CORNYN") #28522
tokens5 = tokens[16214:28491]

In [117]:
for word in tokens5:
    print(word, end=' ')

Mr. REED . Mr. President , I rise to discuss the fiscal year 2022 National Defense Authorization Act . Over the coming days , the Senate will consider this bill , which the Armed Services Committee passed by a broad bipartisan margin of 23 to 3 in July . I look forward to debating and improving this bill , as we all work toward ensuring our military has the right tools and capabilities to combat threats around the globe and keep Americans safe . First , I would like to acknowledge Ranking Member Inhofe , whose leadership on this committee and this body has been invaluable . His commitment to our men and women in uniform is unwavering , and he was instrumental in helping to produce this bipartisan legislation . As we debate the NDAA , we must keep in mind that the United States is engaged in a strategic competition with China and Russia . These near- peer rivals do not accept U.S. global leadership or the international norms that have helped keep the peace for the better part of a centu

---
### Prep sixth debate on the matter

In [118]:
url = "https://www.congress.gov/congressional-record/2021/12/2/senate-section/article/s8876-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [119]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [120]:
# Shorten the raw text down to the actual debate
tokens.index("CORNYN") #13349
tokens6 = tokens[13348:]

In [121]:
tokens6.index("yield") #2205

2205

In [122]:
tokens6 = tokens6[:2209]

In [123]:
for word in tokens6:
    print(word, end=' ')

Mr. CORNYN . Mr. President , in my lifetime , the People 's Republic of China has gone from a poor and isolated country to one that now accounts for 20 percent of global domestic product . There is no question that the driving force behind this dramatic shift is the ruthlessness of the Chinese Communist Party led by President Xi . The CCP 's ruling strategy can best be described as win at all costs , which means that China never thinks twice about disregarding basic values and international norms . But there is no question that the most immediate and grave threats are against countries close to China 's borders . Last month , I led a congressional delegation to visit the Indo- Pacific to learn more from the people on the ground doing the hard work about the challenges they face and that we face in the Indo-Pacific . In my conversations with leaders in the Philippines , Taiwan , and India , I noticed they used a different vocabulary to describe China 's behavior than what we hear in Was

---
### Prep seventh debate on the matter

In [124]:
url = "https://www.congress.gov/congressional-record/2021/12/15/senate-section/article/s9167-7"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [125]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [126]:
# Shorten the raw text down to the actual debate
tokens.index("THUNE") #17138
tokens = tokens[17140:]

In [127]:
tokens.index("THUNE") #30
tokens = tokens[35:]

In [128]:
tokens.index("THUNE") #44
tokens7 = tokens[43:]

In [129]:
tokens7.index("4880")

5255

In [130]:
tokens7 = tokens7[:5238]

In [131]:
for word in tokens7:
    print(word, end=' ')



In [132]:
tokens = tokens1 + tokens2 + tokens3 + tokens4 + tokens5 + tokens6 + tokens7 

---
### Normalize the words 

In [133]:
type(tokens)
ustext10 = [w.lower() for w in tokens]

---
**Save Output**

In [134]:
save_path = '/Users/charlottekaiser/Documents/uni/Hertie/master_thesis/00_data/20_intermediate_files'
file_name = "US10_National Defense Authorization Act.txt"
completeName = os.path.join(save_path, file_name)
output = open(completeName, 'w')
print(ustext10, file=output)