### *Data Collection - US Congress*
## Preparing Raw Data
---
**Sample Text 37**
Title: ENDLESS FRONTIER ACT <br>
US Senate // Date: May 12/17/18/19/20/26/27/28, June 08, 2021 - Washington D.C.

In [2]:
# import necessary libraries
import requests
from requests_html import HTMLSession
import urllib.request
import time
from bs4 import BeautifulSoup
import urllib
from urllib import request
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import os.path 
import pandas as pd

---
### Process: HTML to ACII to Text (tokens, not yet nltk text)
- Download web page, strip html if necessary, trim to desired content
- Tokenize the text, select tokens of interest

In [4]:
# Add directory with user-agent ID, to bypass 404 response error and ensure that the US Congress Website Server allows scraping
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}

---
### Prep first debate on the matter

In [777]:
url = "https://www.congress.gov/congressional-record/2021/5/12/senate-section/article/s2464-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [778]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [779]:
# Shorten the raw text down to the actual debate
tokens.index("SCHUMER") #10681
tokens.index("____________________") #11384

11384

In [780]:
tokens1 = tokens[10680:11384]

In [781]:
for word in tokens1:
    print(word, end=' ')

Mr. SCHUMER . Mr. President , now , speaking of bipartisan legislation , today the Commerce Committee will begin marking up the Endless Frontier Act , one of the most significant investments in American innovation in generations . The bill will be at the core of comprehensive legislation to address American competitiveness and security in the 21st century . Once again , for the information of the Senate , it is my intention for the Senate to consider and finish competitive legislation before the end of the month . There have been productive bipartisan talks over the last week to improve the Endless Frontier Act . This is an issue I have worked on with my friend the Republican Senator from Indiana for the past few years . He has been a great help , a great partner , and I appreciate his work . And , of course , Senator Cantwell , our chairman of the Commerce Committee , and Senator Wicker , our ranking member , have come together . And everyone had to give a bit here , there , and every

---
### Prep second debate on the matter

In [782]:
url = "https://www.congress.gov/congressional-record/2021/5/17/senate-section/article/s2535-4"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [783]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [784]:
# Shorten the raw text down to the actual debate
tokens.index("PRESIDING") #10693
tokens.index("____________________") #19528

19528

In [785]:
tokens2 = tokens[10692:19528]

In [786]:
for word in tokens2:
    print(word, end=' ')



--- 
### Prep third debate on the matter

In [787]:
url = "https://www.congress.gov/congressional-record/2021/5/18/senate-section/article/s2555-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [788]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [789]:
# Shorten the raw text down to the actual debate
tokens.index("McCONNELL") #10681
tokens.index("____________________") #10992

10992

In [790]:
tokens3 = tokens[10680:10992]

In [791]:
for word in tokens3:
    print(word, end=' ')

Mr. McCONNELL . Now , Mr. President , on an entirely different matter , yesterday , the Senate took a step toward considering wide-ranging legislation that would touch on multiple parts of the U.S. economy in the name of increasing innovation and competitiveness . A secure , productive , and innovative America that can outcompete China is something that all 100 Senators want . Of course , in a place like the Senate , you are guaranteed to find a wide variety of different ideas about the best ways to encourage that . A number of our colleagues have assembled a proposal that touches on a long list of subjects -- everything from funding universities , to regional economic development , to Indo-Pacific geopolitics , to artificial intelligence , to cyber security , and beyond . Legislation this broad needs a thorough , robust , and bipartisan floor process , including a healthy series of amendment votes . As one of my Republican colleagues -- the ranking member on the Commerce Committee -- 

--- 
### Prep fourth debate on the matter

In [792]:
url = "https://www.congress.gov/congressional-record/2021/5/19/senate-section/article/s2755-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [793]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [794]:
# Shorten the raw text down to the actual debate
tokens.index("assumed") #38660
tokens.index("____________________") #41646


41646

In [795]:
tokens4 = tokens[38666:41646]

In [796]:
for word in tokens4:
    print(word, end=' ')

Mr. President , on one final matter , we know the Endless Frontier legislation , which is on the floor today , is part of our response to the competition caused by an increasingly belligerent and aggressive China , and I am glad the Senate has taken up consideration of this legislation . In coming days , I expect both sides to offer amendments to strengthen this legislation and to ensure that it addresses a broad range of strategic threats . As Leader McConnell has said , a robust amendment process is critical to the success of this legislation . One of the most pressing needs , though , is to bolster our domestic semiconductor manufacturing , which will be addressed and is addressed by the underlying bill . We rely on these microelectronic circuits , or semiconductors , for everything from our telephones that we have in our pockets to the cars in our driveways , to the missile defense systems that are right now knocking down Hamas rockets raining down over Israel . Over the past coupl

--- 
### Prep fifth debate on the matter

In [797]:
url = "https://www.congress.gov/congressional-record/2021/5/19/senate-section/article/s2752-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [798]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [799]:
# Shorten the raw text down to the actual debate
tokens.index("CANTWELL") #10789
tokens.index("____________________") #15158

15158

In [800]:
tokens5 = tokens[10788:15158]

In [801]:
for word in tokens5:
    print(word, end=' ')

Ms. CANTWELL . Mr. President , I call up amendment No . 1527 . The PRESIDING OFFICER . The clerk will report the amendment . The senior assistant legislative clerk read as follows : The Senator from Washington [ Ms. Cantwell ] proposes an amendment numbered 1527 to amendment No . 1502 . The amendment is as follows AMENDMENT NO . 1527 ( Purpose : To improve the bill ) On page 304 , line 18 , strike `` 3 '' and insert `` 4 '' . Ms. CANTWELL . Mr. President , we come to the floor today after a lot of hard work by the Commerce Committee to pass out the Endless Frontier bill last week -- 24 to 4 . I know my colleagues from the committee will be out here to speak on this important legislation , as will the majority leader , Senator Schumer , who authored this important legislation , and our colleague from Indiana , Senator Young . We thank them for kick-starting what is a very important national discussion about how much we should be investing in research and development or what I would say 

--- 
### Prep sixth debate on the matter

In [802]:
url = "https://www.congress.gov/congressional-record/2021/5/20/senate-section/article/s3314-3"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [803]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [804]:
# Shorten the raw text down to the actual debate
tokens.index("CORNYN") #10679
tokens.index("____________________") #12998

12998

In [805]:
tokens6 = tokens[10678:12998]

In [806]:
for word in tokens6:
    print(word, end=' ')

Mr. CORNYN . Mr. President , in my lifetime , China has gone from a poor and isolated country to now accounting for nearly 20 percent of global gross domestic product . There is no doubt that the ingenuity of the Chinese people has contributed to this success , but we know the driving force behind this dramatic rise is the aggressiveness of the Chinese Communist Party . Its aims can be summed up with four R 's : resist , reduce , replace , and reorder . China resists American economic influence by manipulating American businesses and industries and stealing intellectual property . It reduces internal dissent and free expression of ideas through mass surveillance and censorship of its own people , and it seeks to exert its power and influence in the United States . The Chinese Communist Party intends to replace America as the world 's technology leader through the Made in China 2025 initiative , which seeks to achieve Chinese dominance in high-tech manufacturing . Finally , it hopes to 

--- 
### Prep seventh debate on the matter

In [807]:
url = "https://www.congress.gov/congressional-record/2021/5/26/senate-section/article/s3477-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [808]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [809]:
# Shorten the raw text down to the actual debate
tokens.index("LUMMIS") #15425
tokens.index("Utah") #16815


16815

In [810]:
tokens7 = tokens[15463:16817]

In [811]:
for word in tokens7:
    print(word, end=' ')

Ms. LUMMIS . Mr. President , I am excited to announce the founding of the Senate Financial Innovation Caucus with my friend and cochair , the Senator from Arizona . I am delighted that you also have joined our caucus . We are grateful for your participation and look forward to working with you . One of my top priorities and a legacy I hope to leave in this Chamber is to ensure the United States remains a global leader in financial services for future generations . The U.S. dollar is the world 's unquestioned reserve currency . Since the Second World War , this leadership role has given our country enormous advantages , including affordable credit and trade finance . China is not hiding its ambition to knock the U.S. dollar down a peg by offering a competitor payment system that sidesteps the United States . This year , the Chinese Government launched a pilot program for their digital yuan in multiple cities around China . They expect to completely release the central bank digital curre

--- 
### Prep eighth debate on the matter

In [812]:
url = "https://www.congress.gov/congressional-record/2021/5/27/senate-section/article/s3549-5"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [813]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [814]:
# Shorten the raw text down to the actual debate
tokens.index("Vehicles") #10892
tokens.index("Blind") #11308

11308

In [815]:
tokens8_1 = tokens[10891:11326]

In [816]:
for word in tokens8_1:
    print(word, end=' ')

Automated Vehicles Mr. THUNE . Madam President , from the beginning , the story of the U.S. auto industry has been one of ingenuity , of taking risks , and of pushing forward . At the dawn of the 20th century , most Americans could hardly comprehend the idea of the automobile . Yet , 20 years later , they had become nearly ubiquitous in American life , thanks to the insistence of entrepreneurs like Henry Ford on making the automobile affordable for the majority of Americans . The democratization of the automobile , rather than the invention of the automobile itself , is , in my opinion , one of most remarkable and uniquely American success stories . Automobiles allowed Americans to capitalize on the economic dynamism of the roaring twenties and helped Americans move and adapt during the Great Depression . They contributed greatly to the American industrial base and the know-how needed to fight and win the Second World War and help propel the United States to its current status as a pre

In [817]:
# Add second relevant portion of debate

In [818]:
tokens8_2 = tokens[10895:]
tokens8_2.index("THUNE") #900
tokens8_2.index("Cloture") #1789

1789

In [819]:
tokens8_2 = tokens8_2[899:1789]

In [820]:
for word in tokens8_2:
    print(word, end=' ')

Mr. THUNE . Madam President , imagine a farmer in rural South Dakota who can no longer drive to get to town for appointments , prescriptions , or groceries -- enter the automated vehicle . This technology has potential to keep people in their homes and communities longer . Moreover , AVs have potential to greatly increase roadway safety . Currently , there are an average of more than 35,000 traffic fatalities on our Nation 's roadways each year , including pedestrian , motorcycle , and bicycle fatalities . Automated vehicles could dramatically -- dramatically -- reduce that number . Distracted driving , driving while impaired -- automated vehicles could eliminate those dangers . For automated vehicle technology to advance , it is imperative that the regulatory framework catch up with private-sector innovation . That is why I have pushed for the enactment of AV legislation over the years and why I had hoped -- I had hoped -- that we would be voting to add my automated vehicles amendment

In [821]:
# Add third relevant portion of debate

In [822]:
tokens.index("far-reaching") #236300
tokens.index("well-positioned") #240582

240582

In [823]:
tokens8_3 = tokens[236283:240623]

In [824]:
for word in tokens8_3:
    print(word, end=' ')

Mr. President , the Senate is moving quickly , I hope , toward a vote on a far-reaching proposal to confront threats from China . Based on everything we know about the might and the ambitions of the Chinese Communist Party , there is a clear and urgent need for us to take action . Every year , the U.S. intelligence community issues a threat assessment report outlining the greatest challenges confronting our country on the horizon . Topping the latest report , which was released last month , is China 's push for global power . The report outlines China 's efforts to strengthen its military power , diversify its nuclear arsenal , and fine-tune its cyber espionage skills , which are already quite considerable . One major area that can not be overlooked is China 's industrial policy . Through the CCP 's Made in China 2025 initiative , it seeks to achieve China 's dominance in high-tech manufacturing . For everything from electric cars to advanced robotics , to artificial [ [ Page S3846 ] ]

In [825]:
# Add fourth relevant portion of the debate
tokens.index("BENNET") #241830
tokens.index("dysfunctional") #241920

241920

In [826]:
tokens8_4 = tokens[241829:241952]

In [827]:
for word in tokens8_4:
    print(word, end=' ')

Mr. BENNET . Mr. President , I am thrilled that we are passing this legislation . It is amazing to me -- and I am sure the Presiding Officer would agree -- that the Senate has actually returned to regular order . We are passing amendments on both sides of the aisle , and I think we are going to have a big bipartisan vote here in the Senate . In having been here for a number of years when the Senate did n't operate that way -- it was incredibly dysfunctional -- it is a great , great privilege to be here at a moment when it is working . So I want to express my sense of gratitude for that . 

In [828]:
# Combine all parts
tokens8 = tokens8_1 + tokens8_2 + tokens8_3 + tokens8_4

In [829]:
for word in tokens8:
    print(word, end=' ')

Automated Vehicles Mr. THUNE . Madam President , from the beginning , the story of the U.S. auto industry has been one of ingenuity , of taking risks , and of pushing forward . At the dawn of the 20th century , most Americans could hardly comprehend the idea of the automobile . Yet , 20 years later , they had become nearly ubiquitous in American life , thanks to the insistence of entrepreneurs like Henry Ford on making the automobile affordable for the majority of Americans . The democratization of the automobile , rather than the invention of the automobile itself , is , in my opinion , one of most remarkable and uniquely American success stories . Automobiles allowed Americans to capitalize on the economic dynamism of the roaring twenties and helped Americans move and adapt during the Great Depression . They contributed greatly to the American industrial base and the know-how needed to fight and win the Second World War and help propel the United States to its current status as a pre

--- 
### Prep ninth debate on the matter

In [5]:
url = "https://www.congress.gov/congressional-record/2021/5/27/senate-section/article/s3851-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [6]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'skip',
 'to',
 'main',
 'content',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',

In [11]:
# Shorten the raw text down to the actual debate
tokens.index("STABENOW") #10652
tokens.index("yield") #12138

12138

In [12]:
tokens9_1 = tokens[10651:12142]

In [13]:
for word in tokens9_1:
    print(word, end=' ')

Ms. STABENOW . Mr. President , it will come as no surprise to anyone in this Chamber that I am extremely proud to be born and raised in Michigan . Our State leads the world in innovation . We are the leaders in making things -- furniture , appliances , wind turbines and solar components and so much more , and , of course , we are the home of the automobile and the automotive assembly line and the middle class of America . Our workers put the world on four wheels . They built an economy strong enough that those same workers could afford to buy one or two or more cars and trucks that they made . Yet our Nation faces a stark choice right now , and that is why the bill in front of us tonight is so very important . We can continue to invest in making things in America or we can decide that it is not really worth the trouble anymore . We can continue to lead the world in the research and development of breakthrough technologies or we can allow other countries to surge ahead while we tread wa

In [835]:
# Add second relevant portion of the debate 

In [14]:
tokens.index("revitalize") #18618


18618

In [15]:
tokens9_2 = tokens[18598:]

In [16]:
tokens9_2.index("MURKOWSKI") #959

959

In [17]:
tokens9_2 = tokens9_2[:990]

In [18]:
for word in tokens9_2:
    print(word, end=' ')

Mr. PETERS . Mr. President , as we emerge from the coronavirus pandemic , we have a real opportunity to revitalize American manufacturing and harness American leadership in scientific and technological advancement . Today I urge my colleagues to support critical , bipartisan legislation that will do just that . The United States Innovation and Competition Act will help keep our country on the cutting edge of technology , strengthen American competitiveness on a global stage , and protect our national security . International competitors like the Chinese Government are aggressively investing in manufacturing , science , and technology in an attempt to gain a competitive advantage over the United States , and we can not let that happen . In order to maintain our edge , we must make serious investments in domestic research and development , technology , and manufacturing . We know that a strong manufacturing sector is the backbone of any economy . I have long believed that you can not be 

In [840]:
tokens.index("high-tech") #47739

47739

In [841]:
tokens9_3 = tokens[47539:]

In [842]:
tokens9_3.index("18632")

2026

In [843]:
tokens9_3 = tokens9_3[:2053]

In [844]:
for word in tokens9_3:
    print(word, end=' ')

Mr. SULLIVAN . Mr. President , I want to commend my colleagues for the important work that everybody is doing down here on the Senate floor , bipartisan work , addressing one of the most important challenges we have as a nation . Not just today but for years this challenge is going to be with us , and that is the challenge of dealing with the rise of the Communist Party of China . That is going to be more and more of a challenge and focus of the efforts of all elements of America 's economy , military , society . And here is the good news . As you are seeing here , there is a lot of focus , a lot of effort , and a lot of bipartisan work . It is a democracy , a Republic , right ? It is messy . It is not going to be perfect . But , for the Chinese , I think the worst nightmare of the Chinese Communist Party is to see Americans coming together and recognizing that this is something we all need to work on together . China 's economy is growing . Their high-tech capability is growing . Thei

In [845]:
tokens9 = tokens9_1 + tokens9_2 + tokens9_3

--- 
### Prep tenth debate on the matter

In [846]:
url = "https://www.congress.gov/congressional-record/2021/6/8/senate-section/article/s3974-3"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [847]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [848]:
# Shorten the raw text down to the actual debate
tokens.index("CANTWELL") #10685
tokens.index("GDP") #12039

12039

In [849]:
tokens10 = tokens[10684:12076]

In [850]:
for word in tokens10:
    print(word, end=' ')

Ms. CANTWELL . Madam President , I come to the floor , hopefully today will be the day we wrap up debate on the America Competes-Endless Frontier legislation now known as the USICA , United States Innovation and Competition Act of 2021 . We come to talk about this now , primarily because we know that the research dollars invested today are going to decide the jobs of the future . And we know that we all believe a significant increase in the investment in research and development dollars will help us spur innovation , continue to help us compete , and continue to be competitive in key sectors of our economy that are so important to us . We know that we have been having this debate literally now for more than a decade , starting with President Bush 's 2006 report saying America needed to invest more in the National Science Foundation . And at the time , I am pretty sure we thought we were in a track meet where our competitor was maybe half a lap behind us I am pretty sure now , as the de

---
### Prep eleventh debate on the matter

In [851]:
url = "https://www.congress.gov/congressional-record/2021/5/20/senate-section/article/s3182-7"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [852]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [853]:
tokens.index("CANTWELL") #11357
tokens.index("____________________") #17644

17644

In [854]:
tokens11 = tokens[11356:17644]

In [855]:
for word in tokens11:
    print(word, end= ' ')

Ms. CANTWELL . Mr. President , I come to the floor today to continue our discussion about the Endless Frontier Act and why America needs to make more investment in the areas of research and development for our Nation . This is critically important as we have gone through this debate with some of our colleagues , to talk about why this is important for the United States . I spent my time yesterday -- maybe somebody from the staff can come over and help me with the charts but , thank you -- the biggest reason we are doing this is because we believe in American know-how , that is we believe in American ingenuity and we believe in American know-how and we have discussed already how that has helped to build our country over and over and over again , that we are a nation of , if you will , explorers , of pioneers , and by necessity , inventors , and that has continued throughout the history of our country . So we are so proud to continue to make these investments in all areas of science , ce

---
### Prep twelth debate on the matter

In [856]:
url = "https://www.congress.gov/congressional-record/2021/6/8/senate-section/article/s3975-2"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [857]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [858]:
tokens.index("DURBIN") #13587
tokens.index("meritocracies") #16448

16448

In [859]:
tokens12 = tokens[13586:16544]

In [860]:
for word in tokens12:
    print(word, end=' ')

Mr. DURBIN . Mr. President , last week , China announced that it would now allow families to have three children -- a profound shift from their previous one- and two-child policies . Why the change ? China looked to the future and realized that its population policies would hamper economic growth . Now , the U.S. Government will never tell families how many children to have . That choice is profoundly personal . Yet we must ask ourselves the same questions China is asking : What kind of changes will lead or deter the United States from a future of economic growth and prosperity ? How can we enhance America 's competitiveness ? And more than just compete , how can we make sure America comes in first ? [ [ Page S3978 ] ] The answer is obvious : Invest in American creativity . China is investing heavily in electric vehicles , critical minerals , energy production , computer chips -- the list goes on . In all of these areas , China is beginning to pull ahead of the pack . They are aiming f

---
### Prep thirteenth debate on the matter

In [861]:
url = "https://www.congress.gov/congressional-record/2021/5/18/senate-section/article/s2555-5"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [862]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [863]:
tokens.index("SASSE") #14390

14390

In [864]:
tokens13 = tokens[14389:]

In [865]:
tokens13.index("Texas") #2936

2936

In [866]:
tokens13 = tokens13[:2938]

In [867]:
for word in tokens13:
    print(word, end=' ')

Mr. SASSE . Mr. President , I ask unanimous consent that the order for the quorum call be rescinded . The PRESIDING OFFICER . Without objection , it is so ordered . S. 1260 Mr. SASSE . Mr. President , Winston Churchill is often credited with the apocryphal quote that `` we sleep soundly in our beds because rough men stand ready in the night to visit violence on those who would do us harm . '' This is still true , but the 21st century has gotten more complicated . We live in an era of hybrid wars . There are fewer D-days on enemy beaches and more zero-day exploits in enemy servers . Americans sleep soundly at night because , in addition to these rough men at the ready , brilliant men and women work around the clock to develop national security technology that defends our interests and undermines our enemies . DARPA -- the Defense Advanced Research Projects Agency -- is on the frontlines of that work . They are racing against our adversaries . Our technology struggle against the Chinese 

---
### Prep fourteenth debate on the matter

In [868]:
url = "https://www.congress.gov/congressional-record/2021/5/28/senate-section/article/s3915-6"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [869]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [870]:
tokens.index("PAUL") #10910

10910

In [871]:
tokens = tokens[10911:]

In [872]:
tokens.index("PAUL") #140
tokens = tokens[139:]
tokens.index("2019") #7482
tokens14_1 = tokens[:7480]

In [873]:
for word in tokens14_1:
    print(word, end=' ')

Mr. PAUL . Mr. President , we are currently $ 28 trillion in debt . Whose fault is it -- Republicans ? Democrats ? The answer is yes , yes on both fronts . Both parties are responsible for the debt , and one side is honest about it . One side will tell you they do n't give a fig about the debt : The debt be damned . We are for new monetary theories . Spend as much as you have got ; borrow as much as you can ; and somehow we are going to combat the influence of China by borrowing more money from China . It does n't really seem to make a lot of sense , but that is where we are . So we have before us a bill that will simply add to the debt . We will go further in debt . You might make the argument that we are actually less strong as a nation the more in debt we are . Where is the opposition ? Now , there is no opposition on one side of the [ [ Page S3916 ] ] aisle , and on the other side , there is feigned opposition . The Republicans will feign opposition to the debt . They will say : We

In [874]:
# Add second relevant portion
tokens14_2 = tokens[7480:]
tokens14_2.index("proposal") #1908
tokens14_2.index("colleagues") #2761

2761

In [875]:
tokens14_2 = tokens14_2[1871:2823]

In [876]:
for word in tokens14_2:
    print(word, end=' ')



In [877]:
tokens14 = tokens14_1 + tokens14_2

---
### Prep fifteenth debate on the matter


In [878]:
url = "https://www.congress.gov/congressional-record/2021/5/19/senate-section/article/s2745-7"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [879]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 't

In [880]:
tokens.index("Endless") #17498
tokens = tokens[17501:]


In [881]:
tokens.index("Wyoming") #1857
tokens = tokens[:1856]

In [882]:
tokens15 = tokens

In [883]:
for word in tokens15:
    print(word, end=' ')

Mr. President , years ago , I traveled to Israel with then-minority leader Senator Harry Reid . We met with Shimon Peres , and something happened that I have never forgotten . [ [ Page S2750 ] ] Senator Reid asked : `` What do you see as the greatest threat in the world to the United States ? '' That was just a couple of years after 9/ 11 when that question was posed . I thought Peres might cite terrorism or loose nukes . Instead , he said without hesitation : `` China . Do n't you see that ? '' Economically , strategically , diplomatically , China was already focused like a laser on advancing its position as the world 's most powerful nation , and that was 16 years ago . Five days ago , China launched a spacecraft safely and landed it on Mars , becoming only the second nation in history , after the United States , to land on the Red Planet . In 1957 , at the height of the Cold War , the Soviet Union became the first nation to launch an Earth-orbiting satellite into space . That launch

---
### Normalize the words 

In [884]:
tokens = tokens1 + tokens2 + tokens3 + tokens13 + tokens15 + tokens4 + tokens5 + tokens6 + tokens11 + tokens7 + tokens8 + tokens9 + tokens14 + tokens10 + tokens12

In [885]:
type(tokens)
ustext37 = [w.lower() for w in tokens]

---
**Save Output**

In [886]:
save_path = '/Users/charlottekaiser/Documents/uni/Hertie/master_thesis/00_data/20_intermediate_files'
file_name = "US37_ENDLESS FRONTIER ACT.txt"
completeName = os.path.join(save_path, file_name)
output = open(completeName, 'w')
print(ustext37, file=output)