### *Data Collection - US Congress*
## Preparing Raw Data
---
**Sample Text 5**
Title: NATIONAL SCIENCE FOUNDATION FOR THE FUTURE ACT <br>
US House // Date: March 26, June 28/29 2021 - Washington D.C.

In [84]:
# import necessary libraries
import requests
from requests_html import HTMLSession
import urllib.request
import time
from bs4 import BeautifulSoup
import urllib
from urllib import request
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import os.path 
import pandas as pd

---
### Process: HTML to ACII to Text (tokens, not yet nltk text)
- Download web page, strip html if necessary, trim to desired content
- Tokenize the text, select tokens of interest

In [85]:
# Add directory with user-agent ID, to bypass 404 response error and ensure that the US Congress Website Server allows scraping
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}

---
#### Prep first debate on matter

In [86]:
url = "https://www.congress.gov/congressional-record/2021/3/26/extensions-of-remarks-section/article/e305-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [87]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Extensions',
 'of',
 'Remarks',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 

In [88]:
# Shorten the raw text down to the actual debate
tokens.index("Ms.") #10719
tokens = tokens[10719:]

In [89]:
tokens.index("____________________") #1204
tokens = tokens[:1204]

In [90]:
for word in tokens:
    print(word, end=' ')

Ms. JOHNSON of Texas . Madam Speaker , today I am pleased to be joined by my colleagues on the Committee on Science , Space , and Technology , Ranking Member Frank Lucas , and the Research and Technology Subcommittee Chairwoman and Ranking Member , Haley Stevens and Michael Waltz , in introducing the National Science Foundation for the Future Act . Established in 1950 , the National Science Foundation ( NSF ) was born out of hard-earned lessons about the powerful role of science in securing an allied victory in World War II . Propelled by his wartime experience leading the Office of Scientific Research and Development , Vannevar Bush championed the creation of NSF and postwar federal support for science , making the argument that `` advances in science when put to practical use mean more jobs , higher wages , shorter hours , more abundant crops , more leisure for recreation , for study , for learning how to live without the deadening drudgery which has been the burden of the common man

In [91]:
tokens1 = tokens 

---
#### Prep second debate on matter

In [92]:
url = "https://www.congress.gov/congressional-record/2021/6/28/house-section/article/h3187-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [93]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'House',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 'te

In [94]:
tokens.index("JOHNSON") #10691
tokens01 = tokens[10690:]

In [95]:
tokens01.index("amended") #51
tokens01 = tokens01[:53]

In [96]:
# Add general leave part 

In [97]:
tokens.index("Leave") #36697
tokens02 = tokens[36696:]


In [98]:
tokens02.index("____________________") #6242
tokens02 = tokens02[:6242]

In [99]:
for word in tokens02:
    print(word, end=' ')

General Leave Ms. JOHNSON of Texas . Mr. Speaker , I ask unanimous consent that all Members may have 5 legislative days to revise and extend their remarks and to include extraneous material on H.R . 2025 , the bill now under consideration . The SPEAKER pro tempore . Is there objection to the request of the gentlewoman from Texas ? There was no objection . Ms. JOHNSON of Texas . Mr. Speaker , I yield myself such time as I may consume . I rise today in strong support of H.R . 2225 , the National Science Foundation for the Future Act . The United States has long been a beacon of excellence in science and engineering . We are at a time of markedly increased global competition in research and development . However , while we should be cognizant of our increasing global competition , we must not be constrained by it . To continue to lead , we must chart our own course . First and foremost , we must significantly boost funding for science . For years , we have allowed billions of dollars of e

In [100]:
tokens2 = tokens01 + tokens02

---
### Prep third debate on matter


In [101]:
url = "https://www.congress.gov/congressional-record/2021/6/29/extensions-of-remarks-section/article/e711-4"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [102]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Extensions',
 'of',
 'Remarks',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 

In [103]:
tokens.index("Ms.") #10720
tokens = tokens[10720:]

In [104]:
tokens.index("____________________") #581
tokens = tokens[:581]

In [105]:
for word in tokens:
    print(word, end=' ')

Ms. MOORE of Wisconsin . Mr. Speaker , I rise today in support of legislation reauthorizing funding for the National Science Foundation , HR 2225 , the National Science Foundation for the Future Act . I thank the Chair and Ranking Member for their leadership and hard work to put together this package . I 'm proud to be a cosponsor of this bill to increase much-needed NSF funding while implementing new and creative strategies to improve STEM education and bolster our research and engineering infrastructure . The NSF is a funder of more than 1800 institutions in the US . Given that a quarter of all federal funding is contributed through the Foundation , nearly every student in STEM has been in one way or another supported by the NSF through grants , fellowships , and research opportunities . Despite continuing support from Congress , the reality is that the majority of grant proposals NSF receives are rejected and valuable research projects are never carried out . During an era of tough 

In [106]:
tokens3 = tokens

---
### Merge all three debates

In [107]:
tokens = tokens1 + tokens2 + tokens3

---
### Normalize the words 

In [108]:
type(tokens)
ustext5 = [w.lower() for w in tokens]

---
**Save Output**

In [109]:
save_path = '/Users/charlottekaiser/Documents/uni/Hertie/master_thesis/00_data/20_intermediate_files'
file_name = "US05_NATIONAL SCIENCE FOUNDATION FOR THE FUTURE ACT.txt"
completeName = os.path.join(save_path, file_name)
output = open(completeName, 'w')
print(ustext5, file=output)