### *Data Collection - US Congress*
## Preparing Raw Data
---
**Sample Text 6**
Title: RECOGNIZING THE 75TH ANNIVERSARY OF THE OFFICE OF NAVAL RESEARCH <br>
US House of Rep // Date: July 30, 2021 - Washington D.C.

In [1]:
# import necessary libraries
import requests
from requests_html import HTMLSession
import urllib.request
import time
from bs4 import BeautifulSoup
import urllib
from urllib import request
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import os.path 
import pandas as pd

---
### Process: HTML to ACII to Text (tokens, not yet nltk text)
- Download web page, strip html if necessary, trim to desired content
- Tokenize the text, select tokens of interest

In [2]:
# Add directory with user-agent ID, to bypass 404 response error and ensure that the US Congress Website Server allows scraping
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}

In [3]:
url = "https://www.congress.gov/congressional-record/2021/7/30/extensions-of-remarks-section/article/e856-4"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [4]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Extensions',
 'of',
 'Remarks',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'skip',
 'to',
 'main',
 'content',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 '

In [7]:
# Shorten the raw text down to the actual debate
tokens.index("LANGEVIN") #10671
tokens.index("____________________") #11382

11382

In [8]:
tokens1 = tokens[10670:11382]

In [9]:
for word in tokens1:
    print(word, end=' ')

R. LANGEVIN of rhode island in the house of representatives Friday , July 30 , 2021 Mr. LANGEVIN . Madam Speaker , today it is an honor for me to pay tribute to the Office of Naval Research and its contributions to our Sea Services , the Nation , and the pursuit of scientific and technological discovery on the occasion of its seventy-fifth anniversary . The Office of Naval Research was established by act of Congress on August 1 , 1946 , in the aftermath of World War II to `` plan , foster , and encourage scientific research in recognition of its paramount importance as related to the maintenance of future naval power , and the preservation of national security . '' A product of wartime necessities that brought together government and military planners , academia , and industry to help make science and technology an essential tool [ [ Page E857 ] ] for victory , the Office of Naval Research grew into a vital organization dedicated to the enduring warfighting requirements of the Navy and

---
### Prep second debate on the matter 

In [10]:
url = "https://www.congress.gov/congressional-record/2021/8/3/senate-section/article/s5708-4"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()

In [11]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'Senate',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'skip',
 'to',
 'main',
 'content',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',

In [13]:
tokens.index("REED") #10651
tokens.index("____________________") #11179

10651

In [14]:
tokens2 = tokens[10650:11179]

In [15]:
for word in tokens2:
    print(word, end=' ')

Mr. REED . Madam President , on behalf of Senator Inhofe and myself , as the ranking member and chairman of the Senate Armed Services Committee , we rise to commemorate and celebrate the Office of Naval Research and its contributions to our Sea Services , national defense , and the advancement of scientific and technological discovery on the occasion of its 75th anniversary . World War II underscored how science and technology could determine winners and losers on the battlefield . In the aftermath of the war , Congress established the Office of Naval Research on August 1 , 1946 , to `` plan , foster , and encourage scientific research in recognition of its paramount importance as related to the maintenance of future naval power , and the preservation of national security . '' Since then , the Office of Naval Research has been at the forefront of groundbreaking research that has resulted in lasting military supremacy not only on and in the seas , but also in the skies , on land , and i

In [16]:
tokens = tokens1 + tokens2

---
### Normalize the words 

In [17]:
type(tokens)
ustext6 = [w.lower() for w in tokens]

---
**Save Output**

In [18]:
save_path = '/Users/charlottekaiser/Documents/uni/Hertie/master_thesis/00_data/20_intermediate_files'
file_name = "US06_75th ANNIVERSARY OF THE OFFICE OF NAVAL RESEARCH.txt"
completeName = os.path.join(save_path, file_name)
output = open(completeName, 'w')
print(ustext6, file=output)