### *Data Collection - US Congress*
## Preparing Raw Data
---
**Sample Text 14**
Title: RECOGNIZING PARTNERSHIP BETWEEN FLORIDA INTERNATIONAL UNIVERSITY AND THE JOHN S. AND JAMES L. KNIGHT FOUNDATION <br>
US House // Date: February 25, 2021 - Washington D.C.

In [22]:
# import necessary libraries
import requests
from requests_html import HTMLSession
import urllib.request
import time
from bs4 import BeautifulSoup
import urllib
from urllib import request
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import os.path 
import pandas as pd

---
### Process: HTML to ACII to Text (tokens, not yet nltk text)
- Download web page, strip html if necessary, trim to desired content
- Tokenize the text, select tokens of interest

In [23]:
# Add directory with user-agent ID, to bypass 404 response error and ensure that the US Congress Website Server allows scraping
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}

In [24]:
url = "https://www.congress.gov/congressional-record/2021/2/25/house-section/article/h624-1"
html = requests.get(url, headers=headers)
raw = BeautifulSoup(html.content, 'html.parser').get_text()


In [25]:
tokens = word_tokenize(raw)
tokens

['Congressional',
 'Record',
 'House',
 'Articles',
 '|',
 'Congress.gov',
 '|',
 'Library',
 'of',
 'Congress',
 'Alert',
 ':',
 'For',
 'a',
 'better',
 'experience',
 'on',
 'Congress.gov',
 ',',
 'please',
 'enable',
 'JavaScript',
 'in',
 'your',
 'browser',
 '.',
 'Navigation',
 'Advanced',
 'Searches',
 'Browse',
 'Legislation',
 'Congressional',
 'Record',
 'Committees',
 'Members',
 'Search',
 'Tools',
 'Support',
 'Glossary',
 'Help',
 'Contact',
 'Close',
 'Support',
 'Search',
 'the',
 'Help',
 'Center',
 'GO',
 'Browse',
 'the',
 'Help',
 'Center',
 'Glossary',
 'Contact',
 'Current',
 'CongressAll',
 'CongressesLegislationCommittee',
 'MaterialsCongressional',
 'RecordMembersNominations',
 'Search',
 'Within',
 'GO',
 'Legislation',
 'Legislation',
 'Text',
 'Committee',
 'Reports',
 'Congressional',
 'Record',
 'Nominations',
 'House',
 'Communications',
 'Senate',
 'Communications',
 'Treaty',
 'Documents',
 'Reset',
 'Words',
 '&',
 'Phrases',
 'Include',
 'full',
 'te

In [26]:
# Shorten the raw text down to the actual debate
tokens.index("minutes") #10724
tokens = tokens[10726:]

In [27]:
tokens.index("NOTE") #32
tokens1 = tokens[:31]

In [28]:
for word in tokens1:
    print(word, end=' ')

Mr. GIMENEZ . Mr. Speaker , I rise today to recognize a transformational partnership announced in my districtbetween Florida International University and the John S. and James L. Knight Foundation . 

In [29]:
#Add second part of debate - got rid off mid-section that is a procedural note that was added to the record for formalities
tokens.index("____________________") #343
tokens2 = tokens[:343]


In [30]:
for word in tokens2:
    print(word, end=' ')



In [31]:
tokens2.index("In") #122
tokens2 = tokens2[123:]

In [32]:
for word in tokens2:
    print(word, end=' ')

In what will become a national model for public-private collaboration to meet industry needs and further fuel the current momentum for technology and entrepreneurship in South Florida , the Knight Foundation has made a $ 10 million gift and FIU a 10-year commitment of $ 106 million to catalyze the development of the local tech ecosystem . Today , FIU is Miami-Dade 's top 50 public research university and is a top producer of minority graduates in STEM fields . This partnership will strengthen FIU 's standing , allowing for the doubling of computer science graduates , researchers and making FIU a hub for research in artificial intelligence , smart robotics , bioinformatics , biodevices , and digital forensics . I am proud of the work we did at the county level when I was the mayor of Miami-Dade County to make Miami and our South Florida communities a world-class destination for tech entrepreneurs . The resources and support for accelerators , incubators , our colleges and universities ,

In [33]:
# Combine part
tokens3 = tokens1 + tokens2
for word in tokens3:
    print(word, end=' ')

Mr. GIMENEZ . Mr. Speaker , I rise today to recognize a transformational partnership announced in my districtbetween Florida International University and the John S. and James L. Knight Foundation . In what will become a national model for public-private collaboration to meet industry needs and further fuel the current momentum for technology and entrepreneurship in South Florida , the Knight Foundation has made a $ 10 million gift and FIU a 10-year commitment of $ 106 million to catalyze the development of the local tech ecosystem . Today , FIU is Miami-Dade 's top 50 public research university and is a top producer of minority graduates in STEM fields . This partnership will strengthen FIU 's standing , allowing for the doubling of computer science graduates , researchers and making FIU a hub for research in artificial intelligence , smart robotics , bioinformatics , biodevices , and digital forensics . I am proud of the work we did at the county level when I was the mayor of Miami-D

In [34]:
tokens3

['Mr.',
 'GIMENEZ',
 '.',
 'Mr.',
 'Speaker',
 ',',
 'I',
 'rise',
 'today',
 'to',
 'recognize',
 'a',
 'transformational',
 'partnership',
 'announced',
 'in',
 'my',
 'districtbetween',
 'Florida',
 'International',
 'University',
 'and',
 'the',
 'John',
 'S.',
 'and',
 'James',
 'L.',
 'Knight',
 'Foundation',
 '.',
 'In',
 'what',
 'will',
 'become',
 'a',
 'national',
 'model',
 'for',
 'public-private',
 'collaboration',
 'to',
 'meet',
 'industry',
 'needs',
 'and',
 'further',
 'fuel',
 'the',
 'current',
 'momentum',
 'for',
 'technology',
 'and',
 'entrepreneurship',
 'in',
 'South',
 'Florida',
 ',',
 'the',
 'Knight',
 'Foundation',
 'has',
 'made',
 'a',
 '$',
 '10',
 'million',
 'gift',
 'and',
 'FIU',
 'a',
 '10-year',
 'commitment',
 'of',
 '$',
 '106',
 'million',
 'to',
 'catalyze',
 'the',
 'development',
 'of',
 'the',
 'local',
 'tech',
 'ecosystem',
 '.',
 'Today',
 ',',
 'FIU',
 'is',
 'Miami-Dade',
 "'s",
 'top',
 '50',
 'public',
 'research',
 'university',
 

---
### Normalize the words 

In [35]:
type(tokens3)
ustext14 = [w.lower() for w in tokens3]

---
**Save Output**

In [36]:
save_path = '/Users/charlottekaiser/Documents/uni/Hertie/master_thesis/00_data/20_intermediate_files'
file_name = "US14_RECOGNIZING PARTNERSHIP BETWEEN FLORIDA INTERNATIONAL UNIVERSITY AND THE JOHN S. AND JAMES L. KNIGHT FOUNDATION.txt"
completeName = os.path.join(save_path, file_name)
output = open(completeName, 'w')
print(ustext14, file=output)