# Department of Education Press Release Scraping

### Author: Doug Hummel-Price | Created: February 2020 | Updated: 06.09.20

This notebook contains the code to scrape the Department of Education (ED) press releases. It contains sections to scrape the current Trump administration releases as well as the archived Obama administration releases.  

In [3]:
## Import required libraries 
import requests
import bs4
import re
import pandas as pd
from time import sleep

## Create List of links to the individual press releases from the press release pages that contain 10 releases each.  

root = "https://www.ed.gov/news/press-releases?page="
pages = ["https://www.ed.gov/news/press-releases?"]

## Create a list of all 38 pages of Trump press releases
for item in range(1,38):
    item = str(item)
    item = root + item
    pages.append(item)

## Create an empty list to append links to Trump press releases    
trumplinks = []

## Iterates over the the 38 pages, snagging the individual press release links
for page in pages:

    sleep(.4)                                        ## Pauses for .4 seconds in between requests to not overwhelm the server    
    server_response = requests.get(page)             ## Creates object with the code response 
    print(f'The server response for page {page} is: {server_response}\n') ## Displays the HTTP response code

    ## Parse using BeautifulSoup
    soup = bs4.BeautifulSoup(server_response.text, features="html.parser")
    
    ## Snags all the links on the page
    links = []
    for item in soup.find_all('a'):
        link = item.get('href')
        links.append(link)

    ## Creates a blank list to append the individual press release links to
    presslinks = []
    
    ## Iterates over the links, appending only those that have the proper stem for the individual release pages
    for item in links:
            try:
                length = len(item)
                if item[:21] == '/news/press-releases/':
                    if length >29:
                        if item[21:29] != 'monthly/':
                            presslinks.append(item)
                else:
                    pass 
            except:
                pass 
    
    ## Takes the press links and appends the necessary stem to scrape the page 
    for link in presslinks:        
        newlink = "https://www.ed.gov" + link
        trumplinks.append(newlink)

len(trumplinks)

The server response for page https://www.ed.gov/news/press-releases? is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=1 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=2 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=3 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=4 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=5 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=6 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=7 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=8 is: <Response [200]>

The server response for page https://www.ed.gov/news/press-releases?page=9 is: <Response [200]>

The server response for page https:/

376

#### Note: Having started this NLP project prior to the pandemic, I originally aimed to compare releases under Trump to those under Obama. The section below pulls the Obama releases, similar to the above. 

In [None]:
## This section pulls the Obama Ed archives

## Create List of links to the individual press releases from the press release pages that contain 10 releases each. 
root = "https://www.ed.gov/news/press-release-archive?page="
pages = ["https://www.ed.gov/news/press-releases-archive?"]

## Create a list of all 38 pages of Trump press releases
for item in range(1,171):
    item = str(item)
    item = root + item
    pages.append(item)
pages

## Create an empty list to append links to Trump press releases  
obamalinks = []

## Iterates over the the pages, snagging the individual press release links
for page in pages:

    server_response = requests.get(page)             ## Creates object with the code response 
    print(f'The server response is: {server_response}\n') ## Displays the HTTP response code 

    ## Parse using BeautifulSoup
    soup = bs4.BeautifulSoup(server_response.text, features="html.parser")

    links = []
    for item in soup.find_all('a'):
        link = item.get('href')
        links.append(link)


    presslinks = []

    for item in links:
            try:
                length = len(item)
                if item[:21] == '/news/press-releases/':
                    if length >29:
                        if item[21:29] != 'monthly/':
                            presslinks.append(item)
                else:
                    pass 
            except:
                pass 

    for link in presslinks:        
        newlink = "https://www.ed.gov" + link
        obamalinks.append(newlink)

len(obamalinks)

In [None]:
## This cell contains a list of a few sample press releases I used to test and verify my code. 
webpages= ['https://www.ed.gov/news/press-releases/schoolsafetygov-launches-help-educators-administrators-parents-and-law-enforcement-prepare-threats', 
"https://www.ed.gov/news/press-releases/president-trump-proposes-transformative-student-first-budget-return-power-states-limit-federal-control-education",
"https://www.ed.gov/news/press-releases/department-education-awards-grants-modernize-workforce-training",
"https://www.ed.gov/news/press-releases/secretary-devos-approves-new-methodology-providing-student-loan-relief-borrower-defense-applicants",
"https://www.ed.gov/news/press-releases/secretary-devos-joins-parents-students-members-congress-celebrate-15th-anniversary-dcosp",
"https://www.ed.gov/news/press-releases/us-secretary-education-arne-duncan-issues-statement-2008-national-assessment-educational-progress-trend-report"]


In [6]:
## This cell iterates over the list of indiviual Trump releases, and snags the text
data = []
count = 0

for webpage in trumplinks:
    
    ## Creates base page 
    sleep(.6)
    server_response = requests.get(webpage)             ## Creates object with the code response 
    count += 1
    print(f'Checking Trump release number: {count}\n')
    print(f'The server response is: {server_response}\n') ## Displays the HTTP response code 

    ## Parse using BeautifulSoup
    soup = bs4.BeautifulSoup(server_response.text, features="html.parser")

    ## Snag the title
    title = soup.title.string[:-31]

    subtest = '<div class="field field-name-field-subtitle field-type-text field-label-hidden">'
    datetest = '<div class="field field-name-field-release-date field-type-datetime field-label-hidden">'

    subtitle = []
    date = []

    for item in soup.find_all('div'):
        item = str(item)
        if item[:80] == subtest:
            item = item[134:]
            item = item[:-18]
            subtitle = item          ## Snags the subtitle 
            subtitle = subtitle.replace('<br/><br/>',". ")
        elif item[:88] == datetest:
            date = item[185:195]      ## Snags the date

    def remove_tags(text):
        RE = re.compile(r'<[^>]+>')
        return RE.sub('', text)
    
    image = 0
    
    text = str(soup.find_all(property="content:encoded"))
    
    ## Creates a dichotomous variable showing whether the release includes an image
    if "img " in text:
        image = 1
    text = text[60:]
    text = remove_tags(text)        
    
    metadata = ("Trump",date, image, title, subtitle, text)
    data.append(metadata) 

Checking Trump release number: 1

The server response is: <Response [200]>

Checking Trump release number: 2

The server response is: <Response [200]>

Checking Trump release number: 3

The server response is: <Response [200]>

Checking Trump release number: 4

The server response is: <Response [200]>

Checking Trump release number: 5

The server response is: <Response [200]>

Checking Trump release number: 6

The server response is: <Response [200]>

Checking Trump release number: 7

The server response is: <Response [200]>

Checking Trump release number: 8

The server response is: <Response [200]>

Checking Trump release number: 9

The server response is: <Response [200]>

Checking Trump release number: 10

The server response is: <Response [200]>

Checking Trump release number: 11

The server response is: <Response [200]>

Checking Trump release number: 12

The server response is: <Response [200]>

Checking Trump release number: 13

The server response is: <Response [200]>

Checking

Checking Trump release number: 108

The server response is: <Response [200]>

Checking Trump release number: 109

The server response is: <Response [200]>

Checking Trump release number: 110

The server response is: <Response [200]>

Checking Trump release number: 111

The server response is: <Response [200]>

Checking Trump release number: 112

The server response is: <Response [200]>

Checking Trump release number: 113

The server response is: <Response [200]>

Checking Trump release number: 114

The server response is: <Response [200]>

Checking Trump release number: 115

The server response is: <Response [200]>

Checking Trump release number: 116

The server response is: <Response [200]>

Checking Trump release number: 117

The server response is: <Response [200]>

Checking Trump release number: 118

The server response is: <Response [200]>

Checking Trump release number: 119

The server response is: <Response [200]>

Checking Trump release number: 120

The server response is: <Res

Checking Trump release number: 214

The server response is: <Response [200]>

Checking Trump release number: 215

The server response is: <Response [200]>

Checking Trump release number: 216

The server response is: <Response [200]>

Checking Trump release number: 217

The server response is: <Response [200]>

Checking Trump release number: 218

The server response is: <Response [200]>

Checking Trump release number: 219

The server response is: <Response [200]>

Checking Trump release number: 220

The server response is: <Response [200]>

Checking Trump release number: 221

The server response is: <Response [200]>

Checking Trump release number: 222

The server response is: <Response [200]>

Checking Trump release number: 223

The server response is: <Response [200]>

Checking Trump release number: 224

The server response is: <Response [200]>

Checking Trump release number: 225

The server response is: <Response [200]>

Checking Trump release number: 226

The server response is: <Res

Checking Trump release number: 320

The server response is: <Response [200]>

Checking Trump release number: 321

The server response is: <Response [200]>

Checking Trump release number: 322

The server response is: <Response [200]>

Checking Trump release number: 323

The server response is: <Response [200]>

Checking Trump release number: 324

The server response is: <Response [200]>

Checking Trump release number: 325

The server response is: <Response [200]>

Checking Trump release number: 326

The server response is: <Response [200]>

Checking Trump release number: 327

The server response is: <Response [200]>

Checking Trump release number: 328

The server response is: <Response [200]>

Checking Trump release number: 329

The server response is: <Response [200]>

Checking Trump release number: 330

The server response is: <Response [200]>

Checking Trump release number: 331

The server response is: <Response [200]>

Checking Trump release number: 332

The server response is: <Res

In [None]:
## This cell does the same as the above, but for the Obama press releases instead of Trump releases

count = 0

for webpage in obamalinks:
    ## Creates base page 
    server_response = requests.get(webpage)             ## Creates object with the code response (a)
    count += 1
    print(f'Checking Obama release number: {count}\n')
    print(f'The server response is: {server_response}\n') ## Displays the HTTP response code (b)

    ## Parse using BeautifulSoup
    soup = bs4.BeautifulSoup(server_response.text, features="html.parser")
    soup

    ## Snag the title
    title = soup.title.string[:-31]

    ## Snag any subtitle

    subtest = '<div class="field field-name-field-subtitle field-type-text field-label-hidden">'
    datetest = '<div class="field field-name-field-release-date field-type-datetime field-label-hidden">'

    subtitle = []
    date = []

    for item in soup.find_all('div'):
        item = str(item)
        if item[:80] == subtest:
            item = item[134:]
            item = item[:-18]
            subtitle = item
            subtitle = subtitle.replace('<br/><br/>',". ")
        elif item[:88] == datetest:
            date = item[185:195]      

    def remove_tags(text):
        RE = re.compile(r'<[^>]+>')
        return RE.sub('', text)
    
    image = 0
    
    text = str(soup.find_all(property="content:encoded"))
    
    if "img " in text:
        image = 1
    text = text[60:]
    text = remove_tags(text)        

    metadata = ("Obama",date, image, title, subtitle, text)
    data.append(metadata) 

In [None]:
## Turns the list of tuples into a dataframe
datadf = pd.DataFrame(data, columns=["President","Date","Has_Image","Title","Subtitle","Text"])

datadf

In [None]:
## Count how many images are in all releases
datadf.Has_Image.sum()

In [None]:
## Saves the Raw HTML corpus
datadf.to_excel("RawHTMLCorpus.xlsx")

### The cells below involve testing and applying various ways to clean the html text and turn into to text that can then be preprocessed for NLP

In [None]:
test9 =">25d"

tostrip = ['>','class="note">','\n','\\n',"ockquote>","\'v class=\"alert alert-info\" style=\"width: 66%;\""]
for item in tostrip:
    test9 = test9.strip(item)
    print(item)
    print(test9)

In [None]:
datadf1 = datadf

In [None]:
datadf1 = pd.read_excel("CompleteCorpus.xlsx")

In [None]:
from datetime import date
import numpy as np
import datetime

#### The datetime cells below take the date and create a variable for the number of days since the respective president's (first) inauguration 

In [None]:
datadf1["Date"] = [datetime.datetime.strptime(item,'%Y-%m-%d') for item in datadf1.Date]

In [None]:
oaug = datetime.datetime.strptime("2009-01-20",'%Y-%m-%d')
taug = datetime.datetime.strptime("2017-01-20",'%Y-%m-%d')

diff = taug - oaug

In [None]:
datadf1["Date"] = [str(item)[0:11] for item in datadf1.Date]
datadf1

In [None]:
datadf1["Date"] = pd.to_datetime(datadf1["Date"])

In [None]:
dayseries = []

for Date,Pres in zip(datadf1.Date,datadf1.President):
    if Pres == "Trump":
        dayssince = Date - taug
    else:
        dayssince = Date - oaug
    dayseries.append(int(dayssince.days))    
    

datadf1["Days_since_inaug"] = dayseries

#### The cells below create binary variables indicating whether a given release was before or after each of the five milestones I chose for my COVID analysis. 

In [None]:
models = ["2020-01-05", "2020-02-02", "2020-02-28", "2020-03-16","2020-04-07"]
modelnames = ["Model_A","Model_B","Model_C","Model_D","Model_E"]

for dat,modnam in zip(models,modelnames):
    moddate = datetime.datetime.strptime(dat,'%Y-%m-%d')
    diff = moddate - taug
    print(f"The dividing line for {modnam} is all days greater than {diff.days}")

In [None]:
datadf1["Post_Model_A"] = [1 if x > 1080 else 0 for x in datadf1.Days_since_inaug]
datadf1["Post_Model_B"] = [1 if x > 1108 else 0 for x in datadf1.Days_since_inaug]
datadf1["Post_Model_C"] = [1 if x > 1134 else 0 for x in datadf1.Days_since_inaug]
datadf1["Post_Model_D"] = [1 if x > 1151 else 0 for x in datadf1.Days_since_inaug]
datadf1["Post_Model_E"] = [1 if x > 1173 else 0 for x in datadf1.Days_since_inaug]

In [None]:
## Reorder columns
datadf1 = datadf1[["President","Date","Days_since_inaug","Has_Image","Title","Subtitle","Text"]]
datadf1

In [None]:
datadf1["Subtitle"] = [str(x) for x in datadf1.Subtitle]

In [None]:
textstrip = ['class="note">',']','>',"<",]
replacelist = [("$"," $")]

In [None]:
datadf1["Text"] = [text.replace("$"," $") for text in datadf1.Text]

In [None]:
striplist= ["<p>","<em>","</em>","</p>","<strong>","</strong>","<h3>","</h3>",
              "<br/>","<br>","</br>","<cite>","</cite>","amp;","[]","[","]"]

for char in striplist:
    datadf1["Subtitle"] = [text.strip(char) for text in datadf1.Subtitle]

In [None]:
textstrip = ['class="note">',']','>',"<",]
for char in textstrip:
    datadf1["Text"] = [text.strip(char) for text in datadf1.Text]

In [None]:
datadf1["All_Text"] = datadf1.Title+" "+datadf1.Subtitle+" "+datadf1.Text
datadf1

In [None]:
datadf1.columns

In [None]:
datadf1 = datadf1[["Date","Days_since_inaug",'Post_Model_A', 'Post_Model_B', 'Post_Model_C',
       'Post_Model_D', 'Post_Model_E',"Has_Image","All_Text"]]


In [None]:
datadf1.to_excel("CompleteCorpus_NOT_Preprocessed.xlsx")