# Corpus Building Journal

This is a journal for the DIGI405 corpus building project. Here, I will aim to build a web scrapping bot to develop the three corpora for this project. To start, I will experiment with developing a web scrapper using Scrapy: a Python library for crawling websites and extracting data. 

To start, I need to import the scrapy library.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

To get to know how to use this library, I am importing the basic example from the Scrapy documentation: https://docs.scrapy.org/en/latest/intro/overview.html


In [15]:
# Example code from Scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

In [17]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2021-08-04 14:44:58 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-08-04 14:44:58 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.19041-SP0
2021-08-04 14:44:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-08-04 14:44:58 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2021-08-04 14:44:58 [scrapy.extensions.telnet] INFO: Telnet Password: 5d514fa39001bde0
2021-08-04 14:44:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-08-04 14:44:58 [scrapy.middleware] INFO: Enabled downloader middlewares

ReactorNotRestartable: 

So, based on the output text above, the scraper has done the work which needs to be performed. Becasue we are running in a jupyter notebook, we needed to import "CrawlerProcess", execute process.crawl(QuoteSpider) and process.start(). What this is doing is defining the process as a crawler process, assigning the class to be crawled as the one we defined "QuoteSpider" and then starting the crawler process. 

This has crawled every page of the http://quotes.toscrape.com/tag/humor/ website (only 2 pages) and parsed the quotes using the defined parse method. 

For this project, what needs to be done first is setting up a new class of scraper that will get extract information from the websites of interest to form the three corpora. This will include setting up a parse definition, setting up a process to crawl pages (assigning the next_page value). 

# COVID-19 and Vaccine Information

For this project, I am interested in looking at the representation of information surrounding vaccines in news media. I would like to look at the expression of vaccines across various swinging news spectrum. To determine which news sites I would like use I will use the Ad fontes Media Bias Chart: (https://www.adfontesmedia.com/interactive-media-bias-chart/). I would like to look at three spectra of the media: Hyper left and less reliable, Hyper right and less reliable, and center and more reliable. 

For the Hyper Left and less reliable I plan to look at:

1.) Truth Out: https://truthout.org/

For the Hyper Right and less reliable I will took at:

1.) The Federalist: https://thefederalist.com/


For the center, reliable media I will use the one outlet:

1.) The independant Journal Review: https://ijr.com

From these I would like to get a broad spectrum of information and opinions to compare to one another. I have choosen these sites because they do not have paywalls, they lie on the areas of the media bias chart I would like to investigate and they have easy to use search functions. For each, I will search the word "Vaccine COVID". This brings me to a page of all of the links to articles including the term Vaccine in them. 

Note: If there are not enough articles on each site, I will supplement my corpus from news media outlets on similar areas of the spectrum. 

Now, I need to define three different scraping bots, one for each website. For each bot, I want it to go down the list of news articles, enter each article, scrape the body test from the article and save it to a text file. The format for the file will be: ArticleTitle_AuthorName_Date.txt. 

Note: If the date of the article is from before 2020, the article should not be saved as this is likely an article pertaininig to vaccine news before COVID. Given that the articles are usually laid out by date of print, the crawler will stop if the date of the most recent article is 2019 or the date or the crawler reaches the end of the search results.


# Initial Testing 
To start, I want to get a scraper running that will simply extract the text data from a single webpage and save it to a text file. Then I will worry about crawling multiple pages. 

In [1]:
# Code for Independent Journal Review Scraper.
import scrapy
import scrapy.crawler as crawler
from scrapy.crawler import CrawlerProcess
import html2text
        
class ijr_scraper(scrapy.Spider):
    name = 'ijr_spider'
    start_urls = [
        'https://ijr.com/cuomo-says-we-have-to-knock-on-doors/',
    ]

    def parse(self, response):
        """" Parses a single webpage on the Independent Journal Review. Attempts to get rid of 
        hyperlinks, and extraneous text"""
        
        html_parser = html2text.HTML2Text()
        html_parser.ignore_links = True
        html_parser.BODY_WIDTH = 0
        # Get the authour's name, title of article, date, etc.
        title = response.xpath('//h1/text()').get()
        print(title)
        author = response.xpath('/html/body/div/main/article/header/span/a/span/text()').get()
        print(author)
        date = response.xpath('/html/body/div/main/article/header/span/span/text()').get()
        print(date)
        
        # Now we take the main body of the text. 
        text_list = []
        section = response.xpath("//section")
        for p in section.xpath('.//p'):
            text = html_parser.handle(p.get())
            text = text.replace('\n\n','')
            text = text.replace('\n',' ')
            #Need to remove any lines of text accidentally picked up from twiter feeds and text printed at
            #the end of every article.
            if '@' in text or '.twitter.' in text:
                print("Found Garbage")
            elif 'This article appeared originally' in text or 'We are committed to truth and accuracy' in text:
                print("End Early")
                break
            else:
                text_list.append(text)
        print(text_list)

In [2]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ijr_scraper)
process.start()

2021-08-04 16:09:57 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-08-04 16:09:57 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.19041-SP0
2021-08-04 16:09:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-08-04 16:09:57 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2021-08-04 16:09:57 [scrapy.extensions.telnet] INFO: Telnet Password: 1096b29e3e0540cf
2021-08-04 16:09:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-08-04 16:09:58 [scrapy.middleware] INFO: Enabled downloader middlewares

Cuomo Says 'We Have To' Knock on Doors, Put People in Cars and Drive Them To Get COVID Vaccine
Isa Cox, The Western Journal

			July 27, 2021 at 1:49am			
Found Garbage
Found Garbage
End Early
['New York Gov. Andrew Cuomo seems intent on dispelling any myths surrounding the government’s efforts to go door-to-door to make sure everyone gets vaccinated.', 'No, he’s made things perfectly clear that it is his “mission” to make sure that the government is sending people door-to-door to talk them into getting the jab.', 'Just so we’re clear that this was the plan all along.', 'Cuomo announced a new campaign to cleanse the unwashed, unvaccinated masses of the Empire State during a media briefing at Yankee Stadium in New York City on Monday, and it sounds a bit, uh, aggressive.', 'During his remarks, in which he announced the allocation of $15 million in funds to promote vaccination among the 3.5 million New Yorkers who have yet to get the shot, the governor rather bluntly declared the idea is

Ok, now I can scrape the desired text off of a page. dumping this to a text file should be quite easy so I will add that later. For now, I will summarize my findings.

1.) The html parser randomly adds \n characters. This can be dealt with but would be nice to deal with it in a cleaner way.

2.) Every time a spider is run, the Kernel must be restarted before running another one.

3.) This source was helpful to get the spider running in notebook: https://www.jitsejan.com/using-scrapy-in-jupyter-notebook

Now, I need to determine how to make a work flow for scraping the site at a start position. As well, I will need to throttle my scraper so I do not get banned from the website. 

# Different Direction (Moving to Beautiful Soup and Requests)

While I had originally planned to use Scrapy, I have run into a number of issues with it and frankly, working with Beautiful soup and Requests seems a lot easier. So, I will pivot to this instead. 

The plan for the algorithm is dispalyed as follows:

1.) Start a request on an initial page

2.) Make soup from page.

3.) Extract all of the links from the page and add them to a list of links.

4.) If the next page is available, go to it and continue to gather links to articles. Else...

5.) Start iterating through the list of links.

6.) Extract the text, title, date and author from the page.

7.) Save each visited page as a text file with the format: (ArticleSource_Date_AuthorName.txt)

To start, I need import all of the libraries mentioned in the lab for week 3

# Building The Independent Journal Review Corpus

In [1]:
# Import requests for requestion a handle to a website.
import requests

# Import Beautiful Soup for HTML parsing
from bs4 import BeautifulSoup

# Import time for delays and to not get banned from sites.
import time

# Import os for file IO
import os

# Import zipfile for creation of final, Zipped Doc.
import zipfile



In [34]:
# Now, the initial request needs to be made. I will start with the IJR.
url = 'https://ijr.com/page/2/?s=covid+vaccine'

#Get the request from the website
response = requests.get(url)

#Make the initial soup
soup = BeautifulSoup(response.text, "html.parser") 

<Response [200]>


In [18]:
# Now to get the links (test code)

article_body = soup.find(class_='ff-main-content')
titles = article_body.find_all(class_='article-title')
for title in titles:
    link = title.find('a')
    new_url = link['href'].strip() #Need to strip away the white space around the links. 
    if 'vaccine' in new_url: # Ensure only the articles with "vaccine" in the title are used to narrow the scope. 
        print(new_url)

https://ijr.com/trump-terrible-disservice-pause-johnson-johnson-vaccine/
https://ijr.com/buttigieg-hesitant-evangelicals-vaccine-gods-plan/
https://ijr.com/rich-mexicans-jet-us-skip-lines-covid-vaccines/
https://ijr.com/one-shot-covid-vaccine-approved-for-us-use-doesnt-need-complicated-storage/
https://ijr.com/2-women-busted-trying-skip-covid-vaccine-line-dressed-grannies/
https://ijr.com/biden-slams-trump-for-vaccine-rollout/
https://ijr.com/trump-speaks-bidens-false-covid-vaccine-not-telling-truth-mentally-gone/
https://ijr.com/congresswoman-biden-involve-more-distributors-covid-vaccine/
https://ijr.com/fauci-why-people-precautions-after-covid-vaccine/
https://ijr.com/press-secretary-every-american-not-eligible-get-covid-vaccine-spring/
https://ijr.com/chief-staff-plan-distribute-covid-vaccine-into-community-not-really-exist/
https://ijr.com/christmas-comfort-covid-vaccines-collides/
https://ijr.com/who-strong-commitment-pfizer-affordable-covid-vaccine/
https://ijr.com/redfield-cdc-p

Clearly, I can now scrape all of the sites off of one page. I now need code to get to the next page. This is found in the html as "ff-pagination"

In [19]:
next_page = soup.find(class_='ff-pagination')
paginations = next_page.find_all('a')
print(paginations)
if len(paginations) == 1:
    next_url = paginations.pop()
    print(next_url[-1]['href'])
else: 
    next_url = paginations.pop()
    print(next_url['href'])
    print(next_url.get_text())

[<a href="https://ijr.com/?s=covid+vaccine">« Previous Page</a>, <a href="https://ijr.com/page/3/?s=covid+vaccine">Next Page »</a>]
https://ijr.com/page/3/?s=covid+vaccine
Next Page »


Now that I can get the address of the next page, I can write a recursive function to get all of the links to all of the articles I wish to scrape.

In [2]:
def get_ijr_urls(seed_url, urls):
    """Function to get a list of URLs from the Independent Journal Review. Is called recursively to
    visit all pages using the next page button and will stop when all of the pages have been searched.
    Only adds articles with 'vaccine' in the title to narrow scope of links."""
    
    # Wait for a few seconds as to not bombard the site with requests.
    sleep_seconds = 2
    time.sleep(sleep_seconds)
    
    # Get the response from the seed URL
    response = requests.get(seed_url)
    
    #Create the beautiful soup
    soup = BeautifulSoup(response.text, "html.parser") 
    
    #Get the main body of the page
    article_body = soup.find(class_='ff-main-content')
    
    #Get all of the atricle titles in the main body.
    titles = article_body.find_all(class_='article-title')
    
    #Extract the link in each title
    for title in titles:
        link = title.find('a')
        new_url = link['href'].strip() #Need to strip away the white space around the links. 
        if 'vaccine' in new_url and new_url not in urls: # Ensure only the articles with "vaccine" in the title are used to narrow the scope. 
            urls.append(new_url)
    
    #Get the next page
    next_page = soup.find(class_='ff-pagination')
    paginations = next_page.find_all('a')
    next_url = paginations.pop()   
    search_url = next_url['href']
    if next_url.get_text() != '« Previous Page':
        print(search_url)
        get_ijr_urls(search_url, urls)
    else:
        print('Done!')

In [3]:
# Code to start the function.
seed_url = 'https://ijr.com/?s=covid+vaccine'
urls = []
get_ijr_urls(seed_url, urls)

https://ijr.com/page/2/?s=covid+vaccine
https://ijr.com/page/3/?s=covid+vaccine
https://ijr.com/page/4/?s=covid+vaccine
https://ijr.com/page/5/?s=covid+vaccine
https://ijr.com/page/6/?s=covid+vaccine


KeyboardInterrupt: 

Check to see how many articles I have found: When was not checking for duplicates there was (326). After was 326 as well, so the site does not duplicate articles. 

In [23]:
print(len(urls))

241
['https://ijr.com/greene-opposition-covid-vaccine-mandates-cant-live-forever/', 'https://ijr.com/doocy-psaki-bidens-past-skepticism-covid-vaccine-developed-trump/', 'https://ijr.com/greene-1-week-suspension-twitter-claim-covid-vaccines/', 'https://ijr.com/biden-mandatory-covid-vaccine-service-members/', 'https://ijr.com/mcconnell-americans-ignore-giving-demonstrably-bad-advice-covid-vaccines/', 'https://ijr.com/fda-warn-covid-vaccine-link-rare-nerve-disorder-paralysis/', 'https://ijr.com/adams-questions-covid-vaccine-incentives/', 'https://ijr.com/parents-call-honor-8-month-old-receive-covid-vaccine/', 'https://ijr.com/biden-us-send-20-million-vaccine-doses-other-countries/', 'https://ijr.com/trump-asks-for-mention-please-covid-vaccine-rollout/', 'https://ijr.com/biden-pressed-whether-order-military-get-covid-vaccine/', 'https://ijr.com/trump-accuses-biden-admin-refusing-give-credit-vaccine/', 'https://ijr.com/rogan-claims-healthy-people-should-not-covid-vaccine/', 'https://ijr.com

In [24]:
print(urls)

['https://ijr.com/greene-opposition-covid-vaccine-mandates-cant-live-forever/', 'https://ijr.com/doocy-psaki-bidens-past-skepticism-covid-vaccine-developed-trump/', 'https://ijr.com/greene-1-week-suspension-twitter-claim-covid-vaccines/', 'https://ijr.com/biden-mandatory-covid-vaccine-service-members/', 'https://ijr.com/mcconnell-americans-ignore-giving-demonstrably-bad-advice-covid-vaccines/', 'https://ijr.com/fda-warn-covid-vaccine-link-rare-nerve-disorder-paralysis/', 'https://ijr.com/adams-questions-covid-vaccine-incentives/', 'https://ijr.com/parents-call-honor-8-month-old-receive-covid-vaccine/', 'https://ijr.com/biden-us-send-20-million-vaccine-doses-other-countries/', 'https://ijr.com/trump-asks-for-mention-please-covid-vaccine-rollout/', 'https://ijr.com/biden-pressed-whether-order-military-get-covid-vaccine/', 'https://ijr.com/trump-accuses-biden-admin-refusing-give-credit-vaccine/', 'https://ijr.com/rogan-claims-healthy-people-should-not-covid-vaccine/', 'https://ijr.com/ron

Now I have a list of articles that can be iterated through to build the first corpus. The code that follows will be used to test the building of text files for the first corpus. For this I will look at the first page in the list. 

In [157]:
# Get the response from the first page of data
response = requests.get(urls[0])
    
#Create the beautiful soup
soup = BeautifulSoup(response.text, "html.parser") 

#Get the title, author, date and 
#main = soup.find(class_='single-post')
title = soup.title.get_text()
author = soup.select('span.article-authors:nth-child(2) > a:nth-child(1) > span:nth-child(1)')[0].get_text()
date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]
print(title)
print(author)
print(date)

#Get all of the data from the article
main = soup.find(class_='ff-main-content')
article = main.find('article')
paragraphs = article.find_all('p')

for paragraph in paragraphs:
    text = paragraph.get_text()
    if '@' in text or '.twitter.' in text:
        print("Found Garbage")
    elif 'This article appeared originally' in text or 'We are committed to truth and accuracy' in text:
        print("End Early")
        break
    else:
        print(paragraph.get_text())

<class 'str'>
Peter-Doocy-Grills-Psaki-on-Biden's-Past-Skepticism-of-COVID-Vaccine-Developed-Under-Trump
Peter Doocy Grills Psaki on Biden's Past Skepticism of COVID Vaccine Developed Under Trump
Bradley Cortright
2021-08-11
White House Press Secretary Jen Psaki says the data has not shown that President Joe Biden created vaccine hesitancy by raising questions about its development while he was a candidate. 
During a White House press briefing on Wednesday, Psaki was asked by Fox News’ Peter Doocy, “As the president tries to reach unvaccinated Americans, has there been any thought given, looking back, to the possibility that he may have created some vaccine hesitancy when last year around this time the previous administration was rushing to get a vaccine authorized?”
He noted that then-candidate Biden said, “I trust vaccines, I trust scientists, but I don’t trust Donald Trump, and at this moment, the American people can’t either.”
“I think it’s safe to say he still doesn’t trust Donald

Ok, the above looks good. Now a parser can be made to start saving individual files!

In [124]:
def save_text_data(title, author, date, texts, corpus_path):
    """Program to write text files for corpus."""
    
    # Make the directory if it does not exist.
    if not os.path.exists(corpus_path):
        os.makedirs(corpus_path)
    
    # Make the file name
    file_name = corpus_path + date + '_' + title + '_' + author + '.txt'
    
    #Clean up so the data saves properly
    if ":" in file_name:
        file_name = file_name.replace(":","")
    if '"' in file_name:
        file_name = file_name.replace('"',"")
    file_name = file_name.replace(" ","-")
    
    try:
        # Open the file
        file = open(file_name, 'w', encoding='utf8')

        #Write the text
        for text in texts:
            file.write(text + '\n')
        print('Saved file: ' + file_name)
        file.close()
    except:
        print("Error: Text file not added.")

In [171]:
def get_ijr_data(urls):
    
    corpus_path = 'ijr_corpus/'
    
    #For every url, get the data and save it.
    for url in urls:
        
        # Make an empty list to contain all of the text data.
        texts = []
        
        # Wait for a few seconds as to not bombard the site with requests.
        sleep_seconds = 2
        time.sleep(sleep_seconds)
        
        # Get the response from the first page of data
        print('Opening url: ' + url)
        response = requests.get(url)
    
        #Create the beautiful soup
        soup = BeautifulSoup(response.text, "html.parser") 

        #Get the title, author, date and 
        #main = soup.find(class_='single-post')
        title = soup.title.get_text()
        author = soup.select('span.article-authors:nth-child(2) > a:nth-child(1) > span:nth-child(1)')[0].get_text()
        date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]

        #Get all of the data from the article
        main = soup.find(class_='ff-main-content')
        article = main.find('article')
        paragraphs = article.find_all('p')

        for paragraph in paragraphs:
            text = paragraph.get_text()
            if '@' in text or '.twitter.' in text:
                print("Skipping")
            elif 'This article appeared originally' in text or 'We are committed to truth and accuracy' in text:
                print('End of useful text.')
                #End as we do not care about this text at the end of the article 
                break 
            elif 'Reporting by' in text or '(By' in text:
                print('End of useful text.')
                #End as we do not care about this text at the end of the article 
                break 
            else:
                texts.append(text)
        
        save_text_data(title, author, date, texts, corpus_path)

In [172]:
get_ijr_data(urls)

Opening url: https://ijr.com/doocy-psaki-bidens-past-skepticism-covid-vaccine-developed-trump/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-08-11_Peter-Doocy-Grills-Psaki-on-Biden's-Past-Skepticism-of-COVID-Vaccine-Developed-Under-Trump_Bradley-Cortright.txt
Opening url: https://ijr.com/greene-1-week-suspension-twitter-claim-covid-vaccines/
Skipping
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-08-10_Marjorie-Taylor-Greene-Receives-1-Week-Suspension-From-Twitter-for-Claim-About-COVID-Vaccines_Bradley-Cortright.txt
Opening url: https://ijr.com/biden-mandatory-covid-vaccine-service-members/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-08-09_Biden-Backs-Mandatory-COVID-Vaccine-for-Service-Members_Bradley-Cortright.txt
Opening url: https://ijr.com/mcconnell-americans-ignore-giving-demonstrably-bad-advice-covid-vaccines/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-07-20_McConnell-Urges-Americans-To-'Ignore'-

Opening url: https://ijr.com/redfield-cdc-proud-sign-recommendation-covid-vaccine/
End of useful text.
Saved file: ijr_corpus/2020-12-14_US-CDC-Director-'Proud'-To-Sign-Advisory-Panel-Recommendation-of-COVID-Vaccine_Reuters.txt
Opening url: https://ijr.com/historic-covid-vaccine-campaign-launches-convoy-trucks/
End of useful text.
Saved file: ijr_corpus/2020-12-13_Historic-US-COVID-Vaccine-Campaign-Launches-With-Convoy-of-Trucks_Reuters.txt
Opening url: https://ijr.com/more-women-nervous-fast-rollout-covid-vaccine/
End of useful text.
Error: Text file not added. Moving to error list.
Opening url: https://ijr.com/operation-warp-speed-chief-adviser-trump-executive-order-vaccine/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-08_'Operation-Warp-Speed'-Chief-Adviser-Pressed-on-Trump's-Executive-Order-To-Prioritize-COVID-Vaccines_Madison-Summers.txt
Opening url: https://ijr.com/operation-warp-speed-adviser-covid-vaccine/
Skipping
Skipping
End of useful text.
Saved file

Opening url: https://ijr.com/mexico-to-tighten-borders-covid-us-offers-vaccine-help/
End of useful text.
Saved file: ijr_corpus/2021-03-19_Mexico-To-Tighten-Borders-Against-COVID-19-as-US-Offers-Vaccine-Help_Reuters.txt
Opening url: https://ijr.com/trump-recommends-covid-19-vaccine/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-03-17_Trump-Advocates-for-People-To-Get-COVID-19-Vaccine-—-Even-if-They-'Don't-Want-To'_Madison-Summers.txt
Opening url: https://ijr.com/fauci-hopes-trump-tell-supporters-covid-vaccine/
End of useful text.
Saved file: ijr_corpus/2021-03-14_Fauci-Hopes-Trump-Will-Tell-His-Supporters-To-Get-COVID-19-Vaccine_Reuters.txt
Opening url: https://ijr.com/former-presidents-ad-campaign-covid-19-vaccine/
End of useful text.
Saved file: ijr_corpus/2021-03-11_Former-Presidents,-First-Ladies-Encourage-Americans-To-Get-the-COVID-19-Vaccine_Savannah-Rychcik.txt
Opening url: https://ijr.com/trump-hopes-everyone-remembers-covid-19-vaccine/
End of useful text.
S

Opening url: https://ijr.com/uk-man-covid-19-vaccine-second-dose/
End of useful text.
Saved file: ijr_corpus/2021-01-05_UK-Man-Who-Went-Viral-After-Getting-the-COVID-19-Vaccine-Receives-Second-Dose_Savannah-Rychcik.txt
Opening url: https://ijr.com/adams-states-move-next-level-priority-covid-vaccine/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-01-05_Surgeon-General-Urges-States-To-Move-To-Next-Priority-Group-With-COVID-19-Vaccine-'if-the-Demand-Isn't-There'_Madison-Summers.txt
Opening url: https://ijr.com/premature-change-covid-vaccines-dosing-fda/
End of useful text.
Saved file: ijr_corpus/2021-01-05_'Premature'-To-Change-Authorized-COVID-19-Vaccines-Dosing,-Schedules,-FDA-Says_Reuters.txt
Opening url: https://ijr.com/new-york-florida-hospitals-dispense-covid-vaccines-quicker-lose-supply/
End of useful text.
Saved file: ijr_corpus/2021-01-04_New-York,-Florida-Tell-Hospitals-To-Dispense-COVID-19-Vaccines-Quicker-or-Lose-Supply_Reuters.txt
Opening url: https://ijr.c

Opening url: https://ijr.com/fauci-possible-covid-vaccine-could-spread-virus/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-07_Fauci-Says-It-Is-'Possible'-Someone-Who-Gets-COVID-19-Vaccine-Could-Spread-the-Virus_Madison-Summers.txt
Opening url: https://ijr.com/biden-covid-19-vaccine-mandatory/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-05_Biden-Does-Not-Believe-the-COVID-19-Vaccine-Should-Be-Mandatory_Savannah-Rychcik.txt
Opening url: https://ijr.com/biden-will-take-covid-19-vaccine-once-its-declared-safe/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-03_Biden-Says-He-Will-Take-a-COVID-19-Vaccine-'Once-It's-Declared-To-Be-Safe'_Bradley-Cortright.txt
Opening url: https://ijr.com/hogan-worst-part-virus-still-coming-amid-covid-vaccine/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-03_Gov.-Larry-Hogan-Says-'the-Worst-Part-of-This-Virus-Is-Still-Coming'-as-States-Await-COVID-19-Vaccine_Madison-Sum

Opening url: https://ijr.com/pence-family-wouldnt-hesitate-covid-19-vaccine/
End of useful text.
Saved file: ijr_corpus/2020-09-16_Pence-Weighs-in-on-Whether-He-and-His-Family-Would-Get-the-COVID-19-Vaccine_Meaghan-Ellis.txt
Opening url: https://ijr.com/trump-covid-19-disappear-even-without-vaccine/
Skipping
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-09-16_Trump-Claims-COVID-19-With-'Disappear'-Even-Without-a-Vaccine_Bradley-Cortright.txt
Opening url: https://ijr.com/astrazeneca-resumes-trials-covid-19-vaccine-halted-illness/
End of useful text.
Saved file: ijr_corpus/2020-09-13_AstraZeneca-Resumes-UK-Trials-of-COVID-19-Vaccine-Halted-by-Patient-Illness_Reuters.txt
Opening url: https://ijr.com/biden-even-cost-election-would-take-covid-vaccine/
End of useful text.
Saved file: ijr_corpus/2020-09-08_Biden-Even-'If-It-Cost-Me-the-Election,'-I-Would-Get-the-COVID-19-Vaccine_Madison-Summers.txt
Opening url: https://ijr.com/biden-campaign-weighs-in-on-take-covid-vaccine

Opening url: https://ijr.com/missouri-gov-biden-challenges-door-to-door-vaccine/
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-07-08_Missouri-Gov.-Stands-Up-To-Biden,-Challenges-Door-to-Door-Vaccine-Plan-With-Order-To-Health-Dept._Jack-Davis,-The-Western-Journal.txt
Opening url: https://ijr.com/ex-pharmacist-deliberately-sabotaged-vaccine-doses-sentenced-prison/
End of useful text.
Saved file: ijr_corpus/2021-06-09_Ex-Pharmacist-Who-Deliberately-Sabotaged-Hundreds-of-Vaccine-Doses-Sentenced-To-Prison_Jack-Davis,-The-Western-Journal.txt
Opening url: https://ijr.com/angry-protesters-jill-biden-fauci-they-tour-vaccine-facility-harlem/
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2021-06-08_Angry-Protesters-Greet-Jill-Biden,-Fauci-As-They-Tour-Vaccine-Facility-in-Harlem_Isa-Cox,-The-Western-Journal.txt
Opening url: https://ijr.com/texas-governor-will-sign-bill-banning-vaccine-passports/
Skipping
End of useful 

Opening url: https://ijr.com/oklahoma-diner-customers-will-not-get-vaccine/
End of useful text.
Saved file: ijr_corpus/2021-03-18_Oklahoma-Diner-Customers-Say-They-Will-Not-Get-the-Vaccine-Even-With-Trump's-Endorsement-of-It_Savannah-Rychcik.txt
Opening url: https://ijr.com/expert-committee-review-astrazeneca-vaccine-side-effects/
End of useful text.
Saved file: ijr_corpus/2021-03-14_Expert-WHO-Committee-To-Review-AstraZeneca-Vaccine-After-Worrying-Side-Effects-Appear_Jack-Davis,-The-Western-Journal.txt
Opening url: https://ijr.com/biden-direct-states-make-adults-eligible-vaccine-may-1/
End of useful text.
Saved file: ijr_corpus/2021-03-12_Biden-To-Direct-States-To-Make-All-Adults-Eligible-for-Vaccine-by-May-1_Reuters.txt
Opening url: https://ijr.com/biden-says-americans-will-be-first-to-get-vaccines-any-surplus-to-be-shared/
End of useful text.
Saved file: ijr_corpus/2021-03-10_Biden-Says-Americans-Will-Be-First-To-Get-Vaccines;-Any-Surplus-To-Be-Shared_Reuters.txt
Opening url: https:

End of useful text.
Saved file: ijr_corpus/2020-12-13_Historic-US-Vaccine-Campaign-Begins-With-First-Shipments-'Delivering-Hope'-To-Millions_Reuters.txt
Opening url: https://ijr.com/biden-hails-vote-pfizer-vaccine-bright-light-needlessly-dark-time/
End of useful text.
Saved file: ijr_corpus/2020-12-11_Biden-Hails-FDA-Panel-Vote-on-Pfizer-Vaccine-'A-Bright-Light-in-a-Needlessly-Dark-Time'_Bradley-Cortright.txt
Opening url: https://ijr.com/vaccine-trump-may-invoke-defense-production-act/
End of useful text.
Saved file: ijr_corpus/2020-12-09_While-Seeking-Credit-for-Vaccine,-Trump-Says-He-May-Invoke-Defense-Production-Act_Reuters.txt
Opening url: https://ijr.com/91-year-old-british-man-covid-vaccine/
Skipping
Skipping
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-12-08_91-Year-Old-British-Man-Speaks-After-COVID-19-Vaccination-'Well-There's-No-Point-in-Dying-Now'_Alex-Thomas.txt
Opening url: https://ijr.com/biden-transition-coronavirus-vaccine-teams-meet/
End of useful 

Opening url: https://ijr.com/russia-announces-vaccine-will-be-tested/
End of useful text.
Saved file: ijr_corpus/2020-08-20_Russia-Announces-Its-Coronavirus-Vaccine-Will-Be-Tested-on-More-Than-40,000-People_Savannah-Rychcik.txt
Opening url: https://ijr.com/pope-warns-rich-countries-against-vaccine-nationalism/
End of useful text.
Saved file: ijr_corpus/2020-08-19_Pope-Warns-Rich-Countries-Against-Coronavirus-Vaccine-Nationalism_Reuters.txt
Opening url: https://ijr.com/fauci-does-not-think-mandating-vaccine/
End of useful text.
Saved file: ijr_corpus/2020-08-18_Fauci-Says-He-Does-Not-'Think-You'll-Ever-See-a-Mandating'-of-a-Coronavirus-Vaccine_Savannah-Rychcik.txt
Opening url: https://ijr.com/fauci-doubts-russias-coronavirus-vaccine-effective-safe/
Skipping
Skipping
Skipping
End of useful text.
Saved file: ijr_corpus/2020-08-12_Fauci-'Seriously-Doubts'-Russia's-Touted-Coronavirus-Vaccine-Is-Effective-And-Safe_Meaghan-Ellis.txt
Opening url: https://ijr.com/trump-says-coronavirus-vaccine-

['https://ijr.com/page/2/?s=covid+vaccine',
 'https://ijr.com/page/2/?s=covid+vaccine',
 'https://ijr.com/page/2/?s=covid+vaccine',
 'https://ijr.com/page/2/?s=covid+vaccine',
 'https://ijr.com/page/2/?s=covid+vaccine',
 'https://ijr.com/page/2/?s=covid+vaccine']

Ran into error with ":" character in file name... need to repalce. Got 319 out of 326 files with 136,913 words.

# The Federalist Corpus
Now, the process needs to be repeated for the other two websites. first, I will start with the Federalist. Using the inspector tool, I will look at the data coming from the site and write the function to parse it.

In [52]:
def get_federalist_urls(seed_url, urls):
    """Function to get a list of URLs from The Federalist. Is called recursively to
    visit all pages using the next page button and will stop when all of the pages have been searched.
    Only adds articles with 'vaccine' in the title to narrow scope of links."""
    
    # Wait for a few seconds as to not bombard the site with requests.
    sleep_seconds = 2
    time.sleep(sleep_seconds)
    
    # Get the response from the seed URL
    response = requests.get(seed_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})
    
    #Create the beautiful soup
    soup = BeautifulSoup(response.text, "html.parser") 
    
    #Get the main body of the page
    article_body = soup.find(class_='listing-container')
    
    #Get all of the atricle titles in the main body.
    titles = article_body.find_all(class_='entry-title')
    
    #Extract the link in each title
    for title in titles:
        link = title.find('a')
        new_url = link['href'].strip() #Need to strip away the white space around the links. 
        if 'vaccine' in new_url and new_url not in urls: # Ensure only the articles with "vaccine" in the title are used to narrow the scope. 
            urls.append(new_url)
    
    #Get the next page
    next_page = soup.find(class_='search-paging')
    paginations = next_page.find_all('a')
    next_url = paginations.pop()
    search_url = next_url['href']
    if next_url.get_text() == 'Next »':
        print(search_url)
        get_federalist_urls(search_url, urls)
    else:
        print('Done!')

In [53]:
seed_url = 'https://thefederalist.com/?s=covid+vaccine'
urls = []
get_federalist_urls(seed_url, urls)

https://thefederalist.com/page/2/?s=covid+vaccine
https://thefederalist.com/page/3/?s=covid+vaccine
https://thefederalist.com/page/4/?s=covid+vaccine
https://thefederalist.com/page/5/?s=covid+vaccine
https://thefederalist.com/page/6/?s=covid+vaccine
https://thefederalist.com/page/7/?s=covid+vaccine
https://thefederalist.com/page/8/?s=covid+vaccine
https://thefederalist.com/page/9/?s=covid+vaccine
https://thefederalist.com/page/10/?s=covid+vaccine
https://thefederalist.com/page/11/?s=covid+vaccine
https://thefederalist.com/page/12/?s=covid+vaccine
https://thefederalist.com/page/13/?s=covid+vaccine
https://thefederalist.com/page/14/?s=covid+vaccine
https://thefederalist.com/page/15/?s=covid+vaccine
https://thefederalist.com/page/16/?s=covid+vaccine
https://thefederalist.com/page/17/?s=covid+vaccine
https://thefederalist.com/page/18/?s=covid+vaccine
https://thefederalist.com/page/19/?s=covid+vaccine
https://thefederalist.com/page/20/?s=covid+vaccine
https://thefederalist.com/page/21/?s=co

In [54]:
len(urls)

134

In [55]:
print(urls)

['https://thefederalist.com/2021/08/06/biden-considers-coercing-institutions-to-mandate-covid-vaccine-by-withholding-federal-funds/', 'https://thefederalist.com/2021/07/29/postal-union-that-endorsed-biden-says-covid-vaccine-mandates-not-the-role-of-the-federal-government/', 'https://thefederalist.com/2021/07/27/cdc-sows-doubts-about-covid-vaccines-by-nagging-americans-who-got-the-shot-to-mask-up-again/', 'https://thefederalist.com/2021/07/21/majority-of-voters-reject-covid-vaccine-mandates-new-poll-finds/', 'https://thefederalist.com/2021/07/06/leaked-document-shows-army-plans-to-mandate-covid-vaccines-for-service-members/', 'https://thefederalist.com/2021/07/05/how-college-covid-vaccine-mandates-put-students-in-danger/', 'https://thefederalist.com/2021/06/30/ohio-legislature-bans-covid-vaccine-mandates-in-government-schools/', 'https://thefederalist.com/2021/06/24/by-the-lefts-standards-covid-vaccine-mandates-clearly-institutionalize-racism/', 'https://thefederalist.com/2021/06/23/if-

It seems that there are two different formats the Federalist uses for its articles. These are exemplified with the following URLs: 'https://thefederalist.com/2021/06/03/vaccine-mandates-shouldnt-be-the-next-covid-policy-disaster/' and  'https://thefederalist.com/2020/12/18/vice-president-mike-pence-receives-covid-vaccine-in-televised-event/'

These cases will need to be handled individually. 

In [78]:
# Code for the first article type:
# Get the response from the first page of data
response = requests.get('https://thefederalist.com/2020/12/18/vice-president-mike-pence-receives-covid-vaccine-in-televised-event/',headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})

#Create the beautiful soup
soup = BeautifulSoup(response.text, "html.parser") 

#Get the title, author, date and 
#main = soup.find(class_='single-post')
title = soup.title.get_text().strip()
author = soup.find(class_='byline-standard').find('a').get_text()
date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]

#Get all of the data from the article
article = soup.find(class_='entry-content standard clearfix')
paragraphs = article.find_all('p')

for paragraph in paragraphs:
    text = paragraph.get_text()
    if '@' in text or '.twitter.' in text:
        print("Found Garbage")
    elif 'This article appeared originally' in text or 'We are committed to truth and accuracy' in text:
        print("End Early")
        break
    else:
        print(paragraph.get_text())
        
#Now for the second article type.
print()
response = requests.get('https://thefederalist.com/2021/06/03/vaccine-mandates-shouldnt-be-the-next-covid-policy-disaster/',headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})

#Create the beautiful soup
soup = BeautifulSoup(response.text, "html.parser") 

#Get the title, author, date and 
#main = soup.find(class_='single-post')
title = soup.title.get_text().strip()
author = soup.select('.aside-left > a:nth-child(2)')[0].get_text()
date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]


#Get all of the data from the article
article = soup.find(class_='entry-content long clearfix')
paragraphs = article.find_all('p')

for paragraph in paragraphs:
    text = paragraph.get_text()
    if '@' in text or '.twitter.' in text:
        print("Found Garbage")
    elif 'This article appeared originally' in text or 'We are committed to truth and accuracy' in text:
        print("End Early")
        break
    else:
        print(paragraph.get_text())

Vice President Mike Pence received a coronavirus vaccine dose in a televised White House event Friday morning, calling it “a medical miracle.”
Pence is the highest-ranking U.S. official to receive a dose of Pfizer-BioNTech’s vaccine, which was found to 95 percent effective in trials.
“I didn’t feel a thing. Well done,” Pence told the medical staff at Walter Reed National Military Medical Center.
Found Garbage
Found Garbage

 
While journalists, so-called experts, and verified Twitter users cast doubt on President Donald Trump and Operation Warp Speed’s efforts to ensure the creation, production, and distribution of a safe and effective vaccine, the Food and Drug Administration (FDA) voted last week to approve Pfizer’s vaccine for emergency use and mass distribution.
In his address on Friday morning, Pence reassured the public that the vaccine is safe and effective.
“Hope is on the way,” Pence said. “The American people can be confident.”
“We have one, and perhaps within hours, two safe

Now, both article types can be parsed.

In [85]:
def get_federalist_data(urls):
    
    corpus_path = 'federalist_corpus/'
    
    #For every url, get the data and save it.
    for url in urls:
        
        # Make an empty list to contain all of the text data.
        texts = []
        
        # Wait for a few seconds as to not bombard the site with requests.
        sleep_seconds = 2
        time.sleep(sleep_seconds)
        
        # Get the response from the first page of data
        print('Opening url: ' + url)
        response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})
    
        #Create the beautiful soup
        soup = BeautifulSoup(response.text, "html.parser") 
        
        
        #Get the title, author, date and 
        #main = soup.find(class_='single-post')
        title = soup.title.get_text().strip()
        
        #Now need code to handle the two different cases.
        # Case 1:
        if soup.find(class_='byline-standard') is not None:
            author = soup.find(class_='byline-standard').find('a').get_text()
            date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]
            article = soup.find(class_='entry-content standard clearfix')
        else:
            author = soup.select('.aside-left > a:nth-child(2)')[0].get_text()
            date = soup.find("meta", {"property":"article:published_time"})['content'].split('T')[0]
            article = soup.find(class_='entry-content long clearfix')
            
        #Get all of the data from the article
        paragraphs = article.find_all('p')

        for paragraph in paragraphs:
            text = paragraph.get_text()
            if '@' in text or '.twitter.' in text:
                print("Skipping")
            else:
                texts.append(text)
        
        save_text_data(title, author, date, texts, corpus_path)

In [86]:
get_federalist_data(urls)

Opening url: https://thefederalist.com/2021/08/06/biden-considers-coercing-institutions-to-mandate-covid-vaccine-by-withholding-federal-funds/
Saved file: federalist_corpus/2021-08-06_Biden-Considers-Coercing-Institutions-To-Mandate-COVID-Vaccine_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2021/07/29/postal-union-that-endorsed-biden-says-covid-vaccine-mandates-not-the-role-of-the-federal-government/
Skipping
Skipping
Skipping
Skipping
Saved file: federalist_corpus/2021-07-29_Postal-Union-That-Endorsed-Biden-Says-No-To-Federal-Vaccine-Mandates_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2021/07/27/cdc-sows-doubts-about-covid-vaccines-by-nagging-americans-who-got-the-shot-to-mask-up-again/
Saved file: federalist_corpus/2021-07-27_CDC-Sows-Vaccine-Doubts-By-Nagging-Those-Who-Got-The-Shot-To-Mask_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2021/07/21/majority-of-voters-reject-covid-vaccine-mandates-new-poll-finds/
Saved file: federalist_corpus/2

Opening url: https://thefederalist.com/2021/06/21/students-parents-sue-indiana-university-over-mandatory-covid-19-vaccine-policy/
Saved file: federalist_corpus/2021-06-21_Students-Sue-Indiana-University-Over-Mandatory-COVID-19-Vaccine-Policy_Spencer-Lindquist.txt
Opening url: https://thefederalist.com/2021/06/03/vaccine-mandates-shouldnt-be-the-next-covid-policy-disaster/
Saved file: federalist_corpus/2021-06-03_Vaccine-Mandates-Shouldn’t-Be-The-Next-COVID-Policy-Disaster_Ron-Johnson.txt
Opening url: https://thefederalist.com/2021/05/28/report-facebook-is-censoring-covid-19-vaccine-hesitancy/
Saved file: federalist_corpus/2021-05-28_Report-Facebook-Is-Censoring-COVID-19-Vaccine-Hesitancy_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2021/05/27/as-students-leave-campus-for-summer-colleges-are-requiring-the-covid-19-vaccine-if-they-want-to-return-this-fall/
Saved file: federalist_corpus/2021-05-27_More-Than-100-Colleges-Are-Requiring-The-COVID-19-Vaccine-For-Students-On-Camp

Saved file: federalist_corpus/2021-06-24_Pennsylvania-House-Republicans-Advance-Bill-To-Ban-Vaccine-Passports_Shawn-Fleetwood.txt
Opening url: https://thefederalist.com/2021/06/08/vaccines-for-kids-amid-fauci-email-scandal-poll-shows-americans-trust-parents-over-feds/
Error: Text file not added.
Opening url: https://thefederalist.com/2021/06/07/texas-governor-pledges-to-sign-law-prohibiting-businesses-from-requiring-vaccine-passports/
Skipping
Saved file: federalist_corpus/2021-06-07_Texas-Governor-Pledges-To-Sign-Law-Prohibiting-Businesses-From-Requiring-Vaccine-Passports_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2021/06/04/sen-rick-scott-introduces-bill-to-ban-vaccine-passports-for-domestic-flights/
Saved file: federalist_corpus/2021-06-04_Sen.-Rick-Scott-Introduces-Bill-To-Ban-Vaccine-Passports-For-Domestic-Flights_Shawn-Fleetwood.txt
Opening url: https://thefederalist.com/2021/06/04/indiana-lawmakers-must-protect-college-students-from-de-facto-vaccine-passports/
Sa

Saved file: federalist_corpus/2021-04-13_The-New-York-Times-Can-Tell-The-Difference-Between-Men-And-Women-With-Vaccines-But-Not-Pronouns---The-Federalist_Kylee-Zempel.txt
Opening url: https://thefederalist.com/2021/04/09/report-new-yorks-pilot-program-vaccine-passports-are-easy-to-fake/
Saved file: federalist_corpus/2021-04-09_Report-New-York's-Pilot-Program-Vaccine-Passports-Are-Easy-To-Fake_Gabe-Kaminsky.txt
Opening url: https://thefederalist.com/2021/04/08/tennessee-gov-bill-lee-supports-allowing-big-business-to-require-vaccine-passports/
Saved file: federalist_corpus/2021-04-08_Tennessee-Gov.-Bill-Lee-Supports-Allowing-Big-Business-To-Require-Vaccine-Passports_Tristan-Justice.txt
Opening url: https://thefederalist.com/2021/04/08/iowa-governor-says-shell-take-executive-action-against-vaccine-passports/
Saved file: federalist_corpus/2021-04-08_Iowa-Governor-Commits-To-Executive-Action-Against-Vaccine-Passports_Gabe-Kaminsky.txt
Opening url: https://thefederalist.com/2021/04/06/gov-gr

Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Saved file: federalist_corpus/2020-12-14_Mainstream-Media-And-Twitter-Journalists-Said-A-Vaccine-By-Year-End-Was-Impossible.-Here-Are-The-Receipts_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2020/12/10/fda-votes-to-recommend-pfizer-vaccine-for-emergency-use/
Saved file: federalist_corpus/2020-12-11_FDA-Votes-To-Recommend-Pfizer-Vaccine-For-Emergency-Use_Tristan-Justice.txt
Opening url: https://thefederalist.com/2020/12/09/new-york-times-you-must-wear-a-mask-even-after-getting-a-vaccine/
Saved file: federalist_corpus/2020-12-09_New-York-Times-You-Must-Wear-A-Mask-Even-After-A-Vaccine_Jordan-Davidson.txt
Opening url: https://thefederalist.com/2020/12/08/washington-post-spouts-false-iranian-talking-points-on-sanctions-allegedly-blocking-vaccine/
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Saved file: federalist_corpus/2020

This produced a corpus that is 131 documents in length and has a word count of: 92,205 words.

# Building the Truth Out Corpus

In [98]:
def get_truthout_urls(seed_url, urls):
    """Function to get a list of URLs from The Federalist. Is called recursively to
    visit all pages using the next page button and will stop when all of the pages have been searched.
    Only adds articles with 'vaccine' in the title to narrow scope of links."""
    
    # Wait for a few seconds as to not bombard the site with requests.
    sleep_seconds = 2
    time.sleep(sleep_seconds)
    
    # Get the response from the seed URL
    response = requests.get(seed_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})
    
    #Create the beautiful soup
    soup = BeautifulSoup(response.text, "html.parser") 
    
    #Get the main body of the page
    article_body = soup.find(class_='inner-content')
    
    #Get all of the atricle titles in the main body.
    titles = article_body.find_all(class_='entry-title')
    
    #Extract the link in each title
    for title in titles:
        link = title.find('a')
        new_url = link['href'].strip() #Need to strip away the white space around the links. 
        if 'vaccine' in new_url and new_url not in urls: # Ensure only the articles with "vaccine" in the title are used to narrow the scope. 
            urls.append(new_url)
    
    #Get the next page
    next_page = soup.find(class_='pagination text-center')
    paginations = next_page.find_all('a')
    next_url = paginations.pop()
    search_url = next_url['href']
    print(search_url)
    print(next_url.get_text())
    if next_url.get_text() == 'Next >':
        print(search_url)
        get_truthout_urls(search_url, urls)
    else:
        print('Done!')

In [99]:
seed_url = 'https://truthout.org/?s=covid+vaccine&post_type=all'
urls = []
get_truthout_urls(seed_url, urls)
print(len(urls))

https://truthout.org/page/2/?s=covid+vaccine&post_type=all#038;post_type=all
Next >
https://truthout.org/page/2/?s=covid+vaccine&post_type=all#038;post_type=all
https://truthout.org/page/3/?s=covid+vaccine&post_type=all#038;post_type=all
Next >
https://truthout.org/page/3/?s=covid+vaccine&post_type=all#038;post_type=all
https://truthout.org/page/4/?s=covid+vaccine&post_type=all#038;post_type=all
Next >
https://truthout.org/page/4/?s=covid+vaccine&post_type=all#038;post_type=all
https://truthout.org/page/5/?s=covid+vaccine&post_type=all#038;post_type=all
Next >
https://truthout.org/page/5/?s=covid+vaccine&post_type=all#038;post_type=all
https://truthout.org/page/6/?s=covid+vaccine&post_type=all#038;post_type=all
Next >
https://truthout.org/page/6/?s=covid+vaccine&post_type=all#038;post_type=all
https://truthout.org/page/5/?s=covid+vaccine&post_type=all#038;post_type=all
5
Done!
54


Now to make sure the code is parsing properly:

In [116]:
# Get the response from the first page of data
response = requests.get(urls[3], headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})
print(response)
#Create the beautiful soup
soup = BeautifulSoup(response.text, "html.parser") 

#Get the title, author, date and 
#main = soup.find(class_='single-post')
title = soup.title.get_text()
author = soup.find(class_='author vcard').find('a').get_text()
date = soup.find(class_="dateline").find('time')['datetime'].split('T')[0]
print(title)
print(author)
print(date)

#Get all of the data from the article
article = soup.find(id='article-content')
paragraphs = article.find_all('p')
paragraphs.pop() # Remove the last element...

for paragraph in paragraphs:
    text = paragraph.get_text()
    if '@' in text or '.twitter.' in text or 'https' in text:
        print("Found Garbage")
    else:
        print(paragraph.get_text())

<Response [200]>
Only 0.2 Percent of COVID Vaccines Have Gone to Poor Countries
Jake Johnson
2021-04-12
The head of the World Health Organization estimated in a recent address that of the more than 700 million coronavirus vaccine doses that have been administered across the globe, just 0.2% have gone to people in low-income nations — inequity that experts warn will persist unless rich countries end their obstruction of an international effort to suspend vaccine patents.
Speaking to the media on Friday, WHO Director General Tedros Adhanom Ghebreysus warned that “there remains a shocking imbalance in the global distribution of vaccines” as pharmaceutical companies cling to their monopoly control over technology that was developed with large infusions of public money.
“On average in high-income countries, almost one in four people has received a vaccine. In low-income countries, it’s one in more than 500,” said Tedros. “Let me repeat that: one in four versus one in 500.”
Tedros went on to

In [117]:
def get_truthout_data(urls):
    
    corpus_path = 'truthout_corpus/'
    
    #For every url, get the data and save it.
    for url in urls:
        
        # Make an empty list to contain all of the text data.
        texts = []
        
        # Wait for a few seconds as to not bombard the site with requests.
        sleep_seconds = 2
        time.sleep(sleep_seconds)
        
        # Get the response from the first page of data
        print('Opening url: ' + url)
        response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'})
    
        #Create the beautiful soup
        soup = BeautifulSoup(response.text, "html.parser") 
        
        
        #Get the title, author, date and 
        #main = soup.find(class_='single-post')
        title = soup.title.get_text().strip()
        author = soup.find(class_='author vcard').find('a').get_text()
        date = soup.find(class_="dateline").find('time')['datetime'].split('T')[0]
        
        article = soup.find(id='article-content')
            
        #Get all of the data from the article
        paragraphs = article.find_all('p')
        paragraphs.pop() #Remove last paragraph which is always info about the author.
        for paragraph in paragraphs:
            text = paragraph.get_text()
            if '@' in text or '.twitter.' in text or 'https' in text:
                print("Skipping")
            else:
                texts.append(text)
        
        save_text_data(title, author, date, texts, corpus_path)

In [126]:
get_truthout_data(urls)

Opening url: https://truthout.org/articles/ohio-gops-expert-witness-bizarrely-claims-covid-vaccines-magnetize-people/
Saved file: truthout_corpus/2021-06-09_Ohio-GOP's-Expert-Witness-Bizarrely-Claims-COVID-Vaccines-Magnetize-People_Sharon-Zhang.txt
Opening url: https://truthout.org/articles/top-republican-raises-prospect-of-congress-blocking-covid-vaccine-patent-waivers/
Saved file: truthout_corpus/2021-05-13_Top-Republican-Raises-Prospect-of-Congress-Blocking-COVID-Vaccine-Patent-Waivers_Sam-Knight.txt
Opening url: https://truthout.org/articles/democrats-funded-by-big-pharma-refuse-to-back-covid-vaccine-patent-waiver/
Skipping
Skipping
Skipping
Skipping
Saved file: truthout_corpus/2021-05-04_Democrats-Funded-by-Big-Pharma-Refuse-to-Back-COVID-Vaccine-Patent-Waiver_Jake-Johnson.txt
Opening url: https://truthout.org/articles/only-0-2-percent-of-covid-vaccines-have-gone-to-poor-countries/
Skipping
Skipping
Skipping
Skipping
Saved file: truthout_corpus/2021-04-12_Only-0.2-Percent-of-COVID

Opening url: https://truthout.org/articles/covid-vaccine-billionaires-strike-it-rich-as-poor-nations-struggle-for-access/
Saved file: truthout_corpus/2021-05-21_COVID-Vaccine-Billionaires-Strike-It-Rich-as-Poor-Nations-Struggle-for-Access_Mike-Ludwig.txt
Opening url: https://truthout.org/articles/as-covid-cases-soar-in-india-advocates-urge-biden-to-end-vaccine-apartheid/
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Saved file: truthout_corpus/2021-05-02_AS-COVID-Cases-Soar-in-India,-Advocates-Urge-Biden-to-End-Vaccine-Apartheid_Andrea-Germanos.txt
Opening url: https://truthout.org/articles/prioritizing-incarcerated-people-for-vaccine-quickly-reduced-covid-in-il-prisons/
Saved file: truthout_corpus/2021-04-24_Prioritizing-Incarcerated-People-for-Vaccine-Quickly-Reduced-COVID-in-IL-Prisons_Brian-Dolinar.txt
Opening url: https://truthout.org/articles/the-wto-stopped-millions-of-people-from-receiving-a-covid-19-vaccine/
Saved file: truthout_corpus/2021-04-19_The-WTO-Stopped-Millio

This produced a corpus that is 54 files and 71,375 words in length. 