I scraped nearly 3000 books from the sixteenth and seventeenth centuries from Early English Books Online.  Some of the centuries did not contain very much text, so I cut those out of my data set.  Cleaning the data set was incredibly simple, as I merely had to get rid of some characters that would show up every once in a while, like the pipeline and slash characters.  I also decided to shift everything to lowercase to cut down on the features that I would be testing over, so now it's mainly spelling and common phrases that I will be looking at.

In terms of the reliability of the data, I believe it all to be accurate, given that it was put together by several universities that were tring to make a corpus that was easily searchable and could be used for analysis on  Early Modern English.  I do not foresee any "hidden agendas" that might skew the data with which I am concerned.  Similarly, I do not believe there to be values that should concern me besides the occasional illustration or narration tag that shows up in the text.

This data set was created specifically for analyses similar to this, and is very suited to it.

Below is the code that I used to scrape and clean the data in executable files on one of the lab computers.  I then used the command line to move the files into folders labeled by the publication decade of the books contained therein.

In [None]:
import re
import time
import requests
from bs4 import BeautifulSoup as b

#try to avoid copyright infringement
copyright=re.compile('[pP]rohibit(ed|s)')
full_text=re.compile('View entire text')
#start off at the beginning browse by title
base_url = 'https://quod.lib.umich.edu/e/eebo?cginame=text-idx;id=navbarbrowselink;key=title;page=browse'
base_soup = b(requests.get(base_url).text,'html.parser')
#navigate to the row of letters
first_hierarchy = base_soup.find_all(href=True,class_='browsenav_r1')
for i in first_hierarchy[:26]:
    #navigate to the row of subletters
    next_page = b(requests.get(i.attrs['href']).text,'html.parser')
    titlestart=next_page.find_all(href=True,class_='browsenav_r2')
    for j,k in enumerate(titlestart):
        if j>10:
            break
        time.sleep(1)
        current_page=b(requests.get(k.attrs['href']).text,'html.parser')
        #grab the list of browselistitem
        
        titles = current_page.find_all(class_='browsecell')
        for ii,jj in enumerate(titles):
            if ii > 20:
                break
            if ii%2==0:
                #alternates between year and link to book
                author_year = jj.text
            else:
                book = b(requests.get(jj.contents[0].attrs['href']).text,'html.parser')
                text = []
                if book.find(name='p',string=copyright) is not None:  #check the copyright area.
                    pass
                contents = book.find_all(class_='buttonlink')[-1].contents[0]
                time.sleep(1)
                page = b(requests.get(contents.attrs['href']).text,'html.parser')
                accept = page.find(name='a')
                actual = b(requests.get(accept.attrs['href']).text,'html.parser')
                for ss in actual.find_all(name='p',limit=4000):
                    text.append(ss.string)
                #write the text to a file
                with open(author_year.replace('.','').replace('/','')+'.txt','a+') as outfile:
                    for aa in text:
                        if aa is not None:
                            outfile.write(aa)
                            outfile.write('\n')


In [None]:
from glob import iglob 
for i in iglob('*.txt'):
    with open(i,'r') as infile:
        contents = infile.readlines()
    for j in range(len(contents)):
        if contents[j] in [' [illustration] ',' [narration]']:  
            #there are some of these scattered throughout the data.
            contents.remove(contents[j])
            pass
        contents[j]=contents[j].lower().replace('/','').replace('|','') 
        #there are some odd characters in the data.
        with open(i, 'w') as outfile:
            outfile.writelines([j + '\n' for j in contents])


Examining the data leads me to believe that I will indeed be able to look at the features that I previously intended to use for the purpose of ascertaining the possible presence of words, spellings, and phrases that have a higher frequency in some decades than in others.  This would have ramifications in the world of machine learning and archaeology because it would signify the existence of features to be used in training models to predict how old a document is based on the vocabulary used in the text.

Here I include a sample of the data after cleaning and scraping.  Note the odd spellings that may be markers of specific eras.

"in the begynnynge and endynge of all good werkes worshyp & thankynge be to almyghty god  maker & byer of all mākynde  begynner and ender of all goodnes  without whose gyfte & helpe no maner vertue is ne may be  whether it be in thought  wyll  or dede  than what euer we synfull creatures thynke or do speke or wryte  that may tourne in to proufyte of mannes soule  to god onely be the worshyp that sente al grace  to vs no praysynge  for of vs without hym cometh no thynge but fylthe & synne. now than good god of his endeles myght & plenteuous goodnes graūte me grace to thynke somwhat of his dere loue & how he sholde be loued  of that same loue some wordes to wryte whiche may to hym be worshyp  to the wryter mede  and proufytable to the reder. amen."