After building the data set from the initial Gutenberg download, we will clean and prepare the data for machine learning. This will include...
 - Removal of unusual characters
 - Removal of stop words in attributes (the text)
 - Reduction of columns
 
 
The first pass at this will not use pipes, but later versions will use a pipeline to ensure validity.

Should also consider adding some sort of Project Gutenberg (PG) specific cleaning, such as the removal of copyright info.

In [30]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.7.1-py3-none-any.whl (2.5 MB)
Installing collected packages: pyspellchecker
Successfully installed pyspellchecker-0.7.1


In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
from string import punctuation
from spellchecker import SpellChecker

import nltk
# nltk.download()

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

In [108]:
path_to_data = "Data/bookshelf_data.csv" # starter data, should be replaced in later versions
df = pd.read_csv(path_to_data, index_col=0)
df.head()

Unnamed: 0,Title,Author,Link,ID,Bookshelf,Text
0,The Extermination of the American Bison,William T. Hornaday,http://www.gutenberg.org/ebooks/17748,17748,Animal,[Illustration: (Inscription) Mr. Theodore Roos...
1,Deadfalls and Snares,A. R. Harding,http://www.gutenberg.org/ebooks/34110,34110,Animal,DEADFALLS AND SNARES [Frontispiece: A GOOD DEA...
2,Artistic Anatomy of Animals,Édouard Cuyer,http://www.gutenberg.org/ebooks/38315,38315,Animal,+---------------------------------------------...
3,"Birds, Illustrated","Color Photography, Vol. 1, No. 1 Various",http://www.gutenberg.org/ebooks/30221,30221,Animal,FROM: THE PRESIDENT OF THE NATIONAL TEACHERS' ...
4,On Snake-Poison: Its Action and Its Antidote,A. Mueller,http://www.gutenberg.org/ebooks/32947,32947,Animal,[Illustration] ON SNAKE-POISON. ITS ACTION AND...


In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2731
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      2355 non-null   object
 1   Author     2119 non-null   object
 2   Link       2355 non-null   object
 3   ID         2355 non-null   int64 
 4   Bookshelf  2355 non-null   object
 5   Text       2355 non-null   object
dtypes: int64(1), object(5)
memory usage: 208.8+ KB


Define function for cleaning text. This will take an entire corpus and break it into sentences, then make each sentence lowercase, remove all punctuation, define the parts of speach, and lemmmatize the individual words. Finally, it will combine the lemmatized words into a new string representing the cleaned corpus.

In [106]:
en_stopwords = stopwords.words('english')
wnl = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # default pos

def clean_text(text):
    results = []
    # split text into sentences
    sentences = sent_tokenize(text)
    # clean and lemmatize each sentence
    for sentence in sentences:
        words = sentence.lower() # shift to lowercase
        words = re.sub(f"[{re.escape(punctuation)}]", "", words) # remove punctuation
        words = word_tokenize(words) # split sentence into individual words
        words = [word for word in words if not word in en_stopwords] # remove stopwords
        parts = pos_tag(words)
        for word, part in parts:
            lemma = wnl.lemmatize(word, pos=get_wordnet_pos(part))
            results.append(lemma)
    return " ".join(results)

In [105]:
clean_text(df.iloc[0].Text)

'illustration inscription mr theodore roosevelt author hunt trip ranchman compliment author wt hornaday smithsonian institution united state national museum extermination american bison william hornaday superintendent national zoological park report national museum 188687 page 369548 plate ixxii washington government printing office 1889 illustration group american bison national museum collect mount w hornaday content prefatory note part ithe life history bison discovery specie ii geographical distribution iii abundance iv character specie 1 buffalo rank amongst ruminant 2 change form captivity 3 mount specimen museum 4 calf 5 yearling 6 spike bull 7 adult bull 8 cow third year 9 adult cow 10 wood mountain buffalo 11 shed winter pelage v habit buffalo vi food buffalo vii mental capacity disposition buffalo viii value mankind ix economic value bison western cattlegrowers 1 bison captivity domestication 2 need improvement range cattle 3 character buffalodomestic hybrid 4 bison beast bur

In [118]:
df['Text'][0:5].apply(clean_text)

0    illustration inscription mr theodore roosevelt...
1    deadfalls snare frontispiece good deadfall dea...
2    transcriber note transcription use etext texts...
3    president national teacher association state n...
4    illustration snakepoison action antidote muell...
Name: Text, dtype: object

In [137]:
from datetime import datetime
start = datetime.now()
df['Text'][0:5].apply(clean_text)
stop = datetime.now()
print("Total Time: {}", ((stop-start).total_seconds()))

Total Time: {} 17.42791


In [138]:
# prepare df for processing
# note that this cell took ~6 hours to run on my machine

from IPython.display import clear_output
from datetime import datetime

start = datetime.now()

df.drop(columns=['Title', 'Author', 'Link', 'ID']) # these are not needed for our analysis
clean_series = pd.Series(data=clean_text(df['Text'][0]))
increment = 10
for i in range(1, len(df), increment):
    print("Processing {}-{} out of {}...".format(i, i + increment, len(df)))
    cleaned_text = df['Text'][i:i+increment].apply(clean_text)
    clean_series = pd.concat([clean_series, cleaned_text])
    print("Elapsed time: {} seconds.".format((datetime.now() - start).total_seconds()))
    print("Last 5:\n{}".format(cleaned_text[:5]))
print("Complete.")

Processing 1-11 our of 2355...
Elapsed time: 51.746393 seconds.
Last 5:
1    deadfalls snare frontispiece good deadfall dea...
2    transcriber note transcription use etext texts...
3    president national teacher association state n...
4    illustration snakepoison action antidote muell...
5    fifty year hunter trapper frontispiece e n woo...
Name: Text, dtype: object
Processing 11-21 our of 2355...
Elapsed time: 129.073485 seconds.
Last 5:
11    bird man author bird village adventure among b...
12    tozier online distribute proofread team httpsw...
13    produce image generously make available biodiv...
14    proofread team httpwwwpgdpnet page image gener...
15    distribute proofread canada team httpwwwpgdpca...
Name: Text, dtype: object
Processing 21-31 our of 2355...
Elapsed time: 157.794352 seconds.
Last 5:
21    illustration draw shimotori see page 120 regal...
22    animal past illustration phororhacos patagonia...
23    science trap describes fur bearing animal natu...
24   

Elapsed time: 910.719245 seconds.
Last 5:
252    story treasure seeker e nesbit adventure basta...
253    book cover coverjpg coral island tale pacific ...
254    rag dick street life new york bootblack horati...
255    story amulet e nesbit content chapter psammead...
256    little book christmas illustration author make...
Name: Text, dtype: object
Processing 231-241 our of 2355...
Elapsed time: 972.723057 seconds.
Last 5:
262    portion header copyright c 2001 michael hart m...
263    freedom cause g henty content glen cairn ii le...
264    coral island rm ballantyne chapter one beginni...
265    dragon raven day king alfred g henty c n e n p...
266    online distribute proofread team httpswwwpgdpn...
Name: Text, dtype: object
Processing 241-251 our of 2355...
Elapsed time: 1130.693178 seconds.
Last 5:
272    gorilla hunter rm ballantyne chapter one hunte...
274    online distribute proofread team httpwwwpgdpne...
275    personal memoir u grant complete u grant prefa...
276    under

Elapsed time: 2134.032435 seconds.
Last 5:
485    several edition ebook project gutenberg collec...
486    illustration argonautica apollonius rhodius or...
487    treatise government aristotle translate greek ...
488    proofread team cicero tusculan disputation als...
489    illustration hesiod homeric hymn homerica home...
Name: Text, dtype: object
Processing 451-461 our of 2355...
Elapsed time: 2207.10752 seconds.
Last 5:
495    team handy literal translation work horace tra...
496    transcriber note number bracket refer line num...
497    iliad homer literally translate explanatory no...
498    treatise friendship old age marcus tullius cic...
499    art poetry aristotle translate ingram bywater ...
Name: Text, dtype: object
Processing 461-471 our of 2355...
Elapsed time: 2325.94967 seconds.
Last 5:
506    fortunata jacinta do historias de casadas por ...
507    copyright 1998 r rudder la celestina por ferna...
508    la regenta por leopoldo ala « clarín » librerí...
509    also 

Elapsed time: 3659.090262 seconds.
Last 5:
729    adventure lavington george manville fenn lindo...
730    transcribed 1914 c fifield edition david price...
731    letter guardian australia new zealand shoghi e...
732    online distribute proofread team httpwwwpgdpne...
733    university wellington college education gender...
Name: Text, dtype: object
Processing 671-681 our of 2355...
Elapsed time: 3765.321264 seconds.
Last 5:
739    recollection private life napoleon complete co...
740    proofreader boy life napoleon afterwards emper...
741    history france guizot volume iv content xxviii...
742    napoleon bonaparte john sc abbott napoleon fin...
743    eugène sue les mystères de paris tome 18421843...
Name: Text, dtype: object
Processing 681-691 our of 2355...
Elapsed time: 3867.213115 seconds.
Last 5:
749    online distribute proofread team httpwwwpgdpne...
750    brittany byway fanny bury palliser edition 02 ...
751    online distribute proofreader europe httpdpras...
752    dis

Elapsed time: 5519.740576 seconds.
Last 5:
973    quest sacred slipper sax rohmer content chapte...
975    illustration recoil narrow hall drive uncontro...
976    bat wing sax rohmer illustration “ woman raise...
977    yellow claw sax rohmer content chapter lady ci...
978    dracula bram stoker illustration colophon new ...
Name: Text, dtype: object
Processing 891-901 our of 2355...
Elapsed time: 5590.823923 seconds.
Last 5:
984    secret agent simple tale joseph conrad second ...
985    mystery edwin drood charles dickens content ch...
986    work edgar allan poe edgar allan poe raven edi...
987    work edgar allan poe edgar allan poe raven edi...
988    work edgar allan poe edgar allan poe raven edi...
Name: Text, dtype: object
Processing 901-911 our of 2355...
Elapsed time: 5654.073047 seconds.
Last 5:
994     cover sign four arthur conan doyle content cha...
995     mysterious affair style agatha christie conten...
997     poirot investigate author mysterious affair st...
998    

Elapsed time: 7168.370775 seconds.
Last 5:
1202    transcriber ’ note text produce photoreprint 1...
1203    transcriber note story publish fantastic unive...
1204    online distribute proofread team httpswwwpgdpn...
1205    transcriber note initial ad move main text bee...
1207    time trader andre norton science fiction star ...
Name: Text, dtype: object
Processing 1101-1111 our of 2355...
Elapsed time: 7195.046424 seconds.
Last 5:
1213    black amazon mar novel leigh brackett transcri...
1214    key time andre norton publish world publish co...
1215    online distribute proofread team httpswwwpgdpn...
1216    illustration cover duel cosmic magician voodoo...
1217    astound story superscience sale first thursday...
Name: Text, dtype: object
Processing 1111-1121 our of 2355...
Elapsed time: 7227.073915 seconds.
Last 5:
1226    connecticut yankee king arthur court mark twai...
1227    illustration strange case dr jekyll mr hyde ro...
1230    image generously make available internet ar

Elapsed time: 8853.78742 seconds.
Last 5:
1454    illustration fig 1—the leardo map world 1452 1...
1455    transcriber note italic render italic woodcraf...
1456    camp trail illustration paint fernand lungren ...
1457    note project gutenberg also html version file ...
1459    sutherland project gutenberg online distribute...
Name: Text, dtype: object
Processing 1311-1321 our of 2355...
Elapsed time: 9670.407814 seconds.
Last 5:
1465    world factbook 1990 electronic version world f...
1466    cia world factbook 2009 content whats new know...
1467    cia world factbook 2001 content country locati...
1468    cia world factbook 2006 content country locati...
1469    cia world factbook 2005 content country locati...
Name: Text, dtype: object
Processing 1321-1331 our of 2355...
Elapsed time: 10011.446038 seconds.
Last 5:
1475    preliminary edition final first edition file a...
1476    cia online version book publish address httpww...
1477    edition project gutenberg edition plain van

Elapsed time: 11617.084187 seconds.
Last 5:
1695    charles fourier sein leben und seine theorien ...
1696    europe online distribute proofread team httpdp...
1697    die organisation der rohstoffversorgung vortra...
1698    proofreader team illustration portrait late si...
1699    online distribute proofread team mirror litera...
Name: Text, dtype: object
Processing 1521-1531 our of 2355...
Elapsed time: 11625.936549 seconds.
Last 5:
1705    proofread team mirror literature amusement ins...
1706    mirror literature amusement instruction vol xi...
1707    mirror literature amusement instruction vol 13...
1708    mirror literature amusement instruction vol 12...
1709    proofreader mirror literature amusement instru...
Name: Text, dtype: object
Processing 1531-1541 our of 2355...
Elapsed time: 11659.570041 seconds.
Last 5:
1715    mirror literature amusement instruction vol 20...
1716    mirror literature amusement instruction vol 10...
1717    mirror literature amusement instruction 

Elapsed time: 13002.553905 seconds.
Last 5:
1931    christian foundation scientific religious jour...
1932    transcriber note spell punctuation inconsisten...
1933    christian foundation scientific religious jour...
1934    scientific religious journal vol september 188...
1935    scientific religious journal vol 1 october 188...
Name: Text, dtype: object
Processing 1731-1741 our of 2355...
Elapsed time: 13047.557051 seconds.
Last 5:
1941    tablet divine plan ‘ abdu ’ lbahá edition 1 se...
1942    kitábiaqdas bahá ’ u ’ lláh edition 1 june 21 ...
1943    gem divine mystery bahá ’ u ’ lláh edition 1 j...
1944    hidden word bahá ’ u ’ lláh bahá ’ u ’ lláh ed...
1945    answered question ‘ abdu ’ lbahá edition 1 sep...
Name: Text, dtype: object
Processing 1741-1751 our of 2355...
Elapsed time: 13108.699599 seconds.
Last 5:
1951    bahá ’ í prayer selection prayer reveal bahá ’...
1952    promulgation universal peace ‘ abdu ’ lbahá ed...
1953    god pass shoghi effendi edition 1 septem

Elapsed time: 14610.586642 seconds.
Last 5:
2171    microscope something science together many cur...
2172    microscope article contribute andrew ross “ pe...
2174    volume eleven number four journal entomology z...
2175    proofread team httpswwwpgdpnet volume eleven n...
2176    funghi mangerecci e velenosi delleuropa medium...
Name: Text, dtype: object
Processing 1941-1951 our of 2355...
Elapsed time: 14662.788184 seconds.
Last 5:
2182    online distribute proofread team httpswwwpgdpn...
2183    aganetha dyck online distribute proofread team...
2184    anmerkungen zur transkription im original gesp...
2185    online distribute proofread team zoonomia law ...
2186    treatise anatomy physiology hygiene design col...
Name: Text, dtype: object
Processing 1951-1961 our of 2355...
Elapsed time: 14704.20648 seconds.
Last 5:
2192    brain voice speech song fw mott fr md frcp 191...
2193    lockyer ’ astronomy element astronomy accompan...
2194    typographical error whether correct liste

Elapsed time: 15645.702116 seconds.
Last 5:
2435    dp team illustration scientific american suppl...
2436    pg distribute proofreader illustration scienti...
2437    online distribute proofread team httpwwwpgdpne...
2438    charles frank dp team illustration scientific ...
2439    frank online distribute proofreader team illus...
Name: Text, dtype: object
Processing 2151-2161 our of 2355...
Elapsed time: 15732.075662 seconds.
Last 5:
2445    illustration scientific american supplement 46...
2447    several edition ebook project gutenberg collec...
2448    transcriber note typographical error correct l...
2456    vertebrata etext prepared teary eye anderson d...
2458    public domain work university michigan digital...
Name: Text, dtype: object
Processing 2161-2171 our of 2355...
Elapsed time: 15794.838238 seconds.
Last 5:
2466    form function contribution history animal morp...
2469    illustration evolution man scientifically disp...
2470    specie variety origin mutation lecture d

Elapsed time: 16891.806966 seconds.
Last 5:
2728    transcriber note obvious printer error correct...
2729    note project gutenberg also html version file ...
2730    note project gutenberg also html version file ...
2731    online distribute proofread team httpwwwpgdpne...
Name: Text, dtype: object
Complete.


In [142]:
clean_series.to_csv("Data/cleaned_texts.csv") # store this away for later