Initial data examination and processing

Having collected the book texts from the Gutenberg Project, this will be the first pass on the data and we will pre-process from there.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
path_to_data = "Data/gutenberg_data.csv"
df_data = pd.read_csv(path_to_data)
df_data.head()

Unnamed: 0,Title,Author,Link,ID,Bookshelf,Text
0,The Extermination of the American Bison,William T. Hornaday,http://www.gutenberg.org/ebooks/17748,17748,Animal,[Illustration: (Inscription) Mr. Theodore Roos...
1,Deadfalls and Snares,A. R. Harding,http://www.gutenberg.org/ebooks/34110,34110,Animal,DEADFALLS AND SNARES [Frontispiece: A GOOD DEA...
2,Artistic Anatomy of Animals,Édouard Cuyer,http://www.gutenberg.org/ebooks/38315,38315,Animal,+---------------------------------------------...
3,"Birds, Illustrated","Color Photography, Vol. 1, No. 1 Various",http://www.gutenberg.org/ebooks/30221,30221,Animal,FROM: THE PRESIDENT OF THE NATIONAL TEACHERS' ...
4,On Snake-Poison: Its Action and Its Antidote,A. Mueller,http://www.gutenberg.org/ebooks/32947,32947,Animal,[Illustration] ON SNAKE-POISON. ITS ACTION AND...


In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15331 entries, 0 to 15330
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      15331 non-null  object
 1   Author     14135 non-null  object
 2   Link       15331 non-null  object
 3   ID         15331 non-null  int64 
 4   Bookshelf  2732 non-null   object
 5   Text       14891 non-null  object
dtypes: int64(1), object(5)
memory usage: 718.8+ KB


Using the eariler script, we imported 15331 titles from Project Gutenberg. Of those, we were able to collect text for 14891. Books missing text are not useful for this project and will be dropped. It also appears that many of the titles lost their Bookshelf attribute, so we will need to add a genre column as our target for this analysis. There are many null authors, but since author is not important for the analysis these nulls are unimportant.

In [12]:
df_data.Bookshelf.value_counts()

Children's    213
FR             84
The            77
United         61
World          50
             ... 
General         2
Journal         2
Bulgaria        2
Maps            1
Norway          1
Name: Bookshelf, Length: 118, dtype: int64

As suspected, the bookshelf attribute is not a valid target for genre prediction, which will require us to externally fill the genres.

In [6]:
df_data.iloc[0]

Title                  The Extermination of the American Bison
Author                                     William T. Hornaday
Link                     http://www.gutenberg.org/ebooks/17748
ID                                                       17748
Bookshelf                                               Animal
Text         [Illustration: (Inscription) Mr. Theodore Roos...
Name: 0, dtype: object

In [8]:
len(df_data.iloc[0].Text)

550092

In [9]:
df_data.iloc[0].Text

'[Illustration: (Inscription) Mr. Theodore Roosevelt. Author of "Hunting Trips of a Ranchman," With the compliments of The Author, W.T. Hornaday.] SMITHSONIAN INSTITUTION. UNITED STATES NATIONAL MUSEUM. * * * * * THE EXTERMINATION OF THE AMERICAN BISON. BY WILLIAM T. HORNADAY, _Superintendent of the National Zoological Park._ * * * * * From the Report of the National Museum, 1886-\'87, pages 369-548, and plates I-XXII. * * * * * WASHINGTON GOVERNMENT PRINTING OFFICE. 1889. [Illustration: GROUP OF AMERICAN BISONS IN THE NATIONAL MUSEUM. Collected and mounted by W. T. Hornaday.] CONTENTS. PREFATORY NOTE PART I.--THE LIFE HISTORY OF THE BISON I. Discovery of the species II. Geographical distribution III. Abundance IV. Character of the species 1. The buffalo\'s rank amongst ruminants 2. Change of form in captivity 3. Mounted specimens in museums 4. The calf 5. The yearling 6. The spike bull 7. The adult bull 8. The cow in the third year 9. The adult cow 10. The "Wood" or "Mountain Buffalo"

A look at the text value for the first book in our set ("The Extermination of the American Bison" by William T. Hornaday) shows that we will need to do a thorough cleaning of the set before any statistical processing can be done, but that the book did import properly.

Now we'll start trimming down before engineering the genres. First we'll drop all titles without an associated text.

In [25]:
df_trimmed = df_data.dropna(axis=0, subset=['Text']).copy()
df_trimmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14891 entries, 0 to 15330
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      14891 non-null  object
 1   Author     13729 non-null  object
 2   Link       14891 non-null  object
 3   ID         14891 non-null  int64 
 4   Bookshelf  2665 non-null   object
 5   Text       14891 non-null  object
dtypes: int64(1), object(5)
memory usage: 814.4+ KB


Now let's have a look at the text column and see how the numbers stack up.

In [26]:
df_trimmed.Text.describe()

count                                                 14891
unique                                                 8781
top       501(c)(3) educational corporation organized un...
freq                                                     24
Name: Text, dtype: object

In [27]:
df_trimmed.Text.value_counts().values.tolist()

[24,
 7,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,

Those numbers look suspicious. Perhaps there are duplicates of the same texts?

In [28]:
len(df_trimmed.Title.unique())

8571

That's a yes. Let's trim down to unique titles, then.

In [30]:
df_trimmed.drop_duplicates(inplace=True, subset=['Title', 'Text'])
df_trimmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8790 entries, 0 to 15305
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      8790 non-null   object
 1   Author     7982 non-null   object
 2   Link       8790 non-null   object
 3   ID         8790 non-null   int64 
 4   Bookshelf  2355 non-null   object
 5   Text       8790 non-null   object
dtypes: int64(1), object(5)
memory usage: 480.7+ KB


In [31]:
df_trimmed.Title.value_counts()

Slave Narratives: A Folk History of Slavery in the United States from Interviews    33
Birds, Illustrated                                                                  12
Mémoires du duc de Rovigo, pour servir à l'histoire de l'empereur Napoléon, Tome     7
Trial of the Major War Criminals Before the International Military Tribunal,         7
A Child's Garden of Verses                                                           5
                                                                                    ..
Instincts of the Herd in Peace and War                                               1
The Reign of Tiberius, Out of the First Six Annals of Tacitus;                       1
Diane de Poitiers                                                                    1
Security                                                                             1
Frank Merriwell's Races                                                              1
Name: Title, Length: 8571, dtype: int64

In [44]:
df_trimmed[df_trimmed.Title == "Slave Narratives: A Folk History of Slavery in the United States from Interviews"].iloc[3].Text

'https://www.pgdp.net (This file was produced from images generously made available by the Library of Congress, Manuscript Division) +--------------------------------------------------------------+ | | | Transcriber\'s Note: | | | | I. Inconsistent punctuation has been silently corrected | | throughout the book. | | | | II. Clear spelling mistakes have been corrected however, | | inconsistent language usage (such as \'day\' and \'dey\') | | has been maintained. A list of corrections is included | | at the end of the book. | | | | III. The numbers at the start of each interview were stamped | | into the original work and refer to the number of the | | published interview in the context of the entire Slave | | Narratives project. | | | | IV. Two handwritten notes have been retained and are | | annotated as such. | | | | | +--------------------------------------------------------------+ SLAVE NARRATIVES _A Folk History of Slavery in the United States From Interviews with Former Slaves_ TY

While there are still repeated titles, they appear to be distinct texts. For the moment I will keep them, though we may have to re-evaluate if the genre collection fails because of this.

In [63]:
df_trimmed.Text.value_counts().values.tolist()[:5]

[10, 1, 1, 1, 1]

There are still some repeated texts, so let's look into those.

In [53]:
df_trimmed.Text.value_counts().index[0]

'501(c)(3) educational corporation organized under the laws of the state of Mississippi and granted tax exempt status by the Internal Revenue Service. The Foundation’s EIN or federal tax identification number is 64-6221541. Its 501(c)(3) letter is posted at http://www.gutenberg.org/fundraising/pglaf. Contributions to the Project Gutenberg Literary Archive Foundation are tax deductible to the full extent permitted by U.S. federal laws and your state’s laws. The Foundation’s principal office is located at 4557 Melan Dr. S. Fairbanks, AK, 99712., but its volunteers and employees are scattered throughout numerous locations. Its business office is located at 809 North 1500 West, Salt Lake City, UT 84116, (801) 596-1887, email business@pglaf.org. Email contact links and up to date contact information can be found at the Foundation’s web site and official page at http://www.pglaf.org For additional contact information: Dr. Gregory B. Newby Chief Executive and Director gbnewby@pglaf.org Sectio

In [66]:
mask = df_trimmed.duplicated(subset=['Text'], keep=False)
df_trimmed[mask]

Unnamed: 0,Title,Author,Link,ID,Bookshelf,Text
3409,Declaration of Independence of the United Stat...,Thomas Jefferson,http://www.gutenberg.org/ebooks/19785,19785,,501(c)(3) educational corporation organized un...
5590,American Indian Fairy Tales,W. T. Larned,http://www.gutenberg.org/ebooks/19579,19579,,501(c)(3) educational corporation organized un...
5717,A Christmas Carol,Charles Dickens,http://www.gutenberg.org/ebooks/19505,19505,,501(c)(3) educational corporation organized un...
5882,Amendments to the United States Constitution,United States,http://www.gutenberg.org/ebooks/19581,19581,,501(c)(3) educational corporation organized un...
5957,Dorothy and the Wizard in Oz,L. Frank Baum,http://www.gutenberg.org/ebooks/19450,19450,,501(c)(3) educational corporation organized un...
6901,Évangile selon Jean,Anonymous,http://www.gutenberg.org/ebooks/19842,19842,,501(c)(3) educational corporation organized un...
7017,Biblia Sacra Vulgata - Psalmi XXII,Anonymous,http://www.gutenberg.org/ebooks/19635,19635,,501(c)(3) educational corporation organized un...
7740,Little Wizard Stories of Oz,L. Frank Baum,http://www.gutenberg.org/ebooks/19467,19467,,501(c)(3) educational corporation organized un...
9894,Der Schimmelreiter,Theodor Storm,http://www.gutenberg.org/ebooks/19790,19790,,501(c)(3) educational corporation organized un...
11232,Denslow's Three Bears,W. W. Denslow,http://www.gutenberg.org/ebooks/19788,19788,,501(c)(3) educational corporation organized un...


These all appear to be audiobooks with no associated body text, so they can be dropped from our data set.

In [67]:
df_trimmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8790 entries, 0 to 15305
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      8790 non-null   object
 1   Author     7982 non-null   object
 2   Link       8790 non-null   object
 3   ID         8790 non-null   int64 
 4   Bookshelf  2355 non-null   object
 5   Text       8790 non-null   object
dtypes: int64(1), object(5)
memory usage: 480.7+ KB


In [68]:
df_trimmed.drop_duplicates(subset=['Text'], keep=False, inplace=True)
df_trimmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8780 entries, 0 to 15305
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      8780 non-null   object
 1   Author     7972 non-null   object
 2   Link       8780 non-null   object
 3   ID         8780 non-null   int64 
 4   Bookshelf  2355 non-null   object
 5   Text       8780 non-null   object
dtypes: int64(1), object(5)
memory usage: 480.2+ KB


With those repeats removed, let's look at the texts again.

In [72]:
df_trimmed.Text.str.len().describe()

count    8.780000e+03
mean     3.637440e+05
std      4.921112e+05
min      4.930000e+02
25%      8.955800e+04
50%      2.528160e+05
75%      4.765695e+05
max      1.090011e+07
Name: Text, dtype: float64

In [83]:
# The shortest text in the set
df_trimmed[df_trimmed.Text.str.len() == min(df_trimmed.Text.str.len())].Text.tolist()

["CHRISTMAS*** Provided by McGuinn's Folk Den (http://www.ibiblio.org/jimmy/folkden) We Wish You a Merry Christmas (English Traditional, 16th Century) We wish you a merry Christmas We wish you a merry Christmas We wish you a merry Christmas And a happy New Year. Oh bring us some figgy pudding (3x) With a cup of good cheer. We won't go until we get it (3x) We all love our figgy pudding (3x) With a cup of good cheer So bring it out here! We wish you a merry Christmas (3x) And a happy New Year"]

In [87]:
# The longest text in the set (first 1000 characters only)
df_trimmed[df_trimmed.Text.str.len() == max(df_trimmed.Text.str.len())].Text.tolist()[0][:1000]



It looks like we have the texts all straight and ready to go. Now let's add some genres so we can start processing things.

In [89]:
df_trimmed.head(10)

Unnamed: 0,Title,Author,Link,ID,Bookshelf,Text
0,The Extermination of the American Bison,William T. Hornaday,http://www.gutenberg.org/ebooks/17748,17748,Animal,[Illustration: (Inscription) Mr. Theodore Roos...
1,Deadfalls and Snares,A. R. Harding,http://www.gutenberg.org/ebooks/34110,34110,Animal,DEADFALLS AND SNARES [Frontispiece: A GOOD DEA...
2,Artistic Anatomy of Animals,Édouard Cuyer,http://www.gutenberg.org/ebooks/38315,38315,Animal,+---------------------------------------------...
3,"Birds, Illustrated","Color Photography, Vol. 1, No. 1 Various",http://www.gutenberg.org/ebooks/30221,30221,Animal,FROM: THE PRESIDENT OF THE NATIONAL TEACHERS' ...
4,On Snake-Poison: Its Action and Its Antidote,A. Mueller,http://www.gutenberg.org/ebooks/32947,32947,Animal,[Illustration] ON SNAKE-POISON. ITS ACTION AND...
5,Fifty Years a Hunter and Trapper,E. N. Woodcock,http://www.gutenberg.org/ebooks/34063,34063,Animal,FIFTY YEARS A HUNTER AND TRAPPER [Frontispiece...
6,What Bird is That?,Frank M. Chapman,http://www.gutenberg.org/ebooks/31751,31751,Animal,Online Distributed Proofreading Team at https:...
7,Fox Trapping: A Book of Instruction Telling Ho...,,http://www.gutenberg.org/ebooks/34076,34076,Animal,FOX TRAPPING [Frontispiece: FALL CATCH] FOX TR...
8,A Guide for the Study of Animals,"Lucas, Shinn, Smallwood, and Whitney",http://www.gutenberg.org/ebooks/34984,34984,Animal,Proofreading Team at https://www.pgdp.net Tran...
9,Our Vanishing Wild Life: Its Extermination and...,William T. Hornaday,http://www.gutenberg.org/ebooks/13249,13249,Animal,"""_I know no way of judging of the Future but b..."


Browsing through the first few books on the web, I am not sure how best to add the genres. GoodReads was my initial plan, but many of the texts do not have an entry on the site. Some of the results have a wikipedia page, which would let me add the categories as reasonable genres. Others have a Google Books result, some of which have genre tags. Others can be found on Amazon or Barnes & Nobles, which may have genre tags. There is little consistency on genre availability, however.

In [90]:
df_trimmed.Bookshelf.value_counts()

Children's     199
FR              82
The             77
United          48
Crime           48
              ... 
Mathematics      2
Maps             1
Bulgaria         1
American         1
Norway           1
Name: Bookshelf, Length: 118, dtype: int64

In [92]:
df_trimmed[df_trimmed['Bookshelf'] == "The"]

Unnamed: 0,Title,Author,Link,ID,Bookshelf,Text
1348,The Brochure Series of Architectural Illustrat...,,http://www.gutenberg.org/ebooks/15020,15020,The,"[Illustration: LXXXI. Ferme la Vallauine, Norm..."
1349,The Brochure Series of Architectural Illustrat...,,http://www.gutenberg.org/ebooks/14987,14987,The,"[Illustration: LXXIII. Ferme de Turpe, Normand..."
1350,The Brochure Series of Architectural Illustrat...,Various,http://www.gutenberg.org/ebooks/25735,25735,The,THE BROCHURE SERIES OF ARCHITECTURAL ILLUSTRAT...
1351,The Brochure Series of Architectural Illustrat...,,http://www.gutenberg.org/ebooks/13489,13489,The,the Online Distributed Proofreading Team THE B...
1352,The Brochure Series of Architectural Illustrat...,,http://www.gutenberg.org/ebooks/15091,15091,The,[Illustration: IX. The Principal Doorway to th...
...,...,...,...,...,...,...
1937,"The Christian Foundation, Or, Scientific and R...",,http://www.gutenberg.org/ebooks/28678,28678,The,Scientific and Religious Journal. VOL. I. DECE...
1938,"The Christian Foundation, Or, Scientific and R...",,http://www.gutenberg.org/ebooks/28668,28668,The,Scientific and Religious Journal. VOL. I. JULY...
1939,"The Christian Foundation, Or, Scientific and R...",,http://www.gutenberg.org/ebooks/28669,28669,The,Scientific and Religious Journal. VOL. I. AUGU...
2488,"The Illustrated War News, Number 21, Dec. 30, ...",Various,http://www.gutenberg.org/ebooks/18334,18334,The,N.B.--REMOVE INSETTED LEAFLET EACH NUMBER THE ...


Due to current time constraints, I will proceed with Bookshelf as the target, with the intent to revisit later and engineer a more valid genre. The remainder of the processing and modeling should follow the same flow regardless, so this will be an easy substitution in later work.

In [93]:
df_bookshelf = df_trimmed.dropna(axis = 0, subset=['Bookshelf']).copy()
df_bookshelf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2731
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      2355 non-null   object
 1   Author     2119 non-null   object
 2   Link       2355 non-null   object
 3   ID         2355 non-null   int64 
 4   Bookshelf  2355 non-null   object
 5   Text       2355 non-null   object
dtypes: int64(1), object(5)
memory usage: 128.8+ KB


In [95]:
df_bookshelf.Bookshelf.describe()

count           2355
unique           118
top       Children's
freq             199
Name: Bookshelf, dtype: object

In [94]:
df_trimmed.to_csv("Data/trimmed_data.csv") # Saving for later use
df_bookshelf.to_csv("Data/bookshelf_data.csv") # Saving for first use