# Get text from Project Gutenberg

[Project Gutenberg](https://www.gutenberg.org/) is an excellent source of text data from books which are out of copyright. Project Gutenberg posts plain text files of ~70,000 different texts. In the following blocks of code we will walk through a process of scrapping, organizing, and cleaning up Most of [Charles Dickens's](https://en.wikipedia.org/wiki/Charles_Dickens) novels.

## Set up

In [1]:
# imports
import requests
import pandas as pd
import spacy

In [2]:
# set up nlp pipline
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

### Example: Tale of Two Cities

We will start by searching for the text we want in [Project Gutenberg](https://www.gutenberg.org/). In this first example, we will search for _A Tale of Two Cities_. From this search result we can get a url to the plain text file.

In [3]:
# use requests library to load text
response= requests.get("https://gutenberg.org/cache/epub/98/pg98.txt")
text= response.text

In [4]:
text[:250]

'\ufeffThe Project Gutenberg eBook of A Tale of Two Cities\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or '

### Isolate the text we are interested in

If you look closely at the text, you will notice that there is a lot of front matter and back matter that is superfluous to the novel. We are not interested in that material, so we need to find a way to cut it.

We can do this by finding the index to the starting text and then the index to the ending text. We can use the start and end index numbers to isolate the text we are interested in.

In [5]:
# find start index
text.find('It was the best of times, it was the worst of times, it was the age of')

2731

In [6]:
text[2731]

'I'

In [7]:
# find end index
text.find('*** END OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***')

774307

In [8]:
# set start and end variables
start=2731
end=774307 -1

In [9]:
# trim text
tale= text[start:end]

### Split text into paragraphs for easier processing

We now have a very long text string containing the entire text of _A Tale of Two Cities_. For ease of processing, we will cut the text up into paragraphs (you could also do chapters if you wanted). Project Gutenberg uses a sequence of returns and new lines to separate the paragraphs. In the raw text this sequence looks like this `\r\n\r\n`. We can use this sequence of paragraphs to split the text into paragraphs.

In [10]:
# split into paragraphs
tale_paras= tale.split('\r\n\r\n')

In [11]:
type(tale_paras)

list

In [12]:
len(tale_paras)

3364

### Organize paragraphs into a Dataframe for processing and storage

As we have seen, a Pandas DataFrame is a handy way to organize data for processing and storage. Here, we will create a DataFrame with columns for author, title, and text where each row represents a paragraph. Of course, in this example, the author and title will be the same for each row.

In [13]:
# creating empty lists for author and title will be handy for building our dataframe
author = []
title = []

In [14]:
# this little for-loop will poplate the lists we just created
for para in tale_paras:
    author.append('Dickens')
    title.append('Tale')

In [15]:
len(author) == len(tale_paras)

True

We now have three lists of equal length: author, title, paras. We can us the `zip` function to construct the lists together and then into a DataFrame.

In [16]:
tale_df = pd.DataFrame(list(zip(author, title, tale_paras)), columns=['author', 'title', 'text'])

In [17]:
# Sanity check
tale_df.head()

Unnamed: 0,author,title,text
0,Dickens,Tale,"It was the best of times, it was the worst of ..."
1,Dickens,Tale,There were a king with a large jaw and a queen...
2,Dickens,Tale,It was the year of Our Lord one thousand seven...
3,Dickens,Tale,"France, less favoured on the whole as to matte..."
4,Dickens,Tale,"In England, there was scarcely an amount of or..."


### Lemmatize text
Now that we have our text divided by paragraphs and stored in a DataFrame, let's apply some nlp to extract the lemmas. We will then place the lemmas into a new column in our DataDrame.

In [18]:
# extract lemmas
def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [19]:
# apply process_text to text column
tale_df['lemmas']= tale_df['text'].apply(process_text)

In [20]:
# sanity check
tale_df.head()

Unnamed: 0,author,title,text,lemmas
0,Dickens,Tale,"It was the best of times, it was the worst of ...",good time bad time age wisdom age foolishness ...
1,Dickens,Tale,There were a king with a large jaw and a queen...,king large jaw queen plain face throne england...
2,Dickens,Tale,It was the year of Our Lord one thousand seven...,year lord thousand seven seventy spiritual rev...
3,Dickens,Tale,"France, less favoured on the whole as to matte...",france favour matter spiritual sister shield t...
4,Dickens,Tale,"In England, there was scarcely an amount of or...",england scarcely order protection justify nati...


### Making our code more efficient with functions.

The code above worked well, but if we want to get more texts, it will be useful to make our code more efficient by using functions.

In [21]:
# make a function for all the above
def get_text(url):
    response= requests.get(url)
    text= response.text
    return text


In [22]:
def divide_paras(text, start, end, para_break):
    text= text[start:end]
    paras= text.split(para_break)
    return paras


In [23]:
def make_df(author, title, paras):
    df = pd.DataFrame(paras, columns=['text'])
    df.insert(0, 'author', author)
    df.insert(1, 'title', title)
    return df


Now that we have written our functions, we can get our next text: _Great Expectations_.

In [24]:
# get text
text= get_text('https://gutenberg.org/cache/epub/1400/pg1400.txt')

In [25]:
text[:250]

'\ufeffThe Project Gutenberg eBook of Great Expectations\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re'

Now that we have our text, we need to cut out the beginning and ending material we are not interested in. HOWEVER, we are going to run into an error with _Great Expectations_.

In [26]:
# set start index
start= text.find('My father’s family name being Pirrip')

In [27]:
# sanity check
start

1874

### Error!

The `-1` index means that we were not able to find that substring of characters in the text. Something is wrong with the encoding. We can investigate the text if we write it to a file.

In [None]:
with open('great_expectations.txt', encoding='utf8', mode='w') as f:
    f.write(text)

If you look cloely at the text file, you will see that the line "My father’s family name being Pirrip" is actually "My fatherâs family name being Pirrip..." There are some funny characters in this text!

In [None]:
# set start index (again)


In [None]:
#sanity check:


In [28]:
# set end index
end= text.find('*** END OF THE PROJECT GUTENBERG EBOOK GREAT EXPECTATIONS ***')

In [29]:
#sanity check
end

1015861

In [30]:
# set paragraph break
para_break = '\r\n\r\n'

In [31]:
# divide text into paragraphs
expectations_paras= divide_paras(text=text, start=start, end=end, para_break= para_break)

In [32]:
# make DataFrame
expectations_df= make_df(author= 'Dickens', title= 'Expectations', paras= expectations_paras)

In [33]:
# sanity check
expectations_df.head()

Unnamed: 0,author,title,text
0,Dickens,Expectations,"My father’s family name being Pirrip, and my C..."
1,Dickens,Expectations,"I give Pirrip as my father’s family name, on t..."
2,Dickens,Expectations,"Ours was the marsh country, down by the river,..."
3,Dickens,Expectations,"“Hold your noise!” cried a terrible voice, as ..."
4,Dickens,Expectations,"A fearful man, all in coarse grey, with a grea..."


In [34]:
# extract lemmas
expectations_df['lemmas']= expectations_df['text'].apply(process_text)

In [35]:
expectations_df.head()

Unnamed: 0,author,title,text,lemmas
0,Dickens,Expectations,"My father’s family name being Pirrip, and my C...",father family pirrip christian philip infant t...
1,Dickens,Expectations,"I give Pirrip as my father’s family name, on t...",pirrip father family authority tombstone joe g...
2,Dickens,Expectations,"Ours was the marsh country, down by the river,...",marsh country river river wound mile sea vivid...
3,Dickens,Expectations,"“Hold your noise!” cried a terrible voice, as ...",hold noise cry terrible voice man start grave ...
4,Dickens,Expectations,"A fearful man, all in coarse grey, with a grea...",fearful man coarse grey great iron leg man hat...


### On your own: Christmas Carol

Based on the code above, can you create a dataset from _A Christmas Carol_?

Here is a url to get you started: `https://www.gutenberg.org/cache/epub/46/pg46.txt`

In [37]:
# get text
text= get_text('https://www.gutenberg.org/cache/epub/46/pg46.txt')

In [41]:
# set start index
start= text.find('MARLEY was dead: to begin with')

In [42]:
start

1607

In [40]:
with open('great_expectations.txt', encoding='utf8', mode='w') as f:
    f.write(text)

In [43]:
# set end index
end= text.find('*** END OF THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL IN PROSE; BEING A GHOST STORY OF CHRISTMAS ***')

In [44]:
end

163208

In [45]:
# set paragraph break
para_break = '\r\n\r\n'

In [46]:
# divide text into paragr
carol_paras= divide_paras(text=text, start=start, end=end, para_break= para_break)

In [47]:
# make DataFrame
carol_df= make_df(author= 'Dickens', title= 'Carol', paras= expectations_paras)

In [50]:
# extract lemmas
carol_df['lemmas']= carol_df['text'].apply(process_text)

In [51]:
carol_df.head()

Unnamed: 0,author,title,text,lemmas
0,Dickens,Carol,"My father’s family name being Pirrip, and my C...",father family pirrip christian philip infant t...
1,Dickens,Carol,"I give Pirrip as my father’s family name, on t...",pirrip father family authority tombstone joe g...
2,Dickens,Carol,"Ours was the marsh country, down by the river,...",marsh country river river wound mile sea vivid...
3,Dickens,Carol,"“Hold your noise!” cried a terrible voice, as ...",hold noise cry terrible voice man start grave ...
4,Dickens,Carol,"A fearful man, all in coarse grey, with a grea...",fearful man coarse grey great iron leg man hat...


## Combine DataFrames

In [52]:
# combine DataFrames
final_df = pd.concat([tale_df, expectations_df, carol_df], axis=0)

In [53]:
# sanity check
final_df.sample(50)

Unnamed: 0,author,title,text,lemmas
2986,Dickens,Carol,"Her fingers stopped for the first time, as she...",finger stop time retort angrily tell think spi...
1542,Dickens,Carol,"“Why, if it ain’t your footstool!” cried Flops...",ai footstool cry flopson skirt like help tumbl...
1100,Dickens,Tale,All these trivial incidents belonged to the ro...,trivial incident belong routine life return mo...
1434,Dickens,Carol,\r\nChapter XXII.,chapter xxii
1402,Dickens,Carol,"He wore his hat on the back of his head, and l...",wear hat head look straight walk self contain ...
199,Dickens,Expectations,I think the Romans must have aggravated one an...,think romans aggravate nose restless people co...
2144,Dickens,Expectations,"“Why then,” said the turnkey, grinning again, ...",say turnkey grin know jaggers
1850,Dickens,Tale,“I forgot it long ago.”,forget long ago
2606,Dickens,Tale,"“I’ll run them over. I’ll see what I hold,--Mr...",run lorry know brute wish little brandy
2815,Dickens,Carol,"“Assuredly,” replied Herbert.",assuredly reply herbert


## Filter DataFrame
Some of our rows likely just contain something like `CHAPTER III`. We can remove these "junk rows" out by creating a filter that filters out everything with fewer than 25 characters.

In [54]:
# filter out strings shorter than 25 characters
length_filter = final_df['lemmas'].str.len() > 25

In [55]:
filter_df = final_df[length_filter]

In [56]:
filter_df.head()

Unnamed: 0,author,title,text,lemmas
0,Dickens,Tale,"It was the best of times, it was the worst of ...",good time bad time age wisdom age foolishness ...
1,Dickens,Tale,There were a king with a large jaw and a queen...,king large jaw queen plain face throne england...
2,Dickens,Tale,It was the year of Our Lord one thousand seven...,year lord thousand seven seventy spiritual rev...
3,Dickens,Tale,"France, less favoured on the whole as to matte...",france favour matter spiritual sister shield t...
4,Dickens,Tale,"In England, there was scarcely an amount of or...",england scarcely order protection justify nati...


In [57]:
# remove \n and \r characters from the text
def remove_new_lines(text):
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    return text

In [58]:
# apply above function, you can ignore the warning.
filter_df['text'] = filter_df['text'].apply(remove_new_lines)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_df['text'] = filter_df['text'].apply(remove_new_lines)


In [59]:
# save our work
filter_df.to_csv('dickens_novels.csv', index=False)