In [18]:
'''
What are the most frequent words in Tom Sawyer, by Mark Twain novel, and how often do they occur?

In this notebook, we'll scrape the novel Tom Sawyer from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. 
Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. 
The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main Python packages we are going to use.

'''

# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
from nltk.corpus import stopwords

# Getting the Tom Sawyer HTML 
r = requests.get('https://www.gutenberg.org/files/74/74-h/74-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html)

# Getting the text out of the soup
text = soup.get_text()

#nltk – the Natural Language Toolkit
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Create a list called words containing all tokens transformed to lower-case
words = []
for w in tokens:
    words.append(w.lower())

# Getting the English stop words from nltk
sw = nltk.corpus.stopwords.words('english')

# Create a list words_ns containing all words that are in words but not in sw

words_ns = []

for w in words:
    if w not in sw:
        words_ns.append(w)

# Initialize a Counter object from our processed list of words
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_twenty = count.most_common(20)

print(f'The most frequent word used in Tom Sawyer is {top_twenty[0][0]}.')

The most frequent word used in Tom Sawyer is tom.
