![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

Project Instructions

What are the most frequent words in Herman Melville's novel Moby Dick, and how often do they occur?

Note that the HTML file you are asked to request is a cashed version of this file from Project Gutenberg.

Your project will follow these steps:

    The first step will be to request the Moby Dick HTML file using requests and encoding it to utf-8. Here is the URL to scrape from: https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm
    Next, you'll extract the HTM

L and create a BeautifulSoup object using an HTML parser to get the text.
Following that, you'll initialize a regex tokenizer object tokenizer using nltk.tokenize.RegexpTokenizer to keep only alphanumeric text, assigning the results to tokens.
You'll transform the tokens into lowercase, removing English stop words, and saving the results to words_no_stop.
Finally, you'll initialize a Counter object and find the ten most common words, saving the result to top_ten and printing to see what they are.

In [67]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Start coding here... 

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [68]:
url = "https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm"
r = requests.get(url)
r.encoding = 'utf-8'
html = r.text
html_soup = BeautifulSoup(html, "html.parser")
moby_text = html_soup.get_text()

In [69]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(moby_text)

words = [word.lower() for word in tokens]
stop_words = nltk.corpus.stopwords.words('english')

words_no_stop = [word for word in words if word not in stop_words]

count = Counter(words_no_stop)
top_ten = count.most_common(10)

top_ten

[('whale', 1246),
 ('one', 925),
 ('like', 647),
 ('upon', 568),
 ('man', 527),
 ('ship', 519),
 ('ahab', 517),
 ('ye', 473),
 ('sea', 455),
 ('old', 452)]