![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

# Request Moby Dick
The first step will be to request the Moby Dick HTML file using requests and encoding it to utf-8. Here is the URL to scrape from: https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm

In [89]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Getting the Moby Dick HTML 
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text


[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Get the text from the HTML
`Beautiful Soup` is a Python library used for web scraping purposes to extract data from HTML and XML documents. It creates parse trees from page source code that can be used to extract data from web pages.

In [90]:
# Creating a BeautifulSoup object from the HTML
html_soup = BeautifulSoup(html, 'html.parser')

# Getting the text out of the soup
moby_text = html_soup.get_text()

# Extract the words
A `tokenizer` is a tool used in natural language processing (NLP) that splits text into smaller units, called tokens. These tokens can be words, subwords, or characters, depending on the type of tokenizer used. Tokenization is an essential first step in most NLP tasks like text classification, machine translation, and sentiment analysis

In [91]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(moby_text)

# Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]

#  Remove stop words in Moby Dick
`Stop words` are common words (such as "the", "is", "in", "at", "on", etc.) that are often filtered out during text preprocessing because they don’t carry much meaningful information in many natural language processing (NLP) tasks. These words tend to be frequent in texts but don't contribute to the context or meaning in tasks like text classification, sentiment analysis, or topic modeling.

In [92]:
# Getting the English stop words from nltk
stop_words = nltk.corpus.stopwords.words('english')

# A new list to hold Moby Dick with No Stop words
words_no_stop = []

# Create a list words_no_stop containing all words that are in words but not in stop_words
words_no_stop = [word for word in words if word not in stop_words]


# Top ten most common words

In [93]:
# Initialize a Counter object from our processed list of words
count = Counter(words_no_stop)

# Store ten most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
