# 1. Tools for text processing

We will analyze the most frequent words in Herman Melville's Moby Dick using Python. We'll scrape the novel from Project Gutenberg using the requests library, process the text with BeautifulSoup, and analyze word frequency using nltk and Counter. This pipeline can be adapted to visualize word distributions in any text from Project Gutenberg, showcasing how natural language processing tools apply to unstructured textual data. Let's start by loading the necessary Python libraries.

In [1]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize,RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter

# 2. Request Moby Dick
To analyze Moby Dick, we'll fetch its HTML content from Project Gutenberg, where it is freely available: Moby Dick on Project Gutenberg.

HTML (Hypertext Markup Language) is the standard format for web pages, and we can extract its content programmatically. We'll use the requests library to send a GET request, which retrieves the webpage content directly into Python for further processing. This approach mirrors what happens when you visit a webpage in a browser, but here, we automate the process.

In [2]:
# Getting the Moby Dick HTML 
r = requests.get("https://www.gutenberg.org/files/2701/2701-h/2701-h.htm")

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[:2000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Moby Dick; Or the Whale, by Herman Melville</title>

<style type="text/css" xml:space="preserve">

    body {margin-left:15%; margin-right:15%; text-align:justify }
    p { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; margin-bottom: .75em;

# 3. Get the text from the HTML
The HTML from Project Gutenberg contains the text of Moby Dick, but it needs cleaning and extraction before we can use it. We'll use the BeautifulSoup library for this task.

The name "Beautiful Soup" comes from its ability to clean and parse "tag soup"—HTML that might be messy or non-standard. The library makes it easy to extract the specific content we need. We'll create a BeautifulSoup object to process the HTML and isolate the actual text of the novel, leaving behind unnecessary web elements like navigation links and metadata.

In [3]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html,'html.parser')

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000

print(text[32000:34000])

inging up the rear
      of every funeral I meet; and especially whenever my hypos get such an
      upper hand of me, that it requires a strong moral principle to prevent me
      from deliberately stepping into the street, and methodically knocking
      people’s hats off—then, I account it high time to get to sea as soon
      as I can. This is my substitute for pistol and ball. With a philosophical
      flourish Cato throws himself upon his sword; I quietly take to the ship.
      There is nothing surprising in this. If they but knew it, almost all men
      in their degree, some time or other, cherish very nearly the same feelings
      towards the ocean with me.
    

      There now is your insular city of the Manhattoes, belted round by wharves
      as Indian isles by coral reefs—commerce surrounds it with her surf.
      Right and left, the streets take you waterward. Its extreme downtown is
      the battery, where that noble mole is washed by waves, and coole

# 4. Extract the words
Now that we have the text of Moby Dick, we can move on to analyzing the word frequencies. Although there’s some extraneous content at the beginning and end, it’s negligible compared to the bulk of the novel and can be ignored for now.

To count word occurrences, we’ll use the Natural Language Toolkit (nltk). The first step is tokenization: breaking the text into individual words by removing non-word elements like whitespace and punctuation. This process results in a clean list of words, ready for analysis. 

In [4]:
# Creating a tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
print(tokens[:8])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Moby', 'Dick', 'Or']


# 5. Make the words lowercase
To ensure accurate word counts, we need to treat words like "Or" and "or" as the same. The solution is to convert all words to lowercase before counting their occurrences. This normalization step avoids case sensitivity issues and ensures consistent word representation.

We'll modify our list of words from Moby Dick by applying the .lower() method to each word. This will create a uniform, lowercase representation of all words in the text, preparing them for accurate frequency analysis. 

In [5]:
# Create a list called words containing all tokens transformed to lower-case
words = [tokens.lower() for tokens in tokens]
# Printing out the first 8 words / tokens 
print(words[:8])

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or']


# 6. Load in stop words
To focus on meaningful and interesting words, we commonly remove high-frequency but less informative words, known as stop words (e.g., "the," "of," "a"). The nltk library provides a built-in list of English stop words that we can use for this purpose.

We’ll filter out these stop words from the tokenized and normalized word list. This step helps highlight the unique vocabulary and themes of Moby Dick by focusing on the content-rich words.

In [6]:
# Ensure the stopwords are downloaded
nltk.download('stopwords')

# Getting the English stop words
sw = nltk.corpus.stopwords.words('english')

# Printing out the first eight stop words
print(sw[:8])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91984\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 7. Remove stop words in Moby Dick
To exclude stop words from Moby Dick, use list comprehension to filter them out.

In [7]:
# Create a list words_ns containing all words that are in words but not in sw
words_ns = [words for words in words if words not in sw]
# Printing the first 5 words_ns to check that stop words are gone
print(words_ns[:5])

['project', 'gutenberg', 'ebook', 'moby', 'dick']


# 8. We have the answer
To identify the most frequent words in Moby Dick, we’ll use the Counter class from Python's collections module. Here's the process:

1. Pass the filtered_words list to Counter to create a dictionary-like object where keys are words and values are their counts.
2. Use the .most_common(n) method to retrieve the top n most frequent words along with their counts.

In [8]:
# Initialize a Counter object from our processed list of words
count = Counter(words_ns)
# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)
# Print the top ten words and their counts
print(top_ten)

[('whale', 1244), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]


# 9. The most common word
The most common word in Moby Dick is likely "whale", reflecting the novel's main theme. This analysis shows how natural language processing can extract insights from unstructured text, a key skill for working with various types of data.

In [9]:
# What's the most common word in Moby Dick?
most_common_word = count.most_common(1)[0][0]
print(most_common_word)

whale
