# Web Scraping & NPL for Novel Word Frequency

<p>In this project, I will find out what are the most frequent words in Herman Melville's novel, Moby Dick, and how often do they occur? Firstly,I will use requests and BeautifulSoup to scrape a novel from the Project Gutenberg website. After scraping and cleaning the text data, you will use NLP to find the most frequent words in Moby Dick. <p>


## 1. Import packages

In [38]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
nltk.download('stopwords')
from collections import Counter

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Web Scraping for text in Moby Dick 

In [39]:
# Extracting characters from Moby Dick HTML 
r = requests.get('https://www.gutenberg.org/files/2701/2701-h/2701-h.htm')
r.encoding = 'utf-8'
html = r.text
print(html[0:2000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Moby Dick; Or the Whale, by Herman Melville</title>

<style type="text/css" xml:space="preserve">

    body {margin-left:15%; margin-right:15%; text-align:justify }
    p { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; margin-bottom: .75em;}
    pre       

In [40]:
# Getting text by using BeautifulSoup 
soup = BeautifulSoup(html,"html.parser")
text =soup.get_text()
print(text[32000:34000])

inging up the rear
      of every funeral I meet; and especially whenever my hypos get such an
      upper hand of me, that it requires a strong moral principle to prevent me
      from deliberately stepping into the street, and methodically knocking
      people’s hats off—then, I account it high time to get to sea as soon
      as I can. This is my substitute for pistol and ball. With a philosophical
      flourish Cato throws himself upon his sword; I quietly take to the ship.
      There is nothing surprising in this. If they but knew it, almost all men
      in their degree, some time or other, cherish very nearly the same feelings
      towards the ocean with me.
    

      There now is your insular city of the Manhattoes, belted round by wharves
      as Indian isles by coral reefs—commerce surrounds it with her surf.
      Right and left, the streets take you waterward. Its extreme downtown is
      the battery, where that noble mole is washed by waves, and cooled by
      bre

## 3. Extract and process words via Natural Language Toolkit

In [41]:
# Tokenizing the text
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(text) 
tokens[0:8]

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Moby', 'Dick', 'Or']

In [42]:
# Make the words lowercase
words=[token.lower() for token in tokens]
words[:8]

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or']

In [43]:
# Getting the English stop words from nltk
sw = nltk.corpus.stopwords.words('english')

# Printing out the first eight stop words
sw[:8]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']

## 4. Remove stop words in Moby Dick
<p>We now want to create a new list with all <code>words</code> in Moby Dick, except those that are stop words (that is, those words listed in <code>sw</code>).</p>

In [44]:
# Remove stop words in Moby Dick
words_ns=[word for word in words if word not in sw]
words_ns[:5]

['project', 'gutenberg', 'ebook', 'moby', 'dick']

## 5. The most common word

In [45]:
# Count words and store 10 most common words
count = Counter(words_ns)
top_ten =count.most_common(10)
print(top_ten)

[('whale', 1244), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
