With this code, any text from the internet can be analysed. Specifically, we can see which words are the most popular in the text inserted in cell "[31]". We can replace the Moby Dick's text with any other text, and see the most popular words in the text.

In [39]:
# We start as always with importing the necessary libraries

import requests #PACKAGE THAT allows us download texts from online (e.g. I request Moby Dick's online book without downloading it)
from bs4 import BeautifulSoup #Submodule of bs4 (I don't need the entire package). 
import nltk #natural language processing (NLP)
from collections import Counter #


In [40]:
# We can now download the book that we want
#https://www.gutenberg.org/files/2701/2701-h/2701-h.htm

r = requests.get("https://www.gutenberg.org/files/2701/2701-h/2701-h.htm") #that is how I get the book here without download it or saving it anywhere. I name it as r though to save it here.
r.encoding = 'utf-8'#we use that to allow space for other characters in the text (other than the british alphabet letters)
html = r.text #take the text of r, and save it as html
print(html[0:2000]) #print the first 2000 characters of that html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Moby Dick; Or the Whale, by Herman Melville</title>

<style type="text/css" xml:space="preserve">

    body {margin-left:15%; margin-right:15%; text-align:justify }
    p { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; margin-bottom: .75em;

In [41]:
# We can now extract the useful data from it

soup = BeautifulSoup(html, "html.parser") #I initiate an instance of BeautifulSoup, and I am working with html, and I need a parser which will allow me to separate the structure of the document
text = soup.get_text() #take the soup, and get text from it
print(text[0:2000])#again I use the print command to print the first 2000 characters of text.






The Project Gutenberg eBook of Moby Dick; Or the Whale, by Herman Melville



The Project Gutenberg eBook of Moby-Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online
at www.gutenberg.org. If you
are not located in the United States, you will have to check the laws of the
country where you are located before using this eBook.

Title: Moby-Dick; or The Whale
Author: Herman Melville
Release Date: June, 2001 [eBook #2701]
[Most recently updated: August 18, 2021]
Language: English
Character set encoding: UTF-8
Produced by: Daniel Lazarus, Jonesey, and David Widger
*** START OF THE PROJECT GUTENBERG EBOOK MOBY-DICK; OR THE WHALE ***

      MOBY-DICK;or, THE WHALE.
    




      By Herman Melville
    

 





In [42]:
# The document needs to be tokenized to be analyzed
#we need to remove not useful words (e.g. "the")
#also remove caps, dots, spaces, etc

tokenizer = nltk.tokenize.RegexpTokenizer('\w+')#initiate an instance of nltk. We tokenize the document. RegexpTokenizer allows us to work with regular expressions(=ways to systematize words--e.g. is it a number, letter, punctuation etc)
tokens = tokenizer.tokenize(text) #we need to remove punctuation.
print(tokens[0:8]) #we print the first eight tokens (i.e. the words that)

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Moby', 'Dick', 'Or']


In [43]:

# Some additional pre-processing is needed
words = [token.lower() for token in tokens]#I lower from tokens, caps to lower letters
print(words[0:8])

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or']


In [44]:
sw = nltk.corpus.stopwords.words('english')#(corpus includes dictionaries) we define our list of stopwords. We compare the text with the other text, and we remove those stopwords.
print(sw[0:8])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']


In [45]:
words_ns = [word for word in words if word not in sw]#each individual element of words is going to be called 'word'.Scroll through all the words in the list, look at every element of the list, keep the word, and list it only if the word is not a stopword.
print(words_ns[:5])

['project', 'gutenberg', 'ebook', 'moby', 'dick']


In [46]:
# We can now analyze the data and determine the most common word
count = Counter(words_ns)
top_ten = count.most_common(10)
print(top_ten)#most common word is 'whale', then 'one', etc...

[('whale', 1244), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]


In [47]:
#I can just replace the link with another link, and see which is the most commonly used word.