<a href="https://colab.research.google.com/github/KunalSharma2001/Data-Analysis-Projects/blob/main/Word_Frequency_in_Novels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text-Processing for Novel**
In this analysis, we will look what are the most frequent words in **Herman Melville's Novel**, Moby Dick, and how often do they occur ?

In this notebook, we'll scrape the novel *Moby Dick* from the website [Project Gutenberg](https://www.gutenberg.org/) (which contain a large corpus of books) using Python Packhage requests. Then we'll extract words from this web data using Beautiful Soup. Finally, we'll drive into analyzing the distribution of words using the nltk (Natural Language ToolKit) and Counter.

The *Data Science pipeline* we'll build in this notebook can be used to visualise the word frequency at novel that you can find in the above link of Project Gutenberg. The NLP tools used here apply to much of the data that data scientist encounter as a vast proportion of the world's data is unstructured data and includes a great deal of texts.


In [22]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import nltk # extracting the words from the HTML
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize

In [2]:
import matplotlib.pyplot as plt
try:
  import seaborn as sns
except:
  !pip install seaborn --user
  import seaborn as sns

In [3]:
%matplotlib inline
sns.set()

# Extracting Moby Dick

In [4]:
data = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')
print(type(data))


<class 'requests.models.Response'>


In [5]:
data

<Response [200]>

Here we can see the the output is **<Response [200]>** thus the data is successfully fetched.

In [6]:
# setting the text-encoding of the HTML Page
data.encoding = 'utf-8' 

# Extracting the HTMKL from the request objects
html = data.text
print(html[0:2000])

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       {

In [7]:
# creating an object using Beautiful from HTML
sp = BeautifulSoup(html, 'html.parser')
text = sp.get_text()
# text

# Printing out text between characters 32000 and 34000
print(text[32000:34000])
# getting hyperlinks from soup and check out first several
# print(sp.findAll('a')[:8])

ent me
      from deliberately stepping into the street, and methodically knocking
      people’s hats off—then, I account it high time to get to sea as soon
      as I can. This is my substitute for pistol and ball. With a philosophical
      flourish Cato throws himself upon his sword; I quietly take to the ship.
      There is nothing surprising in this. If they but knew it, almost all men
      in their degree, some time or other, cherish very nearly the same feelings
      towards the ocean with me.
    

      There now is your insular city of the Manhattoes, belted round by wharves
      as Indian isles by coral reefs—commerce surrounds it with her surf.
      Right and left, the streets take you waterward. Its extreme downtown is
      the battery, where that noble mole is washed by waves, and cooled by
      breezes, which a few hours previous were out of sight of land. Look at the
      crowds of water-gazers there.
    

      Circumambulate the city of a dre

This gives us all the text under < a >

In [8]:
from nltk.tokenize import word_tokenize
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(text)
tokens[:10]

['Moby',
 'Dick',
 'Or',
 'the',
 'Whale',
 'by',
 'Herman',
 'Melville',
 'The',
 'Project']

We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.

Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use nltk – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.

In [9]:
# making the words lower case
words = [token.lower() for token in tokens]
words[:10]

['moby',
 'dick',
 'or',
 'the',
 'whale',
 'by',
 'herman',
 'melville',
 'the',
 'project']

In [10]:
import nltk
nltk.download()
# from nltk.corpus import stopwords
# sw = stopwords.word('english')


NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords


    Downloading package stopwords to /root/nltk_data...
      Unzipping corpora/stopwords.zip.



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords


    Downloading package stopwords to /root/nltk_data...
      Package stopwords is already up-to-date!



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

Here we have to download the nltk.corpus stopwords. And the list of Stopwords is downloaded and we can import it and use it.

In [11]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
sw[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

# Removing the stop words from the Moby Dick

In [12]:
not_words = [word for word in words if words not in sw]
not_words[:10]

['moby',
 'dick',
 'or',
 'the',
 'whale',
 'by',
 'herman',
 'melville',
 'the',
 'project']

These are the words that are not the stop words in the Moby Dick.

In [20]:
from collections import Counter

In [27]:
Count = Counter(not_words)
top_ten = Count.most_common(10)
max(top_ten)

('to', 4707)

The natural language processing skills we used in this notebook are also applicable to much of the data that Data Scientists encounter as the vast proportion of the world's data is unstructured data and includes a great deal of text.

So, what word turned out to (not surprisingly) be the most common word in Moby Dick?