## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

<h1>Counting Word Frequency</h1>
<h3 id="a_problem_to_solve">The Problem</h3>
<p>Among the top ten most common meaningful words in a text, which ones are character names? In this Jupyter Notebook, I will offer my step-by-step solution to this problem by identifying the most common names used in J. M. Barrie's novel <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Peter and Wendy</a>. To do so, I will need a few tools that are imported below. These tools include NLTK's word_tokenize and stopwords, BeautifulSoup and requests, spaCy's en_core_web_sm package, and the Counter tool from collections.</p>

In [1]:
# Import the appropriate packages
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from bs4 import BeautifulSoup as bs
import requests, re, nltk, os, string, en_core_web_sm
from pprint import pprint as print

nlp = en_core_web_sm.load()

<h2>Scraping the Data</h2>

In order to identify the meaningful words in a text, it is first important to obtain the text in question. This can be done by first identifying the page we want to explore, which here is the text of <i>Peter and Wendy</i> by J. M. Barrie at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. To do this, I will get the url with the requests module and convert the HTML to a BeautifulSoup object.

In [2]:
url = "https://www.gutenberg.org/files/16/16-h/16-h.htm"

request = requests.get(url)
text = request.text
soup = bs(text, 'html.parser')

<p>In this next step, I take a look at the HTML and identify which pattern contains only the text of the book.</p>

In [3]:
# Exploring the HTML to identify meaningful data

print(soup.prettify())

('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\r\n'
 '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n'
 '<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">\n'
 ' <head>\n'
 '  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>\n'
 '  <meta content="text/css" http-equiv="Content-Style-Type"/>\n'
 '  <title>\n'
 '   The Project Gutenberg eBook of Peter Pan, by James M. Barrie\n'
 '  </title>\n'
 '  <style type="text/css">\n'
 '   body { margin-left: 20%;\r\n'
 '       margin-right: 20%;\r\n'
 '       text-align: justify; }\r\n'
 '\r\n'
 'h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:\r\n'
 'normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}\r\n'
 '\r\n'
 'h1 {font-size: 300%;\r\n'
 '    margin-top: 0.6em;\r\n'
 '    margin-bottom: 0.6em;\r\n'
 '    letter-spacing: 0.12em;\r\n'
 '    word-spacing: 0.2em;\r\n'
 '    text-indent: 0em;}\r\n'
 'h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em

 '   </p>\n'
 '   <p>\n'
 '    “What a funny address!”\n'
 '   </p>\n'
 '   <p>\n'
 '    Peter had a sinking. For the first time he felt that perhaps it was a '
 'funny\r\n'
 'address.\n'
 '   </p>\n'
 '   <p>\n'
 '    “No, it isn’t,” he said.\n'
 '   </p>\n'
 '   <p>\n'
 '    “I mean,” Wendy said nicely, remembering that she was hostess,\r\n'
 '“is that what they put on the letters?”\n'
 '   </p>\n'
 '   <p>\n'
 '    He wished she had not mentioned letters.\n'
 '   </p>\n'
 '   <p>\n'
 '    “Don’t get any letters,” he said contemptuously.\n'
 '   </p>\n'
 '   <p>\n'
 '    “But your mother gets letters?”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Don’t have a mother,” he said. Not only had he no mother,\r\n'
 'but he had not the slightest desire to have one. He thought them very\r\n'
 'over-rated persons. Wendy, however, felt at once that she was in the '
 'presence\r\n'
 'of a tragedy.\n'
 '   </p>\n'
 '   <p>\n'
 '    “O Peter, no wonder you were crying,” she said, and got out of bed\r\n'
 '

 '    Of course this was rather unsatisfactory. However, to make amends he '
 'showed\r\n'
 'them how to lie out flat on a strong wind that was going their way, and '
 'this\r\n'
 'was such a pleasant change that they tried it several times and found that '
 'they\r\n'
 'could sleep thus with security. Indeed they would have slept longer, but '
 'Peter\r\n'
 'tired quickly of sleeping, and soon he would cry in his captain voice,\r\n'
 '“We get off here.” So with occasional tiffs, but on the whole\r\n'
 'rollicking, they drew near the Neverland; for after many moons they did '
 'reach\r\n'
 'it, and, what is more, they had been going pretty straight all the time, '
 'not\r\n'
 'perhaps so much owing to the guidance of Peter or Tink as because the '
 'island\r\n'
 'was looking for them. It is only thus that any one may sight those magic\r\n'
 'shores.\n'
 '   </p>\n'
 '   <p>\n'
 '    “There it is,” said Peter calmly.\n'
 '   </p>\n'
 '   <p>\n'
 '    “Where, where?”\n'
 '   </p>\n'
 '  

 '   </p>\n'
 '   <p>\n'
 '    “I am glad!” Peter cried.\n'
 '   </p>\n'
 '   <p>\n'
 '    “I will call again in the evening,” Slightly said; “give her\r\n'
 'beef tea out of a cup with a spout to it;” but after he had returned the\r\n'
 'hat to John he blew big breaths, which was his habit on escaping from a\r\n'
 'difficulty.\n'
 '   </p>\n'
 '   <p>\n'
 '    In the meantime the wood had been alive with the sound of axes; '
 'almost\r\n'
 'everything needed for a cosy dwelling already lay at Wendy’s feet.\n'
 '   </p>\n'
 '   <p>\n'
 '    “If only we knew,” said one, “the kind of house she likes\r\n'
 'best.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Peter,” shouted another, “she is moving in her sleep.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Her mouth opens,” cried a third, looking respectfully into it.\r\n'
 '“Oh, lovely!”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Perhaps she is going to sing in her sleep,” said Peter.\r\n'
 '“Wendy, sing the kind of house you would like to have.”\n'
 '   </p>\n'
 '

 'was doing there; but of course neither of them understood the other’s\r\n'
 'language. In fanciful stories people can talk to the birds freely, and I '
 'wish\r\n'
 'for the moment I could pretend that this were such a story, and say that '
 'Peter\r\n'
 'replied intelligently to the Never bird; but truth is best, and I want to '
 'tell\r\n'
 'you only what really happened. Well, not only could they not understand '
 'each\r\n'
 'other, but they forgot their manners.\n'
 '   </p>\n'
 '   <p>\n'
 '    “I—want—you—to—get—into—the—nest,”\r\n'
 'the bird called, speaking as slowly and distinctly as possible,\r\n'
 '“and—then—you—can—drift—ashore,\r\n'
 'but—I—am—too—tired—to—bring—it—any—nearer—so—you—must—try\r\n'
 'to—swim—to—it.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “What are you quacking about?” Peter answered. “Why\r\n'
 'don’t you let the nest drift as usual?”\n'
 '   </p>\n'
 '   <p>\n'
 '    “I—want—you—” the bird said, and repeated it all\r\n'
 'over.\n'
 '   </p>\n'
 '   <p>\n'
 '

 'the string); and strange to say it was Hook who told them to belay their\r\n'
 'violence. His lip was curled with malicious triumph. While his dogs were '
 'merely\r\n'
 'sweating because every time they tried to pack the unhappy lad tight in '
 'one\r\n'
 'part he bulged out in another, Hook’s master mind had gone far beneath\r\n'
 'Slightly’s surface, probing not for effects but for causes; and his\r\n'
 'exultation showed that he had found them. Slightly, white to the gills, '
 'knew\r\n'
 'that Hook had surprised his secret, which was this, that no boy so blown '
 'out\r\n'
 'could use a tree wherein an average man need stick. Poor Slightly, most\r\n'
 'wretched of all the children now, for he was in a panic about Peter, '
 'bitterly\r\n'
 'regretted what he had done. Madly addicted to the drinking of water when he '
 'was\r\n'
 'hot, he had swelled in consequence to his present girth, and instead of\r\n'
 'reducing himself to fit his tree he had, unknown to the others, whittled 

 'the\r\n'
 'forecastle and came along the deck. Now, reader, time what happened by '
 'your\r\n'
 'watch. Peter struck true and deep. John clapped his hands on the '
 'ill-fated\r\n'
 'pirate’s mouth to stifle the dying groan. He fell forward. Four boys\r\n'
 'caught him to prevent the thud. Peter gave the signal, and the carrion was '
 'cast\r\n'
 'overboard. There was a splash, and then silence. How long has it taken?\n'
 '   </p>\n'
 '   <p>\n'
 '    “One!” (Slightly had begun to count.)\n'
 '   </p>\n'
 '   <p>\n'
 '    None too soon, Peter, every inch of him on tiptoe, vanished into the '
 'cabin; for\r\n'
 'more than one pirate was screwing up his courage to look round. They could '
 'hear\r\n'
 'each other’s distressed breathing now, which showed them that the more\r\n'
 'terrible sound had passed.\n'
 '   </p>\n'
 '   <p>\n'
 '    “It’s gone, captain,” Smee said, wiping off his spectacles.\r\n'
 '“All’s still again.”\n'
 '   </p>\n'
 '   <p>\n'
 '    Slowly Hook let his head e

 '   </p>\n'
 '   <p>\n'
 '    “What do we see now?”\n'
 '   </p>\n'
 '   <p>\n'
 '    “I don’t think I see anything to-night,” says Wendy, with a\r\n'
 'feeling that if Nana were here she would object to further conversation.\n'
 '   </p>\n'
 '   <p>\n'
 '    “Yes, you do,” says Jane, “you see when you were a little\r\n'
 'girl.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “That is a long time ago, sweetheart,” says Wendy. “Ah me,\r\n'
 'how time flies!”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Does it fly,” asks the artful child, “the way you flew when\r\n'
 'you were a little girl?”\n'
 '   </p>\n'
 '   <p>\n'
 '    “The way I flew? Do you know, Jane, I sometimes wonder whether I ever '
 'did\r\n'
 'really fly.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Yes, you did.”\n'
 '   </p>\n'
 '   <p>\n'
 '    “The dear old days when I could fly!”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Why can’t you fly now, mother?”\n'
 '   </p>\n'
 '   <p>\n'
 '    “Because I am grown up, dearest. When people grow up they forget the

<p>Since I am not interested in chapter titles or the copyright information for this task, I extract only the div tags that contain the class ID "chapter".</p>

In [4]:
# Extracting the text of each chapter from the book

def div_has_class(tag):
    return tag.name == 'div' and tag.has_attr('class')

def find_chapters(cls):
    return cls and re.search('chapter', cls)

chapters = soup.find_all(div_has_class, class_=find_chapters)

<h2>Identifying Meaningful Words</h2>

<p>In this next step, I define a few functions which tokenize the text by words and identify if a token is meaningful or not. This is done with the help NLTK's English stopwords list and the string.punctuation method. While I am at it, I also define a function that creates a spaCy doc object to perform a simple Named Entity Recognition (NER) task by checking if a token's label is PERSON. This will help me by maintaining a name's capital letter in the final output, whereas all other words will be lowercase.</p>

In [5]:
# Check for stopwords
def check_stopwords(token):
    stopword = set(stopwords.words('english'))
    stopword.update({'could', 'would', 'should', 'can', 'will', 'might'})
    if token in stopword:
        return True
    return False

# Check for punctuation
def check_punctuation(token):
    punctuation = string.punctuation + '’“”'
    if token in punctuation:
        return True
    return False

# naive Named Entity Recognition
def check_name(token, names):
    if token in names:
        return True
    return False

# A list of the names in the text
def find_names(text):
    doc = nlp(text)
    names = []
    for token in doc.ents:
        if token.label_ == "PERSON":
            names.append(token.text)
    return names

# Tokenizing the text
def get_tokens(text):
    tokens = []
    names = find_names(text)
    text = word_tokenize(text)
    for token in text:
        if not check_stopwords(token.lower()) and not check_punctuation(token.lower()):
            if check_name(token, names):
                tokens.append(token)
            else:
                tokens.append(token.lower())
    return tokens

<p>Since I am only interested in the frequency of words, I use a simple Counter object to find this. Looping through the chapters of the book, I use my functions to check for the meaningful tokens and update the Counter object.</p>

In [6]:
# Initializing the counter object
meaningful_tokens = Counter()

for chapter in chapters:
    for paragraph in chapter.find_all('p'):
        paragraph = get_tokens(paragraph.get_text())
        meaningful_tokens.update(paragraph)

<p>Finally, with the Counter object completed, I can organize the dictionary by frequency counts and extract the top ten most frequent words! These words are split into two lists seperating the names from the other frequent words.</p>

In [10]:
# Convert the Counter to a dict
meaningful_tokens = dict(meaningful_tokens)

# Sort the dict by key values
sorted_meaningful_tokens = sorted(meaningful_tokens.items(), key=lambda x:x[1], reverse=True)

top_names = [word for word, count in sorted_meaningful_tokens[:10] if word[0].isupper()]
other_top_words = [word for word, count in sorted_meaningful_tokens[:10] if not word[0].isupper()]

print(top_names)
print(other_top_words)

['Peter', 'Wendy', 'John', 'Darling']
['said', 'one', 'hook', 'cried', 'time', 'see']


<h2>The Top Names</h2>
<p>As our output shows, the top most frequent names in this text are <strong>Peter</strong>, <strong>Wendy</strong>, <strong>John</strong>, and <strong>Darling</strong>. Other meaningful words include <strong>said</strong>, <strong>one</strong>, <strong>hook</strong>, <strong>cried</strong>, <strong>time</strong>, and <strong>see</strong>.</p>