## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

## 1. Importing modules

In [1]:
# Use this cell to begin your analysis, and add as many as you would like!
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter


## 2. Gathering data

The data comes from making a request to a specific URL. I used the Python package BeautifulSoup for webs scrapping that can scrap the website content. 

In [13]:
# Creating a variable for the URL
URL = "https://www.gutenberg.org/files/16/16-h/16-h.htm"

# Requesting the page using requests.get
page = requests.get(URL)

# Using BeautifulSoup to scrap the page content 
soup = BeautifulSoup(page.content, 'html.parser')

# Getting only the text using soup.text
texts = soup.text

## 2. Tokenizing the text
In this step, we are going to tokenize the text using regex, in which we will tokeinze the words only. Next, we will convert the whole text into lower case in order not to count same words, like "One" and "one", separately. 

In [14]:
# Creating a tokenizer regex variable that tokenizes words only
tokenizer = nltk.tokenize.RegexpTokenizer("\w+")

# Tokenizing text
txt_tokenized = tokenizer.tokenize(texts)

# Converting words into lowercase
text_lower =[word.lower() for word in txt_tokenized]

# printing the Length
len(text_lower)

51893

In [22]:
top_ten = Counter(text_lower).most_common(10)
top_ten

[('the', 2546),
 ('and', 1491),
 ('to', 1285),
 ('he', 1060),
 ('a', 996),
 ('of', 989),
 ('was', 928),
 ('it', 840),
 ('in', 743),
 ('that', 645)]

## 3. Removing stopwords
While checking the top 10 most common words, I found that they are mainly about english stopwords like "the", "and", and "to" which are not useful for understanding anything about the context. The next step is to remove these words using the stopwords.words('english') method. 

In [23]:
# Defining the stop words
stop_words = set(stopwords.words('english'))

# Filtered words
fitered_words = [word for word in text_lower if word not in stop_words]

len(fitered_words)

24291

In [18]:
top_ten = Counter(fitered_words).most_common(10)
top_ten

[('peter', 410),
 ('wendy', 362),
 ('said', 358),
 ('would', 219),
 ('one', 214),
 ('hook', 175),
 ('could', 142),
 ('cried', 136),
 ('john', 133),
 ('time', 126)]

In [19]:
protagonists = ['peter', 'wendy', 'hook', 'john']
