# Welcome to the Python for Data Scientist: An Introduction!

For this module, we will use Natural Language ToolKit along with several other popular Python packages to build a data science pipeline to plot frequency histograms of words in html novels.

To get started, you will need a Python installation (3.6.3 or later is recommended).
```
$ python --version
3.6.3
```


Clone or download the repository https://github.com/LSU-Analytics/activate_2018.git
```
$ git clone https://github.com/LSU-Analytics/activate_2018.git
```

Run this command to install the packages: 
```
$ pip install -r requirements.txt
```

Or you can install the packages individually.
```
$ pip install beautifulsoup4
$ pip install jupyter
$ pip install matplotlib
$ pip install nltk
$ pip install pandas
$ pip install requests
$ pip install scipy
$ pip install seaborn
```

## Get some data
Where do we get data?  That's easy...data is everywhere.  We can import files (csv, xlsx, txt), pull from APIs (usually as JSON), or obtain raw HTML.  For this example, we will use the freely available online at Project Gutenberg.

Here are several links to well known HTML books:
- 'https://www.gutenberg.org/files/514/514-h/514-h.htm' # Little Women
- 'https://www.gutenberg.org/files/42671/42671-h/42671-h.htm' # Pride & Prejudice
- 'https://www.gutenberg.org/files/203/203-h/203-h.htm' # Uncle Tom's Cabin
- 'https://www.gutenberg.org/files/205/205-h/205-h.htm' # Walden

In [1]:
# Store url


Next, we need to fetch the HTML file.  To do this, we will use a popular package known as ```requests```.  If you are familiar with http requests, we will be submitting a ```GET``` request.

In [2]:
# Import `requests`


# Make the request and check object type


The ```type``` command outputs the datatype.  Here we are getting a ```Response`` object.

The following commands extract and outputs the raw HTML.

In [3]:
# Extract HTML from Response object and print


## Wrangle the data

**Tag soup** refers to unstructured (or malformed) HTML code.  The package ```BeautifulSoup``` allows you to easily interact with this code.

Because we are in Lousiana, let's refer to our HTML soup as 'gumbo'.

In [4]:
# Import BeautifulSoup from bs4


# Create a BeautifulSoup object from the HTML


From our ```gumbo``` object, we can extract some information such as title.

In [5]:
# Get title as string


We can also find the hyperlinks within a page (< a > tags):

In [6]:
# Get hyperlinks from gumbo and check out first several


    For this project, we want the text from the ```gumbo``` object.  Luckily, there is a ```.get_text()``` method for doing this.

In [7]:
# Get the text out of the gumbo and print it


Almost there!  While we have the text of the novel, it still contains some metadata.  Since the metadata is minimal and will not influence our findings, let's move forward witht he project.

## Extract Words
Next, we will use ```nltk``` tokenize text and remove stopwords.

Regex in use.

In [8]:
# Import regex package


# Define sentence


# Define regex


# Find all words in sentence that match the regex and print them


In [9]:
# Find all words and print them


Let's do something similar with the ```text``` object.

In [10]:
# Find all words in Moby Dick and print several


**Note** that there is also a way to do this with ```nltk```, the Natural Language Toolkit:

In [11]:
# Import RegexpTokenizer from nltk.tokenize

# Create tokenizer


# Create tokens


Almost there!  At this point, words that start with a capital letter will be counted a separate instance.  To handle this issue, make all of the words lowercase.

In [12]:
# Initialize new list


# Loop through list tokens and make lower case


# Print several items from list as sanity check


Stop words provide no real insights so let's remove them. 

In [13]:
# Import nltk


# Get English stopwords and print some of them


# If you encounter an error, run the command below.
# nltk.download('stopwords')

In [14]:
# Initialize new list


# Add to words_ns all words that are in words but not in sw


# Print several list items as sanity check



## Answering the question
We started this project wanting to know the most frequently used words in a novel.  An easy manner to answer this question is to create a graph.

In [15]:
#Import datavis libraries


# Figures inline and set visualization style


# Create freq dist and plot


## Bonus: Create a reusable function

There are hundreds of novels on Project Gutenbergso it makes sense to write a function that does utilizes our code from above.

In [16]:
def plot_word_freq(url, num = 25):
    """Takes a url & frequency and plots the word distribution"""
    
   

In [17]:
plot_word_freq('https://www.gutenberg.org/files/521/521-h/521-h.htm', 10)

## Conclusion

What have we learned?  You now have the foundation for 'scraping' HTML data from a website, extracting data, manipulating text, and plotting output.