# <font face="times"><font size="6pt"><p style = 'text-align: center;'> The City University of New York, Queens College

<font face="times"><font size="6pt"><p style = 'text-align: center;'><b>Introduction to Computational Social Science</b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>Lesson 10 | Natural Language Processing II: Web Scraping and Text Analysis </b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>5 Checkpoints</b><br/><br/>



***
***

# Begin Lesson 10
# Using Text as Data

Now that we covered the basics, let's see what we can really do. It's this Notebook, we're going to learn how to 

- Extract data from the web
- Sentiment analysis on text and identify its part of speech




***
***

## Extracting Text from HTML

Now, we'll start with a pretty basic and commonly-faced task: extracting text content from an HTML page. Python's `urllib3` package  gives us the tools we need to fetch a web page from a given URL, but we see that the output is full of HTML markup that we don't want to deal with.

First, let's install it. 

In [None]:
!pip3.6 install --user urllib3

In [None]:
from urllib.request import urlopen

Let's test it out with some code from a newssite called "venturebeat.com" (Although, this ought to work many different websites.)

In [None]:
url = "http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/"
html = urlopen(url).read()

Let's see what this looks like (just a sneak-peek, so we'll only look at the first 500 characters.)

In [None]:
html[:500]

That doesn't make any sense, unless you know `html`. Thankfully for us, we have options in `Python` that will help us extract useful information. 

***
***

# Checkpoint 1 of 5
## Now you try!

### Pick the website so any news article (or any website with a lot of text). Use `urlopen` to read in the page's `HTML` and look at the first 500 characters. 

### **Note:** You'll need a stable internet connection to to this if you're not using it on the cloud. 

### What do you see?


***
***

***
## Stripping-out HTML formatting

Fortunately, we can use a method called `BeautifulSoup()` to get the raw text out of an HTML-formatted string. BeautifulSoup, is a Python library for pulling data out of HTML and XML files. It parses HTML content into easily-navigable nested data structure.

It's still not perfect, though, since the output will contain page navigation and all kinds of other junk that we don't want, especially if our goal is to focus on the body content from a news article, for example.

In [None]:
import bs4
from bs4 import BeautifulSoup

For `BeautifulSoup` to work, just pass in the string (in this, called `html`) and specify the format of what the string represents (this case `XML`).

In [None]:
text = BeautifulSoup(html,'xml')

Now, take a look at how nicely it converted the raw text into acutal 'XML'. It still looks overwhelming and it's hard for us to interpret, but it's a step in the right direction. 

**Note:** It's going to be long!

In [None]:
text

***
***

# Checkpoint 2 of 5
## Now you try!

### With the `HTML` text you extracted from the previous checkpoint, now apply `BeautifulSoup` and convert it to `xml`. 

### How does it compare to the `HTML` you saw before?

***
***

***

## Identifying the Main Content
If we just want the body content from the article, we'll need to use two additional packages. The first is a package called `Readability`, which pulls the main body content out of an HTML document and subsequently "cleans it up." 

Using Readability and BeautifulSoup together, we can quickly get exactly the text we're looking for out of the HTML, (*mostly*) free of page navigation, comments, ads, etc. Now we're ready to start analyzing this text content.

***NOTE***: In order for `readability` to work, we'll need to download the module using `pip`. This make take a few minutes. 

In [None]:
!pip3.6 install --user readability

In [None]:
!pip3.6 install --user readability-lxml

In [None]:
import readability   

In [None]:
from readability.readability import Document #Note, we need to call it from readability.readability, a strange quirk to how this module was originally named. 

Let's use the function `Document()` and pass in our string `html` and extract the summary and title of the article, using the `summary()` and `title()` methods respectively. 

And let's store it as `readable_article` and `readable_title`. 

In [None]:
readable_article = Document(html).summary()
readable_title = Document(html).title()

Let's take a peak at the summary, but let's only look at the first 500 characters.

In [None]:
readable_article[0:500]

It has a lot of `html` code embedded around it, and a lot of unhelpful tags. Here, `BeautifulSoup` can clear out much of this from the string.

Since this is converted into `xml` we need to let `BeautifulSoup` know that's the format it's in with the parameter `lxml`.

In [None]:
soup = BeautifulSoup(readable_article,'lxml')

Now, let's print out the title and the summary of the article that we cleaned up with `BeautifulSoup`. Here, we can use the method `.text` that will extract readable text from `soup`. Let's restrict it to the first 500 characters. 

In [None]:
print('*** The Title is *** \n\"' + readable_title + '\"\n')
print('*** The Content is *** \n\"' + soup.text[:500])

As you can see, it's now much easier on the eyes!

***
***

# Checkpoint 3 of 5 
## Now you try! 

### Apply `readability` to the **original** `HTML` text you extracted in checkpoint 1. Print out the title and content of the webpage. 

### How does this compare to what you read in checkpoint 2?

***
***

***
***

## Part of Speech (PoS) Tagging 

We now look at an example of part of speech tagging using NLTK. Looking at the part of speech for terms is helpful for a variety of purposes. For instance, in the case of sentiment analysis--which looks at whether a term is used in a positive or negative way--understanding whether the term is used as a noun, an modifier (adjective), or a verb can help us understand the rhetorical style of a particular text. 

Here, we're not going to rehash (or try to recall) our gradeschool grammar classes. Essentially, using `nltk`, the PoS tagger will process a sentence (as a string) and provide what it believes the part of speech is for each term. 

Below, I've outlined most of the tags that it will output. Hopefully, a lot of these seem familiar (e.g., noun, adjective, verb, pronoun, etc.)

### Tagset

    N = noun
    NP = noun phrase
    Adj = adjective
    AdjP = adjective phrase
    Adv = adverb
    Prep = preposition
    PP = prepositional phrase
    Quant = quantifier
    Ord = ordinal numeral
    Card = cardinal numeral	Rel-Cl = relative clause
    Rel-Pro = relative pronoun
    V = verb
    S = sentence
    Det = determiner
    Dem-Det = demonstrative determiner
    Wh-Det = wh-determiner
    PPron = personal pronoun
    PoPron = possessive pronoun

So, let's test it out a sample sentence: 

"WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

Let's import `nltk` and save this sentence as a string.

In [None]:
import nltk 

In [None]:
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

We first need to clean up this sentence. Let's use nltk's `.word_tokenize()` method. Once that's done, we can then use `nltk` to tag the part of speech of each term in this sentence. 

In [None]:
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

Let's see what this looks like. `pos_sentence` is a list of `tuples`, so let's use a for loop and print out each term and its PoS. 

In [None]:
for term, part_of_speech in pos_sentence:
    print("Term: " + term+ " | Part of Speech: " + part_of_speech)

If you are unsure of what each of these tags means, you can always use `nltk` to return what it is and examples, as shown below:

In [None]:
nltk.help.upenn_tagset('NNP')

***
***

# Checkpoint 4 of 5
## Now you try!

### Read in some sentence. It can come from anywhere! Use the part-of-speech tagger from `nltk` and a for loop to identify every term's part-fo-speech. 

### You can use `nltk.help.upenn_tagset()` function to identify what the part-of-speech code means (e.g., verb, adverb, pronoun, etc.). 

### How well did it do?

***
***

***
***

## Sentiment Analysis

Now that we know how to determine the PoS of a sentence, now let's turn and do some sentiment analysis using Empath (empath.stanford.edu), which is a dictionary tool that counts words in various categories (e.g., positive sentiment, negative sentiment). 

First, we need to import the library and create a lexicon. 

(**Note:** This module isn't readily available on Anaconda, so we'll import it from a file in this directory (e.g., folder) if you use Anaconda, instead for future work.)

You can actually play around with it here: empath.stanford.edu

We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. 

Empath can generate new lexical categories and analyze text over 200 built-in human-validated categories. 

First, let's install `empath` and then import it. 

In [None]:
!pip3 install --user empath

In [None]:
import empath

In [None]:
from empath import Empath
lexicon = Empath()

Let's start analyzing a sentence. 

With setting normalize to True, the counts are normalized according to sentence length. Here, let's tokenize the sentence as we did last week using the `nltk` method called `.word_tokenize()`.

Let's test it out with this sentence:

"Bullshit, you can't even post FACTS on this sub- like Clinton lying about sniper fire."

In [None]:
sentiment_dictionary = lexicon.analyze(nltk.word_tokenize("Bullshit, you can't even post FACTS on this sub- like Clinton lying about sniper fire."), normalize=True)

Now, let's go through this sentiment_dictionary and just look at the values that are greater than zero. We can use a list to extract the values that are greater than zero, as shown here:

In [None]:
[(k,v) for k,v in sentiment_dictionary.items() if v > 0]

Not bad. It picked up on words like "fire," "sub," "lying" to associate with "social media," "deception," and "weapon." 

Let's try it out with another sentence:

"Totally agree. Planning to beat your opponent is not a sign of corruption. That's politics. "

In [None]:
sentiment_dictionary = lexicon.analyze(nltk.word_tokenize("Totally agree. Planning to beat your opponent is not a sign of corruption. That's politics. "), normalize=True)

In [None]:
[(k,v) for k,v in sentiment_dictionary.items() if v > 0]

***
***

# Checkpoint 5 of 5

## Now you try!

### Let's explore the tool with some more examples. What happens in cases of sarcasm, negation, or very informal text?

### Identify three sentences: One that is sarcastic, one that negates, and an informal text with slang. 

### Repeat the same steps as above with your three sentences. 

### What categories do you pick up? Which are the top categories for each sentence?