In [80]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 10;

<IPython.core.display.Javascript object>

# Making sense of the news

How can we get information out of a news story on the web?

### Getting started

What we'll be using...
  1. Our news source for this demo:
      - [The Guardian](https://www.theguardian.com/world)
  2. Libraries for retrieving and processing html:
      - [`requests`](http://docs.python-requests.org/en/master/) and [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
  3. NLP:
      - [`py-processors`](http://py-processors.readthedocs.io/en/latest/)

`fn`+down/up on a OSX to navigate sub-slides

In [70]:
from bs4 import BeautifulSoup
from processors import *
import requests

## Initialize the NLP server

In [81]:
# We'll be using the server in several examples
# NOTE: you can stop it manually with API.stop_server()
API = ProcessorsAPI(port=8886, keep_alive=True)

Using default
Connection with server established!


# Retrieving a news story

- Let's take a look at the first story the Guardian reported on the Snowden Leaks

In [72]:
url = "https://www.theguardian.com/world/2013/jun/09/edward-snowden-nsa-whistleblower-surveillance"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
#article_title = soup.title.text

# View the extracted text

In [73]:
article_text = [p.text for p in soup.find_all("p")]
limit = 3
for i in range(limit):
    text = article_text[i]
    print(text)


Glenn Greenwald, 
Ewen MacAskill and 
Laura Poitras in Hong Kong


Tuesday 11 June 2013 09.00 EDT


The individual responsible for one of the most significant leaks in US political history is Edward Snowden, a 29-year-old former technical assistant for the CIA and current employee of the defence contractor Booz Allen Hamilton. Snowden has been working at the National Security Agency for the last four years as an employee of various outside contractors, including Booz Allen and Dell.


# Do we have any Named Entities?

In [74]:
for i in range(limit):
    doc = API.fastnlp.annotate(article_text[i].strip())
    formatted_dict = json.dumps(doc.nes, sort_keys=True, indent=4)
    print("{}\n".format(formatted_dict))

{
    "PERSON": [
        "Glenn Greenwald",
        "Ewen MacAskill",
        "Laura Poitras"
    ]
}

{
    "DATE": [
        "Tuesday 11 June 2013"
    ],
    "NUMBER": [
        "09.00"
    ]
}

{
    "DATE": [
        "current"
    ],
    "DURATION": [
        "29-year-old",
        "the last four years"
    ],
    "LOCATION": [
        "US"
    ],
    "NUMBER": [
        "one"
    ],
    "ORGANIZATION": [
        "CIA",
        "National Security Agency",
        "Dell"
    ],
    "PERSON": [
        "Edward Snowden",
        "Booz Allen Hamilton",
        "Snowden",
        "Booz Allen"
    ]
}



# Where did these labels come from?


In [75]:
# annotate the text of the first paragraph
doc = API.fastnlp.annotate(article_text[0].strip())
sentence = doc.sentences[0]
# print each word and it's NE label
for i in range(len(sentence.words)):
    entity_label = sentence._entities[i]
    word = sentence.words[i]
    print("{:<15} {}".format(word, entity_label))

Glenn           PERSON
Greenwald       PERSON
,               O
Ewen            PERSON
MacAskill       PERSON
and             O
Laura           PERSON
Poitras         PERSON
in              O
Hong            LOCATION
Kong            LOCATION


## Challenge

<code>'PERSON': [<span style="color:gray">'Edward Snowden', <strong><span style="color:red">'Booz Allen Hamilton'</span></strong>, 'Snowden', <strong><span style="color:red">'Booz Allen'</span></strong></span>]</code>

`Booz Allen Hamilton` should be an `ORGANIZATION`
  - **Challenge**: How can we fix this?