## Assignment 4
by Charlie Mei cm3947

In [1]:
from urllib import request
from bs4 import BeautifulSoup
from bs4.element import Comment
import spacy
import pandas as pd

from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

In [2]:
# Extract all text from the url, using code provided by lecturer in class 3 exercise
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

### 1. Pick a random news article (preferably with many entity mentions) from your Webhose dataset 

In [3]:
# Take the first article from the Netflix Webhose dataset provided in Assignment 2
url = 'https://www.stuff.co.nz/entertainment/tv-radio/300026661/13-reasons-why-the-popular-netflix-shows-creator-teases-chance-of-a-hopeful-ending'

In [4]:
# Get all text from the webpage
html = request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll(text=True)

In [5]:
text = text_from_html(html)
print(text[:1000])

National World Business Climate Change Sport Entertainment Life & Style Homed Travel Motoring Stuff Nation Play Stuff Quizzes Politics Premium Well & Good Food & Wine Parenting Rugby Farming Technology Opinion Auckland Wellington Canterbury Waikato Bay of Plenty Taranaki Manawatu Nelson Marlborough Timaru Otago Southland Careers Advertising Contact Privacy © 2020 Stuff Limited Entertainment TV & Radio 13 Reasons Why: The popular Netflix show's creator teases chance of a hopeful ending 14:49, Jun 03 2020 Facebook Twitter Whats App Reddit Email NETFLIX The final season of 13 Reasons Why is out. The controversial 13  Reasons Why is returning for its fourth and final season on Netflix from Friday and creator Brian Yorkey has indicated there will be a hopeful ending. Adapted from Jay Asher's 2007 novel, the show was released on Netflix in 2017 and began with the first season focused on the death of Hannah Baker, a 17-year-old American high school student who


### 2. Follow directions to set up one of the Information Extraction services below, and write a Python program implementing API calls to extract Company/Organization and Geo entities from  the article chosen in Step 1:

I have chosen to use ```SpaCy```.

In [6]:
nlp = spacy.load("en_core_web_sm")

# Parse through text from webpage into a spacy nlp
page = nlp(text)

In [7]:
# Entity label in spacy for company/organization and geo entities
entity_labels = ['ORG', 'GPE']

# Extract companies and geo entities from the article
orgs = []
geos = []
for entity in page.ents:
    if entity.label_ == 'ORG':
        orgs.append(entity.text)
    elif entity.label_ == 'GPE':
        geos.append(entity.text)
    else:
        continue

print("Here are a list of companies/organizations referenced in the article: \n {}".format(set(orgs)))
print("Here are a list of geographies referenced in the article: \n {}".format(set(geos)))

Here are a list of companies/organizations referenced in the article: 
 {'Health Ministry', 'Stuff Limited Entertainment TV & Radio', 'Premium Well & Good Food & Wine', 'Nelson Marlborough Timaru', 'Netflix', 'Ford', 'Mental Health Foundation', 'Entertainment Weekly', 'Yorkey'}
Here are a list of geographies referenced in the article: 
 {'US', 'Auckland', 'Australia', 'Netflix', 'Manawatu', 'North Star', 'Yorkey'}


### 3. Download Crunchbase Open Data Map CSV file and store it in a directory on your computer

### 4. Use the Class Exercise Jupyter Notebook as a reference to:
- !pip install pyspark
- load Crunchbase Open Data Map into notebook by modifying the path .csv(".../...") to the file on your computer where you stored the downloaded CSV file from Step 4.
- find matches of Company or Organization entities identified in Step 3 using rlike function and print results

In [8]:
# Initializing spark to load the Couch database
sc = SparkContext()
config = sc.getConf()
sqlContext = SQLContext(sc)

In [9]:
# Load in the Couch DB
df = sqlContext.read.option('header', 'true').option('delimiter', ',').option('inferSchema', 'true').csv('cb_odm_092419.csv')
df.count()

687755

In [10]:
for org in set(orgs):
    print('Matches in the Couch DB for {}:'.format(org))
    match_df = df[df['name'].rlike(org)]
    match_df['crunchbase_uuid', 'name', 'homepage_domain'].show()
    print()

Matches in the Couch DB for Health Ministry:
+---------------+----+---------------+
|crunchbase_uuid|name|homepage_domain|
+---------------+----+---------------+
+---------------+----+---------------+


Matches in the Couch DB for Stuff Limited Entertainment TV & Radio:
+---------------+----+---------------+
|crunchbase_uuid|name|homepage_domain|
+---------------+----+---------------+
+---------------+----+---------------+


Matches in the Couch DB for Premium Well & Good Food & Wine:
+---------------+----+---------------+
|crunchbase_uuid|name|homepage_domain|
+---------------+----+---------------+
+---------------+----+---------------+


Matches in the Couch DB for Nelson Marlborough Timaru:
+---------------+----+---------------+
|crunchbase_uuid|name|homepage_domain|
+---------------+----+---------------+
+---------------+----+---------------+


Matches in the Couch DB for Netflix:
+--------------------+-------------------+--------------------+
|     crunchbase_uuid|               n

### BONUS

Use the Class Exercise Jupyter Notebook as a reference to:
- !pip install spacy 
- update TRAIN_DATA with annotations of entities (PERSON, LOCATION, or ORGANIZATION) from each sentence in the article selected in step 1
- run spaCy_NER function to generate trained_nlp model
- use trained_nlp to test entity recognition on another random news article from Webhose and print results to output

In [None]:
# Repeat 
entity_labels = ['ORG', 'GPE']

# Extract companies and geo entities from the article
orgs = []
geos = []
for entity in page.ents:
    if entity.label_ == 'ORG':
        orgs.append(entity.text)
    elif entity.label_ == 'GPE':
        geos.append(entity.text)
    else:
        continue