**NAMED ENTITY RECOGNITION WITH WEB SCRAPING**
===========================

In [1]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [2]:
# requests for fetching html of website
import requests

# Make the request to a url

r = requests.get('https://www.wired.com/story/guide-emoji/')
#r = requests.get('http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html')

# Create soup from content of request
c = r.content

from bs4 import BeautifulSoup

soup = BeautifulSoup(c)

In [3]:
# Find the element on the webpage
main_content = soup.find('div', attrs = {'class': 'grid--item body body__container article__body grid-layout__content'})
main_content

<div class="grid--item body body__container article__body grid-layout__content"><p><span class="lead-in-text-callout">Emoji are more</span> than a millennial messaging fad. Think of them more like a primitive language. The tiny, emotive characters—from 😜  to 🎉  to 💩—represent the first language born of the digital world, designed to add emotional nuance to otherwise flat text. Emoji have been popular since they first appeared on Japanese mobile phones in the late ’90s, and in the past few years they have become a hallmark of the way people communicate. They show up in press releases and corporate emails. The White House once issued an  <a class="external-link" data-event-click='{"element":"ExternalLink","outgoingURL":"https://www.theatlantic.com/technology/archive/2014/10/why-the-white-house-is-using-emojis/381307/"}' href="https://www.theatlantic.com/technology/archive/2014/10/why-the-white-house-is-using-emojis/381307/" rel="nofollow noopener" target="_blank">economic report</a> illu

In [4]:
# Extract the relevant information
content = main_content.find('p').text

import pprint
rawtext=str(pprint.pprint(content))

('Emoji are more than a millennial messaging fad. Think of them more like a '
 'primitive language. The tiny, emotive characters—from 😜  to 🎉  to '
 '💩—represent the first language born of the digital world, designed to add '
 'emotional nuance to otherwise flat text. Emoji have been popular since they '
 'first appeared on Japanese mobile phones in the late ’90s, and in the past '
 'few years they have become a hallmark of the way people communicate. They '
 'show up in press releases and corporate emails. The White House once issued '
 'an  economic report illustrated with emoji. In 2015, 😂  became Oxford '
 'Dictionaries’ “Word” of the Year. Emoji aren’t just for people who say '
 'things like “lmao smh tbh fam.” Emoji are for everyone.')


In [5]:
content

'Emoji are more than a millennial messaging fad. Think of them more like a primitive language. The tiny, emotive characters—from 😜  to 🎉  to 💩—represent the first language born of the digital world, designed to add emotional nuance to otherwise flat text. Emoji have been popular since they first appeared on Japanese mobile phones in the late ’90s, and in the past few years they have become a hallmark of the way people communicate. They show up in press releases and corporate emails. The White House once issued an  economic report illustrated with emoji. In 2015, 😂  became Oxford Dictionaries’ “Word” of the Year. Emoji aren’t just for people who say things like “lmao smh tbh fam.” Emoji are for everyone.'

In [6]:
def remove_emojis(data):
    print(data)
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', str(data))

In [7]:
rawtext

'None'

In [8]:
import re
dat=remove_emojis(content)

Emoji are more than a millennial messaging fad. Think of them more like a primitive language. The tiny, emotive characters—from 😜  to 🎉  to 💩—represent the first language born of the digital world, designed to add emotional nuance to otherwise flat text. Emoji have been popular since they first appeared on Japanese mobile phones in the late ’90s, and in the past few years they have become a hallmark of the way people communicate. They show up in press releases and corporate emails. The White House once issued an  economic report illustrated with emoji. In 2015, 😂  became Oxford Dictionaries’ “Word” of the Year. Emoji aren’t just for people who say things like “lmao smh tbh fam.” Emoji are for everyone.


In [9]:
#after rmoving emojis
dat

'Emoji are more than a millennial messaging fad. Think of them more like a primitive language. The tiny, emotive characters—from   to   to —represent the first language born of the digital world, designed to add emotional nuance to otherwise flat text. Emoji have been popular since they first appeared on Japanese mobile phones in the late ’90s, and in the past few years they have become a hallmark of the way people communicate. They show up in press releases and corporate emails. The White House once issued an  economic report illustrated with emoji. In 2015,   became Oxford Dictionaries’ “Word” of the Year. Emoji aren’t just for people who say things like “lmao smh tbh fam.” Emoji are for everyone.'

In [10]:
doc=nlp(dat)

In [11]:
#create list of word tokens
token_list=[]
for token in doc:
    token_list.append(token.text)
print(token_list)

['Emoji', 'are', 'more', 'than', 'a', 'millennial', 'messaging', 'fad', '.', 'Think', 'of', 'them', 'more', 'like', 'a', 'primitive', 'language', '.', 'The', 'tiny', ',', 'emotive', 'characters', '—', 'from', '  ', 'to', '  ', 'to', '—', 'represent', 'the', 'first', 'language', 'born', 'of', 'the', 'digital', 'world', ',', 'designed', 'to', 'add', 'emotional', 'nuance', 'to', 'otherwise', 'flat', 'text', '.', 'Emoji', 'have', 'been', 'popular', 'since', 'they', 'first', 'appeared', 'on', 'Japanese', 'mobile', 'phones', 'in', 'the', 'late', '’', '90s', ',', 'and', 'in', 'the', 'past', 'few', 'years', 'they', 'have', 'become', 'a', 'hallmark', 'of', 'the', 'way', 'people', 'communicate', '.', 'They', 'show', 'up', 'in', 'press', 'releases', 'and', 'corporate', 'emails', '.', 'The', 'White', 'House', 'once', 'issued', 'an', ' ', 'economic', 'report', 'illustrated', 'with', 'emoji', '.', 'In', '2015', ',', '  ', 'became', 'Oxford', 'Dictionaries', '’', '“', 'Word', '”', 'of', 'the', 'Year'

In [12]:
#pos tagging
for token in doc:
    print(token.text,token.pos_)

Emoji PROPN
are AUX
more ADJ
than SCONJ
a DET
millennial ADJ
messaging NOUN
fad NOUN
. PUNCT
Think VERB
of ADP
them PRON
more ADV
like SCONJ
a DET
primitive ADJ
language NOUN
. PUNCT
The DET
tiny ADJ
, PUNCT
emotive ADJ
characters NOUN
— PUNCT
from ADP
   SPACE
to PART
   SPACE
to PART
— PUNCT
represent VERB
the DET
first ADJ
language NOUN
born VERB
of ADP
the DET
digital ADJ
world NOUN
, PUNCT
designed VERB
to PART
add VERB
emotional ADJ
nuance NOUN
to PART
otherwise ADV
flat ADJ
text NOUN
. PUNCT
Emoji PROPN
have AUX
been AUX
popular ADJ
since SCONJ
they PRON
first ADV
appeared VERB
on ADP
Japanese ADJ
mobile ADJ
phones NOUN
in ADP
the DET
late ADJ
’ NOUN
90s NOUN
, PUNCT
and CCONJ
in ADP
the DET
past ADJ
few ADJ
years NOUN
they PRON
have AUX
become VERB
a DET
hallmark NOUN
of ADP
the DET
way NOUN
people NOUN
communicate VERB
. PUNCT
They PRON
show VERB
up ADP
in ADP
press NOUN
releases NOUN
and CCONJ
corporate ADJ
emails NOUN
. PUNCT
The DET
White PROPN
House PROPN
once ADV
issued V

In [13]:
#named entities
for ent in doc.ents:
    print(ent.text,ent.label)

Emoji 380
first 396
Emoji 380
first 396
Japanese 381
the late ’90s 391
the past few years 391
The White House 383
2015 391
Word” of the Year 388
Emoji 380
Emoji 380


In [14]:

spacy.explain("RB")

'adverb'

In [15]:
#frequency of entities in the text
entity=[]
from collections import Counter
labels=[x.label_ for x in doc.ents]
Counter(labels)
#entity.append()

Counter({'DATE': 3,
         'NORP': 1,
         'ORDINAL': 2,
         'ORG': 1,
         'PERSON': 4,
         'WORK_OF_ART': 1})

In [16]:
#most common 10 entities
Counter(labels).most_common(10)

[('PERSON', 4),
 ('DATE', 3),
 ('ORDINAL', 2),
 ('NORP', 1),
 ('ORG', 1),
 ('WORK_OF_ART', 1)]

In [None]:
from spacy import displacy
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

