## Name Entity Recognition using python 

**Named Entity:**

    A Named entity is basically  a real- life object which has proper identification and can be denoted with a proper name     ;it can be a person, place, object etc.
    
**Named Entity Recignition:**

    Named Entity Recognition is the process of NLP which deals with identifying and classifying named entities. The raw       and structured text is taken and named entities are classified into persons, organizations, places, money, time, etc.     basically, named entities are identified and segmented into various predefined classes.
    

# Data Import by Web Scraping 

## Named Entity Recognition using nltk

In [2]:
#Importing libraries for web scraping
import bs4 as bs
import urllib.request
import re 

In [3]:
#Find out the sites for fetching the data 
scraped_data = urllib.request.urlopen('https://www.business-standard.com/article/technology/how-will-metaverse-digital-technology-change-the-future-of-our-world-121111400659_1.html')
#'https://www.business-standard.com/technology'
#

In [4]:
#Scrape the data from mentioned url 
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraph = parsed_article.find_all('p')

# taking all the paragraph in one 
article_text = ""
for p in paragraph:
  article_text +=p.text

In [5]:
# Converting list to string
text = str(article_text)

In [6]:
text

'Topics\nDigital technology\xa0|\xa0Facebook\xa0|\xa0virtual reality\n\nNeha Alawadhi\xa0\t\t\t\t\t\t\t\t  \t\t\t\t\t\t\t\t  |\xa0\r\n\t\t\t\t\t\t\t\t  \t\t\t\t\t\t\t\t  \t\t\t\t\t\t\t\t  \t\t\t\t\t\t\t\t  New Delhi\xa0\t\t\t\t\t\t\t  \t  \t\t\t\t\t\t\t  \n\n\t\t\t\t\t\t\t  Last Updated at November 14, 2021 23:11 IST\nhttps://mybs.in/2Zj0tiJ\t\t\t\t\t    \t\t\t\t\t\t    \t\t\t\t\t\t    \t\t\t\t\t\t    \tEver since Facebook and its companies embraced a new brand identity called “Meta” last month, there has been a slew of opinions on what a connected digital world means, what it is going to look like, and who will build it. The concept seems similar to Second Life, a virtual world that allows people to create an online version of themselves and have a second life in the online space. Launched in 2003, Second Life is still around and claims to have about a million users. For a closer reference, it is like the ideal virtual universe set out at the end of the recent Ryan ...Key stories on b

In [7]:
#Importing required libraries 
import re 
import nltk
import string 
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
ps=nltk.PorterStemmer()
lem = WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\rajse\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
#Creating function for cleaning the input text data 
def clean_text(text):
    text = text.replace('\'','').split(',')
    text  = text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+',text)
    #text = [ps.stem(word) for word in tokens if word not in stopwords]
    text = [lem.lemmatize(word) for word in tokens if word not in stopwords]
    text = nltk.pos_tag(text)
    return text

In [10]:
NER = clean_text(text)[0:10]

In [11]:
#Extract all parts of speech from any text
from nltk import RegexpParser
chunker = RegexpParser("""
                            NP: {<DT>?<JJ>*<NN>} #To extract Noun Phrases
                            P: {<IN>}			 #To extract Prepositions
                            
                            """
                      )
                         #   V: {<V.*>}			 #To extract Verbs
                         #   PP: {<p> <NP>}		 #To extract Prepositional Phrases
                         #   VP: {<V> <NP|PP>*}	 #To extract Verb Phrases
# Print all parts of speech in above sentence
output = chunker.parse(NER)
print("After Extracting\n", output)

After Extracting
 (S
  (NP topic/NN)
  (NP digital/JJ technology/NN)
  (NP facebook/NN)
  (NP virtual/JJ reality/NN)
  neha/RB
  (NP alawadhi/JJ new/JJ delhi/NN))


In [12]:
%matplotlib inline
# To draw the parse tree
output.draw()

In [20]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(output)
pprint(iob_tagged)

[('price', 'NN', 'B-NP'),
 ('available', 'JJ', 'O'),
 ('gaming', 'VBG', 'O'),
 ('smartphone', 'NN', 'B-NP'),
 ('could', 'MD', 'O'),
 ('offered', 'VB', 'O'),
 ('something', 'NN', 'B-NP'),
 ('kind', 'NN', 'B-NP'),
 ('value', 'NN', 'B-NP'),
 ('predecessor', 'NN', 'B-NP')]


## Named Entity Recognition using spacy

In [101]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [108]:
article = str(text)
def preprocess(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    text = re.findall('[A-Za-z]+',article)
    text = " ".join([word for word in text if word not in stopwords])
    return text

In [109]:
article = preprocess(article)


In [112]:
article = nlp(article)
pprint([(X.text, X.label_) for X in article.ents])

[('Topics', 'ORG'),
 ('New Delhi', 'GPE'),
 ('Last Updated', 'WORK_OF_ART'),
 ('November 14, 2021', 'DATE'),
 ('IST', 'ORG'),
 ('Facebook', 'WORK_OF_ART'),
 ('Meta', 'WORK_OF_ART'),
 ('last month', 'DATE'),
 ('Second Life', 'ORG'),
 ('second', 'ORDINAL'),
 ('2003', 'DATE'),
 ('Second Life', 'ORG'),
 ('about a million', 'CARDINAL'),
 ('Ryan', 'PERSON'),
 ('Business Standard', 'ORG'),
 ('Digital', 'ORG'),
 ('PREVIOUS STORYNEXT', 'PERSON'),
 ('2021', 'CARDINAL'),
 ('Upgrade To Premium Services Business Standard', 'ORG'),
 ('Premium Services In Partnership with Dear Guest', 'ORG'),
 ('Business Standard', 'ORG'),
 ('FIS', 'ORG'),
 ('Manage', 'ORG'),
 ('Enjoy Reading', 'PERSON')]


In [53]:
#pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

In [113]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 11,
         'GPE': 1,
         'WORK_OF_ART': 3,
         'DATE': 3,
         'ORDINAL': 1,
         'CARDINAL': 2,
         'PERSON': 3})

In [114]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Second Life', 2), ('Business Standard', 2), ('Topics', 1)]

In [116]:
displacy.render(nlp(str(article)), jupyter=True, style='ent')

In [117]:
displacy.render(nlp(str(sentences)), style='dep', jupyter = True, options = {'distance': 120})

In [118]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences)) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('[', 'X', '['),
 ('price', 'NOUN', 'price'),
 ('available', 'ADJ', 'available'),
 ('gaming', 'NOUN', 'gaming'),
 ('smartphone', 'NOUN', 'smartphone'),
 ('offered', 'VERB', 'offer'),
 ('kind', 'NOUN', 'kind'),
 ('value', 'NOUN', 'value'),
 ('predecessor', 'NOUN', 'predecessor'),
 ('iQoo', 'PROPN', 'iQoo'),
 ('Z3', 'PROPN', 'Z3'),
 ('provided', 'VERB', 'provide'),
 ('Based', 'VERB', 'base'),
 ('Google', 'PROPN', 'Google'),
 ('TV', 'NOUN', 'tv'),
 ('platform', 'NOUN', 'platform'),
 ('Realme', 'PROPN', 'Realme'),
 ('4', 'NUM', '4'),
 ('K', 'PROPN', 'K'),
 ('TV', 'PROPN', 'TV'),
 ('Stick', 'PROPN', 'Stick'),
 ('shares', 'NOUN', 'share'),
 ('fundamentals', 'NOUN', 'fundamental'),
 ('Android', 'PROPN', 'Android'),
 ('TV', 'NOUN', 'tv'),
 ('platform', 'NOUN', 'platform'),
 ('touch', 'NOUN', 'touch'),
 ('novelty', 'NOUN', 'novelty'),
 ('Home', 'NOUN', 'home'),
 ('grown', 'VERB', 'grow'),
 ('mobile', 'ADJ', 'mobile'),
 ('phone', 'NOUN', 'phone'),
 ('company', 'NOUN', 'company'),
 ('Lava', 'PRO

In [119]:
dict([(str(x), x.label_) for x in nlp(str(sentences)).ents])

{'Stick': 'FAC',
 'Android': 'ORG',
 'Lava International': 'ORG',
 'first': 'ORDINAL',
 'Indian': 'NORP',
 '5': 'CARDINAL',
 'GSM Arena': 'PRODUCT',
 '12L': 'DATE',
 'early next year': 'DATE',
 '12': 'CARDINAL',
 'India': 'GPE',
 'daily': 'DATE',
 'DAU': 'ORG',
 'the past three': 'DATE',
 'Apple': 'ORG',
 'Messages': 'GPE',
 'Samsung': 'ORG',
 '2021': 'CARDINAL',
 'Upgrade To Premium Services Business Standard': 'ORG',
 'Premium Services In Partnership with Dear Guest': 'ORG',
 'Business Standard': 'ORG',
 'FIS': 'ORG',
 'Manage': 'ORG',
 'Enjoy Reading': 'PERSON'}