**CLASSWORK- / NAMED ENTITY RECOGNITION**

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and categorizing named entities within a text into predefined categories such as the names of persons, organizations, locations, dates, numerical expressions, and more.

**SpaCy**

SpaCy is a powerful library for linguistic data processing. It providesa pipeline of processing components: a tokenizer, a part-of-speechtagger, a dependency parser and a named-entity recognizer.

**IMPORTING REQUIRED LIBRARIES**

In [None]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
nlp = spacy.load("en_core_web_lg")

**TOKENIZATION AND PRINTING USING SPACY**

In [None]:
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)
for token in doc:
  print(token, end=" | ")

My | best | friend | Ryan | Peters | likes | fancy | adventure | games | . | 

**VISUALISATION OF SPACY TOKENS**

In [None]:
import pandas as pd
def display_nlp(doc, include_punct=False):
  """Generate data frame for visualization of spaCy tokens."""
  rows = []
  for i, t in enumerate(doc):
    if not t.is_punct or include_punct:
      row = {'token': i, 'text': t.text, 'lemma_': t.lemma_,'is_stop': t.is_stop, 'is_alpha': t.is_alpha,'pos_': t.pos_, 'dep_': t.dep_,'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
      rows.append(row)
  df = pd.DataFrame(rows).set_index('token')
  df.index.name = None
  return df
display_nlp(doc)

Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,Dear,Dear,False,True,PROPN,compound,,O
1,Ryan,Ryan,False,True,PROPN,npadvmod,PERSON,B
3,we,we,True,True,PRON,nsubj,,O
4,need,need,False,True,VERB,ROOT,,O
5,to,to,True,True,PART,aux,,O
6,sit,sit,False,True,VERB,xcomp,,O
7,down,down,True,True,ADP,prt,,O
8,and,and,True,True,CCONJ,cc,,O
9,talk,talk,False,True,VERB,conj,,O
11,Regards,Regards,False,True,PROPN,ROOT,PERSON,B


**REMOVING STOP WORDS USING SPACY**

In [None]:
text = "Dear Ryan, we need to sit down and talk. Regards, Pete"
doc = nlp(text)

In [None]:
non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)

[Dear, Ryan, need, sit, talk, Regards, Pete]


**FINDING NOUNS USING SPACY**

In [None]:
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

In [None]:
nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)

[friend, Ryan, Peters, adventure, games]


**NAMED ENTITY RECOGNITION**

In [None]:
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

In [None]:
for ent in doc.ents:
  print(f"({ent.text}, {ent.label_})", end=" ")

(Ryan Peters, PERSON) 

**Harder one:**

In [None]:
text = "James O'Neill, chairman of World Cargo Inc, lives in SanFrancisco."
doc = nlp(text)

In [None]:
for ent in doc.ents:
  print(f"({ent.text}, {ent.label_})", end=" ")

(James O'Neill, PERSON) (World Cargo Inc, ORG) (SanFrancisco, ORG) 

 **VISUALIZE NERS**

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style='ent', jupyter=True)

**TRYING WITH REAL DATASET**

In [None]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
  res = requests.get(url)
  html = res.text
  soup = BeautifulSoup(html, 'html.parser')
  for script in soup(["script", "style", 'aside']):
    script.extract()
  return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.reuters.com/world/europe/ukrainian-infrastructure-pounded-again-saturday-2022-10-22/')
article=nlp(ny_bb)
len(article.ents)


1

In [None]:
displacy.render(article, style='ent', jupyter=True)

In [None]:
from collections import Counter
items = [x.text for x in article.ents]
Counter(items).most_common(5)

[('JS', 1)]

**POPULAR NER TYPES:**

In [None]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'GPE': 1})

In [None]:
sentences = [x for x in article.sents]
print(sentences[0])

reuters.comPlease enable JS and disable any ad blocker


**NER tags**

In [None]:
displacy.render(nlp(str(sentences[0])), jupyter=True, style='ent')


**TYPES OF WORDS IN THE SENTENCE**

In [None]:
[(x.orth_,x.pos_, x.lemma_) for x in [y for y in nlp(str(sentences[0]))
if not y.is_stop and y.pos_ != 'PUNCT']]

[('reuters.comPlease', 'INTJ', 'reuters.complease'),
 ('enable', 'VERB', 'enable'),
 ('JS', 'PROPN', 'JS'),
 ('disable', 'VERB', 'disable'),
 ('ad', 'NOUN', 'ad'),
 ('blocker', 'NOUN', 'blocker')]

**SENTENCE DEPENDENCY TREE**

In [None]:
displacy.render(nlp(str(sentences[0])), style='dep', jupyter = True,
options = {'distance': 120})