**CLASSWORK- / NAMED ENTITY RECOGNITION**

**SpaCy**

SpaCy is a powerful library for linguistic data processing. It providesa pipeline of processing components: a tokenizer, a part-of-speechtagger, a dependency parser and a named-entity recognizer.

**IMPORTING REQUIRED LIBRARIES**

In [2]:
!python -m spacy download en_core_web_lg
!python -m spacy download de_core_news_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     -------------------------------------- 0.0/587.7 MB 682.7 kB/s eta 0:14:21
     -------------------------------------- 0.0/587.7 MB 393.8 kB/s eta 0:24:53
     -------------------------------------- 0.1/587.7 MB 573.4 kB/s eta 0:17:05
     ---------------------------------------- 0.2/587.7 MB 1.1 MB/s eta 0:08:57
     ---------------------------------------- 0.2/587.7 MB 1.2 MB/s eta 0:08:27
     ---------------------------------------- 0.9/587.7 MB 3.1 MB/s eta 0:03:12
     ---------------------------------------- 2.2/587.7 MB 6.7 MB/s eta 0:01:27
     --------------------------------------- 4.7/587.7 MB 12.5 MB/s eta 0:00:47
     ---------------------------


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
     ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
     --------------------------------------- 0.0/14.6 MB 217.9 kB/s eta 0:01:08
     --------------------------------------- 0.0/14.6 MB 245.8 kB/s eta 0:01:00
     --------------------------------------- 0.1/14.6 MB 238.1 kB/s eta 0:01:02
     --------------------------------------- 0.1/14.6 MB 610.6 kB/s eta 0:00:24
      --------------------------------------- 0.3/14.6 MB 1.1 MB/s eta 0:00:13
     - -------------------------------------- 0.6/14.6 MB 1.9 MB/s eta 0:00:08
     -- ------------------------------------- 0.9/14.6 MB 2.6 MB/s eta 0:00:06
     ----- -------------------------------


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import spacy
# Load the english language model
nlp = spacy.load("en_core_web_lg")
# Load the German language model
nlp = spacy.load("de_core_news_sm")

**TRYING ON SAMPLE TEXT**

In [4]:
text = "This time india is going to win the world cup."
doc = nlp(text)
for token in doc:
  print(token, end=" | ")

This | time | india | is | going | to | win | the | world | cup | . | 

In [5]:
import pandas as pd
def display_nlp(doc, include_punct=False):
  """Generate data frame for visualization of spaCy tokens."""
  rows = []
  for i, t in enumerate(doc):
    if not t.is_punct or include_punct:
      row = {'token': i, 'text': t.text, 'lemma_': t.lemma_,'is_stop': t.is_stop, 'is_alpha': t.is_alpha,'pos_': t.pos_, 'dep_': t.dep_,'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
      rows.append(row)
  df = pd.DataFrame(rows).set_index('token')
  df.index.name = None
  return df
display_nlp(doc)

Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,This,This,False,True,PROPN,sb,MISC,B
1,time,time,False,True,X,ROOT,MISC,I
2,india,india,False,True,X,mo,MISC,I
3,is,--,False,True,X,uc,MISC,I
4,going,going,False,True,X,uc,MISC,I
5,to,to,False,True,X,uc,MISC,I
6,win,win,False,True,X,pnc,MISC,I
7,the,The,False,True,PROPN,pnc,MISC,I
8,world,World,False,True,PROPN,pnc,MISC,I
9,cup,cup,False,True,PROPN,uc,MISC,I


**REMOVING STOP WORDS USING SPACY**

In [6]:
text = "India is the best cricket team in the world"
doc = nlp(text)

In [7]:
non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)

[India, is, the, best, cricket, team, the, world]


**FINDING NOUNS USING SPACY**

In [8]:
text = "Kohli is the king of the cricket world"
doc = nlp(text)

In [9]:
nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)

[Kohli, the, the, cricket, world]


**NAMED ENTITY RECOGNITION**

In [10]:
text = "Kohli is the king of the cricket world."
doc = nlp(text)

In [12]:
for ent in doc.ents:
  print(f"({ent.text}, {ent.label_})", end=" ")

(Kohli is the king of the, MISC) 

In [14]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

**Harder one:**

In [15]:
text = "James O'Neill, chairman of World Cargo Inc, lives in SanFrancisco."
doc = nlp(text)

In [16]:
for ent in doc.ents:
  print(f"({ent.text}, {ent.label_})", end=" ")

(James O'Neill, PER) (chairman of World Cargo Inc, MISC) (SanFrancisco, MISC) 

 **VISUALIZE NERS**

In [17]:
from spacy import displacy

In [18]:
displacy.render(doc, style='ent', jupyter=True)

**TRYING WITH REAL DATASET**

In [19]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
  res = requests.get(url)
  html = res.text
  soup = BeautifulSoup(html, 'html.parser')
  for script in soup(["script", "style", 'aside']):
    script.extract()
  return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://blog.google/technology/health/google-ai-health-information/')
article=nlp(ny_bb)
len(article.ents)


147

In [20]:
displacy.render(article, style='ent', jupyter=True)

In [21]:
from collections import Counter
items = [x.text for x in article.ents]
Counter(items).most_common(5)

[('Google', 3),
 ('SVP', 3),
 ('English', 3),
 ('Twitter Facebook LinkedIn Mail Copy', 2),
 ('Android', 2)]

**POPULAR NER TYPES:**

In [22]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 62, 'ORG': 30, 'PER': 30, 'LOC': 25})

In [23]:
sentences = [x for x in article.sents]
print(sentences[9])

link                Latest stories                            Product updates                    Product updates                Android, Chrome & Play                                


**NER tags**

In [24]:
displacy.render(nlp(str(sentences[0])), jupyter=True, style='ent')




**TYPES OF WORDS IN THE SENTENCE**

In [25]:
[(x.orth_,x.pos_, x.lemma_) for x in [y for y in nlp(str(sentences[0]))
if not y.is_stop and y.pos_ != 'PUNCT']]

[(' ', 'SPACE', ' ')]

**SENTENCE DEPENDENCY TREE**

In [26]:
displacy.render(nlp(str(sentences[0])), style='dep', jupyter = True,
options = {'distance': 120})

**2 CONVERT URL TO TEXT AND COUNT ENTITIES**

In [27]:
# Convert the content of a given URL into text and count the identified entities

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.tvtoday.de/tv-programm/')
article = nlp(ny_bb)
len(article.ents)

1250

VISUALIZE ENTITIES IN TEXT

In [28]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

COUNT ENTITY LABELS

In [36]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 528, 'LOC': 325, 'ORG': 260, 'PER': 137})

COUNT MOST COMMON ENTITIES

In [37]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('USA', 116),
 ('D', 84),
 ('ReportDokusoap', 14),
 ('ZDF', 11),
 ('Kan', 11),
 ('ProSieben', 10),
 ('VOX', 9),
 ('VisierKrimiserie', 9),
 ('ARTE', 7),
 ('ZDFneo', 7),
 ('F', 7),
 ('GrenzeDoku', 7),
 ('VersicherungsdetektiveDokusoap', 7),
 ('TV-Programm', 6),
 ('RTLZWEI', 6),
 ('HundeflüstererCoachingdoku', 6),
 ('RTLup', 6),
 ('Italiens', 6),
 ('Patrol New Zealand', 6),
 ('GerichtsmedizinDoku', 6),
 ('ONE', 5),
 ('D 2024Mit', 5),
 ('TELE', 5),
 ('HobbyKrimiserie', 5),
 ('NITRO', 5)]

PRINT SPECIFIC SENTENCE

In [31]:
# Print the 4th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[3])

Bundesli...


VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [32]:
# Render a visualization of the identified entities in the 4th sentence of the extracted article text.

displacy.render(nlp(str(sentences[3])), jupyter=True, style='ent')


EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [33]:
# Extract words along with their parts of speech and lemmas from the 4th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[3])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Bundesli', 'PROPN', 'Bundesli')]

VISUALIZE DEPENDENCY PARSING

In [35]:
# Render a visualization of the dependency parsing for the 4th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[1])), style='dep', jupyter = True, options = {'distance': 120})

**3 CONVERT URL TO TEXT AND COUNT ENTITIES**

In [38]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.bbc.com/zhongwen/simp')
article = nlp(ny_bb)
len(article.ents)

40