# CLASSWORK - 8

## NAMED ENTITY RECOGNITION

## LOAD SPACY MODEL

Spacy is an open-source Python Natural Language Processing (NLP) toolkit and module. It is a well-liked option for NLP jobs since it provides pre-trained models and effective tokenization, named entity recognition, part-of-speech tagging, and dependency parsing functionalities.

In [1]:
# Import Spacy library and load the large English language model
import spacy
nlp = spacy.load("en_core_web_lg")


## TOKENIZE TEXT

In [2]:
# Tokenize the text and print each token
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)
for token in doc:
    print(token, end="| ")


My| best| friend| Ryan| Peters| likes| fancy| adventure| games| .| 

## DISPLAY NLP TOKENS

For processing and analysis, NLP tokens are the basic pieces of text that represent individual words or subword units within a linguistic context. They enable algorithms to efficiently comprehend and work with human language by acting as building blocks for a variety of natural language processing activities.

In [3]:
# Function to generate a DataFrame for visualization of spaCy tokens

import pandas as pd

def display_nlp(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i,  'text': t.text, 'lemma_': t.lemma_, 
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos_': t.pos_, 'dep_': t.dep_, 
                   'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df
display_nlp(doc)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,My,my,True,True,PRON,poss,,O
1,best,good,False,True,ADJ,amod,,O
2,friend,friend,False,True,NOUN,nsubj,,O
3,Ryan,Ryan,False,True,PROPN,compound,PERSON,B
4,Peters,Peters,False,True,PROPN,appos,PERSON,I
5,likes,like,False,True,VERB,ROOT,,O
6,fancy,fancy,False,True,ADJ,amod,,O
7,adventure,adventure,False,True,NOUN,compound,,O
8,games,game,False,True,NOUN,dobj,,O


## FILTERING STOPWORDS AND PUNCTUATION

In [4]:
# Removing stopwords and punctuation from the given text using spaCy.
text = "Dear Ryan, we need to sit down and talk. Regards, Pete"
doc = nlp(text)

non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)


[Dear, Ryan, need, sit, talk, Regards, Pete]


## EXTRACT NOUNS FROM TEXT

In [5]:
# Extracts nouns and proper nouns from a given text and prints them.
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)


[friend, Ryan, Peters, adventure, games]


# IDENTIFY ENTITIES IN TEXT

In [6]:
# Print identified entities along with their labels.
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")


(Ryan Peters, PERSON) 

## IDENTIFY ENTITIES IN TEXT

In [7]:
# Print identified entities along with their labels.
text = "James O'Neill, chairman of World Cargo Inc, lives in San Francisco." 
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")


(James O'Neill, PERSON) (World Cargo Inc, ORG) (San Francisco, GPE) 

## VISUALIZE ENTITIES IN TEXT

In [8]:
# Render a visualization of the identified entities in the text.
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)


## CONVERT URL TO TEXT AND COUNT ENTITIES

In [9]:
# Convert the content of a given URL into text and count the identified entities.
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/news-event/coronavirus')
article = nlp(ny_bb)
len(article.ents)


144

In [10]:
ny_bb

"          The Covid-19 Pandemic - The New York Times                                                                                                                                           Skip to contentSkip to site indexThe Covid-19 Pandemic\xa0Today’s PaperCovid-19 GuidanceSymptoms and TreatmentJN.1 VariantNew ShotsLong Covid in KidsPaxlovidAt-Home TestsMasksThe Covid-19 PandemicWith the acute phase of the Covid-19 pandemic fading even as the coronavirus persists and evolves, a new normal is taking shape around the world.Four Years of CovidCard 1 of 6We asked readers\xa0how Covid has changed their attitudes towards life. Here is what they said:“I'm a much more grateful person. Life is precious, and I see the beauty in all the little miracles that happen all around me. I'm a humbled human being now. I have more empathy and compassion towards everyone.” —\xa0Gil Gallegos, 59, Las Vegas, N.M.“The pandemic has completely changed my approach to educating my child. My spouse and I had 

## VISUALIZE ENTITIES IN TEXT

In [11]:
# Render a visualization of the identified entities in the extracted article text.
displacy.render(article, style='ent', jupyter=True)



## COUNT ENTITY LABELS

In [12]:
# Count the occurrence of each entity label in the extracted article text.
from collections import Counter

# Get labels of named entities
labels = [x.label_ for x in article.ents]

# Count occurrences of each label
label_counts = Counter(labels)

# Print label counts
print(label_counts)


Counter({'PERSON': 44, 'ORG': 28, 'GPE': 27, 'DATE': 20, 'NORP': 6, 'PRODUCT': 5, 'FAC': 4, 'CARDINAL': 4, 'WORK_OF_ART': 3, 'TIME': 1, 'EVENT': 1, 'LANGUAGE': 1})


In [13]:
article

          The Covid-19 Pandemic - The New York Times                                                                                                                                           Skip to contentSkip to site indexThe Covid-19 Pandemic Today’s PaperCovid-19 GuidanceSymptoms and TreatmentJN.1 VariantNew ShotsLong Covid in KidsPaxlovidAt-Home TestsMasksThe Covid-19 PandemicWith the acute phase of the Covid-19 pandemic fading even as the coronavirus persists and evolves, a new normal is taking shape around the world.Four Years of CovidCard 1 of 6We asked readers how Covid has changed their attitudes towards life. Here is what they said:“I'm a much more grateful person. Life is precious, and I see the beauty in all the little miracles that happen all around me. I'm a humbled human being now. I have more empathy and compassion towards everyone.” — Gil Gallegos, 59, Las Vegas, N.M.“The pandemic has completely changed my approach to educating my child. My spouse and I had never seri

## COUNT MOST COMMON ENTITIES

In [14]:
# Count the most common entities in the extracted article text.
items = [x.text for x in article.ents]
Counter(items).most_common(5)


[('New York', 9),
 ('Americans', 4),
 ('Four Years', 3),
 ('TreatmentJN.1 VariantNew ShotsLong', 2),
 ('2024', 2)]

## PRINT FIRST SENTENCE

In [15]:
# Print the first sentence from the extracted article text.
sentences = [x for x in article.sents]
print(sentences[0])


          The Covid-19 Pandemic - The New York Times                                                                                                                                           Skip to contentSkip to site indexThe Covid-19 Pandemic Today’s PaperCovid-19 GuidanceSymptoms and TreatmentJN.1 VariantNew ShotsLong Covid in KidsPaxlovidAt-Home TestsMasksThe Covid-19 PandemicWith the acute phase of the Covid-19 pandemic fading even as the coronavirus persists and evolves, a new normal is taking shape around the world.


## VISUALIZE ENTITIES IN FIRST SENTENCE

In [16]:
# Visualize entities in the first sentence of the extracted article text.
displacy.render(nlp(str(sentences[0])), jupyter=True, style='ent')


## EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [17]:
# Extract words along with their parts of speech and lemmas from the first sentence of the extracted article text 
# excluding stop words and punctuation.


[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[0])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]


[('          ', 'SPACE', '          '),
 ('Covid-19', 'PROPN', 'Covid-19'),
 ('Pandemic', 'PROPN', 'Pandemic'),
 ('New', 'PROPN', 'New'),
 ('York', 'PROPN', 'York'),
 ('Times', 'PROPN', 'Times'),
 ('                                                                                                                                          ',
  'SPACE',
  '                                                                                                                                          '),
 ('Skip', 'PROPN', 'Skip'),
 ('contentSkip', 'NOUN', 'contentskip'),
 ('site', 'VERB', 'site'),
 ('indexThe', 'PRON', 'indexthe'),
 ('Covid-19', 'PROPN', 'Covid-19'),
 ('Pandemic', 'PROPN', 'Pandemic'),
 ('\xa0', 'SPACE', '\xa0'),
 ('Today', 'NOUN', 'today'),
 ('PaperCovid-19', 'PROPN', 'PaperCovid-19'),
 ('GuidanceSymptoms', 'PROPN', 'GuidanceSymptoms'),
 ('TreatmentJN.1', 'VERB', 'TreatmentJN.1'),
 ('VariantNew', 'PROPN', 'VariantNew'),
 ('ShotsLong', 'PROPN', 'ShotsLong'),
 ('Covid', 'PROPN', 'Co

## VISUALIZE DEPENDENCY PARSING

A natural language processing method called dependency parsing looks for relationships between words to determine how a sentence is put together grammatically. Usually, directed edges between words are used to depict these relationships, in which one word is dependent upon another.

In [39]:
# Render a visualization of the dependency parsing for the first sentence of the extracted article text.
displacy.render(nlp(str(sentences[0])), style='dep', jupyter = True, options = {'distance': 120})
