Named Entity Recognition with spaCy & roBERTa

Documentation: https://spacy.io/models/en#en_core_web_trf

<div style="color:white">

- **Description**: This technique will give insights into the frequency of named entities in the articles that fall into a specific topic.
- **Purpose**: it gives the understanding that entity-based approach to sentiment analysis will be the most beneficial for the project
- **Deployment**: The Sentiment Analysis will be performed on sentences containing both direct and indirect references to pre-selected entities. These sentences will be extracted and separated from the rest of the article body for focused analysis. The analysis will differentiate between explicit and implicit mentions of the entities, ensuring that indirect references (e.g., pronouns or related phrases) are also considered. After analyzing the sentiment of these sentences, the tone of the entire article will be adjusted accordingly. Specifically, sentences with mentions of these entities will be categorized and toned based on their sentiment—positive, negative, or neutral. This refined approach ensures that the overall sentiment of the article reflects the impact of these key entities while maintaining contextual integrity.

<div>

Named Entity Types in en_core_web_trf

<span style="font-size: 8px;">

    PERSON ('PERSON'): People, including fictional characters, and sometimes groups of people.
    NORP ('NORP'): Nationalities or religious/political groups.
    FAC ('FAC'): Buildings, airports, highways, bridges, etc.
    ORG ('ORG'): Organizations, including companies, institutions, and governmental bodies.
    GPE ('GPE'): Geopolitical entities (countries, cities, states).
    LOC ('LOC'): Non-GPE locations, mountain ranges, bodies of water, etc.
    PRODUCT ('PRODUCT'): Products, including software, vehicles, gadgets, etc.
    EVENT ('EVENT'): Named events, including sports events, festivals, wars, and other events.
    WORK_OF_ART ('WORK_OF_ART'): Titles of books, movies, paintings, etc.
    LAW ('LAW'): Laws, regulations, rules, and legal documents.
    LANGUAGE ('LANGUAGE'): Languages.
    DATE ('DATE'): Absolute or relative dates or periods.
    TIME ('TIME'): Times within a day.
    PERCENT ('PERCENT'): Percentage values.
    MONEY ('MONEY'): Monetary values, including unit currencies.
    QUANTITY ('QUANTITY'): Measurements of physical quantities.
    ORDINAL ('ORDINAL'): Ordinal numbers.
    CARDINAL ('CARDINAL'): Non-ordinal numbers.
</span>

In [None]:
# Import libraries
import spacy
from spacy import displacy
import pandas as pd
from collections import Counter 
import matplotlib.pyplot as plt

# local imports

# Settings
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
%load_ext autoreload
%autoreload 2

In [None]:
# Download the model
trf = spacy.load('en_core_web_trf') # python -m spacy download en_core_web_trf

In [None]:
# Load the data
df = pd.read_csv('../data/topics.csv')

In [None]:
data_filtered=df[(df['topic']==0)].reset_index()
data_filtered.info()

In [None]:
data_filtered.head()

In [None]:
data_filtered['text'][1]

In [None]:
# Visualize the entity recognizer for one article
one_text = data_filtered['text'].iloc[80]
doc = trf(one_text)
displacy.render(doc, style='ent')

In [None]:
# Create the function to return a list of entities
def get_entities(text):
    # process the text with a SpaCy model to get named entities
    doc = trf(text)
    # initialize list to store identified organizations
    org_list = []
    people_list = []
    norp_list = []
    # loop through the identified entities and append entities to lists
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
        elif entity.label_ == 'PERSON':
            people_list.append(entity.text)
    
    return org_list, people_list

In [None]:
# Fetching entities (run time - about 3 mins)
data_filtered[['ORG', 'people']] = data_filtered['body'].apply(lambda x: pd.Series(get_entities(x)))

In [None]:
# Check the DataFrame
data_filtered.head(10)

In [None]:
# Check a single document for an entity
print(data_filtered['people'].iloc[10])

In [None]:
# Convert each entity column into lists / merge organizations column into one big list
org_list = data_filtered['ORG'].to_list()
people_list = data_filtered['people'].to_list()

# Flatten the lists (combine all rows into one list per entity type)
org_list = [org for sublist in org_list for org in sublist]
people_list = [person for sublist in people_list for person in sublist]

In [None]:
# Create dictionary of entity mention frequency and calculate frequencies of selected entities
org_freq = Counter(org_list)
org_freq.most_common(30)

In [None]:
# Visualize entity frequencies
common_orgs = org_freq.most_common(20)
org_names, org_counts = zip(*common_orgs)
plt.figure(figsize=(8, 4))
plt.barh(org_names, org_counts, color='green', label='Overall')
plt.xlabel('Frequency')
plt.ylabel('Entities')
plt.title('Top 20 Most Common Organisations')
plt.gca().invert_yaxis()
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Create dictionary of entity mention frequency and calculate frequencies of selected entities
people_freq = Counter(people_list)
people_freq.most_common(30)

In [None]:
# Combine frequencies into a DataFrame
entity_freq_df = pd.DataFrame({
    'ORG': pd.Series(org_freq),
    'PERSON': pd.Series(people_freq)
}).fillna(0)  # Fill NaN values with 0

In [None]:
# Sum up frequencies per document (row) to get the general number of entities per document
entity_counts_per_doc = data_filtered[['ORG', 'people']].applymap(len).sum(axis=1)

In [None]:
# Plotting entity frequency per entity type
entity_freq_df.sum().plot(kind='bar', color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.title("Comparative Frequency of Entity Types")
plt.xlabel("Entity Type")
plt.ylabel("Frequency")
plt.show()

In [None]:
# Save the entities
data_filtered.to_csv('../data/topics-enities.csv', index=False)