# Name Entity Recognition (NER) 
is a sub-task of natural language processing (NLP) that involves identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc..

Named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories.

In [None]:
# !pip install spacy

In [6]:
# !python -m spacy download en_core_web_sm

In [39]:
import spacy
import pandas as pd
from collections import Counter
pd.set_option('display.max_rows', 1000)
nlp = spacy.load('en_core_web_sm')

en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities

In [31]:
def get_entity(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified entities
    entity_list = []
    # loop through the identified entities and append entities to entity_list
    for entity in doc.ents:
        if entity.label_ in ['PERSON','ORG', 'GPE','NORP', 'EVENT','LAW', 'FAC', 'LOC', 'WEAPON', 'VICTIM', 'INJURY', 'POLICE_FORCE', 'LOCATION', 'CRIME','ARREST', 'DEATH','AGENCY']:
            entity_list.append(entity.text)
    # if entity is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    entity_list = list(set(entity_list))
    return entity_list

### Named entity types:

**PERSON**: Represents names of people, such as "John Smith," "Mary," or "Dr. Watson."

**ORG**: Stands for organizations, which can be companies, institutions, or other entities, such as "Google," "United Nations," or "NASA."

**GPE**: Geopolitical entities refer to places like countries, cities, and states. Examples include "United States," "New York," or "Paris."

**NORP**: Represents nationalities or religious or political groups, such as "American," "Christian," or "Republican."

**EVENT**: Represents events or incidents, such as "Olympics," "Thanksgiving," or "World War II."

**LAW**: Represents legal references, like "Constitution," "copyright law," or "Section 230."

**FAC**: Stands for facilities, including buildings, airports, or stadiums, like "Eiffel Tower" or "JFK Airport."

**LOC**: Represents non-geopolitical locations or natural features, such as "Mount Everest" or "Amazon River."

**WEAPON**: This entity type represents the names of firearms, weapons, or explosive devices involved in incidents. Examples include "AK-47," "handgun," "grenade," or "IED" (Improvised Explosive Device).

**VICTIM**: Denotes individuals or groups who were injured, harmed, or killed in the incident. Examples include "victims," "casualties," or specific names of individuals affected.

**INJURY**: Denotes the type and severity of injuries sustained in the incident, such as "gunshot wounds," "stabbing injuries," or "minor injuries."

**POLICE_FORCE**: Represents law enforcement agencies or police departments involved in responding to the incident. Examples include "NYPD" (New York Police Department) or "FBI" (Federal Bureau of Investigation).

**LOCATION**: Refers to specific locations or venues where the incident occurred, such as "school," "shopping mall," "concert venue," or "crime scene."

**CRIME**: Denotes the type of crime or incident, such as "shooting," "terrorist attack," "robbery," or "homicide."

**ARREST**: Denotes the arrest of individuals or suspects related to the incident.

**DEATH**: Represents fatalities or deaths resulting from the incident, including the names of deceased individuals.

**AGENCY**: Refers to government agencies or organizations involved in responding to the incident, such as "Homeland Security" or "Emergency Services."

In [34]:
df = pd.read_csv('School_Shooting_Data.csv')
df.head()

Unnamed: 0,Created At,Text,Source,User Name,Location,Description,Followers Count,Quote Count,Reply Count,Retweet Count,Favorite Count
0,Tue Apr 30 23:57:07 +0000 2019,This week: \n• Baltimore: 1 dead\n• Birmingham...,"<a href=""http://twitter.com/download/iphone"" r...",Dante Vic,"Barcelona, Spain",rhythm & blues 🎶 #UNCC17,485,113,112.0,2441.0,4605.0
1,Tue Apr 30 23:29:19 +0000 2019,Two people dead and several injured at the Uni...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Shannon Watts,,"Founder of @MomsDemand, grassroots army of @Ev...",310019,351,507.0,6612.0,16246.0
2,Wed May 01 00:36:25 +0000 2019,Saddened to hear about the news at UNC Charlot...,"<a href=""http://twitter.com/download/iphone"" r...",President Parker 🇺🇸,"Charlotte, NC",Excellence is the Only Standard|#NCCU19 Studen...,7440,0,0.0,44.0,85.0
3,Wed May 01 06:48:12 +0000 2019,It’s a sad reality when there’s been 106 schoo...,"<a href=""http://twitter.com/download/iphone"" r...",Mike Kelleher,"Hazlet, New Jersey",•Part time owner of salernos pizzeria. •GQ Mag...,546,0,0.0,0.0,0.0
4,Wed May 01 02:37:28 +0000 2019,I'm heartsick for the victims of the #UNCC sho...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Elizabeth Warren,Massachusetts,"US Senator, MA. Former teacher & law professor...",2452166,27,123.0,971.0,5327.0


In [35]:
df['entities'] = df['Text'].apply(get_entity)

In [36]:
df.head()

Unnamed: 0,Created At,Text,Source,User Name,Location,Description,Followers Count,Quote Count,Reply Count,Retweet Count,Favorite Count,entities
0,Tue Apr 30 23:57:07 +0000 2019,This week: \n• Baltimore: 1 dead\n• Birmingham...,"<a href=""http://twitter.com/download/iphone"" r...",Dante Vic,"Barcelona, Spain",rhythm & blues 🎶 #UNCC17,485,113,112.0,2441.0,4605.0,[• West Chester]
1,Tue Apr 30 23:29:19 +0000 2019,Two people dead and several injured at the Uni...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Shannon Watts,,"Founder of @MomsDemand, grassroots army of @Ev...",310019,351,507.0,6612.0,16246.0,"[the University of North Carolina, Charlotte]"
2,Wed May 01 00:36:25 +0000 2019,Saddened to hear about the news at UNC Charlot...,"<a href=""http://twitter.com/download/iphone"" r...",President Parker 🇺🇸,"Charlotte, NC",Excellence is the Only Standard|#NCCU19 Studen...,7440,0,0.0,44.0,85.0,"[🙏, UNC Charlotte]"
3,Wed May 01 06:48:12 +0000 2019,It’s a sad reality when there’s been 106 schoo...,"<a href=""http://twitter.com/download/iphone"" r...",Mike Kelleher,"Hazlet, New Jersey",•Part time owner of salernos pizzeria. •GQ Mag...,546,0,0.0,0.0,0.0,[US]
4,Wed May 01 02:37:28 +0000 2019,I'm heartsick for the victims of the #UNCC sho...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Elizabeth Warren,Massachusetts,"US Senator, MA. Former teacher & law professor...",2452166,27,123.0,971.0,5327.0,[]


In [40]:
# merge entities column into one big list
orgs = df['entities'].to_list()
orgs = [org for sublist in orgs for org in sublist]
orgs[:10]

['• West Chester',
 'the University of North Carolina',
 'Charlotte',
 '🙏',
 'UNC Charlotte',
 'US',
 'https://t.co/Kodra668',
 'Suspect',
 'UNC Charlotte',
 'The University of North Carolina']

In [41]:
# create dictionary of entities mention frequency
org_freq = Counter(orgs)

In [49]:
org_freq.most_common(20)

[('Colorado', 1496),
 ('Kendrick Castillo', 915),
 ('Republicans', 734),
 ('Riley Howell', 487),
 ('STEM School Highlands Ranch', 433),
 ('Highlands Ranch', 386),
 ('STEMshooting', 347),
 ('UNC Charlotte', 341),
 ('Charlotte', 317),
 ('Virginia Beach', 280),
 ('shot &amp', 278),
 ('the STEM School Highlands Ranch', 262),
 ('American', 238),
 ('the University of North Carolina', 232),
 ('Arabs', 218),
 ('Arab', 216),
 ('UNCC', 214),
 ('NRA', 208),
 ('Mexico', 176),
 ('Denver', 168)]

The most frequently mentioned entity in the text is 'Colorado,' with a high frequency of 1496 mentions. This suggests that the text contains extensive references to the state of Colorado.

'Kendrick Castillo' is the second most common named entity with 915 mentions. Kendrick Castillo's name is mentioned frequently in the text.

'Republicans' and 'Riley Howell' are also frequently mentioned entities with 734 and 487 mentions, respectively.

#### School Shootings:

The 'STEM School Highlands Ranch' and 'UNC Charlotte' (University of North Carolina at Charlotte) are both listed in the top entities, indicating their significance in the text.

Given their context and the nature of the text, it is likely that these mentions are related to school shootings that occurred at these locations.

#### STEM School Highlands Ranch Shooting:

'STEM School Highlands Ranch' appears prominently in the list with 433 mentions.

The context suggests that this entity is related to a notable incident, potentially the STEM School Highlands Ranch Shooting.

Further analysis of the text surrounding this entity can provide insights into the details and discussions regarding this incident.

#### UNC Charlotte Shooting:

'UNC Charlotte' is also present in the list with 341 mentions.

Similar to the STEM School Highlands Ranch entity, 'UNC Charlotte' likely refers to the UNC Charlotte Shooting.

Detailed examination of the text context can shed light on the discussions and information regarding this incident.

In [44]:
# df.to_csv('School_Shooting_Data_NER.csv', index=False)

**********************

*****************************
****************************
*****************************

**************************

In [46]:
# Function to extract information from text using spaCy
def extract_information(text):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Initialize variables to store extracted information
    organizations = []
    persons = []
    locations = []
    
    # Extract information based on entity types
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            organizations.append(entity.text)
        elif entity.label_ == 'PERSON':
            persons.append(entity.text)
        elif entity.label_ == 'GPE':
            locations.append(entity.text)
    
    return organizations, persons, locations

# Apply the 'extract_information' function to the 'text' column and store results in new columns
df[['org_extracted', 'person_extracted', 'location_extracted']] = df['Text'].apply(extract_information).apply(pd.Series)

                            org_extracted           person_extracted  \
0                                      []                         []   
1      [the University of North Carolina]                         []   
2                                     [🙏]                         []   
3                                      []                         []   
4                                      []                         []   
...                                   ...                        ...   
12612                                  []                         []   
12613                                  []  [https://t.co/XOxtLcRkhF]   
12614                              [UNCC]                         []   
12615                                  []                         []   
12616                     [ChabadofPoway]                         []   

         location_extracted  
0                        []  
1               [Charlotte]  
2                        []  
3              

In [48]:
df[['org_extracted', 'person_extracted', 'location_extracted']]

Unnamed: 0,org_extracted,person_extracted,location_extracted
0,[],[],[]
1,[the University of North Carolina],[],[Charlotte]
2,[🙏],[],[]
3,[],[],[US]
4,[],[],[]
...,...,...,...
12612,[],[],[]
12613,[],[https://t.co/XOxtLcRkhF],[]
12614,[UNCC],[],[]
12615,[],[],[]
