<a href="https://colab.research.google.com/github/1daytotheleft/ENG3810/blob/main/NER_tutorial_esther.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER)
This is a NLP function, wherein certain entities are located from a bulk of text. Entities in general refer to pieces of raw data. Named entities specifically are entities that typically refer to real world objects. This can include persons' names, locations, companies, titles, etc. NER can find and categorize them according to what type of object it is. 

### Step 1:
Import spaCy. 

*Note: we are downloading version 3.6.2 for this tutorial. The default is usually vrn. 2.2.4

In [None]:
import spacy

#download vrn 3.6.2

!spacy download en_core_web_sm

In [None]:
#load spaCy language model

nlp = spacy.load('en_core_web_sm')

### Let's get a run-down of some basic named-entity categories. 

In [None]:
#@markdown Use the `for-loop` shown below to iterate through each named entity label and description
for label in nlp.get_pipe('ner').labels:
  print(label,spacy.explain(label),'\n')

#@markdown *Note: `'/n'` is to break the output into separate lines for ease of reading. Disregard it for now.

CARDINAL Numerals that do not fall under another type 

DATE Absolute or relative dates or periods 

EVENT Named hurricanes, battles, wars, sports events, etc. 

FAC Buildings, airports, highways, bridges, etc. 

GPE Countries, cities, states 

LANGUAGE Any named language 

LAW Named documents made into laws. 

LOC Non-GPE locations, mountain ranges, bodies of water 

MONEY Monetary values, including unit 

NORP Nationalities or religious or political groups 

ORDINAL "first", "second", etc. 

ORG Companies, agencies, institutions, etc. 

PERCENT Percentage, including "%" 

PERSON People, including fictional 

PRODUCT Objects, vehicles, foods, etc. (not services) 

QUANTITY Measurements, as of weight or distance 

TIME Times smaller than a day 

WORK_OF_ART Titles of books, songs, etc. 



### Step 2: 
Import data/text. 

**For this tutorial, we will be using data from sklearn's 20 Newsgroups dataset. We will only load posts from the `comp.sys.mac.hardware` forum.


In [None]:
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='train',categories=['comp.sys.mac.hardware'])


In [None]:
# Here is a preview of the data
news.data[0]

'From: Sammons@mailer.acns.fsu.edu (David Sammons)\nSubject: Re: Monitor turning off on its own\nOrganization: FSUACNS\nLines: 29\n\nIn article <gcohen.164.734712474@mailer.acns.fsu.edu>,\ngcohen@mailer.acns.fsu.edu (Gregory Cohen) wrote:\n> \n> In article <1993Apr13.142129.9491@rhrk.uni-kl.de> staudt@physik.uni-kl.de (Willi Staudt AG-Linder) writes:\n> >From: staudt@physik.uni-kl.de (Willi Staudt AG-Linder)\n> >Subject: Re: Monitor turning off on its own\n> >Date: Tue, 13 Apr 1993 14:21:29 GMT\n> >kayc@leland.Stanford.EDU (K C Ku) writes:\n> >|>\n> >|>I have a strange problem with my Apple 13" monitor which hopefully\n> >|>someone can shed some light on. \n> >|>\n> >|>I would be using my computer for 5 minutes and then the screen would\n> >|>go blank as if someone has switch the monitor off. After the screen\n> >|>went off, I would not be able to turn the monitor off even if I turn\n> >|>the power off and back on. I will have to let the monitor sit over\n> >|>night and it usually turn

### Step 3:
Prepare the data and/or text as a spaCy doc

**Being stored as a spaCy doc allows that data to be put through NLP processes by spaCy


In [None]:
doc = nlp(news.data[500])


In [None]:
#@markdown This will print named-entities found in the data (However, it will not show the label for the named-entities yet)
print(doc.ents)


(Michael A. McGuire, 2, University of Tennessee Computing Center, VersaTerm Link, 27, Dave Hollinsworth, two, >1, SIMMS, 80ns, RAM, two, 132, 136, megs, 8, meg, 8, 4 megs, 4, 2, 4mb & 8mb, 4, 132mb, 650, Michael A. McGuire, UTCC - User Services)


### Step 4: 
In order to find out what type of entity each named-entity in the doc is, use the `.label_` attribute


In [None]:
# Printing labels of entities.
for entity in doc.ents:
  print(entity.text,'--- ',entity.label_)

Michael A. McGuire ---  PERSON
2 ---  CARDINAL
University of Tennessee Computing Center ---  ORG
VersaTerm Link ---  PERSON
27 ---  CARDINAL
Dave Hollinsworth ---  PERSON
two ---  CARDINAL
>1 ---  DATE
SIMMS ---  ORG
80ns ---  ORDINAL
RAM ---  ORG
two ---  CARDINAL
132 ---  CARDINAL
136 ---  CARDINAL
megs ---  PERSON
8 ---  CARDINAL
meg ---  ORG
8 ---  CARDINAL
4 megs ---  MONEY
4 ---  CARDINAL
2 ---  CARDINAL
4mb & 8mb ---  QUANTITY
4 ---  CARDINAL
132mb ---  QUANTITY
650 ---  CARDINAL
Michael A. McGuire ---  PERSON
UTCC - User Services ---  ORG
