# NER Workshop Exercise 1: Looking for Relations

**In this exercise we will use spaCy's named entity recognition (NER) algorithm to find relations between different entities in the Brown corpus.**

## Part 1: Basic entity extraction

**The Brown corpus is a well-known corpus of English developed at Brown University, containing text from many different sources. We will use entity extraction on a subset of the Brown corpus covering a few categories.**

**We can use spaCy to find entities in a basic sentence as follows:**

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
sample_sentence = "The White House is located in Washington D.C."
sample_doc = nlp(sample_sentence)
print([(ent.text, ent.label_) for ent in sample_doc.ents])

[('The White House', 'ORG'), ('Washington D.C.', 'GPE')]


**To see what an entity label means:**

In [2]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

**And to display the entities in a document using displaCy:**

In [3]:
from spacy import displacy
displacy.render(sample_doc, style='ent', jupyter = True)

**Now let's load sentences from the Brown corpus for a few categories:**

In [4]:
import nltk
import pandas as pd
from collections import Counter
from itertools import chain


pd.set_option('display.max_colwidth', None)
nltk.download('brown')
from nltk.corpus import brown
sentences = brown.sents(categories = ['news', 'editorial', 'reviews'])

[nltk_data] Downloading package brown to /home/gal/nltk_data...
[nltk_data]   Package brown is already up-to-date!


### Questions:
####  1. Use displaCy to display the entities in the first three sentences of this corpus.

**source link:**<br>https://spacy.io/usage/visualizers#ent

In [5]:
sent = [' '.join(sent) for sent in sentences]

for s in sent[:3]:
    doc = nlp(s)
    displacy.render(doc, style='ent', jupyter = True)

####  What are some entities that are tagged, and what do their entity labels means?


**source link:**<br>https://spacy.io/api/annotation#named-entities

`Ivan Allen Jr. PERSON ` - People, including fictional.
<br>`September-October DATE` - Absolute or relative dates or periods.
<br>`the City Executive Committee ORG` - Companies, agencies, institutions, etc.
<br>`the City of Atlanta '' GPE` - Countries, cities, states.


####  2. What are the five most common people mentioned in the corpus for these categories? <br>(Hint: See [this page](https://spacy.io/api/annotation#named-entities) under "Named Entity Recognition")

In [6]:
persons = []
buildings = []

for s in sent:
    doc = nlp(s)
    persons.append([X.text for X in doc.ents if X.label_ == 'PERSON'])
    buildings.append([X.text for X in doc.ents if X.label_ == 'FAC'])

Counter(chain.from_iterable(persons)).most_common(5)

[('Kennedy', 113),
 ('Khrushchev', 69),
 ('Maris', 29),
 ('Eisenhower', 27),
 ('Podger', 22)]

#### What are the five most common buildings? 

In [7]:
Counter(chain.from_iterable(buildings)).most_common(5)

[('Broadway', 11),
 ('the White House', 6),
 ('Pennsylvania Avenue', 4),
 ('Capitol', 4),
 ('Lewisohn Stadium', 4)]

## Part 2: Finding relations

**Now we will look at pairs of entities in sentences in the corpus and try to identify relations between them.**

### Questions:
  #### 3. We would like to know where organizations are located.<br> Try to find all occurences of organization-location where the organization (ORG) comes before the location (GPE) in the sentence, with no other entity in between, and the word "in" appears somewhere between them. <br>Put this in a Pandas Dataframe with three columns: ORG (organization name), GPE (location name), and context (words in between the organization and location). 
####  Hint: use entity.start and entity.end to get the starting and ending indices for an entity in the sentence.


In [8]:
def between_2_entities(lab_1, lab_2, prep):    
    org =[]
    gpe = []
    context  = []
    
    for s in sent:
        doc = nlp(s)
        tokens = [w.text for w in doc]

        for i in range(len(doc.ents)-1): 
            if doc.ents[i].label_ == lab_1 and doc.ents[i+1].label_ == lab_2:
                if prep in tokens[doc.ents[i].end: doc.ents[i+1].start]:
                    org.append(doc.ents[i].text)
                    gpe.append(doc.ents[i+1].text)
                    context.append(' '.join(tokens[doc.ents[i].end: doc.ents[i+1].start]))
                    
    df = pd.DataFrame(list(zip(org, gpe, context)), 
                   columns =[lab_1, lab_2,'context'])
    return df          


In [9]:
df = between_2_entities('ORG', 'GPE', 'in')
df

Unnamed: 0,ORG,GPE,context
0,the State Welfare Department,Fulton County,` ` has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of
1,ADC,Cook county,program in
2,White House,Washington,aids in
3,NATO,Angola,committee has been set up so that in the future such topics as
4,State Department,Laos,"officials explain , now is mainly interested in setting up an international inspection system which will prevent"
...,...,...,...
85,Mijbil,Iraq,", of whom there are a fine series of photographs and drawings in the book , but to the author who has catalogued the saga of a frightened otter cub 's journey by plane from"
86,Negro,the United States,"listeners -- an audience which , in"
87,Negro,America,news staffs in
88,St. Torpetius,St. Tropez,that still persists in


####  How many of these are there?

In [10]:
len(df)

90

  
####  4. How much does this data tell us about what organizations are located where? 

It tells us about the Country place, City, area in the city, relocation of the company or staff of the companie, some events which happened in those companies in those places

#### In what cases can we be more or less certain?


When in the context we have a verb for our ORG entity, not other parts of speach

  
####  5. What is another example of a pair of entity labels and context word that would give us useful information?

It can be `'DATE'-'EVENT'-'in'`, `'PERSON'-'ORG'-'in'`, `'ORG'-'EVENT'-'in'`, `'DATE'-'ORG'-'on'` etc.

 #### Try running your code to find this new relation.

In [11]:
df2 = between_2_entities('DATE', 'EVENT', 'in')
df2

Unnamed: 0,DATE,EVENT,context
0,today,"the Lyle Elliott Funeral Home , 31730 Mound",in closed caskets at
1,annual,New Year '',"report in the form of a ` ` happy , warless"


In [12]:
df3 = between_2_entities('PERSON', 'ORG', 'in')
df3

Unnamed: 0,PERSON,ORG,context
0,Pearl Williams Hartsfield,Fulton Superior Court,", in"
1,Kennedy,House,"program ` ` a mighty fine thing '' , but made no prediction on its fate in the"
2,-- William J. Seidel,the Department of Conservation and Economic Development,", state fire warden in"
3,Sam Rayburn's,the Rules Committee,forces in
4,Harold V. Varani,the Department of Public Property,", former director of architecture and engineering in"
5,Jack Fisher,Oriole,", the big righthander who figures to be in the middle of"
6,Whitey Herzog,Orioles,", performing in right as the"
7,Pete Ward,House,was sent in for
8,Richard Stafford,Raiders,", who is undergoing treatment for a leg injury suffered in the"
9,End Gene Raesz,Owl,", who broke a hand in the"


In [13]:
df4 = between_2_entities('ORG', 'EVENT', 'in')
df4

Unnamed: 0,ORG,EVENT,context
0,Snead,the Canada Cup,will be saluted as the winning team in
1,the State Department,World War 2,in


In [14]:
df5 = between_2_entities('DATE', 'ORG', 'on')
df5

Unnamed: 0,DATE,ORG,context
0,last November,Legislature,rejected a constitutional amendment to allow legislators to vote on pay raises for future
1,last year,Capitol hill,"as a senator , a fight on"
2,four-year,the national committee,term on
3,nearly eighteen months,the United Nations,of work on the question of the organization of
4,Tuesday,Frankford,that the bids on the
5,1,Anne Arundel General Hospital,", was pronounced dead on arrival at"
6,yesterday,the First Federal Savings and Loan association,on a suppressed federal warrant charging him with embezzling an undetermined amount of money from
7,1917,the State Board of Medical Examiners,", and served on"
8,25 years ago,the House of Representatives,and now on the President 's staff as liaison representative with
9,20 years,the Rules Committee,he has enjoyed his power on
