# Homework 2

In this homework you will be performing some analysis with entity extraction. In particular, you will be looking at the Reuters corpus and trying to construct entity profiles of persons, organizations, and locations. This will require you to iterate through the documents in the Reuters corpus, parse them appropriately, extract entities, and then store the entities along with some surrounding text. Additionally, you will be looking for mechanisms to identify potential relationships between persons and locations.

Throughout this you will need to use NLTK to access the corpus. At the same time, you will need to use an entity extraction system. You can choose to use either NLTK or Spacy. I would strongly suggest using Spacy for the entity extraction portion of this assignment.

The basic idea is to build a knowledge base around the entities you will extract in the Reuters corpus. Normally, this would be a first step to trying to model such things as entity resolution across documents. You could also use this as a first step to analyzing the sentiment towards particular entities. For example, people expressing dissatistfaction at a restaurant or brand.

Follow the below steps and read the comments carefully on the types of tasks your code will need to do.

I would expect that some of you might be able to reuse parts of this code for your project...

## Step 1) Import necessary libraries 

In [127]:
# This will be the corpus we work from
from nltk.corpus import reuters, stopwords
from nltk.tokenize import TreebankWordTokenizer
stop_words = stopwords.words('english')
treebank_tokenizer = TreebankWordTokenizer()


In [2]:
# I will assume you are using Spacy as a default entity recognizer.
import spacy
# note, the model load can be odd. In some instances your model might have the full name or the short name here.
# if you run into issues here, check the spacy model page at https://spacy.io/usage/models
nlp = spacy.load("en")

## Step 2) FIll in the following function to extract the entity, document id, and relevant sentence text from the input

In [74]:
def extract_entities(doc_id, doc_text):
    analyzed_doc = nlp(doc_text)
    
    # these two dictionaries will include all the persons and locations you find in a document.
    # You will need to add each person or location you encounter in the document to them
    # for the key you can use the text of the entity, for the value you will want to use the document_id and the
    # text of the sentence one challenge could be that an entity might occur multiple times in the document, 
    # thus the value should really be a document id and a list of the text of the sentences ( or something such as that)
    doc_persons = {}
    doc_organizations = {}
    doc_locations = {}
    
    for entity in analyzed_doc.ents:
        if entity.text.strip() != "":
#             # The .label_ property will provide information on the type of entity tagged
#             print(" -> ", entity.label_)
#             # The .text property will display the actual text of the entity in the text
#             print("->", entity.text.strip(), "<-")
#             # You can also access the sentence that the entity is contained in by using the .sent property
#             # inside the sentence you can then use the .text property
#             print("->", entity.sent.text, "<-")
            
            
            # one way to represent the document id and the sentence text would be with a tuple
            # thus, you could do:
            relevant_sentence = (doc_id, entity.sent.text)
#             print('relevant:', relevant_sentence)
            
            # add the relevant document id and sentence to the entity record
            if entity.label_ == 'PERSON':
                doc_persons.setdefault(entity.text.strip(), []).append(relevant_sentence)
            elif entity.label_ == 'ORG':
                doc_organizations.setdefault(entity.text.strip(), []).append(relevant_sentence)
            elif entity.label_ == 'LOC':
                doc_locations.setdefault(entity.text.strip(), []).append(relevant_sentence)
            
            
    return doc_persons, doc_organizations, doc_locations
        

## Step 3) Adjust the following code to run the document entity extraction function
## Also, add the entity records you are constructing to your master list of entities
## Note: for the full subission run across all the Reuters documents

In [75]:
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents you use while developing your technique 
# ex. reuters.fileids()[0:25]

# these two dictionaries will incorporate all the referneces to 
combined_persons = {}
combined_organizations = {}
combined_locations = {}

# this will only iterate over the first 25 documents, for the real submission you will need to run across all documents
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    persons, organizations, locations = extract_entities(doc_id, reuters.open(doc_id).read())
    
    # you will need to write something here to put the persons and locations found in a document into the 
    # combined_persons, combined_organizations, and combined_locations dictionaries.
    # here you will need to consider how to extend the values already in the dictionaries
    # maybe something like:
    # for person in persons.keys():
    #     if person not in combined_persons.keys():
    #         --- add a person key to the combined persons list
    #     now here you can add the person's document ids and sentence texts to the dictionary value
    
    for person in persons.keys():
        if person not in combined_persons.keys():
            combined_persons[person] = persons[person]
        else:
            combined_persons[person] += persons[person]
            
    for org in organizations.keys():
        if org not in combined_organizations.keys():
            combined_organizations[org] = organizations[org]
        else:
            combined_organizations[org] += organizations[org]
            
    for loc in locations.keys():
        if loc not in combined_locations.keys():
            combined_locations[loc] = locations[loc]
        else:
            combined_locations[loc] += locations[loc]
    

## Step 4) Fill in the following method to look through the content of an entity dictionary to determine the most popular based on number of mentions

In [90]:
# now that we have the text associated with the entities, 
# you will want to focus on the 500 top entities in each category
# Identify the top 500 entities by the count of their occurrences
def find_most_popular_entities(entity_dictionary):
    # sort through the entities in the dictionary by the number of sentences
    list_of_dictionary_keys_with_most_mentions = sorted(entity_dictionary, 
                                                        key=lambda x: len(entity_dictionary[x]), 
                                                        reverse=True)
    
    return list_of_dictionary_keys_with_most_mentions[:500]




In [76]:
dic = {'A': [1,2], 'B': [2,3,4], 'C': [0], 'D': [1,2,3,4,4]}

In [134]:
sorted(dic, key=lambda x: len(dic[x]), reverse=True)

['D', 'B', 'A', 'C']

## Step 5) Now invoke your top entity mention finder

In [101]:
# simply get the top persons and locations
top_persons = find_most_popular_entities(combined_persons)
top_locations = find_most_popular_entities(combined_locations)
top_organizations = find_most_popular_entities(combined_organizations)

In [92]:
len(top_persons)

500

In [93]:
len(top_locations)

500

In [102]:
len(top_organizations)

500

## Step 6) Analyze the most popular entities to determine what words they most frequently occur with

In [150]:
# use these dictionaries to store the most frequent terms associated with the entities
person_most_popular_terms = {}
organization_most_popular_terms = {}
location_most_popular_terms = {}

def find_most_frequent_term(top_persons, combined_persons):
    person_most_popular_terms = {}
    # finally, now find the most frequent tokens associated with the entities
    for person in top_persons:
        # fill this dictionary with all the words in the context of the person entity
        person_token_dictionary = {}
        sentences_words = [treebank_tokenizer.tokenize(sentence[1]) for sentence in combined_persons[person]]
        all_tokens = [word for sentence in sentences_words for word in sentence]
        for word in all_tokens:
            # remove the stop words, numbers, and punctuations
            if word.isalpha() and (word.lower() not in stop_words) and (word not in person):
                person_token_dictionary.setdefault(word, 0)
                person_token_dictionary[word] += 1
        try:
            person_most_popular_terms[person] = max(person_token_dictionary, 
                                                    key=lambda x: person_token_dictionary[x])
        except: pass
        
    return person_most_popular_terms

person_most_popular_terms = find_most_frequent_term(top_persons, combined_persons)

# finally, now find the most frequent tokens associated with the entities
organization_most_popular_terms = find_most_frequent_term(top_organizations, combined_organizations)
    
location_most_popular_terms = find_most_frequent_term(top_locations, combined_locations)

In [151]:
person_most_popular_terms

{'Avg': 'vs',
 'Reagan': 'said',
 'Baker': 'said',
 'Lawson': 'said',
 'Yeutter': 'said',
 'James Baker': 'Treasury',
 'Poehl': 'said',
 'Ecus': 'European',
 'Stoltenberg': 'said',
 'Baldrige': 'said',
 'Purolator': 'Hutton',
 'Volcker': 'said',
 'Clayton Yeutter': 'Trade',
 'Brown': 'AFG',
 'Louvre': 'said',
 'Kiichi Miyazawa': 'Finance',
 'Johnson': 'said',
 'Herrington': 'said',
 'Yasuhiro Nakasone': 'Prime',
 'Rotterdam': 'said',
 '1986/87': 'tonnes',
 'Dart': 'said',
 'Lyng': 'said',
 'Bass': 'said',
 'Sosnoff': 'said',
 'Gencorp': 'said',
 'Richard Lyng': 'Agriculture',
 'Satoshi Sumita': 'Japan',
 'Prev Wk': 'WK',
 'Nazer': 'said',
 'Banks': 'billion',
 'Williams': 'mln',
 'Clayton': 'Yeutter',
 'Subroto': 'said',
 'Nigel Lawson': 'Chancellor',
 'Redstone': 'said',
 'Karl Otto Poehl': 'Bundesbank',
 'Wagner': 'AFG',
 'Wendy': 'said',
 'Icahn': 'USAir',
 'Heller': 'said',
 'Edouard Balladur': 'Minister',
 'Dome': 'said',
 'REAGAN': 'SAYS',
 'Caspar Weinberger': 'Secretary',
 'Chi

In [152]:
organization_most_popular_terms

{'mln': 'dlrs',
 'pct': 'said',
 'cts': 'vs',
 'QTR': 'NET',
 'EC': 'said',
 'Reuters': 'told',
 'OPEC': 'said',
 'USDA': 'said',
 'MLN': 'DLRS',
 'Fed': 'said',
 'QTR NET': 'lt',
 'Bundesbank': 'said',
 'FED': 'SAYS',
 'PCT': 'RATE',
 'Treasury': 'said',
 'GATT': 'trade',
 'CTS': 'VS',
 'Congress': 'said',
 'Bank': 'stg',
 'USAir': 'said',
 'the Securities and Exchange Commission': 'said',
 'ICO': 'said',
 'TWA': 'USAir',
 'OECD': 'said',
 'The Bank of England': 'said',
 'SEC': 'said',
 'GenCorp': 'said',
 'CCC': 'said',
 'qtr': 'dlrs',
 'House': 'trade',
 'Co': 'said',
 'Senate': 'said',
 'BP': 'said',
 'European Community': 'EC',
 'treasury': 'bills',
 'Chrysler': 'Renault',
 'Oper': 'shr',
 'Bank of Japan': 'yen',
 'CSR': 'said',
 'The U.S. Agriculture Department': 'tonnes',
 'Lyng': 'said',
 'Borg-Warner': 'said',
 'GAF': 'said',
 'Shearson': 'said',
 'MITI': 'said',
 'EMS': 'said',
 'Texaco': 'said',
 'Nakasone': 'said',
 'NET': 'QTR',
 'The Federal Reserve': 'said',
 'Commission

In [153]:
location_most_popular_terms

{'Gulf': 'said',
 'Europe': 'said',
 'Africa': 'said',
 'Asia': 'said',
 'North Sea': 'oil',
 'West': 'said',
 'North America': 'said',
 'the Middle East': 'said',
 'the\n  Gulf': 'said',
 'Middle East': 'oil',
 'Western Europe': 'mln',
 'Mediterranean': 'said',
 'the Strait of Hormuz': 'said',
 'Midwest': 'corn',
 'the Gulf of Mexico': 'said',
 'New England': 'dlrs',
 '1986/87': 'mln',
 'Mideast': 'oil',
 'the North Sea': 'said',
 'Latin America': 'dlrs',
 'South America': 'said',
 'the Far East': 'mln',
 'Atlantic': 'said',
 'Pacific': 'pipeline',
 'West Texas': 'said',
 'Western': 'said',
 'the U.S. Gulf': 'dlrs',
 'Far East': 'dlrs',
 'FOB Gulf': 'dlrs',
 'West Texas Sour': 'dlrs',
 'South': 'said',
 'the Aegean Sea': 'Turkish',
 'East': 'West',
 'Southeast Asia': 'said',
 'Southern Pacific': 'Santa',
 'Southwest': 'said',
 'Prudhoe Bay': 'said',
 'Persian Gulf': 'said',
 'Eastern Europe': 'mln',
 'Northeast': 'said',
 'EUROPE': 'JAPAN',
 'the Mideast Gulf': 'said',
 'the Red River

## Step 7) Present your results of the most popular entities and their associated terms

In [None]:
# present you results

In [172]:
import pandas as pd
def show_term_freqency(terms):
    a = pd.DataFrame(terms.values())
    a['count'] = 1
    a_sorted = a.groupby(by=0).count().sort_values(by='count', ascending=False)
    return a_sorted

In [173]:
show_term_freqency(person_most_popular_terms).head(10)

Unnamed: 0_level_0,count
0,Unnamed: 1_level_1
said,238
dlrs,11
pct,10
mln,9
President,5
trade,5
Prime,5
chairman,5
Minister,4
Secretary,4


In [174]:
show_term_freqency(organization_most_popular_terms).head(10)

Unnamed: 0_level_0,count
0,Unnamed: 1_level_1
said,247
pct,18
dlrs,17
mln,16
lt,9
billion,8
EC,6
shares,6
Secretary,4
told,4


In [175]:
show_term_freqency(location_most_popular_terms).head(10)

Unnamed: 0_level_0,count
0,Unnamed: 1_level_1
said,49
mln,21
dlrs,19
company,12
oil,11
pct,6
pipeline,5
countries,3
crude,3
lt,3


### My Findings
The verb "said" is the most frequent term among person, organization, and location entities. Nevertheless, the most possible entity will occur is person entity when a sentence has the word "said". For person entities, there are some frequent political words "President", "chairman", "Minister", and "Secretary". For organization entities, there are some frequent business words like "shares". For location entities, there are some frequent words like "countries" and "oil."

## Extra Credit

There are several extra credit options for this assignment. 
* The first would be to determine which persons, organizations, and locations most frequently occur in the same sentences.
* Another task would be to attempt to resolve different forms of the same name for each person and location. For example, George Bush and Bush inside the same document.