# Test Task

# Analyze the organization name recognition system and dataset annotations



### The Road Ahead

Here, break the notebook into separate steps.

* [Step 1](#step1): Load Enron Dataset
* [Step 2](#step2): Analyze spacy model performace on Enron file
* [Step 3](#step3): Load OntoNotes Dataset
* [Step 4](#step4): Analyze spacy model performace on OntoSentences file

---
<a id='step1'></a>
## Step 1: Load Enron Dataset

### Load Enron Dataset
Open and load EnronSentences.json file
The type of the data is list and length = 310

In [1]:
import json

# Open JSON file
with open('EnronSentences.json') as json_file:
    Enron_data = json.load(json_file)
    
    print(type(Enron_data))
    print(len(Enron_data))

<class 'list'>
310


### Take the first email as an example.
#### Email is read in dict. 

In [2]:
# print an email and check content
print(Enron_data[4])
print("Enron_data[4] type:",type(Enron_data[4]))

{'text': "I have made this correction in EPMI's CEC filing and resubmitted it.", 'machine_entities': [[31, 35, 'ORG', 'machine'], [38, 41, 'ORG', 'machine']], 'human_entities': {'44357992': [[31, 35, 'ORG', 'machine']]}}
Enron_data[4] type: <class 'dict'>


#### Check keys and its value.

In [3]:
# check each dictionary keys and value
print(Enron_data[4].keys())
input_text = Enron_data[4]['text']
output = Enron_data[4]['machine_entities']
print(input_text)
print(type(output), " ", output)
print(output[0][:3])
print(output[0][-1])

dict_keys(['text', 'machine_entities', 'human_entities'])
I have made this correction in EPMI's CEC filing and resubmitted it.
<class 'list'>   [[31, 35, 'ORG', 'machine'], [38, 41, 'ORG', 'machine']]
[31, 35, 'ORG']
machine


In [4]:
machine_output = Enron_data[4]['machine_entities']
human_output = Enron_data[0]['human_entities']
print(machine_output)
print(human_output)
#print(human_output.values())
#list(human_output.values())[0]

[[31, 35, 'ORG', 'machine'], [38, 41, 'ORG', 'machine']]
{}


<a id='step2'></a>
## Step 2: Analyze spacy model performace on Enron file


### Compare the annotation by machine and human

#### First to figur out how many emails were edited by human

In [5]:
def read_each_email(n):
    """
    Read each email and return:
        - machine annotated output
        - human annotated output
    """
    output_machine = Enron_data[n]['machine_entities']
    output_human = Enron_data[n]['human_entities']
    result_human = []
    
    if output_human != {}:
        
        output_human_dict = output_human
        result_human = list(output_human_dict.values())[0]

    return output_machine, result_human

In [6]:
# Figur out how many emails are edited by human.
no_change_email = 0
correct_index = []
edited_index = []

for i in range(len(Enron_data)):
    # get machine output and human output
    output_machine, result_human = read_each_email(i)
    if result_human == []:
        no_change_email += 1
        correct_index.append(i)
    else:
        edited_index.append(i)
        
edited_number = len(Enron_data) - no_change_email
print("There are: ", edited_number, " are corrected by human.\n")
print(edited_index)

There are:  59  are corrected by human.

[4, 8, 24, 29, 30, 33, 40, 52, 65, 72, 82, 104, 105, 118, 119, 121, 125, 131, 132, 135, 139, 140, 153, 154, 157, 165, 168, 169, 170, 172, 173, 174, 176, 177, 178, 181, 188, 196, 197, 218, 230, 231, 244, 245, 252, 256, 269, 272, 287, 293, 297, 299, 300, 301, 302, 303, 307, 308, 309]


#### Focused on these email corrected by human

Figur out what were considered as false negative name.

It is important to know which names spaCy considered it not a ORG name but human annotated as ORG.

It is also good to know:
    * names that spaCy correctly annotated
    * names that spaCy annotated as ORG but human deleted
    * names that spaCy missed

In [7]:
def find_word(value, text):
    """
    According to the start and end position to find the ORG name
    """
    start = value[0]
    end = value[1]
    word = text[start:end]
    return word


In [8]:
name_found_correct = []

for n in correct_index:

    input_text = Enron_data[n]['text']
    output_machine = Enron_data[n]['machine_entities']
    
    for value in output_machine:
        name_correct = find_word(value, input_text)
        name_found_correct.append(name_correct)


from collections import Counter

counts_name_found = Counter(name_found_correct)
print(counts_name_found)
#vocab_found = sorted(counts_name_found, key=counts_name_found.get, reverse=True)

Counter({'Enron': 17, 'ENA': 9, 'ISDA': 5, 'Nicor': 4, 'Texas': 3, 'Cinergy': 3, 'FYI': 3, 'VAR': 3, 'CEC': 2, 'Sierra': 2, 'GISB': 2, 'Central': 2, 'HPL': 2, 'MILLS': 2, 'Yahoo': 2, 'The': 2, 'EOTT': 2, 'NDA': 2, 'Harvard': 2, 'CPUC': 2, 'Amazon.com': 2, 'Master': 2, 'ProCaribe': 2, 'Socal': 2, 'Waddington': 2, 'Containerisation': 2, 'Model': 2, 'PG&E': 2, 'Legal': 2, 'VMAC': 2, 'ISO': 2, 'Real': 2, 'CNN': 2, 'NNG': 1, 'Westheimer': 1, 'Gingerbread': 1, 'Oxley': 1, 'East': 1, 'ICAP': 1, 'Stinson/Vince': 1, 'Client': 1, 'Confirmation': 1, 'Ken': 1, 'Principal-Protected*': 1, 'Trust': 1, 'Nasdaq-100': 1, 'Ambac': 1, "Moody's": 1, 'AAA': 1, 'Standard': 1, 'Enerfax': 1, 'GMT-06:00': 1, 'Global': 1, 'Seller': 1, 'PEPL': 1, 'TE': 1, 'Submit': 1, 'Christi': 1, 'Dominion': 1, 'ESA': 1, 'RAC': 1, 'Eastern': 1, 'PPT': 1, 'Supervisor': 1, 'SHRM': 1, 'CRRA': 1, 'TransPecos': 1, 'FBI': 1, 'CommodityLogic': 1, 'ASAP': 1, 'Peoples': 1, 'EFF_DT': 1, 'Jesse': 1, 'Specialist': 1, 'Logistics': 1, 'Assoc

In [9]:
name_found_wrong = []
name_missed = []

for n in edited_index:
    input_text = Enron_data[n]['text']
    output_machine, output_human = read_each_email(n)
    
    for value in output_machine:
        # find the name
        name = find_word(value, input_text)
        
        if value in output_human:
            name_found_correct.append(name)
        else:
            name_found_wrong.append(name)
            
    for result in output_human:
        if not result in output_machine:
            name = find_word(value, input_text)
            name_missed.append(name)

counts_name_wrong = Counter(name_found_wrong)
print("Annotated name by machine is deleted by human: \n")
print(counts_name_wrong, "\n")

counts_name_missed = Counter(name_missed)
print("spaCy missed names: \n")
print(counts_name_missed)

Annotated name by machine is deleted by human: 

Counter({'FERC': 2, 'ROFR': 2, 'CEC': 1, 'ABB': 1, 'GE': 1, 'ISDA': 1, 'PEPL': 1, 'BNP/EML': 1, 'Mariella': 1, 'LPG': 1, 'PEP': 1, 'Supervisor': 1, 'January': 1, 'DPC': 1, 'LNG': 1, 'K#66940': 1, 'Interstate': 1, 'California': 1, 'Central': 1, 'LOI': 1, "L/C's": 1, 'L/C': 1, 'Harry': 1, '6/13/01': 1, 'MGI': 1, 'FYI': 1, 'Master': 1, 'Weekly': 1, 'NBSK': 1, 'CLE': 1, 'Underwriting': 1, 'IV': 1, 'FX': 1, 'GPG': 1, 'NBPL': 1, 'Veroinca': 1, 'GISB': 1, 'GMT-06:00': 1, 'INGAA': 1, 'BPA': 1, 'Client': 1, 'Outages': 1, 'BB': 1, 'NGX': 1, 'ENA': 1, 'The': 1}) 

spaCy missed names: 

Counter({'PEPL': 3, 'PEP': 3, '6/13/01': 3, 'Client': 3, 'GE': 2, 'FERC': 2, 'Supervisor': 2, 'Enron': 2, 'K#66940': 2, 'California': 2, 'UBS': 2, 'Master': 2, 'ROFR': 2, 'ISO': 2, 'The': 2, 'EnronOnline': 1, 'ISDA': 1, 'CO2': 1, 'Henry': 1, 'BNP/EML': 1, 'Hector': 1, 'LPG': 1, '2600MT': 1, 'LNG': 1, 'LOI': 1, "L/C's": 1, 'L/C': 1, 'AEP': 1, 'MGI': 1, 'FYI': 1, 'Week

### Compared Names

Based on the above three word lists, we can get some results.
* In name_found_correct list, some words need to be deleted.
>- Like: 'FYI', 'Central', 'Jesse', 'Dennis', etc

* There are some words appeared in both name_found_correct list and name_found_wrong list, like 'ENA', 'CEC', 'PBA', 'Morton', 'EOT', 'PG&E'. 
>- When human editored the annotation, human made mistakes. Take 'CEC' as an example, CEC are annotated as ORG name for twice, however, it is deleted by human for one time.
    
* There are some words appeared in name_found_wrong list and name_missed list, like 'PEP', 'LNG', 'INGAA', 'NBSK'
>- Sometimes, human recognized these words as ORG, sometimes, not
>- ABB is a Company, however, it is deleted by editors.

* There are some words appeared in all three list, like 'PEPL', 'LPG'
>- This situation would make the result confusing.

Human editors are also helpful.
* From name_found_wrong list, editors really removed words correctly, such as
>- words means name, location, date, time
>- words meaningless (like K#66940)

<a id='step3'></a>
## Step 3: Load OntoNotes Dataset

### Load OntoNotes.json file

In [10]:
# Open JSON file
with open('OntoNotes.json') as json_file:
    Onto_data = json.load(json_file)
    
    print(type(Onto_data))
    print(len(Onto_data))

<class 'list'>
5000


In [11]:
input_text = str(Onto_data[0][0])
print(input_text)

The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .


In [12]:
print(Onto_data[0])
print("Onto_data[0] type:",type(Onto_data[0]))
print(Onto_data[0][0])
output_Onto = Onto_data[0][1]['entities']
print(output_Onto)

output_test = []
for value in output_Onto:
    if value[2] == 'ORG':
        output_test.append(value)
output_test

['The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .', {'entities': [[12, 48, 'ORG'], [71, 85, 'GPE']]}]
Onto_data[0] type: <class 'list'>
The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .
[[12, 48, 'ORG'], [71, 85, 'GPE']]


[[12, 48, 'ORG']]

### Import spacy and analyze the first email.

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")    

In [14]:
doc = nlp(input_text)
output_spacy = []
for ent in doc.ents:
    if ent.label_ == "ORG":
        output_spacy.append([ent.start_char,ent.end_char,ent.label_])
print(output_spacy)

[[12, 38, 'ORG'], [43, 48, 'ORG']]


In [15]:
result = output[0][:3] == output_spacy[0]
print(result)

False


In [16]:
def get_predict_from_spacy(input_text):
    """
    Spacy is used to annotate ORG name from email content
    Return annotated result
    """
    doc = nlp(input_text)
    output_spacy = []
    for ent in doc.ents:
        if ent.label_ == "ORG":
            output_spacy.append([ent.start_char,ent.end_char,ent.label_])
    
    return output_spacy

In [17]:
# test get_predict_from_spacy function
output_spacy = get_predict_from_spacy(input_text)
print(output_spacy)

[[12, 38, 'ORG'], [43, 48, 'ORG']]


<a id='step4'></a>
## Step 4: Analyze spacy model performace on OntoSentences file

#### Compare the value got from spacy and file

In [18]:
def complete_match(output, predict):
    """
    Check weather Spacy's predict has the same result of target
    output - Enron result
    predict - Spacy's predict
    """
    if predict == output:
        return 1
    else:
        return 0

def correct_find(output, predict):
    """
    find whether the names annotated by spacy are in the range of target
    output - Enron result
    predict - Spacy's predict
    
    The name annotated by spacy sometime starts or ends with different position from target.
    However, sometimes, a part of the name is annotated by spacy.
    This situation is considered as spacy found name correctly.
    """
    find_name = 0
    for value_predict in predict:
        for value_output in output:
            if value_predict[0] > value_output[1] or value_predict[1] < value_output[0]:
                find_name += 0
            else:
                find_name += 1
    return find_name


In [19]:
def read_output_from_OntoNotes(n):
    info = Onto_data[n]
    input_text = str(Onto_data[n][0])
    output = Onto_data[n][1]['entities']
    output_Onto = []
    for value in output:
        if value[2] == 'ORG':
            output_Onto.append(value)
    
    return input_text, output_Onto

In [20]:
complete_correct_number = 0
position_find_number = 0
false_positive_number = 0
false_negative_number = 0
ture_negative_number = 0

for n in range(len(Onto_data)):
    input_text, output_Onto = read_output_from_OntoNotes(n)
    len_output = len(output_Onto)
    output_spacy = get_predict_from_spacy(input_text)
    len_spacy = len(output_spacy)
    
    if complete_match(output_Onto, output_spacy) == 1:

        if output_spacy == []:
            ture_negative_number += 1
        else:
            complete_correct_number += 1
        #print(i, " Completely Match OntoSentences: ", output_Onto, " spacy: ", output_spacy)
    elif correct_find(output_Onto, output_spacy) == len(output_Onto):
        position_find_number += 1     
    
    else:
        if len_spacy >= len_output:  # spacy has false positive
            false_positive_number += 1
        
        elif len_spacy < len_output:  # spacy has false negative
            false_negative_number += 1
            #print(input_text)
            #print(n, " OntoSentences: ", output_Onto, " spacy: ", output_spacy)
                
print("complete_correct_number Positive: ", complete_correct_number)
print("All position found: ", position_find_number)
print("True Positive: ", complete_correct_number + position_find_number)
print("True Negative: ", ture_negative_number)
print("with false positive number: ", false_positive_number)
print("with false negative number: ", false_negative_number)


complete_correct_number Positive:  1397
All position found:  190
True Positive:  1587
True Negative:  3341
with false positive number:  11
with false negative number:  61


### Here use the following matrix to analyze results

|               |Predict Positive        | Predict Negative           |
| ------------- |:-------------:| -----:|
| Positive      | True Positive | False Negative |
| Negative      | False Positive| True Negative  |

Based on the result from above step, it is easy to find:

there are 1397 emails correctly annotate ORG by spacy
there are 190 emaild can include then names needed to be annotated by spacy
there are 3341 emails don't have ORG name checked by spacy, the same as Enton file
there are 11 emails annotated with wrong ORG name by spacy
there are 61 emails which can not be annotated ORG name by spacy
hence,
    - True Positive = 1397 + 190 = 1587
    - True Negative = 3341
    - False Positive = 11
    - False Negative = 61
When only consider correct results, then:

    - Accuracy = (True Positive + True Negative)/ Total = (1587 + 3341) / 5000 = 0.99
    - Recall = True Positive/(True Positive + False Negative) = 1587/(1587+61) = 0.96
