# Test Task

# Analyze the organization name recognition system and dataset annotations



### The Road Ahead

Here, break the notebook into separate steps.

* [Step 1](#step1): Load Enron Dataset
* [Step 2](#step2): Analyze spacy model performace on Enron file
* [Step 3](#step3): Load OntoNotes Dataset
* [Step 4](#step4): Analyze spacy model performace on OntoSentences file

---
<a id='step1'></a>
## Step 1: Load Enron Dataset

### Load Enron Dataset
Open and load EnronSentences.json file
The type of the data is list and length = 310

In [1]:
import json

# Open JSON file
with open('EnronSentences.json') as json_file:
    Enron_data = json.load(json_file)
    
    print(type(Enron_data))
    print(len(Enron_data))

<class 'list'>
310


### Take the first email as an example.
#### Email is read in dict. 

In [2]:
# print an email and check content
print(Enron_data[0])
print("Enron_data[0] type:",type(Enron_data[0]))

{'text': 'Shelley: Do you have any information on Project Max that we could use in the business plan for NNG?', 'machine_entities': [[95, 98, 'ORG', 'machine']], 'human_entities': {}}
Enron_data[0] type: <class 'dict'>


#### Check keys and its value.

In [3]:
# check each dictionary keys and value
print(Enron_data[0].keys())
input_text = Enron_data[0]['text']
output = Enron_data[0]['machine_entities']
print(input_text)
print(type(output), " ", output)
print(output[0][:3])
print(output[0][-1])

dict_keys(['text', 'machine_entities', 'human_entities'])
Shelley: Do you have any information on Project Max that we could use in the business plan for NNG?
<class 'list'>   [[95, 98, 'ORG', 'machine']]
[95, 98, 'ORG']
machine


### Import spacy and analyze the first email.

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")    

In [5]:
doc = nlp(input_text)
output_spacy = []
for ent in doc.ents:
    if ent.label_ == "ORG":
        output_spacy.append([ent.start_char,ent.end_char,ent.label_])
print(output_spacy)

[[95, 98, 'ORG']]


#### Compare the value got from spacy and file

In [6]:
result = output[0][:3] == output_spacy[0]
print(result)

True


<a id='step2'></a>
## Step 2: Analyze spacy model performace on Enron file


In [7]:
def read_each_email(n):
    
    input_text = Enron_data[n]['text']
    output_machine = Enron_data[n]['machine_entities']
    len_machine = len(output_machine)
    len_output = 0
    output = []
    
    if Enron_data[n]['human_entities'] != {}:

        output_human_dict = Enron_data[n]['human_entities']
        output = list(output_human_dict.values())[0]
        
        len_human = len(output)
    else:
        output = output_machine
        len_output = len_machine
    
    return input_text, output, len_output


In [8]:
def get_predict_from_spacy(input_text):
    
    doc = nlp(input_text)
    output_spacy = []
    for ent in doc.ents:
        if ent.label_ == "ORG":
            output_spacy.append([ent.start_char,ent.end_char,ent.label_])
    
    return output_spacy

In [9]:
def complete_match(output, predict):
    """
    Spacy predict has the same result of target
    """
    if predict == output:
        return 1
    else:
        return 0

def correct_find(output, predict):
    """
    find whether the name range in predict
    """
    find_name = 0
    for value_predict in predict:
        for value_output in output:
            if value_predict[0] > value_output[1] or value_predict[1] < value_output[0]:
                find_name += 0
            else:
                find_name += 1
    return find_name

def TrueP_with_FalseP(output, predict): 
    """
    All targeted name are included in predict, meanwhile spacy include Flase Positive
    """
    false_pos = 0
    for value in output:
        if value in predict:
            false_pos += 1
    if false_pos == len(output):
        return 1
    else:
        return 0

def TrueP_with_FalseN(output, predict):
    """
    All predict names are included in target, meanwhile spacy has False Negative
    """
    false_neg = 0
    for value in predict:
        if value in output:
            false_neg += 1
    if false_neg > 0:
        return 1
    else:
        return 0

In [10]:
#for i in range(len(Enron_data)):
complete_correct_number = 0
position_find_number = 0
false_positive_number = 0
false_negative_number = 0
ture_negative_number = 0

for i in range(len(Enron_data)):
    input_text, output_Enron, len_output = read_each_email(i)
    output_spacy = get_predict_from_spacy(input_text)
    len_spacy = len(output_spacy)
    
    output_corrected_by_human = []
    
    for n in range(len_output):
        if output_Enron[n][2] != '':
            output_corrected_by_human.append(output_Enron[n][:3])
    
    if complete_match(output_corrected_by_human, output_spacy) == 1:
        if output_spacy == []:
            ture_negative_number += 1
        else:
            complete_correct_number += 1
        #print(i, " Completely Match enron human: ", output_corrected_by_human, " spacy: ", output_spacy)
        
    elif correct_find(output_corrected_by_human, output_spacy) == len(output_corrected_by_human):
        position_find_number += 1
        #print(i, " Position Match enron human: ", output_corrected_by_human, " spacy: ", output_spacy)

    else:
        if len_spacy > len_output:  # spacy has false positive
            if TrueP_with_FalseP(output_corrected_by_human, output_spacy) == 1:
                false_positive_number += 1
                #print(i, " FP enron human: ", target_corrected_by_human, " spacy: ", target_spacy)
            elif correct_find(output_corrected_by_human, output_spacy) >=1:
                false_positive_number += 1
            else:
                print(i, " enron human: ", output_corrected_by_human, " spacy: ", output_spacy)
        elif len_spacy < len_output:  # spacy has false negative
            
            if TrueP_with_FalseN(output_corrected_by_human, output_spacy) == 1:
                false_negative_number += 1
                #print(i, " FN enron human: ", output_corrected_by_human, " spacy: ", output_spacy)
            elif correct_find(output_corrected_by_human, output_spacy) >=1:
                false_negative_number += 1
            elif len_spacy == 0 and len_output >0 :
                false_negative_number += 1
                #print(input_text)
                #print(i, " enron human: ", output_corrected_by_human, " spacy: ", output_spacy)
            else:
                print(i, " enron human: ", output_corrected_by_human, " spacy: ", output_spacy)

    
print("complete_correct_number Positive: ", complete_correct_number)
print("All position found: ", position_find_number)
print("True Positive: ", complete_correct_number + position_find_number)
print("True Negative: ", ture_negative_number)
print("with false positive number: ", false_positive_number)
print("with false negative number: ", false_negative_number)

5  enron human:  [[178, 188, 'ORG']]  spacy:  [[17, 49, 'ORG'], [102, 110, 'ORG']]
36  enron human:  [[0, 5, 'ORG'], [15, 17, 'ORG']]  spacy:  [[35, 38, 'ORG']]
186  enron human:  [[0, 9, 'ORG']]  spacy:  [[88, 114, 'ORG'], [139, 179, 'ORG']]
211  enron human:  [[114, 125, 'ORG'], [130, 134, 'ORG']]  spacy:  [[47, 74, 'ORG']]
complete_correct_number Positive:  73
All position found:  115
True Positive:  188
True Negative:  26
with false positive number:  0
with false negative number:  81


Here use the following matrix to analyze results

|               |Predict Positive        | Predict Negative           |
| ------------- |:-------------:| -----:|
| Positive      | True Positive | False Negative |
| Negative      | False Positive| True Negative  |

Based on the result from above step, it is easy to find: 
- count completely matched result and All position found result together: 73 + 26 + 115 = 214
- there are 11 email found at least one name, meanwhile missed at least one name
- there are 70 emails didn't generate name.
- there are 4 email found false name.

When only consider correct results, then:
- Accuracy = (True Positive + True Negative)/ Total = (188 + 26) / 310 = 0.69
- Recall = True Positive/(True Positive + False Negative) = 188/(188+81) = 0.70


<a id='step3'></a>
## Step 3: Load OntoNotes Dataset

### Load OntoNotes.json file

In [11]:
# Open JSON file
with open('OntoNotes.json') as json_file:
    Onto_data = json.load(json_file)
    
    print(type(Onto_data))
    print(len(Onto_data))

<class 'list'>
5000


In [12]:
print(Onto_data[0])
print("Onto_data[0] type:",type(Onto_data[0]))
print(Onto_data[0][0])
output_Onto = Onto_data[0][1]['entities']
print(output_Onto)

output_test = []
for value in output_Onto:
    if value[2] == 'ORG':
        output_test.append(value)
output_test

['The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .', {'entities': [[12, 48, 'ORG'], [71, 85, 'GPE']]}]
Onto_data[0] type: <class 'list'>
The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .
[[12, 48, 'ORG'], [71, 85, 'GPE']]


[[12, 48, 'ORG']]

In [13]:
input_text = str(Onto_data[0][0])
print(input_text)
output_spacy = get_predict_from_spacy(input_text)
print(output_spacy)

The head of the Palestinian Television and Radio has been shot dead in the Gaza Strip .
[[12, 38, 'ORG'], [43, 48, 'ORG']]


In [14]:
def read_output_from_OntoNotes(n):
    info = Onto_data[n]
    input_text = str(Onto_data[n][0])
    output = Onto_data[n][1]['entities']
    output_Onto = []
    for value in output:
        if value[2] == 'ORG':
            output_Onto.append(value)
    
    return input_text, output_Onto

<a id='step4'></a>
## Step 4: Analyze spacy model performace on OntoSentences file

In [15]:
complete_correct_number = 0
position_find_number = 0
false_positive_number = 0
false_negative_number = 0
ture_negative_number = 0

for n in range(len(Onto_data)):
    input_text, output_Onto = read_output_from_OntoNotes(n)
    len_output = len(output_Onto)
    output_spacy = get_predict_from_spacy(input_text)
    len_spacy = len(output_spacy)
    
    if complete_match(output_Onto, output_spacy) == 1:
        if output_spacy == []:
            ture_negative_number += 1
        else:
            complete_correct_number += 1
        #print(i, " Completely Match OntoSentences: ", output_Onto, " spacy: ", output_spacy)
    elif correct_find(output_Onto, output_spacy) == len(output_Onto):
        position_find_number += 1     
    
    else:
        if len_spacy > len_output:  # spacy has false positive
            if TrueP_with_FalseP(output_Onto, output_spacy) == 1:
                false_positive_number += 1
                #print(n, " FP OntoSentences: ", output_Onto, " spacy: ", target_spacy)
            elif correct_find(output_Onto, output_spacy) >=1:
                #print(n, " FP OntoSentences: ", output_Onto, " spacy: ", target_spacy)
                false_positive_number += 1
            else:
                print(n, " OntoSentences: ", output_Onto, " spacy: ", output_spacy)
        
        elif len_spacy < len_output:  # spacy has false negative
            
            if TrueP_with_FalseN(output_Onto, output_spacy) == 1:
                false_negative_number += 1
               # print(n, " FN OntoSentences: ", output_Onto, " spacy: ", output_spacy)
            elif correct_find(output_Onto, output_spacy) >=1:
                false_negative_number += 1
            elif len_spacy == 0 and len_output >0 :
                false_negative_number += 1
                #print(input_text)
                #print(n, " OntoSentences: ", output_Onto, " spacy: ", output_spacy)
            else:
                print(n, " OntoSentences: ", output_Onto, " spacy: ", output_spacy)
                
print("complete_correct_number Positive: ", complete_correct_number)
print("All position found: ", position_find_number)
print("True Positive: ", complete_correct_number + position_find_number)
print("True Negative: ", ture_negative_number)
print("with false positive number: ", false_positive_number)
print("with false negative number: ", false_negative_number)


complete_correct_number Positive:  1397
All position found:  190
True Positive:  1587
True Negative:  3341
with false positive number:  7
with false negative number:  61


Based on the result from above step, it is easy to find: 
- count completely matched result and All position found result together: 1397 + 190 + 3341 = 4928
- there are 14 email found at least one name, meanwhile missed at least one name
- there are 47 emails didn't generate name.
- there are 7 email found false name.

When only consider correct results, then:
- Accuracy = (True Positive + True Negative)/ Total = (1587 + 3341) / 5000 = 0.99
- Recall = True Positive/(True Positive + False Negative) = 1587/(1587+61) = 0.96
