# Detecting the Ingroup and Outgroup of a Text
---

In this first experiment we use Bush (2001) and bin Laden’s (19XX) narratives as a dataset to detect the ingroup and outgroup of each. Shown in Figure 1, a manual annotation of two text reveals the ingroups and outgroups for each orator. Bush’s text is his “Address to Joint Session of Congress Following 9/11 Attacks ” made on the 20 September 2001; bin Laden’s text is his “Declaration of Jihad Against the Americans Occupying the Land of the Two Holy Places” published on 23 August 1996. These texts were chosen as they are the first of the dataset in which each declares war against each other. Bush identifies his ingroup as, ‘fellow Americans’, and after identifying al Qaeda as an terrorist organisation, he declares, “Our war on terror begins with al Qaeda”. Bin Laden similarly identifies his ingroup as “Muslim brethren” and declares Jihad against his outgroup of the Americans by stating, “driving back the American occupier enemy is the most essential duty after faith”. The ingroup and outgroup of each text are clearly defined by each orator, as is their intention to legitimise warfare.

## Setup NLP Pipeline and data

In [6]:
%%time
import os
import pandas as pd
import spacy
from cndlib.visuals import display_side_by_side
from cndlib.pipeline import add_hard_coded_entities, merge_compounds, custom_tokenizer

nlp = spacy.load("en_core_web_md")

nlp.tokenizer = custom_tokenizer(nlp)

dirname = "C:\\Users\\spa1e17\\OneDrive - University of Southampton\\Hostile-Narrative-Analysis\\dataset"
filename = "named_entity_corrections.json"
filepath = os.path.join(dirname, filename)

add_hard_coded_entities(nlp, filepath)

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents, after = "entity_ruler")

nlp.add_pipe(merge_compounds, last = True)

print([pipe for pipe in nlp.pipe_names])

['tagger', 'parser', 'ner', 'entity_ruler', 'merge_entities', 'merge_compounds']
Wall time: 3.66 s


### Create the Dataset

The dataset is created by extracting the named entities relating to people or groups, which in turn are annotated in relation to each orator as either ingroup or outgroup. The annotations were made by words signifying group membership, such as "terrorist organisation known as al Qaeda", or by inference using annotator judgement. The annotations are saved in a .csv file and can be reviewed as required.

In [14]:
import os

docs = None
docs = {"bush" : {"name" : "George Bush", "text" : dict()},
       "binladen" : {"name" : "Osama bin Laden", "text" : dict()}}

docs['bush']['dirpath'] = u"C:\\Users\\spa1e17\\OneDrive - University of Southampton\\Hostile-Narrative-Analysis\\dataset\\George Bush"
docs['bush']['text']['filename'] = "20010920-Address to Joint Session of Congress Following 911 Attacks.txt"

docs['binladen']['dirpath'] = u"C:\\Users\\spa1e17\\OneDrive - University of Southampton\\Hostile-Narrative-Analysis\\dataset\\Osama bin Laden"
docs['binladen']['text']['filename'] = "19960823-Declaration of Jihad Against the Americans Occupying the Land of the Two Holiest Sites.txt"
                                                
# lambda function to capture the named entities of a text which are GEP, NORP, ORG or PERSON
entity_list = lambda ents: [ent for ent in ents.noun_chunks 
                            if all((ent.root.pos_ != "ADJ",
                                   ent.root.ent_type_ in ["PERSON", "ORG", "NORP","GPE"]))
                            ]

# lambda function to process doc and extract entities
get_entities = lambda orator : entity_list(nlp(orator["text"]["rawtext"]))

for orator in docs.values():
    
    with open(os.path.join(orator['dirpath'], orator['text']['filename']), 'r') as fp:
    
        # get bush entities
        orator["text"]["rawtext"] = fp.read()
        orator["text"]["entities"] = get_entities(orator)
        orator['text']['analytics'] = dict()

n = 20
captions = [f"First {n} of {len(orator['text']['entities'])} Entities for {orator['name']}" for orator in docs.values()]

display_side_by_side([pd.DataFrame([(ent.text, ent.root.text, ent.root.ent_type_, ent.root.pos_) 
                                    for ent in docs[orator]["text"]["entities"]]).head(20) 
                                   for orator in docs], captions)

Unnamed: 0,0,1,2,3
0,Mr. President Pro Tempore,President Pro Tempore,PERSON,PROPN
1,Congress,Congress,ORG,PROPN
2,fellow Americans,Americans,NORP,PROPN
3,the Union,Union,ORG,PROPN
4,Todd Beamer,Todd Beamer,PERSON,PROPN
5,Lisa Beamer,Lisa Beamer,PERSON,PROPN
6,the Congress,Congress,ORG,PROPN
7,America,America,GPE,PROPN
8,Republicans,Republicans,ORG,PROPN
9,Democrats,Democrats,ORG,PROPN

Unnamed: 0,0,1,2,3
0,Muhammad,Muhammad,PERSON,PROPN
1,Ye,Ye,PERSON,PROPN
2,his Prophet,Prophet,PERSON,PROPN
3,Muslims,Muslims,NORP,PROPN
4,Jews,Jews,NORP,PROPN
5,Christians,Christians,NORP,PROPN
6,Palestine,Palestine,GPE,PROPN
7,Iraq,Iraq,GPE,PROPN
8,Qana,Qana,GPE,PROPN
9,Lebanon,Lebanon,GPE,PROPN


### Assigning Ingroup to Outgroup of a Text

Export the entities to an output csv file for manual annotation.

Annotation was based on three methods. Firstly, if the named entity had an associated seed term of either elevation or othering it was annotated as ingroup or outgroup respectively. For example, any seed term associated with the term “enemy” would be considered to be an outgroup, while any entity preceded by the phrase “my fellow” would be annotated as an ingroup. Secondly, any named entity whose grouping was identified elsewhere would be annotated as “linked”. For example, where the single instance of an entity had been annotated as an outgroup through the term “enemy”, all other instances of the same entity would be annotated with the same label. The final method is through inferred knowledge which draws upon real-world knowledge to make the annotation. For review, the annotated dataset is available online  and the first 12 annotation results are in Figure 1. 

In [8]:
%time
import csv
import pandas as pd

# Field headers for the csv
fields = ["Orator", "Entity Type", "Part of Speech", "Entity Phrase", "Entity Root", "Grouping", "Seed Term", "Sentence"]
entities = []

for orator in docs:
    for entity in docs[orator]["text"]["entities"]:
        entities.append([orator, entity.root.ent_type_, entity.root.pos_, entity.text, entity.root, '', '', str(entity.sent).replace('\n', ' ').strip()])

dirpath = os.getcwd()
filename = "entity_list.csv"
filepath = os.path.join(dirpath, filename)

df = pd.DataFrame(entities, columns = fields)
df.to_csv(filepath, sep=',',index=False)
df

Wall time: 0 ns


Unnamed: 0,Orator,Entity Type,Part of Speech,Entity Phrase,Entity Root,Grouping,Seed Term,Sentence
0,bush,PERSON,PROPN,Mr. President Pro Tempore,President Pro Tempore,,,"Mr. Speaker, Mr. President Pro Tempore, member..."
1,bush,ORG,PROPN,Congress,Congress,,,"Mr. Speaker, Mr. President Pro Tempore, member..."
2,bush,NORP,PROPN,fellow Americans,Americans,,,"Mr. Speaker, Mr. President Pro Tempore, member..."
3,bush,ORG,PROPN,the Union,Union,,,"Mr. Speaker, Mr. President Pro Tempore, member..."
4,bush,PERSON,PROPN,Todd Beamer,Todd Beamer,,,"We have seen it in the courage of passengers, ..."
...,...,...,...,...,...,...,...,...
347,binladen,NORP,PROPN,Brother Muslims,Brother Muslims,,,Brother Muslims worldwide:
348,binladen,GPE,PROPN,Palestine,Palestine,,,Your brothers in the land of the two holy mosq...
349,binladen,NORP,PROPN,the Jews,Jews,,,"O God, the people of the cross have come with ..."
350,binladen,NORP,PROPN,Muslims,Muslims,,,"O God, strengthen the youth of Islam, guide th..."


### Import the Annotations as a Tab Deliminated File

The annotation file is saved as a .txt file and imported using tab delimiations to avoid any clashes with sentence commas.

In [15]:
import os
import csv
import pandas as pd
from cndlib.visuals import display_side_by_side

filename = "entity_list_gold.txt"
filename = os.path.join(os.getcwd(), filename)

for orator in docs:
    docs[orator]['text']['groups'] = {"ingroup" : set(), "outgroup" : set()}

with open(filename, newline = "") as fp:
    data = csv.DictReader(fp, delimiter = '\t')       
        
    for row in data:

        if row["Grouping"].lower().strip() == "ingroup":
            docs[row["Orator"]]['text']['groups']["ingroup"].add(row["Entity Root"].lower().strip())

        if row["Grouping"].lower().strip() == "outgroup":
            docs[row["Orator"]]['text']['groups']["outgroup"].add(row["Entity Root"].lower().strip())
            
dfs = []
captions = []
for orator in docs.values():
    data = orator['text']['groups']
    
    for grouping, group_list in data.items():
        
        # convert set() to list()
        data[grouping] = list(data[grouping])
        
        dfs.append(pd.DataFrame(group_list, columns = [f"{orator['name']}'s {grouping.title()}s"]).head(12))
        captions.append(f"{orator['name']}'s text has {len(group_list)} annotated {grouping} terms")
        
display_side_by_side(dfs, captions)

Unnamed: 0,George Bush's Ingroups
0,american
1,arlene
2,todd beamer
3,democrats
4,great britain
5,pentagon
6,tom ridge
7,fbi agents
8,governor george pataki
9,the united states of america

Unnamed: 0,George Bush's Outgroups
0,al qaeda
1,islamic movement of uzbekistan
2,egyptian islamic jihad
3,taliban regime
4,usama bin laden
5,taliban

Unnamed: 0,Osama bin Laden's Ingroups
0,brother muslims
1,muhammad
2,abd-al rahman
3,gabriel
4,fatimah bint-al-khattab
5,al-muthanna bin-harithah al-shibani
6,army
7,bin-'amr al-tamimi
8,the armed forces
9,messenger muhammad

Unnamed: 0,Osama bin Laden's Outgroups
0,serbs
1,britain
2,marines
3,united states
4,us defense secretary
5,russians
6,jew
7,secretary william perry
8,us troops
9,king fahd


## Test 1: Testing IBM Watson Sentiment Analysis

IBM's sentiment analyser has a feature to "analyse target phrases in context of the surrounding text for focused sentiment and emotion results"  . For the first test, therefore, the named entities shown in figure 1 for each orator were passed to the API to get the sentiment scores for each. A positive score towards a named entity should infer ingroup membership, while a negative score should infer outgroup membership. The annotated entity phrases were passed to the API as target phrases and results were assessed against the annotations from figure 1. If positive sentiment scores correlated with an ingroup annotation or negative scores correlated with outgroup, the test was a pass; the test was a fail for positive score correlating with outgroup, or negative scores correlating with ingroup.

API Documentation: https://cloud.ibm.com/docs/natural-language-understanding?topic=natural-language-understanding-getting-started

### Initiate Watson API

In [430]:
%%time
import json
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import EmotionOptions, Features, EntitiesOptions, KeywordsOptions, SentimentOptions

apikey = 'D3ptPkoLkoQNJvIav-reiA5137cr3m8Y1f-mhX1bLile'
url = 'https://api.eu-gb.natural-language-understanding.watson.cloud.ibm.com/instances/204e6ba7-952c-41ae-99e9-fe4e8208bfde'

authenticator = IAMAuthenticator(apikey)
service = NaturalLanguageUnderstandingV1(version='2019-07-12', authenticator=authenticator)
service.set_service_url(url)

# print(json.dumps(response, indent=2))

Wall time: 3.39 ms


### Get Responses from Watson API

In [17]:
%%time
import json

for orator in docs.values():
    print(f"getting results for {orator['name']}")
    
    text = orator['text']['rawtext']
    targets = orator['text']['groups']['ingroup'] + orator['text']['groups']['outgroup']
    
    orator['text']['analytics'].update(service.analyze(text=text, features = Features(
                                                        emotion = EmotionOptions(targets = targets),
                                                        entities = EntitiesOptions(sentiment = True, emotion = True),
                                                        sentiment = SentimentOptions(targets = targets, document = True),
                                                        keywords = KeywordsOptions(sentiment = True, emotion = True),             
                                                      )).get_result())
    
    #empty the list of entity Spans to enable saving file as a json object
    orator['text']['entities'] = None
    
    print(f"{orator['name']} complete with {len(orator['text']['analytics']['sentiment']['targets'])} target entities scored for sentiment")


with open(os.path.join(os.getcwd(), f"entity_sentiment_scores.json"), "wb") as f:
    f.write(json.dumps(docs).encode("utf-8"))

getting results for George Bush
George Bush complete with 33 target entities scored for sentiment
getting results for Osama bin Laden
Osama bin Laden complete with 41 target entities scored for sentiment
Wall time: 11.2 s


### Overall Document Sentiment Score

In [2]:
import os
import json

docs = None

with open(os.path.join(os.getcwd(), f"entity_sentiment_scores.json"), "r") as f:
    docs = json.load(f)

for orator in docs.values():
    response = orator['text']['analytics']
    print(f"document sentiment {orator['name']}: {response['sentiment']['document']['label']}")
    print(f"document sentiment score for {orator['name']}: {response['sentiment']['document']['score']}")
    print()

document sentiment George Bush: negative
document sentiment score for George Bush: -0.331922

document sentiment Osama bin Laden: negative
document sentiment score for Osama bin Laden: -0.475387



### Scores for Annotated Entities

In [16]:
import pandas as pd
from cndlib.visuals import display_side_by_side

def get_group(orator, entity):

    """
    function to get the grouping of an entity from the orator's groupings
    """
    if entity in docs[orator]['text']['groups']['ingroup']:
        return "ingroup"
    if entity in docs[orator]['text']['groups']['outgroup']:
        return "outgroup"
    return "not found"

def assessment_test(col1, col2):

    """
    function to test whether a sentiment scores matches ingroup/outgroup
    """

    if col1 == "positive" or col1 == "neutral" and col2 == "ingroup":
        return "pass"
    if col1 == "negative" and col2 == "ingroup":
        return "fail"
    if col1 == "negative" and col2 == "outgroup":
        return "pass"
    if col1 == "positive" or col1 == "neutral" and col2 == "outgroup":
        return "fail"
    
# create new dataframe based on filtered columns
scores = lambda table, labels: table[table.label.isin(labels)].sort_values("score", ascending = 'negative' in labels, ignore_index = True)

## iterate through the docs
for orator in docs:
    
    # capture results
    results = pd.DataFrame(docs[orator]["text"]['analytics']["sentiment"]["targets"])
    
    ## create a dataframe for positive and negative results
    dfs = dict()
    dfs = {"ingroup" : {"result" : None, "df" : scores(results, ['neutral', 'positive'])}, 
           "outgroup" : {"result" : None, "df" : scores(results, ['negative'])}}

    for obj in dfs.values():
        
        df = obj["df"]
            
        # get the grouping for each entity
        df["grouping"] = df.apply(lambda x: get_group(orator, x["text"]), axis = 1)
        
        # test whether sentiment score matches ingroup/outgroup        
        df["test result"] = df.apply(lambda x: assessment_test(x["label"], x["grouping"]), axis=1)
        
        # get the success scores for ingroup and outgroup
        obj["result"] = format(df["test result"].value_counts(normalize = True)["pass"], '.0%')
        
        # format dataframe
        df.drop('mixed', axis = 1, inplace = True)
        df['text'] = df['text'].str.title()
        df.rename(columns = {"score" : "sentiment score", "text" : "entity text"}, inplace = True)
        df.columns = df.columns.str.title()

    docs[orator]['text']['analytics']['sentiment']['dfs'] = dfs
    
    # display the outputs
    display_side_by_side([output["df"] for output in dfs.values()],
                         [f"{key.title()} scores for {docs[orator]['name']} has a True Positive Score of {obj['result']} from a total of {len(obj['df'])} Entities"
                         for key, obj in dfs.items()])
    print()

# dfs = []
# captions = []
# for orator in docs.values():
#     for group, df in orator['text']['analytics']['sentiment']['dfs'].items():
#         dfs.append(df['df'].head(13)
#         captions.append(f"{group.title()} scores for {orator['name']} has a Success of {df['result']} from a total of {len(df['df'])} Entities")
        
# display_side_by_side(dfs, captions)

Unnamed: 0,Entity Text,Sentiment Score,Label,Grouping,Test Result
0,Pentagon,0.9793,positive,ingroup,pass
1,The Armed Forces,0.9793,positive,ingroup,pass
2,Great Britain,0.965023,positive,ingroup,pass
3,The Office Of Homeland Security,0.897459,positive,ingroup,pass
4,Muslims,0.759801,positive,ingroup,pass
5,New Yorkers,0.755239,positive,ingroup,pass
6,Mayor Rudolph Giuliani,0.755239,positive,ingroup,pass
7,Governor George Pataki,0.755239,positive,ingroup,pass
8,American,0.680431,positive,ingroup,pass
9,Lisa Beamer,0.643663,positive,ingroup,pass

Unnamed: 0,Entity Text,Sentiment Score,Label,Grouping,Test Result
0,Arlene,-0.866308,negative,ingroup,fail
1,Muslim,-0.832574,negative,ingroup,fail
2,Christians,-0.818949,negative,ingroup,fail
3,Jews,-0.818949,negative,ingroup,fail
4,Islamic Movement Of Uzbekistan,-0.818724,negative,outgroup,pass
5,Al Qaeda,-0.738021,negative,outgroup,pass
6,United States Authorities,-0.648871,negative,ingroup,fail
7,Taliban Regime,-0.573796,negative,outgroup,pass
8,Taliban,-0.555951,negative,outgroup,pass
9,The United States Of America,-0.535138,negative,ingroup,fail





Unnamed: 0,Entity Text,Sentiment Score,Label,Grouping,Test Result
0,Clinton,0.835665,positive,outgroup,pass
1,Brother Muslims,0.642764,positive,ingroup,pass
2,Israel,0.443557,positive,outgroup,pass
3,Us Enemy,0.427792,positive,outgroup,pass
4,Afghanistan,0.427279,positive,outgroup,pass
5,Gabriel,0.387794,positive,ingroup,pass
6,Messenger Muhammad,0.0,neutral,ingroup,pass
7,Mujahidin Leaders,0.0,neutral,ingroup,pass
8,Secretary William Perry,0.0,neutral,outgroup,fail
9,Us Troops,0.0,neutral,outgroup,fail

Unnamed: 0,Entity Text,Sentiment Score,Label,Grouping,Test Result
0,United Nations,-0.935647,negative,outgroup,pass
1,Jew,-0.883336,negative,outgroup,pass
2,Jews,-0.803315,negative,outgroup,pass
3,United States,-0.784973,negative,outgroup,pass
4,Christians,-0.783726,negative,outgroup,pass
5,Serbs,-0.778184,negative,outgroup,pass
6,The United States,-0.765797,negative,outgroup,pass
7,King Fahd,-0.726643,negative,outgroup,pass
8,Russians,-0.718231,negative,outgroup,pass
9,Jewish-Crusade Alliance,-0.672888,negative,outgroup,pass





In relation to the annotation methods, these are somewhat counter-intuitive results. For Bush’s outgroups the phrases ‘al Qaeda’, ‘Taliban’, ‘the Taliban Regime’, ‘al Qaeda’ and ‘Islamic Movement of Uzbekistan’ are annotated as outgroups and generate negative scores between -0.56 and -0.81 as expected. The phrases, ‘the United States’, ‘America’, ‘Americans’, ‘the United States of America’ and ‘United States Authorities’, however, are annotated as ingroups, but generate negative scores between -0.31 and -0.65. These overlapping range of scores do not correlate with how a President would refer to his country in a time of national mourning. 

The phrases, ‘Christians’, ‘Jews’, ‘Muslims’ and ‘Arlene’ generate the most negative results for Bush despite being annotated as ingroups. Of these ‘Arlene’ is the most negative with a score of -0.87 and occurs in the sentence, ‘It was given to me by his mom, Arlene, as a proud memorial to her son’. This mention is in reference to a Police Shield given to George Bush by Arlene Howard in memorial to her son George who was killed in the attacks. The context in which ‘Arlene’ is mentioned is entirely positive. 
There are 37 annotations for entities associated with elevation or othering seed terms. Any phrase containing the word, ‘enemy’ would reasonably be scored as negative. The phrase, ‘US Enemy’, nevertheless, generates a score of +0.42, which is higher than ‘Gabriel’ at +0.38, a reference to the Angel Gabriel who bin Laden repeatedly reveres. Equally, bin Laden refers to the ‘mujahidin’ as ‘our brothers the people’ yet generates a score of -0.47, which is more negative than both ‘Americans’ and ‘Marines’ at -0.28 and -0.36 respectively. Bush’s phrase, “The enemy of America is not our many Muslim friends” establishes Muslims as an ingroup, whereas the term, ‘Muslims’ generates the second most negative score of -0.83.

There is also problem with how different mentions of the same entity are linked across a narrative.  Despite being an outgroup of bin Laden, ‘Israel’ receives the third highest score of +0.44 behind a reference to President Clinton at +0.84 and ‘brother Muslims’ at +0.64. “Israel” is mentioned once in bin Laden’s text, but there are two mentions of the “Israeli-American alliance”, one mention of “Israeli-American enemy alliance” and one mention of “Israelis” in the phrase, “their Jihad against their enemies and yours, the Israelis and Americans”. Where bin Laden’s counter-intuitively generates positive scores for Israel, the phrases, ‘Jewish-Crusade Alliance”, “Jew” and “Jews” each generate negative scores as expected. Against expectations, however, the phrases, ‘Muslim’, ‘Muslims’ and ‘Ulema’ generate negative scores of +0.60, +0.62 and +0.64 respectively despite the phrase, ‘brother Muslims’ receiving the second highest score for positivity. There are 43 annotations that rely upon linking entities to specific clauses of elevation or othering, resolving these different mentions of the same entity might produce more intuitive results.

In addition to linking entities, there is also a problem with linking them to specific noun phrases. In Bush’s text, “Osama bin Laden” and “Egyptian Islamic Jihad” generate neutral scores, whereas they are annotated as his outgroups. They occur in the phrase, “This group and its leader -- a person named Usama bin Laden -- are linked to many other organizations in different countries, including the Egyptian Islamic Jihad and the Islamic Movement of Uzbekistan.”. The noun phrase, “this group” refers to a mention of al Qaeda in the previous paragraph who Bush variously others as “terrorists” and “murderers”. Nevertheless, there is no obvious way to resolve, ‘this group’ to ‘al Qaeda’ to establish group status of bin Laden or the Egyptian Islamic Jihad. Their annotation relies upon real-world knowledge for which there are 57 annotations in the dataset. Given these entities are only mentioned once, real-world knowledge is the only way to identify their group status.

Beyond linking seed terms to entities, there is also a problem with linking entities to functional narrative clauses. For bin Laden, the phrase, ‘Prophet’ – a reference to Muhammed – generates a negative score of -0.53 despite being a religious figure of bin Laden’s ingroup ‘America’ generating a less negative score. 12 out of 28 mentions of ‘Prophet’ are followed by the phrase, ‘may God's prayers and blessings be upon him’, which is used as a religious narrative clause to elevate the entity associated with the pronoun, ‘him’. This clause is functionally similar to ‘God bless America’, which sanctifies the object of the clause, in this case ‘America’. Such as use of religion should attract high scores of positivity for sentiment analysis, which appears to be unlikely for this algorithm.

### Scores for Watson Defined Entities

In [63]:
# get emotion scores
# entry["emotion"]["sadness"], entry["emotion"]["joy"], entry["emotion"]["fear"], entry["emotion"]["disgust"], entry["emotion"]["anger"]) 

columns = ["Entity", "Sentiment Score", "Sentiment Label"]

scores = lambda labels, table: pd.DataFrame( # get sentiment scores
                                            [(entry["text"], entry["sentiment"]["score"], entry["sentiment"]["label"]) 
                                                                                  
                                             # iterate through table if positive/negative
                                             for entry in table if entry["sentiment"]["label"] in labels], 
                                            
                                            # set column names
                                            columns = columns) \
                                            \
                                            .sort_values("Sentiment Score", ascending = label not in labels, ignore_index = True) 


                                            

for orator in docs.values():
    results = orator['text']['analytics']['entities']
    n = 10
    display_side_by_side([scores(['positive', 'neutral'], results).head(n), scores(['negative'], results).head(n)], 
                         [f"Top 10 Positive Scores for API Defined Entities in {orator['name']}'s Dataset'", 
                          f"Top 10 Negative Scores for API Defined Entities in {orator['name']}'s Dataset'"])
    print()

Unnamed: 0,Entity,Sentiment Score,Sentiment Label
0,Pentagon,0.9793,positive
1,Great Britain,0.965023,positive
2,Speaker Hastert,0.95322,positive
3,Minority Leader Gephardt,0.95322,positive
4,Majority Leader Daschle,0.95322,positive
5,Senator Lott,0.95322,positive
6,Office of Homeland Security,0.897459,positive
7,Cabinet,0.897459,positive
8,40 billion dollars,0.87996,positive
9,Governor George Pataki,0.755239,positive

Unnamed: 0,Entity,Sentiment Score,Sentiment Label
0,India,-0.898932,negative
1,Japan,-0.898932,negative
2,El Salvador,-0.898932,negative
3,Mexico,-0.898932,negative
4,Iran,-0.898932,negative
5,Islamic Movement,-0.818724,negative
6,Uzbekistan,-0.818724,negative
7,Latin America,-0.801347,negative
8,Al Qaeda,-0.738021,negative
9,Asia,-0.663322,negative





Unnamed: 0,Entity,Sentiment Score,Sentiment Label
0,Khorasan,0.777557,positive
1,American enemy alliance,0.56739,positive
2,Shaykh Safar al-Hawali,0.562657,positive
3,Yemen,0.476234,positive
4,Ibn-Taymiyah,0.473677,positive
5,Afghanistan,0.427279,positive
6,Jahl,0.424174,positive
7,Shaykh Salman al,0.0,neutral
8,Awdah,0.0,neutral

Unnamed: 0,Entity,Sentiment Score,Sentiment Label
0,Fatani,-0.98102,negative
1,Ogaden,-0.98102,negative
2,Qana,-0.98102,negative
3,Khubar,-0.960076,negative
4,"alliance of Jews, Christians",-0.925662,negative
5,Israeli-American alliance,-0.858944,negative
6,Tajikistan,-0.847741,negative
7,Somalia,-0.805516,negative
8,Aziz,-0.771678,negative
9,Bosnia,-0.770552,negative





## Test 2: Proximity of Named Entities to Seed Terms

In [655]:
import tqdm
import pickle

group = ["the United States of America", "Americans", "America", "The United States"]

term = None
results = dict()
for sentence in doc.sents:
    terms = set(group).intersection(set([token.text.strip() for token in sentence]))
    if terms:
        term = list(terms)[0]
        terms = [token.text for token in sentence if token.pos_ in ["NOUN", "VERB"]]
        if term and term in results.keys():
            results[term].extend(terms)
        elif term:
            results[term] = terms
            
ibm_df = dict()
n = 1
for entity, terms in tqdm.tqdm(results.items()):
    
    ibm_df[entity] = {"positive" : list(), "negative" : list(), "neutral" : list()}
    for term in terms:
        analytics = service.analyze(text=term, features=Features(
                                    sentiment=SentimentOptions()),
                                    language = "en").get_result()
        sentiment = analytics['sentiment']['document']
        score = {term : round(sentiment['score'], 2)}
                                    
        if sentiment['label'] == "positive":
#             print(f"appending {score} to {entity}['positive']")
            ibm_df[entity]['positive'].append(score)
        
        elif sentiment['label'] == "negative":
#             print(f"appending {score} to {entity}['negative']")
            ibm_df[entity]['negative'].append(score)
        
        elif sentiment['label'] == "neutral":
#             print(f"appending {score} to {entity}['neutral']")
            ibm_df[entity]['neutral'].append(score)

with open(os.path.join(os.getcwd(), f"manual_cooccurring_scores.json"), "wb") as f:
    f.write(json.dumps(ibm_df).encode("utf-8"))
    
filepath = os.getcwd()
pickle_filename = "ibm_cooccuring_scores.pkl"
with open(os.path.join(filepath, pickle_filename), 'wb') as file:
    pickle.dump(ibm_df, file)

100%|██████████| 4/4 [01:16<00:00, 19.08s/it]


## Load and Display the Results

In [10]:
import os
import pickle
import pandas as pd
from statistics import mean
from cndlib.visuals import display_side_by_side

# load file from disc
filepath = os.getcwd()
pickle_filename = "ibm_cooccuring_scores.pkl"
with open(os.path.join(filepath, pickle_filename), 'rb') as file:
    ibm_df = pickle.load(file)

def get_sentiment(entity):
    
    """
    function to get the sentiment score for the entity being assessed
    """
    
    for target in docs['bush']["text"]['analytics']["sentiment"]["targets"]:
        if target['text'] == entity.lower():
            return target['score']
        
def get_averages_row(df):
    
    """
    function to get the average scores for each sentiment polarity
    """
    
    averages = list()
    for result in df.columns:
        averages.append(f"Average ({round(mean([score.get(list(score.keys())[0]) for score in df[result].tolist() if isinstance(score, dict)]), 2)})")
    return pd.DataFrame(dict(zip(df.columns, averages)), index=[0])

## create DataFrames from the results
dfs = [pd.DataFrame(dict([(k, pd.Series(v, dtype='object')) 
                          for k,v in ibm_df[d].items() 
                          if k in ['positive', 'negative', 'neutral']])) 
       for d in ibm_df]

## function to get the length of the greater number of positive or negative results
get_table_size = lambda df: max([value.count() for key, value in df.items() if key in ['positive', 'negative']])

## cell formatting function to convert the dictionary results to a string
format_cell = lambda x: f"{list(x.items())[0][0]} ({list(x.items())[0][1]})"

## get DataFrame caption
captions = [f"Entity: '{d}' - sentiment {round(get_sentiment(d), 2)}" for d in ibm_df]
    
## display DataFrames
display_side_by_side([df
                      .head(get_table_size(df)) # shrink table to longest of either positivity or negativity
                      .applymap(format_cell, na_action='ignore') # reformat dictionary results to strings
                      .fillna('').append(get_averages_row(df), ignore_index = True) # append the average scores for each columns
                      .rename(columns={key : f"{key.title()}, ({value.count()} Terms)" for key, value in df.items()}) # rename columns to include number of entities
                      for df in dfs], captions)

Unnamed: 0,"Positive, (11 Terms)","Negative, (14 Terms)","Neutral, (38 Terms)"
0,course (0.68),casualties (-0.7),members (0)
1,state (0.31),war (-0.93),events (0)
2,thousands (0.53),surprise attacks (-0.88),Presidents (0)
3,directive (0.68),civilians (-0.84),come (0)
4,commands (0.4),attacked (-0.78),chamber (0)
5,win (0.64),terrorists (-0.98),report (0)
6,expect (0.51),kill (-0.91),known (0)
7,measures (0.66),kill (-0.91),wars (0)
8,protect (0.49),civilians (-0.84),the past 136 years (0)
9,thank (0.99),hate (-0.98),wars (0)

Unnamed: 0,"Positive, (16 Terms)","Negative, (11 Terms)","Neutral, (56 Terms)"
0,sounds (0.56),tragedy (-0.86),touched (0)
1,honored (0.99),forget (-0.66),evening (0)
2,unity (0.57),streets (-0.43),see (0)
3,practiced (0.8),enemy (-0.92),joined (0)
4,counts (0.37),atrocity (-0.98),steps (0)
5,hope (0.41),retreating (-0.9),singing (0)
6,freedom (0.66),forsaking (-0.84),will (0)
7,uphold (0.96),fight (-0.71),playing (0)
8,values (0.87),resolve (-0.25),friend (0)
9,creativity (0.64),died (-0.97),crossed (0)

Unnamed: 0,"Positive, (2 Terms)","Negative, (1 Terms)","Neutral, (4 Terms)"
0,respects (0.99),sympathy (-0.62),people (0)
1,support (0.63),,nations (0)
2,Average (0.81),Average (-0.62),Average (0)

Unnamed: 0,"Positive, (7 Terms)","Negative, (2 Terms)","Neutral, (14 Terms)"
0,makes (0.43),terror (-0.96),tonight (0)
1,demands (0.8),lies (-0.81),following (0)
2,Deliver (0.37),,United States authorities (0)
3,leaders (0.56),,hide (0)
4,determined (0.73),,land (0)
5,grant (0.84),,will (0)
6,wisdom (0.51),,age (0)
7,Average (0.61),Average (-0.89),Average (0)
