# Creating Benchmark Data
---
 
In this notebook the benchmark data is creating for testing the NLP technologies in each experiment.

## InGroup and Outgroup For Each Orator

In this cell a JSON object is created containing the ingroups and outgroups of each orator. These groups are noun phrases identifying the groups and are taken from the speech in which each orator identified their outgroup. For bin Laden, this was his first speech published on 23/08/1996; for Bush he first identified hi outgroup in his State of the Union address on 20/09/2001

In [5]:
import os
import json
from datetime import datetime

groups_benchmark = {
    
    "bush" : {
        "ingroup" : ["america", "american people", "americans", "united states", "united states of America", "my fellow americans", "fellow americans"],
        "outgroup" : ["al qaeda", "taliban regime", "taliban", "egyptian islamic jihad", "islamic movement of uzbekistan"]
    },
    
    "binladen" : {
        "ingroup" : ["people of islam", "islamic world", "ummah of islam", "muslims", "muslim people", "muslim nation"],
        "outgroup" : ["zionist-crusaders alliance", "american crusaders", "american zionist alliance", "american-israeli alliance", \
                      "Jewish-crusade alliance", "saudi regime", "american enemy", "zionist-crusaders", "Christian armies of the Americans", 
                     "american people", "american army", "the bush administration"]
    }
}

filepath = "C:/Users/Steve/OneDrive - University of Southampton/CulturalViolence/KnowledgeBases/Data/"

with open(os.path.join(filepath, "groups_benchmark.json"), "wb") as f:
    f.write(json.dumps(groups_benchmark).encode("utf-8"))
    
print("complete at: ", datetime.now().strftime("%d/%m/%Y - %H:%M:%S"))    

complete at:  19/06/2020 - 15:22:57


In [6]:
# https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths

import pandas as pd

keys = list(groups_benchmark.keys())
print(keys)

frames = []
for value in groups_benchmark.values():
    frames.append(pd.DataFrame(dict([ (k, pd.Series(v)) for k, v in value.items() ]), index = None).fillna(""))

display(pd.concat(frames , keys = keys))

['bush', 'binladen']


Unnamed: 0,Unnamed: 1,ingroup,outgroup
bush,0,america,al qaeda
bush,1,american people,taliban regime
bush,2,americans,taliban
bush,3,united states,egyptian islamic jihad
bush,4,united states of America,islamic movement of uzbekistan
bush,5,my fellow americans,
bush,6,fellow americans,
binladen,0,people of islam,zionist-crusaders alliance
binladen,1,islamic world,american crusaders
binladen,2,ummah of islam,american zionist alliance


## Instantiate the Pipeline

In [1]:
%%time
import importlib
import pipeline
importlib.reload(pipeline)
cnd = pipeline.CND(extended = False)

print(cnd.nlp.meta['name'])
print([pipe for pipe in cnd.nlp.pipe_names])

core_web_md
['tagger', 'parser', 'ner', 'Named Entity Matcher', 'merge_entities', 'Concept Matcher']
Wall time: 35 s


## Selecting relevant sentences

In [None]:
%%time

filepath = r"C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset\Osama bin Laden\fulltext.txt"

with open(filepath, "r") as f:
    fulltext = f.read()
    
doc = cnd(fulltext)

sents_dict = dict()

for sent in doc.sents:
    if doc[sent.end -1].text == '\n':
        sents_dict[len(sents_dict)] = str(sent)
    else:
        sents_dict[len(sents_dict)] = str(sent)
        
print(len(sents_dict))

## Capturing Sentences Relating to Ingroup and Outgroup

In this notebook we iterate over all the sentence in the speech if appropriate manually classify each sentence as either ingroup elevation or outgroup othering.

In [8]:
from datetime import datetime
import os
import json
from IPython.display import clear_output
from spacy import displacy
from visuals import sent_frame

sents_dict = dict()

dirpath = os.getcwd()
ingroup = dict()
outgroup = dict()
index = dict()
ingroup_file = "ingroup_sents.json"
ingroup_filepath = os.path.join(dirpath, ingroup_file)
outgroup_file = "outgroup_sents.json"
outgroup_filepath = os.path.join(dirpath, outgroup_file)
index_filepath = os.path.join(dirpath, "index.json")

# open previous file and progress index

try:
    with open(ingroup_filepath, 'r') as fp:
        ingroup = json.load(fp)
except:
    pass

try:
    with open(outgroup_filepath, 'r') as fp:
        outgroup = json.load(fp)
except:
    pass

try:
    with open(index_filepath, 'r') as fp:
        index = json.load(fp)
except:
    index = 0

#iterate over each sentence dictionary for classification of ingroup or outgroup
while index < len(sents_dict):

    # record progress  through dictionary object
    with open(index_filepath, "wb") as f:
            f.write(json.dumps(index).encode("utf-8"))

    # clear screen
    clear_output(wait=True)# get text
    
    # show progress through input_dict
    print(f'{index} / {len(sents_dict)}')
    
    # get sentence text
    text = sents_dict[index]

    # parse text
    doc = cnd(text)

    # if the option to show the dependency parse is passed display it
#     displacy.render(doc, style="dep")

    # display the sentence frame in compact form
    display(sent_frame(doc))

    entry = input('ingroup(i) / outgroup(o) / delete (d) / back(b)').lower()
    
    # ask if sentence is refering to an ingroup or outgroup
    if entry in ['i', 'o']:        
        if entry == 'i': # add sentence to ingroup dictionary if user selects ingroup
            print(len(ingroup), ' => ingroup add: ', text)
            ingroup[len(ingroup)] = text
            
            # write dictionary to file
            with open(ingroup_filepath, "wb") as f:
                f.write(json.dumps(ingroup).encode("utf-8"))
            
        else: # else add sentence to outgroup dictionary
            print(len(outgroup), ' => outgroup add: ', text)
            outgroup[len(outgroup)] = text
            
            # write dictionary to file
            with open(outgroup_filepath, "wb") as f:
                f.write(json.dumps(outgroup).encode("utf-8"))
                
        # increase index by 1
        index += 1
    
    # if user enters 'd' then go back by 1 in the dictionary and delete
    elif entry == 'd': 
        if index != 0:
            
            # test whether the previous sentence was ingroup or outgroup and delete from respective dictionary
            
            if index >= 0 and len(ingroup) - 1 >= 0 and sents_dict[index-1] == ingroup[len(ingroup) - 1]:
                print('deleting from ingroup: ', ingroup.pop())

                with open(ingroup_filepath, "wb") as f:
                    f.write(json.dumps(ingroup).encode("utf-8"))

            elif index >= 0 and len(outgroup) - 1 >= 0 and sents_dict[index-1] == outgroup[len(outgroup) - 1]:
                print('deleting from outgroup: ', outgroup.pop())

                with open(outgroup_filepath, "wb") as f:
                    f.write(json.dumps(outgroup).encode("utf-8"))

            index -= 1
        
        else:
            print('iterating backwards by one sentence')
            pass
        
    # quit    
    elif entry == 'q':
        break
        
    else:
        index += 1

print("complete at: ", datetime.now().strftime("%d/%m/%Y - %H:%M:%S"))  #1220

complete at:  19/06/2020 - 15:23:36


## Create the Gold Dataset

In [None]:
import os
import json
import jsonlines
from tqdm import tqdm
import pandas as pd
from IPython.display import clear_output
from spacy import displacy
import importlib
import visuals
importlib.reload(visuals)

filenames = ["bush_ingroup_sents.jsonl",
             "bush_outgroup_sents.jsonl",
             "laden_ingroup_sents.jsonl",
             "laden_outgroup_sents.jsonl"]

path = os.getcwd()
index_filename = "index.json"
index_filepath = os.path.join(path, index_filename)

gold_ents_filename = "gold_ents.jsonl"
gold_ents_filepath = os.path.join(path, gold_ents_filename)

gold_ents = []

try:  
    with jsonlines.open(gold_ents_filepath) as f:
        gold_ents = list(f.iter())
    if len(test_chunks) == 0:
        for filename in filenames:
            with jsonlines.open(os.path.join(path, filename)) as f:
                lines = list(f.iter())
                for i, line in enumerate(lines): 
                    gold_ents.append({len(gold_ents) : line[str(i)]})
except:
    pass
            
try:
    with open(index_filepath, "r") as index_json:
        ref = json.load(index_json)
        
except:
    ref = 0
    
print(ref)

while ref < len(gold_ents):
    line = gold_ents[ref]
    
    with open(index_filepath, "wb") as f:
        f.write(json.dumps(ref).encode("utf-8"))
    
    while True:
        doc = cnd(line[str(ref)])
        line["gold_chunks"] = visuals.chunk_custom_attrs(list(doc._.custom_chunks), json = True)
        index = 0

        # iterate through each noun chunk to record whether ingroup/outgroup
        while index < len(line["gold_chunks"]):
            
            clear_output(wait=True)
            options = {"compact": True}
            displacy.render(doc, style = "dep", options = options)
            pd.set_option('display.max_colwidth', None)
            pd.set_option('display.max_columns', None)
            display(pd.DataFrame([line[str(ref)]]))
            df = [["spans:"] + [ent["text"] for ent in line["gold_chunks"]], \
                  ["entity:"] + [ent["entity"] for ent in line["gold_chunks"]], \
                  ["modifier:"] + [ent["modifier"] for ent in line ["gold_chunks"]], \
                  ["span_type:"] + [ent["span_type"] for ent in line["gold_chunks"]], \
                  ["ATTRIBUTE:"] + [ent["ATTRIBUTE"] for ent in line["gold_chunks"]]]
            display(pd.DataFrame(df))

            
            ent = line["gold_chunks"][index]
            print(ref, '/', len(gold_ents))
            print(f'Named Entity ({ent["span_type"]}) ({ent["ATTRIBUTE"]}): {ent["text"]}')

            group = None
            answers = {"i" : "ingroup", "o" : "outgroup", "y" : True, "n" : False} 
            
            while group not in ["i", "o", "q", "", "b"]:
                group = input("grouping? (i/o/q/b/q)")
            if group == "":
                ent["grouping"] = ""
                ent["detectable"] = ""
                index += 1
                continue
            if group == "q":
                raise SystemExit("Stop right there")
            if group == "b":
                if index != 0:
                    index -= 1
                continue

            detectable = None
            while detectable not in ["y", "n"]:
                detectable = input("detectable (y/n)")

            ent["grouping"] = answers[group]
            ent["detectable"] = answers[detectable]

            index += 1

        # check results
        attrs = ["text", "entity", "span_type", "ATTRIBUTE", "grouping", "detectable"]
        results = [[result[attr] for attr in attrs] for result in line["gold_chunks"]]
        display(pd.DataFrame(results, columns = attrs))
        
        satisfied = None
        while satisfied not in ["y", "n", "q"]:
            satisfied = input("safisfied? (y/n/q)")
        if satisfied == "q":
            raise SystemExit("Stop right there")
        if satisfied == "y":
            with jsonlines.open(os.path.join(path, gold_ents_filepath), 'w') as writer:
                writer.write_all(gold_ents)
            ref += 1
            break

## Code for making corrections

In [None]:
import os
import jsonlines

path = os.getcwd()
gold_ents_filename = ""
gold_ents_filepath = os.path.join(path, gold_ents_filename)

gold_ents = []
new_gold_ents = []

with jsonlines.open(gold_ents_filepath) as f:
    gold_ents = list(f.iter())

attrs = ["text", "grouping", "detectable"]

for i, gold_ent in enumerate(gold_ents):
    for ent in gold_ent["gold_chunks"]:
        if ent["text"].lower() == "palestine":
            print(f'text: {gold_ent[str(i)]}')
            for attr in attrs:
                print(f'{attr}: {ent[attr]}')
            ent["grouping"] = "ingroup"
            ent["detectable"] = "False"
            for attr in attrs:
                print(f'{attr}: {ent[attr]}')
            print('----')
            
# with jsonlines.open(os.path.join(path, gold_ents_filepath), 'w') as writer:
#     writer.write_all(gold_ents)

## Get the summary of the dataset

In [4]:
import os
import jsonlines

path = os.getcwd()
gold_ents_filename = "gold_ents.jsonl"
gold_ents_filepath = os.path.join(path, gold_ents_filename)

gold_ents = []
new_gold_ents = []

with jsonlines.open(gold_ents_filepath) as f:
    gold_ents = list(f.iter())
    
ents_total = 0
detectable_count = 0
non_detectable_count = 0

for gold_ent in gold_ents:
    for ent in gold_ent["gold_chunks"]:
        if ent["grouping"]: 
            ents_total += 1
        if ent["detectable"] == True:
            detectable_count += 1
        if ent["detectable"] == False:
            non_detectable_count += 1
            
print("total number of named entities:", ents_total)
print("total number of detectable named entities:", detectable_count)
print("total number of non-detectable named entities:", non_detectable_count)


total number of named entities: 529
total number of detectable named entities: 97
total number of non-detectable named entities: 359


## Annotation Notes

In this sentence "America" refers to Territory, whereas in others "America" will refer to the ingroup. Should GPE refer to a group asset rather than group?
- "In the past week, we have seen the American people at their very best everywhere in America."

"Muslims in nations" should be a noun chunk, and group.
- "Both Americans and Muslim friends and citizens, tax-paying citizens, and Muslims in nations were just appalled and could not believe what -- what we saw on our TV screens."

"behalf of the American people", should this be chunked? add to noun chunk gold list.
- "	And on behalf of the American people, I thank they world for its outpouring of support."

Need to make a decision about whether to create span_chunks in the ner component.

Should this become a hypernymic phrase
- "America has no truer friend than Great Britain"

Chunk as "many millions of Americans" - add to noun chunk gold list.
- "It's practiced freely by many millions of Americans, and by millions more in countries that America counts as friends."

"GPE" elements look like they become assets rather than a group.
for now, mark assets as ingroup and figure out how to alter later.
- "And what is at stake is not just America's freedom."

appositional modifier (appos) phrases
- "We are joined in this operation by our staunch friend, Great Britain."
- "Our staunch friends, Great Britain, our neighbors Canada and Mexico, our NATO allies, our allies in Asia, Russia and nations from every continent on the Earth have offered help of one kind or another -- from military assistance to intelligence information, to crack down on terrorists' financial networks."
- "At the same time, we are showing the compassion of America by delivering food and medicine to the Afghan people who are, themselves, the victims of a repressive regime."
- "I even had nice things to say about my friend, Ted Kennedy."
- "O protectors of monotheism and guardians of the faith; O successors of those who spread the light of guidance in the world; O grandsons of Sa'd Bin-Abi-Waqqas, al-Muthanna Bin-Harithah al-Shibani, al-Qa'qa' Bin-'Amr al-Tamimi, and the companions who fought alongside them: You rushed to join the Army and the Guard merely to join the jihad for the cause of God in order to spread the word of God and to defend Islam and the land of the two holy mosques against invaders and occupiers, which is the highest degree of belief in religion."
- "That was the only door left open to the public for ending injustice and upholding right and justice, and in whose interests do Prince Sultan and Prince Nayif plunge the country and the people into an internal war that would destroy everything, enlisting the aid and advice of those who fomented internal sedition in their country and using the people's police force to put down the reform movement there and pit members of the public one against the other—leaving the main enemy in the region, namely the Jewish-American alliance, safe and secure, having found such traitors to implement its policies aimed at exhausting the nation's human and financial resources internally."
- "But, thank God, the vast majority of the people, civilians and military, are aware of that sinister plan and will not allow themselves to be an instrument for strikes against one another in implementation of the policy of the main enemy, namely the Israeli-American alliance, through the Saudi regime, its agent in the country."

pronoun modifer denoting group
- "Our Islamic nation has been tasting the same for more than 80 years of humiliation and disgrace, its sons killed and their blood spilled, its sanctities desecrated."

interesting phrase, the Afghan people is only detectable as the ingroup if USA is also marked as ingroup.
an extended "billion Afghan people" can be detected as ingroup from "we are the friends of"
- "The United States of America is a friend to the Afghan people, and we are the friends of almost a billion worldwide who practice the Islamic faith."

the conjunctions from the head, "diligent and determined work" are split between dependency trees.
- "We may never know what horrors our country was spared by the diligent and determined work of our police forces, the FBI, ATF agents, federal marshals, Custom officers, Secret Service, intelligence professionals and local law enforcement officials, under the most trying conditions."

the named entities "Africa" and "Latin America" are split from the conjunction with the head "friends and allies"
- "Together with friends and allies from Europe to Asia, and Africa to Latin America, we will demonstrate that the forces of terror cannot stop the momentum of freedom."

should expand to "brave men of the United States military" and "brave women of the United States military"
- "And we have one more great asset in this cause: The brave men and women of the United States military."

should expand so that "Representative" applies to each surname of the conjunction
- "I also want to thank Representative Porter Goss, LaFalce, Oxley, and Sensenbrenner for their hard work."

should expand to "America is ally against terror" and "Afghanistan is ally against terrorism"
- "America and Afghanistan are now allies against terror."

"terror" is annotated as a verb when it should be a noun
- "Al Qaeda is to terror what the mafia is to crime."

"we" is a hypernym of the hyponym "largest source of humanitarian aid"
- "After all, we are currently its largest source of humanitarian aid; but we condemn the Taliban regime."

how to mark "military capability of the Taliban regime." as an outgroup asset when the root and modifier don't refer to an outgroup term.
- "These carefully targeted actions are designed to disrupt the use of Afghanistan as a terrorist base of operations and to attack the military capability of the Taliban regime."

the verb "close" is marked as an ADJ (as in reference to distance) rather then a VERB
- "More than two weeks ago, I gave Taliban leaders a series of clear and specific demands: Close terrorist training camps; hand over leaders of the Al Qaeda network; and return all foreign nationals, including American citizens, unjustly detained in your country."

adpositional phrase "indifference of governments" is split across dependency tree
- add to custom chunks test data set
- "Terrorist groups like al Qaeda depend upon the aid or indifference of governments."

The phrase "the suffering the Taliban have brought upon Afghanistan" is a hyponym of the hyernym, "the terrible burden of war."
(interesting story phrase, Bush shares the responsibility for the "burden of war")
- "And my country grieves for all the suffering the Taliban have brought upon Afghanistan, including the terrible burden of war."

noun chunk should be "Abandoned al Qaeda houses in Kabul"
- add to custom chunks test data set
- how to markup the noun phrase?
- "Abandoned al Qaeda houses in Kabul contained diagrams for crude weapons of mass destruction."

split dependency for "weapons of mass destruction"
- add to custom chunks test data set
- "North Korea is a regime arming with missiles and weapons of mass destruction, while starving its citizens."

split dependency for "people in industry"
- add to custom chunks test data set
- good test phrase for adpositional phrases
- "The same thing has befallen the people in industry and in agriculture, the cities and villages, and the people in the desert and the rural areas."

noun chunk should be "brothers in Palestine"
- add to custom chunks test data set
- "The money paid for US goods is turning into bullets [fired at] the chests of our brothers in Palestine, and tomorrow the chests of the sons of the country of the two holy mosques; by buying their goods we are strengthening their economy while we continue to become poorer."

noun chunks should be "Russians in Afghanistan" and "Serbs in Bosnia-Herzegovina"
- add to custom chunks test data set
- "I say: If the sons of the country of the two holy mosques—who went to fight the Russians in Afghanistan, the Serbs in Bosnia-Herzegovina, and who are now fighting in Chechnya, and God has granted them victory over the Russians, who are allying with you—and they are also fighting in Tajikistan—believe in the need to fight against atheism everywhere, they have the strength and enthusiasm in the land in which they were born to defend their greatest holy sites, the holy Ka'bah, the qiblah of all Muslims."

split dependency for "[that enemy], namely the Israeli-American alliance"
- "There is no greater duty after faith than warding [daf'] off [that enemy], namely the Israeli-American alliance occupying the land of the two holy mosques and the land of the ascension of the Prophet, may God's prayers and blessings be upon him."

should read "sons of Islam" and "daughters of Islam"
- "Sons and daughters of Islam!"

should read, "leaders of atheism in the United States"?
- add to noun chunk gold data.
- "Besides, this claim is no longer valid following the statements made by the leaders of atheism in the United States, the most recent being that of US Defense Secretary William Perry after the al-Khubar blast targeting US troops."

merged dependency, "Armed Forces, Guard" should be split as a conjunction.
- "We alert you to the fact that the regime might carry out operations against members of the Armed Forces, Guard, or security forces and try to attribute them to the mujahidin with a view to driving a wedge between them and you."

merged dependency, should read, "our holy sites" of "the Jews" and "Christians"
- "By doing that we will have contributed to ridding our holy sites of the Jews and Christians and forced them to leave our land, defeated, God willing."

merged conjunction, should read, "roles of Pharoah", "roles of Ceasar" and "roles of Chosroes".
- "Today, the roles of Pharaoh, Caesar, and Chosroes have been taken up by Israel and United States, who first occupied our Aqsa Mosque, in the direction of which our Holy Prophet performed his prayers."

missing entity, "Israeli Tanks"
- In these days, Israeli tanks rampage across Palestine, in Ramallah, Rafah and Beit Jala and many other parts of the land of Islam

very good hypernym phrase
- "The creation of Israel is a crime which must be erased."

split apositional phrase, "usurpation of their land"
- "Thus the American people have chosen, consented to, and affirmed their support for the Israeli oppression of the Palestinians, the occupation and usurpation of their land, and its continuous killing, torture, punishment and expulsion of the Palestinians."

strong Anti-semetic statement
- "As a result of this, in all its different forms and guises, the Jews have taken control of your economy, through which they have then taken control of your media, and now control all aspects of your life making you their servants and achieving their aims at your expense; precisely what Benjamin Franklin warned you against."

missing entity, "American Friends"
- "The freedom and democracy that you call to is for yourselves and for white race only; as for the rest of the world, you impose upon them your monstrous, destructive policies and Governments, which you call the 'American friends'."

noun chunk should read, "Indians in Kashmir" and "Muslims in Southern Philippines."
- add to chunker gold data
- "(4) We also advise you to stop supporting Israel, and to end your support of the Indians in Kashmir, the Russians against the Chechens and to also cease supporting the Manila Government against the Muslims in Southern Philippines."


merged conjunction, should read, "policies of sub dual", "theft", "occupation" and "policy of supporting the Jews"
- add to noun chunk gold data
- "(7) We also call you to deal with us and interact with us on the basis of mutual interests and benefits, rather than the policies of sub dual, theft and occupation, and not to continue your policy of supporting the Jews because this will result in more disasters for you."

should read, "people in Palestine" and "people in Lebanon"
- "But after it became unbearable and we witnessed the oppression and tyranny of the American/Israeli coalition against our people in Palestine and Lebanon, it came to my mind."

good hypernym phrase
- "And that day, it was confirmed to me that oppression and the intentional killing of innocent women and children is a deliberate American policy."