# A Schema of Groups
---

In this notebook we present a typology for modelling groups from natural language.

The schema is structured using the "five group problems" identified by Martin which any group must resolve to be successful. As group attributes of the schema they are applied in seven different calssifications of ideology. At the highest level of abstraction these features are:

1. Identity: identifying who is the ingroup and who is an outgroup to determine those should, versus those who should not benefit from the advantages of group living.
  * identity: named groups
  * ingroup: groups identifying an ingroup
  * outgroup: groups identifying an outgroup
  * entities: entities identifying the group context
2. Hierarchy: systems of governance to establish group leadership to resolve problems caused by status-seeking and enable resource distribution.
  * Title: titles given to people within a particular context
  * People: who are the figures a group reveres
3. Disease control: the control of diseases for the maintenance of group health.
  * this feature is considered as a social category rather than a feature
4. Trade: independent of hierarchy, systems constituting the fair terms of exchange between individuals towards developing an underpinning concept of altruism in a society.
  * Good: terms of exchange considered to be good
  * Bad: terms of exchange considered to be bad
5. Punishment: the group jurisprudence for moderating the systems seeking to resolve the other group problems.
  * Right: considered to be just
  * Wrong: considered to be unjust and worthy of punishment
  
We identified sets of seed words as linguistic representations of religion and ideology. These sets were classified by a named concept then placed in schema according to attribute and group ideology. 

This process identified eight classifications for group ideologies referred to by each orator: social, academia, medical, geopolitics, religion, economic, justice and military.

## The Schema

The next cell contains the schema and seed terms classified by named concept

In [1]:
# Create a json object of the group typology

import json
import os
from datetime import datetime

group_schema = {
    "social" : {
        "identity" : {
            "SOCIALGROUP" : ["ladies", "gentlemen", "men", "women", "Women", "boys", "girls", "youth", "society", "people", "children", "minority", \
                          "mankind", "passengers", "stranger", "group", "community", "organization", "organisation", "member", \
                          "tribe", "population", "ummah", "Ummah", "human", "personnel", "person", "man", "Woman", "woman", "boy", "girl", "child", \
                          "humankind", "countryman", "volunteer", "individual", "freeman", "humanity", "society", "public", "movement", \
                          "predecessors", "successor"]},
        "ingroup" : {
            "SELF" : ["we", "our", "i", "us"], 
            "FAMILY" : ["family", "parent", "children", "forefather", "spouse", "mother", "father", "husband", "wife", "mom", "mum", "dad", "son", \
                        "daughter", "brother", "brethren", "sister", "grandson", "granddaughter", "descendent", "ancestor", "relative", "brotherhood"],
            "AFFILIATE" : ["ourselves", "collaborator", "friend", "ally", "associate", "partner", "companion", "fellow", "kinship", "neighbor", \
                          "advocate"]},
        "outgroup" : {
            "OUTCAST":  ["imbecile", "critic", "wolf", "snake", "dog", "hypocrite", "fool"]},
        "entity" : {
            "CAUSE" : ["goal", "cause", "struggle", "action", "commitment", "effort", "awakening"],
            "CREDO" : ["philosophy", "Philosophy", "ideology", "Belief", "belief", "creed", "atheism"],
            "PLACE" : ["world", "place", "ground", "homeland", "sanctuary", "safe haven", "land", "sea", "site", "underworld", "north", "south", "east", \
                       "west", "jungle", "desert", "street", "road"],
            "SOCFAC" : ["installation", "camp", "shelter", "facility", "infrastructure", "refuge", "tower", "house", "home"],
            "SOCWORKOFART" : ["poetry", "song", "picture", "art", "poem", "entertainment", "symbol", "banner"]},
        "hierarchy" : {
            "TITLE" : ["mr", "mrs", "miss", "ms"],
            "LEADER" : ["leader", "hero"],
            "LEADERSHIP" : ["leadership", "protagonist"]},
        "trade" : {
            "BENEVOLANCE" : ["love", "kinship", "honesty", "tolerance", "patience", "decency", "sympathy", "peace", "good", "best", "great", \
                             "goodness", "hope", "courage", "resolve", "friendship", "loving", "peaceful", "rightness", "brave", \
                             "strong", "peaceful", "fierce", "honesty", "kind", "generous", "resourceful", "truth", "truthful", "pride", "defiance", "strength", \
                             "comfort", "solace", "respect", "dignity", "honor", "honour", "danger", "freedom", "honorable", "grateful", "compassion", "condolence", \
                             "sympathy", "fulfillment", "dedication", "dignity", "noble", "truthfulness", "happy", "enthusiasm", "perseverance", \
                             "persistence", "toughness", "beauty", "beautiful", "commendable", "praiseworthy", "destiny", "generosity", "supremecy", \
                             "obedience", "superior", "success", "cooperation", "laugh", "bravery", "loyalty", "steadfastness", "creativity"],
                            # help
            
            "MALEVOLANCE" : ["grief", "grievous", "sorrow", "tradegy", "damage", "bad", "misinformation", "confusion", "falsehood", "humiliation", "catastrophe", \
                             "terror", "fear", "threat", "cruelty", "danger", "anger", "harm", "suffering", "harrassment", "deceit", "death", "anger", \
                             "hate", "hatred", "adversity", "chaos", "loneliness", "sadness", "misery", "prejudice", "horrifying", "cynicism", "despair", \
                             "rogue", "hostile", "dangerous", "tears", "peril", "unfavourable", "vile", "sad", "cowardly", "grieve", "shame", "ugly", \
                             "insanity", "arrogance", "hypocrisy", "horror", "monstrous", "suspicious", "disaster", "malice", "menace", "repressive", \
                             "malicious", "nightmare", "isolation", "debauchery", "greed", "disgrace", "calamity", "rejection", "disregard", "ridicule", \
                             "marginalize", "neutralize", "harass", "inefficiency", "terrible", "frightful", "unreasonable", "sinister", "brutal", "false", \
                            "impotence", "weakness", "foolish", "destructive", "haughtiness", "distortion", "deception", "materialistic", "selfishness", \
                            "failure", "plight", "havoc", "dire", "shady"]},
        "punishment" : {
            "PERMITTED" : ["fair", "acceptable", "right", "deed"],
            "FORBIDDEN" : ["unfair", "unacceptable", "wrong", "wrongdoing", "catastrophic", "disastrous", "catastrophe", "mischief", "disappoint", "humiliate", "deceive", "lie"]}
    },
    
    "academia" : {
        "identity" : {
            "ACADEMICGROUP" : ["student", "graduates", "graduate", "scholar", "analyst"]},
        "ingroup" : {},
        "outgroup" : {},
        "entity" : {
            "ACADEMICENTITY" : ["knowledge", "intelligence", "wisdom", "information", "lecture"],
            "ACADEMICFACILITY" : ["school", "university"],
        },
        "hierarchy" : {
            "ACADEMICTITLE" : ["teacher", "dr", "professor", "intellectual"]
        },
        "trade" : {
            "PHILOSOPHY" : ["scientific"]
        },
        "punishment" : {}
        
    },
    
    "medical" : {
        "identity" : {
            "MEDICALGROUP" : ["blood donors", "disabled", "injured"]},
        "ingroup" : {},
        "outgroup" : {
            "VERMIN" : ["vermin", "parasite"]},
        "entity" : {
            "MEDICALENT" : ["heart", "soul", "skin graft", "tomb", "palm", "limb", "drug", "chemical", "biological", "heroin", "vaccine", "health", \
                            "blood", "body", "medicine", "remedy", "organ", "intestine", "tongue"],
            "NOURISHMENT" : ["food"],
            "SEXUALITY" : ["fornication", "homosexuality", "sex"],
            "MEDICALFAC" : ["hospital"],
            "INTOXICANT" : ["intoxicant", "sarin", "nerve agent", "nerve gas"],
            "DISEASE" : ["AIDS", "anthrax"]},
        "hierarchy" : {
            "MEDTITLE" : ["doctor", "nurse"]},
        "trade" : {
            "HEALTHY" : ["life"],
            "UNHEALTHY" : ["death", "unhealed", "illness", "disease", "filthy", "wound", "injury", "scar", "suffocation", "weak", "injure", "unhealthy"]},
        "punishment" : {
            "CLEANSE" : ["cure", "remedy", "cleanse"],
            "POISON" : ["poison",  "pollute", "bleed"]}
    },
    
    "geopolitics" : {
        "identity" : {
            "GPEGROUP" : ["ministry", "Ministry", "government", "Government", "civilian", "nation", "Union", "civilization", "congress", "Congress", "alliance", \
                          "patriot", "citizen", "journalist", "diplomat", "agency", "delegate", "coalition", "axis", "compatriots", "administration", \
                          "monarchy", "political party", "communist", "statelet", "emigrant", "oppressed", "persecuted", "Empires", "empire", \
                         "compatriot", "refugee"]},
        "ingroup" : {},
        "outgroup" : {
            "GPEOUTGROUP" : ["Regime", "regime", "opponent", "dictatorship"]},
        "entity" : {
            "GPEENTITY" : ["human rights", "unity", "diplomatic", "citizenship", "legislation", "senate", "secretary", "elect", "election", "reign", \
                           "embassy", "policy", "diplomacy", "media", "power", "edict", "institution", "petition", "memorandum", "superpower", "opposition", \
                           "capital", "humanitarian aid"],
            "TERRITORY" : ["territory", "planet", "peninsula", "city", "country", "neighborhood", "region", \
                           "area", "peninsula", "continent", "kingdom", "empire"],
            "CAMPAIGN" : ["campaign"],
            "GPEFAC" : [],
            "GPEWORKOFART" : []},
        "hierarchy" : {
            "GPETITLE" : ["president", "minister", "speaker", "prime minister", "senator", "mayor", "governor", "President", "Minister", "Prime Minister", \
                          "Speaker", "Senator", "Mayor", "Governor", "King", "king", "prince", "Leader", "commander-in-chief", "ruler", "chairman", \
                          "congressman", "amir", "pharaoh", "dignitary", "reformer", "caliph"],
            "GOVERNANCE" : ["rule"]},
        "trade" : {
            "GPEIDEOLOGY" : ["liberty", "sovereignty", "pluralism", "patriotism", "democracy", "communism", "bipartisanship"],
            "AUTHORITARIANISM" : ["nationalism", "fascism", "nazism", "totalitarianism", "nazi", "tyranny", "sectarianism", "anti-semitism"],
            "CONFRONTATION" : ["jihad", "confrontation", "feud", "partisanship", "division", "dissociation", "boycott", "rivalry", "crisis"]},
        "punishment" : {
            "JUST" : ["reform", "reformist", "equality"],
            "UNJUST" : ["unjust", "inequality", "usury", "poverty", "slavery", "injustice", "oppression", "repression", "subdual", "misrule", "occupation", "usurpation", \
                        "starvation", "starving", "servitude", "surpression", "oppress", "persecute", "sedition", "unjustified"]} 
    },
    
    "religion" : {
        "identity" : {
            "RELGROUP" : ["ulamah", "Ulamah", "ulema", "moguls", "Moguls", "Sunnah", "Seerah", "Christian", "sunnah", "christian", "muslims", "islamic", \
                          "sunnah", "seerah", "polytheist", "houries", "the people of the book", "merciful", "religionist"]},
        "ingroup" : {
            "BELIEVER" : ["believer"]},
        "outgroup" : {
            "APOSTATE" : ["kufr", "Kufr", "kuffaar", "infidel", "evildoer", "cult", "mushrik", "unbeliever", "disbeliever", "mushrikeen", "pagan", \
                          "idolater", "apostate", "heretic", "atheist"]},
        "entity" : {
            "RELENTITY" : ["faith", "pray", "prayer", "mourn", "vigil", "prayer", "remembrance", "praise", "bless", "last rites", "angel", \
                           "memorial", "revelation", "sanctify", "grace", "religion", "repentance", "exalted", "repent", "seerah", "confession", \
                           "exaltation", "praise", "commandment", "wonderment", "supplication", "worship", "testament", "blessing", "ascension"],
            "CALLING" : ["calling"],
            "RELIGIOUSLAW" : ["shari'ah", "Shari'ah", "shari'a", "Shari'a", "fatawa", "Fatawa", "fatwa", "Fatwas"],
            "FAITH" : ["da'ees", "piety", "creationism", "monotheism"],
            "RELFAC" : ["Mosque", "mosque", "sanctity", "cathedral", "Cathedral"],
            "RELWORKOFART" : ["sheerah"],
            "RELPLACE" : ["heaven", "paradise"]}, 
        "hierarchy" : {
            "RELFIGURE" : ["Apostle", "Prophet", "apostle", "prophet", "lord", "priest", "all-mighty"],
            "RELTITLE" : ["priest", "cleric", "Immam", "immam", "saint", "st.", "sheikh", "shaykh", "preacher"]},
        "trade" : {
            "HOLY" : ["religious", "orthodox", "pious", "devout", "holy", "righteous", "serve", "sacrifice", "forgive", "martyrdom", "piety", "polytheism", "divine" \
                      , "miracle", "eulogize"],
            "UNHOLY" : ["hell", "hellish", "unholy", "unrighteous", "evil", "devil", "devilish", "demon", "demonic", "evildoe", "satan", "immoral", \
                        "immorality", "non-righteous", "blasphemy", "misguidance", "heresy", "abyss", "paganism"]},
        "punishment" : {
            "VIRTUE" : ["Grace", "grace", "halal", "forgiveness", "mercy", "righteous", "mercy", "righteousness", "purity"],
            "SIN" : ["iniquity", "haram", "adultery", "sin", "sinful", "blashpeme", "misdeed", "infidelity"]}
    },
    
    "economic" : {
        "identity" : {
            "ECONGROUP" : ["merchant", "employee", "economist", "worker", "entrepreneur", "shopkeeper", "servant", "company", "shareholder", \
                           "contractor", "merchant", "consumer", "rich", "passenger", "corporation", "customer"]},
        "ingroup" : {},
        "outgroup" : {           
            "COMPETITOR" : ["competitor"]},
        "entity" : {
            "ECONENTITY" : ["economic", "trade", "work", "currency", "bank", "business", "economy", "asset", "fund", "sponsor", "shop", \
                            "financing", "innovation", "micromanage", "export", "job", "budget", "spending", "paycheck", "market", "growth", \
                            "investment", "factory", "welfare", "pension", "accounting", "retirement", "industry", "agriculture", "income", \
                            "spending", "expensive", "purchase", "wealth", "economical", "booty", "buy", "sell", "trading", "profit", "resource"\
                            "price", "monie", "gambling", "production", "advertising", "tool", "cargo","reconstruction", "contract", "tax", \
                            "financial", "wealth", "power", "prosperity", "excise", "livelihood", "training", "enterprise", "expenditure", "deal", \
                            "property"],
            "COMMODITY" : ["money", "oil", "water", "energy", "currency", "jewel", "jewellery", "gold", "industry", "agriculture", "inflation", \
                           "debt", "income"],
            "CURRENCY" : ["dollar"],
            "ECONFAC" : ["airport", "subway", "farm", "charity", "skyscraper"],
            "OPERATION" : ["operation"],
            "ECONWORKOFART" : []},
        "hierarchy" : {
            "ECONTITLE" : ["ceo", "director"],
            "ECONFIGURE" : []},
        "trade" : {
            "ECONOMICAL" : ["capitalism", "communism", "economical", "economic", "tourism", "commercial"],
            "UNECONOMICAL" : ["uncommercial", "boycott", "bankruptcy", "exorbitant"]},
        "punishment" : {
            "EQUITABLE" : ["reward"],
            "UNEQUITABLE" : ["recession", "unemployment"]}            
    },
        
    "security" : {
        "identity" : {
            "SECURITYGROUP" : ["police", "officers", "policeman", "law enforcement", "law-enforcement", "firefighter", "rescuer", "lawyer", "agent", "authority", \
                          "protector", "guardian", "captive", "marshal", "innocent", "guard"],
            "VICTIM" : ["victim", "dead", "casualty", "persecuted", "slave"]},
        "ingroup" : {},
        "outgroup" : {
            "CRIMEGROUP" : ["criminal", "mafia", "prisoner", "murderer", "terrorist", "hijacker", "outlaw", "violator", "killer", "executioner", "thief"]},
        "entity": {
            "LEGALENTITY" : ["trial", "security", "law", "decree",  "sailor", "surveillance", "warrant", \
                           "penalty", "statute", "attorney", "treaty", "duty", "imprisonment", "justification", \
                           "judge", "custody", "justice", "denunciation", "agreement"],
            "CRIMFAC" : [],
            "LAW" : [],
            "LEGALFAC" : ["prison", "sanctuary", "court", "jail"],
            "LEGALWORKOFART" : []},
        "hierarchy" : {
            "LEGALPERSON" : [],
            "LEGALTITLE" : ["beneficiary"]},
        "trade" : {
            "LAWFUL" : ["duty", "morality", "innocence", "legitimacy"],
            "UNLAWFUL" : ["crime", "terrorism", "extremism", "murderous", "imtimidation", "harrassment", "trafficking", "criminal", "suicide", "plot", \
                          "brutality", "coercion", "subversion", "bioterrorism", "propaganda", "corrupt", "corruption", "scandal", "betray", "betrayal", \
                          "misappropriation", "violation", "offence"],
            "LEGALACTION" : ["confiscation", "arrest", "investigate", "enforce", "prohibit", "shield", "punish", "imprison", "jailing"]},
        "punishment" : {
            "LEGAL" : ["legal", "protect", "legitimate", "liberation", "liberate"],
            "ILLEGAL" : ["illegal", "counterfeit", "money-laundering", "guilty", "blackmail", "threaten", "punishment", "conspiracy", "illegitimate", \
                         "infringement", "adultery", "wicked", "tricked", "harm", "incest", "exploit", "embezzle", "pilfering"],
            "PHYSICALVIOLENCE" : ["murder", "hijack", "kill", "terrorise", "killing"]}
    },
    
    "military" : {
        "identity" : {
            "ARMEDGROUP" : ["commander", "vetran", "Vetran", "occupier", "invader", "military", "Mujahideen", "mujahideen", "army", "Army", "navy", \
                            "air force", "troops", "defender", "recruit", "guerrilla", "knight", "special forces", "fatality", "martyr", "vanguard"],
            "BELIGERENT" : ["aggressor", "troop", "fighter", "soldier", "warrior", "Mujahid", "mujahid", "soldier", "mujahidin", "victorious"]},
        "ingroup" : {},
        "outgroup" : {
            "ADVERSARY" : ["traitor", "oppressor", "enemy", "crusader", "aggressor", "invader", "occupier"]},
        "entity" : {
            "MILENTITY" : ["battlefield", "beachead", "training camp", "armed", "force", "uniform", "chain of command", "target", "materiel" \
                           "defense", "battleship", "aircraft carriers", "infantry", "tank", "air power", "shooter", "cavalry", \
                           "arming", "enlist", "conquest", "base", "buy", "surrender", "soldier", "airman", "marine", "biodefense"],
            "MISSION" : ["mission"],
            "WEAPON" : ["weapon", "weaponry", "bomb", "missile", "munition", "explosive", "arms", "bullet", "sword", "spear", "gun", "rocket"],
            "MILFAC" : ["fortress", "fortification"],
            "MILWORKOFART" : []},
        "hierarchy" : {
            "MILRANK" : ["lieutenant", "commander", "adjutant", "mujahid"],
            "MILFIGURE" : []},
        "trade" : {
            "WARFARE" : ["victory", "war", "warfare", "battle", "blockade"],
            "MILACTION" : ["destruction", "violence", "conflict", "slain", "besiege", "massacre", "atrocity", "aggression", "attack", "assault", "fight", \
                           "explosion", "combat", "invasion", "ruin", "bombardment", "expel", "fighting", "defeat", "ambush", "overthrow", "destabilize", 
                          "destroy", "strike", "infighting", "invade", "expulsion", "hostility", "blast"]},
        "punishment" : {
            "BARBARY" : ["genocide", "brutalize", "holocaust", "torture", "slaughter", "bloodshed"]}
    }
}

filepath = r'C:\Users\Steve\OneDrive - University of Southampton\CNDPipeline\dataset'

with open(os.path.join(filepath, "group_schema.json"), "wb") as f:
    f.write(json.dumps(group_schema).encode("utf-8"))
    
print("complete at: ", datetime.now().strftime("%d/%m/%Y - %H:%M:%S"))

complete at:  19/08/2020 - 19:20:37


## Structure of the Schema

The following cell displays the structure for the schema and how each named concept has been classified.

In [87]:
## https://www.datacamp.com/community/tutorials/joining-dataframes-pandas 
## Display a DataFrame of the Typology

import pandas as pd

labels = []
typology = dict()
typology_chart = dict()

## create a list of keys
schema = {ideology: {subcat: ', '.join(list(terms.keys())) for (subcat, terms) in value.items()} 
             for (ideology, value) in group_schema.items()}

keys = [list(cat.keys()) for cat in list(schema.values())][0]

## Create frames for table
frames = []
schema = {ideology: {subcat: list(terms.keys()) for (subcat, terms) in value.items()} 
             for (ideology, value) in group_schema.items()}

for frame in [list(cat.keys()) for cat in list(schema.values())][0]:
    frames.append(pd.DataFrame.from_dict({k : v[frame] for k, v in list(schema.items())}, orient = 'index').fillna("").T)

# display table
display(pd.concat(frames, keys = keys))

Unnamed: 0,Unnamed: 1,social,academia,medical,geopolitics,religion,economic,justice,military
identity,0,SOCGROUP,ACADEMICGROUP,MEDICALGROUP,GPEGROUP,RELGROUP,ECONGROUP,SECGROUP,ARMEDGROUP
identity,1,,,,,,,VICTIM,BELIGERENT
ingroup,0,SELF,,,,BELIEVER,,,
ingroup,1,FAMILY,,,,,,,
ingroup,2,AFFILIATE,,,,,,,
outgroup,0,OUTCAST,,VERMIN,GPEOUTGROUP,APOSTATE,COMPETITOR,CRIMEGROUP,ENEMY
entity,0,CAUSE,ACADEMICENTITY,MEDICALENT,GPEENTITY,RELENTITY,ECONENTITY,SECENTITY,MILENTITY
entity,1,CREDO,,SEXUALITY,TERRITORY,RELIGIOUSLAW,COMMODITY,LAW,WEAPON
entity,2,LOCATION,,MEDICALFAC,GPEFAC,FAITH,ECONFAC,SECFAC,MILFAC
entity,3,SOCFAC,,INTOXICANT,GPEWORKOFART,RELFAC,ECONWORKOFART,LEGALWORKOFART,MILWORKOFART


## Create lookup tables and getter functions for each category of the group_schema

The categories of the schema are as follows:
- Concept - the classification term for each group of synonyms
- Attribute - the term referring to each group problem in which each concept is contained
- Ideology - the group context

## Create Lookup tables for each concept, attribute and ideology

In [99]:
%%time

# create a lookup table for terms relating to each concept
concept_lookup = dict()
for attribute in group_schema.values():
    for concept in attribute.values():
        concept_lookup.update(concept)
        
# print(concept_lookup)

## create a lookup table for concepts relating to an attribute
attribute_lookup = dict()
for attribute_dict in group_schema.values():
    attribute_lookup = {key : [] for key in list(attribute_dict.keys())}

for attribute_dict in group_schema.values():
    for attribute, concept in attribute_dict.items():
        attribute_lookup[attribute] += (list(concept.keys()))

# print(attribute_lookup)

# create a lookup table for concept relating to each ideology
ideology_lookup = dict()     
for ideology, attribute in group_schema.items():
    labels = []
    for concepts in attribute.values():
        labels += list(concepts.keys())
    ideology_lookup[ideology] = labels

# print(ideology_lookup)

# logic for accessing concept_lookup for matcher rules
# for rule, pattern in concept_lookup.items():
#     print(rule, None, [{"LEMMA" : {"IN" : pattern}}])

Wall time: 997 µs


## Define getter functions for each concept, attribute and ideology

In [100]:
%%time

from spacy.tokens import Token
import pandas as pd

def get_concept(token):
    
    """
    getter function returning the concept related to token text
    """
    
    for concept, terms in concept_lookup.items():
        if token.lemma_.lower() in [pattern.lower() for pattern in terms]:
            return concept
    return ''

def get_attribute(token):

    """
    getter function returning group attribute related to the token concept
    input: token._.concept
    output attribute
    """

    for attribute, concepts in attribute_lookup.items():
        if token._.CONCEPT.lower() in [pattern.lower() for pattern in concepts]:
            return attribute
    return ''

def get_ideology(token):

        """
        getter function returning the ideology related to the token concpet
        input: token._.CONCEPT
        output: related ideology
        
        """

        for ideology, concepts in ideology_lookup.items():
            if token._.CONCEPT.lower() in [pattern.lower() for pattern in concepts]:
                return ideology
        return ''

Wall time: 0 ns


## Setup Pipline

In [101]:
%%time

import spacy

nlp = spacy.load("en_core_web_md")

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents, after = "ner")



Wall time: 25.6 s


## Add Token extensions for each custom attribute using the getter functions

In [102]:
%%time
Token.set_extension("CONCEPT", getter=get_concept, force = True)
Token.set_extension("ATTRIBUTE", getter=get_attribute, force = True)
Token.set_extension("IDEOLOGY", getter=get_ideology, force = True)

Wall time: 0 ns


## Display the custom attributes for a test sentence

In [103]:
from visuals import sent_frame
text = "On my orders, the United States military has begun strikes against Al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan."
doc = nlp(text)

display(sent_frame(doc))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
text,On,my,orders,",",the,United States,military,has,begun,strikes,against,Al Qaeda,terrorist,training,camps,and,military,installations,of,the,Taliban,regime,in,Afghanistan,.
lemma,on,-PRON-,order,",",the,United States,military,have,begin,strike,against,Al Qaeda,terrorist,training,camp,and,military,installation,of,the,Taliban,regime,in,Afghanistan,.
ent_type,,,,,,GPE,,,,,,ORG,,,,,,,,,ORG,,,GPE,
pos,ADP,DET,NOUN,PUNCT,DET,PROPN,NOUN,AUX,VERB,NOUN,ADP,PROPN,NOUN,NOUN,NOUN,CCONJ,ADJ,NOUN,ADP,DET,PROPN,NOUN,ADP,PROPN,PUNCT
tag,IN,PRP$,NNS,",",DT,NNP,NN,VBZ,VBN,NNS,IN,NNP,NN,NN,NNS,CC,JJ,NNS,IN,DT,NNP,NN,IN,NNP,.
dep,prep,poss,pobj,punct,det,compound,nsubj,aux,ROOT,dobj,prep,nmod,amod,compound,pobj,cc,amod,conj,prep,det,compound,pobj,prep,pobj,punct
concept,,,,,,,ARMEDGROUP,,,,,,CRIMEGROUP,,SOCFAC,,ARMEDGROUP,SOCFAC,,,,GPEOUTGROUP,,,
attribute,,,,,,,identity,,,,,,outgroup,,entity,,identity,entity,,,,outgroup,,,
ideology,,,,,,,military,,,,,,justice,,social,,military,social,,,,geopolitics,,,


## Create a custom pipeline component for named concept recognition

while getter functions work for small text, processing a large document takes too long, therefore, a matcher pipline component is required.

In [121]:
%%time

from spacy.tokens import Doc, Span, Token
from spacy.matcher import Matcher

import schemautils as mk

class ConceptMatcher(object):
    
    """This class is a for a pipelines component for detecting concepts in a text."""

    name = "Concept Matcher"  # component name, will show up in the pipeline

    dataset_dir = r'C:\\Users\\Steve\\OneDrive - University of Southampton\\CNDPipeline\\dataset'
    group_markup = "group_schema.json"

    ideologies = None
    group_schema = None
    concept_lookup = None
    attribute_lookup = None
    ideology_lookup = None
    
    def __init__(self, nlp):
        
        """Initialise the pipeline component. The shared nlp instance is used to initialise the matcher
        with the shared vocab, get the label ID and generate Doc objects as phrase match patterns.
        """

        self.nlp = nlp
        
        #####
        # initiate group schema attributes to ConceptMatcher()
        #####

        # load group schema from disc
        with open(os.path.join(ConceptMatcher.dataset_dir, ConceptMatcher.group_markup), 'r') as fp:
            ConceptMatcher.group_schema = json.load(fp)
        
        # initiate a json object structure for group ideologies
        ConceptMatcher.ideologies = {key : 0 for key in ConceptMatcher.group_schema.keys()}

        # initiate lookup tables
        ConceptMatcher.concept_lookup = mk.get_concept_lookup(ConceptMatcher.group_schema)
        ConceptMatcher.attribute_lookup = mk.get_attribute_lookup(ConceptMatcher.group_schema)
        ConceptMatcher.ideology_lookup = mk.get_ideology_lookup(ConceptMatcher.group_schema)

        #####
        # set up the Matcher using concepts as the rule name and terms in the pattern 
        #####
        
        self.matcher = Matcher(self.nlp.vocab)
        for concept, terms in ConceptMatcher.concept_lookup.items():
                self.matcher.add(concept, None, [{"LEMMA" : {"IN" : terms}}])
        
        #####
        # set up token and span extensions
        #####
        
        Doc.set_extension("concepts", default = [], force = True)
        
        Span.set_extension("CONCEPT", default = '', force = True)
        Token.set_extension("CONCEPT", default = '', force = True)
        
        Span.set_extension("ATTRIBUTE", default = '', force = True)
        Token.set_extension("ATTRIBUTE", default = '', force = True)
        
        Span.set_extension("IDEOLOGY", default = '', force = True)
        Token.set_extension("IDEOLOGY", default = '', force = True)        

    def __call__(self, doc):
        
        """Apply the pipeline component on a Doc object and modify it if matches are found. 
        Return the Doc, so it can be processed by the next component in the pipeline, if available.
        
        merge entities code: https://github.com/explosion/spaCy/issues/4107
        filter code: https://github.com/explosion/spaCy/issues/4056
        """
        with doc.retokenize() as retokenizer:

            matches = self.matcher(doc)
            for match_id, start, end in matches:
                span = Span(doc, start, end)
                
                concept_id = self.nlp.vocab.strings[match_id]
                
                span._.CONCEPT = concept_id
                span._.IDEOLOGY = self.get_ideology(concept_id)
                span._.ATTRIBUTE = self.get_attribute(concept_id)
                
                for token in span:
                    token._.CONCEPT = span._.CONCEPT
                    token._.IDEOLOGY = span._.IDEOLOGY
                    token._.ATTRIBUTE = span._.ATTRIBUTE
                try:
                    if len(span) > 1:
                        retokenizer.merge(span)
                except ValueError:
                    pass
                doc._.concepts.append(span)

        return doc

    def get_ideology(self, concept_id):

        """
        getter function returning the ideology related to the token concpet
        input: token._.CONCEPT
        output: related ideology
        
        """

        for ideology, concepts in ConceptMatcher.ideology_lookup.items():
            if concept_id.lower() in [pattern.lower() for pattern in concepts]:
                return ideology
        return ''

    def get_attribute(self, concept_id):

        """
        getter function returning group attribute related to the token concept
        input: token._.concept
        output attribute
        """

        for attribute, concepts in ConceptMatcher.attribute_lookup.items():
            if concept_id.lower() in [pattern.lower() for pattern in concepts]:
                return attribute
        return ''

Wall time: 0 ns


## Create new pipeline

## Add pipeline components

In [122]:
%%time

for component in nlp.pipe_names:
    if component not in ['tagger', "parser", "ner"]:
        nlp.remove_pipe(component)

# merge entities
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents, after = "ner")

# add concept matcher component to pipeline
nlp.add_pipe(ConceptMatcher(nlp), after = "merge_entities") # add concepts

print([name for name in nlp.pipe_names])

['tagger', 'parser', 'ner', 'merge_entities', 'Concept Matcher']
Wall time: 8.97 ms


## Visialise a sentence showing custom attributes

In [123]:
import importlib
import visuals
importlib.reload(visuals)

text = "On my orders, the United States military has begun strikes against Al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan."
doc = nlp(text)

display(visuals.sent_frame(doc))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
text,On,my,orders,",",the,United States,military,has,begun,strikes,against,Al Qaeda,terrorist,training,camps,and,military,installations,of,the,Taliban,regime,in,Afghanistan,.
lemma,on,-PRON-,order,",",the,United States,military,have,begin,strike,against,Al Qaeda,terrorist,training,camp,and,military,installation,of,the,Taliban,regime,in,Afghanistan,.
ent_type,,,,,,GPE,,,,,,ORG,,,,,,,,,ORG,,,GPE,
pos,ADP,DET,NOUN,PUNCT,DET,PROPN,NOUN,AUX,VERB,NOUN,ADP,PROPN,NOUN,NOUN,NOUN,CCONJ,ADJ,NOUN,ADP,DET,PROPN,NOUN,ADP,PROPN,PUNCT
tag,IN,PRP$,NNS,",",DT,NNP,NN,VBZ,VBN,NNS,IN,NNP,NN,NN,NNS,CC,JJ,NNS,IN,DT,NNP,NN,IN,NNP,.
dep,prep,poss,pobj,punct,det,compound,nsubj,aux,ROOT,dobj,prep,nmod,amod,compound,pobj,cc,amod,conj,prep,det,compound,pobj,prep,pobj,punct
concept,,,,,,,ARMEDGROUP,,,,,,CRIMEGROUP,,SOCFAC,,ARMEDGROUP,SOCFAC,,,,GPEOUTGROUP,,,
attribute,,,,,,,identity,,,,,,outgroup,,entity,,identity,entity,,,,outgroup,,,
ideology,,,,,,,military,,,,,,justice,,social,,military,social,,,,geopolitics,,,


## Summarise the instances of each concept for a document

## Incorporating each concept related function into pipline objects and applying them to the dataset

In [124]:
from collections import Counter

def get_doc_ideologies(doc):
    
    """
    returns a dictionary containing a count of ideologies mentioned within the document
    count is a percentage of ideology instances / total number of ideology instances
    """
    
    ## create a list for counting the number of ideologies featuring as custom attributes of each named concept
    ideology_list = [concept._.IDEOLOGY for concept in doc._.concepts if concept._.IDEOLOGY]
    
    ## get the data structure of ideologies as a json object
    doc_ideologies = ConceptMatcher.ideologies.copy()
        
    ## create a counter for the ideologies featuring in the doc
    for k, v in dict(Counter(ideology_list)).items():
        doc_ideologies[k] = v / len(ideology_list)
        
    return doc_ideologies

get_doc_ideologies(doc)

{'social': 0.3333333333333333,
 'academia': 0,
 'medical': 0,
 'geopolitics': 0.16666666666666666,
 'religion': 0,
 'economic': 0,
 'justice': 0.16666666666666666,
 'military': 0.3333333333333333}

In [134]:
%%time
import pipeline
import.reload(pipline)

dirpath = r'C:\\Users\\Steve\\OneDrive - University of Southampton\\CNDPipeline\\dataset'
print("initiating custom pipeline")
cnd = pipeline.CND()

initiating custom pipeline
Wall time: 40.8 s


## Create the dataset

In [135]:
%%time
import cndobjects
importlib.reload(cndobjects)
orators = cndobjects.Dataset(cnd, dirpath)

parsing:  bush (2001-09-11) 911 Address to the Nation
parsing:  bush (2001-09-14) Remarks at the National Day of Prayer & Remembrance Service
parsing:  bush (2001-09-15) First Radio Address following 911
parsing:  bush (2001-09-17) Address at Islamic Center of Washington, D.C.
parsing:  bush (2001-09-20) Address to Joint Session of Congress Following 911 Attacks
parsing:  bush (2001-10-07) Operation Enduring Freedom in Afghanistan Address to the Nation
parsing:  bush (2001-10-11) 911 Pentagon Remembrance Address
parsing:  bush (2001-10-11) Prime Time News Conference on War on Terror
parsing:  bush (2001-10-11) Prime Time News Conference Q&A
parsing:  bush (2001-10-26) Address on Signing the USA Patriot Act of 2001
parsing:  bush (2001-11-10) First Address to the United Nations General Assembly
parsing:  bush (2001-12-11) Address to Citadel Cadets
parsing:  bush (2001-12-11) The World Will Always Remember 911
parsing:  bush (2002-01-29) First (Official) Presidential State of the Union A

## Results for Each Orator

The following cell shows how each of the ideologies are represented over each orators texts.

From the total number of concepts of each speech, these infographics shows the percentage of concepts used by each orator for each ideology. 

In all speeches they are addressing "the people" which is why "social" scores most highly. 

"Justice" scores most highly in Bush's speech on the 26/10/2001 in which he addresses the signing of the US Patriot Act. As might be expected, in his speech on the 14/09/2001 at the Episcopal National Cathedral for a day of Prayer and Remembrance, "religion" features most highly. 

For bin Laden in his second and third speeches, religion, military and geopolitics feature highly. In his speech following 9/11, religion features most highly in how he confers a divine legitimacy to the attacks. 

Using these terms, this annotation framework could be extended to create a new topic modelling schema specific for cultural violence.

## Displaying a heatmap of ideologies for each orator across all texts

In [137]:
import visuals
importlib.reload(visuals)

for orator in orators:
    print(orator.name)
    display(visuals.heatmap(orator.ideologies))

George Bush


Unnamed: 0,2001-09-11,2001-09-14,2001-09-15,2001-09-17,2001-09-20,2001-10-07,2001-10-11,2001-10-26,2001-11-10,2001-12-11,2002-01-29
social,43%,44%,36%,56%,38%,37%,35%,15%,35%,51%,32%
academia,0%,0%,0%,1%,0%,0%,0%,0%,0%,0%,1%
medical,5%,9%,3%,1%,3%,2%,3%,2%,4%,3%,5%
geopolitics,12%,12%,14%,14%,21%,13%,24%,22%,20%,13%,17%
religion,7%,18%,5%,11%,5%,6%,5%,2%,5%,7%,3%
economic,9%,4%,3%,6%,4%,5%,5%,8%,4%,1%,17%
justice,14%,7%,12%,6%,16%,17%,15%,41%,21%,10%,13%
military,11%,5%,27%,6%,13%,20%,13%,9%,10%,14%,12%


Martin Luther King


Unnamed: 0,1963-04-03,1963-08-28,1965-03-25,1967-04-04,1967-04-14
social,56%,51%,51%,42%,53%
academia,0%,0%,0%,0%,1%
medical,3%,2%,5%,7%,5%
geopolitics,13%,20%,15%,20%,19%
religion,3%,5%,5%,3%,2%
economic,14%,5%,6%,5%,9%
justice,5%,15%,9%,5%,7%
military,7%,2%,8%,19%,5%


Osama bin Laden


Unnamed: 0,1996-08-23,2001-01-07,2001-10-07,2001-11-09,2002-11-24,2004-11-01
social,29%,38%,34%,38%,27%,35%
academia,1%,0%,0%,0%,0%,0%
medical,3%,0%,3%,4%,4%,4%
geopolitics,18%,13%,11%,13%,18%,18%
religion,18%,21%,26%,18%,12%,5%
economic,7%,10%,1%,1%,10%,12%
justice,9%,5%,15%,14%,11%,11%
military,15%,13%,11%,11%,18%,16%


## Having created custom attributes for each Token object , the next step is to create the same for noun_chunks Span() objects containing multiple spans of different group schema attributes

the following noun_chunks apply

(from list(doc.noun_chunks))

- my orders
- the United States military: 
-- what to do with a noun_chunk containing both a named entity and named concept, does this become a named entity?
-- "the United States Military" resolves to a det for "en_core_web_sm"
- strikes
- Al Qaeda terrorist training camps
-- the ent and noun phrase would ideally be resolved separately
-- "training camps" would ideally be annotated as an asset of the associated entity
- military installations
-- a social facility is modified to become a military facility
- the Taliban regime
-- should be merged and annotated as an outgroup
- Afghanistan