# Subject-Verb-Object Triplets 

In this notebook, I am going to create a function to extract the subject-verb-object triplets from the articles. Then, I will run it for the entire corpus. The end result is a dataset with all identifiable triplets and some important features, like starting position, parts-of speech tags, etc. 

In [1]:
import spacy 
from spacy.matcher import Matcher
import textacy
import pandas as pd 
import numpy as np 

nlp = spacy.load('en_core_web_sm')

from spacy.symbols import NOUN, PROPN, VERB
from spacy.tokens import Doc, Span, Token

I begin by importing the dataset and dropping the missing values

In [6]:
# First let's import vox

vox = pd.read_csv("C:/Users/nro04/Documents/moral_templates/Data/vox_articles.csv")

# Drop articles for which I don't have the text 

vox = vox.dropna(subset=['clean_strings'])



I know create a function that extracts the desired triplets along with additional important information. This code uses and is heavily based on the textacy library

In [7]:
def subject_verb_object_triples(doc):

    sents = doc.sents

    for sent in sents:
        start_i = sent[0].i

        verbs = textacy.spacier.utils.get_main_verbs_of_sent(sent)
        for verb in verbs:
            subjs = textacy.spacier.utils.get_subjects_of_verb(verb)
            if not subjs:
                continue
            objs = textacy.spacier.utils.get_objects_of_verb(verb)
            if not objs:
                continue

            # add adjacent auxiliaries to verbs, for context
            # and add compounds to compound nouns
            verb_span = textacy.spacier.utils.get_span_for_verb_auxiliaries(verb)
            verb = sent[verb_span[0] - start_i : verb_span[1] - start_i + 1]
            for subj in subjs:
                subj_dep = subj.dep_
                subj_tag = subj.tag_
                subj = sent[
                    textacy.spacier.utils.get_span_for_compound_noun(subj)[0]
                    - start_i : subj.i
                    - start_i
                    + 1
                ]
                for obj in objs:
                    if obj.pos == NOUN:
                        span = textacy.spacier.utils.get_span_for_compound_noun(obj)
                    elif obj.pos == VERB:
                        span = textacy.spacier.utils.get_span_for_verb_auxiliaries(obj)
                    else:
                        span = (obj.i, obj.i)
                    obj_dep = obj.dep_
                    obj_tag = obj.tag_
                    obj = sent[span[0] - start_i : span[1] - start_i + 1]
                    end_pos = span[1] 

                    yield (start_i,subj, subj_dep, subj_tag, verb, obj, obj_dep, obj_tag,  end_pos)

I now expand on the above function so that the result is a pandas data-frame.

In [8]:
# Create a function for creating the dataframe 

def create_svo_dataframe(doc): 

    start_list = []
    subject_list = []
    sdep_list = []
    stag_list = []
    verb_list = []
    object_list = []
    odep_list = []
    otag_list = []
    end_list = []

    triplets = subject_verb_object_triples(doc)

    for triplet in triplets:
        start = triplet[0]
        subj = triplet[1]
        subj_dep = triplet[2]
        subj_tag = triplet[3]
        verb = triplet[4]
        obj = triplet[5]
        obj_dep = triplet[6]
        obj_tag = triplet[7]
        end = triplet[8]

        start_list.append(start)
        subject_list.append(subj)
        sdep_list.append(subj_dep)
        stag_list.append(subj_tag)
        verb_list.append(verb)
        object_list.append(obj)
        odep_list.append(obj_dep)
        otag_list.append(obj_tag)
        end_list.append(end)
    
    dict = {'subject': subject_list, 
        'verb': verb_list, 
        'object': object_list, 
        'start': start_list, 
        'end': end_list, 
        'subj_dep': sdep_list, 
        'subj_tag': stag_list, 
        'obj_dep': odep_list, 
        'obj_tag': otag_list}
    
    svo_df = pd.DataFrame(dict)

    return svo_df


        

Run the function on all rows of vox and append them to an empty dataframe. 

This takes a long time. 

In [19]:
svodf = pd.DataFrame()

for i in range(len(vox)): 
    doc = nlp(vox.iloc[i]['clean_strings'])
    df = create_svo_dataframe(doc)
    df['Document'] = i 
    svodf = svodf.append(df, ignore_index = True)
    if (i % 100 == 0):
        print(f'working on article {i}')

svodf.tail(25)

working on article 0
working on article 100
working on article 200
working on article 300
working on article 400
working on article 500
working on article 600
working on article 700
working on article 800
working on article 900
working on article 1000
working on article 1100
working on article 1200
working on article 1300
working on article 1400
working on article 1500
working on article 1600
working on article 1700
working on article 1800
working on article 1900
working on article 2000
working on article 2100
working on article 2200
working on article 2300
working on article 2400
working on article 2500
working on article 2600
working on article 2700
working on article 2800
working on article 2900
working on article 3000
working on article 3100
working on article 3200
working on article 3300
working on article 3400
working on article 3500
working on article 3600
working on article 3700
working on article 3800
working on article 3900
working on article 4000
working on article 4100
work

Unnamed: 0,subject,verb,object,start,end,subj_dep,subj_tag,obj_dep,obj_tag,Document
913066,(which),(includes),(Smith),7379.0,7394.0,nsubj,WDT,dobj,NNP,22907
913067,(which),(includes),(Davis),7379.0,7397.0,nsubj,WDT,conj,NNP,22907
913068,(which),(includes),(Leto),7379.0,7400.0,nsubj,WDT,conj,NNP,22907
913069,(which),(includes),(Robbie),7379.0,7404.0,nsubj,WDT,conj,NNP,22907
913070,(any),(managed),(have),7409.0,7425.0,nsubj,DT,xcomp,VB,22907
913071,(Zootopia),(uses),"(kingdom, divide)",7486.0,7502.0,nsubj,NNP,dobj,NN,22907
913072,(Zootopia),(uses),"(to, craft)",7486.0,7508.0,nsubj,NNP,xcomp,VB,22907
913073,(It),"(’s, not)",(metaphor),7518.0,7524.0,nsubj,PRP,attr,NN,22907
913074,(film),(reinforces),(feel),7643.0,7661.0,nsubj,NN,dobj,NN,22907
913075,(Nominated),(portrays),(friendship),7739.0,7763.0,nsubj,VBN,dobj,NN,22907


We got the data-set but some filtering will be useful at this point. I will select triplets where the object is a direct object and a variant of a proper noun. This is so that the triplets fit better with the structure used by Affect Control Theory. 

In [20]:
svodf_filtered = svodf[(svodf.obj_dep=='dobj')]
svodf_final = svodf_filtered[svodf_filtered.obj_tag.isin(["NN", "NNS", "NNP", "NNPS"])]

svodf_final.tail(50)


Unnamed: 0,subject,verb,object,start,end,subj_dep,subj_tag,obj_dep,obj_tag,Document
913010,(it),(’s),(opus),4620.0,4637.0,nsubj,PRP,dobj,NN,22907
913011,(that),(turns),(life),4620.0,4641.0,nsubj,WDT,dobj,NN,22907
913013,"(director, Ezra, Edelman)","(has, made)",(movie),4669.0,4676.0,nsubj,NNP,dobj,NN,22907
913014,(cast),(carry),(day),4775.0,4785.0,nsubj,NN,dobj,NN,22907
913015,(feeling),(carry),(day),4775.0,4785.0,conj,NN,dobj,NN,22907
913017,(it),"(’s, filled)",(history),5005.0,5035.0,nsubj,PRP,dobj,NN,22907
913020,(it),(’s),(nightmare),5109.0,5137.0,nsubj,PRP,dobj,NN,22907
913022,(them),(reenact),"(love, story)",5186.0,5215.0,nsubj,PRP,dobj,NN,22907
913025,(government),(turns),(screws),5312.0,5341.0,nsubj,NN,dobj,NNS,22907
913026,(he),"(’ll, give)",(names),5312.0,5357.0,nsubj,PRP,dobj,NNS,22907


Last step is to save the data-frame in our Data folder 

In [21]:
svodf_final.to_csv('C:/Users/nro04/Documents/moral_templates/Data/vox_triplets_dataset.csv')