# Homework and bake-off: Relation extraction using distant supervision

In [1]:
__author__ = "Bill MacCartney and Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Baselines](#Baselines)
  1. [Hand-build feature functions](#Hand-build-feature-functions)
  1. [Distributed representations](#Distributed-representations)
1. [Homework questions](#Homework-questions)
  1. [Different model factory [1 points]](#Different-model-factory-[1-points])
  1. [Directional unigram features [1.5 points]](#Directional-unigram-features-[1.5-points])
  1. [The part-of-speech tags of the "middle" words [1.5 points]](#The-part-of-speech-tags-of-the-"middle"-words-[1.5-points])
  1. [Bag of Synsets [2 points]](#Bag-of-Synsets-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to developing really effective relation extraction systems using distant supervision. 

As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [2]:
%load_ext autoreload
%autoreload 2
import numpy as np
import os
import rel_ext
from sklearn.linear_model import LogisticRegression
import utils
from nltk import bigrams
from nltk.corpus import wordnet as wn
from sklearn.ensemble import RandomForestClassifier,StackingClassifier,VotingClassifier

As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:

In [3]:
rel_ext_data_home = os.getcwd()
rel_ext_data_home

'/home/teja/relation_extraction'

In [4]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))

In [5]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))

In [6]:
dataset = rel_ext.Dataset(corpus, kb)

In [None]:
GLOVE_HOME = os.path.join(os.getcwd(), 'glove.6B')

glove_lookup = utils.glove2dict(
    os.path.join('/home/teja', 'glove.6B.300d.txt'))

You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:

In [9]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.79, 0.20],
    seed=1)

In [10]:
splits

{'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 263,285 examples; KB with 36,191 triples,
 'dev': Corpus with 64,937 examples; KB with 9,248 triples,
 'all': Corpus with 331,696 examples; KB with 45,884 triples}

## Baselines

### Hand-build feature functions

In [11]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in get_tag_bigrams(ex.middle_POS):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in get_tag_bigrams(ex.middle_POS):
            feature_counter[word] += 1
    return feature_counter


def get_tag_bigrams(s):
    """Suggested helper method for `middle_bigram_pos_tag_featurizer`.
    This should be defined so that it returns a list of str, where each
    element is a POS bigram."""
    # The values of `start_symbol` and `end_symbol` are defined
    # here so that you can use `test_middle_bigram_pos_tag_featurizer`.
    start_symbol = "<s>"
    end_symbol = "</s>"
    tags=get_tags(s)
    tags=[start_symbol]+tags+[end_symbol]
    toks=bigrams(tags)
    toks=[" ".join(x) for x in toks]
    toks.extend(tags)
    return toks

def get_tags(s):
    """Given a sequence of word/POS elements (lemmas), this function
    returns a list containing just the POS elements, in order.
    """
    return [parse_lem(lem)[1] for lem in s.strip().split(' ') if lem]


def parse_lem(lem):
    """Helper method for parsing word/POS elements. It just splits
    on the rightmost / and returns (word, POS) as a tuple of str."""
    return lem.strip().rsplit('/', 1)

In [12]:
def getBeforePOS(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in getBeforetokensPOS(ex.left_POS):
            feature_counter[word+"_before"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in getBeforetokensPOS(ex.left_POS):
            feature_counter[word+"_before"] += 1
    return feature_counter

def getAfterPOS(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in getAftertokensPOS(ex.right_POS):
            feature_counter[word+"_after"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in getAftertokensPOS(ex.right_POS):
            feature_counter[word+"_after"] += 1
    return feature_counter

def getBeforetokensPOS(x):
    tokens=x.split(" ")[-3:]
    if len(tokens)==0:
        return ["NOPOS"]
    try:
        pos_tags=[x.split("/")[1] for x in tokens]
    except Exception as e:
        return ["NOPOS"]
    return pos_tags
def getAftertokensPOS(x):
    tokens=x.split(" ")[0:3]
    if len(tokens)==0:
        return ["NOPOS"]
    try:
        pos_tags=[x.split("/")[1] for x in tokens]
    except Exception as e:
        return ["NOPOS"]
    return pos_tags

In [13]:
def directional_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word+"_SO"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word+"_OS"] += 1
    return feature_counter
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

def left_bag_of_words(kbt,corpus,feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.left.split(' '):
            feature_counter[word+"_left"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.left.split(' '):
            feature_counter[word+"_left"] += 1
    return feature_counter
def right_bag_of_words(kbt,corpus,feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.right.split(' '):
            feature_counter[word+"_right"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.right.split(' '):
            feature_counter[word+"_right"] += 1
    return feature_counter

def SorO(kbt,corpus,feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        feature_counter["subject"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        feature_counter["object"] += 1
    return feature_counter

In [14]:
from nltk.corpus import wordnet as wn

def get_synsets(s):
    """Suggested helper method for `synset_featurizer`. This should
    be completed so that it returns a list of stringified Synsets
    associated with elements of `s`.
    """
    # Use `parse_lem` from the previous question to get a list of
    # (word, POS) pairs. Remember to convert the POS strings.
    wt = [parse_lem(lem) for lem in s.strip().split(' ') if lem]
    wt=[[x[0],convert_tag(x[1])] for x in wt]
    sysnets=[]
    for i in wt:
        nets=wn.synsets(i[0], pos=i[1])
        nets=list(map(str,nets))
        sysnets.extend(nets)
    return sysnets


def synset_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in get_synsets(ex.middle_POS):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in get_synsets(ex.middle_POS):
            feature_counter[word] += 1
    return feature_counter


def convert_tag(t):
    """Converts tags so that they can be used by WordNet:

    | Tag begins with | WordNet tag |
    |-----------------|-------------|
    | `N`             | `n`         |
    | `V`             | `v`         |
    | `J`             | `a`         |
    | `R`             | `r`         |
    | Otherwise       | `None`      |
    """
    if t[0].lower() in {'n', 'v', 'r'}:
        return t[0].lower()
    elif t[0].lower() == 'j':
        return 'a'
    else:
        return None


In [15]:
def glove_middle_featurizer(kbt, corpus, np_func=np.sum):
    reps = []
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split():
            rep = glove_lookup.get(word)
            if rep is not None:
                reps.append(rep)
    if len(reps) == 0:
        dim = len(next(iter(glove_lookup.values())))
        return utils.randvec(n=dim)
    else:
        return np_func(reps, axis=0)
def glove_left_featurizer(kbt, corpus, np_func=np.sum):
    reps = []
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.left.split()[-3:]:
            rep = glove_lookup.get(word)
            if rep is not None:
                reps.append(rep)
    if len(reps) == 0:
        dim = len(next(iter(glove_lookup.values())))
        return utils.randvec(n=dim)
    else:
        return np_func(reps, axis=0)
    
def glove_right_featurizer(kbt, corpus, np_func=np.sum):
    reps = []
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.right.split()[0:3]:
            rep = glove_lookup.get(word)
            if rep is not None:
                reps.append(rep)
    if len(reps) == 0:
        dim = len(next(iter(glove_lookup.values())))
        return utils.randvec(n=dim)
    else:
        return np_func(reps, axis=0)
    
def combineGloveFeatures(kbt, corpus, np_func=np.sum):
    left_features=glove_left_featurizer(kbt, corpus)
    middle_features=glove_middle_featurizer(kbt, corpus) 
    right_features=glove_right_featurizer(kbt, corpus)
    final_features=np.concatenate([left_features,middle_features,right_features],0)
    return final_features

In [16]:
featurizers = [simple_bag_of_words_featurizer,left_bag_of_words,right_bag_of_words,SorO,directional_bag_of_words_featurizer,middle_bigram_pos_tag_featurizer,getBeforePOS,getAfterPOS,synset_featurizer,combineGloveFeatures]

In [19]:
# model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear',n_jobs=-1)
# model_factory = lambda: RandomForestClassifier(n_estimators=100,random_state=42,
#                                                class_weight="balanced_subsample",n_jobs=-1)

estimators=[('lr',LogisticRegression(fit_intercept=True, solver='liblinear',n_jobs=-1,random_state=42)),
           ('rf',RandomForestClassifier(n_estimators=100,random_state=42,class_weight="balanced_subsample",n_jobs=-1))]

model_factory = lambda: StackingClassifier(estimators=estimators,n_jobs=-1,
                                           final_estimator=LogisticRegression(fit_intercept=True, solver='liblinear',n_jobs=-1,random_state=42),verbose=2)

In [20]:
estimators

[('lr',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='auto', n_jobs=-1, penalty='l2', random_state=42,
                     solver='liblinear', tol=0.0001, verbose=0, warm_start=False)),
 ('rf',
  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                         class_weight='balanced_subsample', criterion='gini',
                         max_depth=None, max_features='auto', max_leaf_nodes=None,
                         max_samples=None, min_impurity_decrease=0.0,
                         min_impurity_split=None, min_samples_leaf=1,
                         min_samples_split=2, min_weight_fraction_leaf=0.0,
                         n_estimators=100, n_jobs=-1, oob_score=False,
                         random_state=42, verbose=0, warm_start=False))]

In [21]:
baseline_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=model_factory,
    train_sampling_rate=0.2,
    verbose=True)

  0%|          | 0/16 [00:00<?, ?it/s]

Hurray completed all the assertions


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
100%|██████████| 16/16 [1:07:14<00:00, 252.15s/it]


Hurray completed all the assertions
relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.920      0.471      0.772        340       5716
author                    0.899      0.631      0.829        509       5885
capital                   0.778      0.221      0.517         95       5471
contains                  0.896      0.733      0.858       3904       9280
film_performance          0.910      0.661      0.846        766       6142
founders                  0.890      0.342      0.674        380       5756
genre                     0.852      0.306      0.628        170       5546
has_sibling               0.898      0.513      0.781        499       5875
has_spouse                0.908      0.466      0.764        594       5970
is_a                      0.904      0.302      0.646        497       5873
nationality               0.851      0.322      0.64

Studying model weights might yield insights: