*David Schlangen, 2019-03-24*

# Task: Predicting Entailments

This notebook gives an overview of tasks that make use of images as implicit link between utterances. What follows from the fact that two expressions were provided for the same image (object)? 

In the perspective established above, where we see images as models of sentences, the question would be "what follows from being true in the same model"? In logics, the answer would be "not much", as there models are supposed to be stand-ins for the world as a whole (or rather, for one possible world among infinitely many), and many things can be true at the same time, without there being a logical connection between these facts. 

But our models are perhaps better described as *situations* with some internal coherence stemming from the fact that they are individual slices of the world. (Hence, *situation semantics* \cite{barwiseperry:sitatt} might have been the more appropriate formalisation to use in the background, but it is a bit more involved than first-order logic, which I used here.) 

We can thus reformulate our question to: "what follows from being true in the same situation?", and investigate whether we can derive a notion of *situational entailment*. This we do in this notebook. The strategy will be to go through the various possible combinations of anchors (image objects) and expression types (referring expressions, captions, etc.) and to inspect the resulting expression pairs. We will also create "negative" examples of expressions that might have been taken from the same situation, but weren't. If there is any regularity to the phenomenon, a model of it should be able to distinguish between same-situation pairs and different-situation ones.

In the literature, the relation of interest here is typically called *entailment* or *implication*, which is a more general relation than the *logical* entailment studied in formal semantics. In the most general formulation, a sentence  (lets call it the *hypothesis*) is *implied* by another sentence (or set of sentences; let's call this the *premise*), if accepting the premise makes one (more likely to) accept the hypothesis as well \cite{chierchi:meaning}. (This is also how later the influential "recognising textual entailment" challenge \cite{Dagan:rte} introduced the relation.)

This could be called a pragmatic view on the relation, as it revolves around *accepting* a statement. In formal semantics, one abstracts away from this to yield a universally valid relation (independent of what anyone may or may not choose to accept). Interestingly, there are typically two ways in which the relation can then be explicated. Going the semantic route (and then typically calling the relation *semantic consequence* and using $\models$ as relation symbol), the notion of truth  as introduced above is harnessed, and the definition becomes "all models that make the premises true also make the hypothesis true". Going the syntactic route (and then typically calling the relation *syntactic consequence* and using $\vdash$ as symbol), the relation is assumed to hold if there is a sequence of applications of syntactic rules that transform the premises into the hypothesis. As the rules are seen as truth-preserving, the idea is that both paths, semantic and syntactic, actually describe the same relation (that is, cover the same pairs of premises and hypotheses). This is the case for some logics, but not all. 

An interesting task could be to try to set up both "paths" (via models / truth and via syntactic transformations) for the tasks described here.

Before we launch into the investigation of the data, a few words on related work. The influential "recognising textual entailments" challenge was already mentioned above. Under the name "natural language inference", the task has recently seen enormous renewed interest. Interestingly, the paper starting this revival, \cite{snli:emnlp2015}, used image captions as starting point. However, instead of making use of the linked image, they only used the caption as trigger, and asked annotators to *imagine* what must, can, or cannot also be true about the described situation. We use the image to skip the imagination part, having ground truth about whether the situation that makes the hypothesis true is the same or not. But, as we will see, this comes at the cost of what perhaps is noisier data. (We have put some of this data to crowdworkers to let them judge similarity; the experiment has been published in \cite{schlangen:iwcs19} and is documented in another notebook here.)

\cite{youngetal:flickr30k} pioneered the idea of making use of the image / expression relation, defining a notion of *approximate entailment* and testing it via (sets) of images and partially constructed captions.

Finally, one might ask why this is an important notion to model. Here the answer would be that being able to recognise entailment (or weaker forms of it) is crucial for being able to understand discourses, as they are structured by relations between their constituent expressions. While the presentation below will mostly be organised by the types of expressions that are related (where we extend the discussion to relations between expressions of types other than sentence as well), we will also point out where such relations might be found in real discourses, and what the use of being able to recognise them would be.

In [2]:
# imports

import configparser
import os
import random
from textwrap import fill
import sys
from copy import deepcopy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import Latex, display

pd.set_option('max_colwidth', 250)

In [3]:
# Load up config file (needs path; adapt env var if necessary); local imports

# load config file, set up paths, make project-specific imports
config_path = os.environ.get('VISCONF')
if not config_path:
    # try default location, if not in environment
    default_path_to_config = '../../clp-vision/Config/default.cfg'
    if os.path.isfile(default_path_to_config):
        config_path = default_path_to_config

assert config_path is not None, 'You need to specify the path to the config file via environment variable VISCONF.'        

config = configparser.ConfigParser()
with open(config_path, 'r', encoding='utf-8') as f:
    config.read_file(f)

corpora_base = config.get('DEFAULT', 'corpora_base')
preproc_path = config.get('DSGV-PATHS', 'preproc_path')
dsgv_home = config.get('DSGV-PATHS', 'dsgv_home')


sys.path.append(dsgv_home + '/Utils')
from utils import icorpus_code, plot_labelled_bb, get_image_filename, query_by_id
from utils import plot_img_cropped, plot_img_ax, invert_dict, get_a_by_b
sys.path.append(dsgv_home + '/WACs/WAC_Utils')
from wac_utils import create_word2den, is_relational
sys.path.append(dsgv_home + '/Preproc')
from sim_preproc import load_imsim, n_most_sim

sys.path.append('../Common')
from data_utils import load_dfs, plot_rel_by_relid, get_obj_bb, compute_distance_objs
from data_utils import get_obj_key, compute_relpos_relargs_row, get_all_predicate
from data_utils import compute_distance_relargs_row, get_rel_type, get_rel_instances
from data_utils import compute_obj_sizes_row

In [4]:
# Load up preprocessed DataFrames. Slow!
# These DataFrames are the result of pre-processing the original corpus data,
# as per dsg-vision/Preprocessing/preproc.py

df_names = [#'saiapr_bbdf', 'saiapr_refdf',
            'mscoco_bbdf', 'refcoco_refdf', 'refcocoplus_refdf', 'grex_refdf',
            'vgregdf', 'vgimgdf', 'vgobjdf', 'vgreldf',
            'vgpardf', 'cococapdf']
            # 'flickr_bbdf', 'flickr_capdf', 'flickr_objdf']
df = load_dfs(preproc_path, df_names)

# a derived DF, containing only those region descriptions which I was able to resolve
df['vgpregdf'] = df['vgregdf'][df['vgregdf']['pphrase'].notnull() & 
                               (df['vgregdf']['pphrase'] != '')]

# load up pre-computed similarities
coco_sem_sim, coco_sem_map = load_imsim(os.path.join(preproc_path, 'mscoco_sim.npz'))
visg_sem_sim, visg_sem_map = load_imsim(os.path.join(preproc_path, 'visgen_sim.npz'))
coco_id2semsim = invert_dict(coco_sem_map)
visg_id2semsim = invert_dict(visg_sem_map)

coco_vis_sim, coco_vis_map = load_imsim(os.path.join(preproc_path, 'mscoco_vis_sim.npz'))
visg_vis_sim, visg_vis_map = load_imsim(os.path.join(preproc_path, 'visgen_vis_sim.npz'))
coco_id2vissim = invert_dict(coco_vis_map)
visg_id2vissim = invert_dict(visg_vis_map)

Before we fully delve into this, however, we first look how the data might be used to learn representations for word meanings that might support the task of inferring entailment relations.

## Learning Word Representations from Referential Uses

There is a long tradition of work on word meaning where it is modelled not via denotations (as in the *denotations* part of this notebook), but rather via contexts of use, as recorded in corpora (see survey in \cite{turney-pantel:10}), and where the central notion for which it is put to work is not *truth* relative to a model, but rather the (somewhat vaguer) notion of semantic *similarity*. This tradition has recently been refreshed by the advent of more powerful methods for learning these representations from corpora \cite{Mikolov2013:embeddings}.

The context of use that are the basis of these approaches typically are only linguistic contexts as found in text corpora. The image corpora discussed here open up the possiblitiy to structure the context further, by the referential uses that were made of the expressions. For example, we can look at which words occur together in references to an object (*within context* use), and distinguish them from words that don't (*outside* words). (We report results for such an approach in \cite{zaschla:contground}, showing that for visual referntial similarity, it outperforms purely textually trained representations.)

We show some examples here:

In [5]:
# referential context
def get_all_refexps(corps, id_triple):
    this_refexp = []
    for this_corp in coco_ref_corps:
        this_refexp.extend(query_by_id(df[this_corp], id_triple, column='refexp'))
    return this_refexp

coco_ref_corps = ['refcoco_refdf', 'refcocoplus_refdf', 'grex_refdf']

min_length_target = 5

# Example 1
targt_triple = df['refcoco_refdf'].sample()['i_corpus image_id region_id'.split()].values[0]
targt_refexp = get_all_refexps(coco_ref_corps, targt_triple)

distr_refexps = df['refcoco_refdf'].sample(5)['refexp'].tolist()

listA = list(set(' '.join(targt_refexp).split()))
listB = list(set(' '.join(distr_refexps).split()))

target_word = ""
while len(target_word) < min_length_target:
    target_word_index = random.choice(range(len(listA)))
    target_word = listA[target_word_index]
_ = listA.pop(target_word_index)

print("target word:", target_word)
print("belonging to same context:")
print('   ', fill(', '.join(listA), 70))
print("belonging to different context:")
print('   ', fill(', '.join(listB), 70))

print('-' * 40)
# Example 2
targt_triple = df['refcoco_refdf'].sample()['i_corpus image_id region_id'.split()].values[0]
targt_refexp = get_all_refexps(coco_ref_corps, targt_triple)

distr_refexps = df['refcoco_refdf'].sample(5)['refexp'].tolist()

listA = list(set(' '.join(targt_refexp).split()))
listB = list(set(' '.join(distr_refexps).split()))

target_word = ""
while len(target_word) < min_length_target:
    target_word_index = random.choice(range(len(listA)))
    target_word = listA[target_word_index]
_ = listA.pop(target_word_index)

print("target word:", target_word)
print("belonging to same context:")
print('   ', fill(', '.join(listA), 70))
print("belonging to different context:")
print('   ', fill(', '.join(listB), 70))

#pos = [tuple(np.random.choice(listA, 2)) for _ in range(10)]
#neg = [(np.random.choice(listA), np.random.choice(listB)) for _ in range(10)]
#
#print "From same context:"
#for pair in pos:
#    print '  {:>10} , {:<10}'.format(pair[0], pair[1])
#print ""
#print "From different contexts:"
#for pair in neg:
#    print '  {:>10} , {:<10}'.format(pair[0], pair[1])

target word: painted
belonging to same context:
    right, meter, parked, blue, lights, pay, machine, to, tail, station,
plate, accord, showing, honda, dark, with, rear, a, on, license, car,
near, of, the, side, road
belonging to different context:
    pink, right, center, of, between, bowl, middle, giraffe, woman, pic,
in, soldiers, chair, white, empty, kid
----------------------------------------
target word: attached
belonging to same context:
    smaller, is, one, down, second, in, sandwiched, from, top, two, black,
largest, wheels, suitcases, sided, under, with, between, a, on,
stacked, luggage, grey, cat, large, bag, suitcase, the, soft
belonging to different context:
    wheel, on, picture, right, boy, of, bottom, young, side, suv, parts,
black, motorcycle, girl, other, kid


As this indicates, this method is probably more likely to result in useful representations for nouns and adjectives than for other parts of speech (as can be expected, since the referential function is crucial here).

Also, readers that are familiar with this approach will have already seen that this is likely to push terms apart that in other aspects would be seen as semantically very close (e.g., different colours), if they are incompatible on the level of instances of reference (since for example in the corpora it will be rare for the same object to be called both "black" and "red", or "large" and "small", or "man" and "woman"). (For more details, see \cite{zaschla:contground}.)

We again summarise properties of the dataset that could be derived from the available data in this way:

* **Dataset:** words paired with contexts (words from same referring expression)
* **Negative Instances:** words from different referring expression
* **Source:** referring expression corpora
* **Uses:** learning word representations optimized for *referential* similarity

## Implicature / Approximate Entailment

We extend this approach of using the non-linguistic context to create pairings now to larger expressions, and move from semantic *similarity* to *implication* (or *entailment*, or, to introduce yet another term, to *approximate entailment*, which is how \cite{youngetal:flickr30k} introduced this task [for captions]; where the qualifier "approximate" is added presumably to express the fact that this isn't quite *logical* entailment).

The general approach here will be to take positive examples from the set of annotations for the same object (that is, expressions that are related to the same image); for example, two expressions referring to the same image object, or two captions describing the same image. As we will see, this will indeed only yield pairs that are likely to be evaluated similarly in *all* context (more likely in any case than the negative pairs); but generalising this from the fact that this is the case for *one* context is of course potentially fallible. Similarly, an expression taken from another image is only likely not to accidentally apply to the same situation.

### ... between Referring Expressions

The following shows some example pairings of referring expressions taken from the same object (premise + p-hyp) or from a different object (premise + n-hyp).

In [6]:
# pairing premise and hypotheses
n = 10 # how many to do
triples = []

this_df = df['refcoco_refdf']

for _ in range(n):
    # seed image
    ic, ii, ri, rexi = this_df.sample()['i_corpus image_id region_id rex_id'.split()].values[0]
    premise, phyp =  np.random.choice(query_by_id(this_df, (ic, ii, ri), 'refexp'), 2, replace=False)

    # negative hypothesis
    nhyp = this_df.sample()['refexp'].values[0]
    triples.append((premise, phyp, nhyp))

colnames = 'premise p-hyp n-hyp'.split()
#pd.DataFrame(triples, columns=colnames)

for prem, phyp, nhyp in triples:
    print("-" * 40)
    print("premise: ", prem)
    print("p-hyp:   ", phyp)
    print("n-hyp:   ", nhyp)

----------------------------------------
premise:  man beer
p-hyp:    person in entire picture
n-hyp:    chair
----------------------------------------
premise:  kid red tie
p-hyp:    red tie
n-hyp:    left donkey
----------------------------------------
premise:  black dog right
p-hyp:    black dog purple collar front
n-hyp:    guy on left side
----------------------------------------
premise:  cat left
p-hyp:    left cat
n-hyp:    red veh top right
----------------------------------------
premise:  white horse
p-hyp:    white horse
n-hyp:    the boy second from right red tie
----------------------------------------
premise:  skiier
p-hyp:    skiier
n-hyp:    green stripe board
----------------------------------------
premise:  left picture
p-hyp:    left tie
n-hyp:    left skier
----------------------------------------
premise:  fridge behind man
p-hyp:    the refrigerator on the right
n-hyp:    partially visible head to the right of red hat in middle
--------------------------------

Intuitively, distinguishing between these pairs seems to be a rather easy task, for which attention to lexical items might even be enough. *Explaining* the decision, however, would not be trivial and require knowledge about how speakers are likely to refer, or about what properties are unlikely to co-occur in the same entity. 

Relating this abstract task to a real(er) discourse task, we can phrase this as recognising whether something can serve as an answer to a clarification request. The positive hypotheses shown here would work as elaborations or reformulations of the description that the premise gives; it is harder to understand the negative hypotheses in that way. 

Here are some examples put into this kind of context:

+ A: black car behind dorks holding signs  
  B: what?  
  A: the car behind the three people

vs 

+ A: black car behind dorks holdings signs  
  B: what?  
  A: #closer girl

### ... between Captions and "There is"-Sentences

Where the pairs above combined expressions whose denotations are on the same level, as it were (objects and objects), we can also create unequal pairings. Using a caption as hypothesis (a description of the situation as a whole), we can ask whether it entails the presence of specific objects.

Here are some examples of caption paired with an object referred to via a name (slotted into the "there is __" frame):

In [7]:
# intersecting visual genome and coco captions. Slow-ish.
caption_coco_iids = list(set(df['cococapdf']['image_id'].tolist()))
# regions for only those image for which we also have coco captions
visgencocap_regdf = df['vgregdf'].merge(pd.DataFrame(caption_coco_iids, columns=['coco_id']))
# coco_image_ids for images with both caption and region
vgcap_coco_iids = list(set(visgencocap_regdf['coco_id'].tolist()))
# visgen_image_ids for images with both caption and region
vgcap_vg_iids = list(set(visgencocap_regdf['image_id'].tolist()))

# map coco_ids to visgen_ids, and back
coco2vg = dict(visgencocap_regdf[['coco_id', 'image_id']].values)
vg2coco = dict([(v,k) for k,v in coco2vg.items()])

df['vgpardf']['coco_image_id'] = df['vgpardf']['image_id'].apply(lambda x: vg2coco.get(x, None))
df['cocoparcapdf'] = df['cococapdf'].merge(df['vgpardf'],
                                           left_on='image_id', right_on='coco_image_id')

In [8]:
# captions and objects (slotted into "there is __" frame)
tuples = []

for _ in range(10):
    try:
        vgii, cocoii = visgencocap_regdf.sample()['image_id coco_id'.split()].values[0]
        prem = df['cococapdf'][df['cococapdf']['image_id'] == cocoii].sample()['caption'].values[0]
        phyp = df['vgobjdf'][df['vgobjdf']['image_id'] == vgii].sample()['name'].values[0]
        nhyp = df['vgobjdf'].sample()['name'].values[0]
        tuples.append((prem, phyp, nhyp))
    except:
        continue

for prem, phyp, nhyp in tuples:
    print("-" * 40)
    print("premise: ", fill(prem, 70))
    print("p-hyp:    there is a", phyp)
    print("n-hyp:    there is a", nhyp)

----------------------------------------
premise:  A group of people are out taking a baby for a walk.
p-hyp:    there is a head
n-hyp:    there is a fence
----------------------------------------
premise:  A young person with outstretched arms in a driveway
p-hyp:    there is a shirt
n-hyp:    there is a foot
----------------------------------------
premise:  a green fire hydrant standing in the grass close to the street
p-hyp:    there is a buillding
n-hyp:    there is a match
----------------------------------------
premise:  A clay rendition of roses in a pot are displayed.
p-hyp:    there is a tube
n-hyp:    there is a concrete planters
----------------------------------------
premise:  three people are waiting on the street with an umbrella
p-hyp:    there is a rollers
n-hyp:    there is a letters
----------------------------------------
premise:  A large jetliner flying through a blue sky.
p-hyp:    there is a wheels
n-hyp:    there is a clock
-----------------------------------

And here is the same configuration, but with whole region descriptions.

In [9]:
# caption + region description
for _ in range(10):
    try:
        p_caps = []
        while len(p_caps) == 0:
            coco_ii = np.nan
            while np.isnan(coco_ii):
                ic, vg_ii, coco_ii = df['vgimgdf'].sample()[['i_corpus', 'image_id', 'coco_id']].values[0]
            p_caps = query_by_id(df['cococapdf'], (icorpus_code['mscoco'], coco_ii))
        p_cap_ind = random.choice(range(len(p_caps)))
        p_cap = p_caps.iloc[p_cap_ind]['caption']
        p_row = p_caps.index[p_cap_ind]

        p_hyp_regions = query_by_id(df['vgpregdf'], (ic, vg_ii))
        p_hyp_regions = p_hyp_regions[~p_hyp_regions['rels'].isnull()]
        p_hyp_reg, p_hyp_relids, p_hyp_rels, p_hyp_pphrase = \
            p_hyp_regions.sample()['phrase rel_ids rels pphrase'.split()].values[0]

        n_hyp_reg, n_hyp_relids, n_hyp_rels, n_hyp_pphrase = \
            df['vgpregdf'].sample()['phrase rel_ids rels pphrase'.split()].values[0]

        print("=" * 40)
        print("premise: ", fill(p_cap, 70))
        print("p-hyp:    there is/are (a)", p_hyp_reg) #, '||', p_hyp_pphrase
        print("n-hyp:    there is/are (a)", n_hyp_reg) #, '||', n_hyp_pphrase
    except:
        pass

premise:  a number of people siting at a table with wine glasses
p-hyp:    there is/are (a) young woman with long wavy brown hair
n-hyp:    there is/are (a) the hand of a boy
premise:  A woman and a black cat outside a building.
p-hyp:    there is/are (a) wire running along the wall
n-hyp:    there is/are (a) table the cake is on
premise:  Group of people watching someone play a video game.
p-hyp:    there is/are (a) red flowers behind a couch
n-hyp:    there is/are (a) house with a light on
premise:  a man with his dog catching a red frisbee
p-hyp:    there is/are (a) a man wearing glasses
n-hyp:    there is/are (a) man wearing white headband
premise:  Large bird flying and gliding over the ocean.
p-hyp:    there is/are (a) specks on bird 
n-hyp:    there is/are (a) The left handle brake on the motorcycle.
premise:  A skateboarder touching the ground with his hands
p-hyp:    there is/are (a) The man has khaki pants
n-hyp:    there is/are (a) white horse with purple outfit looking at t

As the examples indicate, this configuration seems to create examples that bring out a rather clear version of (approximate, commonsense) entailment (perhaps better called *situational implication*): Does a situation of the type described by the caption typically entail / imply the presence of an object of the type described by the hypothesis?

### ... between Captions

Stepping up the complexity of the expressions, here are pairings based on captions and full images. (Note that this, as mentioned above, was also the starting point of the now popular "natural language inference" datasets \cite{snli:emnlp2015}, where the positive pair was taken from COCO captions. The negative pair and the additional "neutral" pair, however, were created manually rather than how it is done here.)

In [10]:
# pairing premise and hypotheses
n = 10 # how many to do
triples = []

this_df = df['cococapdf']

for _ in range(n):
    # seed image
    ic, ii, rexi = this_df.sample()['i_corpus image_id id'.split()].values[0]
    premise, phyp =  np.random.choice(query_by_id(this_df, (ic, ii), 'caption'), 2, replace=False)

    # negative hypothesis
    nhyp = this_df.sample()['caption'].values[0]
    triples.append((premise, phyp, nhyp))

#colnames = 'premise p-hyp n-hyp'.split()
#pd.DataFrame(triples, columns=colnames)

for prem, phyp, nhyp in triples:
    print("-" * 40)
    print("premise: ", fill(prem, 70))
    print("p-hyp:   ", fill(phyp, 70))
    print("n-hyp:   ", fill(nhyp, 70))

----------------------------------------
premise:  A large and a small giraffe standing in the grass
p-hyp:    An adult giraffe and a baby giraffe standing next to each other in the
woods.
n-hyp:    A boat on a body of water with trees in the background.
----------------------------------------
premise:  Two tennis players are shaking hands over the net.
p-hyp:    Two men are shaking hands at a tennis game
n-hyp:    A young boy with an adult umbrella stands near a stroller.
----------------------------------------
premise:  A dog looking out the window of a car.
p-hyp:    A dog with its face handing out a car window.
n-hyp:    some kind of white and brown desert on a table
----------------------------------------
premise:  This shows a young man's jump as continuous sequence of actions.
p-hyp:    A picture of a snowboarder jumping right into the air.
n-hyp:    A woman standing on a street with a umbrella.
----------------------------------------
premise:  A woman takes a bite of the wh

In these examples, the negative hypotheses were selected simply by sampling captions of images other than that from which the premise was taken. Again, to a human reader, the negative hypotheses seem to jump out, and it is not unlikely that a rather shallow model (looking only at semantic similarity of the words and ignoring compositionality) could perform well on this task. But again, *explaining* why the descriptions could be of the same situation (or not), seems like a challenging task, requiring knowledge about event types as well as knowledge about entity types.

In any case, we can make the task more challenging, by using our similarity relation between images to select the distractor image from which the negative hypothesis caption is to be taken. The assumption here would be that descriptions of more similar situations should also be more similar, and that to distinguish between them, a deeper understanding of the expression itself is needed.

Here are examples using the semantic similarity relation described above:

In [11]:
# pairing premise and hypotheses
n = 10 # how many to do
tuples = []

this_df = df['cococapdf']
ic = icorpus_code['mscoco']

for _ in range(n):
    # seed image
    ii = np.random.choice(coco_id2semsim.keys())
    premise, phyp =  np.random.choice(query_by_id(this_df, (ic, ii), 'caption'), 2, replace=False)

    # negative hypothesis
    sim_ids = n_most_sim(coco_sem_sim, coco_sem_map, coco_id2semsim[ii], n=5)
    n_ii = np.random.choice(sim_ids[1:])
    nhyp = this_df[this_df['image_id'] == n_ii]['caption'].values[0]
    
    tuples.append((premise, phyp, nhyp))

# colnames = 'premise p-hyp n-hyp'.split()
# pd.DataFrame(tuples, columns=colnames)

for prem, phyp, nhyp in tuples:
    print("-" * 40)
    print("premise: ", prem)
    print("p-hyp:   ", phyp)
    print("n-hyp:   ", nhyp)

----------------------------------------
premise:  a couple of people on some skies on a snowy field
p-hyp:    an Olympic snow skier a judge and a spectator 
n-hyp:    Two people in orange jackets smile as they ski up a road.
----------------------------------------
premise:  A couple of zebras are standing in a field
p-hyp:    an image of two zebras in the wild
n-hyp:    A group of zebras are near the side of the road, with one on its back. 
----------------------------------------
premise:  Two young men playing frisbee in a park.
p-hyp:    this is two men in cleats on the grass
n-hyp:    Several people in black red and green are on a field as one person wears white shorts and two people have their arms up toward a white Frisbee.
----------------------------------------
premise:  The skateboarder jumps over the ramp on his board.
p-hyp:    A boy doing a trick on a skateboard from a ramp.
n-hyp:    There are men who are skateboarding down the trail.
-----------------------------------

The same with the visual similarity relation:

In [12]:
# pairing premise and hypotheses
n = 10 # how many to do
tuples = []

this_df = df['cococapdf']
ic = icorpus_code['mscoco']

for _ in range(n):
    # seed image
    ii = np.random.choice(coco_id2semsim.keys())
    premise, phyp =  np.random.choice(query_by_id(this_df, (ic, ii), 'caption'), 2, replace=False)

    # negative hypothesis
    sim_ids = n_most_sim(coco_vis_sim, coco_vis_map, coco_id2vissim[ii], n=5)
    n_ii = np.random.choice(sim_ids[1:])
    nhyp = this_df[this_df['image_id'] == n_ii]['caption'].values[0]
    
    tuples.append((premise, phyp, nhyp))

#colnames = 'premise p-hyp n-hyp'.split()
#pd.DataFrame(tuples, columns=colnames)

for prem, phyp, nhyp in tuples:
    print("-" * 40)
    print("premise: ", prem)
    print("p-hyp:   ", phyp)
    print("n-hyp:   ", nhyp)

----------------------------------------
premise:  A man sitting at a table as he stares at his mobile phone.
p-hyp:    He is eating lunch with his mobile phone on the tray.
n-hyp:    A woman that is standing near a counter.
----------------------------------------
premise:  THERE ARE SEVERAL MOTOR BIKES PARKED ON THE STREET 
p-hyp:    The back ends of a line of parked motorcycles.
n-hyp:    Some motorcycles and scooters parked in front of some tents. 
----------------------------------------
premise:  Two toddlers are playing with an electric organ.
p-hyp:    A person sitting at a computer with two kids and one is pushing keys.
n-hyp:    This little girl can't wait to get a piece of this cake
----------------------------------------
premise:  Costumed men riding in a horse drawn carriage at a fair
p-hyp:    The carriage was being pulled by two horses. 
n-hyp:    A man sitting on a wagon seat driving two draft horses
----------------------------------------
premise:  Oatmeal sitting on

As these examples illustrate, the similarity might even sometimes be too great, so that the assumption that the negative hypothesis is less likely to also work is broken. It looks like some fine tuning (and testing on human raters) would be needed to find the right degree of (dis)similarity to produce a challenging but promising dataset.

### ... between Captions and Paragraphs

The expressions do not need to be of the same type; the more important requirement is that they relate to the same type of object. Here we show captions as premise, and image paragraphs as hypotheses (with the negative instances coming from similar images). The hypotheses thus could be seen as elaborations of the situation described by the caption.

In [13]:
# caption, paragraph for same image, paragraph for different by similar image
n = 5

available_iis_cappar = df['cocoparcapdf']['image_id_x']
available_iis_sim = coco_id2semsim.keys()
available_iis = set(available_iis_cappar).intersection(available_iis_sim)
# len(available_iis)    # Only 1503 available...

for _ in range(n):
    ii = np.random.choice(list(available_iis))
    cap, ppar = df['cocoparcapdf'][df['cocoparcapdf']['image_id_x'] == ii]['caption paragraph'.split()].values[0]
    all_sim = n_most_sim(coco_sem_sim, coco_sem_map, coco_id2semsim[ii], n=200)
    all_neg = set(available_iis).intersection(all_sim)
    nii = np.random.choice(list(all_neg))
    npar = df['cocoparcapdf'][df['cocoparcapdf']['image_id_x'] == nii]['paragraph'].values[0]

    print('=' * 40)
    print(cap)
    print('-' * 40)
    print(ppar)
    print('-' * 40)
    print(npar)


A hotel wall with a white substance made to look like a penis.
----------------------------------------
The room has two unmade beds in it. The beds share a headboard that appears to have an unidentifiable mark or scratches. Both beds have white sheets and a pillow. The bed on the left has a red and tan comforter. The white wall behind the beds has no decoration or hanging picture on it.
----------------------------------------
The bedroom consists of two twin beds, lying next to each other. Made of wooden material, the beds contain three pillows each and there is a small dresser lying between them. A telephone and double-head lamp is lying on the table. On the other side of the room, there are two chairs lying between a table. They're arranged behind a curtain.
Four toy characters sitting next to some pizza slices.
----------------------------------------
The image is of plastic figurines on a chopping block surrounded by a partially cut pizza. The pizza has cheese and pepperoni on it

### ... between Discursive Scene Descriptions and Follow-Ups

Following the same recipe, we can create other, related tasks as well, such as the following: "Given a sequence of region descriptions from one scene, predict whether an additional single region description comes from the same scene or not." 

This is a variant of the Caption / There is task from above, except that here the premise is not assumed to fully describe the base situation; the particular relation that links the hypothesis to the premise could be called "continuation" or "thematic coherence". 

First, with randomly selected distractor scene for the wrong hypothesis:

In [14]:
# deep caption + follow up: plausible or not? Randomly selected neg hyp.

n_egs = 3

for _ in range(n_egs):
    ic, ii = df['vgregdf'].sample()[['i_corpus', 'image_id']].values[0]

    prem_set_all = list(set(query_by_id(df['vgregdf'], (ic, ii), 'phrase')))
    prem_set = np.random.choice(prem_set_all, min(10, len(prem_set_all)), replace=False)
    np.random.shuffle(prem_set)
    phyp = prem_set[-1]
    prem_set = prem_set[:-1]

    nii = df['vgregdf'].sample()['image_id'].values[0]
    nhyp = np.random.choice(query_by_id(df['vgregdf'], (ic, nii), 'phrase'))

    print("=" * 40)
    print("The scene:")
    for rg in prem_set:
        print(' ', rg)
    print("Which of the following belongs to the same scene?")
    print(" A:", phyp)
    print(" B:", nhyp)

The scene:
  seagulls flocked together at the shore
  lake is frozen over
  house around the reservior
  wet sand at a beach
  animal in the water
  the dog is black
  a dog on the sand
  person standing on beach in the distance
  dark blue patches on icy water
Which of the following belongs to the same scene?
 A: a person casting a shadow on the sand
 B: Patch of bright green grass
The scene:
  an arched entryway to the church
  red roof on building
  Clock on the side of the building
  a path going to the building
  Arched black and yellow door
  red and black shingle slanted roof
  Window on the building
  bright blue sky with white clouds
  Gravel pathway through green grass
Which of the following belongs to the same scene?
 A: Wooden door way
 B: Black tire of a car
The scene:
  The chairs are green.
  man at a cafe
  green chairs and table outside
  he is wearing tan pants.
  windows in front of a restaurant
  green plastic patio tables
  reflection of the street
  name inscribed

With the distractor scene selected for similarity:

In [15]:
# deep caption + follow up: plausible or not? Neg hyp from similar image.
ic = icorpus_code['visual_genome']
reg_sim_iis = list(set(df['vgregdf']['image_id']).intersection(set(visg_id2semsim.keys())))

n_egs = 3

for _ in range(n_egs):
    
    ii = np.random.choice(reg_sim_iis)
    # ic, ii = df['vgregdf'].sample()[['i_corpus', 'image_id']].values[0]

    prem_set_all = list(set(query_by_id(df['vgregdf'], (ic, ii), 'phrase')))
    prem_set = np.random.choice(prem_set_all, min(10, len(prem_set_all)), replace=False)
    np.random.shuffle(prem_set)
    phyp = prem_set[-1]
    prem_set = prem_set[:-1]

    nm = n_most_sim(visg_sem_sim, visg_sem_map, visg_id2semsim[ii], n=100)
    #nm = n_most_sim(visg_vis_sim, visg_vis_map, visg_id2vissim[ii], n=100)
    nm = set(df['vgregdf']['image_id']).intersection(set(nm))
    nii = np.random.choice(list(nm))
    nhyp = np.random.choice(query_by_id(df['vgregdf'], (ic, nii), 'phrase'))

    print("=" * 40)
    print("The scene:")
    for rg in prem_set:
        print(' ', rg)
    print("Which of the following belongs to the same scene?")
    print(" A:", phyp)
    print(" B:", nhyp)

The scene:
  vegetables inside of soup
  brown of toasted bread
  Golden brown toast
  green lettuce of sandwich
  a piece of an orange
  Dark soup in bowl
  a piece of a sandwich
  edge of a soup spoon
  soup in a bowl on a plate
Which of the following belongs to the same scene?
 A: surface of gray marble
 B: Food on a tray
The scene:
  colorful plant behind gate
  A little yellow train.
  the sky is white.
  yellow and black front of train
  large rosebush behind the fence
  the grass is green.
  A building in the distance.
  small yellow and red train for tourists
  the sign is green.
Which of the following belongs to the same scene?
 A: train is yellow and black
 B: Backhoe construction equipment
The scene:
  roof of building with columns
  small buildings attached to railings
  entrance to the building
  a couple of lights
  steps leading to door
  solar panels on roof
  leafless tree on the grass
  entryway to the building with columns
  multi floored building between structures


To summarise this section, the general recipe here is to use the external grounding of the expressions (in images) to construct pairings of expressions that are semantically closely related, with the second part of the pair in some sense following from the first. (Or not, for which case the fact that the expressions come from different images is utilised.)

* **Dataset:** pairs of expressions related via the same image
* **Negative Instances:** expressions taken from other images
* **Source:** visual genome, COCO; derived
* **Uses:** learn to predict whether given pair is semantically related or not; learn *(common sense) entailment* relation

## What Would a Model have to Learn, and What Might it Look Like?

Having looked at various data sets that can be created, which all bring out different aspects of the general *implies* relation, we can pause briefly to ask what a model that learns this relation from that data would have to learn.

This first thing to note here is that it seems that it can't just be logical rules (or rather, the meaning of logical constants) that is to be learned here. This might be enough for pairs such as "all girls are coding / the girl on the left is coding", but as the expressions used here come from actual use contexts (albeit in annotations), such textbook examples are unlikely to occur. In many of the cases, at the very least additional *linguistic* knowledge is required (e.g., to relate "a woman is reading" and "a person is reading"). But beyond that, in many of the examples it seems to be knowledge that presumably goes beyond linguistic knowledge and into the common sense domain to recognise the relation (e.g., to know that in a situation described as "The skate boarder is riding the skateboard down a slope.", it is likely that "there is a helmet" is also true).

A straightforward modern approach now would be to use a high-capacity model (most likely a neural network) to train a classifier that takes a pair of expressions and  predicts whether the relation holds or not. (And which in that sense learns and defines the relation.) This is indeed the approach typically taken to the "Natural Language Inference" task \cite{snli:emnlp2015}, and with some good success.

We just note here that in the settings described here, other approaches also seem possible. We said above that the semantic view on the relation is that it holds in case every model of the premise is also a model of the hypothesis. Given a way to evaluate an expression relative to a model, as sketched in Section [Expressions and Denotations](#Expressions-and-Denotations), this quantification over models could indeed be realised, as quantification over all available images.

This would, however, require the availability of a reasonably large set of reference models -- which perhaps means that the approach loses in cognitive plausiblity. (If that is a goal.) An approach that sits somewhere between the direct prediction and the model exemplar checking would be one where a (set of) models is *predicted* from the premise, against which the hypothesis is then evaluated. There are methods in the literature that aim to go from natural language expressions (typically, captions) to images or semi-symbolic representations such as image layouts \cite{Hong2018}, \cite{Zhao2018}, which could perhaps be used for this. The advantage of such a model would be that it would be inspectable and hence lending itself to also making *explanations* derivable.

# References

[<a id="cit-barwiseperry:sitatt" href="#call-barwiseperry:sitatt">1</a>] Jon Barwise and John Perry, ``_Situations and Attitudes_'',  1983.

[<a id="cit-chierchi:meaning" href="#call-chierchi:meaning">2</a>] Gennaro Chierchia and Sally McConnell-Ginet, ``_Meaning and Grammar: An Introduction to Semantics_'',  1990.

[<a id="cit-Dagan:rte" href="#call-Dagan:rte">3</a>] I. Dagan, O. Glickman and B. Magnini, ``_The PASCAL Recognising Textual Entailment Challenge_'', Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment,  2006.  [online](http://dx.doi.org/10.1007/11736790_9)

[<a id="cit-snli:emnlp2015" href="#call-snli:emnlp2015">4</a>] S.R. Bowman, G. Angeli, C. Potts <em>et al.</em>, ``_A large annotated corpus for learning natural language inference_'', Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP),  2015.

[<a id="cit-schlangen:iwcs19" href="#call-schlangen:iwcs19">5</a>] D. Schlangen, ``_Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics_'', Proceedings of the International Conference on Computational Semantics (IWCS), May 2019.

[<a id="cit-youngetal:flickr30k" href="#call-youngetal:flickr30k">6</a>] Young Peter, Lai Alice, Hodosh Micah <em>et al.</em>, ``_From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions_'', Transactions of the Association for Computational Linguistics, vol. 2, number , pp. ,  2014.

[<a id="cit-turney-pantel:10" href="#call-turney-pantel:10">7</a>] D. Peter and Pantel Patrick, ``_From Frequency to Meaning: Vector Space Models of Semantics_'', Journal of Artificial Intelligence Research, vol. 37, number , pp. 141--188,  2010.

[<a id="cit-Mikolov2013:embeddings" href="#call-Mikolov2013:embeddings">8</a>] T. Mikolov, K. Chen, G. Corrado <em>et al.</em>, ``_Distributed Representations of Words and Phrases and their Compositionality_'', Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS 2013),  2013.

[<a id="cit-zaschla:contground" href="#call-zaschla:contground">9</a>] S. Zarrieß and D. Schlangen, ``_Deriving continous grounded meaning representations from referentially structured multimodal contexts_'', Proceedings of EMNLP 2017 -- Short Papers, September 2017.

[<a id="cit-Hong2018" href="#call-Hong2018">10</a>] Hong Seunghoon, Yang Dingdong, Choi Jongwook <em>et al.</em>, ``_Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis_'', ArXiv, vol. , number Figure 1, pp. ,  2018.  [online](http://arxiv.org/abs/1801.05091)

[<a id="cit-Zhao2018" href="#call-Zhao2018">11</a>] Zhao Bo, Meng Lili, Yin Weidong <em>et al.</em>, ``_Image Generation from Layout_'', ArXiv, vol. , number , pp. ,  2018.

