In [1]:
import os
import pickle
import csv
import pandas
from datetime import datetime as odt
from time import mktime, time
from whoosh.index import create_in, open_dir
from whoosh.fields import Schema, ID, TEXT, DATETIME
from whoosh.qparser import QueryParser
from IPython.display import display, Markdown, Latex

pandas.options.display.max_columns = None
pandas.options.display.max_colwidth = 2000

def get_text_filename(path):
    return path.split('/')[-1]

def get_paper_id(text_filename):
    pid = text_filename.replace(".pdf.txt", "")
    pid = text_filename.split('v')[0]
    return pid

def clean_text(text):
    return text.replace("\n", " ").replace("  ", " ").strip()

def create_results_csv(db, query, results):
    csv_filename = "q-results-{}.csv".format(int(time()))
    outpath = os.path.join("results", csv_filename)
    with open(outpath, 'w') as fh:
        writer = csv.writer(fh)
        writer.writerow([query])
        writer.writerow(["Link", "Title", "Summary"])
        for result in results:
            path = result['path']
            text_filename = get_text_filename(path)
            pid = get_paper_id(text_filename)
            data = db[pid]
            title = clean_text(data['title'])
            summary = clean_text(data['summary'])
            link = data['link']
            writer.writerow([link, title, summary])
    
    return outpath

def run_query(db, ix, query):
    print("Running Query ...")
    with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse(query_str)
        results = searcher.search(query, limit=None)
        print("Found {} results.".format(len(results)))
        results_filename = create_results_csv(db, query_str, results)
        
    link_tmpl = '[Right-click "Save Link as ..." to Download CSV]({})'
    display(Markdown(link_tmpl.format(results_filename)))
    display(pandas.read_csv(results_filename))
    


In [2]:
# Open our paper database
print("Loading DB")
db = pickle.load(open('db.p', 'rb'))

# Open our index
print("Opening Index")
ix = open_dir("indexdir")

print("Done.")

Loading DB
Opening Index
Done.


In [3]:
query_str = "'Making the V in VQA matter' published:201809"

run_query(db, ix, query_str)

Running Query ...
Found 7 results.


[Right-click "Save Link as ..." to Download CSV](results/q-results-1541546146.csv)

Unnamed: 0,Unnamed: 1,'Making the V in VQA matter' published:201809
Link,Title,Summary
http://arxiv.org/abs/1809.04344v1,"The Wisdom of MaSSeS: Majority, Subjectivity, and Semantic Similarity in the Evaluation of VQA","We introduce MASSES, a simple evaluation metric for the task of Visual Question Answering (VQA). In its standard form, the VQA task is operationalized as follows: Given an image and an open-ended question in natural language, systems are required to provide a suitable answer. Currently, model performance is evaluated by means of a somehow simplistic metric: If the predicted answer is chosen by at least 3 human annotators out of 10, then it is 100% correct. Though intuitively valuable, this metric has some important limitations. First, it ignores whether the predicted answer is the one selected by the Majority (MA) of annotators. Second, it does not account for the quantitative Subjectivity (S) of the answers in the sample (and dataset). Third, information about the Semantic Similarity (SES) of the responses is completely neglected. Based on such limitations, we propose a multi-component metric that accounts for all these issues. We show that our metric is effective in providing a more fine-grained evaluation both on the quantitative and qualitative level."
http://arxiv.org/abs/1809.01124v1,Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering,"Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignoring our ability to seamlessly combine observed content with general knowledge. To understand interactions with a knowledge base, a dataset has been introduced recently and keyword matching techniques were shown to yield compelling results despite being vulnerable to misconceptions due to synonyms and homographs. To address this issue, we develop a learning-based approach which goes straight to the facts via a learned embedding space. We demonstrate state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset, outperforming competing methods by more than 5%."
http://arxiv.org/abs/1809.03044v1,"How clever is the FiLM model, and how clever can it be?","The FiLM model achieves close-to-perfect performance on the diagnostic CLEVR dataset and is distinguished from other such models by having a comparatively simple and easily transferable architecture. In this paper, we investigate in more detail the ability of FiLM to learn various linguistic constructions. Our main results show that (a) FiLM is not able to learn relational statements straight away except for very simple instances, (b) training on a broader set of instances as well as pretraining on simpler instance types can help alleviate these learning difficulties, (c) mixing is less robust than pretraining and very sensitive to the compositional structure of the dataset. Overall, our results suggest that the approach of big all-encompassing datasets and the paradigm of ""the effectiveness of data"" may have fundamental limitations."
http://arxiv.org/abs/1809.01816v1,Visual Coreference Resolution in Visual Dialog using Neural Module Networks,"Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively."
http://arxiv.org/abs/1809.01810v1,Interpretable Visual Question Answering by Reasoning on Dependency Trees,"Collaborative reasoning for understanding each image-question pair is very critical but underexplored for an interpretable visual question answering system. Although very recent works also attempted to use explicit compositional processes to assemble multiple subtasks embedded in the questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, leading to either heavy workloads or poor performance on composition reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question, and we thus phrase our model as parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module to exploit the local visual evidence for each word parsed from the question, ii) a gated residual composition module to compose the previously mined evidence, and iii) a parse-tree-guided propagation module to pass the mined evidence along the parse tree. Our PTGRN is thus capable of building an interpretable VQA system that gradually derives the image cues following a question-driven parse-tree reasoning route. Experiments on relational datasets demonstrate the superiority of our PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system."
http://arxiv.org/abs/1809.00812v1,RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes,"Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa."
http://arxiv.org/abs/1809.02719v1,What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks,"Nature language inference (NLI) task is a predictive task of determining the inference relationship of a pair of natural language sentences. With the increasing popularity of NLI, many state-of-the-art predictive models have been proposed with impressive performances. However, several works have noticed the statistical irregularities in the collected NLI data set that may result in an over-estimated performance of these models and proposed remedies. In this paper, we further investigate the statistical irregularities, what we refer as confounding factors, of the NLI data sets. With the belief that some NLI labels should preserve under swapping operations, we propose a simple yet effective way (swapping the two text fragments) of evaluating the NLI predictive models that naturally mitigate the observed problems. Further, we continue to train the predictive models with our swapping manner and propose to use the deviation of the model's evaluation performances under different percentages of training text fragments to be swapped to describe the robustness of a predictive model. Our evaluation metrics leads to some interesting understandings of recent published NLI methods. Finally, we also apply the swapping operation on NLI models to see the effectiveness of this straightforward method in mitigating the confounding factor problems in training generic sentence embeddings for other NLP transfer tasks."
