# Question and Answer assignment

## Task overview

_(Adapted from the assignment's description.)_

In this project, we will be focusing on one of the main features of Assist AI's technology: __macro suggestions__. Macro suggestions consist in recommending potential answers for a question. Throughout this report, the terms *__macro__* and *__template__* will be used as synonyms unless noted otherwise.

Assist AI currently approaches this problem as a prediction task. More specifically, answers to questions are extracted from historical data consisting of Customer Support tickets. In its simplest implementation, this kind of dataset will contain some large number of pairings of a question and the correct answer to that question according to some human annotator. Let's call this dataset _D_, such that _D = List: ( question,    answer )_.

We also need to assume the use of some similarity metric that allows the system to judge how similar two linguistic expressions are, so that e.g. a previously answered question can be matched to a new question so that the former's answer can be re-used as a reply to the latter, too. For illustration, consider dataset _D_ and input question _q_ below:

In [5]:
D = [
    ('when was America discovered?', 'America was discovered in 1492'),
    ('what is the color of the sky?', 'The sky is blue'),
]

q = 'what year was America discovered?'

The system should be able to recognize that _D_ contains a pair whose first element has a high similarity with _q_ and its answer may therefore be used as the answer for _q_, too.

Besides answering previously unseen questions with information from observed similar questions, this type of predictive model can also be used to determine the similarity between a new answer from the ticket history and previous answers, in order to assess e.g. if we are in front of a new answer that requires building a new template. For illustration, let's assume we wanted to expand dataset _D_ with the candidate _(question, answer)_ pair below:

In [6]:
D = [
    ('when was America discovered?', 'America was discovered in 1492'),
    ('what is the color of the sky?', 'The sky is blue')
]

#    Below is a hypothetical new ticket generated by a customer support representative.
#    The new ticket provides a new potential <quesiton, answer> template to be learned
#    by the system:
new_ticket = (                                     
    'when was the discovery of America?',          # new candidate question
    'Europeans first set foot on America in 1492'  # new candidate answer
)

In this case, we would like the similarity metric to reflect the fact that, from a semantic point of view, the answer in the first item of D is a paraphrase of the question in the new ticket and, therefore, that no new template should be added for the question 'when was the discovery of America?'. Instead, both the old and the new question should simply be mapped to the same answer/template.

In this type of framework, the choice of similarity metric is very likely to be a key factor in the overall performance of the model.

### 1st task
On the basis of the problem description in the previous point, we will address the second issue first, namely, how to determine whether an input answer is similar to any of the answers in the templates already known by them system and, if so, which of those templates is the best match for the new question.

### 2nd task
In the second task, we will address the issue of assigning a relevant answer to a new question not previously observed in the customer support tickets.

For this second part of the project, the task specifications state that we can assume the training set to contain:
1. between 1k and 100k instances, to be denoted as _|S|_.
2. and between 10 and 1k classes, to be denoted as _|C|_.

In our implementation:
* _|S|_ is set to 174 (all the instances in our dataset of questions and answers).
* _|C|_ is set to 174 (ibidem).

## Challenges
#### The problem of thresholding <a name="thresholding"></a>
In order to assess the similarity between a new expression and previously observed expressions, one of the main challenges is choosing the right similarity threshold below which any candidate pair of expressions are considered dissimilar and should therefore be discarded as a match.

In this case, the problem lies in the fact that potentially many pairs of expressions will be similar based on some of the features used by the system to measure the similarity. This means that a simple heuristic such as picking always the highest-scoring match will usually result in all correct matches being selected but also in many incorrect matches being selected as well (since there is always a highest-scoring match, even when the actual score is not objectively high). That is, given that the similarity scale expresses relative similarity rather than absolute similarity, we need some other way for the system to be able to detect __true negatives__ with some degree of success.

In [a later section](#relev_filter), we will discuss in detail our solution for this problem.

#### The problem of partial matches <a name="partial_match"></a>
Another issue, closely related to the problem of thresholding introduced in the previous point, is the fact that the system may miss a legitimate match between two instances due to the match existing between some of the instances' subunits rather than the full instances themselves.

That is, in order to measure the similarity between a pair of instances, the system needs to take into account some ratio of the number of features that they share with respect to their total number of features (by using e.g. the Euclidean distance or some similar metric). By definition, partial matches contain all the relevant information for triggering a match, plus some amount of irrelevant information that should not trigger it. In case that the features activated by the irrelevant information outweight the features activated by the relevant information, the similarity between the two may fall below the chosen threshold, resulting in a __false negative__. 

Details on a quick fix for this problem are given in [this section](#solution_partial_matches).

#### The problem of linguistic variation <a name="lingvar"></a>
Ultimately, linguistic variation is the key linguistic challenge underlying both tasks. That is, given that natural languages allow the same thought to be expressed in a number of different ways, in order to achieve optimal performance any similarity metric aimed at measuring the semantic relatedness between two expressions must ideally provide robustness across all or as many of their variants as possible.

This type of robustness is determined by the choice of features, which must be consistent and must remain the same across variants of the same idea, i.e., feature extraction must be performed in such a way that the features extracted from two variants expressing the same overall meaning, must reflect none of the linguistic variations making those two different expressions.

The details of our approach for dealing with this phenomenon are discussed [here](#solution_inguistic_variation).

## Dataset
For our experiments, we will be using a subset of Microsoft's [WikiQA corpus](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/).

More specifically, we will be using all instances corresponding to questions with at least 2 manually-assigned correct answers (a total of 174). In addition, and as part of the test data, we will also be using a variable number of instances with a single manually-assigned correct answer (between 68 and 140 based on the experimental conditions).

This subset will provide us with the data we need for tasks 1 and 2:
* Given that the dataset contains many instances with two answers for each question, we can use one of the answers to train the system, then have the algorithm predict the label for the second answer (thus fulfilling the requirement that the prediction is tested on previously unseen input). Since we know in advance that both are valid answers of the same question, we know what is the correct prediction we expect from the system (as a matter of fact, this amounts in practice to having a total of 348 training instances available to us).
* Given that the dataset also contains the question that triggered each answer, it also provides us with the necessary data for our second prediction task: suggesting a potential answer for a previously unseen question. Since the questions are never used for training in our implementation, we can actually use again all instances for testing. For simplicity, we will use the same training and testing data size in both tasks:
 * 174 training instances,
 * plus between 34 and 70 positive test instances as true positives (from the subset of double-answer instances),
 * plus the same number of negative test instances as true negatives (from the subset of single-answer instances).
 
 
#### Effect of the composition of the corpus on the partial matches problem <a name="effect_partial_matches"></a>
In Microsoft's WikiQA dataset, the pairing of questions and answers is generally precise, in the sense that each answer tends to carry all the necessary information to answer its associated question, and very little additional information in the form of context. This constitutes a safeguard by design against the problem of [partial matches](#partial_matches): since the information to be matched tends to be exactly the information contained in the question and its answer, there is little information left that can mislead the system into a false negative or, probably less likely, a false positive.

## Model

### Architecture
Our model is built around a simple pipeline:
1. A main method from which all other methods and classes are called, and where all the different system [parameterizations](#Parameters) are dynamically generated.
2. The _Dataset_ class, which is responsible for sampling the data, splitting the training and test sets, and storing the full [corpus](#Dataset) used in our experiments.
3. The _FeatureEngine_ class, which takes care of the [feature extraction](#Features).
4. Two wrapper functions, each with the set of instructions for running one of the __tasks__: _FirstTask_ and _SecondTask_. Both wrappers are essentially identical except for the fact that the 2nd task's wrapper creates the test dataset from instance questions (as opposed to instance answers in the case of the 1st task's wrapper). Therefore, in what follows we will only be looking at the [classifier](#Classifier) implemented in the 1st task wrapper.

In the next sections we will cover in more detail points 1, 3 and 4 above. Point 2 has already been addressed in the previous section.

### Parameters

| Parameter | Description | Values tested |
|---|---|---|
| Ratio of test data | Defines a ratio over the total number of dataset records. The result denotes some quantity of instances that will be added as test instances. For the first task, this additional data will consist in a list with the __second correct answer__ of all dataset records (after randomization), up to the number specified by this parameter. For the second task, the list will consist of the __questions__ in the dataset records, up to the number specified by this parameter.| 0.2, 0.4 |
| Ratio of confidence | Defines a ratio over the summation of the inverse probability of all the features in a bag of features. This ratio will then be used as threshold _t_ during classification: given [1] a candidate match between two bags of features sharing at least some feature, [2] the sum of the inverse probability of their shared features, and [3] the sum of the inverse probability of all their features, there is a positive match if _[2] / [3] >= t._  | 0.5, 0.9 |

### Features
In our model, feature extraction will based on a relatively simple set of features.

1) word [n-gram](https://en.wikipedia.org/wiki/N-gram)s for _n: 1 ≤ n ≤ 3_, using the conventional definition of n-grams as
 * all possible sequences of _n_ adjacent elements in the __space__,
 * where the __space__ is defined as all the words in each sentence of the corpus
 * and where each sequence starts at the _i_-th element in the __space__ and ends at the _i + (n - 1)_-th element.

In [10]:
#   Example of feature extraction using n-grams:
from FeatureEngine import FeatureEngine

fe1 = FeatureEngine(
    ngrams=[1]
)

fe2 = FeatureEngine(
    ngrams=[1, 2]
)

fe12 = FeatureEngine(
    ngrams=[1, 2]
)

example = 'this sentence will be used as an n-gram extraction example'

print 'using 1-grams:', fe1(example)
print
print 'using 2-grams:', fe2(example)
print
print 'using 1- and 2-grams:', fe12(example)

using 1-grams: ['this', 'sentence', 'will', 'be', 'used', 'as', 'an', 'n', 'gram', 'extraction', 'example']

using 2-grams: ['this', 'sentence', 'will', 'be', 'used', 'as', 'an', 'n', 'gram', 'extraction', 'example', 'this sentence', 'sentence will', 'will be', 'be used', 'used as', 'as an', 'an n', 'n gram', 'gram extraction', 'extraction example']

using 1- and 2-grams: ['this', 'sentence', 'will', 'be', 'used', 'as', 'an', 'n', 'gram', 'extraction', 'example', 'this sentence', 'sentence will', 'will be', 'be used', 'used as', 'as an', 'an n', 'n gram', 'gram extraction', 'extraction example']




2) character n-grams for _n: 3 ≤ n ≤ 5_, under the same n-gram definition as word n-grams above, but with the __space__ now defined as all the characters in each word of every sentence in the dataset.




In [11]:
#   Example of feature extraction using character n-grams:
from FeatureEngine import FeatureEngine

fe1 = FeatureEngine(
    ch_ngrams=[3]
)

fe2 = FeatureEngine(
    ch_ngrams=[4]
)

fe12 = FeatureEngine(
    ch_ngrams=[3, 4]
)

example = 'this sentence will be used as an n-gram extraction example'

print 'using 1-grams:', fe1(example)
print
print 'using 2-grams:', fe2(example)
print
print 'using 1- and 2-grams:', fe12(example)

using 1-grams: ['thi', 'his', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'wil', 'ill', 'be', 'use', 'sed', 'as', 'an', 'n', 'gra', 'ram', 'ext', 'xtr', 'tra', 'rac', 'act', 'cti', 'tio', 'ion', 'exa', 'xam', 'amp', 'mpl', 'ple']

using 2-grams: ['this', 'sent', 'ente', 'nten', 'tenc', 'ence', 'will', 'be', 'used', 'as', 'an', 'n', 'gram', 'extr', 'xtra', 'trac', 'ract', 'acti', 'ctio', 'tion', 'exam', 'xamp', 'ampl', 'mple']

using 1- and 2-grams: ['thi', 'his', 'this', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'wil', 'ill', 'will', 'be', 'be', 'use', 'sed', 'used', 'as', 'as', 'an', 'an', 'n', 'n', 'gra', 'ram', 'gram', 'ext', 'xtr', 'tra', 'rac', 'act', 'cti', 'tio', 'ion', 'extr', 'xtra', 'trac', 'ract', 'acti', 'ctio', 'tion', 'exa', 'xam', 'amp', 'mpl', 'ple', 'exam', 'xamp', 'ampl', 'mple']



3) word s-skip-n-grams, for _n: 3 ≤ n ≤ 4_, _S = {1, 2}_, and _s: s ∈ S_, where the final features are defined as
 * n-grams of order _n_ (also as it pertains to the __space__ from which they are extracted),
 * in which any constituents in a position _p_ of the n-gram are removed for _p = 1 + s_, as long as _p < n_.



In [12]:
#   Example of feature extraction using character s-skip-n-grams:
from FeatureEngine import FeatureEngine

fe1 = FeatureEngine(
    skip_ngrams=[3]
)

fe2 = FeatureEngine(
    skip_ngrams=[4]
)

fe12 = FeatureEngine(
    skip_ngrams=[3, 4]
)

example = 'this sentence will be used as an n-gram extraction example'

print 'using 1-grams:', fe1(example)
print
print 'using 2-grams:', fe2(example)
print
print 'using 1- and 2-grams:', fe12(example)

using 1-grams: ['this * will', 'sentence * be', 'will * used', 'be * as', 'used * an', 'as * n', 'an * gram', 'n * extraction', 'gram * example']

using 2-grams: ['this * be', 'sentence * used', 'will * as', 'be * an', 'used * n', 'as * gram', 'an * extraction', 'n * example']

using 1- and 2-grams: ['this * will', 'sentence * be', 'will * used', 'be * as', 'used * an', 'as * n', 'an * gram', 'n * extraction', 'gram * example', 'this * be', 'sentence * used', 'will * as', 'be * an', 'used * n', 'as * gram', 'an * extraction', 'n * example']


We have run experiments both using different combinations of these features as well as settings where they were each disabled, in order to better assess their specific contributions. In the Results section we provide a detailed summary of the metrics.

#### Theoretical foundation

The use of character n-grams and s-skip-n-grams (_s-grams_ henceforth for short) as features has the same goal: to relax the constraints for matching in order to maximize the number of feature coincidences detected by the system.

__Character n-grams.__ In many tasks, character n-grams have been shown to be a suitable replacement for preprocessing steps such as stemming or lemmatization, which are similarly aimed at decreasing the linguistic variation in the data and thus increase the number of matches.

Although both character n-grams and stemming algorithms provide a comparable boost in recall, the former are simpler overall, on the one hand, and the latter, on the other hand, do not provide a significant improvement in precision, which is why character n-grams were chosen as features in our system.

Similarly, although lemmatization would provide a nearly-optimal feature representation, a reliable implementation would presuppose a full NLP-pipeline with solved part-of-speech tagging and some level of word sense disambiguation functionalities. Given the additional implementation cost, a decision was made to leave as future work the possibility of incorporating lemmatization into our system.

The logic behind the use of character n-grams is to decompose words into sequences made up of their constituent characters, so that minor morphological variations are abstracted away and do not prevent the words' constant component[1] from being matched as features, allowing for a more accurate measurement of the similarity between linguistic expressions.

__S-grams.__ S-grams, on the other hand, allow the model to represent discontinuous dependencies much more reliably than n-grams, somewhat mitigating the impact of data sparsity with growing values of _n_ and allowing s-grams to capture relevant long-range semantic relations. This makes the system robust to minor variations in wording between different linguistic paraphrases of the same overall meaning.

[1] Their roots, as we would expect them to be returned by an ideal lemmatizer.

#### Solution to the problem of linguistic variation <a name="solution_inguistic_variation"></a>
This choice of features is directly related to the issue of __linguistic variation__ introduced in our overview of the tasks. More specifically:


* By using character n-grams, many legitimate linguistic variants of the same lemma that would not have been matched otherwise, will now be matched through their component parts. That is, character n-grams play the same role in feature extraction as features do in classification: in a classification task, we want to reduce two instances to their respective sets of features and we then expect the latter to intersect if the two instances are similar in any relevant way. When using character n-grams, words are also reduced to their respective sets of "features" (character sequences), which we then expect to intersect if the two words are variants of the same word and, therefore, share the root or some other majority subset of their characters.


* By using s-grams and extracting high-order n-grams, the system abstracts the innermost tokens in high-order n-grams and can collapse several legitimate variants of the same n-gram, e.g. _he is tall_ and _he is now very tall_ are both transformed into the same feature: __he * tall__, increasing the potential matches accordingly, i.e., the same meaning, initially split between at least two different n-gram features, is now correctly represented as a common single s-gram feature. Standard n-gram models lack the expressive power to represent this type of information but s-gram models do not have that limitation.

### Classifier

#### Relevance filter _versus_ Logistic Regression
For the initial experiments of the first task, a [Logistic Regression](https://en.wikipedia.org/wiki/Inverted_index) (LR) classifier was implemented (using the same features we described in previous sectins). However, the performance of this model was rather poor overall. After performing the error analysis, some trivial issues became apparent that could be fixed _post hoc_ with an inverse probability relevance filter, i.e., a process that only accepted a bag-of-features-based match between two items if the features in the intersection of the items' bags also happen to be the highest relevance (~highest inverse-probability) features in their respective bags.

However, if that kind of filter was implemented, it would already be able to handle the classification task itself (as a mere lookup over the [inverted index](https://en.wikipedia.org/wiki/Inverted_index)), making the LR classifier redundant to a certain extent. Based on grounds both of implementation ease and theoretical elegance, a decision was finally made not to use the classifier and rely solely on the relevance filter for the time being (where the relevance filter can be seen as an inverted index mapping features to answers containing those features, and their inverse probabilities as a [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf)-like feature weighting).

#### Approach for relevance filtering
Based on the remarks in the previous section, an inverted index lookup was implemented as the classifier for both tasks.

During __training__, the system indexes all instances by their features (storing also each instance's label) and calculates all the features' inverse probability as a proxy for their relevance (to be later used as the weight for computing the similarity between pairs of instances).

During __testing__, the system:
1. first extracts the features for a given input instance,
2. then looks up the resulting features in the inverted index and activates all training instances indexed by the same features,
3. for each activated instance, it calculates a __score__ by summing up the inverse probability of all features that activated that instance, without [length normalization](#partial_matches),
4. returns as the prediction the candidate instance with the highest __score__,
5. finally, keeps the prediction if and only if the prediction's score is higher than a threshold, defined above as the __ratio of confidence__ parameter in [Parameters](#Parameters).


#### Relevance filtering as the solution for the problem of true negatives <a name="relev_filter"></a> 

The main strength of our implementation is the system's direct access to the inverse probabilities and to the features shared by two instances (both matching and total features), which allows us to set explicitly a certain __ratio of confidence__ that can be used as a threshold for filtering out poor matches.

More specifically, matches whose score fails to equal or exceed this threshold can be assumed to be based on features carrying little information and, therefore, likely to be false positives. Such matches can then be discarded so that they do not hurt the algorithm's prediction performance, thus addressing the problem of [thresholding](#thresholding) we introduced at the beginning.

#### Relevace filtering as the solution to the problem of partial matches <a name="solution_partial_matches"></a> 
Another strength of our model resides in the fact that the scores are not length-normalized, which provides an unintended solution for the problem of [partial matches](#partial_match).

In our approach, the system does not apply length normalization, i.e., the final weight of a candidate match is the total sum of the weights of the features shared by two instances, and that number is the same regardless of the ratio of matching features over the total number of features (that is, a match based on three features has the same total weight if the instances involved have five or ten features each).

This implies that a match is not disregarded if it is partial, but only if it contains too little information to be considered a reliable match, whichs solves the partial matches problem. When the instance containing the right answer is long and it includes additional information besides the answer, the extra context results in additional features that remain unmatched. If taken into account for measuring the similarity, these features would lower the match's overall similarity score, resulting in a potential false negative. Our system solves this by ignoring altogether the total number of features and using an independent criterion (the threshold) to assess the quality of the match instead. It thus minimizes the risk that a correct match is missed when the similarity score for the whole answer is low (the independently-calculated relevance threshold tells us that the score for the specific part containing the actual answer would have been higher had it been not for the interference of the extra information in that instance).

Although this can be expected to result in some number of false positives and lower precision, it can also be expected to help boost recall.

## Evaluation

### Metrics

The algorithm's performance is measured using standard [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) metrics:
* [Precision](https://en.wikipedia.org/wiki/Precision_and_recall),
* [Recall](https://en.wikipedia.org/wiki/Precision_and_recall)
* and [F-1 measure](https://en.wikipedia.org/wiki/F1_score).

We also provide the total number of true positives, true negatives, false positives and false negatives for additional context.

The results for all experimental settings in the tables below are macro-averaged across 40 runs.

### Results

__Summary of results for tasks 1 and 2__ <a name="summary"></a>

Task | Configuration | Avg. Precision | Avg. Recall | Max. Precision | Max. Recall
:---:|:---:|:---:|:---:|:---:|:---:
1| all | 41%| 52% | 79% | 92%
1| confidence ratio = 0.5 | 38% | 63% | 67% | 92%
1| confidence ratio = 0.9 | 44% | 41% | 79% | 62%
2| all| 75% | 25% | 89% | 66%
2| confidence ratio = 0.5 | 70% | 37% | 87% | 66%
2| confidence ratio = 0.9 | 82% | 14% | 89% | 24%

#### Overview of the results for task 1
In task 1 (assign a preexisting template to an unseen answer, or none at all), the best precision across all configurations is consistently achieved by word n-gram features alone. Using character n-grams or s-grams as additional features usually harms precision in a significant way. Despite great improvements in recall in most of those cases, F1 scores remain lower than for configurations with higher precision.

#### Overview of the results for task 2
In task 2 (assign a preexisting template to an unseen question, or none at all), the best precision across all configurations is again achieved by word n-gram features alone, but the highest performance (i.e., highest F1 score) is actually achieved by settings including some other features, either s-grams, character n-grams, or both. The best performing run sees an 8% increase in recall with respect to the highest-precision run, while its precision only dropped by 4%.

#### Comparison of the results for task 1 and task 2
As can be seen on the [summary](#summary), the results for the second task were significantly better than for the first task. Predicting an answer given a question seems substantially easier than predicting whether an answer belongs to a preexisting template (i.e., matches a previously observed question). The difference is extremely large, with a 35% precision increase on the average scores for a 26% decrease in average recall). This suggests that our choice of features is particularly suited for the second task, whereas there is still room for improvement in answer prediction. By looking at some examples from our dataset, it becomes apparent that, although the composition of our corpus is such that [allows us address the issue of partial matches](#effect_partial_matches) to some extent, the Wikipedia answers are on average rather wordy and provide a substantial amount of highly specific linguistic information that pushes up the relevance threshold. In the [error analysis](#errors) we will see some examples of this.

#### Effect of increasing the confidence ratio
The data in [summary](#summary) above suggests that higher values for the confidence ratio result in substantially higher precision (6% and 12% increase for task 1 and 2, respectively), although they also cause recall to drop significantly (22% and 23% decrease for task 1 and 2, respectively) and by a larger amount.

#### Effect of increasing the size of the test data
As can be seen in the following table, increasing the __ratio of the test data__ does not have a significant impact on performance: 

Task | Confidence ratio | _Ratio of test data_ | Precision | Recall
:---:|:---:|:---:|:---:|:---:
1 | 0.5 | 0.2 | 39% | 64%
1 | 0.5 | 0.4 | 38% | 63%
1 | 0.9 | 0.2 | 44% | 41%
1 | 0.9 | 0.4 | 44% | 41%

This suggests that our implementation is robust to overfitting: both precision and recall remain essentially constant after doubling the size of the test data, indicating that our model generalizes reasonably well to new data.

#### All results
The tables below display the full breakdown of the results for all experimental settings carried out for [task 1](#all_a) and [task 2](#all_b). The _Setting_ column describes the parameters used for each run. The legend is as follows:

Particle in setting name | Meaning
--- | ---
uni | The setting uses word 1-grams as features.
uni-bi-tri | The setting uses as features word n-grams for _n: n ∈ {1, 2, 3}_.
chgrams | The setting uses character n-grams.
sgrams | The setting uses s-skip-n-grams.

Results are sorted by F-1 Measure, with the highest precision in each table highlighted in bold for convenience.

##### Task 1 <a name="all_a"></a>

__Confidence ratio = 0.5, Ratio of test data = 0.2__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
__uni__|__53__|__123__|__27__|__73__|__0.6594__|__0.4186__|__0.5118__|__0.5776__
uni.sgrams|47|119|33|76|0.5838|0.3836|0.4625|0.655
uni-bi-tri.chgrams.sgrams|65|22|183|6|0.2615|0.9046|0.4056|3.7893
uni.chgrams.sgrams|65|18|188|5|0.2581|0.9223|0.4032|30.7164
uni-bi-tri.chgrams|64|20|186|6|0.2556|0.9028|0.3982|3.796
chgrams.sgrams|63|15|192|5|0.249|0.9165|0.3914|3.8517
uni.chgrams|63|14|193|5|0.2461|0.9141|0.3877|3.7329
chgrams|63|13|195|5|0.2448|0.9206|0.3865|3.574
uni-bi-tri.sgrams|35|119|32|91|0.5186|0.2784|0.3619|0.7374
uni-bi-tri|31|121|28|96|0.5237|0.2479|0.3362|0.64
sgrams|28|120|31|98|0.4782|0.2256|0.3059|0.589

__Confidence ratio = 0.5, Ratio of test data = 0.4__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
__uni__|__68__|__154__|__33__|__91__|__0.6713__|__0.4269__|__0.5216__|__0.5913__
uni.sgrams|58|150|42|97|0.5798|0.3742|0.4547|0.6854
uni-bi-tri.chgrams.sgrams|79|25|231|11|0.2564|0.8787|0.3967|4.9884
uni-bi-tri.chgrams|78|23|235|9|0.2504|0.8895|0.3907|4.392
uni.chgrams.sgrams|79|20|240|8|0.2487|0.9047|0.39|4.6961
chgrams.sgrams|78|18|242|8|0.2449|0.9035|0.3852|4.5017
uni.chgrams|78|17|244|8|0.2424|0.898|0.3815|4.4083
chgrams|77|14|248|7|0.2369|0.9101|0.3759|4.5565
uni-bi-tri.sgrams|43|148|41|114|0.5101|0.274|0.3562|0.7873
uni-bi-tri|39|151|36|119|0.5219|0.2495|0.3374|0.7721
sgrams|33|149|40|123|0.454|0.2156|0.2921|0.5983


__Confidence ratio = 0.9, Ratio of test data = 0.2__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
uni.chgrams|52|57|135|33|0.278|0.613|0.3822|3.725
chgrams|52|54|138|31|0.2748|0.6252|0.3814|3.5066
uni.chgrams.sgrams|50|61|127|38|0.2859|0.5719|0.3811|3.7596
uni-bi-tri.chgrams.sgrams|49|65|118|44|0.2966|0.5296|0.3801|36.0064
uni-bi-tri.chgrams|49|64|123|40|0.2879|0.552|0.3782|3.9734
chgrams.sgrams|51|57|132|36|0.2781|0.5844|0.3766|3.6815
__uni__|__32__|__134__|__8__|__102__|__0.7895__|__0.2402__|__0.368__|__0.5651__
uni.sgrams|32|128|16|99|0.6615|0.2472|0.3594|0.6518
sgrams|27|119|31|100|0.4666|0.2127|0.2917|0.5818
uni-bi-tri.sgrams|23|131|14|108|0.6179|0.1754|0.273|0.7617
uni-bi-tri|20|134|11|112|0.6452|0.1509|0.2443|0.6657

__Confidence ratio = 0.9, Ratio of test data = 0.4__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
uni.chgrams|67|69|169|42|0.2838|0.6127|0.3877|4.4876
uni-bi-tri.chgrams.sgrams|63|81|147|55|0.299|0.532|0.3826|4.4874
uni.chgrams.sgrams|64|74|161|48|0.2853|0.5713|0.3804|4.8892
uni-bi-tri.chgrams|62|78|155|51|0.2887|0.5518|0.3789|4.4017
chgrams|65|64|178|39|0.2673|0.6216|0.3735|4.4794
chgrams.sgrams|63|70|168|45|0.2729|0.5788|0.3708|4.4514
__uni__|__40__|__168__|__10__|__128__|__0.7863__|__0.2375__|__0.3646__|__0.5773__
uni.sgrams|40|162|21|123|0.6492|0.2447|0.3552|0.7305
sgrams|34|149|39|124|0.4709|0.2179|0.2975|0.627
uni-bi-tri.sgrams|28|163|18|136|0.6075|0.175|0.2714|0.7735
uni-bi-tri|24|166|14|142|0.6278|0.1465|0.2373|0.6789

##### Task 2 <a name="all_b"></a>

__Confidence ratio = 0.5, Ratio of test data = 0.2__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
uni.chgrams|77|101|57|41|0.5733|0.6519|0.6096|0.9976
chgrams|78|99|61|39|0.5618|0.6659|0.6092|0.9742
chgrams.sgrams|75|103|52|45|0.5911|0.6237|0.6067|1.0291
uni-bi-tri.chgrams|73|109|45|50|0.6168|0.5914|0.6035|1.0733
uni.chgrams.sgrams|74|105|50|47|0.5959|0.6109|0.6032|1.0329
uni-bi-tri.chgrams.sgrams|70|113|40|52|0.634|0.574|0.6022|1.065
uni|38|135|8|95|0.8125|0.2856|0.4222|0.31
uni.sgrams|24|136|6|110|0.8059|0.1847|0.3001|0.3201
__uni-bi-tri__|__7__|__138__|__1__|__130__|__0.8272__|__0.0528__|__0.099__|__0.3215__
uni-bi-tri.sgrams|5|138|1|131|0.7642|0.0436|0.0823|0.3716
sgrams|0|138|0|138|0.5437|0.0272|0.0451|0.1811

__Confidence ratio = 0.5, Ratio of test data = 0.4__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
chgrams|97|124|75|51|0.5657|0.6567|0.6075|1.2752
uni.chgrams.sgrams|93|132|62|59|0.5973|0.6085|0.6027|1.2662
uni.chgrams|95|126|72|53|0.5669|0.641|0.6014|1.2228
chgrams.sgrams|93|130|64|59|0.5906|0.6116|0.6007|1.1655
uni-bi-tri.chgrams.sgrams|87|142|49|68|0.6413|0.5593|0.5973|1.4716
uni-bi-tri.chgrams|89|137|55|65|0.6185|0.5773|0.597|1.1781
uni|48|167|12|119|0.7911|0.2893|0.4232|0.3237
uni.sgrams|31|171|7|137|0.8135|0.1859|0.3022|0.3708
__uni-bi-tri__|__10__|__173__|__1__|__162__|__0.8526__|__0.0631__|__0.1173__|__0.3463__
uni-bi-tri.sgrams|8|173|2|164|0.7864|0.0503|0.0944|0.3881
sgrams|1|174|0|172|0.85|0.0276|0.046|0.259

__Confidence ratio = 0.9, Ratio of test data = 0.2__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
uni.chgrams|31|137|4|103|0.8689|0.2356|0.3701|0.9271
chgrams|31|137|5|103|0.8592|0.2337|0.3672|0.9296
uni.chgrams.sgrams|28|137|4|106|0.8578|0.2137|0.3417|1.009
chgrams.sgrams|28|137|4|106|0.8557|0.2119|0.3394|0.9026
__uni__|__22__|__138__|__3__|__114__|__0.8809__|__0.162__|__0.2733__|__0.3077__
uni-bi-tri.chgrams|19|138|3|117|0.847|0.1423|0.2433|0.8082
uni.sgrams|16|138|2|120|0.865|0.123|0.2152|0.3341
uni-bi-tri.chgrams.sgrams|16|138|3|120|0.8236|0.1187|0.2073|0.9307
uni-bi-tri|5|138|1|132|0.7902|0.0408|0.0774|0.3323
uni-bi-tri.sgrams|5|138|1|132|0.7885|0.0399|0.0758|0.3829
sgrams|0|138|0|138|0.4667|0.0136|0.0261|0.1592

__Confidence ratio = 0.9, Ratio of test data = 0.4__

Setting|TruePositives|TrueNegatives|FalsePositives|FalseNegatives|Precision|Recall|F-1_Measure|Duration
---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:
chgrams|41|171|7|128|0.8504|0.2422|0.3768|1.0961
uni.chgrams|40|171|6|129|0.8576|0.2377|0.3719|1.0968
chgrams.sgrams|35|171|7|133|0.8363|0.2119|0.3379|1.117
uni.chgrams.sgrams|35|171|6|134|0.8413|0.2062|0.331|1.1387
__uni__|__28__|__172__|__3__|__143__|__0.8914__|__0.164__|__0.2767__|__0.3103__
uni-bi-tri.chgrams|25|172|4|144|0.8462|0.1508|0.2557|1.0283
uni.sgrams|22|173|2|149|0.8905|0.1329|0.2312|0.3812
uni-bi-tri.chgrams.sgrams|22|172|4|148|0.837|0.1337|0.2303|1.033
uni-bi-tri|7|173|2|164|0.7822|0.0478|0.09|0.3602
uni-bi-tri.sgrams|7|173|2|164|0.7999|0.0477|0.0899|0.4271
sgrams|1|173|0|172|0.7292|0.0272|0.0452|0.2489

## Error analysis <a name="errors"></a>

### Issues
#### Runaway relevance threshold

Topic|Question |Answer
---|---|---
Financial audit	|	how do forensic auditors examine financial reporting	|	The audit opinion is intended to provide reasonable assurance, but not absolute assurance, that the financial statements are presented fairly, in all material respects, and/or give a true and fair view in accordance with the financial reporting framework.
Financial audit	|	how do forensic auditors examine financial reporting	|	The purpose of an audit is provide and objective independent examination of the financial statements, which increases the value and credibility of the financial statements produced by management, thus increase user confidence in the financial statement, reduce investor risk and consequently reduce the cost of capital of the preparer of the financial statements.
Financial audit	|	how do forensic auditors examine financial reporting	|	Financial audits are typically performed by firms of practicing accountants who are experts in financial reporting.
Dwarf planet	|	what makes a dwarf planet	|	A dwarf planet is a planetary-mass object that is neither a planet nor a satellite .
Dwarf planet	|	what makes a dwarf planet	|	More explicitly, the International Astronomical Union (IAU) defines a dwarf planet as a celestial body in direct orbit of the Sun that is massive enough for its shape to be controlled by gravitation , but that unlike a planet has not cleared its orbital region of other objects.


#### False positive feature intersection 
40,fp,what country is turkey in,where is basque spoken

Topic | Question | Answer
---| --- |---
Turkey	|	what country is turkey in	|	Turkey (), officially the Republic of Turkey , is a transcontinental country , located mostly on Anatolia in Western Asia and on East Thrace in Southeastern Europe.
Turkey	|	what country is turkey in	|	Turkey is bordered by eight countries: Bulgaria to the northwest; Greece to the west; Georgia to the northeast; Armenia , Iran and the Azerbaijani exclave of Nakhchivan to the east; and Iraq and Syria to the southeast.
Basque language	|	where is basque spoken	|	Basque ( endonym : , ) is the ancestral language of the Basque people , who inhabit the Basque Country , a region spanning an area in northeastern Spain and southwestern France .

#### Coreferential expressions
Question |Topic|Answer
---|---|---
what is a medallion guarantee	|	Medallion signature guarantee	|	It is a guarantee by the transferring financial institution that the signature is genuine and the financial institution accepts liability for any forgery.
what is a medallion guarantee	|	Medallion signature guarantee	|	They also limit the liability of the transfer agent who accepts the certificates.

#### Unaccounted linguistic variation
Question |Topic|Answer
---|---|---
what is an agents job role in film	|	Talent agent	|	A talent agent, or booking agent, is a person who finds jobs for actors , authors , film directors , musicians , models , producers, professional athletes , writers and other people in various entertainment businesses.
what is an agents job role in film	|	Talent agent	|	An agent also defends, supports and promotes the interest of his/her clients.


### Limitations

#### Relativistic question
what are the most known sports in america,who is basketball star antoine walker
what are the most known sports in america,what is an assist in basketball

#### Factoid question
Question |Topic|Answer
---|---|---
what cities are in the bahamas	|	List of cities in the Bahamas	|	Nassau
what cities are in the bahamas	|	List of cities in the Bahamas	|	Freeport, Bahamas

#### Conventional linguistic variation (acronyms)
Question |Topic|Answer
---|---|---
how many percent is a basis point	|	Basis point |	1 basis point = 1 permyriad = one one-hundredth percent
how many percent is a basis point	|	Basis point	|	1 bp = 1 = 0.01% = 0.1‰ = 10−4 = = 0.0001



## Next steps

### Issues

#### Runaway relevance threshold
Activate through character n-grams, relevance filter over n-gram TFIDF for words containing activated character n-grams.
Inverse probability of rare tokens is too large; risky for collapsed features, the logic for collapsing must ensure high precision (i.e., must not create any ambiguities due to neutralizing legitimate linguistic distinctions).

#### False positive feature intersection 
a) Bigger model, more training data.
b) Do not accept matches for adjectives unless heads match as well.

#### Unsupported coreferential expressions
Find topic of each answer, replace pronouns with topics.

#### Unaccounted linguistic variation
Expand ellipsis with respect to observed full phrases ((talent )agent).
[From 1st task] __TO-DO: word embeddings, normalization__
[From 1st task] __TO-DO: word embeddings, concept expansion__


### Limitations

#### Relativistic question
Implement shallow syntactic and semantic parsing in order to restrict predictions to observed relations. For example, given linguistic expressions 1 and 2 below:

1. _most known sports in america_
2. _sports in america_

the problem arises because the system is currently transforming both into an equivalent set of features such as _{sports, america}_. However, the specifier _most known_ crucially restricts the reference of the head noun _sports_ in 1, making its meaning considerably different from the one it has in 2. This constitutes the mirror image of the problem of [linguistic variation](#lingvar): whereas in many cases we want to abstract away linguistic variation in order to increase the number of matches, this constitutes an instance of a linguistic distinction that should be kept in order not to provide an incorrect answer.

In order to do that, features should incorporate information about the specifiers of each noun phrase: the lack of specifier in 2 suggests that the head noun _sports_ is being used with an absolute sense; the presence of specifier in 1 suggests that the head noun _sports_ is being used in a restricted sense. Features incorporating this type of phrasal information would thus be different and would avoid the false positive.

#### Factoid question
Entity types: {Place(city), Place(Bahamas)}
List of X in Bahamas.
Recognition that every X is a city.
Reliance on a knowledge base.

#### Conventional linguistic variation (acronyms)
Dictionary lookup.
Limited set of heuristics (use initials).


### Question types and expected answers
PLACE, TIME

__Topic labels__ already handled by the system