Skip to content
Sean Gallagher edited this page Feb 7, 2014 · 8 revisions

Watsonsim's baseline is a prototype. The first pipeline consists of only two steps:

  • Index (once) and query search engines with default settings
  • Normalize and average the scores from the search engines to create a new ranked list In order to allow the teams to work in parallel, the search team sent a prebuilt list of about 8000 questions with their correct answers labeled. Some unit tests where built to calculate these results.

Results

Question Results

The score-average baseline for Watsonsim is commit 232051a2d4. You can find it in the run db. There were 8045 questions, 2992 of which had correct answers in their respective resultsets (37.1% recall). For 370 questions the correct answer was the top result in the final score (12.4% recall given the result was in the result set, 4.6% recall end-to-end). The metric of choice in this project, the mean reciprocal rank, was 0.2810.

Normally, the results would be plotted against the percentage of questions answered. In this prototype, watsonsim answers every question because there is no functionality to avoid a question when the predicted accuracy is too low.

Determining Answer Correctness

Telling if an answer is more difficult than one might think. A direct string comparison will give many false negatives. Since this is an early prototype, the search team implemented it in two simple steps: lowercase both strings, and see if the correct answer given for the question was a substring of the predicted answer. If they were, the predicted answer is marked as correct.

Next Direction

The next step was taken almost concurrently with the baseline: to use logistic regression instead of an average.