In [1]:
import os
import re

import pandas as pd

# MS-Marco dataset

In [40]:
class StringParser:
    def __init__(self):
        docs_str = r"""
            # e.g. 'is_selected': 0
            {(?P<quotesel>\"|')is_selected(?P=quotesel):\s(?P<sel>0|1),\s

            # e.g. 'passage_text': 'hello world'
            (?P<quotetext>\"|')passage_text(?P=quotetext):\s(?P<quotetext2>\"|')(?P<text>.*?)(?P=quotetext2),\s

            # e.g. 'url': 'http://foo.bar.com'
            (?P<quoteurl>\"|')url(?P=quoteurl):\s(?P<quoteurl2>\"|')(?P<url>.*?)(?P=quoteurl2)}
        """
        self.PASSAGE = re.compile(docs_str, re.VERBOSE)
        self.ANSWER = re.compile(r"(?P<quote>\"|')(?P<text>.*?)(?P=quote)")

    def parse_passages(self, s: str):
        return [
            {
                "is_selected": int(m.group("sel")),
                "passage_text": m.group("text"),
                "url": m.group("url"),
            }
            for m in self.PASSAGE.finditer(s)
        ]

    def parse_answers(self, s: str):
        return [m.group("text") for m in self.ANSWER.finditer(s)]


string_parser = StringParser()


Dataset read.

In [2]:
path = os.path.join(os.getcwd(), "marco_comp_all_fields.tsv")
df = pd.read_csv(path, sep="\t")

df.head()

Unnamed: 0,question,query_id,wellFormedAnswers,passages,answers,query_type
0,is the atlanta airport the busiest in the world,1174762,[],"[{'is_selected': 0, 'passage_text': 'While Chi...",['No Answer Present.'],LOCATION
1,most romantic hotels in bahamas,459200,[],"[{'is_selected': 0, 'passage_text': ""Thank You...",['No Answer Present.'],PERSON
2,benefits of a weaker dollar,50251,[],"[{'is_selected': 1, 'passage_text': 'The Benef...",['The Benefits of a Weak Dollar Despite expres...,DESCRIPTION
3,what distinguishes a macronutrient from a micr...,621519,['The difference between a macronutrient and a...,"[{'is_selected': 1, 'passage_text': ""Macronutr...","['Macronutrients mainly include carbohydrates,...",DESCRIPTION
4,united states postal service roseburg or,532464,[],"[{'is_selected': 0, 'passage_text': 'United St...",['It is located at the address 1304 Ne Cedar S...,DESCRIPTION


## Observe data

### Well-formed answers

Well-formed answers are more relevant for generative models, i.e. the answer is generated rather than being extracted from the available texts. Our goal is toward a retrieval model, hence we are not interested in this type of answers.

There are 7217 well-formed answers in the dataset.

In [35]:
df_wf = df[df["wellFormedAnswers"].map(len) > 2]

print(f"Numer of rows: {df_wf.shape[0]}")

Numer of rows: 7217


Difference between a well-formed answer and a retrieved one. The former is not extracted directly from the passage text, but it's generated.

In [41]:
idx = 0
row = df_wf.iloc[idx]
print(f"Question: {row.question}", end="\n\n")
print(f"Well-formed answer: {string_parser.parse_answers(row.wellFormedAnswers)}", end="\n\n")
print(f"Retrived answer: {string_parser.parse_answers(row.answers)}")

Question: what distinguishes a macronutrient from a micronutrient

Well-formed answer: ['The difference between a macronutrient and a micronutrient, macronutrients mainly include carbohydrates, proteins and fats and also water which are required in large quantities whereas, micronutrients mainly comprise vitamins and minerals which are required in minute quantities.']

Retrived answer: ['Macronutrients mainly include carbohydrates, proteins and fats and also water which are required in large quantities and their main function being the release of energy in body.Whereas, micronutrients mainly comprise vitamins and minerals which are required in minute quantities.']


However, for this task well-formed answer are excluded because we are interested only in retrieved answers, in this way the classification is for span texts. 

### Number of answers

The number of answers for the majority of questions is one. However, some questions have more than an answer, but we are not interested in those questions because our model would be too complicated.

How many answers in the column 'answers'. We should expect only 1 answer.

In [42]:
answers_length = df.answers.apply(lambda x: len(string_parser.parse_answers(x)))
print(f"Number of answers for the same question: {sorted(answers_length.unique())}")

Number of answers for the same question: [1, 2, 3]


However, some questions have two answers, while only two has three answers. Fortunately, these numbers are small compared to the size of the dataset, in fact the majority of questions have only one answer.

- 1 answer: 39818
- 2 answers: 614
- 3 answers: 2

In [43]:
answers_length.value_counts()

1    39818
2      614
3        2
Name: answers, dtype: int64

#### Question with two answers.

In [44]:
df_two = df[answers_length == 2]
df_two.head()

Unnamed: 0,question,query_id,wellFormedAnswers,passages,answers,query_type
71,how much water should i drink each day?,330247,[],"[{'is_selected': 1, 'passage_text': 'The Harva...",['A 132-pound women needs to drink 41 ounces o...,NUMERIC
107,cost of shiplap compared to drywall,107016,['The cost of shiplap is between $25 and $50 p...,"[{'is_selected': 1, 'passage_text': 'Shiplap s...","['$25 and $50 per hour', 'Shilpa boards cost $...",NUMERIC
129,is it ct or cst,413658,[],"[{'is_selected': 1, 'passage_text': 'The curre...",['Central Time is denote the local time in are...,DESCRIPTION
172,most advanced naval ships in the world,456321,['The Astute class submarine is the most advan...,"[{'is_selected': 1, 'passage_text': 'Astute-cl...","['Astute-class submarines', 'Astute-class subm...",ENTITY
174,what foods are good for tonsillitis,660946,"['Soft, bland foods such as pudding, applesauc...","[{'is_selected': 0, 'passage_text': 'Chilled d...","['Soft, bland foods such as pudding, applesauc...",ENTITY


In [45]:
string_parser.parse_answers(df.iloc[71].answers)

['A 132-pound women needs to drink 41 ounces of water a day.',
 '8-ounce glasses']

#### Question with three answers.

Answers are not retrieved from the same passage text. In addition, the first answer refers to two different passage texts, while the other two answers aren't even retrieved from texts but are generated.

In [46]:
df_three = df[answers_length == 3]
df_three.head()

Unnamed: 0,question,query_id,wellFormedAnswers,passages,answers,query_type
35043,benefits of cucumber and lemon water,50550,[],"[{'is_selected': 1, 'passage_text': 'As part o...","['Fights chronic diseases, such as coronary ar...",DESCRIPTION
38409,is flip or flop renewed,410309,[],"[{'is_selected': 0, 'passage_text': 'Flip or F...","['Yes, Flip or Flop Returns for Season.', 'Yes...",DESCRIPTION


In [47]:
string_parser.parse_answers(df.iloc[35043].answers)

['Fights chronic diseases, such as coronary artery disease. Supports brain cell communication, boosts collagen production to strengthen your tissues and also helps your body process cholesterol.',
 'cucumber increase your vitamin K intake and Lemon water makes for a powerful detox drink.lemon juice helps to cleanse and alkalize the body.',
 'Fights chronic diseases, increase your vitamin K,it providing a significant amount of pantothenic acid, also called vitamin B-5.']

In [48]:
string_parser.parse_answers(df.iloc[38409].answers)

['Yes, Flip or Flop Returns for Season.',
 'Yes',
 "Yes,HGTV has renewed its popular 'Flip or Flop' series."]

### Number of passages selected

In [49]:
df_passages = df.passages.apply(string_parser.parse_passages)

In [50]:
sum_passages = df_passages.apply(lambda x: sum([v["is_selected"] for v in x]))
sum_passages.value_counts()

1    20879
0    16571
2     2504
3      365
4       95
5       15
6        3
7        2
Name: passages, dtype: int64

## Extract span texts

The number of not well-formed answers is 33217.

In total there are $33217+7217=40434$ rows.

In [51]:
df[df["wellFormedAnswers"].map(len) == 2]

Unnamed: 0,question,query_id,wellFormedAnswers,passages,answers,query_type
0,is the atlanta airport the busiest in the world,1174762,[],"[{'is_selected': 0, 'passage_text': 'While Chi...",['No Answer Present.'],LOCATION
1,most romantic hotels in bahamas,459200,[],"[{'is_selected': 0, 'passage_text': ""Thank You...",['No Answer Present.'],PERSON
2,benefits of a weaker dollar,50251,[],"[{'is_selected': 1, 'passage_text': 'The Benef...",['The Benefits of a Weak Dollar Despite expres...,DESCRIPTION
4,united states postal service roseburg or,532464,[],"[{'is_selected': 0, 'passage_text': 'United St...",['It is located at the address 1304 Ne Cedar S...,DESCRIPTION
5,what are the most common geriatric illness,571722,[],"[{'is_selected': 0, 'passage_text': 'Often goi...","['Heart disease, stroke, cancer, and diabetes.']",ENTITY
...,...,...,...,...,...,...
40427,best wr in nfl history,52417,[],"[{'is_selected': 0, 'passage_text': 'this year...",['No Answer Present.'],DESCRIPTION
40428,best clp for gun cleaning,51663,[],"[{'is_selected': 0, 'passage_text': 'Breakfree...","['Oil,Ace Silicone Grease,Archoil AR4200,ATF –...",DESCRIPTION
40430,difference between longitudinal waves and tran...,147826,[],"[{'is_selected': 0, 'passage_text': 'In a long...",['The difference between transverse and longit...,DESCRIPTION
40431,fastest bike wheel,185255,[],"[{'is_selected': 0, 'passage_text': 'Al33, The...",['No Answer Present.'],ENTITY


In [52]:
idx = 2
row = df.iloc[idx]

print(row)

question                                   benefits of a weaker dollar
query_id                                                         50251
wellFormedAnswers                                                   []
passages             [{'is_selected': 1, 'passage_text': 'The Benef...
answers              ['The Benefits of a Weak Dollar Despite expres...
query_type                                                 DESCRIPTION
Name: 2, dtype: object


Look at passages.

The retrieved answer is extracted from the selected passage, usually in the first position in the list of passages.

In [55]:
answers = string_parser.parse_answers(row.answers)
passages = string_parser.parse_passages(row.passages)
selected_passage = list(filter(lambda x: x["is_selected"], passages))

display(selected_passage)
display(answers)

[{'is_selected': 1,
  'passage_text': 'The Benefits of a Weak Dollar Despite expressions to the contrary by politicos and economists, the nature of a weak dollar can actually benefit stagnated portions of our economy. For example, a weak U.S. dollar will actually help boost the manufacturing sector. A weak dollar can also help sell U.S. made goods in overseas markets at lower prices. A weak dollar will benefit a few European retailers who import Asian products bought for dollars and sell them in European markets for stronger euro payments.',
  'url': 'http://www.andygause.com/articles/international-money/the-benefits-of-a-weak-dollar/'}]

['The Benefits of a Weak Dollar Despite expressions to the contrary by politicos and economists, the nature of a weak dollar can actually benefit stagnated portions of our economy.']

To extract spans start/end, the retrieved answer is 

In [58]:
prog = re.compile(re.escape(string_parser.parse_answers(row.answers)[0])) 
passage_text = selected_passage[0]["passage_text"]

res = prog.search(passage_text)
start, end = res.start(), res.end()

assert passage_text[start:end] == string_parser.parse_answers(row.answers)[0]
print(f"Start: {start}, end: {end}")

Start: 0, end: 178


## Other

# Preprocessing pipeline

- Drop 'wellFormedAnswers' column
- Drop when `len(answers)`$>1$

TODO

- [ ] "No Answer Present" elements
- [ ] Number of unique "query_type" 