Purpose: analyze the splits for direct prompt datasets and self ask prompt datasets

In [3]:
!pip install pytest
!pip install sentencepiece
!pip install tokenizers
!pip install nltk
!pip install loguru

# from google.colab import drive
# drive.mount('/content/drive/')



In [4]:
# %cd drive/MyDrive/projects/compositional-reasoning-finetuning

In [5]:
import json
from token_stats import extract_token_counts, summarize_token_counts

In [6]:
path = "data/FinetuningData/"

## Train Set: Direct Prompting

In [7]:
direct_train_counts = extract_token_counts("direct", "train")
direct_train_stats = summarize_token_counts(direct_train_counts)
direct_train_stats

Extracting token counts for direct-train split: 100%|████████████████████████████| 105479/105479 [00:00<00:00, 2504733.01it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,105479.0,105479.0,105479.0
mean,95.75,2.7,98.44
std,16.86,2.0,17.07
min,41.0,1.0,44.0
25%,83.0,1.0,86.0
50%,95.0,3.0,98.0
75%,108.0,4.0,111.0
max,130.0,23.0,150.0
sum,10099174.0,284678.0,10383852.0


In [9]:
# open json file for direct prompt training examples
with open(path + "direct_train.json", "r") as f:
    direct_train = json.load(f)

In [10]:
direct_train[:2]

[{'prompt': "Facts:\nFact #0: Egan was the wife of the state of Alaska's first governor, William Allen Egan, and the mother of former Juneau Mayor and Alaska State Senator Dennis Egan.\nFact #1: William Allen Egan (October 8, 1914 – May 6, 1984) was an American Democratic politician.\n\nQuestion: When was Neva Egan's husband born?\nAnswer:",
  'target': 'October 8, 1914',
  'num_prompt_tokens': 81,
  'num_target_tokens': 3,
  'num_tokens': 84},
 {'prompt': 'Facts:\nFact #0: Alejo Mancisidor( born 31 July 1970) is a former professional tennis player from Spain.\nFact #1: Emil Leyde( 8 January 1879 in Kassel – ca. 1924) was a German film director, screenwriter, cameraman and film producer.\n\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAnswer:',
  'target': 'Emil Leyde',
  'num_prompt_tokens': 92,
  'num_target_tokens': 4,
  'num_tokens': 96}]

In [11]:
print(direct_train[0]["prompt"])
print(direct_train[0]["target"])

Facts:
Fact #0: Egan was the wife of the state of Alaska's first governor, William Allen Egan, and the mother of former Juneau Mayor and Alaska State Senator Dennis Egan.
Fact #1: William Allen Egan (October 8, 1914 – May 6, 1984) was an American Democratic politician.

Question: When was Neva Egan's husband born?
Answer:
October 8, 1914


In [12]:
del direct_train

## Dev Set: Direct Prompting

In [13]:
direct_dev_counts = extract_token_counts("direct", "dev")
direct_dev_stats = summarize_token_counts(direct_dev_counts)
direct_dev_stats

Extracting token counts for direct-dev split: 100%|██████████████████████████████████| 8657/8657 [00:00<00:00, 2328913.46it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,8657.0,8657.0,8657.0
mean,95.91,2.74,98.65
std,17.05,2.05,17.28
min,41.0,1.0,47.0
25%,83.0,1.0,86.0
50%,95.0,3.0,98.0
75%,109.0,4.0,112.0
max,130.0,18.0,142.0
sum,830310.0,23723.0,854033.0


In [14]:
# open json file for direct prompt dev examples
with open(path + "direct_dev.json", "r") as f:
    direct_dev = json.load(f)

In [15]:
direct_dev[:2]

[{'prompt': 'Facts:\nFact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.\nFact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.\n\nQuestion: Who was born earlier, Polly Swann or Éric Deflandre?\nAnswer:',
  'target': 'Éric Deflandre',
  'num_prompt_tokens': 87,
  'num_target_tokens': 7,
  'num_tokens': 94},
 {'prompt': 'Facts:\nFact #0: The film was written, adapted and directed by Russian-born Arcady Boytler.\nFact #1: Boytler was born in Moscow, Russia.\n\nQuestion: Where was the director of film Heads Or Tails (1937 Film) born?\nAnswer:',
  'target': 'Moscow',
  'num_prompt_tokens': 60,
  'num_target_tokens': 1,
  'num_tokens': 61}]

In [16]:
print(direct_dev[0]["prompt"])
print(direct_dev[0]["target"])

Facts:
Fact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.
Fact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.

Question: Who was born earlier, Polly Swann or Éric Deflandre?
Answer:
Éric Deflandre


In [17]:
del direct_dev

## Train Set: Self-Ask

In [18]:
self_ask_train_counts = extract_token_counts("self_ask", "train")
self_ask_train_stats = summarize_token_counts(self_ask_train_counts)
self_ask_train_stats

Extracting token counts for self_ask-train split: 100%|██████████████████████████| 102026/102026 [00:00<00:00, 2850611.25it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,102026.0,102026.0,102026.0
mean,268.64,54.03,322.66
std,16.01,8.52,19.55
min,215.0,38.0,260.0
25%,257.0,49.0,309.0
50%,268.0,52.0,322.0
75%,281.0,57.0,336.0
max,300.0,180.0,464.0
sum,27408157.0,5512024.0,32920181.0


In [19]:
import pandas as pd
df = pd.DataFrame(self_ask_train_counts)
df.loc[df.prompt_token_counts < 450, :].shape[0] / df.shape[0]

1.0

In [20]:
# open json file for self ask prompt training examples
with open(path + "self_ask_train.json", "r") as f:
    self_ask_train = json.load(f)

In [21]:
self_ask_train[:2]

[{'prompt': "Examples:\nSTART\nQuestion: When was Neva Egan's husband born?\nAre follow up questions needed here: Yes.\nFollow up: Who is the spouse of Neva Egan?\nIntermediate answer: William Allen Egan\nFollow up: When is the date of birth of William Allen Egan?\nIntermediate answer: October 8, 1914\nSo the final answer is: October 8, 1914\nEND\n\nSTART\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAre follow up questions needed here: Yes.\nFollow up: When is the date of birth of Alejo Mancisidor?\nIntermediate answer: 31 July 1970\nFollow up: When is the date of birth of Emil Leyde?\nIntermediate answer: 8 January 1879\nSo the final answer is: Emil Leyde\nEND\n\nFacts:\nFact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.\nFact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.\n\nQuestion: What is the place of birth of the director of film Solo (2006 Film)?\nA

In [22]:
print(self_ask_train[0]["prompt"])
print(self_ask_train[0]["target"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.
Fact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.

Question: What is the place of birth of the director of film Solo (2006 Film)?
Are follow up questions needed here:

Ye

In [23]:
del self_ask_train

## Dev Set: Self-Ask

In [24]:
self_ask_dev_counts = extract_token_counts("self_ask", "dev")
self_ask_dev_stats = summarize_token_counts(self_ask_dev_counts)
self_ask_dev_stats

Extracting token counts for self_ask-dev split: 100%|████████████████████████████████| 8367/8367 [00:00<00:00, 2095899.52it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,8367.0,8367.0,8367.0
mean,268.78,54.22,323.0
std,16.21,8.52,19.82
min,215.0,38.0,266.0
25%,257.0,49.0,309.0
50%,269.0,53.0,322.0
75%,281.0,57.0,336.0
max,300.0,131.0,428.0
sum,2248891.0,453666.0,2702557.0


In [25]:
import pandas as pd
df = pd.DataFrame(self_ask_dev_counts)
df.loc[df.prompt_token_counts < 450, :].shape[0] / df.shape[0]

1.0

In [26]:
# open json file for self ask prompt dev examples
with open(path + "self_ask_dev.json", "r") as f:
    self_ask_dev = json.load(f)

In [27]:
self_ask_dev[:2]

[{'prompt': "Examples:\nSTART\nQuestion: When was Neva Egan's husband born?\nAre follow up questions needed here: Yes.\nFollow up: Who is the spouse of Neva Egan?\nIntermediate answer: William Allen Egan\nFollow up: When is the date of birth of William Allen Egan?\nIntermediate answer: October 8, 1914\nSo the final answer is: October 8, 1914\nEND\n\nSTART\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAre follow up questions needed here: Yes.\nFollow up: When is the date of birth of Alejo Mancisidor?\nIntermediate answer: 31 July 1970\nFollow up: When is the date of birth of Emil Leyde?\nIntermediate answer: 8 January 1879\nSo the final answer is: Emil Leyde\nEND\n\nFacts:\nFact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.\nFact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.\n\nQuestion: Who was born earlier, Polly Swann or Éric Deflandre?\nAre follow up questions n

In [28]:
print(self_ask_dev[0]["prompt"])
print(self_ask_dev[0]["target"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.
Fact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.

Question: Who was born earlier, Polly Swann or Éric Deflandre?
Are follow up questions needed here:

Yes.
Follow up: When is th

In [29]:
del self_ask_dev

## Test Set: Direct Prompting (no examplars)

In [30]:
direct_test_counts = extract_token_counts("direct", "test", examplars=False)
direct_test_stats = summarize_token_counts(direct_test_counts)
direct_test_stats

Extracting token counts for direct-test split: 100%|███████████████████████████████| 12576/12576 [00:00<00:00, 2691751.74it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,122.59,4.49,127.08
std,42.33,3.26,43.07
min,44.0,1.0,48.0
25%,91.0,2.0,95.0
50%,111.0,4.0,115.0
75%,147.0,6.0,152.0
max,353.0,36.0,355.0
sum,1541736.0,56452.0,1598188.0


In [31]:
with open("data/MultihopEvaluation/direct-without-examplars.json", "r") as f:
    test = json.load(f)

In [34]:
test[0]

{'prompt': 'Facts:\nFact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.\nFact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.\n\nQuestion: Who is the mother of the director of film Polish-Russian War (Film)?\nAnswer:',
 'target': 'Małgorzata Braunek',
 'answer': 'Małgorzata Braunek',
 'num_prompt_tokens': 122,
 'num_target_tokens': 10,
 'num_tokens': 132}

In [32]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Facts:
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Answer:
------------------
Małgorzata Braunek
------------------
Małgorzata Braunek


## Test Set: Self-Ask

### With Examplars

In [7]:
self_ask_test_counts = extract_token_counts("self-ask", "test", examplars=True)
self_ask_test_stats = summarize_token_counts(self_ask_test_counts)
self_ask_test_stats

Extracting token counts for self-ask-test split: 100%|██████████| 12576/12576 [00:00<00:00, 1086973.58it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,296.59,69.02,365.61
std,42.33,21.42,59.82
min,218.0,39.0,266.0
25%,265.0,53.0,321.0
50%,285.0,60.0,346.0
75%,321.0,86.0,403.0
max,527.0,171.0,632.0
sum,3729960.0,867989.0,4597949.0


In [8]:
with open("data/MultihopEvaluation/self-ask-with-examplars.json", "r") as f:
    test = json.load(f)

In [9]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Are fol

### Without Examplars

In [10]:
self_ask_test_counts = extract_token_counts("self-ask", "test", examplars=False)
self_ask_test_stats = summarize_token_counts(self_ask_test_counts)
self_ask_test_stats

Extracting token counts for self-ask-test split: 100%|██████████| 12576/12576 [00:00<00:00, 1180484.01it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,127.59,69.02,196.61
std,42.33,21.42,59.82
min,49.0,39.0,97.0
25%,96.0,53.0,152.0
50%,116.0,60.0,177.0
75%,152.0,86.0,234.0
max,358.0,171.0,463.0
sum,1604616.0,867989.0,2472605.0


In [11]:
with open("data/MultihopEvaluation/self-ask-without-examplars.json", "r") as f:
    test = json.load(f)

In [12]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Facts:
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Are follow up questions needed here:

------------------
Yes.
Follow up: Who is the director of Polish-Russian War?
Intermediate answer: Xawery Żuławski
Follow up: Who is the mother of Xawery Żuławski?
Intermediate answer: Małgorzata Braunek
So the final answer is: Małgorzata Braunek

------------------
Małgorzata Braunek


## Test Set: Baseline

### With Examplars

In [13]:
squad_test_counts = extract_token_counts("baseline", "test", examplars=True)
squad_test_stats = summarize_token_counts(squad_test_counts)
squad_test_stats

Extracting token counts for baseline-test split: 100%|██████████| 12576/12576 [00:00<00:00, 623412.64it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,280.27,69.02,349.29
std,40.49,21.42,57.78
min,203.0,39.0,251.0
25%,250.0,53.0,306.0
50%,270.0,60.0,331.0
75%,304.0,86.0,386.0
max,506.0,171.0,611.0
sum,3524714.0,867989.0,4392703.0


In [14]:
with open("data/MultihopEvaluation/baseline-with-examplars.json", "r") as f:
    test = json.load(f)

In [15]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

question: Who is the mother of the director of film Polish-Russian War (Film)? context: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska. He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.
------------------
Yes.


### Without Examplars

In [16]:
squad_test_counts = extract_token_counts("baseline", "test", examplars=False)
squad_test_stats = summarize_token_counts(squad_test_counts)
squad_test_stats

Extracting token counts for baseline-test split: 100%|██████████| 12576/12576 [00:00<00:00, 522253.14it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,111.27,4.49,115.76
std,40.49,3.26,41.23
min,34.0,1.0,38.0
25%,81.0,2.0,85.0
50%,101.0,4.0,105.0
75%,135.0,6.0,140.0
max,337.0,36.0,339.0
sum,1399370.0,56452.0,1455822.0


In [17]:
with open("data/MultihopEvaluation/baseline-without-examplars.json", "r") as f:
    test = json.load(f)

In [18]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

question: Who is the mother of the director of film Polish-Russian War (Film)? context: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska. He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.
------------------
Małgorzata Braunek
------------------
Małgorzata Braunek
