Purpose: analyze the splits for direct prompt datasets and self ask prompt datasets

In [5]:
import json
from token_stats import extract_token_counts, summarize_token_counts

In [6]:
path = "data/2WikiMultihopQA/"

## Train Set: Direct Prompting

In [3]:
direct_train_counts = extract_token_counts("direct", "train")
direct_train_stats = summarize_token_counts(direct_train_counts)
direct_train_stats

Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,5.0,5.0,5.0
mean,153.8,2.2,156.0
std,71.64,1.79,70.74
min,78.0,1.0,79.0
25%,97.0,1.0,102.0
50%,137.0,1.0,140.0
75%,221.0,3.0,222.0
max,236.0,5.0,237.0
sum,769.0,11.0,780.0


In [5]:
# open json file for direct prompt training examples
with open(path + "direct_train.json", "r") as f:
    direct_train = json.load(f)

In [6]:
direct_train[:2]

[{'prompt': "Fact #0: Rhescuporis I (Ancient Greek: Ραισκούπορις) was a king of the Odrysian kingdom of Thrace in 240 BC - 215 BC, succeeding his father, Cotys III.\nFact #1: 270 BC, succeeding his father, Raizdos.\n\nQuestion: Who is Rhescuporis I (Odrysian)'s paternal grandfather?\nAnswer:",
  'target': 'Raizdos',
  'num_prompt_tokens': 97,
  'num_target_tokens': 5,
  'num_tokens': 102},
 {'prompt': 'Fact #0: The Fascist  is a 1961 Italian film directed by Luciano Salce.\nFact #1: Luciano Salce (25 September 1922, in Rome – 17 December 1989, in Rome) was an Italian film director, actor and lyricist.\n\nQuestion: Where was the director of film The Fascist born?\nAnswer:',
  'target': 'Rome',
  'num_prompt_tokens': 78,
  'num_target_tokens': 1,
  'num_tokens': 79}]

In [7]:
print(direct_train[0]["prompt"])
print(direct_train[0]["target"])

Fact #0: Rhescuporis I (Ancient Greek: Ραισκούπορις) was a king of the Odrysian kingdom of Thrace in 240 BC - 215 BC, succeeding his father, Cotys III.
Fact #1: 270 BC, succeeding his father, Raizdos.

Question: Who is Rhescuporis I (Odrysian)'s paternal grandfather?
Answer:
Raizdos


In [None]:
del direct_train

## Dev Set: Direct Prompting

In [9]:
direct_dev_counts = extract_token_counts("direct", "dev")
direct_dev_stats = summarize_token_counts(direct_dev_counts)
direct_dev_stats

Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,3.0,3.0,3.0
mean,128.67,3.33,132.0
std,45.24,2.52,46.18
min,78.0,1.0,81.0
25%,110.5,2.0,112.5
50%,143.0,3.0,144.0
75%,154.0,4.5,157.5
max,165.0,6.0,171.0
sum,386.0,10.0,396.0


In [8]:
# open json file for direct prompt dev examples
with open(path + "direct_dev.json", "r") as f:
    direct_dev = json.load(f)

In [10]:
direct_dev[:2]

[{'prompt': 'Fact #0: This was the last film directed by Stephen Roberts before his untimely death from a heart attack.\nFact #1: The Star of Santa Clara is a 1958 West German musical comedy film directed by Werner Jacobs and starring Vico Torriani, Gerlinde Locker and Ruth Stephan.\nFact #2: Stephen Roberts( 23 November 1895 – 17 July 1936) was an American film director.\nFact #3: Werner Jacobs (1909–1999) was a German film director. .\n\nQuestion: Do both films: The Ex-Mrs. Bradford and The Star Of Santa Clara have the directors from the same country?\nAnswer:',
  'target': 'no',
  'num_prompt_tokens': 143,
  'num_target_tokens': 1,
  'num_tokens': 144},
 {'prompt': "Fact #0: She is the daughter of Rune Gerhardsen and Tove Strand, and granddaughter of Einar Gerhardsen.\nFact #1: Rune Gerhardsen (born 13 June 1946) is a Norwegian politician, representing the Norwegian Labour Party.\n\nQuestion: What is the date of birth of Mina Gerhardsen's father?\nAnswer:",
  'target': '13 June 1946

In [11]:
print(direct_dev[0]["prompt"])
print(direct_dev[0]["target"])

Fact #0: This was the last film directed by Stephen Roberts before his untimely death from a heart attack.
Fact #1: The Star of Santa Clara is a 1958 West German musical comedy film directed by Werner Jacobs and starring Vico Torriani, Gerlinde Locker and Ruth Stephan.
Fact #2: Stephen Roberts( 23 November 1895 – 17 July 1936) was an American film director.
Fact #3: Werner Jacobs (1909–1999) was a German film director. .

Question: Do both films: The Ex-Mrs. Bradford and The Star Of Santa Clara have the directors from the same country?
Answer:
no


In [None]:
del direct_dev

## Train Set: Self-Ask

In [15]:
self_ask_train_counts = extract_token_counts("self_ask", "train")
self_ask_train_stats = summarize_token_counts(self_ask_train_counts)
self_ask_train_stats

Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,5.0,5.0,5.0
mean,325.8,78.6,404.4
std,71.64,30.35,101.37
min,250.0,51.0,301.0
25%,269.0,58.0,327.0
50%,309.0,63.0,372.0
75%,393.0,100.0,493.0
max,408.0,121.0,529.0
sum,1629.0,393.0,2022.0


In [12]:
# open json file for self ask prompt training examples
with open(path + "self_ask_train.json", "r") as f:
    self_ask_train = json.load(f)

In [13]:
self_ask_train[:2]

[{'prompt': "Example Response\nQuestion: Which film came out first, The Love Route or Engal Aasan?\nAre follow up questions needed here: Yes.\nFollow up: When is the publication date of The Love Route?\nIntermediate answer: 1915\nFollow up: When is the publication date of Engal Aasan?\nIntermediate answer: 2009\nSo the final answer is: The Love Route\n\nExample Response\nQuestion: When is the composer of film Sruthilayalu 's birthday?\nAre follow up questions needed here: Yes.\nFollow up: Who is the composer of Sruthilayalu?\nIntermediate answer: K. V. Mahadevan\nFollow up: When is the date of birth of K. V. Mahadevan?\nIntermediate answer: 14 March 1918\nSo the final answer is: 14 March 1918\n\nFact #0: Rhescuporis I (Ancient Greek: Ραισκούπορις) was a king of the Odrysian kingdom of Thrace in 240 BC - 215 BC, succeeding his father, Cotys III.\nFact #1: 270 BC, succeeding his father, Raizdos.\n\nQuestion: Who is Rhescuporis I (Odrysian)'s paternal grandfather?\nAre follow up questions

In [14]:
print(self_ask_train[0]["prompt"])
print(self_ask_train[0]["target"])

Example Response
Question: Which film came out first, The Love Route or Engal Aasan?
Are follow up questions needed here: Yes.
Follow up: When is the publication date of The Love Route?
Intermediate answer: 1915
Follow up: When is the publication date of Engal Aasan?
Intermediate answer: 2009
So the final answer is: The Love Route

Example Response
Question: When is the composer of film Sruthilayalu 's birthday?
Are follow up questions needed here: Yes.
Follow up: Who is the composer of Sruthilayalu?
Intermediate answer: K. V. Mahadevan
Follow up: When is the date of birth of K. V. Mahadevan?
Intermediate answer: 14 March 1918
So the final answer is: 14 March 1918

Fact #0: Rhescuporis I (Ancient Greek: Ραισκούπορις) was a king of the Odrysian kingdom of Thrace in 240 BC - 215 BC, succeeding his father, Cotys III.
Fact #1: 270 BC, succeeding his father, Raizdos.

Question: Who is Rhescuporis I (Odrysian)'s paternal grandfather?
Are follow up questions needed here:

Yes.
Follow up: Who 

In [None]:
del self_ask_train

## Dev Set: Self-Ask

In [16]:
self_ask_dev_counts = extract_token_counts("self_ask", "dev")
self_ask_dev_stats = summarize_token_counts(self_ask_dev_counts)
self_ask_dev_stats

Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,3.0,3.0,3.0
mean,300.67,83.0,383.67
std,45.24,21.7,66.71
min,250.0,58.0,308.0
25%,282.5,76.0,358.5
50%,315.0,94.0,409.0
75%,326.0,95.5,421.5
max,337.0,97.0,434.0
sum,902.0,249.0,1151.0


In [17]:
# open json file for self ask prompt dev examples
with open(path + "self_ask_dev.json", "r") as f:
    self_ask_dev = json.load(f)

In [18]:
self_ask_dev[:2]

[{'prompt': "Example Response\nQuestion: Which film came out first, The Love Route or Engal Aasan?\nAre follow up questions needed here: Yes.\nFollow up: When is the publication date of The Love Route?\nIntermediate answer: 1915\nFollow up: When is the publication date of Engal Aasan?\nIntermediate answer: 2009\nSo the final answer is: The Love Route\n\nExample Response\nQuestion: When is the composer of film Sruthilayalu 's birthday?\nAre follow up questions needed here: Yes.\nFollow up: Who is the composer of Sruthilayalu?\nIntermediate answer: K. V. Mahadevan\nFollow up: When is the date of birth of K. V. Mahadevan?\nIntermediate answer: 14 March 1918\nSo the final answer is: 14 March 1918\n\nFact #0: This was the last film directed by Stephen Roberts before his untimely death from a heart attack.\nFact #1: The Star of Santa Clara is a 1958 West German musical comedy film directed by Werner Jacobs and starring Vico Torriani, Gerlinde Locker and Ruth Stephan.\nFact #2: Stephen Robert

In [19]:
print(self_ask_dev[0]["prompt"])
print(self_ask_dev[0]["target"])

Example Response
Question: Which film came out first, The Love Route or Engal Aasan?
Are follow up questions needed here: Yes.
Follow up: When is the publication date of The Love Route?
Intermediate answer: 1915
Follow up: When is the publication date of Engal Aasan?
Intermediate answer: 2009
So the final answer is: The Love Route

Example Response
Question: When is the composer of film Sruthilayalu 's birthday?
Are follow up questions needed here: Yes.
Follow up: Who is the composer of Sruthilayalu?
Intermediate answer: K. V. Mahadevan
Follow up: When is the date of birth of K. V. Mahadevan?
Intermediate answer: 14 March 1918
So the final answer is: 14 March 1918

Fact #0: This was the last film directed by Stephen Roberts before his untimely death from a heart attack.
Fact #1: The Star of Santa Clara is a 1958 West German musical comedy film directed by Werner Jacobs and starring Vico Torriani, Gerlinde Locker and Ruth Stephan.
Fact #2: Stephen Roberts( 23 November 1895 – 17 July 193

In [None]:
del self_ask_dev

## Test Set

In [9]:
with open("data/MultihopEvaluation/test.json", "r") as f:
    test = json.load(f)

In [10]:
print(test[0]["self_ask_prompt_with_examplars"])
print(test[0]["self_ask_answer"])
print(test[0]["direct_prompt"])
print(test[0]["answer"])

Example Response
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
Example Response
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Are follow u