Purpose: analyze the splits for direct prompt datasets and self ask prompt datasets

In [1]:
!pip install pytest
!pip install sentencepiece
!pip install tokenizers
!pip install thefuzz
!pip install nltk
!pip install loguru

from google.colab import drive
drive.mount('/content/drive/')

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting tokenizers
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.13.3
Collecting thefuzz
  Downloading thefuzz-0.19.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: thefuzz
Successfully installed thefuzz-0.19.0
Collecting loguru
  Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m1.6 MB/s[0m eta [

In [3]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: safetensors, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 safetensors-0.3.1 transformers-4.31.0


In [2]:
%cd drive/MyDrive/projects/compositional-reasoning-finetuning

/content/drive/MyDrive/projects/compositional-reasoning-finetuning


In [2]:
import json
from token_stats import extract_token_counts, summarize_token_counts

In [4]:
path = "data/FinetuningData/"

## Train Set: Direct Prompting

In [None]:
direct_train_counts = extract_token_counts("direct", "train")
direct_train_stats = summarize_token_counts(direct_train_counts)
direct_train_stats

Extracting token counts for direct-train split: 100%|██████████| 3/3 [00:00<00:00, 26772.15it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,3.0,3.0,3.0
mean,177.33,3.33,180.67
std,42.77,2.52,41.4
min,140.0,1.0,143.0
25%,154.0,2.0,158.5
50%,168.0,3.0,174.0
75%,196.0,4.5,199.5
max,224.0,6.0,225.0
sum,532.0,10.0,542.0


In [None]:
# open json file for direct prompt training examples
with open(path + "direct_train.json", "r") as f:
    direct_train = json.load(f)

In [None]:
direct_train[:2]

[{'prompt': 'Facts:\nFact #0: is a 1919 American silent comedy film directed by Roy William Neill and written by L.V. Jefferson.\nFact #1: Diabolik  is a 1968 action film directed and co-written by Mario Bava, based on the Italian comic series "Diabolik" by Angela and Luciana Giussani.\nFact #2: Roy William Neill (4 September 1887 –\nFact #3: Mario Bava (31 July 1914 – 27 April 1980) was an Italian cinematographer, director, special effects artist and screenwriter, frequently referred to as the "Master of Italian Horror" and the "Master of the Macabre".\n\nQuestion: Which film whose director is younger, Charge It To Me or Danger: Diabolik?\nAnswer:',
  'target': 'Danger: Diabolik',
  'num_prompt_tokens': 168,
  'num_target_tokens': 6,
  'num_tokens': 174},
 {'prompt': 'Facts:\nFact #0: Wedding Night in Paradise  is a 1950 West German musical comedy film directed by Géza von Bolváry and starring Johannes Heesters, Claude Farell and Gretl Schörg.\nFact #1: Géza von Bolváry (full name Géz

In [None]:
print(direct_train[0]["prompt"])
print(direct_train[0]["target"])

Facts:
Fact #0: is a 1919 American silent comedy film directed by Roy William Neill and written by L.V. Jefferson.
Fact #1: Diabolik  is a 1968 action film directed and co-written by Mario Bava, based on the Italian comic series "Diabolik" by Angela and Luciana Giussani.
Fact #2: Roy William Neill (4 September 1887 –
Fact #3: Mario Bava (31 July 1914 – 27 April 1980) was an Italian cinematographer, director, special effects artist and screenwriter, frequently referred to as the "Master of Italian Horror" and the "Master of the Macabre".

Question: Which film whose director is younger, Charge It To Me or Danger: Diabolik?
Answer:
Danger: Diabolik


In [None]:
del direct_train

## Dev Set: Direct Prompting

In [None]:
direct_dev_counts = extract_token_counts("direct", "dev")
direct_dev_stats = summarize_token_counts(direct_dev_counts)
direct_dev_stats

Extracting token counts for direct-dev split: 100%|██████████| 2/2 [00:00<00:00, 19021.79it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,2.0,2.0,2.0
mean,160.0,2.0,162.0
std,111.72,1.41,110.31
min,81.0,1.0,84.0
25%,120.5,1.5,123.0
50%,160.0,2.0,162.0
75%,199.5,2.5,201.0
max,239.0,3.0,240.0
sum,320.0,4.0,324.0


In [None]:
# open json file for direct prompt dev examples
with open(path + "direct_dev.json", "r") as f:
    direct_dev = json.load(f)

In [None]:
direct_dev[:2]

[{'prompt': "Facts:\nFact #0: She is the daughter of Rune Gerhardsen and Tove Strand, and granddaughter of Einar Gerhardsen.\nFact #1: Rune Gerhardsen (born 13 June 1946) is a Norwegian politician, representing the Norwegian Labour Party.\n\nQuestion: What is the date of birth of Mina Gerhardsen's father?\nAnswer:",
  'target': '13 June 1946',
  'num_prompt_tokens': 81,
  'num_target_tokens': 3,
  'num_tokens': 84},
 {'prompt': 'Facts:\nFact #0: Banović Strahinja( Serbian Cyrillic:" Бановић Страхиња", released internationally as The Falcon) is a 1981 Yugoslavian- German adventure film written and directed by Vatroslav Mimica based on Strahinja Banović, a hero of Serbian epic poetry.\nFact #1: Valentin the Good is a 1942 Czech comedy film directed by Martin Frič.\nFact #2: Vatroslav Mimica( born 25 June 1923) is a Croatian film director and screenwriter.\nFact #3: In 1942 he joined Young Communist League of Yugoslavia( SKOJ) and in 1943 he went on to join the Yugoslav Partisans, becomin

In [None]:
print(direct_dev[0]["prompt"])
print(direct_dev[0]["target"])

Facts:
Fact #0: She is the daughter of Rune Gerhardsen and Tove Strand, and granddaughter of Einar Gerhardsen.
Fact #1: Rune Gerhardsen (born 13 June 1946) is a Norwegian politician, representing the Norwegian Labour Party.

Question: What is the date of birth of Mina Gerhardsen's father?
Answer:
13 June 1946


In [None]:
del direct_dev

## Train Set: Self-Ask

In [None]:
self_ask_train_counts = extract_token_counts("self_ask", "train")
self_ask_train_stats = summarize_token_counts(self_ask_train_counts)
self_ask_train_stats

Extracting token counts for self_ask-train split: 100%|██████████| 102026/102026 [00:00<00:00, 494089.63it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,102026.0,102026.0,102026.0
mean,268.64,54.03,322.66
std,16.01,8.52,19.55
min,215.0,38.0,260.0
25%,257.0,49.0,309.0
50%,268.0,52.0,322.0
75%,281.0,57.0,336.0
max,300.0,180.0,464.0
sum,27408157.0,5512024.0,32920181.0


In [None]:
import pandas as pd
df = pd.DataFrame(self_ask_train_counts)
df.loc[df.prompt_token_counts < 450, :].shape[0] / df.shape[0]

0.9982050156254035

In [None]:
# open json file for self ask prompt training examples
with open(path + "self_ask_train.json", "r") as f:
    self_ask_train = json.load(f)

In [None]:
self_ask_train[:2]

[{'prompt': "Examples:\nSTART\nQuestion: When was Neva Egan's husband born?\nAre follow up questions needed here: Yes.\nFollow up: Who is the spouse of Neva Egan?\nIntermediate answer: William Allen Egan\nFollow up: When is the date of birth of William Allen Egan?\nIntermediate answer: October 8, 1914\nSo the final answer is: October 8, 1914\nEND\n\nSTART\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAre follow up questions needed here: Yes.\nFollow up: When is the date of birth of Alejo Mancisidor?\nIntermediate answer: 31 July 1970\nFollow up: When is the date of birth of Emil Leyde?\nIntermediate answer: 8 January 1879\nSo the final answer is: Emil Leyde\nEND\n\nFacts:\nFact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.\nFact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.\n\nQuestion: What is the place of birth of the director of film Solo (2006 Film)?\nA

In [None]:
print(self_ask_train[0]["prompt"])
print(self_ask_train[0]["target"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.
Fact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.

Question: What is the place of birth of the director of film Solo (2006 Film)?
Are follow up questions needed here:

Ye

In [None]:
del self_ask_train

## Dev Set: Self-Ask

In [None]:
self_ask_dev_counts = extract_token_counts("self_ask", "dev")
self_ask_dev_stats = summarize_token_counts(self_ask_dev_counts)
self_ask_dev_stats

Extracting token counts for self_ask-dev split: 100%|██████████| 8367/8367 [00:00<00:00, 767462.15it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,8367.0,8367.0,8367.0
mean,268.78,54.22,323.0
std,16.21,8.52,19.82
min,215.0,38.0,266.0
25%,257.0,49.0,309.0
50%,269.0,53.0,322.0
75%,281.0,57.0,336.0
max,300.0,131.0,428.0
sum,2248891.0,453666.0,2702557.0


In [None]:
import pandas as pd
df = pd.DataFrame(self_ask_dev_counts)
df.loc[df.prompt_token_counts < 450, :].shape[0] / df.shape[0]

1.0

In [None]:
# open json file for self ask prompt dev examples
with open(path + "self_ask_dev.json", "r") as f:
    self_ask_dev = json.load(f)

In [None]:
self_ask_dev[:2]

[{'prompt': "Examples:\nSTART\nQuestion: When was Neva Egan's husband born?\nAre follow up questions needed here: Yes.\nFollow up: Who is the spouse of Neva Egan?\nIntermediate answer: William Allen Egan\nFollow up: When is the date of birth of William Allen Egan?\nIntermediate answer: October 8, 1914\nSo the final answer is: October 8, 1914\nEND\n\nSTART\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAre follow up questions needed here: Yes.\nFollow up: When is the date of birth of Alejo Mancisidor?\nIntermediate answer: 31 July 1970\nFollow up: When is the date of birth of Emil Leyde?\nIntermediate answer: 8 January 1879\nSo the final answer is: Emil Leyde\nEND\n\nFacts:\nFact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.\nFact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.\n\nQuestion: Who was born earlier, Polly Swann or Éric Deflandre?\nAre follow up questions n

In [None]:
print(self_ask_dev[0]["prompt"])
print(self_ask_dev[0]["target"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: Éric Deflandre( born 2 August 1973 in Rocourt) is a former Belgian football right fullback.
Fact #1: Polly Swann( born 5 June 1988) is a British rower and a member of the Great Britain Rowing Team.

Question: Who was born earlier, Polly Swann or Éric Deflandre?
Are follow up questions needed here:

Yes.
Follow up: When is th

In [None]:
del self_ask_dev

## Test Set: Direct Prompting

In [5]:
direct_test_counts = extract_token_counts("direct", "test")
direct_test_stats = summarize_token_counts(direct_test_counts)
direct_test_stats

Extracting token counts for direct-test split: 100%|██████████| 12576/12576 [00:00<00:00, 1322027.30it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,122.59,4.49,127.08
std,42.33,3.26,43.07
min,44.0,1.0,48.0
25%,91.0,2.0,95.0
50%,111.0,4.0,115.0
75%,147.0,6.0,152.0
max,353.0,36.0,355.0
sum,1541736.0,56452.0,1598188.0


In [None]:
with open("data/MultihopEvaluation/direct_test.json", "r") as f:
    test = json.load(f)

In [None]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Facts:
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Answer:
------------------
Małgorzata Braunek
------------------
Małgorzata Braunek


## Test Set: Self-Ask

In [7]:
self_ask_test_counts = extract_token_counts("self_ask", "test")
self_ask_test_stats = summarize_token_counts(self_ask_test_counts)
self_ask_test_stats

Extracting token counts for self_ask-test split: 100%|██████████| 12576/12576 [00:00<00:00, 553182.04it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,296.59,69.02,365.61
std,42.33,21.42,59.82
min,218.0,39.0,266.0
25%,265.0,53.0,321.0
50%,285.0,60.0,346.0
75%,321.0,86.0,403.0
max,527.0,171.0,632.0
sum,3729960.0,867989.0,4597949.0


In [None]:
with open("data/MultihopEvaluation/self_ask_test.json", "r") as f:
    test = json.load(f)

In [None]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Are fol

## Test Set: SQUAD

In [3]:
squad_test_counts = extract_token_counts("squad", "test")
squad_test_stats = summarize_token_counts(squad_test_counts)
squad_test_stats

Extracting token counts for squad-test split: 100%|██████████| 12576/12576 [00:00<00:00, 1244427.94it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,12576.0,12576.0,12576.0
mean,111.27,4.49,115.76
std,40.49,3.26,41.23
min,34.0,1.0,38.0
25%,81.0,2.0,85.0
50%,101.0,4.0,105.0
75%,135.0,6.0,140.0
max,337.0,36.0,339.0
sum,1399370.0,56452.0,1455822.0


In [None]:
with open("data/MultihopEvaluation/squad_test.json", "r") as f:
    test = json.load(f)

In [None]:
print(test[0]["prompt"])
print("------------------")
print(test[0]["target"])
print("------------------")
print(test[0]["answer"])

question: Who is the mother of the director of film Polish-Russian War (Film)? context: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska. He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.
------------------
Małgorzata Braunek
------------------
Małgorzata Braunek
