Purpose: analyze the splits for direct prompt datasets and self ask prompt datasets

In [1]:
!pip install pytest
!pip install transformers
!pip install sentencepiece
!pip install tokenizers
!pip install thefuzz
!pip install nltk
!pip install loguru

from google.colab import drive
drive.mount('/content/drive/')

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.2 MB/s[0m eta [36m0:00:0

In [2]:
%cd drive/MyDrive/projects/compositional-reasoning-finetuning

/content/drive/MyDrive/projects/compositional-reasoning-finetuning


In [3]:
import json
from token_stats import extract_token_counts, summarize_token_counts

In [4]:
path = "data/FinetuningData/"

## Train Set: Direct Prompting

In [8]:
direct_train_counts = extract_token_counts("direct", "train")
direct_train_stats = summarize_token_counts(direct_train_counts)
direct_train_stats

Extracting token counts for direct-train split: 100%|██████████| 3/3 [00:00<00:00, 26772.15it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,3.0,3.0,3.0
mean,177.33,3.33,180.67
std,42.77,2.52,41.4
min,140.0,1.0,143.0
25%,154.0,2.0,158.5
50%,168.0,3.0,174.0
75%,196.0,4.5,199.5
max,224.0,6.0,225.0
sum,532.0,10.0,542.0


In [9]:
# open json file for direct prompt training examples
with open(path + "direct_train.json", "r") as f:
    direct_train = json.load(f)

In [10]:
direct_train[:2]

[{'prompt': 'Facts:\nFact #0: is a 1919 American silent comedy film directed by Roy William Neill and written by L.V. Jefferson.\nFact #1: Diabolik  is a 1968 action film directed and co-written by Mario Bava, based on the Italian comic series "Diabolik" by Angela and Luciana Giussani.\nFact #2: Roy William Neill (4 September 1887 –\nFact #3: Mario Bava (31 July 1914 – 27 April 1980) was an Italian cinematographer, director, special effects artist and screenwriter, frequently referred to as the "Master of Italian Horror" and the "Master of the Macabre".\n\nQuestion: Which film whose director is younger, Charge It To Me or Danger: Diabolik?\nAnswer:',
  'target': 'Danger: Diabolik',
  'num_prompt_tokens': 168,
  'num_target_tokens': 6,
  'num_tokens': 174},
 {'prompt': 'Facts:\nFact #0: Wedding Night in Paradise  is a 1950 West German musical comedy film directed by Géza von Bolváry and starring Johannes Heesters, Claude Farell and Gretl Schörg.\nFact #1: Géza von Bolváry (full name Géz

In [11]:
print(direct_train[0]["prompt"])
print(direct_train[0]["target"])

Facts:
Fact #0: is a 1919 American silent comedy film directed by Roy William Neill and written by L.V. Jefferson.
Fact #1: Diabolik  is a 1968 action film directed and co-written by Mario Bava, based on the Italian comic series "Diabolik" by Angela and Luciana Giussani.
Fact #2: Roy William Neill (4 September 1887 –
Fact #3: Mario Bava (31 July 1914 – 27 April 1980) was an Italian cinematographer, director, special effects artist and screenwriter, frequently referred to as the "Master of Italian Horror" and the "Master of the Macabre".

Question: Which film whose director is younger, Charge It To Me or Danger: Diabolik?
Answer:
Danger: Diabolik


In [None]:
del direct_train

## Dev Set: Direct Prompting

In [12]:
direct_dev_counts = extract_token_counts("direct", "dev")
direct_dev_stats = summarize_token_counts(direct_dev_counts)
direct_dev_stats

Extracting token counts for direct-dev split: 100%|██████████| 2/2 [00:00<00:00, 19021.79it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,2.0,2.0,2.0
mean,160.0,2.0,162.0
std,111.72,1.41,110.31
min,81.0,1.0,84.0
25%,120.5,1.5,123.0
50%,160.0,2.0,162.0
75%,199.5,2.5,201.0
max,239.0,3.0,240.0
sum,320.0,4.0,324.0


In [13]:
# open json file for direct prompt dev examples
with open(path + "direct_dev.json", "r") as f:
    direct_dev = json.load(f)

In [14]:
direct_dev[:2]

[{'prompt': "Facts:\nFact #0: She is the daughter of Rune Gerhardsen and Tove Strand, and granddaughter of Einar Gerhardsen.\nFact #1: Rune Gerhardsen (born 13 June 1946) is a Norwegian politician, representing the Norwegian Labour Party.\n\nQuestion: What is the date of birth of Mina Gerhardsen's father?\nAnswer:",
  'target': '13 June 1946',
  'num_prompt_tokens': 81,
  'num_target_tokens': 3,
  'num_tokens': 84},
 {'prompt': 'Facts:\nFact #0: Banović Strahinja( Serbian Cyrillic:" Бановић Страхиња", released internationally as The Falcon) is a 1981 Yugoslavian- German adventure film written and directed by Vatroslav Mimica based on Strahinja Banović, a hero of Serbian epic poetry.\nFact #1: Valentin the Good is a 1942 Czech comedy film directed by Martin Frič.\nFact #2: Vatroslav Mimica( born 25 June 1923) is a Croatian film director and screenwriter.\nFact #3: In 1942 he joined Young Communist League of Yugoslavia( SKOJ) and in 1943 he went on to join the Yugoslav Partisans, becomin

In [15]:
print(direct_dev[0]["prompt"])
print(direct_dev[0]["target"])

Facts:
Fact #0: She is the daughter of Rune Gerhardsen and Tove Strand, and granddaughter of Einar Gerhardsen.
Fact #1: Rune Gerhardsen (born 13 June 1946) is a Norwegian politician, representing the Norwegian Labour Party.

Question: What is the date of birth of Mina Gerhardsen's father?
Answer:
13 June 1946


In [None]:
del direct_dev

## Train Set: Self-Ask

In [5]:
self_ask_train_counts = extract_token_counts("self_ask", "train")
self_ask_train_stats = summarize_token_counts(self_ask_train_counts)
self_ask_train_stats

Extracting token counts for self_ask-train split: 100%|██████████| 154876/154876 [00:00<00:00, 573057.21it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,154876.0,154876.0,154876.0
mean,294.13,63.64,357.77
std,41.97,20.2,58.31
min,215.0,38.0,260.0
25%,263.0,50.0,316.0
50%,281.0,55.0,336.0
75%,320.0,68.0,387.0
max,644.0,180.0,695.0
sum,45553934.0,9856539.0,55410473.0


In [6]:
# open json file for self ask prompt training examples
with open(path + "self_ask_train.json", "r") as f:
    self_ask_train = json.load(f)

In [7]:
self_ask_train[:2]

[{'prompt': "Examples:\nSTART\nQuestion: When was Neva Egan's husband born?\nAre follow up questions needed here: Yes.\nFollow up: Who is the spouse of Neva Egan?\nIntermediate answer: William Allen Egan\nFollow up: When is the date of birth of William Allen Egan?\nIntermediate answer: October 8, 1914\nSo the final answer is: October 8, 1914\nEND\n\nSTART\nQuestion: Who was born first, Alejo Mancisidor or Emil Leyde?\nAre follow up questions needed here: Yes.\nFollow up: When is the date of birth of Alejo Mancisidor?\nIntermediate answer: 31 July 1970\nFollow up: When is the date of birth of Emil Leyde?\nIntermediate answer: 8 January 1879\nSo the final answer is: Emil Leyde\nEND\n\nFacts:\nFact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.\nFact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.\n\nQuestion: What is the place of birth of the director of film Solo (2006 Film)?\nA

In [8]:
print(self_ask_train[0]["prompt"])
print(self_ask_train[0]["target"])

Examples:
START
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
END

START
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
END

Facts:
Fact #0: Solo is a 2006 Australian film directed by Morgan O'Neill and starring Colin Friels.
Fact #1: Morgan O'Neill (born 19 April 1973 in Sydney, Australia) is an Australian writer, director, actor and producer.

Question: What is the place of birth of the director of film Solo (2006 Film)?
Are follow up questions needed here:

Ye

In [None]:
del self_ask_train

## Dev Set: Self-Ask

In [24]:
self_ask_dev_counts = extract_token_counts("self_ask", "dev")
self_ask_dev_stats = summarize_token_counts(self_ask_dev_counts)
self_ask_dev_stats

Extracting token counts for self_ask-dev split: 100%|██████████| 2/2 [00:00<00:00, 16743.73it/s]


Unnamed: 0,prompt_token_counts,target_token_counts,total_token_counts
count,2.0,2.0,2.0
mean,393.0,89.5,482.5
std,111.72,44.55,156.27
min,314.0,58.0,372.0
25%,353.5,73.75,427.25
50%,393.0,89.5,482.5
75%,432.5,105.25,537.75
max,472.0,121.0,593.0
sum,786.0,179.0,965.0


In [25]:
# open json file for self ask prompt dev examples
with open(path + "self_ask_dev.json", "r") as f:
    self_ask_dev = json.load(f)

In [26]:
self_ask_dev[:2]

[{'prompt': 'Examples:\nSTART\nQuestion: What nationality is the director of film Wedding Night In Paradise (1950 Film)?\nAre follow up questions needed here: Yes.\nFollow up: Who is the director of Wedding Night in Paradise?\nIntermediate answer: Géza von Bolváry\nFollow up: What is the country of citizenship of Géza von Bolváry?\nIntermediate answer: Hungarian\nSo the final answer is: Hungarian\nEND\n\nSTART\nQuestion: Which film whose director is younger, Charge It To Me or Danger: Diabolik?\nAre follow up questions needed here: Yes.\nFollow up: Who is the director of Charge It to Me?\nIntermediate answer: Roy William Neill\nFollow up: Who is the director of Danger: Diabolik?\nIntermediate answer: Mario Bava\nFollow up: When is the date of birth of Roy William Neill?\nIntermediate answer: 4 September 1887\nFollow up: When is the date of birth of Mario Bava?\nIntermediate answer: 31 July 1914\nSo the final answer is: Danger: Diabolik\nEND\n\nFacts:\nFact #0: Banović Strahinja( Serbia

In [27]:
print(self_ask_dev[0]["prompt"])
print(self_ask_dev[0]["target"])

Examples:
START
Question: What nationality is the director of film Wedding Night In Paradise (1950 Film)?
Are follow up questions needed here: Yes.
Follow up: Who is the director of Wedding Night in Paradise?
Intermediate answer: Géza von Bolváry
Follow up: What is the country of citizenship of Géza von Bolváry?
Intermediate answer: Hungarian
So the final answer is: Hungarian
END

START
Question: Which film whose director is younger, Charge It To Me or Danger: Diabolik?
Are follow up questions needed here: Yes.
Follow up: Who is the director of Charge It to Me?
Intermediate answer: Roy William Neill
Follow up: Who is the director of Danger: Diabolik?
Intermediate answer: Mario Bava
Follow up: When is the date of birth of Roy William Neill?
Intermediate answer: 4 September 1887
Follow up: When is the date of birth of Mario Bava?
Intermediate answer: 31 July 1914
So the final answer is: Danger: Diabolik
END

Facts:
Fact #0: Banović Strahinja( Serbian Cyrillic:" Бановић Страхиња", release

In [None]:
del self_ask_dev

## Test Set

In [None]:
with open("data/MultihopEvaluation/test.json", "r") as f:
    test = json.load(f)

In [None]:
print(test[0]["self_ask_prompt_with_examplars"])
print(test[0]["self_ask_answer"])
print(test[0]["direct_prompt"])
print(test[0]["answer"])

Example Response
Question: When was Neva Egan's husband born?
Are follow up questions needed here: Yes.
Follow up: Who is the spouse of Neva Egan?
Intermediate answer: William Allen Egan
Follow up: When is the date of birth of William Allen Egan?
Intermediate answer: October 8, 1914
So the final answer is: October 8, 1914
Example Response
Question: Who was born first, Alejo Mancisidor or Emil Leyde?
Are follow up questions needed here: Yes.
Follow up: When is the date of birth of Alejo Mancisidor?
Intermediate answer: 31 July 1970
Follow up: When is the date of birth of Emil Leyde?
Intermediate answer: 8 January 1879
So the final answer is: Emil Leyde
Fact #0: (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.
Fact #1: He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.

Question: Who is the mother of the director of film Polish-Russian War (Film)?
Are follow u