### Task decomposition and prompt development.

We provide here a framework for developing sequential prompts for a complex NLP task using GPT4o.

The complex tasks is decompsed into as sequence of simpler tasks that each build on the previous one.

For each task in the sequence we produce a number of examples to show GPT4o, and a number of further examples to use for automated testing of the response. This borrows ideas from unit testing of software, since iterative changes to the prompt may break functionality that was previously working.

This framework can be adapted to use with other LLMs and NLP tasks.

#### We decompose into the following tasks:
- Task 1: identify sections of direct speech, and the name of the speaker and recipient
- Task 1b: locate pre-defined sentences in these detected sections of speech (for comparison with human coding)
- Task 2: pull out the spoken words only from each section (removing e.g. 'he said' etc)
- Task 3: locate and replace the names of the speakers and recipients using a pre-defined character map  

#### Notes:
- Determinism is not guaranteed. But well structured prompts should produce near deterministic outputs, along with temperature=0, fixed seed. It is also worth storing the system finerprint for future reference, as changes to this may be the cause of differing results in the future.  
- The lack of determinism can make tests quite brittle. It is worth repeating tests several times to confirm their behvaiour. And then running the full manual validation on a single static result set.
- Where possible, the sequential tasks should b tackled as a new completion API, using formatted output from the previous task as the inupt. This is preferrable to chaining of prompts and outputs to produce a chat style conversation, but this increases the risk of conflict or confusion between prompts/instructions sets. And also increases the length of the context window.
- Need to ensure consistency between instructions, schemas and examples. Otherwise results may be inconsistent e.g. 'reproduce all punctuation all it appears' conflicted with 'remove  speech marks' example.
- Should typos be accounted for (e.g. task_3 name matching?)
- Cost: \\$1.22 left after developing prompts. Added \\$10 to run for 50 books (so ~0.25 full dataset). 

#### Automating this using the Chat GPT API:

## TODO:
 - move deifnition of input json to system prompt?
 - add character/alias mapping to prompt for each book: use primary name only
 - When in pipeline to spellcheck/ correct typos? (e.g. Hany in the Dinosaurs (book 21).
 - what to do about inconsitent sentence detection? e.g "Now Dasher!" being at the end of sentence 7 was causing GPT confusion...
 - add an instruction about how to refer to 'general audience' or 'narrator' or 'I'
 - provide example of input and what the output should look like (within the prompt)
 - should temp be close to 0 (but not exactly 0)?
 - ask for output of reaosning/thought process?
 - ask for a confidence score?
 - do we need to specify (in system prompt), not to use MD or any other formatting in the json output?
 
## Note: ideas to explore if we need performance boost...

- system message to edit assistant role
- vary temperature or top_p parameter
- fine_tuning a model with bespoke training data (how much is necessary?)
- improved instructions or prompt engineering (see e.g. paper on iterative prompting)
- compare results with gpt-3.5-turbo? - does not seem to work weel for our use case!

In [1]:
import os
import json
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import string
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 
from openai import OpenAI
import pickle

%matplotlib inline

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
with open('./key.txt', 'r') as infile:
    api_key = infile.read().splitlines()[0]

In [4]:
with open('data/tempdf.pickle', 'rb') as outfile:
    df = pickle.load(outfile)

In [5]:
# Import our prompts, example data (for in-context learning), and test data for unit testing each task: 
from openai_api.examples import example_data
from openai_api.tests import build_test_data

test_data = build_test_data(df)

#### Converting the full dataset into a dataframe of sentences

In [6]:
from openai_api.utilities import spacy_extract_sentences
sentences = spacy_extract_sentences(df, nlp)

#### Check that this sample contains the same sentences that were manually coded previously.

In [7]:
coding_sample = sentences.sample(frac=0.15, axis=0, random_state=42)
manually_coded = pd.read_csv('./sentences_for_coding/sample_15pc.csv', delimiter='\t', index_col=0)

In [8]:
text_equal = [
    i == j.text
    for i,j in
    zip(manually_coded.sentence, coding_sample.sentence)
]    
assert sum(text_equal) == len(text_equal)

##### Build OpenAI API code...

In [10]:
from openai_api.schemas import build_task_1_input_schema, build_task_1_response_schema
from openai_api.utilities import *

task_1_response_schema_str = build_task_1_response_schema()
task_1_input_schema_str = build_task_1_input_schema()

In [11]:
from openai_api.tests import run_task_1_test_i

In [12]:
from openai_api.prompts import (
    get_task_1_prompt_string, get_task_1_system_prompt,
    get_task_2_prompt_string, get_task_2_system_prompt,
    get_task_3_prompt_string, get_task_3_system_prompt
)

In [13]:
def run_task_1(full_text, client, seed=42):
    
    prompt_string = get_task_1_prompt_string(
            data={'full_text': full_text}, 
        )
        
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": get_task_1_system_prompt()},
            {"role": "user", "content": r"{}".format(prompt_string)}
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
        seed=seed
    )
    
    return completion    

In [14]:
def run_task_1_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")
        
        completion = run_task_1(test_data['strings'][test_id], client=client)
        
        if run_task_1_test_i(test_id, test_data, completion, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [14]:
success_count = 0
for i in range(20):
    print(f"Running repeat {i}")
    
    success = run_task_1_tests(test_data)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 5
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 6
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 7
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 8
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 9
Running test: 0
Test 0: pass


KeyboardInterrupt: 

### Task 1b: recognising pre-defined sentences.

In [26]:
# TODO: for comparison with student speech flags.

### Task 2: pulling out spoken words only.

#### TODO:
- refactor run_test_i method
- move schemas and test and example data to files
- rename as tak 2 or rename functions and strings!

In [15]:
def run_task_2(full_text, client, task_1_completion=None, seed=42):
    
    if task_1_completion is None:
        task_1_completion = run_task_1(full_text, client)
        
    task_1_prompt_string = get_task_1_prompt_string(
        data={'full_text': full_text}
    )
    
    task_2_prompt_string = get_task_2_prompt_string(
        task_1_response=json.loads(task_1_completion.choices[0].message.content)
    )
    
    task_2_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
#            {
#                "role": "system", 
#                "content": get_task_1_system_prompt(task_1_input_schema_str, task_1_response_schema_str)
#            },
#            {
#                "role": 
#                "user", "content": r"{}".format(task_1_prompt_string)
#            },
#            {
#                "role": "assistant", 
#                "content": task_1_completion.choices[0].message.content
#            },
           {
               "role": "system", 
               "content": get_task_2_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_2_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion

In [16]:
client = OpenAI(api_key=api_key)

In [17]:
test_id = 1

In [20]:
completion_1, completion_2 = run_task_2(
    full_text=test_data['strings'][test_id],
    client=client
)

In [21]:
completion_2.usage

CompletionUsage(completion_tokens=156, prompt_tokens=679, total_tokens=835)

In [22]:
print(completion_2.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "spoken_words_only": "That’s a rhinoceros. Triceratops has got more horns.",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "spoken_words_only": "I want to save some animals. What can I do, Mum?",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "spoken_words_only": "Tuh! What a waste of time!",
      "speech_section_id": 2
    }
  ]
}


In [17]:
def run_task_2_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_2_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'spoken_words_only': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [18]:
def run_task_2_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, completion_2 = run_task_2(
            full_text=test_data['strings'][test_id], 
            client=client
        )
           
        if run_task_2_test_i(test_id, test_data, completion_2, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [25]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_2_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Task 3: mappnig character names.

In [19]:
import sqlite3

In [20]:
conn = sqlite3.connect('character_database.db')

In [21]:
aliases = pd.read_sql('select * from aliases', conn, index_col='index')
characters = pd.read_sql('select * from characters', conn, index_col='index')

In [22]:
meta_character_list = [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator'
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]

In [28]:
def run_task_3(full_text, client, characters, aliases, task_2_completion=None, task_1_completion=None, seed=42):
    
    if task_2_completion is None:
        task_1_completion, task_2_completion = run_task_2(full_text, client)
        
    
    task_3_prompt_string = get_task_3_prompt_string(
        task_2_response=json.loads(task_2_completion.choices[0].message.content),
        characters=characters,
        aliases=aliases,
        meta_characters=meta_character_list
    )
    
    task_3_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_3_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_3_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion, task_3_completion

In [29]:
client = OpenAI(api_key=api_key)

In [30]:
test_id = 1

In [31]:
completion_1, completion_2, completion_3 = run_task_3(
    full_text=test_data['strings'][test_id],
    client=client,
    characters=test_data['task_3_characters'][test_id],
    aliases=test_data['task_3_aliases'][test_id]
)

In [32]:
completion_3.usage

CompletionUsage(completion_tokens=155, prompt_tokens=937, total_tokens=1092)

In [33]:
print(completion_3.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "speaker_matched": "Unknown",
      "recipient_matched": "Apatosaurus",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "speaker_matched": "Harry",
      "recipient_matched": "Mum",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "speaker_matched": "Mum",
      "recipient_matched": "Harry",
      "speech_section_id": 2
    }
  ]
}


In [34]:
def run_task_3_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_3_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'speaker_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
        'recipient_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
#         'spoken_words_only': {
#             'case_sensitive': True,
#             'remove_leading_the': False
#         },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [35]:
def run_task_3_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, _, completion_3 = run_task_3(
            full_text=test_data['strings'][test_id], 
            client=client,
            characters=test_data['task_3_characters'][test_id],
            aliases=test_data['task_3_aliases'][test_id]
        )
           
        if run_task_3_test_i(test_id, test_data, completion_3, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [36]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_3_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Running for corpus

Now that our prompts are passing all tests, we run the method for all books in the corpus and save the results to disk....

# TODO:
- add a 'self' match example to data (still usig himself)
- check Noi and 'his dad' - shouldn't it be Dad? (The Storm Whale In Winter)
- add Narrator handling/example (e.g. There's A Monster In Your Book)
- add a flag for if it is a character match or something else ('Everyone' Narrator' etc!)

In [37]:
import datetime

In [55]:
client = OpenAI(api_key=api_key)

book_df = {
    'title': [],
    'speech_section_count': 0,
    'c1_completion_tokens': [],
    'c1_prompt_tokens': [],
    'c1_total_tokens': [],
    'c1_system_fingerprint': [],
    'c2_completion_tokens': [],
    'c2_prompt_tokens': [],
    'c2_total_tokens': [],
    'c2_system_fingerprint': [],
    'c3_completion_tokens': [],
    'c3_prompt_tokens': [],
    'c3_total_tokens': [],
    'c3_system_fingerprint': [],
    'runtime_seconds': []
}
c1_results_dict = {}
c2_results_dict = {}
c3_results_dict = {}

# for book_id in range(len(df)):
for book_id in range(50):
    title = df.iloc[book_id].Title
    print("Book: ", title)
    start = datetime.datetime.now()    
    
    completion_1, completion_2, completion_3 = run_task_3(
        full_text=df.iloc[book_id].Text, 
        client=client,
        characters=characters[characters.book==title],
        aliases=aliases[aliases.book==title]
    )
    
    json_response = json.loads(completion_3.choices[0].message.content)
    
    book_df['c1_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c1_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c1_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c1_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c2_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c2_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c2_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c2_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c3_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c3_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c3_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c3_system_fingerprint'].append(completion_1.system_fingerprint)
    
    book_df['title'].append(title)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'] += len(json_response['speech_sections'])
    
    c3_results_dict[title] = json_response
    
    c1_results_dict[title] = json.loads(completion_1.choices[0].message.content)
    c2_results_dict[title] = json.loads(completion_2.choices[0].message.content)
    
book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn
Book:  The Gruffalo
Book:  The Monstrous Tale of Celery Crumble
Book:  Peace at Last
Book:  Sing A Song Of Bottoms
Book:  Barry The Fish With Fingers
Book:  The Troll
Book:  The Storm Whale In Winter
Book:  There's A Monster In Your Book
Book:  Once Upon A Unicorn Horn
Book:  Mind Your Manners
Book:  The Princess and the Wizard
Book:  Kipper's Toybox
Book:  Oi Frog!
Book:  Elmer and the Lost Teddy
Book:  The Hungry Caterpillar
Book:  A Squash and a Squeeze
Book:  Keith The Cat With The Magic Hat
Book:  Santa is Coming to Devon
Book:  The Enormous Crocodile


JSONDecodeError: Unterminated string starting at: line 400 column 18 (char 14598)

In [40]:
for book_id in range(50):
    title = df.iloc[book_id].Title
#     if title == 'The Enormous Crocodile':
    if True:
        print("Book: ", title)
        print(len(df.iloc[book_id].Text))

Book:  The Night Before Christmas
2627
Book:  Sugarlump and the Unicorn
3058
Book:  The Gruffalo
3839
Book:  The Monstrous Tale of Celery Crumble
3710
Book:  Peace at Last
1949
Book:  Sing A Song Of Bottoms
1530
Book:  Barry The Fish With Fingers
1818
Book:  The Troll
7004
Book:  The Storm Whale In Winter
1957
Book:  There's A Monster In Your Book
1332
Book:  Once Upon A Unicorn Horn
2507
Book:  Mind Your Manners
2371
Book:  The Princess and the Wizard
6025
Book:  Kipper's Toybox
2293
Book:  Oi Frog!
2169
Book:  Elmer and the Lost Teddy
2786
Book:  The Hungry Caterpillar
1206
Book:  A Squash and a Squeeze
2769
Book:  Keith The Cat With The Magic Hat
2312
Book:  Santa is Coming to Devon
6065
Book:  The Enormous Crocodile
16172
Book:  Harry and the Dinosaurs Go Wild
3253
Book:  Open Very Carefully, A Book With Bite!
1786
Book:  The Most Wonderful Gift In The World
3509
Book:  Elmer and Grandpa Eldo
3198
Book:  Tabby McTat
4557
Book:  Harry and the Dinosaurs at the Museum
3336
Book:  The 

In [64]:
   completion_1 = run_task_1(
        full_text=df[df.Title=='The Enormous Crocodile'].iloc[0].Text, 
        client=client)

In [65]:
completion_1, completion_2 = run_task_2(
    full_text=df[df.Title=='The Enormous Crocodile'].iloc[0].Text, 
    client=client,
    task_1_completion=completion_1
)

In [None]:
completion_1, completion_2 = run_task_2(
    full_text=df[df.Title=='The Enormous Crocodile'].iloc[0].Text[0:10000], 
    client=client
)

In [60]:
print(df[df.Title==title].Text)

20    \nIn the biggest brownest muddiest river in Af...
Name: Text, dtype: object


In [63]:
title = 'The Enormous Crocodile'
completion_1, completion_2, completion_3 = run_task_3(
    full_text=df[df.Title==title].iloc[0].Text, 
        client=client,
        characters=characters[characters.book==title],
        aliases=aliases[aliases.book==title]
    )

JSONDecodeError: Expecting property name enclosed in double quotes: line 407 column 1 (char 14571)

In [57]:
task_1_response=json.loads(completion_1.choices[0].message.content)

In [58]:
task_1_response

{'speech_sections': [{'speaker': 'Elephant',
   'recipient': 'Crocodile',
   'speech_text': 'Good morning, Crocodile.',
   'speech_section_id': 0},
  {'speaker': 'Crocodile',
   'recipient': 'Elephant',
   'speech_text': 'Good morning, Elephant. What are you doing?',
   'speech_section_id': 1},
  {'speaker': 'Elephant',
   'recipient': 'Crocodile',
   'speech_text': 'I am going to the river to drink some water.',
   'speech_section_id': 2},
  {'speaker': 'Crocodile',
   'recipient': 'Elephant',
   'speech_text': 'Be careful, Elephant. The river is very deep.',
   'speech_section_id': 3},
  {'speaker': 'Elephant',
   'recipient': 'Crocodile',
   'speech_text': 'Thank you, Crocodile. I will be careful.',
   'speech_section_id': 4}]}

In [None]:
len(book_df)

In [None]:
# Convert results dicts to dataframe of all speech sections and save to disk:

all_speech_sections = {
    'book': [],
    'speech_section_id': [], # speech section id within book
    'speaker': [],
    'recipient': [],
    'speaker_matched': [],
    'recipient_matched': [],
    'speech_text': [],
    'spoken_words_only': [],
    'spoken_word_count': []
}

for book in c3_results_dict.keys():
    print(book)
    for si, section in enumerate(c3_results_dict[book]['speech_sections']):
        all_speech_sections['book'].append(book)
        for key in all_speech_sections.keys():
            if key not in ['book', 'spoken_word_count', 'speech_text', 'spoken_words_only']:
                all_speech_sections[key].append(section[key])
        
        all_speech_sections['speech_text'].append(c1_results_dict[book]['speech_sections'][si]['speech_text'])
        all_speech_sections['spoken_words_only'].append(c2_results_dict[book]['speech_sections'][si]['spoken_words_only'])
        all_speech_sections['spoken_word_count'].append(len(c2_results_dict[book]['speech_sections'][si]['spoken_words_only']))

In [82]:
# l = len(all_speech_sections['spoken_word_count'])

# for key in all_speech_sections.keys():
#     all_speech_sections[key] = all_speech_sections[key][0:l]
    

In [80]:
all_speech_sections = pd.DataFrame(all_speech_sections)

##### We add the metacharacter data to the character table:

In [231]:
meta_character_data = pd.DataFrame({
    'book': ['all' for c in meta_character_list],
    'name': [c for c in meta_character_list],
    'gender': ['NGS' for c in meta_character_list],
    'human': ['NH' if c in ['Dinosaurs', 'Reindeer', 'Elmer and Grandpa Eldo'] else 'H' for c in meta_character_list],
    'alias_count': [0 for c in meta_character_list]
})

In [232]:
characters = pd.concat([
    characters, meta_character_data
])

In [146]:
# Replace 'self' with character name for ease of analysis (and flag self)
all_speech_sections['self_talk_flag'] = [
    True if r == 'Self'
    else False 
    for r in all_speech_sections.recipient_matched
]
all_speech_sections['recipient_matched'] = [
    s if r == 'Self'
    else r 
    for s,r in zip(all_speech_sections.speaker_matched, all_speech_sections.recipient_matched)
]

In [166]:
# aliases # add 'Hany'

In [196]:
speakers = all_speech_sections.merge(characters, how='left', left_on=['speaker_matched', 'book'], right_on=['name', 'book'])

In [197]:
speakers = speakers.merge(characters, how='left', left_on=['recipient_matched', 'book'], right_on=['name', 'book'], suffixes=['_speaker', '_recipient'])

In [198]:
# We fill in the mssing information for the metacharacters
for c in ['People', 'Everyone', 'Reader', 'The Reader']:
    
    speakers['name_speaker'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_speaker, speakers.speaker_matched)
    ]
    speakers['gender_speaker'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_speaker, speakers.speaker_matched)
    ]
    speakers['human_speaker'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_speaker, speakers.speaker_matched)
    ]
    speakers['alias_count_speaker'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_speaker, speakers.speaker_matched)
    ]
    
    speakers['name_recipient'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_recipient, speakers.recipient_matched)
    ]
    speakers['gender_recipient'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_recipient, speakers.recipient_matched)
    ]
    speakers['human_recipient'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_recipient, speakers.recipient_matched)
    ]
    speakers['alias_count_recipient'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_recipient, speakers.recipient_matched)
    ]

In [199]:
speakers.head()

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,self_talk_flag,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Unknown,"Now, Dasher! now, Dancer!\nnow, Prancer and Vi...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183,False,St. Nicholas,M,H,1.0,,,,
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,The Reader,"Happy\nChristmas to all, and to all a good\nni...","Happy Christmas to all, and to all a good night!",48,False,St. Nicholas,M,H,1.0,The Reader,NGS,H,0.0
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Sugarlump,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106,True,Sugarlump,M,NH,0.0,Sugarlump,M,NH,0.0
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Sugarlump,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,65,True,Sugarlump,M,NH,0.0,Sugarlump,M,NH,0.0
4,Sugarlump and the Unicorn,2,Unicorn,Sugarlump,unicorn,Sugarlump,"""Done!"" came a voice, and there stood a beast\...","Done! came a voice, and there stood a beast Wi...",128,False,unicorn,F,NH,0.0,Sugarlump,M,NH,0.0


In [98]:
len(speakers)

478

In [99]:
len(speakers[speakers.name_speaker.isna()])

25

In [110]:
len(speakers[speakers.name_speaker.isna()]) / len(speakers)

0.05230125523012552

In [1065]:
187/811

0.23057953144266338

In [101]:
len(speakers[speakers.name_recipient.isna()])

72

In [111]:
len(speakers[speakers.name_recipient.isna()]) / len(speakers)

0.1506276150627615

In [102]:
print(len(speakers[speakers.name_recipient.isna()].book.unique()))
print(speakers[speakers.name_recipient.isna()].book.unique())

23
['The Night Before Christmas' 'Sugarlump and the Unicorn'
 'The Monstrous Tale of Celery Crumble' 'Peace at Last'
 'Sing A Song Of Bottoms' 'Barry The Fish With Fingers' 'The Troll'
 'The Storm Whale In Winter' "There's A Monster In Your Book"
 'Once Upon A Unicorn Horn' 'Mind Your Manners'
 'The Princess and the Wizard' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild'
 'Open Very Carefully, A Book With Bite!'
 'The Most Wonderful Gift In The World' 'Elmer and Grandpa Eldo'
 'Tabby McTat' 'Harry and the Dinosaurs at the Museum'
 "The Hedgehog's Balloon"]


In [103]:
print(len(speakers[speakers.name_speaker.isna()].book.unique()))
print(speakers[speakers.name_speaker.isna()].book.unique())

14
['Sing A Song Of Bottoms' "There's A Monster In Your Book"
 'Once Upon A Unicorn Horn' 'Mind Your Manners' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild'
 'Open Very Carefully, A Book With Bite!'
 'The Most Wonderful Gift In The World' 'Elmer and Grandpa Eldo'
 'Tabby McTat' 'Harry and the Dinosaurs at the Museum']


In [122]:
characters[characters.book == 'Harry and the Dinosaurs at the Museum']

Unnamed: 0_level_0,book,name,gender,human,alias_count
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
40,Harry and the Dinosaurs at the Museum,Harry,M,H,0
367,Harry and the Dinosaurs at the Museum,Mum,F,H,0
368,Harry and the Dinosaurs at the Museum,Sam,F,H,0
369,Harry and the Dinosaurs at the Museum,Gran,F,H,0
370,Harry and the Dinosaurs at the Museum,T-Rex,M,NH,1
371,Harry and the Dinosaurs at the Museum,Museum Guard,M,H,0
372,Harry and the Dinosaurs at the Museum,Pterodactyl,M,NH,0
373,Harry and the Dinosaurs at the Museum,Triceratops,NGS,NH,0
374,Harry and the Dinosaurs at the Museum,Anchisaurus,NGS,NH,0
375,Harry and the Dinosaurs at the Museum,Apatosaurus,NGS,NH,0


In [118]:
# aliases[aliases.character_id==196]
aliases[aliases.character=='the Enormous Crocodile']

Unnamed: 0_level_0,alias,character,character_id,book
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [124]:
speakers[speakers.name_recipient.isna()]

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Unknown,"Now, Dasher! now, Dancer!\nnow, Prancer and Vi...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183,St. Nicholas,M,H,1.0,,,,
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,The Reader,"Happy\nChristmas to all, and to all a good\nni...","Happy Christmas to all, and to all a good night!",48,St. Nicholas,M,H,1.0,,,,
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Self,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106,Sugarlump,M,NH,0.0,,,,
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Self,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,65,Sugarlump,M,NH,0.0,,,,
5,Sugarlump and the Unicorn,3,Sugarlump,himself,Sugarlump,Self,"""Here in the open countryside is where\nI like...",Here in the open countryside is where I like t...,104,Sugarlump,M,NH,0.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419,The Hedgehog's Balloon,0,Percy,himself,Percy,Self,“Two red ones. . . a blue one. . . there’s\na ...,Two red ones. . . a blue one. . . there’s a ye...,85,Percy,M,H,0.0,,,,
420,The Hedgehog's Balloon,1,Percy,himself,Percy,Self,"“I wonder where they’re\ncoming from,” he said...",I wonder where they’re coming from. Somebody m...,67,Percy,M,H,0.0,,,,
421,The Hedgehog's Balloon,2,Percy,himself,Percy,Self,"“Well, if nobody wants them,” he said,\n“I thi...","Well, if nobody wants them, I think I’ll help ...",53,Percy,M,H,0.0,,,,
422,The Hedgehog's Balloon,3,Percy,himself,Percy,Self,"“Someone’s crying,” said Percy.\n“Oh dear.”",Someone’s crying. Oh dear.,26,Percy,M,H,0.0,,,,


### Preliminary analysis:

In [1013]:
female_character_count = sum(characters.gender == 'F')
male_character_count = sum(characters.gender == 'M')
ngs_character_count = sum(characters.gender == 'NGS')

In [1043]:
ngs_character_count

558

In [1044]:
male_spoken_word_count = speakers[speakers.gender_speaker == 'M'].spoken_word_count.sum()
female_spoken_word_count = speakers[speakers.gender_speaker == 'F'].spoken_word_count.sum()
ngs_spoken_word_count = speakers[speakers.gender_speaker == 'NGS'].spoken_word_count.sum()

In [1045]:
male_spoken_word_count / male_character_count

52.18846153846154

In [1046]:
female_spoken_word_count / female_character_count

20.43558282208589

In [1047]:
ngs_spoken_word_count / ngs_character_count

12.39605734767025

In [1048]:
female_character_count / male_character_count

0.6269230769230769

In [1049]:
male_received_word_count = speakers[speakers.gender_recipient == 'M'].spoken_word_count.sum()
female_received_word_count = speakers[speakers.gender_recipient == 'F'].spoken_word_count.sum()
ngs_received_word_count = speakers[speakers.gender_recipient == 'NGS'].spoken_word_count.sum()

In [1050]:
male_received_word_count / male_character_count

44.03846153846154

In [1051]:
female_received_word_count / female_character_count

14.423312883435583

In [1052]:
ngs_received_word_count / ngs_character_count

10.24731182795699

In [1053]:
male_speech_sections = len(speakers[speakers.gender_speaker == 'M'])
female_speech_sections = len(speakers[speakers.gender_speaker == 'F'])
ngs_speech_sections = len(speakers[speakers.gender_speaker == 'NGS'])

In [1054]:
male_speech_sections / male_character_count

0.9211538461538461

In [1055]:
female_speech_sections / female_character_count

0.3588957055214724

In [1056]:
ngs_speech_sections / ngs_character_count

0.22939068100358423

In [1028]:
characters[characters.gender == 'F'].groupby('name').agg('count').sort_values('book', ascending=False).head(20)

Unnamed: 0_level_0,book,gender,human,alias_count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mum,18,18,18,18
mum,10,10,10,10
Mummy,8,8,8,8
Sam,7,7,7,7
Granny,5,5,5,5
mother,5,5,5,5
Cinderella,5,5,5,5
Mary,5,5,5,5
girl,5,5,5,5
cow,5,5,5,5


### Validation

Proceeds as follows:

#### Notes:
- For speaker matching, inspect 'unknowns' and 'The Reader' and 'Self' separately: how many instances? Binary clasification metrics?

#### Running for several books to test outputs, save format etc:

In [554]:
client = OpenAI(api_key=key)

book_df = {
    'title': [],
    'speech_section_count': 0,
    'completion_tokens': [],
    'prompt_tokens': [],
    'total_tokens': [],
    'runtime_seconds': []
}
results_dict = {
    
}

for book_id in range(10):
    
    print("Book: ", df.iloc[book_id].Title)
    start = datetime.datetime.now()    
    
    data = {
        'full_text': df.iloc[book_id].Text,
        'sentences': dict(
            zip(
                sentences[sentences.book == df.iloc[book_id].Title].sentence_index,
                [span.text for span in sentences[sentences.book == df.iloc[book_id].Title].sentence]
            )
        )
    }
    prompt = get_prompt_string(data)
    
    completion = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation."},
        {"role": "user", "content": r"{}".format(prompt)}
      ],
     temperature=0.0,
     response_format={"type": "json_object"},
    )
    
    json_response = json.loads(completion.choices[0].message.content)
    
    book_df['completion_tokens'].append(completion.usage.completion_tokens)
    book_df['prompt_tokens'].append(completion.usage.prompt_tokens)
    book_df['total_tokens'].append(completion.usage.total_tokens)
    book_df['title'].append(df.iloc[book_id].Title)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'] += len(json_response['speech_sections'])
    
    results_dict[df.iloc[book_id].Title] = json_response
    
book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn
Book:  The Gruffalo
Book:  The Monstrous Tale of Celery Crumble
Book:  Peace at Last
Book:  Sing A Song Of Bottoms
Book:  Barry The Fish With Fingers
Book:  The Troll
Book:  The Storm Whale In Winter
Book:  There's A Monster In Your Book


In [555]:
book_df

Unnamed: 0,title,speech_section_count,completion_tokens,prompt_tokens,total_tokens,runtime_seconds
0,The Night Before Christmas,154,234,1949,2183,3
1,Sugarlump and the Unicorn,154,1114,2295,3409,17
2,The Gruffalo,154,2821,3055,5876,37
3,The Monstrous Tale of Celery Crumble,154,802,2631,3433,13
4,Peace at Last,154,754,1855,2609,11
5,Sing A Song Of Bottoms,154,71,1427,1498,1
6,Barry The Fish With Fingers,154,595,1530,2125,9
7,The Troll,154,2891,4854,7745,52
8,The Storm Whale In Winter,154,274,1544,1818,4
9,There's A Monster In Your Book,154,254,1351,1605,4


In [557]:
results_dict.keys()

dict_keys(['The Night Before Christmas', 'Sugarlump and the Unicorn', 'The Gruffalo', 'The Monstrous Tale of Celery Crumble', 'Peace at Last', 'Sing A Song Of Bottoms', 'Barry The Fish With Fingers', 'The Troll', 'The Storm Whale In Winter', "There's A Monster In Your Book"])

In [419]:
results_dict['Sugarlump and the Unicorn']

{'speech_sections': [{'speaker': 'Sugarlump',
   'recipient': 'children',
   'speech_text': '"Here in the children\'s bedroom\nIs where I want to be.\nHappily rocking to and fro.\nThis is the life for me!"',
   'speech_section_id': 1},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Oh to be out in the big wide world!\nI wish I could trot," he said.',
   'speech_section_id': 2},
  {'speaker': 'unicorn',
   'recipient': 'Sugarlump',
   'speech_text': '"Done!" came a voice, and there stood a beast\nWith a twisty silver horn.\n"I can grant horses\' wishes," Said the snow-\nwhite unicorn.',
   'speech_section_id': 3},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Here in the open countryside is where\nI like to be.\nClippety-dop, clippety-dop, This is the\nlife for me!"',
   'speech_section_id': 4},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Oh to be free of this heavy load.\nI wish I could gallop!"',
   '

In [642]:
import pickle as pk 
with open('./results/gpt4o_results_dict.pk', 'wb') as outfile:
    pk.dump(results_dict, outfile)

In [643]:
book_df.to_csv('./results/gpt4o_book_summary.csv')

#### We now extend the conversation to pull out only the spoken words only:

In [644]:
new_json_schema = {
    "speaker": "string",
    "recipient": "string",
    "spoken_words_only": "string",
    "speech_section_id": "integer"
}
    
new_json_schema_str = ', '.join([f"'{key}': {value}" for key, value in new_json_schema.items()])

In [None]:
For the speech sections that you just found, which I reproduce below, please pull out the words that are spoken
        and add them as a new field in the JSON called spoken_words_only, replacing the speech_text field.
        So you will need to remove all non-speech words such as 'she said' etc.
        
        Please use provide your response in JSON.
        Please reproduce punctuation as it is written using regular double quotes "" for speech marks.

        Data: {previous_response}

In [678]:
def get_second_prompt_string(previous_response):
    
    return f"""
        For the speech sections that you just found, please pull out the words that are spoken
        and add them as a new field in the JSON called spoken_words_only, replacing the speech_text field.
        So you will need to remove all non-speech words such as 'she said' etc.
        
        Please use provide your response in JSON.
        Please reproduce punctuation as it is written using regular double quotes "" for speech marks.
    """

In [686]:
book_id = 21
df.iloc[book_id].Title

'Harry and the Dinosaurs Go Wild'

In [687]:
data = {
    'full_text': df.iloc[book_id].Text,
    'sentences': dict(
        zip(
            sentences[sentences.book == df.iloc[book_id].Title].sentence_index,
            [span.text for span in sentences[sentences.book == df.iloc[book_id].Title].sentence]
        )
    )
}

In [688]:
client = OpenAI(api_key=key)

completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation. Please start all indexing of lists and arrays at 0 rather than 1."},
    {"role": "user", "content": r"{}".format(get_prompt_string(data))}
  ],
 temperature=0.0,
 response_format={"type": "json_object"},
)

In [689]:
completion.usage

CompletionUsage(completion_tokens=992, prompt_tokens=2418, total_tokens=3410)

In [690]:
second_completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation. Please start all indexing of lists and arrays at 0 rather than 1."},
    {"role": "user", "content": r"{}".format(get_prompt_string(data))},
    {"role": "assistant", "content": completion.choices[0].message.content},
    {"role": "system", "content": f"Please use the following schema for your JSON response: {new_json_schema}"},
    {"role": "user", "content": r"{}".format(get_second_prompt_string(json.loads(completion.choices[0].message.content)))}
  ],
 temperature=0.0,
 response_format={"type": "json_object"},
)

In [691]:
second_completion.usage

CompletionUsage(completion_tokens=944, prompt_tokens=3557, total_tokens=4501)

In [693]:
json.loads(completion.choices[0].message.content)

{'speech_sections': [{'speaker': 'Hany',
   'recipient': 'Apatosaurus',
   'speech_text': '“That’s a\nrhinoceros,” said Hany.',
   'speech_section_id': 0},
  {'speaker': 'Harry',
   'recipient': 'Mum',
   'speech_text': '“I\nwant to save some animals,” he said.\n“What can I do, Mum?”',
   'speech_section_id': 1},
  {'speaker': 'Sam',
   'recipient': 'Harry',
   'speech_text': '“Tuh! What a waste of time!”',
   'speech_section_id': 2},
  {'speaker': 'Harry',
   'recipient': 'Pterodactyl',
   'speech_text': '“Wait till I’ve finished my blue whale,” said Harry.\n“Blue whales are bigger than trains, bigger than\ndinosaurs, bigger than thirty-two elephants!”',
   'speech_section_id': 3},
  {'speaker': 'Triceratops',
   'recipient': 'Stegosaurus',
   'speech_text': '“Army tanks don’t need saving!” said Triceratops.\n“Do a tree frog instead.”',
   'speech_section_id': 4},
  {'speaker': 'Nan',
   'recipient': 'Harry',
   'speech_text': '“Why not talk to Mr Bopsom?\nHe might put up a poster in 

In [692]:
json.loads(second_completion.choices[0].message.content)

{'speech_sections': [{'speaker': 'Hany',
   'recipient': 'Apatosaurus',
   'spoken_words_only': '“That’s a rhinoceros,”',
   'speech_section_id': 0},
  {'speaker': 'Harry',
   'recipient': 'Mum',
   'spoken_words_only': '“I want to save some animals,” “What can I do, Mum?”',
   'speech_section_id': 1},
  {'speaker': 'Sam',
   'recipient': 'Harry',
   'spoken_words_only': '“Tuh! What a waste of time!”',
   'speech_section_id': 2},
  {'speaker': 'Harry',
   'recipient': 'Pterodactyl',
   'spoken_words_only': '“Wait till I’ve finished my blue whale,” “Blue whales are bigger than trains, bigger than dinosaurs, bigger than thirty-two elephants!”',
   'speech_section_id': 3},
  {'speaker': 'Triceratops',
   'recipient': 'Stegosaurus',
   'spoken_words_only': '“Army tanks don’t need saving!” “Do a tree frog instead.”',
   'speech_section_id': 4},
  {'speaker': 'Nan',
   'recipient': 'Harry',
   'spoken_words_only': '“Why not talk to Mr Bopsom? He might put up a poster in his shop window! Then

## Now trialling format for manual validation:

We have already used the student manual coding for validate speech detection, so now we can just focus on detected speech...

1. Select book at random, select passage of detected speech at random. 
2. Show user the passage and some of the text either side of the passage
3. Ask is it speech? Is speaker correct? Is recipient correct? [Give option to view more text]
4. Save result.

#### Note: handle case when sentence is not found in text e.g. The Troll sentence 7 is split across two setences (7 and 8) due to bad pdfplumber output.

In [393]:
from IPython.display import display, Markdown
from random import randint

In [628]:
# selection = randint(0, 1)
# book_selection = 4  # Peace at Last
book_selection = 1  # Sugarlump

In [629]:
selected_book = list(results_dict.keys())[book_selection]
selected_book

'Sugarlump and the Unicorn'

In [630]:
# speech_sections = json.loads(completion.choices[0].message.content)['speech_sections']
speech_sections = results_dict[selected_book]['speech_sections']

In [631]:
def validate(v_vec):
    pass

In [632]:
def display_section(df, res, speech_section_result, padding=200):

    speech_section = speech_section_result['speech_text']
    book_text = df[df.Title == selected_book].iloc[0].Text
    this_text = book_text[0:res] + '**' + book_text[res:res+len(speech_section)] + '**' + book_text[res+len(speech_section):]
    this_text = this_text[max(res-padding-2, 0):min(res+len(speech_section)+padding+2, len(this_text))]
    display(Markdown(this_text.replace('\n', '<br>')))

    display(Markdown('**' + 'Result:' + '**'))
    display(speech_section_result)

In [639]:
# section_selection = randint(0, len(speech_sections))
section_selection = 0

In [640]:
res = df[df.Title == selected_book].iloc[0].Text.find(speech_sections[section_selection]['speech_text']) 
res

251

In [641]:
display_section(df, res, speech_sections[section_selection])

ht and blue. And when<br>she hears a horse's wish, She can<br>make that wish come tine.<br>Sugarlump was a rocking horse.<br>He belonged to a girl and boy.<br>To and fro, to and fro,<br>They rode on their favourite toy.<br>**"Here in the children's bedroom<br>Is where I want to be.<br>Happily rocking to and fro.<br>This is the life for me!"**<br><br>But when the children were out at school Sugarlump hung his head.<br>"Oh to be out in the big wide world!<br>I wish I could trot," he said.<br><br>"Done!" came a voice, and there stood a beast<br>With a twisty s

**Result:**

{'speaker': 'Sugarlump',
 'recipient': 'himself',
 'speech_text': '"Here in the children\'s bedroom\nIs where I want to be.\nHappily rocking to and fro.\nThis is the life for me!"',
 'speech_section_id': 1}

### Please indicate with '1' which are correct: [speech, speaker, recipient]

In [440]:
validation_vector = [1, 1, 1]
validate(validation_vector)