
## TODO: add 'and' handling to task 3 prompt and examples. Then postprocess by detecting 'ands'. Also handle comma separation with 3 or more characters. Handle split titles eg. Mr and Mrs Large, Tim and Betty Box etc
## TODO: add instruction to be case insensitive in task 3 e.g. mole -> Mole. Add example of case-insentive matching (with aliases also?)

## Add 'themsevles' example to prompt

## TODO: add fixed split legnth for different books (How grinch..3794)
## TODO: SAVE book_df and sentence_df

### Task decomposition and prompt development.

We provide here a framework for developing sequential prompts for a complex NLP task using GPT4o.

The complex tasks is decompsed into as sequence of simpler tasks that each build on the previous one.

For each task in the sequence we produce a number of examples to show GPT4o, and a number of further examples to use for automated testing of the response. This borrows ideas from unit testing of software, since iterative changes to the prompt may break functionality that was previously working.

This framework can be adapted to use with other LLMs and NLP tasks.

#### We decompose into the following tasks:
- Task 1: identify sections of direct speech, and the name of the speaker and recipient
- Task 1b: locate pre-defined sentences in these detected sections of speech (for comparison with human coding)
- Task 2: pull out the spoken words only from each section (removing e.g. 'he said' etc)
- Task 3: locate and replace the names of the speakers and recipients using a pre-defined character map  

#### Notes:
- Determinism is not guaranteed. But well structured prompts should produce near deterministic outputs, along with temperature=0, fixed seed. It is also worth storing the system finerprint for future reference, as changes to this may be the cause of differing results in the future.  
- The lack of determinism can make tests quite brittle. It is worth repeating tests several times to confirm their behvaiour. And then running the full manual validation on a single static result set.
- Where possible, the sequential tasks should b tackled as a new completion API, using formatted output from the previous task as the inupt. This is preferrable to chaining of prompts and outputs to produce a chat style conversation, but this increases the risk of conflict or confusion between prompts/instructions sets. And also increases the length of the context window.
- Need to ensure consistency between instructions, schemas and examples. Otherwise results may be inconsistent e.g. 'reproduce all punctuation all it appears' conflicted with 'remove  speech marks' example.
- Should typos be accounted for (e.g. task_3 name matching?)
- Cost: \\$1.22 left after developing prompts. Added \\$10 to run for 50 books (so ~0.25 full dataset). 

### Future work:
- handle unnamed characters that are still gendered and speaf (e.g. a girl who was older -> older girl: Enormous Croc)
- Improve current mapping: speaking to 'each other' -> The Reader 
- Refactor run_task and run_test methods and move them to separate files
- Use SpaCy doc/spans to hanlde speech sections, which knowledge base of characters and aliases, and tags for who is speaking etc.
- Then can use the spans to handle splitting books with some overlap to avoid splitting speech/conversations resulting in 'unknown' speaker/recipient. 
- Add further analysis of speech sections: what is being said - commands, questions, statements, sentiment, who is being spoken about etc.
- Better handling of cases where the speech section get split in later tasks (or ideally prevent this from happening).
- Add handling for difficult ways of referring to characters by their role or identity rather than a name or alias e.g. 'his siblings' or 'the postman'
- Add logic or mapping for same character across multiple books

#### Automating this using the Chat GPT API:

## TODO:
 
 - When in pipeline to spellcheck/ correct typos? (e.g. Hany in the Dinosaurs (book 21).
 - what to do about inconsitent sentence detection? e.g "Now Dasher!" being at the end of sentence 7 was causing GPT confusion...
 - add an instruction about how to refer to 'general audience' or 'narrator' or 'I'
 - provide example of input and what the output should look like (within the prompt)
 - should temp be close to 0 (but not exactly 0)?
 - ask for output of reaosning/thought process?
 - ask for a confidence score?
 - do we need to specify (in system prompt), not to use MD or any other formatting in the json output?
 
## Note: ideas to explore if we need performance boost...

- system message to edit assistant role
- vary temperature or top_p parameter
- fine_tuning a model with bespoke training data (how much is necessary?)
- improved instructions or prompt engineering (see e.g. paper on iterative prompting)
- compare results with gpt-3.5-turbo? - does not seem to work well for our use case!

In [1]:
import os
import json
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import string
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 
from openai import OpenAI
import pickle

%matplotlib inline

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
with open('./key.txt', 'r') as infile:
    api_key = infile.read().splitlines()[0]

In [4]:
with open('data/tempdf.pickle', 'rb') as outfile:
    df = pickle.load(outfile)

In [5]:
# Import our prompts, example data (for in-context learning), and test data for unit testing each task: 
from openai_api.examples import example_data
from openai_api.tests import build_test_data

test_data = build_test_data(df)

#### Converting the full dataset into a dataframe of sentences

In [6]:
from openai_api.utilities import spacy_extract_sentences
sentences = spacy_extract_sentences(df, nlp)

#### Check that this sample contains the same sentences that were manually coded previously.

In [7]:
coding_sample = sentences.sample(frac=0.15, axis=0, random_state=42)
manually_coded = pd.read_csv('./sentences_for_coding/sample_15pc.csv', delimiter='\t', index_col=0)

In [8]:
text_equal = [
    i == j.text
    for i,j in
    zip(manually_coded.sentence, coding_sample.sentence)
]    
assert sum(text_equal) == len(text_equal)

##### Build OpenAI API code...

In [9]:
from openai_api.schemas import build_task_1_input_schema, build_task_1_response_schema
from openai_api.utilities import *

task_1_response_schema_str = build_task_1_response_schema()
task_1_input_schema_str = build_task_1_input_schema()

In [10]:
from openai_api.tests import run_task_1_test_i

In [11]:
from openai_api.prompts import (
    get_task_1_prompt_string, get_task_1_system_prompt,
    get_task_2_prompt_string, get_task_2_system_prompt,
    get_task_3_prompt_string, get_task_3_system_prompt
)

In [17]:
def run_task_1(full_text, client, seed=42):
    
    prompt_string = get_task_1_prompt_string(
            data={'full_text': full_text}, 
        )
        
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": get_task_1_system_prompt()},
            {"role": "user", "content": r"{}".format(prompt_string)}
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
        seed=seed
    )
    
    return completion    

In [18]:
def run_task_1_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")
        
        completion = run_task_1(test_data['strings'][test_id], client=client)
        
        if run_task_1_test_i(test_id, test_data, completion, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [14]:
success_count = 0
for i in range(20):
    print(f"Running repeat {i}")
    
    success = run_task_1_tests(test_data)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 5
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 6
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 7
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 8
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 9
Running test: 0
Test 0: pass


KeyboardInterrupt: 

### Task 1b: recognising pre-defined sentences.

In [26]:
# TODO: for comparison with student speech flags.

### Task 2: pulling out spoken words only.

#### TODO:
- refactor run_test_i method
- move schemas and test and example data to files
- rename as tak 2 or rename functions and strings!

In [19]:
def run_task_2(full_text, client, task_1_completion=None, seed=42):
    
    if task_1_completion is None:
        task_1_completion = run_task_1(full_text, client)
        
    task_1_prompt_string = get_task_1_prompt_string(
        data={'full_text': full_text}
    )
    
    task_2_prompt_string = get_task_2_prompt_string(
        task_1_response=json.loads(task_1_completion.choices[0].message.content)
    )
    
    task_2_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_2_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_2_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion

In [20]:
client = OpenAI(api_key=api_key)

In [21]:
test_id = 1

In [22]:
completion_1, completion_2 = run_task_2(
    full_text=test_data['strings'][test_id],
    client=client
)

In [23]:
completion_2.usage

CompletionUsage(completion_tokens=156, prompt_tokens=684, total_tokens=840)

In [24]:
print(completion_2.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "spoken_words_only": "That’s a rhinoceros. Triceratops has got more horns.",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "spoken_words_only": "I want to save some animals. What can I do, Mum?",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "spoken_words_only": "Tuh! What a waste of time!",
      "speech_section_id": 2
    }
  ]
}


In [25]:
def run_task_2_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_2_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'spoken_words_only': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [26]:
def run_task_2_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, completion_2 = run_task_2(
            full_text=test_data['strings'][test_id], 
            client=client
        )
           
        if run_task_2_test_i(test_id, test_data, completion_2, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [25]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_2_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Task 3: mappnig character names.

In [27]:
import sqlite3

In [28]:
conn = sqlite3.connect('character_database.db')

In [29]:
aliases = pd.read_sql('select * from aliases', conn, index_col='index')
characters = pd.read_sql('select * from characters', conn, index_col='index')

In [200]:
meta_character_list = [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]

In [178]:
def run_task_3(
    full_text, client, characters, aliases, 
    task_2_completion=None, 
    task_1_completion=None, 
    _meta_character_list=meta_character_list,
    seed=42
):
    
    if task_2_completion is None:
        task_1_completion, task_2_completion = run_task_2(full_text, client)
        
    
    task_3_prompt_string = get_task_3_prompt_string(
        task_2_response=json.loads(task_2_completion.choices[0].message.content),
        characters=characters,
        aliases=aliases,
        meta_characters=_meta_character_list
    )
    
    task_3_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_3_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_3_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion, task_3_completion

In [32]:
client = OpenAI(api_key=api_key)

In [33]:
test_id = 1

In [34]:
completion_1, completion_2, completion_3 = run_task_3(
    full_text=test_data['strings'][test_id],
    client=client,
    characters=test_data['task_3_characters'][test_id],
    aliases=test_data['task_3_aliases'][test_id]
)

In [35]:
completion_3.usage

CompletionUsage(completion_tokens=155, prompt_tokens=937, total_tokens=1092)

In [36]:
print(completion_3.choices[0].message.content)

{
    "speech_sections": [
        {
            "speaker": "Hany",
            "recipient": "Apatosaurus",
            "speaker_matched": "Unknown",
            "recipient_matched": "Apatosaurus",
            "speech_section_id": 0
        },
        {
            "speaker": "Harry",
            "recipient": "Mum",
            "speaker_matched": "Harry",
            "recipient_matched": "Mum",
            "speech_section_id": 1
        },
        {
            "speaker": "Sam",
            "recipient": "Harry",
            "speaker_matched": "Mum",
            "recipient_matched": "Harry",
            "speech_section_id": 2
        }
    ]
}


In [37]:
def run_task_3_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_3_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'speaker_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
        'recipient_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
#         'spoken_words_only': {
#             'case_sensitive': True,
#             'remove_leading_the': False
#         },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [38]:
def run_task_3_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, _, completion_3 = run_task_3(
            full_text=test_data['strings'][test_id], 
            client=client,
            characters=test_data['task_3_characters'][test_id],
            aliases=test_data['task_3_aliases'][test_id]
        )
           
        if run_task_3_test_i(test_id, test_data, completion_3, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [36]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_3_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Running for corpus

Now that our prompts are passing all tests, we run the method for all books in the corpus and save the results to disk....

# TODO:
- add a 'self' match example to data (still usig himself)
- check Noi and 'his dad' - shouldn't it be Dad? (The Storm Whale In Winter)
- add Narrator handling/example (e.g. There's A Monster In Your Book)
- add a flag for if it is a character match or something else ('Everyone' Narrator' etc!)

In [617]:
import datetime

In [618]:
# Note: that this only handles two chunks currently. And is not an optimal way of splitting since sections of speech may be separated, for example.
# max_chunk_size = 9677
multi_chunk_books = {
    'The Enormous Crocodile': 9677,
    'How The Grinch Stole Christmas': 3794, 
    'Farmer Duck': 1160
}

In [619]:
# df['book_length'] = np.array([len(t) for t in df.Text])

In [620]:
# These are not story books and are also the longest so would possibly need splitting.
remove_non_stories = [
    'All Year Round', 'All About Feelings', 'Ten in the Bed and Other Counting Rhymes', 'Why Am I An Insect'
]

In [621]:
# df.sort_values(by='book_length', ascending=False).head(15)

In [622]:
def process_chunk(
    title, chunk_name, chunk_text, client, book_df, 
    c1_results_dict, c2_results_dict, c3_results_dict
):
    
    print("Book: ", title)
    start = datetime.datetime.now()    
    
    completion_1, completion_2, completion_3 = run_task_3(
        full_text=chunk_text, 
        client=client,
        characters=characters[characters.book==title],
        aliases=aliases[aliases.book==title],
        _meta_character_list=meta_character_list
    )
    
    json_response = json.loads(completion_3.choices[0].message.content)
    
    book_df['c1_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c1_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c1_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c1_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c2_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c2_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c2_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c2_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c3_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c3_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c3_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c3_system_fingerprint'].append(completion_1.system_fingerprint)
    
    book_df['title'].append(chunk_name)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'].append(len(json_response['speech_sections']))
    
    c3_results_dict[chunk_name] = json_response
    
    c1_results_dict[chunk_name] = json.loads(completion_1.choices[0].message.content)
    c2_results_dict[chunk_name] = json.loads(completion_2.choices[0].message.content)

In [623]:
client = OpenAI(api_key=api_key)

book_df = {
    'title': [],
    'speech_section_count': [],
    'c1_completion_tokens': [],
    'c1_prompt_tokens': [],
    'c1_total_tokens': [],
    'c1_system_fingerprint': [],
    'c2_completion_tokens': [],
    'c2_prompt_tokens': [],
    'c2_total_tokens': [],
    'c2_system_fingerprint': [],
    'c3_completion_tokens': [],
    'c3_prompt_tokens': [],
    'c3_total_tokens': [],
    'c3_system_fingerprint': [],
    'runtime_seconds': []
}
c1_results_dict = {}
c2_results_dict = {}
c3_results_dict = {}

for book_id in df.index:
    
    title = df.iloc[book_id].Title
    
    if title not in remove_non_stories:
        book_text = df.iloc[book_id].Text
#         if len(book_text) > max_chunk_size:
    
        if title in multi_chunk_books.keys():
            max_chunk_size = multi_chunk_books[title]
            last_newline = book_text[0:max_chunk_size].rfind('\n')
            chunks = {
                ''.join(['_chunk_a_', title]): book_text[0:max_chunk_size],
                ''.join(['_chunk_b_', title]): book_text[max_chunk_size:]
            }
            for chunk in chunks.keys():
                process_chunk(
                    title=title, 
                    chunk_name=chunk, 
                    chunk_text=chunks[chunk], 
                    client=client, 
                    book_df=book_df, 
                    c1_results_dict=c1_results_dict, 
                    c2_results_dict=c2_results_dict, 
                    c3_results_dict=c3_results_dict
                )

        else:
            process_chunk(
                title=title, 
                chunk_name=title, 
                chunk_text=book_text, 
                client=client, 
                book_df=book_df, 
                c1_results_dict=c1_results_dict, 
                c2_results_dict=c2_results_dict, 
                c3_results_dict=c3_results_dict
            )

book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn


KeyboardInterrupt: 

In [630]:
len(characters.name.unique())

878

In [220]:
# book_df = pd.DataFrame(book_df)

In [231]:
boof_df_complete = book_df.copy()
boof_df_complete.to_json('data/gpt4_output_summary_corpus.json')

In [260]:
# with open('data/gpt4_output_summary_corpus.json', 'r') as outfile:
#     temp_bdf = pd.read_json(outfile)

In [262]:
# book_df = pd.concat([temp_bdf, book_df])

In [615]:
# df[df.Title =='Farmer Duck'].iloc[0].Text.find('just before dawn')

1160

In [227]:
# df[df.Title =='How The Grinch Stole Christmas'].iloc[0].Text.find('Then he slunk')

3794

In [263]:
# Convert results dicts to dataframe of all speech sections and save to disk:

all_speech_sections = {
    'book': [],
    'speech_section_id': [], # speech section id within book
    'speaker': [],
    'recipient': [],
    'speaker_matched': [],
    'recipient_matched': [],
    'speech_text': [],
    'spoken_words_only': [],
    'spoken_word_count': []
}

for book in c3_results_dict.keys():
    print(book)
    for si, section in enumerate(c3_results_dict[book]['speech_sections']):
        all_speech_sections['book'].append(book)
        for key in all_speech_sections.keys():
            if key not in ['book', 'spoken_word_count', 'speech_text', 'spoken_words_only']:
                all_speech_sections[key].append(section[key])
        
        # Handling edge cases where a speach section is split:
        si_corrected = si
        if si >= len(c1_results_dict[book]['speech_sections']):
            si_corrected = len(c1_results_dict[book]['speech_sections']) - 1
        all_speech_sections['speech_text'].append(c1_results_dict[book]['speech_sections'][si_corrected]['speech_text'])
        
        if si >= len(c2_results_dict[book]['speech_sections']):
            si_corrected = len(c2_results_dict[book]['speech_sections']) - 1
        all_speech_sections['spoken_words_only'].append(c2_results_dict[book]['speech_sections'][si_corrected]['spoken_words_only'])
        all_speech_sections['spoken_word_count'].append(len(c2_results_dict[book]['speech_sections'][si_corrected]['spoken_words_only']))

Supertato
Who Will Sing My Puff-a-bye
Can't You Sleep Little Bear
The Duchess and Guy
Grandad's Island
Harry and the Robots
You Can't Take An Elephant On The Bus
Where's My Cuddle
Jesus' Christmas Party
Princess Mirror-Belle And The Dragon Pox
We're Going on a Bear Hunt
The Ugly Duckling
Room on the Broom
Harry and the Bucketful of Dinosaurs
Grandma Bird
We're Going On A Lion Hunt
The Wheels on the Bus
The Cave
Where's My Teddy
The Singing Mermaid
What The Ladybird Heard Next
The Pirates Next Door
Captain Duck
The Little Bully
All In One Piece
The Gruffalo's Child
Elephant Learns to Share
One Snowy Night
Each Peach Pear Plum
The Squirrels Who Squabbled
Knock Knock Pirate
Who Will Save Us
Jack Breaks The Beanstalk
The Snowiest Christmas Ever
Jack and the Beanstalk
Super Sid The Silly Sausage Dog
The Christmas Extravaganza Hotel
Cinder the Bubble-Blowing Dragon
The Animal Boogie
The Truth According to Arthur
Monkey Puzzle
Little Monkey


In [264]:
all_speech_sections = pd.DataFrame(all_speech_sections)

In [230]:
all_speech_sections.to_json('data/gpt4_all_speech_sections_corpus.json')

In [265]:
# with open('data/gpt4_all_speech_sections_corpus.json', 'r') as outfile:
#     temp_ass = pd.read_json(outfile)

In [267]:
# all_speech_sections = pd.concat([temp_ass, all_speech_sections])

##### We replace any chunked book titles with the original, adding columns to retain chunk information in case needed later.

In [268]:
all_speech_sections['chunk_titles'] = all_speech_sections['book']
all_speech_sections['book'] = [
    title
    if '_chunk_' not in title
    else title.split('_')[3]
    for title in all_speech_sections.book]

##### We add the metacharacter data to the character table:

In [None]:
meta_character_list = [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]

In [269]:
meta_character_data = pd.DataFrame({
    'book': ['all' for c in meta_character_list],
    'name': [c for c in meta_character_list],
    'gender': ['NGS' for c in meta_character_list],
    'human': ['NH' if c in ['Dinosaurs', 'Reindeer', 'Elmer and Grandpa Eldo'] else 'H' for c in meta_character_list],
    'alias_count': [0 for c in meta_character_list]
})

In [270]:
characters = pd.concat([
    characters, meta_character_data
])

In [271]:
# Replace 'self' with character name for ease of analysis (and flag self)
all_speech_sections['self_talk_flag'] = [
    True if r == 'Self'
    else False 
    for r in all_speech_sections.recipient_matched
]
all_speech_sections['recipient_matched'] = [
    s if r == 'Self'
    else r 
    for s,r in zip(all_speech_sections.speaker_matched, all_speech_sections.recipient_matched)
]

In [284]:
speakers = all_speech_sections.merge(characters, how='left', left_on=['speaker_matched', 'book'], right_on=['name', 'book'])

In [285]:
speakers = speakers.merge(characters, how='left', left_on=['recipient_matched', 'book'], right_on=['name', 'book'], suffixes=['_speaker', '_recipient'])

In [286]:
# We fill in the mssing information for the metacharacters
# for c in ['People', 'Everyone', 'Reader', 'The Reader']:
for c in [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]:
  
    speakers['name_speaker'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_speaker, speakers.speaker_matched)
    ]
    speakers['gender_speaker'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_speaker, speakers.speaker_matched)
    ]
    speakers['human_speaker'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_speaker, speakers.speaker_matched)
    ]
    speakers['alias_count_speaker'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_speaker, speakers.speaker_matched)
    ]
    
    speakers['name_recipient'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_recipient, speakers.recipient_matched)
    ]
    speakers['gender_recipient'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_recipient, speakers.recipient_matched)
    ]
    speakers['human_recipient'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_recipient, speakers.recipient_matched)
    ]
    speakers['alias_count_recipient'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_recipient, speakers.recipient_matched)
    ]

In [287]:
speakers.head()

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,self_talk_flag,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Reindeer,"""Now, Dasher! now, Dancer!\nnow, Prancer and V...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183,The Night Before Christmas,False,St. Nicholas,M,H,1.0,Reindeer,NGS,NH,0.0
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,Everyone,"""Happy\nChristmas to all, and to all a good\nn...","Happy Christmas to all, and to all a good night!",48,The Night Before Christmas,False,St. Nicholas,M,H,1.0,Everyone,NGS,H,0.0
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Sugarlump,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106,Sugarlump and the Unicorn,True,Sugarlump,M,NH,0.0,Sugarlump,M,NH,0.0
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Sugarlump,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,56,Sugarlump and the Unicorn,True,Sugarlump,M,NH,0.0,Sugarlump,M,NH,0.0
4,Sugarlump and the Unicorn,2,unicorn,Sugarlump,unicorn,Sugarlump,"""Done!"" came a voice, and there stood a beast\...",Done! I can grant horses' wishes.,33,Sugarlump and the Unicorn,False,unicorn,F,NH,0.0,Sugarlump,M,NH,0.0


In [288]:
len(speakers)

2930

In [289]:
len(speakers[speakers.name_speaker.isna()])

180

In [290]:
len(speakers[speakers.name_speaker.isna()]) / len(speakers)

0.06143344709897611

In [291]:
len(speakers[speakers.name_recipient.isna()])

364

In [292]:
len(speakers[speakers.name_recipient.isna()]) / len(speakers)

0.1242320819112628

In [293]:
print(len(speakers[speakers.name_recipient.isna()].book.unique()))
print(speakers[speakers.name_recipient.isna()].book.unique())

83
['The Troll' 'The Princess and the Wizard' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild' 'Tabby McTat'
 'Harry and the Dinosaurs at the Museum' 'The Cross Rabbit'
 'A Thing Called Snow' 'The Way Home For Wolf' 'Elmer and Wilbur'
 'I Need A Wee' "The Owl's Lesson" 'Cave Baby' 'The Rescue Party'
 'Elmer and the Stranger' 'Zog' 'Lost in Snow' 'Monkey Needs to Listen'
 'Dogger' 'Is It Betime Wibbly Pig' 'Wide-awake Hedgehog'
 'Tyrannosaurus Drip' 'Sharing A Shell' 'The Enormous Turnip'
 'Sir Charlie Stinky Socks and the Really Big Adventure' "Ruby's Worry"
 'I Am Amelia Earhart' 'She Rex' 'Mole Hill' 'Oi Dog!'
 'The Gingerbread Man' "The Lighthouse Keeper's Lunch"
 "Dogs Don't Do Ballet" 'Hippo Owns Up' 'Giraffe is Left Out'
 "The Tree That's Meant To Be" 'The Dinosaur That Pooped Christmas'
 'The Book With No Pictures' 'The First Christmas' 'Owl Babies'
 "The Jolly Postman or Other People's Letters" 

In [294]:
print(len(speakers[speakers.name_speaker.isna()].book.unique()))
print(speakers[speakers.name_speaker.isna()].book.unique())

73
['Sing A Song Of Bottoms' 'The Troll' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild'
 'Harry and the Dinosaurs at the Museum' 'The Cross Rabbit'
 'A Thing Called Snow' 'Yoga Babies' 'Elmer and Wilbur' 'I Need A Wee'
 "The Owl's Lesson" 'Cave Baby' 'The Rescue Party'
 'Elmer and the Stranger' 'Zog' 'Lost in Snow' 'Dogger'
 'Is It Betime Wibbly Pig' 'Wide-awake Hedgehog' 'Tyrannosaurus Drip'
 'Sharing A Shell' 'The Rhyming Rabbit' 'I Am Amelia Earhart' 'She Rex'
 'Mole Hill' 'Oi Dog!' "The Lighthouse Keeper's Lunch" 'Hippo Owns Up'
 'Giraffe is Left Out' "The Tree That's Meant To Be"
 'The Dinosaur That Pooped Christmas' "Lion's in a Flap" 'Owl Babies'
 "The Jolly Postman or Other People's Letters" 'What The Ladybird Heard'
 "Ravi's Roar" 'The Runaway Pea' 'Zog and the Flying Doctors'
 'The Three Little Pigs' 'Lenny Makes A Wish' 'Knock Knock Alien'
 'Boogie Bear' 'The Bad-Tempered Ladybird' 'Snow Wh

In [541]:
# null_speaker_books = iter(speakers[speakers.name_speaker.isna()].book.unique())
# null_speaker_books = iter(speakers[speakers.name_recipient.isna()].book.unique())
null_speaker_books = iter(
    set(speakers[speakers.name_recipient.isna()].book.unique()) - set(speakers[speakers.name_speaker.isna()].book.unique())
)

In [539]:
print(list(null_speaker_books))

['Monkey Needs to Listen', 'Harry and the Bucketful of Dinosaurs', 'The Way Home For Wolf', 'Who Will Sing My Puff-a-bye', 'The First Christmas', 'Elmer and the Wind', 'Santa to the Rescue', 'The Christmas Story', 'Harry and the Dinosaurs go to School', 'Goldilocks and the Three Bears', 'The Enormous Turnip', 'Sir Charlie Stinky Socks and the Really Big Adventure', "Eleanor Won't Share", 'The Tiger Who Came To Tea', 'The Princess and the Wizard', 'Tabby McTat', 'All For One', 'Captain Duck', 'Jack and the Beanstalk', "Dogs Don't Do Ballet", "Charlie Cook's Favourite Book", "Ruby's Worry", 'The Book With No Pictures', 'The Gingerbread Man']


In [614]:
current_book = next(null_speaker_books)
print(current_book)

StopIteration: 

In [631]:
# speakers[(speakers.book == current_book) * (speakers.name_recipient.isna())]
speakers[(speakers.book == 'Sing A Song Of Bottoms') * (speakers.name_speaker.isna())]

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,self_talk_flag,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
83,Sing A Song Of Bottoms,0,The judge,Audience,Unknown,The Reader,“Someone’s sitting on a\nwinner…”\nCongratulat...,Someone’s sitting on a winner… Congratulations...,72,Sing A Song Of Bottoms,False,,,,,The Reader,NGS,H,0.0


In [633]:
# characters[characters.book == current_book]
characters[characters.book == 'Sing A Song Of Bottoms']

Unnamed: 0,book,name,gender,human,alias_count
1053,Sing A Song Of Bottoms,dogs,NGS,NH,0
1054,Sing A Song Of Bottoms,bears,NGS,NH,0
1055,Sing A Song Of Bottoms,monkeys,NGS,NH,0
1056,Sing A Song Of Bottoms,mice,NGS,NH,0
1057,Sing A Song Of Bottoms,whales,NGS,NH,0
1058,Sing A Song Of Bottoms,rabbits,NGS,NH,0
1059,Sing A Song Of Bottoms,camels,NGS,NH,0
1060,Sing A Song Of Bottoms,kangeroos,NGS,NH,0
1061,Sing A Song Of Bottoms,elephants,NGS,NH,0
1062,Sing A Song Of Bottoms,peacocks,NGS,NH,0


In [609]:
compound_characters = [
    ('Ben Buckle and Percy Patch', 'The Troll'), 
]

In [367]:
# aliases[aliases.character_id==196]
aliases[aliases.character==current_book]

Unnamed: 0_level_0,alias,character,character_id,book
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [368]:
aliases

Unnamed: 0_level_0,alias,character,character_id,book
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Gruff the Grump,Mr Bear,5,Gruff the Grump
1,We,I,16,The Polar Express
2,little dove,her baby,24,The Christmas Story
3,McTat,Tabby McTat,51,Tabby McTat
4,we're,we,67,We're Going On A Lion Hunt
...,...,...,...,...
164,everyone,other children,1383,Eleanor Won't Share
165,mother,mother duck,1434,The Ugly Duckling
166,the swans,white birds,1436,The Ugly Duckling
167,Granny,grandmother,1449,Little Red Riding Hood


In [299]:
speakers[speakers.recipient == 'Mummies and Daddies']

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,self_talk_flag,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
2231,Bottoms Up!,0,Babies and Toddlers,Mummies and Daddies,Unknown,Mum and Dad,"Mummies and Daddies,\nwhy are we unhappy?\nWe’...","Mummies and Daddies, why are we unhappy? We’re...",1380,Bottoms Up!,False,,,,,Mum and Dad,NGS,H,0.0


### Preliminary analysis:

In [1013]:
female_character_count = sum(characters.gender == 'F')
male_character_count = sum(characters.gender == 'M')
ngs_character_count = sum(characters.gender == 'NGS')

In [1043]:
ngs_character_count

558

In [1044]:
male_spoken_word_count = speakers[speakers.gender_speaker == 'M'].spoken_word_count.sum()
female_spoken_word_count = speakers[speakers.gender_speaker == 'F'].spoken_word_count.sum()
ngs_spoken_word_count = speakers[speakers.gender_speaker == 'NGS'].spoken_word_count.sum()

In [1045]:
male_spoken_word_count / male_character_count

52.18846153846154

In [1046]:
female_spoken_word_count / female_character_count

20.43558282208589

In [1047]:
ngs_spoken_word_count / ngs_character_count

12.39605734767025

In [1048]:
female_character_count / male_character_count

0.6269230769230769

In [1049]:
male_received_word_count = speakers[speakers.gender_recipient == 'M'].spoken_word_count.sum()
female_received_word_count = speakers[speakers.gender_recipient == 'F'].spoken_word_count.sum()
ngs_received_word_count = speakers[speakers.gender_recipient == 'NGS'].spoken_word_count.sum()

In [1050]:
male_received_word_count / male_character_count

44.03846153846154

In [1051]:
female_received_word_count / female_character_count

14.423312883435583

In [1052]:
ngs_received_word_count / ngs_character_count

10.24731182795699

In [1053]:
male_speech_sections = len(speakers[speakers.gender_speaker == 'M'])
female_speech_sections = len(speakers[speakers.gender_speaker == 'F'])
ngs_speech_sections = len(speakers[speakers.gender_speaker == 'NGS'])

In [1054]:
male_speech_sections / male_character_count

0.9211538461538461

In [1055]:
female_speech_sections / female_character_count

0.3588957055214724

In [1056]:
ngs_speech_sections / ngs_character_count

0.22939068100358423

In [1028]:
characters[characters.gender == 'F'].groupby('name').agg('count').sort_values('book', ascending=False).head(20)

Unnamed: 0_level_0,book,gender,human,alias_count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mum,18,18,18,18
mum,10,10,10,10
Mummy,8,8,8,8
Sam,7,7,7,7
Granny,5,5,5,5
mother,5,5,5,5
Cinderella,5,5,5,5
Mary,5,5,5,5
girl,5,5,5,5
cow,5,5,5,5


### Validation

Proceeds as follows:

#### Notes:
- For speaker matching, inspect 'unknowns' and 'The Reader' and 'Self' separately: how many instances? Binary clasification metrics?

#### Running for several books to test outputs, save format etc:

In [554]:
client = OpenAI(api_key=key)

book_df = {
    'title': [],
    'speech_section_count': 0,
    'completion_tokens': [],
    'prompt_tokens': [],
    'total_tokens': [],
    'runtime_seconds': []
}
results_dict = {
    
}

for book_id in range(10):
    
    print("Book: ", df.iloc[book_id].Title)
    start = datetime.datetime.now()    
    
    data = {
        'full_text': df.iloc[book_id].Text,
        'sentences': dict(
            zip(
                sentences[sentences.book == df.iloc[book_id].Title].sentence_index,
                [span.text for span in sentences[sentences.book == df.iloc[book_id].Title].sentence]
            )
        )
    }
    prompt = get_prompt_string(data)
    
    completion = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation."},
        {"role": "user", "content": r"{}".format(prompt)}
      ],
     temperature=0.0,
     response_format={"type": "json_object"},
    )
    
    json_response = json.loads(completion.choices[0].message.content)
    
    book_df['completion_tokens'].append(completion.usage.completion_tokens)
    book_df['prompt_tokens'].append(completion.usage.prompt_tokens)
    book_df['total_tokens'].append(completion.usage.total_tokens)
    book_df['title'].append(df.iloc[book_id].Title)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'] += len(json_response['speech_sections'])
    
    results_dict[df.iloc[book_id].Title] = json_response
    
book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn
Book:  The Gruffalo
Book:  The Monstrous Tale of Celery Crumble
Book:  Peace at Last
Book:  Sing A Song Of Bottoms
Book:  Barry The Fish With Fingers
Book:  The Troll
Book:  The Storm Whale In Winter
Book:  There's A Monster In Your Book


In [555]:
book_df

Unnamed: 0,title,speech_section_count,completion_tokens,prompt_tokens,total_tokens,runtime_seconds
0,The Night Before Christmas,154,234,1949,2183,3
1,Sugarlump and the Unicorn,154,1114,2295,3409,17
2,The Gruffalo,154,2821,3055,5876,37
3,The Monstrous Tale of Celery Crumble,154,802,2631,3433,13
4,Peace at Last,154,754,1855,2609,11
5,Sing A Song Of Bottoms,154,71,1427,1498,1
6,Barry The Fish With Fingers,154,595,1530,2125,9
7,The Troll,154,2891,4854,7745,52
8,The Storm Whale In Winter,154,274,1544,1818,4
9,There's A Monster In Your Book,154,254,1351,1605,4


In [557]:
results_dict.keys()

dict_keys(['The Night Before Christmas', 'Sugarlump and the Unicorn', 'The Gruffalo', 'The Monstrous Tale of Celery Crumble', 'Peace at Last', 'Sing A Song Of Bottoms', 'Barry The Fish With Fingers', 'The Troll', 'The Storm Whale In Winter', "There's A Monster In Your Book"])

In [419]:
results_dict['Sugarlump and the Unicorn']

{'speech_sections': [{'speaker': 'Sugarlump',
   'recipient': 'children',
   'speech_text': '"Here in the children\'s bedroom\nIs where I want to be.\nHappily rocking to and fro.\nThis is the life for me!"',
   'speech_section_id': 1},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Oh to be out in the big wide world!\nI wish I could trot," he said.',
   'speech_section_id': 2},
  {'speaker': 'unicorn',
   'recipient': 'Sugarlump',
   'speech_text': '"Done!" came a voice, and there stood a beast\nWith a twisty silver horn.\n"I can grant horses\' wishes," Said the snow-\nwhite unicorn.',
   'speech_section_id': 3},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Here in the open countryside is where\nI like to be.\nClippety-dop, clippety-dop, This is the\nlife for me!"',
   'speech_section_id': 4},
  {'speaker': 'Sugarlump',
   'recipient': 'himself',
   'speech_text': '"Oh to be free of this heavy load.\nI wish I could gallop!"',
   '

In [642]:
import pickle as pk 
with open('./results/gpt4o_results_dict.pk', 'wb') as outfile:
    pk.dump(results_dict, outfile)

In [643]:
book_df.to_csv('./results/gpt4o_book_summary.csv')

#### We now extend the conversation to pull out only the spoken words only:

In [644]:
new_json_schema = {
    "speaker": "string",
    "recipient": "string",
    "spoken_words_only": "string",
    "speech_section_id": "integer"
}
    
new_json_schema_str = ', '.join([f"'{key}': {value}" for key, value in new_json_schema.items()])

In [None]:
For the speech sections that you just found, which I reproduce below, please pull out the words that are spoken
        and add them as a new field in the JSON called spoken_words_only, replacing the speech_text field.
        So you will need to remove all non-speech words such as 'she said' etc.
        
        Please use provide your response in JSON.
        Please reproduce punctuation as it is written using regular double quotes "" for speech marks.

        Data: {previous_response}

In [678]:
def get_second_prompt_string(previous_response):
    
    return f"""
        For the speech sections that you just found, please pull out the words that are spoken
        and add them as a new field in the JSON called spoken_words_only, replacing the speech_text field.
        So you will need to remove all non-speech words such as 'she said' etc.
        
        Please use provide your response in JSON.
        Please reproduce punctuation as it is written using regular double quotes "" for speech marks.
    """

In [686]:
book_id = 21
df.iloc[book_id].Title

'Harry and the Dinosaurs Go Wild'

In [687]:
data = {
    'full_text': df.iloc[book_id].Text,
    'sentences': dict(
        zip(
            sentences[sentences.book == df.iloc[book_id].Title].sentence_index,
            [span.text for span in sentences[sentences.book == df.iloc[book_id].Title].sentence]
        )
    )
}

In [688]:
client = OpenAI(api_key=key)

completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation. Please start all indexing of lists and arrays at 0 rather than 1."},
    {"role": "user", "content": r"{}".format(get_prompt_string(data))}
  ],
 temperature=0.0,
 response_format={"type": "json_object"},
)

In [689]:
completion.usage

CompletionUsage(completion_tokens=992, prompt_tokens=2418, total_tokens=3410)

In [690]:
second_completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": f"You are a data analysis assistant, capable of accurate and precise natural language processing. Output your response in JSON format using the following schema: {json_schema_str}. When reproducing text data please preserve newline characters and punctuation. Please start all indexing of lists and arrays at 0 rather than 1."},
    {"role": "user", "content": r"{}".format(get_prompt_string(data))},
    {"role": "assistant", "content": completion.choices[0].message.content},
    {"role": "system", "content": f"Please use the following schema for your JSON response: {new_json_schema}"},
    {"role": "user", "content": r"{}".format(get_second_prompt_string(json.loads(completion.choices[0].message.content)))}
  ],
 temperature=0.0,
 response_format={"type": "json_object"},
)

In [691]:
second_completion.usage

CompletionUsage(completion_tokens=944, prompt_tokens=3557, total_tokens=4501)

In [693]:
json.loads(completion.choices[0].message.content)

{'speech_sections': [{'speaker': 'Hany',
   'recipient': 'Apatosaurus',
   'speech_text': '“That’s a\nrhinoceros,” said Hany.',
   'speech_section_id': 0},
  {'speaker': 'Harry',
   'recipient': 'Mum',
   'speech_text': '“I\nwant to save some animals,” he said.\n“What can I do, Mum?”',
   'speech_section_id': 1},
  {'speaker': 'Sam',
   'recipient': 'Harry',
   'speech_text': '“Tuh! What a waste of time!”',
   'speech_section_id': 2},
  {'speaker': 'Harry',
   'recipient': 'Pterodactyl',
   'speech_text': '“Wait till I’ve finished my blue whale,” said Harry.\n“Blue whales are bigger than trains, bigger than\ndinosaurs, bigger than thirty-two elephants!”',
   'speech_section_id': 3},
  {'speaker': 'Triceratops',
   'recipient': 'Stegosaurus',
   'speech_text': '“Army tanks don’t need saving!” said Triceratops.\n“Do a tree frog instead.”',
   'speech_section_id': 4},
  {'speaker': 'Nan',
   'recipient': 'Harry',
   'speech_text': '“Why not talk to Mr Bopsom?\nHe might put up a poster in 

In [692]:
json.loads(second_completion.choices[0].message.content)

{'speech_sections': [{'speaker': 'Hany',
   'recipient': 'Apatosaurus',
   'spoken_words_only': '“That’s a rhinoceros,”',
   'speech_section_id': 0},
  {'speaker': 'Harry',
   'recipient': 'Mum',
   'spoken_words_only': '“I want to save some animals,” “What can I do, Mum?”',
   'speech_section_id': 1},
  {'speaker': 'Sam',
   'recipient': 'Harry',
   'spoken_words_only': '“Tuh! What a waste of time!”',
   'speech_section_id': 2},
  {'speaker': 'Harry',
   'recipient': 'Pterodactyl',
   'spoken_words_only': '“Wait till I’ve finished my blue whale,” “Blue whales are bigger than trains, bigger than dinosaurs, bigger than thirty-two elephants!”',
   'speech_section_id': 3},
  {'speaker': 'Triceratops',
   'recipient': 'Stegosaurus',
   'spoken_words_only': '“Army tanks don’t need saving!” “Do a tree frog instead.”',
   'speech_section_id': 4},
  {'speaker': 'Nan',
   'recipient': 'Harry',
   'spoken_words_only': '“Why not talk to Mr Bopsom? He might put up a poster in his shop window! Then

## Now trialling format for manual validation:

We have already used the student manual coding for validate speech detection, so now we can just focus on detected speech...

1. Select book at random, select passage of detected speech at random. 
2. Show user the passage and some of the text either side of the passage
3. Ask is it speech? Is speaker correct? Is recipient correct? [Give option to view more text]
4. Save result.

#### Note: handle case when sentence is not found in text e.g. The Troll sentence 7 is split across two setences (7 and 8) due to bad pdfplumber output.

In [393]:
from IPython.display import display, Markdown
from random import randint

In [628]:
# selection = randint(0, 1)
# book_selection = 4  # Peace at Last
book_selection = 1  # Sugarlump

In [629]:
selected_book = list(results_dict.keys())[book_selection]
selected_book

'Sugarlump and the Unicorn'

In [630]:
# speech_sections = json.loads(completion.choices[0].message.content)['speech_sections']
speech_sections = results_dict[selected_book]['speech_sections']

In [631]:
def validate(v_vec):
    pass

In [632]:
def display_section(df, res, speech_section_result, padding=200):

    speech_section = speech_section_result['speech_text']
    book_text = df[df.Title == selected_book].iloc[0].Text
    this_text = book_text[0:res] + '**' + book_text[res:res+len(speech_section)] + '**' + book_text[res+len(speech_section):]
    this_text = this_text[max(res-padding-2, 0):min(res+len(speech_section)+padding+2, len(this_text))]
    display(Markdown(this_text.replace('\n', '<br>')))

    display(Markdown('**' + 'Result:' + '**'))
    display(speech_section_result)

In [639]:
# section_selection = randint(0, len(speech_sections))
section_selection = 0

In [640]:
res = df[df.Title == selected_book].iloc[0].Text.find(speech_sections[section_selection]['speech_text']) 
res

251

In [641]:
display_section(df, res, speech_sections[section_selection])

ht and blue. And when<br>she hears a horse's wish, She can<br>make that wish come tine.<br>Sugarlump was a rocking horse.<br>He belonged to a girl and boy.<br>To and fro, to and fro,<br>They rode on their favourite toy.<br>**"Here in the children's bedroom<br>Is where I want to be.<br>Happily rocking to and fro.<br>This is the life for me!"**<br><br>But when the children were out at school Sugarlump hung his head.<br>"Oh to be out in the big wide world!<br>I wish I could trot," he said.<br><br>"Done!" came a voice, and there stood a beast<br>With a twisty s

**Result:**

{'speaker': 'Sugarlump',
 'recipient': 'himself',
 'speech_text': '"Here in the children\'s bedroom\nIs where I want to be.\nHappily rocking to and fro.\nThis is the life for me!"',
 'speech_section_id': 1}

### Please indicate with '1' which are correct: [speech, speaker, recipient]

In [440]:
validation_vector = [1, 1, 1]
validate(validation_vector)