### Task decomposition and prompt development.

We provide here a framework for developing sequential prompts for a complex NLP task using GPT4o.

The complex tasks is decompsed into as sequence of simpler tasks that each build on the previous one.

For each task in the sequence we produce a number of examples to show GPT4o, and a number of further examples to use for automated testing of the response. This borrows ideas from unit testing of software, since iterative changes to the prompt may break functionality that was previously working.

This framework can be adapted to use with other LLMs and NLP tasks.

#### We decompose into the following tasks:
- Task 1: identify sections of direct speech, and the name of the speaker and recipient
- Task 2: locate pre-defined sentences in these detected sections of speech (for comparison with human coding)
- Task 3: pull out the spoken words only from each section (removing e.g. 'he said' etc)
- Task 4: locate and replace the names of the speakers and recipients using a pre-defined character map  

#### Notes:
- Determinism is not guaranteed. But well structured prompts should produce near deterministic outputs, along with temperature=0, fixed seed. It is also worth storing the system finerprint for future reference, as changes to this may be the cause of differing results in the future.  
- The lack of determinism can make tests quite brittle. It is worth repeating tests several times to confirm their behvaiour. And then running the full manual validation on a single static result set.
- Where possible, the sequential tasks should b tackled as a new completion API, using formatted output from the previous task as the inupt. This is preferrable to chaining of prompts and outputs to produce a chat style conversation, but this increases the risk of conflict or confusion between prompts/instructions sets. And also increases the length of the context window.
- Need to ensure consistency between instructions, schemas and examples. Otherwise results may be inconsistent e.g. 'reproduce all punctuation as it appears' conflicted with 'remove  speech marks' example.
- Should typos be accounted for (e.g. task_3 name matching?)
- Cost: \\$1.22 left after developing prompts. Added \\$10 to run for 50 books (so ~0.25 full dataset). 

#### Automating this using the Chat GPT API:

## TODO:
 - move deifnition of input json to system prompt?
 - add character/alias mapping to prompt for each book: use primary name only
 - When in pipeline to spellcheck/ correct typos? (e.g. Hany in the Dinosaurs (book 21).
 - what to do about inconsitent sentence detection? e.g "Now Dasher!" being at the end of sentence 7 was causing GPT confusion...
 - add an instruction about how to refer to 'general audience' or 'narrator' or 'I'
 - ask for output of reaosning/thought process?
 - ask for a confidence score?
 - do we need to specify (in system prompt), not to use MD or any other formatting in the json output?
 
## Note: ideas to explore if we need performance boost...

- system message to edit assistant role
- vary temperature or top_p parameter
- fine_tuning a model with bespoke training data (how much data is necessary?)
- improved instructions or prompt engineering (see e.g. paper on iterative prompting)
- compare results with gpt-3.5-turbo? - does not seem to work weel for our use case!

In [1]:
import os
import json
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import string
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 
from openai import OpenAI

%matplotlib inline

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
with open('./key.txt', 'r') as infile:
    api_key = infile.read().splitlines()[0]

In [4]:
books_to_exclude = []
labels = pd.read_excel('./Book-List-Final-NONA.xlsx', sheet_name='Sheet1')
labels = labels.rename(columns={'Author ': 'Author'})
labels = labels.loc[~labels.Title.isin(books_to_exclude)]

In [5]:
os.chdir('../text_pdfs')

In [6]:
df = pd.DataFrame()

def grab_text(title, labels):
    
    start = labels.loc[labels.Title==title]['Starting Page']
    if len(start)==0:
        print(title, "no start")
        start = 0
    else:
        start = start.values[0]
    end = labels.loc[labels.Title==title]['Ending Page']
    if len(end)==0:
        print(title, "no end")
        end = 0
    else:
        end = end.values[0]
    
    title = title + '.pdf'
    all_text = ''
    with pdfplumber.open(title) as pdf:
        for i, page in enumerate(pdf.pages):
            if i+1 >= start and i < end:
                single_page_text = page.extract_text()

                if single_page_text is not None:
                    all_text = all_text + '\n' + single_page_text
                
    return all_text

df['Title'] = [file.split('.')[0] for file in os.listdir() if file.split('.')[1]=='pdf']
df['Text'] = [grab_text(title, labels) for title in df.Title]

In [7]:
os.chdir('../code_new_version')
df.head()

Unnamed: 0,Title,Text
0,The Night Before Christmas,\n'Twas the night before Christmas\nwhen all t...
1,Sugarlump and the Unicorn,"\nThe unicorn has a silver horn, Her\neyes are..."
2,The Gruffalo,\nA mouse took a stroll through the deep dark ...
3,The Monstrous Tale of Celery Crumble,\nHave you met Celery Crumble?\nThat’s her rig...
4,Peace at Last,"\nThe hour was late.\nMr Bear was tired, Mrs B..."


In [8]:
df = df.loc[df.Text != ''].reset_index()

In [9]:
len(df)

196

#### Converting the full dataset into a dataframe of sentences
Note: using strip() here to remove trailing or leading spaces for improved performance.

In [10]:
sentences = pd.DataFrame()

book_col = []
sentences_col = []
length_col = []
index_col = []

for title, text in zip(df.Title, df.Text):
    text = text.replace('\n', ' ') # This is only safe provided the line break is not being used to separate sentences w/o puntctuation...
    text = text.replace('\t', ' ') # This allows us to save as tsv (and simplifies the whitespace)
    text = ' '.join(text.split())
    
    
    doc = nlp(text)
    sentence_list = list(doc.sents)
    
    for si, sen in enumerate(sentence_list):
        book_col.append(title)
        
        doc = sen #nlp(sen.text.strip())
        sentences_col.append(doc)
        length_col.append(len(doc.text.translate(str.maketrans('', '', string.punctuation)).split(' ')))
        index_col.append(si)

    
sentences['book'] = book_col
sentences['sentence_length'] = length_col
sentences['sentence'] = sentences_col
sentences['sentence_index'] = index_col

In [11]:
coding_sample = sentences.sample(frac=0.15, axis=0, random_state=42)

#### Check that this sample contains the same sentences that were manually coded previously.

In [12]:
manually_coded = pd.read_csv('./sentences_for_coding/sample_15pc.csv', delimiter='\t', index_col=0)

In [13]:
text_equal = [
    i == j.text
    for i,j in
    zip(manually_coded.sentence, coding_sample.sentence)
]    

In [14]:
assert sum(text_equal) == len(text_equal)

In [15]:
example_data = {
    'task_1': {
        'example_input_1': {
            'full_text': '\nBob saw an animal just like a horse. “That’s a\ndonkey,” said Alice.\n“Horses have got longer legs.” \n The donkey realy was quite short and said it definitely wasn\'t a horse. \n“Cool! I\'ve never seen a donkey before.” shouted Bob.'
        },
        'example_output_1': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alice",
                        "recipient": "Bob",
                        "speech_text": 'That’s a\ndonkey,” said Alice.\n“Horses have got longer legs.”',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "Alice",
                        "speech_text": '\n“Cool! I\'ve never seen a donkey before.” shouted Bob.',
                        "speech_section_id": 1
                    }   
                ]
            }
        ),
        'example_input_2': {
            'full_text': '"What a strange day!" Alice said to herself. \n"What do you think of that everyone? She is talking to herself!" asked Bob.'
        },
        'example_output_2': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alice",
                        "recipient": "self",
                        "speech_text": '"What a strange day!" Alice said to herself.',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "reader",
                        "speech_text": '\n"What do you think of that? She is talking to herself!"',
                        "speech_section_id": 1
                    }   
                ]
            }
        )
    },
    'task_2':  {
        'example_input_1': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alice",
                        "recipient": "Bob",
                        "speech_text": 'That’s a\ndonkey,” said Alice.\n“Horses have got longer legs.”',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "Alice",
                        "speech_text": '\n“Cool! I\'ve never seen a donkey before.” shouted Bob.',
                        "speech_section_id": 1
                    },
                    {
                        "speaker": "Alice",
                        "recipient": "Bob",
                        "speech_text": '\n“You can\'t be serious Bob!”\nShe said he must have seen a donkey before.',
                        "speech_section_id": 2
                    },
                    
                ]
            }
        ),
        'example_output_1': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alice",
                        "recipient": "Bob",
                        "spoken_words_only": 'That’s a donkey. Horses have got longer legs.',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "Alice",
                        "speech_text": 'Cool! I\'ve never seen a donkey before.',
                        "speech_section_id": 1
                    },
                    {
                        "speaker": "Alice",
                        "recipient": "Bob",
                        "speech_text": 'You can\'t be serious Bob!',
                        "speech_section_id": 2
                    },
                ]
            }
        )
    },
    'task_3':  {
        'example_input_1': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alace",
                        "recipient": "Bobby",
                        "spoken_words_only": 'That’s a donkey. Horses have got longer legs.',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "The Queen",
                        "speech_text": 'Cool! I\'ve never seen a donkey before.',
                        "speech_section_id": 1
                    },
                    {
                        "speaker": "Alice",
                        "recipient": "Robert",
                        "speech_text": 'You can\'t be serious Bob!',
                        "speech_section_id": 2
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "everyone",
                        "speech_text": 'What do you think, everyone?',
                        "speech_section_id": 2
                    },
                ]
            }
        ),
        'example_characters_1': ['Alice', 'Bob', 'Charlie'],
        'example_aliases_1': pd.DataFrame({
            'alias': ['The Queen', 'Bobby'],
            'character': ['Alice', 'Bob']
        }).to_csv(),
        'example_output_1': json.dumps(
            {
                'speech_sections': [
                    {
                        "speaker": "Alace",
                        "recipient": "Bobby",
                        "speaker_matched": "Alice",
                        "recipient_matched": "Bob",
#                         "spoken_words_only": 'That’s a donkey. Horses have got longer legs.',
                        "speech_section_id": 0
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "The Queen",
                        "speaker_matched": "Bob",
                        "recipient_matched": "Alice",
#                         "speech_text": 'Cool! I\'ve never seen a donkey before.',
                        "speech_section_id": 1
                    },
                    {
                        "speaker": "Alice",
                        "recipient": "Robert",
                        "speaker_matched": "Alice",
                        "recipient_matched": "Unknown",
#                         "speech_text": 'You can\'t be serious Bob!',
                        "speech_section_id": 2
                    },
                    {
                        "speaker": "Bob",
                        "recipient": "everyone",
                        "speaker_matched": "Bob",
                        "recipient_matched": "The Reader",
#                         "speech_text": 'What do you think, everyone?',
                        "speech_section_id": 2
                    },
                ]
            }
        )
    }
}

In [16]:
test_data = {
    'test_ids': [0, 1, 2],
    'strings': {
        0: '\n"Watch out Rosie!" cried Mum. \n"There\'s a monster behind you." \n"I am going to eat you!" shouted the monster.',
        1: '\nIt was a long drive to the safari park but it was worth it.\nApatosaurus saw an animal just like Triceratops. “That’s a\nrhinoceros,” said Hany.\n“Triceratops has got more horns.”\nMum liked the giraffes best and Nan-\nliked the zebras.\nThe monkeys were funny but the\nman said not to feed them.\nSam asked him if they had pandas but the man said\nno, they were endangered animals.\nHarry wanted to know what endangered meant.\nSam said he was too little to understand.\nNan helped. She bought Harry a book about endangered animals.\nShe thought it was sad about the Sumatran tigers. People kept\nhunting them so there were only a few left in the whole world.\n\nHarry really wanted to help but he had no money. “I\nwant to save some animals,” he said.\n“What can I do, Mum?”\nSam said, “Tuh! What a waste of time!”\nShe said he was miles too small to make any difference.',
        2: df.iloc[0].Text
    },
    'task_1_responses': {
        0: {
            'speech_sections': [
                {
                    'speaker': 'Mum',
                    'recipient': 'Rosie',
                    'speech_text': '\n"Watch out Rosie!" cried Mum. \n"There\'s a monster behind you."',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'The monster',
                    'recipient': 'Rosie',
                    'speech_text': '\n"I am going to eat you!" shouted the monster.',
                    'speech_section_id': 1
                },
            ]
            
        },
        1: {
            'speech_sections': [
                {
                    'speaker': 'Hany',
                    'recipient': 'Apatosaurus',
                    'speech_text': '“That’s a\nrhinoceros,” said Hany.\n“Triceratops has got more horns.”',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'Harry',
                    'recipient': 'Mum',
                    'speech_text': '“I\nwant to save some animals,” he said.\n“What can I do, Mum?”',      
                    'speech_section_id': 1
                },
                {
                    'speaker': 'Sam',
                    'recipient': 'Harry',
                    'speech_text': '“Tuh! What a waste of time!”\nShe said he was miles too small to make any difference.',
                    'speech_section_id': 2
                }
            ]
        },
        2: {
            'speech_sections': [
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'Reindeer',
                    'speech_text': 'Now, Dasher! now, Dancer!\nnow, Prancer and Vixen!\nOn Comet! on Cupid!\non Donner and Blitzen!\nTo the top of the porch, to\nthe top of the wall,\nNow, dash away! Dash\naway! Dash away all!',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'everyone',
                    'speech_text': 'Happy\nChristmas to all, and to all a good\nnight!',
                    'speech_section_id': 1
                }
            ]
        }
    },
    'task_2_responses': {
        0: {
            'speech_sections': [
                {
                    'speaker': 'Mum',
                    'recipient': 'Rosie',
                    'spoken_words_only': 'Watch out Rosie! There\'s a monster behind you.',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'The monster',
                    'recipient': 'Rosie',
                    'spoken_words_only': 'I am going to eat you!',
                    'speech_section_id': 1
                },
            ]
            
        },
        1: {
            'speech_sections': [
                {
                    'speaker': 'Hany',
                    'recipient': 'Apatosaurus',
                    'spoken_words_only': 'That’s a rhinoceros. Triceratops has got more horns.',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'Harry',
                    'recipient': 'Mum',
                    'spoken_words_only': 'I want to save some animals. What can I do, Mum?',      
                    'speech_section_id': 1
                },
                {
                    'speaker': 'Sam',
                    'recipient': 'Harry',
                    'spoken_words_only': 'Tuh! What a waste of time!',
                    'speech_section_id': 2
                }
            ]
        },
        2: {
            'speech_sections': [
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'Reindeer',
                    'spoken_words_only': 'Now, Dasher! now, Dancer! now, Prancer and Vixen! On Comet! on Cupid! on Donner and Blitzen! To the top of the porch, to the top of the wall, Now, dash away! Dash away! Dash away all!',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'everyone',
                    'spoken_words_only': 'Happy Christmas to all, and to all a good night!',
                    'speech_section_id': 1
                }
            ]
        }
    },
    'task_3_characters': {
        0: ['Rosie', 'Mum'],
        1: ['Harry', 'Apatosaurus', 'Mum'],
        2: ['Saint Nick', 'Reindeer']
    },
    'task_3_aliases': {
        0: pd.DataFrame({
            'alias': ['Monster'],
            'character': ['Mum']
        }).to_csv(),
        1: pd.DataFrame({
            'alias': ['Sam'],
            'character': ['Mum']
        }).to_csv(),
        2: pd.DataFrame({
            'alias': ['St. Nicholas'],
            'character': ['Saint Nick']
        }).to_csv(),
    },
    'task_3_responses': {
        0: {
            'speech_sections': [
                {
                    'speaker': 'Mum',
                    'recipient': 'Rosie',
                    'speaker_matched': 'Mum',
                    'recipient_matched': 'Rosie',
#                     'spoken_words_only': 'Watch out Rosie! There\'s a monster behind you.',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'The monster',
                    'recipient': 'Rosie',
                    'speaker_matched': 'Mum',
                    'recipient_matched': 'Rosie',
#                     'spoken_words_only': 'I am going to eat you!',
                    'speech_section_id': 1
                },
            ]
            
        },
        1: {
            'speech_sections': [
                {
                    'speaker': 'Hany',
                    'recipient': 'Apatosaurus',
                    'speaker_matched': 'Unknown',
                    'recipient_matched': 'Apatosaurus',
#                     'spoken_words_only': 'That’s a rhinoceros. Triceratops has got more horns.',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'Harry',
                    'recipient': 'Mum',
                    'speaker_matched': 'Harry',
                    'recipient_matched': 'Mum',
#                     'spoken_words_only': 'I want to save some animals. What can I do, Mum?',      
                    'speech_section_id': 1
                },
                {
                    'speaker': 'Sam',
                    'recipient': 'Harry',
                    'speaker_matched': 'Mum',
                    'recipient_matched': 'Harry',
#                     'spoken_words_only': 'Tuh! What a waste of time!',
                    'speech_section_id': 2
                }
            ]
        },
        2: {
            'speech_sections': [
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'Reindeer',
                    'speaker_matched': 'Saint Nick',
                    'recipient_matched': 'Reindeer',
#                     'spoken_words_only': 'Now, Dasher! now, Dancer! now, Prancer and Vixen! On Comet! on Cupid! on Donner and Blitzen! To the top of the porch, to the top of the wall, Now, dash away! Dash away! Dash away all!',
                    'speech_section_id': 0
                },
                {
                    'speaker': 'St. Nicholas',
                    'recipient': 'everyone',
                    'speaker_matched': 'Saint Nick',
                    'recipient_matched': 'The Reader',
#                     'spoken_words_only': 'Happy Christmas to all, and to all a good night!',
                    'speech_section_id': 1
                }
            ]
        }
    }
}

In [17]:
task_1_response_schema = {
    "speech_sections": {
        "speaker": "string",
        "recipient": "string",
        "speech_text": "string",
        "speech_section_id": "integer"
    }
}
    
task_1_response_schema_str = ', '.join([f"'{key}': {value}" for key, value in task_1_response_schema.items()])

In [18]:
task_1_input_schema = {
    "full_text": "string"
}
    
task_1_input_schema_str = ', '.join([f"'{key}': {value}" for key, value in task_1_input_schema.items()])

In [19]:
def replace_speech_marks(text):
    new_text = text.replace('“', '"').replace('”', '"').replace('’', '\'')
    return new_text

def discount_speech_marks(text, marks=['“', '"', '”', '’', '\'']):
    
    new_text = text
    if text[0] in marks:
        new_text = text[1:]
    if new_text[-1] in marks:
        new_text = new_text[:-1]
    return new_text

def remove_the(text):
    new_text = text
    if text[0:3].lower() == 'the':
        new_text = text[3:].strip()
    return new_text


def compare_strings(
    string_1, string_2, 
    _case_sensitive=True, 
    _strip_whitespace=True, 
    _replace_speech_marks=True,
    _discount_leading_trailing_marks=True,
    _remove_leading_the=False
):
    
    if _case_sensitive:
        s_1 = string_1
        s_2 = string_2
    else:
        s_1 = string_1.lower()
        s_2 = string_2.lower()
    
    if _strip_whitespace:
        s_1 = s_1.strip()
        s_2 = s_2.strip()
        
    if _replace_speech_marks:
        s_1 = replace_speech_marks(s_1)
        s_2 = replace_speech_marks(s_2)
        
    if _discount_leading_trailing_marks:
        s_1 = discount_speech_marks(s_1)
        s_2 = discount_speech_marks(s_2)
        
    if _remove_leading_the:
        s_1 = remove_the(s_1)
        s_2 = remove_the(s_2)
        
    return s_1 == s_2

In [20]:
def run_task_1_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_1_responses'][test_id]['speech_sections']

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'speech_text': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):
        pass_flag = True

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [21]:
def get_task_1_prompt_string(data, example_data):
    
    example_input_1 = example_data['task_1']['example_input_1']
    example_output_1 = example_data['task_1']['example_output_1']
#     example_input_2 = example_data['task_1']['example_input_2']
#     example_output_2 = example_data['task_1']['example_output_2']
    
    return f"""
        I will provide you below with the following data in JSON format: {task_1_input_schema_str}
        
        The full_text is text of a children's book as a single string. 

        Using the full_text, please identify any sections of direct speech, and for each one tell me who is the speaker and who is the recipient.
        Remember that a section of direct speech can be broken up by information about who is speaking and that this break could even span 
        multiple lines in some cases. In these cases, please treat this as a single section of speech.
        
        For example, the input: {example_input_1}
        Should produce the following output: {example_output_1}
        
        Use '\n' as the newline character and reproduce these as they occur.
        Reproduce all punctuation as it is written.    
            
        Provide the results in JSON format with the following fields: speaker, recipient, speech_text, speech_section_id
        (where speech_section_id counts the number of sections of speech in this book)

        Data: {data}
    """

In [22]:
def get_task_1_system_prompt(input_schema, output_schema):
    return f"""
            You are a data analysis assistant, capable of accurate and precise natural language processing. 
            You will recieve data in JSON format with the following schema: {input_schema}
            Output your response in JSON format using the following schema: {output_schema}.
            Please start all indexing of lists and arrays at 0 rather than 1.
        """

In [23]:
def run_task_1(full_text, client, seed=42):
    
    prompt_string = get_task_1_prompt_string(
            {'full_text': full_text}, 
            example_data
        )
        
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": get_task_1_system_prompt(task_1_input_schema_str, task_1_response_schema_str)},
            {"role": "user", "content": r"{}".format(prompt_string)}
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
        seed=seed
    )
    
    return completion    

In [26]:
def run_task_1_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")
        
        completion = run_task_1(test_data['strings'][test_id], client=client)
        
        if run_task_1_test_i(test_id, test_data, completion, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [27]:
success_count = 0
for i in range(20):
    print(f"Running repeat {i}")
    
    success = run_task_1_tests(test_data)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 5
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 6
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 7
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 8
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 9
Running test: 0
Test 0: pass


### Task 2: recognising pre-defined sentences.

### Task 3: pulling out spoken words only.

#### TODO:
- refactor run_test_i method
- move schemas and test and example data to files

In [598]:
task_2_response_schema = {
    "speaker": "string",
    "recipient": "string",
    "spoken_words_only": "string",
    "speech_section_id": "integer"
}
    
task_2_response_schema_str = ', '.join([f"'{key}': {value}" for key, value in task_2_response_schema.items()])

In [773]:
# def get_task_2_prompt_string(example_data):
    
#     example_input_1 = example_data['task_2']['example_input_1']
#     example_output_1 = example_data['task_2']['example_output_1']
    
#     return f"""
#         For the speech sections that you just found, please pull out the words that are direct speech
#         and add them as a field in the JSON output called spoken_words_only.
        
#         You will need to remove all non-speech words such as 'she said' 
#         and anything else that is not direct speech. 
        
#         Do not include indirect speech.
        
#         For example, these speech sections: {example_input_1}
#         Should produce the following output: {example_output_1}
        
#         Reproduce all punctuation as it is written.    
#         Provide your response in JSON.
#     """
# def get_task_2_prompt_string(example_data):
    
#     example_input_1 = example_data['task_2']['example_input_1']
#     example_output_1 = example_data['task_2']['example_output_1']
    
#     return f"""
#         For the speech sections that you just found, please look at the speech_text fields in the JSON.
        
#         First, replace all newline characters with a single space.
#         Then, remove all words that are not direct speech. Keep only the words that are actually spoken
#         and add them as a field in the JSON output called 'spoken_words_only'.
        
#         You will need to remove all non-speech words such as 'she said' 
#         and anything else that is not direct speech. 
        
#         Do not include indirect speech.
        
#         For example, these speech sections: {example_input_1}
#         Should produce the following output: {example_output_1}
        
#         Reproduce all punctuation as it is written.    
#         Provide your response in JSON.
#     """
def get_task_2_prompt_string(example_data, task_1_response):
    
    example_input_1 = example_data['task_2']['example_input_1']
    example_output_1 = example_data['task_2']['example_output_1']
    
    return f"""
        Here are the speech sections that you just found: {task_1_response}. 
        
        Look at the speech_text fields.
        Extract only the words that are direct speech, omitting any words that are not actually spoken.
        Add these spoken words as a field in the JSON output called 'spoken_words_only'.
        
        For example, these speech sections: {example_input_1}
        Should produce the following output: {example_output_1}
        
        Remove all speech marks and add full stops where needed, otherwise produce all punctuation as it is written. Replace each newline character '\n' with a sinlge space.   
        Provide your response in JSON.
    """

In [798]:
def get_task_2_system_prompt(_task_2_input_schema, _task_2_response_schema):
#     return f"Please use the following schema for your JSON response: {_task_2_response_schema}. Remove all newline characters in your output with a single space."
    return f"""
        You are a data analysis assistant, capable of accurate and precise natural language processing. 
        You will recieve data in JSON format with the following schema: {_task_2_input_schema}
        Use the following schema for your JSON response: {_task_2_response_schema}.
        Please start all indexing of lists and arrays at 0 rather than 1.
    """

In [775]:
def run_task_2(full_text, client, task_1_completion=None, seed=42):
    
    if task_1_completion is None:
        task_1_completion = run_task_1(full_text, client)
        
    task_1_prompt_string = get_task_1_prompt_string(
        {'full_text': full_text}, 
        example_data
    )
    
    task_2_prompt_string = get_task_2_prompt_string(
        example_data,
        task_1_response=json.loads(task_1_completion.choices[0].message.content)
    )
    
    task_2_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
#            {
#                "role": "system", 
#                "content": get_task_1_system_prompt(task_1_input_schema_str, task_1_response_schema_str)
#            },
#            {
#                "role": 
#                "user", "content": r"{}".format(task_1_prompt_string)
#            },
#            {
#                "role": "assistant", 
#                "content": task_1_completion.choices[0].message.content
#            },
           {
               "role": "system", 
               "content": get_task_2_system_prompt(task_1_response_schema_str, task_2_response_schema_str)
           },
           {
               "role": "user", 
               "content": r"{}".format(task_2_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion

In [776]:
client = OpenAI(api_key=key)

In [781]:
completion_1, completion_2 = run_task_2(
    full_text=test_data['strings'][test_id],
    client=client
)

In [782]:
completion_2.usage

CompletionUsage(completion_tokens=156, prompt_tokens=666, total_tokens=822)

In [783]:
print(completion_2.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "spoken_words_only": "That’s a rhinoceros. Triceratops has got more horns.",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "spoken_words_only": "I want to save some animals. What can I do, Mum?",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "spoken_words_only": "Tuh! What a waste of time!",
      "speech_section_id": 2
    }
  ]
}


In [875]:
def run_task_2_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_2_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'spoken_words_only': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [779]:
def run_task_2_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, completion_2 = run_task_2(
            full_text=test_data['strings'][test_id], 
            client=client
        )
           
        if run_task_2_test_i(test_id, test_data, completion_2, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [883]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_2_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Task 4: mappnig character names.

In [785]:
import sqlite3

In [929]:
conn = sqlite3.connect('character_database.db')

In [930]:
aliases = pd.read_sql('select * from aliases', conn, index_col='index')
characters = pd.read_sql('select * from characters', conn, index_col='index')

In [989]:
task_3_response_schema = {
    "speaker": "string",
    "recipient": "string",
    "speaker_matched": "string",
    "recipient_matched": "string",
#     "spoken_words_only": "string",
    "speech_section_id": "integer"
}
    
task_3_response_schema_str = ', '.join([f"'{key}': {value}" for key, value in task_3_response_schema.items()])

In [990]:
def get_task_3_prompt_string(example_data, task_2_response, characters, aliases):
    
    example_input_1 = example_data['task_3']['example_input_1']
    example_characters_1 = example_data['task_3']['example_characters_1']
    example_aliases_1 = example_data['task_3']['example_aliases_1']
    example_output_1 = example_data['task_3']['example_output_1']
    
    if not isinstance(characters, list):
        character_list = list(characters.name)
    else:
        character_list = characters
    if not isinstance(aliases, str):
        alias_csv = aliases[['alias', 'character']].to_csv()
    else:
        alias_csv = aliases
    
    return f"""
        Here are the speech sections that you just found: {task_2_response}. 
        
        Look at the speakers and recipients. I want you to match these to pre-defined character names,
        and to store the matched name as new fileds called speaker_matched and recipient_matched in the JSON ouput.
        
        Here is a list of character names: {characters}
        If you find the speaker or recipient in this list (or a close enough match, including typos), please
        use the found name as the match value.
        If there is no match in the list, look for the name in the 'alias' column of the following
        csv lookup table: {alias_csv}
        If you find the name in the 'alias' column, take the corresponding value from the 'character' column
        as the match.
        If you cannot find a name in either the characters or aliases, record the match value as 'Unknown'.
        If the recipient appears to be the reader or general audience, record the match value as 'The Reader'.
        If the speaker is talking to themself, record the match value as 'Self'.
        
        For example, these speech sections: {example_input_1}
        with this character list: {example_characters_1}
        and this alias lookup table: {example_aliases_1}
        Should produce the following output: {example_output_1}
        
        Provide your response in JSON.
        Do not change the value of the speaker, recipient fields. Do not include the spoken_words_only field.   
    """

In [991]:
def get_task_3_system_prompt(_task_3_input_schema, _task_3_response_schema):
#     return f"Please use the following schema for your JSON response: {_task_2_response_schema}. Remove all newline characters in your output with a single space."
    return f"""
        You are a data analysis assistant, capable of accurate and precise natural language processing. 
        You will recieve data in JSON format with the following schema: {_task_3_input_schema}
        Use the following schema for your JSON response: {_task_3_response_schema}.
        Please start all indexing of lists and arrays at 0 rather than 1.
    """

In [992]:
def run_task_3(full_text, client, characters, aliases, task_2_completion=None, task_1_completion=None, seed=42):
    
    if task_2_completion is None:
        task_1_completion, task_2_completion = run_task_2(full_text, client)
        
    
    task_3_prompt_string = get_task_3_prompt_string(
        example_data,
        task_2_response=json.loads(task_2_completion.choices[0].message.content),
        characters=characters,
        aliases=aliases
    )
    
    task_3_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_3_system_prompt(task_2_response_schema_str, task_3_response_schema_str)
           },
           {
               "role": "user", 
               "content": r"{}".format(task_3_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion, task_3_completion

In [935]:
client = OpenAI(api_key=key)

In [885]:
completion_1, completion_2, completion_3 = run_task_3(
    full_text=test_data['strings'][test_id],
    client=client,
    characters=test_data['task_3_characters'][test_id],
    aliases=test_data['task_3_aliases'][test_id]
)

In [886]:
completion_3.usage

CompletionUsage(completion_tokens=215, prompt_tokens=928, total_tokens=1143)

In [887]:
print(completion_3.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "speaker_matched": "Unknown",
      "recipient_matched": "Apatosaurus",
      "spoken_words_only": "That’s a rhinoceros. Triceratops has got more horns.",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "speaker_matched": "Harry",
      "recipient_matched": "Mum",
      "spoken_words_only": "I want to save some animals. What can I do, Mum?",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "speaker_matched": "Mum",
      "recipient_matched": "Harry",
      "spoken_words_only": "Tuh! What a waste of time!",
      "speech_section_id": 2
    }
  ]
}


In [998]:
def run_task_3_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_3_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'speaker_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
        'recipient_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
#         'spoken_words_only': {
#             'case_sensitive': True,
#             'remove_leading_the': False
#         },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [999]:
def run_task_3_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, _, completion_3 = run_task_3(
            full_text=test_data['strings'][test_id], 
            client=client,
            characters=test_data['task_3_characters'][test_id],
            aliases=test_data['task_3_aliases'][test_id]
        )
           
        if run_task_3_test_i(test_id, test_data, completion_3, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [1000]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_3_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Running for corpus

Now that our prompts are passing all tests, we run the method for all books in the corpus and save the results to disk....

# TODO:
- add a 'self' match example to data (still using himself)
- check Noi and 'his dad' - shouldn't it be Dad? (The Storm Whale In Winter)
- add Narrator handling/example (e.g. There's A Monster In Your Book)
- add a flag for if it is a character match or something else ('Everyone' Narrator' etc!)

In [1001]:
import datetime

In [1002]:
client = OpenAI(api_key=api_key)

book_df = {
    'title': [],
    'speech_section_count': 0,
    'c1_completion_tokens': [],
    'c1_prompt_tokens': [],
    'c1_total_tokens': [],
    'c1_system_fingerprint': [],
    'c2_completion_tokens': [],
    'c2_prompt_tokens': [],
    'c2_total_tokens': [],
    'c2_system_fingerprint': [],
    'c3_completion_tokens': [],
    'c3_prompt_tokens': [],
    'c3_total_tokens': [],
    'c3_system_fingerprint': [],
    'runtime_seconds': []
}
c1_results_dict = {}
c2_results_dict = {}
c3_results_dict = {}

# for book_id in range(len(df)):
for book_id in range(50):
    title = df.iloc[book_id].Title
    print("Book: ", title)
    start = datetime.datetime.now()    
    
    completion_1, completion_2, completion_3 = run_task_3(
        full_text=df.iloc[book_id].Text, 
        client=client,
        characters=characters[characters.book==title],
        aliases=aliases[aliases.book==title]
    )
    
    json_response = json.loads(completion_3.choices[0].message.content)
    
    book_df['c1_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c1_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c1_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c1_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c2_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c2_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c2_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c2_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c3_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c3_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c3_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c3_system_fingerprint'].append(completion_1.system_fingerprint)
    
    book_df['title'].append(title)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'] += len(json_response['speech_sections'])
    
    c3_results_dict[title] = json_response
    
    c1_results_dict[title] = json.loads(completion_1.choices[0].message.content)
    c2_results_dict[title] = json.loads(completion_2.choices[0].message.content)
    
book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn
Book:  The Gruffalo
Book:  The Monstrous Tale of Celery Crumble
Book:  Peace at Last
Book:  Sing A Song Of Bottoms
Book:  Barry The Fish With Fingers
Book:  The Troll
Book:  The Storm Whale In Winter
Book:  There's A Monster In Your Book
Book:  Once Upon A Unicorn Horn
Book:  Mind Your Manners
Book:  The Princess and the Wizard
Book:  Kipper's Toybox
Book:  Oi Frog!
Book:  Elmer and the Lost Teddy
Book:  The Hungry Caterpillar
Book:  A Squash and a Squeeze
Book:  Keith The Cat With The Magic Hat
Book:  Santa is Coming to Devon
Book:  The Enormous Crocodile
Book:  Harry and the Dinosaurs Go Wild
Book:  Open Very Carefully, A Book With Bite!
Book:  The Most Wonderful Gift In The World
Book:  Elmer and Grandpa Eldo
Book:  Tabby McTat
Book:  Harry and the Dinosaurs at the Museum
Book:  The Hedgehog's Balloon
Book:  Jasper's Jungle Journey
Book:  The Owl Who Was Afraid Of The Dark
Book:  The Cross Rabbit
Book:  Shark In The 

In [1003]:
len(book_df)

50

In [1006]:
# Convert results dicts to dataframe of all speech sections and save to disk:

all_speech_sections = {
    'book': [],
    'speech_section_id': [], # speech section id within book
    'speaker': [],
    'recipient': [],
    'speaker_matched': [],
    'recipient_matched': [],
    'speech_text': [],
    'spoken_words_only': [],
    'spoken_word_count': []
}

for book in c3_results_dict.keys():
    print(book)
    for si, section in enumerate(c3_results_dict[book]['speech_sections']):
        all_speech_sections['book'].append(book)
        for key in all_speech_sections.keys():
            if key not in ['book', 'spoken_word_count', 'speech_text', 'spoken_words_only']:
                all_speech_sections[key].append(section[key])
        
        all_speech_sections['speech_text'].append(c1_results_dict[book]['speech_sections'][si]['speech_text'])
        all_speech_sections['spoken_words_only'].append(c2_results_dict[book]['speech_sections'][si]['spoken_words_only'])
        all_speech_sections['spoken_word_count'].append(len(c2_results_dict[book]['speech_sections'][si]['spoken_words_only']))

The Night Before Christmas
Sugarlump and the Unicorn
The Gruffalo
The Monstrous Tale of Celery Crumble
Peace at Last
Sing A Song Of Bottoms
Barry The Fish With Fingers
The Troll
The Storm Whale In Winter
There's A Monster In Your Book
Once Upon A Unicorn Horn
Mind Your Manners
The Princess and the Wizard
Kipper's Toybox
Oi Frog!
Elmer and the Lost Teddy
The Hungry Caterpillar
A Squash and a Squeeze
Keith The Cat With The Magic Hat
Santa is Coming to Devon
The Enormous Crocodile
Harry and the Dinosaurs Go Wild
Open Very Carefully, A Book With Bite!
The Most Wonderful Gift In The World
Elmer and Grandpa Eldo
Tabby McTat
Harry and the Dinosaurs at the Museum
The Hedgehog's Balloon
Jasper's Jungle Journey
The Owl Who Was Afraid Of The Dark
The Cross Rabbit
Shark In The Dark
A Thing Called Snow
Yoga Babies
The Way Home For Wolf
Elmer and Wilbur
One Starry Night
I Need A Wee
Harry and the Dinosaurs say Raahh!
An Alphabet Of Stories
The Owl's Lesson
The Very Lazy Ladybird
The Dinosaur Departm

In [1007]:
all_speech_sections = pd.DataFrame(all_speech_sections)

In [1008]:
all_speech_sections

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Unknown,"Now, Dasher! now, Dancer!\nnow, Prancer and Vi...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,The Reader,"Happy\nChristmas to all, and to all a good\nni...","Happy Christmas to all, and to all a good night!",48
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Self,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Self,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,65
4,Sugarlump and the Unicorn,2,unicorn,Sugarlump,unicorn,Sugarlump,"""Done!"" came a voice, and there stood a beast\...","Done! came a voice, and there stood a beast Wi...",128
...,...,...,...,...,...,...,...,...,...
806,Elmer and the Stranger,37,Elmer,Lion and Tiger,Elmer,Lion and Tiger,Yes. And now we’re all... aah...,Yes. And now we’re all... aah...,32
807,Elmer and the Stranger,38,"Elmer, Lion, and Tiger",each other,"Elmer, Lion, and Tiger",each other,Friends!,Friends!,8
808,Dinosaurs Love Underpants,0,T-rex,cavemen,Tyrannosaurus rex,cavemen,"“I don’t\nwant to eat you up, I want your\nund...","I don’t want to eat you up, I want your underp...",51
809,Dinosaurs Love Underpants,1,cavemen,each other,cavemen,Self,“We’ve too few knickers to go around!” The\nca...,We’ve too few knickers to go around! The cavem...,123


In [1041]:
len(speakers)

811

In [1036]:
speakers = all_speech_sections.merge(characters, how='left', left_on=['speaker_matched', 'book'], right_on=['name', 'book'])

In [1061]:
len(speakers)

811

In [1065]:
187/811

0.23057953144266338

In [1063]:
len(speakers[speakers.name_recipient.isna()])

187

In [1071]:
print(len(speakers[speakers.name_recipient.isna()].book.unique()))
print(speakers[speakers.name_recipient.isna()].book.unique())

37
['The Night Before Christmas' 'Sugarlump and the Unicorn' 'The Gruffalo'
 'The Monstrous Tale of Celery Crumble' 'Peace at Last'
 'Sing A Song Of Bottoms' 'The Troll' 'The Storm Whale In Winter'
 "There's A Monster In Your Book" 'Once Upon A Unicorn Horn'
 'Mind Your Manners' 'The Princess and the Wizard' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild'
 'Open Very Carefully, A Book With Bite!'
 'The Most Wonderful Gift In The World' 'Tabby McTat'
 'Harry and the Dinosaurs at the Museum' "The Hedgehog's Balloon"
 'The Owl Who Was Afraid Of The Dark' 'The Cross Rabbit'
 'Shark In The Dark' 'A Thing Called Snow' 'The Way Home For Wolf'
 'I Need A Wee' 'Harry and the Dinosaurs say Raahh!'
 'An Alphabet Of Stories' "The Owl's Lesson"
 'The Dinosaur Department Store' "The Fox's Hiccups" 'Cave Baby'
 'The Rescue Party' 'Elmer and the Stranger' 'Dinosaurs Love Underpants']


In [1070]:
print(len(speakers[speakers.name_speaker.isna()].book.unique()))
print(speakers[speakers.name_speaker.isna()].book.unique())

24
['Sing A Song Of Bottoms' 'The Troll' "There's A Monster In Your Book"
 'Once Upon A Unicorn Horn' 'Mind Your Manners' "Kipper's Toybox"
 'Elmer and the Lost Teddy' 'Santa is Coming to Devon'
 'The Enormous Crocodile' 'Harry and the Dinosaurs Go Wild'
 'Open Very Carefully, A Book With Bite!'
 'The Most Wonderful Gift In The World' 'Tabby McTat'
 'Harry and the Dinosaurs at the Museum' 'The Cross Rabbit'
 'A Thing Called Snow' 'I Need A Wee' 'Harry and the Dinosaurs say Raahh!'
 'An Alphabet Of Stories' "The Owl's Lesson"
 'The Dinosaur Department Store' 'Cave Baby' 'The Rescue Party'
 'Elmer and the Stranger']


In [1039]:
speakers = speakers.merge(characters, how='left', left_on=['recipient_matched', 'book'], right_on=['name', 'book'], suffixes=['_speaker', '_recipient'])

In [1040]:
speakers

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Unknown,"Now, Dasher! now, Dancer!\nnow, Prancer and Vi...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183,St. Nicholas,M,H,1.0,,,,
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,The Reader,"Happy\nChristmas to all, and to all a good\nni...","Happy Christmas to all, and to all a good night!",48,St. Nicholas,M,H,1.0,,,,
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Self,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106,Sugarlump,M,NH,0.0,,,,
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Self,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,65,Sugarlump,M,NH,0.0,,,,
4,Sugarlump and the Unicorn,2,unicorn,Sugarlump,unicorn,Sugarlump,"""Done!"" came a voice, and there stood a beast\...","Done! came a voice, and there stood a beast Wi...",128,unicorn,F,NH,0.0,Sugarlump,M,NH,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
806,Elmer and the Stranger,37,Elmer,Lion and Tiger,Elmer,Lion and Tiger,Yes. And now we’re all... aah...,Yes. And now we’re all... aah...,32,Elmer,M,NH,0.0,,,,
807,Elmer and the Stranger,38,"Elmer, Lion, and Tiger",each other,"Elmer, Lion, and Tiger",each other,Friends!,Friends!,8,,,,,,,,
808,Dinosaurs Love Underpants,0,T-rex,cavemen,Tyrannosaurus rex,cavemen,"“I don’t\nwant to eat you up, I want your\nund...","I don’t want to eat you up, I want your underp...",51,Tyrannosaurus rex,M,NH,1.0,cavemen,M,H,0.0
809,Dinosaurs Love Underpants,1,cavemen,each other,cavemen,Self,“We’ve too few knickers to go around!” The\nca...,We’ve too few knickers to go around! The cavem...,123,cavemen,M,H,0.0,,,,


### Preliminary analysis:

In [1013]:
female_character_count = sum(characters.gender == 'F')
male_character_count = sum(characters.gender == 'M')
ngs_character_count = sum(characters.gender == 'NGS')

In [1043]:
ngs_character_count

558

In [1044]:
male_spoken_word_count = speakers[speakers.gender_speaker == 'M'].spoken_word_count.sum()
female_spoken_word_count = speakers[speakers.gender_speaker == 'F'].spoken_word_count.sum()
ngs_spoken_word_count = speakers[speakers.gender_speaker == 'NGS'].spoken_word_count.sum()

In [1045]:
male_spoken_word_count / male_character_count

52.18846153846154

In [1046]:
female_spoken_word_count / female_character_count

20.43558282208589

In [1047]:
ngs_spoken_word_count / ngs_character_count

12.39605734767025

In [1048]:
female_character_count / male_character_count

0.6269230769230769

In [1049]:
male_received_word_count = speakers[speakers.gender_recipient == 'M'].spoken_word_count.sum()
female_received_word_count = speakers[speakers.gender_recipient == 'F'].spoken_word_count.sum()
ngs_received_word_count = speakers[speakers.gender_recipient == 'NGS'].spoken_word_count.sum()

In [1050]:
male_received_word_count / male_character_count

44.03846153846154

In [1051]:
female_received_word_count / female_character_count

14.423312883435583

In [1052]:
ngs_received_word_count / ngs_character_count

10.24731182795699

In [1053]:
male_speech_sections = len(speakers[speakers.gender_speaker == 'M'])
female_speech_sections = len(speakers[speakers.gender_speaker == 'F'])
ngs_speech_sections = len(speakers[speakers.gender_speaker == 'NGS'])

In [1054]:
male_speech_sections / male_character_count

0.9211538461538461

In [1055]:
female_speech_sections / female_character_count

0.3588957055214724

In [1056]:
ngs_speech_sections / ngs_character_count

0.22939068100358423

In [1028]:
characters[characters.gender == 'F'].groupby('name').agg('count').sort_values('book', ascending=False).head(20)

Unnamed: 0_level_0,book,gender,human,alias_count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mum,18,18,18,18
mum,10,10,10,10
Mummy,8,8,8,8
Sam,7,7,7,7
Granny,5,5,5,5
mother,5,5,5,5
Cinderella,5,5,5,5
Mary,5,5,5,5
girl,5,5,5,5
cow,5,5,5,5


#### Cells beyond this point were removed - they were just exploratory. Check legacy branch for content if needed.