### Task decomposition and prompt development.

We provide here a framework for developing sequential prompts for a complex NLP task using GPT4o.

The complex tasks is decompsed into as sequence of simpler tasks that each build on the previous one.

For each task in the sequence we produce a number of examples to show GPT4o, and a number of further examples to use for automated testing of the response. This borrows ideas from unit testing of software, since iterative changes to the prompt may break functionality that was previously working.

This framework can be adapted to use with other LLMs and NLP tasks.

#### We decompose into the following tasks:
- Task 1: identify sections of direct speech, and the name of the speaker and recipient
- Task 1b: locate pre-defined sentences in these detected sections of speech (for comparison with human coding)
- Task 2: pull out the spoken words only from each section (removing e.g. 'he said' etc)
- Task 3: locate and replace the names of the speakers and recipients using a pre-defined character map  

#### On gender and character coding (for article and data entry tools):
- We code mixed plural/groups as NGS, and NH
- We ask users to include groups as a separate character (e.g. pigs in three pigs, reindeer in...)
- We ask users to code all characters that speak or are spoken to including inaimate objects (NH)

#### Notes:
- Determinism is not guaranteed. But well structured prompts should produce near deterministic outputs, along with temperature=0, fixed seed. It is also worth storing the system finerprint for future reference, as changes to this may be the cause of differing results in the future.  
- The lack of determinism can make tests quite brittle. It is worth repeating tests several times to confirm their behvaiour. And then running the full manual validation on a single static result set.
- Where possible, the sequential tasks should b tackled as a new completion API, using formatted output from the previous task as the inupt. This is preferrable to chaining of prompts and outputs to produce a chat style conversation, but this increases the risk of conflict or confusion between prompts/instructions sets. And also increases the length of the context window.
- Need to ensure consistency between instructions, schemas and examples. Otherwise results may be inconsistent e.g. 'reproduce all punctuation all it appears' conflicted with 'remove  speech marks' example.
- Should typos be accounted for (e.g. task_3 name matching?)
- Cost: \\$1.22 left after developing prompts. Added \\$10 to run for 50 books (so ~0.25 full dataset). 

#### Automating this using the Chat GPT API:

## TODO:
 
 - When in pipeline to spellcheck/ correct typos? (e.g. Hany in the Dinosaurs (book 21).
 - what to do about inconsitent sentence detection? e.g "Now Dasher!" being at the end of sentence 7 was causing GPT confusion...
 - add an instruction about how to refer to 'general audience' or 'narrator' or 'I'
 - ask for output of reasoning/thought process?
 - ask for a confidence score?
 - do we need to specify (in system prompt), not to use MD or any other formatting in the json output?
 
## Note: ideas to explore if we need performance boost...

- system message to edit assistant role
- vary temperature or top_p parameter
- fine_tuning a model with bespoke training data (how much data is necessary?)
- improved instructions or prompt engineering (see e.g. paper on iterative prompting)
- compare results with gpt-3.5-turbo? - does not seem to work well for our use case!

In [1]:
import os
import json
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import string
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 
from openai import OpenAI
import pickle

%matplotlib inline

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
with open('./key.txt', 'r') as infile:
    api_key = infile.read().splitlines()[0]

In [4]:
with open('data/tempdf.pickle', 'rb') as outfile:
    df = pickle.load(outfile)

In [5]:
# Import our prompts, example data (for in-context learning), and test data for unit testing each task: 
from openai_api.examples import example_data
from openai_api.tests import build_test_data

test_data = build_test_data(df)

#### Converting the full dataset into a dataframe of sentences

In [6]:
from openai_api.utilities import spacy_extract_sentences
sentences = spacy_extract_sentences(df, nlp)

#### Check that this sample contains the same sentences that were manually coded previously.

In [7]:
coding_sample = sentences.sample(frac=0.15, axis=0, random_state=42)
manually_coded = pd.read_csv('./sentences_for_coding/sample_15pc.csv', delimiter='\t', index_col=0)

In [8]:
text_equal = [
    i == j.text
    for i,j in
    zip(manually_coded.sentence, coding_sample.sentence)
]    
assert sum(text_equal) == len(text_equal)

##### Build OpenAI API code...

In [9]:
from openai_api.schemas import build_task_1_input_schema, build_task_1_response_schema
from openai_api.utilities import *

task_1_response_schema_str = build_task_1_response_schema()
task_1_input_schema_str = build_task_1_input_schema()

In [10]:
from openai_api.tests import run_task_1_test_i

In [11]:
from openai_api.prompts import (
    get_task_1_prompt_string, get_task_1_system_prompt,
    get_task_2_prompt_string, get_task_2_system_prompt,
    get_task_3_prompt_string, get_task_3_system_prompt
)

In [12]:
def run_task_1(full_text, client, seed=42):
    
    prompt_string = get_task_1_prompt_string(
            data={'full_text': full_text}, 
        )
        
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": get_task_1_system_prompt()},
            {"role": "user", "content": r"{}".format(prompt_string)}
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
        seed=seed
    )
    
    return completion    

In [13]:
def run_task_1_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")
        
        completion = run_task_1(test_data['strings'][test_id], client=client)
        
        if run_task_1_test_i(test_id, test_data, completion, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [14]:
success_count = 0
for i in range(20):
    print(f"Running repeat {i}")
    
    success = run_task_1_tests(test_data)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 5
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 6
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 7
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 8
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 9
Running test: 0
Test 0: pass


KeyboardInterrupt: 

### Task 2: pulling out spoken words only.

#### TODO:
- refactor run_test_i method
- move schemas and test and example data to files
- rename as tak 2 or rename functions and strings!

In [14]:
def run_task_2(full_text, client, task_1_completion=None, seed=42):
    
    if task_1_completion is None:
        task_1_completion = run_task_1(full_text, client)
        
    task_1_prompt_string = get_task_1_prompt_string(
        data={'full_text': full_text}
    )
    
    task_2_prompt_string = get_task_2_prompt_string(
        task_1_response=json.loads(task_1_completion.choices[0].message.content)
    )
    
    task_2_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_2_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_2_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion

In [15]:
client = OpenAI(api_key=api_key)

In [16]:
test_id = 1

In [17]:
completion_1, completion_2 = run_task_2(
    full_text=test_data['strings'][test_id],
    client=client
)

In [18]:
completion_2.usage

CompletionUsage(completion_tokens=156, prompt_tokens=679, total_tokens=835, completion_tokens_details={'reasoning_tokens': 0})

In [19]:
print(completion_2.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "spoken_words_only": "That’s a rhinoceros. Triceratops has got more horns.",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "spoken_words_only": "I want to save some animals. What can I do, Mum?",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "spoken_words_only": "Tuh! What a waste of time!",
      "speech_section_id": 2
    }
  ]
}


In [20]:
def run_task_2_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_2_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'spoken_words_only': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [21]:
def run_task_2_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, completion_2 = run_task_2(
            full_text=test_data['strings'][test_id], 
            client=client
        )
           
        if run_task_2_test_i(test_id, test_data, completion_2, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [25]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_2_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Task 3: mappnig character names.

In [22]:
import sqlite3

In [23]:
conn = sqlite3.connect('character_database.db')

In [24]:
aliases = pd.read_sql('select * from aliases', conn, index_col='index')
characters = pd.read_sql('select * from characters', conn, index_col='index')

In [25]:
meta_character_list = [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]

In [26]:
def run_task_3(
    full_text, client, characters, aliases, 
    task_2_completion=None, 
    task_1_completion=None, 
    _meta_character_list=meta_character_list,
    seed=42
):
    
    if task_2_completion is None:
        task_1_completion, task_2_completion = run_task_2(full_text, client)
        
    
    task_3_prompt_string = get_task_3_prompt_string(
        task_2_response=json.loads(task_2_completion.choices[0].message.content),
        characters=characters,
        aliases=aliases,
        meta_characters=_meta_character_list
    )
    
    task_3_completion = client.chat.completions.create(
       model="gpt-4o",
       messages=[
           {
               "role": "system", 
               "content": get_task_3_system_prompt()
           },
           {
               "role": "user", 
               "content": r"{}".format(task_3_prompt_string)
           }
       ],
       temperature=0.0,
       response_format={"type": "json_object"},
       seed=seed
    )
    return task_1_completion, task_2_completion, task_3_completion

In [27]:
client = OpenAI(api_key=api_key)

In [28]:
test_id = 1

In [29]:
completion_1, completion_2, completion_3 = run_task_3(
    full_text=test_data['strings'][test_id],
    client=client,
    characters=test_data['task_3_characters'][test_id],
    aliases=test_data['task_3_aliases'][test_id]
)

In [30]:
completion_3.usage

CompletionUsage(completion_tokens=155, prompt_tokens=942, total_tokens=1097, completion_tokens_details={'reasoning_tokens': 0})

In [31]:
print(completion_3.choices[0].message.content)

{
  "speech_sections": [
    {
      "speaker": "Hany",
      "recipient": "Apatosaurus",
      "speaker_matched": "Unknown",
      "recipient_matched": "Apatosaurus",
      "speech_section_id": 0
    },
    {
      "speaker": "Harry",
      "recipient": "Mum",
      "speaker_matched": "Harry",
      "recipient_matched": "Mum",
      "speech_section_id": 1
    },
    {
      "speaker": "Sam",
      "recipient": "Harry",
      "speaker_matched": "Mum",
      "recipient_matched": "Harry",
      "speech_section_id": 2
    }
  ]
}


In [32]:
def run_task_3_test_i(test_id, test_data, completion, verbose=False):
    response = json.loads(completion.choices[0].message.content)
    correct_speech_sections = test_data['task_3_responses'][test_id]['speech_sections']

    pass_flag = True

    try:
        assert len(response['speech_sections']) == len(correct_speech_sections)
    except AssertionError:
        print(f"Failed test: speech section lists are different lengths.")
        if verbose:
            print(response['speech_sections'])
            print(correct_speech_sections)
            pass_flag = False
            
    test_elemtents = {
        'speaker': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'recipient': {
            'case_sensitive': False,
            'remove_leading_the': True
        },
        'speaker_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
        'recipient_matched': {
            'case_sensitive': True,
            'remove_leading_the': False
        },
#         'spoken_words_only': {
#             'case_sensitive': True,
#             'remove_leading_the': False
#         },
    }
    
    for correct_section, section in zip(correct_speech_sections, response['speech_sections']):

        try:
            assert section['speech_section_id'] == correct_section['speech_section_id']
        except AssertionError:
            print(f"Failed test: speech section_id not equal.")
            if verbose:
                print(correct_section)
                print(section)
                
        for element in test_elemtents.keys():
            try:
                assert compare_strings(
                    correct_section[element], 
                    section[element], 
                    _case_sensitive=test_elemtents[element]['case_sensitive'],
                    _remove_leading_the=test_elemtents[element]['remove_leading_the']
                )
            except AssertionError:
                print(f"Failed {element} test for section: {section}")
                if verbose:
                    print(correct_section[element])
                pass_flag = False
                
    return pass_flag

In [33]:
def run_task_3_tests(test_data, verbose=False):
    
    client = OpenAI(api_key=api_key)

    pass_all = True
    
    for test_id in test_data['test_ids']:
        print(f"Running test: {test_id}")

        _, _, completion_3 = run_task_3(
            full_text=test_data['strings'][test_id], 
            client=client,
            characters=test_data['task_3_characters'][test_id],
            aliases=test_data['task_3_aliases'][test_id]
        )
           
        if run_task_3_test_i(test_id, test_data, completion_3, verbose=verbose):
            print(f"Test {test_id}: pass")
        else: 
            print(f"Test {test_id}: fail")
            pass_all = False
    
    return pass_all

In [36]:
success_count = 0
for i in range(5):
    print(f"Running repeat {i}")
    
    success = run_task_3_tests(test_data, verbose=True)
    success_count += success
    print('\n')

print(f"Success count: {success_count}")

Running repeat 0
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 1
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 2
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 3
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Running repeat 4
Running test: 0
Test 0: pass
Running test: 1
Test 1: pass
Running test: 2
Test 2: pass


Success count: 5


### Running for full corpus

Now that our prompts are passing all tests, we run the method for all books in the corpus and save the results to disk....

#### TODO:
- add a 'self' match example to data (still using himself)
- check Noi and 'his dad' - shouldn't it be Dad? (The Storm Whale In Winter)
- add Narrator handling/example (e.g. There's A Monster In Your Book)
- add a flag for if it is a character match or something else ('Everyone' Narrator' etc)

In [34]:
import datetime

In [61]:
# Note: that this only handles two chunks currently. And is not an optimal way of splitting since sections of speech may be separated, for example.
# max_chunk_size = 9677
multi_chunk_books = {
    'The Enormous Crocodile': 9677,
    'How The Grinch Stole Christmas': 3794, 
    'Farmer Duck': 1160,
    "Ravi's Roar": 1140
}

In [62]:
# df['book_length'] = np.array([len(t) for t in df.Text])

In [125]:
# These are not story books and are also the longest so would possibly need splitting.
remove_non_stories = [
    'All Year Round', 'All About Feelings', 'Ten in the Bed and Other Counting Rhymes', 'Why Am I An Insect'
]

In [64]:
# df.sort_values(by='book_length', ascending=False).head(40)

In [65]:
def process_chunk(
    title, chunk_name, chunk_text, client, book_df, 
    c1_results_dict, c2_results_dict, c3_results_dict
):
    
    print("Book: ", title)
    start = datetime.datetime.now()    
    
    completion_1, completion_2, completion_3 = run_task_3(
        full_text=chunk_text, 
        client=client,
        characters=characters[characters.book==title],
        aliases=aliases[aliases.book==title],
        _meta_character_list=meta_character_list
    )
    
    json_response = json.loads(completion_3.choices[0].message.content)
    
    book_df['c1_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c1_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c1_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c1_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c2_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c2_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c2_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c2_system_fingerprint'].append(completion_1.system_fingerprint)
    book_df['c3_completion_tokens'].append(completion_1.usage.completion_tokens)
    book_df['c3_prompt_tokens'].append(completion_1.usage.prompt_tokens)
    book_df['c3_total_tokens'].append(completion_1.usage.total_tokens)
    book_df['c3_system_fingerprint'].append(completion_1.system_fingerprint)
    
    book_df['title'].append(chunk_name)
    book_df['runtime_seconds'].append((datetime.datetime.now() - start).seconds)
    book_df['speech_section_count'].append(len(json_response['speech_sections']))
    
    c3_results_dict[chunk_name] = json_response
    
    c1_results_dict[chunk_name] = json.loads(completion_1.choices[0].message.content)
    c2_results_dict[chunk_name] = json.loads(completion_2.choices[0].message.content)

In [66]:
client = OpenAI(api_key=api_key)

book_df = {
    'title': [],
    'speech_section_count': [],
    'c1_completion_tokens': [],
    'c1_prompt_tokens': [],
    'c1_total_tokens': [],
    'c1_system_fingerprint': [],
    'c2_completion_tokens': [],
    'c2_prompt_tokens': [],
    'c2_total_tokens': [],
    'c2_system_fingerprint': [],
    'c3_completion_tokens': [],
    'c3_prompt_tokens': [],
    'c3_total_tokens': [],
    'c3_system_fingerprint': [],
    'runtime_seconds': []
}
c1_results_dict = {}
c2_results_dict = {}
c3_results_dict = {}

for book_id in df.index:
    
    title = df.iloc[book_id].Title
    
    if title not in remove_non_stories:
        book_text = df.iloc[book_id].Text
    
        if title in multi_chunk_books.keys():
            max_chunk_size = multi_chunk_books[title]
            last_newline = book_text[0:max_chunk_size].rfind('\n')
            chunks = {
                ''.join(['_chunk_a_', title]): book_text[0:max_chunk_size],
                ''.join(['_chunk_b_', title]): book_text[max_chunk_size:]
            }
            for chunk in chunks.keys():
                process_chunk(
                    title=title, 
                    chunk_name=chunk, 
                    chunk_text=chunks[chunk], 
                    client=client, 
                    book_df=book_df, 
                    c1_results_dict=c1_results_dict, 
                    c2_results_dict=c2_results_dict, 
                    c3_results_dict=c3_results_dict
                )

        else:
            process_chunk(
                title=title, 
                chunk_name=title, 
                chunk_text=book_text, 
                client=client, 
                book_df=book_df, 
                c1_results_dict=c1_results_dict, 
                c2_results_dict=c2_results_dict, 
                c3_results_dict=c3_results_dict
            )

book_df = pd.DataFrame(book_df)

Book:  The Night Before Christmas
Book:  Sugarlump and the Unicorn
Book:  The Gruffalo
Book:  The Monstrous Tale of Celery Crumble
Book:  Peace at Last
Book:  Sing A Song Of Bottoms
Book:  Barry The Fish With Fingers
Book:  The Troll
Book:  The Storm Whale In Winter
Book:  There's A Monster In Your Book
Book:  Once Upon A Unicorn Horn
Book:  Mind Your Manners
Book:  The Princess and the Wizard
Book:  Kipper's Toybox
Book:  Oi Frog!
Book:  Elmer and the Lost Teddy
Book:  The Hungry Caterpillar
Book:  A Squash and a Squeeze
Book:  Keith The Cat With The Magic Hat
Book:  Santa is Coming to Devon
Book:  The Enormous Crocodile
Book:  The Enormous Crocodile
Book:  Harry and the Dinosaurs Go Wild
Book:  Open Very Carefully, A Book With Bite!
Book:  The Most Wonderful Gift In The World
Book:  Elmer and Grandpa Eldo
Book:  Tabby McTat
Book:  Harry and the Dinosaurs at the Museum
Book:  The Hedgehog's Balloon
Book:  Jasper's Jungle Journey
Book:  The Owl Who Was Afraid Of The Dark
Book:  The Cro

In [67]:
len(characters.name.unique())

906

In [68]:
book_df = pd.DataFrame(book_df)

In [69]:
len(book_df)

192

In [71]:
book_df.to_json('data/gpt4_output_summary_corpus.json')

In [126]:
book_df[book_df.title == 'The Troll']

Unnamed: 0,title,speech_section_count,c1_completion_tokens,c1_prompt_tokens,c1_total_tokens,c1_system_fingerprint,c2_completion_tokens,c2_prompt_tokens,c2_total_tokens,c2_system_fingerprint,c3_completion_tokens,c3_prompt_tokens,c3_total_tokens,c3_system_fingerprint,runtime_seconds
7,The Troll,54,2483,2497,4980,fp_3537616b13,2483,2497,4980,fp_3537616b13,2483,2497,4980,fp_3537616b13,204


In [75]:
# Convert results dicts to dataframe of all speech sections and save to disk:

all_speech_sections = {
    'book': [],
    'speech_section_id': [], # speech section id within book
    'speaker': [],
    'recipient': [],
    'speaker_matched': [],
    'recipient_matched': [],
    'speech_text': [],
    'spoken_words_only': [],
    'spoken_word_count': []
}

for book in c3_results_dict.keys():
    print(book)
    for si, section in enumerate(c3_results_dict[book]['speech_sections']):
        all_speech_sections['book'].append(book)
        for key in all_speech_sections.keys():
            if key not in ['book', 'spoken_word_count', 'speech_text', 'spoken_words_only']:
                all_speech_sections[key].append(section[key])
        
        # Handling edge cases where a speach section is split:
        si_corrected = si
        if si >= len(c1_results_dict[book]['speech_sections']):
            si_corrected = len(c1_results_dict[book]['speech_sections']) - 1
        all_speech_sections['speech_text'].append(c1_results_dict[book]['speech_sections'][si_corrected]['speech_text'])
        
        if si >= len(c2_results_dict[book]['speech_sections']):
            si_corrected = len(c2_results_dict[book]['speech_sections']) - 1
        all_speech_sections['spoken_words_only'].append(c2_results_dict[book]['speech_sections'][si_corrected]['spoken_words_only'])
        all_speech_sections['spoken_word_count'].append(len(c2_results_dict[book]['speech_sections'][si_corrected]['spoken_words_only']))

The Night Before Christmas
Sugarlump and the Unicorn
The Gruffalo
The Monstrous Tale of Celery Crumble
Peace at Last
Sing A Song Of Bottoms
Barry The Fish With Fingers
The Troll
The Storm Whale In Winter
There's A Monster In Your Book
Once Upon A Unicorn Horn
Mind Your Manners
The Princess and the Wizard
Kipper's Toybox
Oi Frog!
Elmer and the Lost Teddy
The Hungry Caterpillar
A Squash and a Squeeze
Keith The Cat With The Magic Hat
Santa is Coming to Devon
_chunk_a_The Enormous Crocodile
_chunk_b_The Enormous Crocodile
Harry and the Dinosaurs Go Wild
Open Very Carefully, A Book With Bite!
The Most Wonderful Gift In The World
Elmer and Grandpa Eldo
Tabby McTat
Harry and the Dinosaurs at the Museum
The Hedgehog's Balloon
Jasper's Jungle Journey
The Owl Who Was Afraid Of The Dark
The Cross Rabbit
Shark In The Dark
A Thing Called Snow
Yoga Babies
The Way Home For Wolf
Elmer and Wilbur
One Starry Night
I Need A Wee
Harry and the Dinosaurs say Raahh!
An Alphabet Of Stories
The Owl's Lesson
Th

In [76]:
all_speech_sections = pd.DataFrame(all_speech_sections)

In [77]:
# with open('data/gpt4_all_speech_sections_corpus.json', 'r') as outfile:
#     all_speech_sections = pd.read_json(outfile)

##### We replace any chunked book titles with the original, adding columns to retain chunk information in case needed later.

In [78]:
all_speech_sections['chunk_titles'] = all_speech_sections['book']
all_speech_sections['book'] = [
    title
    if '_chunk_' not in title
    else title.split('_')[3]
    for title in all_speech_sections.book]

#### We replace compound characters ('and' and ','):

In [79]:
def flatten(xss):
    return [x for xs in xss for x in xs]

In [80]:
def replace_compound_characters(_speech_sections, characters, character_type='speaker'):
    
    compound_character_data = {
        'book': [],#'all' for c in meta_character_list],
        'name': [],#c for c in meta_character_list],
        'gender': [],#'NGS' for c in meta_character_list],
        'human': [],#'NH' if c in ['Dinosaurs', 'Reindeer', 'Elmer and Grandpa Eldo'] else 'H' for c in meta_character_list],
        'alias_count': [],#0 for c in meta_character_list]
    }

    matched_character = []
    compound_count = 0
    for ri, row in _speech_sections.iterrows():
        
        book_characters = characters[characters.book == row.book]
        
        cpts = flatten([
            s.split(',') for s in row[character_type].split('and')
        ])
                                
        if len(cpts) > 1:
            compound_count += 1
            cpts = [c.strip() for c in cpts]
            character_list = []
            for cpt in cpts:
                if cpt.lower() in list(book_characters.name.str.lower()):
                    character_list.append(cpt)

            if len(cpts) != len(character_list):
                # handles mr + mrs surname etc
#                 print(row[character_type])
                try:
                    broadcast_name = cpts[0] + ' ' + cpts[1].split(' ')[1].strip()
                    if broadcast_name.lower() in list(book_characters.name.str.lower()):
                        character_list.append(broadcast_name)
                except IndexError:
                    pass

            genders = []        
            humans = []
            for c in character_list:
                _char = book_characters[book_characters.name.str.lower() == c.lower()].iloc[0]
                genders.append(_char.gender)
                humans.append(_char.human)

            if sum([g == 'F' for g in genders]) == len(genders):
                compound_gender = 'F'
            elif sum([g == 'M' for g in genders]) == len(genders):
                compound_gender = 'M'
            else:
                compound_gender = 'NGS'

            if sum([h == 'H' for h in humans]) == len(humans):
                compound_human = 'H'
            else:
                compound_human = 'NH'

            compound_character_data['book'].append(row.book)
            compound_character_data['alias_count'].append(0)
            compound_character_data['name'].append(row[character_type])
            compound_character_data['gender'].append(compound_gender)
            compound_character_data['human'].append(compound_human)
            
            matched_character.append(row[character_type])
        else:
            matched_character.append(row[f"{character_type}_matched"])
            
    _speech_sections[f"{character_type}_matched"] = matched_character
    print(f"Compound count = {compound_count}")
    return pd.DataFrame(compound_character_data), _speech_sections

In [81]:
compound_speakers, all_speech_sections = replace_compound_characters(all_speech_sections, characters, character_type='speaker')
compound_recipients, all_speech_sections = replace_compound_characters(all_speech_sections, characters, character_type='recipient')

Compound count = 95
Compound count = 234


In [82]:
len(all_speech_sections)

2922

In [85]:
len(all_speech_sections.book.unique())

177

In [86]:
compound_characters = pd.concat([compound_speakers, compound_recipients])

In [87]:
sum(['and' in s for s in all_speech_sections.speaker])

90

In [88]:
compound_characters = compound_characters.groupby(['book', 'name']).first().reset_index()

##### We add the metacharacter data to the character table:

In [89]:
meta_character_list = [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs', 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]

In [90]:
meta_character_data = pd.DataFrame({
    'book': ['all' for c in meta_character_list],
    'name': [c for c in meta_character_list],
    'gender': ['NGS' for c in meta_character_list],
    'human': ['NH' if c in ['Dinosaurs', 'Reindeer', 'Elmer and Grandpa Eldo'] else 'H' for c in meta_character_list],
    'alias_count': [0 for c in meta_character_list]
})

In [91]:
conn = sqlite3.connect('character_database.db')

In [92]:
aliases = pd.read_sql('select * from aliases', conn, index_col='index')
characters = pd.read_sql('select * from characters', conn, index_col='index')

In [93]:
characters = pd.concat([
    characters, meta_character_data#, compound_characters
])

In [94]:
# Replace 'self' with character name for ease of analysis (and flag self)
all_speech_sections['self_talk_flag'] = [
    True if r == 'Self'
    else False 
    for r in all_speech_sections.recipient_matched
]
all_speech_sections['recipient_matched'] = [
    s if r == 'Self'
    else r 
    for s,r in zip(all_speech_sections.speaker_matched, all_speech_sections.recipient_matched)
]

In [95]:
speakers = all_speech_sections.merge(characters, how='left', left_on=['speaker_matched', 'book'], right_on=['name', 'book'])

In [96]:
speakers = speakers.merge(characters, how='left', left_on=['recipient_matched', 'book'], right_on=['name', 'book'], suffixes=['_speaker', '_recipient'])

In [97]:
# We fill in the mssing information for the metacharacters
# for c in ['People', 'Everyone', 'Reader', 'The Reader']:
for c in [
    'People','Everyone', 'Reader', 'The Reader', 'Children', 'Adults', 'Narrator',
    'Reindeer', 'Dinosaurs'#, 'Mum and Dad', 'Esme and Bear', 'Elmer and Grandpa Eldo'
]:
  
    speakers['name_speaker'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_speaker, speakers.speaker_matched)
    ]
    speakers['gender_speaker'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_speaker, speakers.speaker_matched)
    ]
    speakers['human_speaker'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_speaker, speakers.speaker_matched)
    ]
    speakers['alias_count_speaker'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_speaker, speakers.speaker_matched)
    ]
    
    speakers['name_recipient'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_recipient, speakers.recipient_matched)
    ]
    speakers['gender_recipient'] = [
        characters[characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_recipient, speakers.recipient_matched)
    ]
    speakers['human_recipient'] = [
        characters[characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_recipient, speakers.recipient_matched)
    ]
    speakers['alias_count_recipient'] = [
        characters[characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_recipient, speakers.recipient_matched)
    ]

In [98]:
# We fill in the mssing information for the metacharacters
# for c in ['People', 'Everyone', 'Reader', 'The Reader']:
for c in compound_characters.name:
  
    speakers['name_speaker'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_speaker, speakers.speaker_matched)
    ]
    speakers['gender_speaker'] = [
        compound_characters[compound_characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_speaker, speakers.speaker_matched)
    ]
    speakers['human_speaker'] = [
        compound_characters[compound_characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_speaker, speakers.speaker_matched)
    ]
    speakers['alias_count_speaker'] = [
        compound_characters[compound_characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_speaker, speakers.speaker_matched)
    ]
    
    speakers['name_recipient'] = [
        c if m==c 
        else n
        for n,m in zip(speakers.name_recipient, speakers.recipient_matched)
    ]
    speakers['gender_recipient'] = [
        compound_characters[compound_characters.name == c].iloc[0].gender if m==c 
        else n
        for n,m in zip(speakers.gender_recipient, speakers.recipient_matched)
    ]
    speakers['human_recipient'] = [
        compound_characters[compound_characters.name == c].iloc[0].human if m==c 
        else n
        for n,m in zip(speakers.human_recipient, speakers.recipient_matched)
    ]
    speakers['alias_count_recipient'] = [
        compound_characters[compound_characters.name == c].iloc[0].alias_count if m==c 
        else n
        for n,m in zip(speakers.alias_count_recipient, speakers.recipient_matched)
    ]

In [99]:
speakers.head()

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,...,name_speaker,gender_speaker,human_speaker,alias_count_speaker,is_protagonist_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient,is_protagonist_recipient
0,The Night Before Christmas,0,St. Nicholas,Reindeer,St. Nicholas,Reindeer,"Now, Dasher! now, Dancer!\nnow, Prancer and Vi...","Now, Dasher! now, Dancer! now, Prancer and Vix...",183,The Night Before Christmas,...,St. Nicholas,M,H,1.0,0.0,Reindeer,NGS,NH,0.0,
1,The Night Before Christmas,1,St. Nicholas,Everyone,St. Nicholas,The Reader,"Happy\nChristmas to all, and to all a good\nni...","Happy Christmas to all, and to all a good night!",48,The Night Before Christmas,...,St. Nicholas,M,H,1.0,0.0,The Reader,NGS,H,0.0,
2,Sugarlump and the Unicorn,0,Sugarlump,himself,Sugarlump,Sugarlump,"""Here in the children's bedroom\nIs where I wa...",Here in the children's bedroom Is where I want...,106,Sugarlump and the Unicorn,...,Sugarlump,M,NH,0.0,1.0,Sugarlump,M,NH,0.0,1.0
3,Sugarlump and the Unicorn,1,Sugarlump,himself,Sugarlump,Sugarlump,"""Oh to be out in the big wide world!\nI wish I...",Oh to be out in the big wide world! I wish I c...,65,Sugarlump and the Unicorn,...,Sugarlump,M,NH,0.0,1.0,Sugarlump,M,NH,0.0,1.0
4,Sugarlump and the Unicorn,2,Unicorn,Sugarlump,unicorn,Sugarlump,"""Done!"" came a voice, and there stood a beast\...","Done! came a voice, and there stood a beast Wi...",128,Sugarlump and the Unicorn,...,unicorn,F,NH,0.0,0.0,Sugarlump,M,NH,0.0,1.0


In [111]:
all_speech_sections.to_json('data/gpt4_all_speech_sections_corpus.json')

In [112]:
speakers.to_json('data/gpt4_speakers_recipients_processed.json')

In [100]:
len(speakers)

2923

In [101]:
len(speakers[speakers.name_speaker.isna()])

53

In [102]:
len(speakers[speakers.name_speaker.isna()]) / len(speakers)

0.018132056106739652

In [103]:
len(speakers[speakers.name_recipient.isna()])

117

In [104]:
len(speakers[speakers.name_recipient.isna()]) / len(speakers)

0.04002736914129319

In [105]:
sum(['and' in s for s in speakers.speaker])/len(speakers)

0.030790283954840916

In [106]:
len(speakers)

2923

In [107]:
len(all_speech_sections)

2922

In [108]:
speakers[['and' in s for s in speakers.speaker]]

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,...,name_speaker,gender_speaker,human_speaker,alias_count_speaker,is_protagonist_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient,is_protagonist_recipient
130,The Troll,39,Ben Buckle and Percy Patch,troll,Ben Buckle and Percy Patch,Troll,The plank! The plank! Make him walk the plank!,The plank! The plank! Make him walk the plank!,46,The Troll,...,Ben Buckle and Percy Patch,M,H,0.0,,Troll,M,NH,0.0,1.0
175,Once Upon A Unicorn Horn,3,Mum and Dad,June,Mum and Dad,June,"“Don’t worry,” Mum and Dad said.\n“We can fix ...",Don’t worry. We can fix it together!,36,Once Upon A Unicorn Horn,...,Mum and Dad,NGS,H,0.0,,June,F,H,0.0,1.0
417,The Most Wonderful Gift In The World,15,Esme and Bear,Little Bunny Boo-Boo,Esme and Bear,Little Bunny Boo-Boo,"“US?” said Esme and Bear, thinking that perhap...",US?,3,The Most Wonderful Gift In The World,...,Esme and Bear,NGS,NH,0.0,,Little Bunny Boo-Boo,F,NH,0.0,0.0
421,Elmer and Grandpa Eldo,3,Grandpa Eldo,Elmer,Grandpa Eldo,Elmer,"“What\na lovely surprise,” he said. “What’s th...",What a lovely surprise. What’s that balanced o...,58,Elmer and Grandpa Eldo,...,Grandpa Eldo,F,H,0.0,0.0,Elmer,M,NH,0.0,1.0
423,Elmer and Grandpa Eldo,5,Grandpa Eldo,Elmer,Grandpa Eldo,Elmer,"“Fancy you remembering\nthat,” said Eldo.",Fancy you remembering that.,27,Elmer and Grandpa Eldo,...,Grandpa Eldo,F,H,0.0,0.0,Elmer,M,NH,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2715,Who Will Save Us,20,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,"Ugh, yucky poohey.","Ugh, yucky poohey.",18,Who Will Save Us,...,"Flip, Flap, Waddle, Splash, and Littlest",NGS,NH,0.0,,Old Wise,M,NH,0.0,1.0
2729,Who Will Save Us,34,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,Really!,Really!,7,Who Will Save Us,...,"Flip, Flap, Waddle, Splash, and Littlest",NGS,NH,0.0,,Old Wise,M,NH,0.0,1.0
2732,Who Will Save Us,37,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,Oh yes! Yes please!,Oh yes! Yes please!,19,Who Will Save Us,...,"Flip, Flap, Waddle, Splash, and Littlest",NGS,NH,0.0,,Old Wise,M,NH,0.0,1.0
2734,Who Will Save Us,39,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,"Flip, Flap, Waddle, Splash, and Littlest",Old Wise,Thank goodness.,Thank goodness.,15,Who Will Save Us,...,"Flip, Flap, Waddle, Splash, and Littlest",NGS,NH,0.0,,Old Wise,M,NH,0.0,1.0


In [109]:
print(len(speakers[speakers.name_recipient.isna()].book.unique()))
print(speakers[speakers.name_recipient.isna()].book.unique())

31
['Peace at Last' 'The Princess and the Wizard' "Kipper's Toybox"
 'Harry and the Dinosaurs at the Museum' 'A Thing Called Snow' 'Cave Baby'
 'Lost in Snow' 'Is It Betime Wibbly Pig' 'Wide-awake Hedgehog'
 'Tyrannosaurus Drip'
 'Sir Charlie Stinky Socks and the Really Big Adventure'
 'I Am Amelia Earhart' "The Tree That's Meant To Be"
 'What The Ladybird Heard' "Charlie Cook's Favourite Book" "Ravi's Roar"
 'The Runaway Pea' 'Knock Knock Alien' 'Boogie Bear' 'Santa to the Rescue'
 'The Bad-Tempered Ladybird' 'Snow White, Star Striker'
 "Eleanor Won't Share" 'Superworm' "Jesus' Christmas Party"
 'The Ugly Duckling' "Where's My Teddy" 'What The Ladybird Heard Next'
 'Captain Duck' 'The Christmas Extravaganza Hotel' 'Little Monkey']


In [110]:
print(len(speakers[speakers.name_speaker.isna()].book.unique()))
print(speakers[speakers.name_speaker.isna()].book.unique())

24
["Kipper's Toybox" 'Elmer and the Lost Teddy' "The Owl's Lesson"
 'Lost in Snow' 'Is It Betime Wibbly Pig' 'Wide-awake Hedgehog'
 'I Am Amelia Earhart' "Dogs Don't Do Ballet" 'The First Christmas'
 'What The Ladybird Heard' 'The Runaway Pea' 'Boogie Bear'
 'Santa to the Rescue' 'Father Christmas Needs A Wee'
 "Eleanor Won't Share" 'Stick Man' 'Harry and the Robots'
 "Jesus' Christmas Party" 'The Ugly Duckling' 'The Wheels on the Bus'
 'What The Ladybird Heard Next' 'Elephant Learns to Share'
 'One Snowy Night' 'Little Monkey']


In [1073]:
null_speaker_books = iter(speakers[speakers.name_speaker.isna()].book.unique())
# null_speaker_books = iter(speakers[speakers.name_recipient.isna()].book.unique())
# null_speaker_books = iter(
#     set(speakers[speakers.name_recipient.isna()].book.unique()) - set(speakers[speakers.name_speaker.isna()].book.unique())
# )

In [1071]:
print(list(null_speaker_books))

['Sing A Song Of Bottoms', 'The Troll', "Kipper's Toybox", 'Elmer and the Lost Teddy', 'Santa is Coming to Devon', 'The Enormous Crocodile', 'Harry and the Dinosaurs Go Wild', 'Harry and the Dinosaurs at the Museum', 'The Cross Rabbit', 'Yoga Babies', 'Elmer and Wilbur', 'I Need A Wee', "The Owl's Lesson", 'Cave Baby', 'The Rescue Party', 'Zog', 'Dogger', 'Is It Betime Wibbly Pig', 'Wide-awake Hedgehog', 'Tyrannosaurus Drip', 'The Rhyming Rabbit', 'I Am Amelia Earhart', 'She Rex', 'Mole Hill', 'Oi Dog!', "The Lighthouse Keeper's Lunch", "The Tree That's Meant To Be", 'The Dinosaur That Pooped Christmas', "The Jolly Postman or Other People's Letters", 'What The Ladybird Heard', "Ravi's Roar", 'The Runaway Pea', 'Zog and the Flying Doctors', 'The Three Little Pigs', 'Lenny Makes A Wish', 'Knock Knock Alien', 'Boogie Bear', 'Father Christmas Needs A Wee', 'The Polar Express', 'Mini Beasties', 'The Smartest Giant in Town', 'Barry the Fish With Fingers and the Hairy Scary Monster', 'Stick M

In [1095]:
current_book = next(null_speaker_books)
print(current_book)

The Cross Rabbit


In [1096]:
speakers[(speakers.book == current_book) * (speakers.name_speaker.isna())]
# speakers[(speakers.book == current_book) * (speakers.name_recipient.isna())]
# speakers[(speakers.book == 'Sing A Song Of Bottoms') * (speakers.name_speaker.isna())]

Unnamed: 0,book,speech_section_id,speaker,recipient,speaker_matched,recipient_matched,speech_text,spoken_words_only,spoken_word_count,chunk_titles,self_talk_flag,name_speaker,gender_speaker,human_speaker,alias_count_speaker,name_recipient,gender_recipient,human_recipient,alias_count_recipient
556,The Cross Rabbit,9,the mice,Percy,Unknown,Percy,“Hello Percy!” they called. But suddenly they\...,Hello Percy! You’re not going to tell us to st...,58,The Cross Rabbit,False,,,,,Percy,M,H,0.0
558,The Cross Rabbit,11,the mice,Percy,Unknown,Percy,“We’ll try!” they squeaked loudly.,We’ll try!,10,The Cross Rabbit,False,,,,,Percy,M,H,0.0


In [1094]:
characters[characters.book == current_book]
# characters[characters.book == 'Sing A Song Of Bottoms']

Unnamed: 0,book,name,gender,human,alias_count
40,Harry and the Dinosaurs at the Museum,Harry,M,H,0
380,Harry and the Dinosaurs at the Museum,Mum,F,H,0
381,Harry and the Dinosaurs at the Museum,Sam,F,H,0
382,Harry and the Dinosaurs at the Museum,Gran,F,H,0
383,Harry and the Dinosaurs at the Museum,T-Rex,M,NH,1
384,Harry and the Dinosaurs at the Museum,Museum Guard,M,H,0
385,Harry and the Dinosaurs at the Museum,Pterodactyl,M,NH,0
386,Harry and the Dinosaurs at the Museum,Triceratops,NGS,NH,0
387,Harry and the Dinosaurs at the Museum,Anchisaurus,NGS,NH,0
388,Harry and the Dinosaurs at the Museum,Apatosaurus,NGS,NH,0


In [1049]:
# aliases[aliases.character_id==196]
aliases[aliases.character=='wolves']

Unnamed: 0_level_0,alias,character,character_id,book
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
147,pack,wolves,1084,The Way Home For Wolf


In [368]:
aliases

Unnamed: 0_level_0,alias,character,character_id,book
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Gruff the Grump,Mr Bear,5,Gruff the Grump
1,We,I,16,The Polar Express
2,little dove,her baby,24,The Christmas Story
3,McTat,Tabby McTat,51,Tabby McTat
4,we're,we,67,We're Going On A Lion Hunt
...,...,...,...,...
164,everyone,other children,1383,Eleanor Won't Share
165,mother,mother duck,1434,The Ugly Duckling
166,the swans,white birds,1436,The Ugly Duckling
167,Granny,grandmother,1449,Little Red Riding Hood


### Preliminary analysis:

In [113]:
female_character_count = sum(characters.gender == 'F')
male_character_count = sum(characters.gender == 'M')
ngs_character_count = sum(characters.gender == 'NGS')

In [114]:
ngs_character_count

632

In [115]:
male_spoken_word_count = speakers[speakers.gender_speaker == 'M'].spoken_word_count.sum()
female_spoken_word_count = speakers[speakers.gender_speaker == 'F'].spoken_word_count.sum()
ngs_spoken_word_count = speakers[speakers.gender_speaker == 'NGS'].spoken_word_count.sum()

In [116]:
male_spoken_word_count / male_character_count

155.65625

In [117]:
female_spoken_word_count / female_character_count

114.85217391304347

In [118]:
ngs_spoken_word_count / ngs_character_count

66.58069620253164

In [119]:
female_character_count / (male_character_count + female_character_count + ngs_character_count)

0.22682445759368836

In [120]:
male_character_count / (male_character_count + female_character_count + ngs_character_count)

0.3576594345825115

In [121]:
female_character_count / male_character_count

0.6341911764705882

In [122]:
male_received_word_count = speakers[speakers.gender_recipient == 'M'].spoken_word_count.sum()
female_received_word_count = speakers[speakers.gender_recipient == 'F'].spoken_word_count.sum()
ngs_received_word_count = speakers[speakers.gender_recipient == 'NGS'].spoken_word_count.sum()

In [1212]:
male_received_word_count / male_character_count

195.32264150943396

In [1213]:
female_received_word_count / female_character_count

106.57910447761193

In [1214]:
ngs_received_word_count / ngs_character_count

118.68092105263158

In [1215]:
male_speech_sections = len(speakers[speakers.gender_speaker == 'M'])
female_speech_sections = len(speakers[speakers.gender_speaker == 'F'])
ngs_speech_sections = len(speakers[speakers.gender_speaker == 'NGS'])

In [1216]:
male_speech_sections / male_character_count

3.6962264150943396

In [1217]:
female_speech_sections / female_character_count

2.537313432835821

In [1218]:
ngs_speech_sections / ngs_character_count

1.350328947368421

In [1219]:
male_speech_sections / (male_speech_sections + female_speech_sections + ngs_speech_sections)

0.5396694214876033

In [1220]:
female_speech_sections / (male_speech_sections + female_speech_sections + ngs_speech_sections)

0.23415977961432508

In [1221]:
ngs_speech_sections / (male_speech_sections + female_speech_sections + ngs_speech_sections)

0.22617079889807162

In [1222]:
male_spoken_word_count / (male_spoken_word_count + female_spoken_word_count + ngs_spoken_word_count)

0.5128964935670641

In [1223]:
female_spoken_word_count / (male_spoken_word_count + female_spoken_word_count + ngs_spoken_word_count)

0.22806182779235584

In [1391]:
male_spoken_words = speakers[speakers.gender_speaker == 'M'].spoken_words_only
female_spoken_words = speakers[speakers.gender_speaker == 'F'].spoken_words_only
ngs_spoken_words = speakers[speakers.gender_speaker == 'NGS'].spoken_words_only

male_received_words = speakers[speakers.gender_recipient == 'M'].spoken_words_only
female_received_words = speakers[speakers.gender_recipient == 'F'].spoken_words_only
ngs_received_words = speakers[speakers.gender_recipient == 'NGS'].spoken_words_only

In [1386]:
sum([c == '!' for c in ' '.join(male_spoken_words)]) / male_spoken_word_count

0.011435552167243802

In [1387]:
sum([c == '!' for c in ' '.join(female_spoken_words)]) / female_spoken_word_count

0.010142532534600289

In [1390]:
sum([c == '!' for c in ' '.join(ngs_spoken_words)]) / ngs_spoken_word_count

0.013821700069108501

In [1388]:
sum([c == '?' for c in ' '.join(male_spoken_words)]) / male_spoken_word_count

0.004491554224724674

In [1389]:
sum([c == '?' for c in ' '.join(female_spoken_words)]) / female_spoken_word_count

0.003966122701921091

In [1392]:
sum([c == '!' for c in ' '.join(male_received_words)]) / male_received_word_count

0.010142869562697424

In [1393]:
sum([c == '!' for c in ' '.join(female_received_words)]) / female_received_word_count

0.012995742773918888

In [1394]:
sum([c == '!' for c in ' '.join(ngs_received_words)]) / ngs_received_word_count

0.012929959256076942

In [1395]:
sum([c == '?' for c in ' '.join(male_received_words)]) / male_received_word_count

0.004742999005032796

In [1396]:
sum([c == '?' for c in ' '.join(female_received_words)]) / female_received_word_count

0.004929419672865786

In [1224]:
characters[characters.gender == 'F'].groupby('name').agg('count').sort_values('book', ascending=False).head(20)

Unnamed: 0_level_0,book,gender,human,alias_count,is_protagonist
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mum,19,19,19,19,19
mum,9,9,9,9,9
Mummy,8,8,8,8,8
Sam,7,7,7,7,7
mother,6,6,6,6,6
Nan,5,5,5,5,5
girl,5,5,5,5,5
cow,5,5,5,5,5
Mary,5,5,5,5,5
Cinderella,5,5,5,5,5


#### How much female speech is by Mum compared to male speech by Dad (option include Granny/Grandpa):

In [1339]:
mum_map = ['mum', 'mother', 'mummy', 'mom', 'mommy', 'ma', 'mama', 'mumma', 'mamma']
dad_map = ['dad', 'father', 'daddy', 'dada', 'dadda', 'da', 'pa', 'papa']
gran_map = ['gran', 'granny', 'nan', 'nana', 'nanna', 'grandma', 'gradmother', 'granmother'] 
grandpa_map = ['grandpa', 'gramps', 'grandfather', 'grandad', 'granddad']

def match_map(_map, _name, verbose=False):
    exclusions = ['christmas', 'xmas', 'nature', 'earth', 'time', 'clock']
    
    if not pd.isna(_name):
        _name = str(_name)
    else:
        _name=''
   
    is_exclusion = sum([
        m in _name.lower() for m in exclusions
    ]) > 0
    if is_exclusion:
        return False
    
    is_match = sum([
        m in _name.lower().split() for m in _map
    ]) > 0
    if is_match and verbose:
        print(_name)

    return is_match #| is_short_match

In [1340]:
match_map(mum_map, 'ma')

True

In [1341]:
sum([match_map(mum_map, s)
#     for s in characters.name]) / len(characters)
    for s in characters.name]) / sum(characters.gender=='F')

0.16716417910447762

In [1342]:
sum([match_map(dad_map, s)
#     for s in characters.name]) / len(characters)
     for s in characters.name]) / sum(characters.gender=='M')

0.062264150943396226

In [1343]:
secondary_characters = characters[characters.is_protagonist==0]

In [1345]:
sum([match_map(dad_map, s)
    for s in secondary_characters.name]) / sum(secondary_characters.gender=='M')
#     for s in secondary_characters.name]) / len(secondary_characters.gender)

0.07888040712468193

In [1346]:
sum([match_map(mum_map, s)
    for s in secondary_characters.name]) / sum(secondary_characters.gender=='F')
#     for s in secondary_characters.name]) / len(secondary_characters.gender)

0.17905405405405406

In [1359]:
speakers['speaker_is_mum'] = [
    match_map(mum_map, s)
    for s in speakers.name_speaker
]
speakers['speaker_is_dad'] = [
    match_map(dad_map, s)
    for s in speakers.name_speaker
]
speakers['speaker_is_granny'] = [
    match_map(gran_map, s)
    for s in speakers.name_speaker
]
speakers['speaker_is_grandpa'] = [
    match_map(grandpa_map, s)
    for s in speakers.name_speaker
]

speakers['recipient_is_mum'] = [
    match_map(mum_map, s)
    for s in speakers.name_recipient
]
speakers['recipient_is_dad'] = [
    match_map(dad_map, s)
    for s in speakers.name_recipient
]
speakers['recipient_is_granny'] = [
    match_map(gran_map, s)
    for s in speakers.name_recipient
]
speakers['recipient_is_grandpa'] = [
    match_map(grandpa_map, s)
    for s in speakers.name_recipient
]

In [1348]:
len(speakers[speakers.speaker_is_mum])

104

In [1349]:
len(speakers[speakers.speaker_is_dad])

44

In [1350]:
len(speakers[speakers.speaker_is_granny])

19

In [1351]:
len(speakers[speakers.speaker_is_grandpa])

57

In [1352]:
speakers[speakers.speaker_is_mum].spoken_word_count.sum() / female_spoken_word_count

0.09902912621359224

In [1353]:
speakers[speakers.speaker_is_dad].spoken_word_count.sum() / male_spoken_word_count

0.01990429039872877

In [1354]:
speakers[speakers.speaker_is_granny].spoken_word_count.sum() / female_spoken_word_count

0.02189630241685602

In [1355]:
speakers[speakers.speaker_is_grandpa].spoken_word_count.sum() / male_spoken_word_count

0.03337895307290279

In [1357]:
len(speakers[speakers.speaker_is_mum]) / female_speech_sections

0.1223529411764706

In [1358]:
len(speakers[speakers.speaker_is_dad]) / male_speech_sections

0.022460438999489536

In [1368]:
speakers[speakers.recipient_is_mum].spoken_word_count.sum() / female_received_word_count

0.15286802599148555

In [1365]:
speakers[speakers.recipient_is_dad].spoken_word_count.sum() / male_received_word_count

0.023589416640102008