# NLP - Project 1
## Rinehart Analysis
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

Check out the **[README](https://github.com/NLPaladins/rinehartAnalysis/blob/main/README.md)**


Or the current **[TODO](https://github.com/NLPaladins/rinehartAnalysis/blob/main/TODO.md)** list.

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display

In [2]:
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [3]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### At any point, to save changes
click **File > Save a copy on Gihtub**

### Now go to files on the left, and open:
- rinehartAnalysis/confs/proj_1.yml

(Ctr+s) to save changes

## Load Libraries and setup

In [4]:
import traceback
import argparse
from importlib import reload as reload_lib
from pprint import pprint
import numpy as np

# Custom libs
from nlp_libs import Configuration, ColorizedLogger, ProcessedBook
# Import this way the libs you want to dynamically change and reload 
# import nlp_libs.books.processed_book as books_lib # Comment out until Class is finalized


### Libraries Overview
All the libraries are located under *"\<project root>/nlp_libs"*
- ***ProcessedBook***: Loc: **books/processed_book.py**, Desc: *Book Pre-processor*
- ***Configuration***: Loc: **configuration/configuration.py**, Desc: *Configuration Loader*
- ***ColorizedLogger***: Loc: **fancy_logger/colorized_logger.py**, Desc: *Logger with formatted text capabilities*

In [5]:
# The path of configuration and log save path
config_path = "confs/proj_1.yml"  # Open files > confs > proj_1.yml to edit temporalily. Commit to save permanently
# !cat "$config_path"
log_path = "logs/proj_1.log"  # Open files > logs > proj_1.log to debug logs of previous runs

In [6]:
# The logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)

2021-09-25 15:28:33 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/logs/proj_1.log[0m


In [7]:
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books = conf.get_config('data_loader')['config']['books']
pprint(books)  # Pretty print the books dict

2021-09-25 15:28:33 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/confs/proj_1.yml[0m
2021-09-25 15:28:33 Config       INFO     [1m[37mConfiguration Tag: proj1[0m


{'Oh,_Well,_You_Know_How_Women_Are!': {'crime_type': 'example',
                                       'detectives': ['man1', 'man2'],
                                       'suspects': ['man3', 'man4'],
                                       'url': 'https://www.gutenberg.org/cache/epub/24259/pg24259.txt'},
 'The_Breaking_Point': {'crime_type': 'example',
                        'detectives': ['man1', 'man2'],
                        'suspects': ['man3', 'man4'],
                        'url': 'https://www.gutenberg.org/files/1601/1601-0.txt'},
 'The_Circular_Staircase': {'crime_type': 'example',
                            'detectives': ['Rachel Innes'],
                            'suspects': ['Liddy',
                                         'Halsey',
                                         'Gertrude',
                                         'Paul Armstrong',
                                         'Doctor Walker',
                                         'Louise Armstrong',
    

## Exploration

In [40]:
import urllib.request
import re
from typing import *


class ProcessedBook:
    num_map = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'),
               (100, 'C'), (90, 'XC'), (50, 'L'), (40, 'XL'),
               (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
    title: str
    url: str
    detectives: List[str]
    suspects: List[str]
    crime_type: str

    def __init__(self, title: str, metadata: Dict, make_lower: bool = True):
        """
        raw holds the books as a single string.
        clean holds the books as a list of lowercase lines starting
        from the first chapter and ending with the last sentence.
        """
        self.title = title
        self.url = metadata['url']
        self.detectives = metadata['detectives']
        self.suspects = metadata['suspects']
        self.crime_type = metadata['crime_type']
        self.raw = self.read_book_from_proj_gut(self.url)
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw
            
        lines = re.sub(r'\r\n', r'\n', lines)
        lines = re.findall(r'.*(?=\n)',  lines)        
        
        self.lines = self.clean_lines(lines=lines)
        self.clean = self.get_clean_book(make_lower=make_lower)

    @staticmethod
    def read_book_from_proj_gut(book_url: str) -> str:
        req = urllib.request.Request(book_url)
        client = urllib.request.urlopen(req)
        page = client.read()
        return page.decode('utf-8')

    def get_clean_book(self, make_lower: bool = True) -> List[str]:
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw

        lines = re.sub(r'\r\n', r'\n', lines)
        lines = re.findall(r'.*(?=\n)',  lines)        

        lines = self.clean_lines(lines)
        chapters = self.lines_to_chapters(lines)
        return chapters

    def clean_lines(self, lines: List[str]) -> List[str]:
        clean_lines = []
        start = False
        for line in lines:
            if re.match(r'^chapter i\.', line, re.IGNORECASE):
                clean_lines.append(line)
                start = True
                continue
            if not start:
                continue
            if re.match(r'^\*\*\* end of the project gutenberg ebook', line, re.IGNORECASE):
                break
            if self.pass_clean_filter(line):
                clean_lines.append(line)
        return clean_lines

    @staticmethod
    def lines_to_chapters(lines: List[str]) -> List[str]:
        chapters = []
        sentences = []
        current_sent = ''
        for i, line in enumerate(lines):
            # add chapter as 1st sentence
            if re.match(r'^chapter [ivxlcdm]+\.$', line, re.IGNORECASE):
                if sentences:
                    chapters.append(sentences)
                sentences = [line]
                add_chapter_title = True
                continue
            # add chapter title as 2nd sentence
            elif add_chapter_title:
                sentences.append(line)
                add_chapter_title = False
                continue
            sents = re.findall(r' *((?:mr\.|mrs.|[^\.\?!])*)(?<!mr)(?<!mrs)[\.\?!]', line, re.IGNORECASE)
            # if no sentence end is detected
            if not sents:
                if current_sent == '':
                    current_sent = line
                else:
                    current_sent += ' ' + line
            # if at least one sentence end is detected
            else:
                for group in sents:
                    if current_sent != '':
                        current_sent += ' ' + group
                        sentences.append(current_sent)
                    else:
                        sentences.append(group)
                    current_sent = ''
                # set the next sentence to its start if there is one
                sent_end = re.search(r'(?<!mr)(?<!mrs)[\.\?!] ((?:mr\.|mrs\.|[^\.\?!])*)$', line, re.IGNORECASE)
                if sent_end is not None:
                    current_sent = sent_end.groups()[0]
        return chapters

    @staticmethod
    def pass_clean_filter(line: str) -> bool:
        # removing the illustration lines and empty lines
        # can add other filters here as needed
        if line == '' or re.match(r'illustration:|\[illustration\]', line, re.IGNORECASE):
            return False
        else:
            return True

    def get_characters_per_chapter(self, chapter): 
        found_character_list = []
        search_string = re.compile(rf'[A-Z][a-z]+(?:\s|,|.|\.\s)[A-Z][a-z][A-Z]?[a-z]+(?:\s[A-Z][a-z][A-Z]?[a-z]+)?(?:\s[A-Z][a-z][A-Z]?[a-z]+)?')
        #get characters per sentence in chapter
        for sentence in chapter:
            res = re.findall(search_string, sentence)
            found_character_list.append(res)

        unique_characters = list(np.concatenate(found_character_list))
        return found_character_list, unique_characters
      
        
    ##
    ## @Warning: Currently only works with all text as upper case.  
    ##
    def get_all_characters_per_novel(self):
        preceding_words_to_ditch = ['After', 'Although', 'And', 'As', 'At',
         'Before', 'Both', 'But', 'Did', 'For', 
         'Good', 'Had', 'Has', 'Home', 'If', 'Is',
         'Leaving', 'Like', 'No', 'Nice', 'Old', 'On', 'Or',
         'Poor', 'Send', 'So', 'That', 'Tell', 'The', 'Thank', 
         'To', 'Was', 'Whatever', 'When', 'Where', 'While', 
         'With','Your', 'View', 
          #Specific Places
         'African', 'Brewing', 'Hospital', 'Zion', 'New','Country', 'Greenwood', 'Western', 'American', 'Bar', 'Chestnut', 'Queen'
        ]
        
        book_by_chapter = self.lines_to_chapters(self.lines)
        
        totalUniqueList = []
        for chapter in book_by_chapter: 

            characterProgression, uniqueCharacters = self.get_characters_per_chapter(chapter)
            totalUniqueList = [*totalUniqueList, *uniqueCharacters]

        totalUnique = set(totalUniqueList)
        
        joined_preceding_words_to_lose = '|'.join(preceding_words_to_ditch)
        preceding_word_to_lose_regex = fr'^(?!{joined_preceding_words_to_lose}).*$'
        regex = re.compile(preceding_word_to_lose_regex)
        filtered_people = list(filter(regex.match, totalUnique))
        
        return filtered_people
        
    def get_chapter(self, chapter: int) -> str:
        return self.clean[chapter - 1]

    def extract_character_names(self):
        lines_by_chapter = self.lines_to_chapters(self.lines)
        for chapter in lines_by_chapter: 
            print(chapter)

In [34]:
# Load Circlular Staircase
title, metadata = list(books.items())[0]
staircase = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [11]:
for i, sent in enumerate(staircase.get_chapter(33)):
  if i == 10: break
  logger.info(sent)

2021-09-25 15:28:36 Notebook     INFO     [1m[36mCHAPTER XXXIII.[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mAT THE FOOT OF THE STAIRS[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mAs I drove rapidly up to the house from Casanova Station in the hack, I saw the detective Burns loitering across the street from the Walker place[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mSo Jamieson was putting the screws on—lightly now, but ready to give them a twist or two, I felt certain, very soon[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mThe house was quiet[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mTwo steps of the circular staircase had been pried off, without result, and beyond a second message from Gertrude, that Halsey insisted on coming home and they would arrive that night, there was nothing new[0m
2021-09-25 15:28:36 Notebook     INFO     [1m[36mMr. Jamieson, having failed to locate the secret room, had gone to the village[0m
2021-

In [12]:
# Load Lower Ten
title, metadata = list(books.items())[0]
lower_ten = ProcessedBook(title=title, metadata=metadata)

In [13]:
logger.info(f'The raw length of this book as a string is {len(lower_ten.raw)}')
logger.info(f'This book has {len(lower_ten.clean)} chapters\n')
for i, chapter in enumerate(lower_ten.clean):
  if i == 5: break
  logger.info(f'{chapter[0]} - {chapter[1]}')
  logger.info(f'There are {len(chapter)} sentences in this chapter.')
  num_words = []
  for sent in chapter:
    num_words.append(len(sent.split(' ')))
  avg_words = np.mean(num_words)
  logger.info(f'The average sentence length in this chapter is {avg_words} words\n')

2021-09-25 15:28:37 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mThere are 104 sentences in this chapter.

In [14]:
for i, sent in enumerate(lower_ten.get_chapter(15)):
  if i == 10: 
    break
  logger.info(sent)

2021-09-25 15:28:37 Notebook     INFO     [1m[36mchapter xv.[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mliddy gives the alarm[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mthe next day, friday, gertrude broke the news of her stepfather’s death to louise[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mshe did it as gently as she could, telling her first that he was very ill, and finally that he was dead[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mlouise received the news in the most unexpected manner, and when gertrude came out to tell me how she had stood it, i think she was almost shocked[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36m“she just lay and stared at me, aunt ray,” she said[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36m“do you know, i believe she is glad, glad[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mand she is too honest to pretend anything else[0m
2021-09-25 15:28:37 Notebook     INFO     [1m[36mw

In [17]:
# Example of running it on all the books
processed_books = {}
for title, metadata in books.items():
  logger.nl()
  logger.info(f"Book: {title}", color='yellow', attrs=['underline'])
  current_book = ProcessedBook(title=title, metadata=metadata)
  # Raw length
  logger.info(f'The raw length of this book as a string is {len(current_book.raw)}')
  # Number of chapters
  logger.info(f'This book has {len(current_book.clean)} chapters\n')
  # Sententences per chapter
  for i, chapter in enumerate(current_book.clean):
    if i == 5: 
      break
    logger.info(f'{chapter[0]} - {chapter[1]}')
    logger.info(f'There are {len(chapter)} sentences in this chapter.')
    num_words = []
    for sent in chapter:
      num_words.append(len(sent.split(' ')))
    avg_words = np.mean(num_words)
    logger.info(f'The average sentence length in this chapter is {avg_words} words\n')
  # Chapter 15
  for i, sent in enumerate(lower_ten.get_chapter(15)):
    if i == 10: break
    logger.info(sent)


2021-09-25 15:47:14 Notebook     INFO     [4m[33mBook: The_Circular_Staircase[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-25 15:47:15 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09

2021-09-25 15:47:18 Notebook     INFO     [1m[36m“she just lay and stared at me, aunt ray,” she said[0m
2021-09-25 15:47:18 Notebook     INFO     [1m[36m“do you know, i believe she is glad, glad[0m
2021-09-25 15:47:18 Notebook     INFO     [1m[36mand she is too honest to pretend anything else[0m
2021-09-25 15:47:18 Notebook     INFO     [1m[36mwhat sort of man was mr. paul armstrong, anyhow[0m
2021-09-25 15:47:18 Notebook     INFO     [1m[36m“he was a bully as well as a rascal, gertrude,” i said[0m

2021-09-25 15:47:18 Notebook     INFO     [4m[33mBook: The_Window_at_the_White_Cat[0m
2021-09-25 15:47:20 Notebook     INFO     [1m[36mThe raw length of this book as a string is 87360[0m
2021-09-25 15:47:20 Notebook     INFO     [1m[36mThis book has 0 chapters
[0m
2021-09-25 15:47:20 Notebook     INFO     [1m[36mchapter xv.[0m
2021-09-25 15:47:20 Notebook     INFO     [1m[36mliddy gives the alarm[0m
2021-09-25 15:47:20 Notebook     INFO     [1m[36mthe next da

## TEST Dictionary 

In [18]:
# Load Circlular Staircase
title, metadata = list(books.items())[0]
staircaseUpper = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [19]:
inputStr = [ 'Miss Gertrude',
 'Miss Gertrude Innes',
 'Miss Innes',
 'Miss Liddy',
 'Miss Louise',
 'Arnold Armstrong',
 'Miss Rachel',
 'Mr. Armstrong',
 'Mr. Arnold',
 'Mr. Arnold Armstrong',
 'Louise Armstrong',
 'Fanny Armstrong',
 'Peter Armstrong',
 'Miss Armstrong',
 'Mrs. Armstrong',
 'Paul Armstrong',
    'Anne Endicott',
  'Anne Haswell',
  'Anne Watson',
  'Mary Anne'
]

output = {
    'Miss Gertrude Innes': [
        'Miss Gertrude',
        'Miss Gertrude Innes',
        'Miss Innes'
    ],
    'Miss Liddy': [ 'Miss Liddy'  ], 
    'Miss Louise':[ 'Miss Louise' ],
    'Miss Rachel':[ 'Miss Rachel' ],
    'Mr. Arnold Armstrong': [
        'Mr. Arnold Armstrong',
        'Arnold Armstrong',
        'Mr. Armstrong',
        'Mr. Arnold',
    ], 
    'Anne Endicott': ['Anne Endicott'],
    'Anne Haswell': ['Anne Haswell'],
    'Anne Watson': ['Anne Watson'],
    'Mary Anne': ['Mary Anne']

}

In [20]:
title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Mr. Arnold Armstrong')
print(title)
title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Doctor Arnold Armstrong')
print(title)
no_title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Arnold Armstrong')
print(no_title)
name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', 'Mr. Arnold Armstrong')
name_split_no_title if len(name_split_no_title) == 1 else [name_split_no_title[0]]

['Mr.']
['Doctor']
[]


['Arnold']

In [21]:
surname = re.findall(r'[A-Z][a-z]+$', 'Mr. Arnold Armstrong')
print(surname)
surname = re.findall(r'[A-Z][a-z]+$', 'Arnold Armstrong')
print(surname)
surname = re.findall(r'[A-Z][a-z]+$', 'Miss Armstrong')
print(surname)

['Armstrong']
['Armstrong']
['Armstrong']


In [22]:
regexString = r'[A-Z][a-z]+(?=\s)'
surname = re.findall(regexString, 'Mr. Arnold Armstrong')
print(surname)
surname = re.findall(regexString, 'Arnold Armstrong')
print(surname)
surname = re.findall(regexString, 'Miss Armstrong')
print(surname)

['Arnold']
['Arnold']
['Miss']


In [23]:

def createNamedDictionary(personList): 
    personList.sort(key=len, reverse=True)
    name_dictionary = {}
    hasKey=False

    for potential_name_key in personList:         
        title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', potential_name_key)
        name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', potential_name_key)
        original_length_no_title = len(name_split_no_title)
        
        if len(name_split_no_title) > 1:
            surname = name_split_no_title[1]
            name_split_no_title = [name_split_no_title[0]]
        
        print(len(name_split_no_title))
        print(name_split_no_title)
        for name in personList:
            if name in name_dictionary.keys() or (len(name_dictionary.values()) > 0 and
                                                  name in np.concatenate(list(name_dictionary.values()))):
                print("Continuing on ", name)
                continue

            if re.match(fr".*({'|'.join(name_split_no_title)}).*", name):
#                 print('\t ',re.match(fr".*({'|'.join(name_split_no_title)}).*", name))

                if original_length_no_title == 1: 
                    continue
                    

                if len(name) > len(potential_name_key): 
                    raise("This shouldn't happen with the way the sorting works")
                else:
                    if potential_name_key not in name_dictionary.keys() and name not in name_dictionary.keys(): 
                        print('\t Key:', potential_name_key, "\t\t Value: ", potential_name_key,  )
                        name_dictionary[potential_name_key] = [ potential_name_key ]
                    elif potential_name_key in name_dictionary.keys(): 
                            print('\t Key:', potential_name_key, "\t\t Value: ", name )
                            print('\tt name_split_no_title: ', name_split_no_title )
                            if re.match(fr'.*(?!{surname}).*$', name) and len(name_split_no_title) == len(re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', name)):
                                print('\t\t >>>>>>>>>>>. SURNAME DOES NOT MATCH!!!')
                                print('\t\t >>>>>>>>>>>.potential_name_key, name')

                                
                            name_dictionary[potential_name_key] = [
                                *name_dictionary[potential_name_key], 
                                name
                            ]                    
                    
#                     print('key: ', potential_name_key, '\tvalue', potential_name_key)
#                     name_dictionary[potential_name_key] = [ potential_name_key ]
                continue
        
    return name_dictionary

In [25]:
def extract_surnames(unique_person_list): 
    surname_list = []
    names_to_ignore = ['Anne']
    # first pass: go through, get break of first / lasts
    for name in unique_person_list:
        name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', name)
        surname = '' if len(name_split_no_title) == 1 else name_split_no_title[1]

        if surname != '' and surname not in names_to_ignore: 
            surname_list.append(surname)

    return set(surname_list)

def get_unambiguous_name_list(unique_person_list, surname_list): 
    unambiguous_name_list = []
    for name in unique_person_list:
        name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', name)
        first_name = name_split_no_title[0]

        if first_name not in surname_list:
            unambiguous_name_list.append(name)
        else: 
            print(f"Name {name} is ambiguous. Not processing")
    return unambiguous_name_list

def create_named_dictionary(unique_person_list): 
    title_regex = r'^(?:Mr\.|Mrs\.|Miss|Doctor)'
    no_title_regex = r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+'
    unique_person_list.sort(key=len, reverse=True)
    unique_person_key = {}
    name_dictionary = {}
    
    surname_list = extract_surnames(unique_person_list)
    unambiguous_person_list = get_unambiguous_name_list(unique_person_list, surname_list)
    
    alias_dictionary = createNamedDictionary(unambiguous_person_list)
    
    return alias_dictionary

In [26]:
someset = create_named_dictionary(staircase.get_all_characters_per_novel())

Name Louise Armstrong is ambiguous. Not processing
Name Thomas Johnson is ambiguous. Not processing
Name Doctor Stewart is ambiguous. Not processing
Name Miss Armstrong is ambiguous. Not processing
Name Mrs. Armstrong is ambiguous. Not processing
Name Thomas—Thomas is ambiguous. Not processing
Name Mr. Armstrong is ambiguous. Not processing
Name Mrs. Fitzhugh is ambiguous. Not processing
Name Doctor Walker is ambiguous. Not processing
Name Rachel Innes is ambiguous. Not processing
Name Mr. Trautman is ambiguous. Not processing
Name Mrs. Stewart is ambiguous. Not processing
Name Mrs. Wallace is ambiguous. Not processing
Name Miss Rachel is ambiguous. Not processing
Name Miss Louise is ambiguous. Not processing
Name Rosie—Rosie is ambiguous. Not processing
Name Mrs. Watson is ambiguous. Not processing
Name Mr. Bailey is ambiguous. Not processing
Name Miss Innes is ambiguous. Not processing
Name Mrs. Innes is ambiguous. Not processing
Name Mr. Innes is ambiguous. Not processing
1
['Intern

Continuing on  Arnold Armstrong
Continuing on  Casanova Station
Continuing on  Alexander Graham
Continuing on  Charity Hospital
Continuing on  Beatrice Fairfax
Continuing on  Nina Carrington
Continuing on  Peter Armstrong
Continuing on  Fanny Armstrong
Continuing on  Wednesday Riggs
Continuing on  Mr. Jack Bailey
Continuing on  Mr. John Bailey
Continuing on  Suppose Louise
Continuing on  Gertrude Innes
Continuing on  Aubrey Wallace
Continuing on  Lucien Wallace
Continuing on  Casanova Creek
Continuing on  Paul Armstrong
Continuing on  Spiritual Life
Continuing on  Joe Jefferson
Continuing on  White Streets
	 Key: City Hospital 		 Value:  City Hospital
Continuing on  Miss Gertrude
Continuing on  John Bailey
Continuing on  Jack Bailey
Continuing on  Mr. Arnold
Continuing on  John Innes
1
['Anne']
Continuing on  International Steamship Company
Continuing on  Pearl Brewing Company
Continuing on  Mr. Arnold Armstrong
Continuing on  Mrs. Ogden Fitzhugh
Continuing on  Miss Gertrude Innes
Cont

Continuing on  John Bailey
Continuing on  Anne Watson
Continuing on  Jack Bailey
Continuing on  Mr. Arnold
Continuing on  John Innes
Continuing on  Mary Anne
1
['Carol']
Continuing on  International Steamship Company
Continuing on  Pearl Brewing Company
Continuing on  Mr. Arnold Armstrong
Continuing on  Mrs. Ogden Fitzhugh
Continuing on  Miss Gertrude Innes
Continuing on  Mr. Jacob Trautman
Continuing on  Mr. Paul Armstrong
Continuing on  Eliza Klinefelter
Continuing on  Arnold Armstrong
Continuing on  Casanova Station
Continuing on  Alexander Graham
Continuing on  Charity Hospital
Continuing on  Beatrice Fairfax
Continuing on  Nina Carrington
Continuing on  Peter Armstrong
Continuing on  Fanny Armstrong
Continuing on  Wednesday Riggs
Continuing on  Mr. Jack Bailey
Continuing on  Mr. John Bailey
Continuing on  Suppose Louise
Continuing on  Gertrude Innes
Continuing on  Aubrey Wallace
Continuing on  Lucien Wallace
Continuing on  Casanova Creek
Continuing on  Paul Armstrong
Continuing on

Continuing on  Miss Gertrude
Continuing on  Anna Whitcomb
Continuing on  Matthew Geist
Continuing on  Weekly Ledger
Continuing on  Madame Sweeny
Continuing on  Fifth Street
Continuing on  Young Walker
Continuing on  Annie Morton
Continuing on  Lucy Haswell
Continuing on  Anne Haswell
Continuing on  Carol Street
Continuing on  Frank Walker
Continuing on  Mattie Bliss
Continuing on  Halsey Innes
Continuing on  Ella Stewart
Continuing on  Sam Bohannon
Continuing on  Liddy Allen
	 Key: Marine Bank 		 Value:  Marine Bank
Continuing on  John Bailey
Continuing on  Anne Watson
Continuing on  Jack Bailey
Continuing on  Mr. Arnold
Continuing on  Miss Liddy
Continuing on  Sam Huston
Continuing on  Mr. Halsey
Continuing on  John Innes
Continuing on  Mary Anne
1
['John']
Continuing on  International Steamship Company
Continuing on  Pearl Brewing Company
Continuing on  Mr. Arnold Armstrong
Continuing on  Mrs. Ogden Fitzhugh
Continuing on  Miss Gertrude Innes
Continuing on  Mr. Jacob Trautman
Continu

Continuing on  Nina Carrington
Continuing on  Peter Armstrong
Continuing on  Fanny Armstrong
Continuing on  Wednesday Riggs
Continuing on  Mr. Jack Bailey
Continuing on  Mr. John Bailey
Continuing on  Suppose Louise
Continuing on  Gertrude Innes
Continuing on  Aubrey Wallace
Continuing on  Lucien Wallace
Continuing on  Casanova Creek
Continuing on  Paul Armstrong
Continuing on  Spiritual Life
Continuing on  Joe Jefferson
Continuing on  White Streets
Continuing on  City Hospital
Continuing on  Anne Endicott
Continuing on  Miss Gertrude
Continuing on  Anna Whitcomb
Continuing on  Matthew Geist
Continuing on  Weekly Ledger
Continuing on  Madame Sweeny
Continuing on  Fifth Street
Continuing on  Young Walker
Continuing on  Annie Morton
Continuing on  Lucy Haswell
Continuing on  Anne Haswell
Continuing on  Carol Street
Continuing on  Frank Walker
Continuing on  Mattie Bliss
Continuing on  Halsey Innes
Continuing on  Ella Stewart
Continuing on  Sam Bohannon
Continuing on  Liddy Allen
Continui

Continuing on  Sam Bohannon
Continuing on  Liddy Allen
Continuing on  Marine Bank
Continuing on  John Bailey
Continuing on  Anne Watson
Continuing on  Valley Mill
Continuing on  Aunt Rachel
Continuing on  Jack Bailey
Continuing on  Middle West
Continuing on  Mr. Arnold
Continuing on  Dragon Fly
Continuing on  Miss Liddy
Continuing on  Sam Huston
Continuing on  Elm Street
Continuing on  Mr. Halsey
Continuing on  John Innes
	 Key: May Riggs 		 Value:  May Riggs
Continuing on  Mary Anne
Continuing on  Aunt Ray
1
['Rock']
Continuing on  International Steamship Company
Continuing on  Pearl Brewing Company
Continuing on  Mr. Arnold Armstrong
Continuing on  Mrs. Ogden Fitzhugh
Continuing on  Miss Gertrude Innes
Continuing on  Mr. Jacob Trautman
Continuing on  Mr. Paul Armstrong
Continuing on  Eliza Klinefelter
Continuing on  Arnold Armstrong
Continuing on  Casanova Station
Continuing on  Alexander Graham
Continuing on  Charity Hospital
Continuing on  Beatrice Fairfax
Continuing on  Nina Carri

In [27]:
someset

{'International Steamship Company': ['International Steamship Company'],
 'Pearl Brewing Company': ['Pearl Brewing Company'],
 'Mr. Arnold Armstrong': ['Mr. Arnold Armstrong',
  'Arnold Armstrong',
  'Mr. Arnold'],
 'Mrs. Ogden Fitzhugh': ['Mrs. Ogden Fitzhugh'],
 'Miss Gertrude Innes': ['Miss Gertrude Innes',
  'Gertrude Innes',
  'Miss Gertrude'],
 'Mr. Jacob Trautman': ['Mr. Jacob Trautman'],
 'Mr. Paul Armstrong': ['Mr. Paul Armstrong', 'Paul Armstrong'],
 'Eliza Klinefelter': ['Eliza Klinefelter'],
 'Casanova Station': ['Casanova Station', 'Casanova Creek'],
 'Alexander Graham': ['Alexander Graham'],
 'Charity Hospital': ['Charity Hospital'],
 'Beatrice Fairfax': ['Beatrice Fairfax'],
 'Nina Carrington': ['Nina Carrington'],
 'Peter Armstrong': ['Peter Armstrong'],
 'Fanny Armstrong': ['Fanny Armstrong'],
 'Wednesday Riggs': ['Wednesday Riggs'],
 'Mr. Jack Bailey': ['Mr. Jack Bailey', 'Jack Bailey'],
 'Mr. John Bailey': ['Mr. John Bailey', 'John Bailey', 'John Innes'],
 'Suppose L

## Lower Ten

In [41]:
# Load Circlular Staircase
title, metadata = list(books.items())[1]
lower_ten = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [45]:
lower_ten.get_all_characters_per_novel()


['Alleghany Mountains',
 'Mrs. Curtis',
 'Monsieur Blakeley',
 'John Flanders',
 'Mr. Johnson',
 'Gulf Stream',
 'Harry Pinckney Sullivan',
 'Miss West',
 'Camberwell Beauty',
 'Mr. John Gilmore',
 'Alice Curtis',
 'Gentleman Andy',
 'Sam Forbeses',
 'Wood Street',
 'Mr. Lawrence',
 'Simon Harrington',
 'Edgar Allan Poe',
 'National Bank',
 'Government English',
 'Fish Commission',
 'Mr. Andrew Bronson',
 'Dorothy Browne',
 'Mr. Granger',
 'Mr. Harrington',
 'Janet Mac',
 'Mr. Gilmore',
 'Grand Rapids',
 'Little Hotchkiss',
 'Mr. Hotchkiss',
 'MacLures',
 'Pullman Company',
 'Lawrence Blakeley',
 'Mrs. Conway',
 'From Richey McKnight',
 'Westinghouse Electric',
 'Bernard Shaw',
 'Six Curtises',
 'Mrs. Klopton',
 'Allegheny County',
 'Possibly Hotchkiss',
 'Harry Sullivan',
 'Mr. Simon Harrington',
 'Mr. Peck',
 'Henry Sullivan',
 'Mr. Bronson',
 'Mr. Henry Pinckney Sullivan',
 'Suppose Harrington',
 'By Jove',
 'Timeo Danaos',
 'Does Richey',
 'Mr. McKnight',
 'Doctor Williams',
 'Cona

In [46]:
create_named_dictionary(lower_ten.get_all_characters_per_novel())

Name Mr. Harrington is ambiguous. Not processing
Name Mr. Hotchkiss is ambiguous. Not processing
Name Mrs. Sullivan is ambiguous. Not processing
Name Mr. Sullivan is ambiguous. Not processing
Name Mr. Blakeley is ambiguous. Not processing
Name Mrs. Curtis is ambiguous. Not processing
Name Mr. Johnson is ambiguous. Not processing
Name Mr. Gilmore is ambiguous. Not processing
Name Mrs. Conway is ambiguous. Not processing
Name Mr. Bronson is ambiguous. Not processing
Name Miss West is ambiguous. Not processing
Name Mrs. West is ambiguous. Not processing
Name MacLures is ambiguous. Not processing
1
['Henry']
	 Key: Mr. Henry Pinckney Sullivan 		 Value:  Mr. Henry Pinckney Sullivan
	 Key: Mr. Henry Pinckney Sullivan 		 Value:  Henry Pinckney Sullivan
	t name_split_no_title:  ['Henry']
	 Key: Mr. Henry Pinckney Sullivan 		 Value:  Henry Sullivan
	t name_split_no_title:  ['Henry']
1
['Wilson']
Continuing on  Mr. Henry Pinckney Sullivan
	 Key: Mr. Wilson Budd Hotchkiss 		 Value:  Mr. Wilson Bu

Continuing on  Henry Sullivan
Continuing on  John Flanders
Continuing on  Mr. Lawrence
Continuing on  John Gilmore
Continuing on  Van Kirk
1
['Ida']
Continuing on  Mr. Henry Pinckney Sullivan
Continuing on  Mr. Wilson Budd Hotchkiss
Continuing on  Harry Pinckney Sullivan
Continuing on  Henry Pinckney Sullivan
Continuing on  Westinghouse Electric
Continuing on  Wilson Budd Hotchkiss
Continuing on  From Richey McKnight
Continuing on  Mr. Simon Harrington
Continuing on  Mr. Justice Springer
Continuing on  Virginia Hot Springs
Continuing on  Alleghany Mountains
Continuing on  Mr. Francis Johnson
Continuing on  Government English
Continuing on  Mr. Andrew Bronson
Continuing on  Possibly Hotchkiss
Continuing on  Suppose Harrington
Continuing on  Monsieur Blakeley
Continuing on  Camberwell Beauty
Continuing on  Lawrence Blakeley
Continuing on  Mr. John Gilmore
Continuing on  Simon Harrington
Continuing on  Little Hotchkiss
Continuing on  Allegheny County
Continuing on  Edgar Allan Poe
Continu

Continuing on  Samuel Forbeses
Continuing on  Doctor Van Kirk
Continuing on  Gentleman Andy
Continuing on  Dorothy Browne
Continuing on  Harry Sullivan
Continuing on  Henry Sullivan
Continuing on  Ida Harrington
Continuing on  Great Unkissed
Continuing on  Great Unwashed
Continuing on  Blanche Conway
Continuing on  John Flanders
Continuing on  National Bank
Continuing on  All Baltimore
Continuing on  Union Station
Continuing on  Alice Curtis
Continuing on  Sam Forbeses
Continuing on  Mr. Lawrence
Continuing on  Grand Rapids
Continuing on  Bernard Shaw
	 Key: Six Curtises 		 Value:  Six Curtises
Continuing on  Sammy Forbes
Continuing on  John Gilmore
Continuing on  Van Kirk
1
['Klopton']
Continuing on  Mr. Henry Pinckney Sullivan
Continuing on  Mr. Wilson Budd Hotchkiss
Continuing on  Harry Pinckney Sullivan
Continuing on  Henry Pinckney Sullivan
Continuing on  Westinghouse Electric
Continuing on  Wilson Budd Hotchkiss
Continuing on  From Richey McKnight
Continuing on  Mr. Simon Harring

Continuing on  Fish Commission
Continuing on  Pullman Company
Continuing on  Samuel Forbeses
Continuing on  Doctor Van Kirk
Continuing on  Gentleman Andy
Continuing on  Dorothy Browne
Continuing on  Harry Sullivan
Continuing on  Henry Sullivan
Continuing on  Ida Harrington
Continuing on  Great Unkissed
Continuing on  Great Unwashed
Continuing on  Blanche Conway
Continuing on  John Flanders
Continuing on  National Bank
Continuing on  All Baltimore
Continuing on  Union Station
Continuing on  Alice Curtis
Continuing on  Sam Forbeses
Continuing on  Mr. Lawrence
Continuing on  Grand Rapids
Continuing on  Bernard Shaw
Continuing on  Six Curtises
Continuing on  Timeo Danaos
Continuing on  Mr. McKnight
Continuing on  Sammy Forbes
Continuing on  Oh—Miss West
Continuing on  John Gilmore
Continuing on  Gulf Stream
Continuing on  Wood Street
Continuing on  Does Richey
	 Key: Conan Doyle 		 Value:  Conan Doyle
Continuing on  Van Kirk
1
['Alison']
Continuing on  Mr. Henry Pinckney Sullivan
Continuin

Continuing on  Gentleman Andy
Continuing on  Dorothy Browne
Continuing on  Harry Sullivan
Continuing on  Henry Sullivan
Continuing on  Ida Harrington
Continuing on  Great Unkissed
Continuing on  Great Unwashed
Continuing on  Blanche Conway
Continuing on  John Flanders
Continuing on  National Bank
Continuing on  All Baltimore
Continuing on  Union Station
Continuing on  Alice Curtis
Continuing on  Sam Forbeses
Continuing on  Mr. Lawrence
Continuing on  Grand Rapids
Continuing on  Bernard Shaw
Continuing on  Six Curtises
Continuing on  Timeo Danaos
Continuing on  Mr. McKnight
Continuing on  Sammy Forbes
Continuing on  Oh—Miss West
Continuing on  John Gilmore
Continuing on  Gulf Stream
Continuing on  Wood Street
Continuing on  Does Richey
Continuing on  Conan Doyle
Continuing on  Alison West
Continuing on  Seal Harbor
Continuing on  Chevy Chase
Continuing on  Monte Carlo
Continuing on  Janet Mac
Continuing on  By Sunday
Continuing on  By George
Continuing on  Van Kirk
Continuing on  By Jov

{'Mr. Henry Pinckney Sullivan': ['Mr. Henry Pinckney Sullivan',
  'Henry Pinckney Sullivan',
  'Henry Sullivan'],
 'Mr. Wilson Budd Hotchkiss': ['Mr. Wilson Budd Hotchkiss',
  'Wilson Budd Hotchkiss'],
 'Harry Pinckney Sullivan': ['Harry Pinckney Sullivan', 'Harry Sullivan'],
 'Westinghouse Electric': ['Westinghouse Electric'],
 'From Richey McKnight': ['From Richey McKnight'],
 'Mr. Simon Harrington': ['Mr. Simon Harrington', 'Simon Harrington'],
 'Mr. Justice Springer': ['Mr. Justice Springer'],
 'Virginia Hot Springs': ['Virginia Hot Springs'],
 'Alleghany Mountains': ['Alleghany Mountains'],
 'Mr. Francis Johnson': ['Mr. Francis Johnson'],
 'Government English': ['Government English'],
 'Mr. Andrew Bronson': ['Mr. Andrew Bronson'],
 'Possibly Hotchkiss': ['Possibly Hotchkiss'],
 'Suppose Harrington': ['Suppose Harrington'],
 'Monsieur Blakeley': ['Monsieur Blakeley'],
 'Camberwell Beauty': ['Camberwell Beauty'],
 'Lawrence Blakeley': ['Lawrence Blakeley', 'Mr. Lawrence'],
 'Mr. Joh

In [57]:
def find_murder_moment(book): 
    book_by_chapter = book.lines_to_chapters(book.lines)

    murder_words = ['murder', 'death']
    murder_string = '|'.join(murder_words)
    murder_regex = re.compile(fr".*({murder_string}).*")
    for chapter in book_by_chapter:
        for line in chapter: 
            found = re.findall(murder_regex, line, re.IGNORECASE)
            if len(found) > 1: 
                print("FOUND: ", found)
                print(line)
    

In [58]:
find_murder_moment(lower_ten)

ValueError: cannot process flags argument with a compiled pattern

In [53]:
# Detectives
# McKnight
# I 


lower_ten_chapters = lower_ten.lines_to_chapters(lower_ten.lines)

In [52]:
for line in lower_ten_chapters[0]: 
    

['CHAPTER I.',
 'I GO TO PITTSBURG',
 'McKnight is gradually taking over the criminal end of the business',
 'I never liked it, and since the strange case of the man in lower ten, I have been a bit squeamish',
 'Given a case like that, where you can build up a network of clues that absolutely incriminate three entirely different people, only one of whom can be guilty, and your faith in circumstantial evidence dies of overcrowding',
 'I never see a shivering, white-faced wretch in the prisoners’ dock that I do not hark back with shuddering horror to the strange events on the Pullman car Ontario, between Washington and Pittsburg, on the night of September ninth, last',
 'McKnight could tell the story a great deal better than I, although he can not spell three consecutive words correctly',
 'But, while he has imagination and humor, he is lazy',
 '“It didn’t happen to me, anyhow,” he protested, when I put it up to him',
 '“And nobody cares for second-hand thrills',
 'Besides, you want the 