# NLP - Project 1
## Rinehart Analysis
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

Check out the **[README](https://github.com/NLPaladins/rinehartAnalysis/blob/main/README.md)**


Or the current **[TODO](https://github.com/NLPaladins/rinehartAnalysis/blob/main/TODO.md)** list.

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
import sys


In [2]:
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [3]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### At any point, to save changes
click **File > Save a copy on Gihtub**

### Now go to files on the left, and open:
- rinehartAnalysis/confs/proj_1.yml

(Ctr+s) to save changes

## Load Libraries and setup

In [4]:
import traceback
import argparse
from importlib import reload as reload_lib
from pprint import pprint
import numpy as np

# Custom libs
from nlp_libs import Configuration, ColorizedLogger, ProcessedBook
# Import this way the libs you want to dynamically change and reload 
# import nlp_libs.books.processed_book as books_lib # Comment out until Class is finalized


### Libraries Overview
All the libraries are located under *"\<project root>/nlp_libs"*
- ***ProcessedBook***: Loc: **books/processed_book.py**, Desc: *Book Pre-processor*
- ***Configuration***: Loc: **configuration/configuration.py**, Desc: *Configuration Loader*
- ***ColorizedLogger***: Loc: **fancy_logger/colorized_logger.py**, Desc: *Logger with formatted text capabilities*

In [5]:
# The path of configuration and log save path
config_path = "confs/proj_1.yml"  # Open files > confs > proj_1.yml to edit temporalily. Commit to save permanently
# !cat "$config_path"
log_path = "logs/proj_1.log"  # Open files > logs > proj_1.log to debug logs of previous runs

In [6]:
# The logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)

2021-09-26 12:47:22 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/logs/proj_1.log[0m


In [7]:
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books = conf.get_config('data_loader')['config']['books']
pprint(books)  # Pretty print the books dict

2021-09-26 12:47:24 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/confs/proj_1.yml[0m
2021-09-26 12:47:24 Config       INFO     [1m[37mConfiguration Tag: proj1[0m


{'K': {'crime_type': 'example',
       'detective': ['man1', 'man2'],
       'suspects': ['man3', 'man4'],
       'url': 'https://gutenberg.org/files/9931/9931-0.txt'},
 'The_After_House': {'crime_type': 'example',
                     'detective': ['man1', 'man2'],
                     'suspects': ['man3', 'man4'],
                     'url': 'https://gutenberg.org/files/2358/2358-0.txt'},
 'The_Case_of_Jennie_Brice': {'crime_type': 'example',
                              'detective': ['man1', 'man2'],
                              'suspects': ['man3', 'man4'],
                              'url': 'https://gutenberg.org/cache/epub/11127/pg11127.txt'},
 'The_Circular_Staircase': {'crime_type': 'example',
                            'detectives': ['Rachel Innes'],
                            'suspects': ['Liddy',
                                         'Halsey',
                                         'Gertrude',
                                         'Paul Armstrong',
            

## Exploration

In [20]:
import urllib.request
import re
from typing import *


class ProcessedBook:
    num_map = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'),
               (100, 'C'), (90, 'XC'), (50, 'L'), (40, 'XL'),
               (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
    title: str
    url: str
    detectives: List[str]
    suspects: List[str]
    crime_type: str

    def __init__(self, title: str, metadata: Dict, make_lower: bool = True):
        """
        raw holds the books as a single string.
        clean holds the books as a list of lowercase lines starting
        from the first chapter and ending with the last sentence.
        """
        self.title = title
        self.url = metadata['url']
        self.detectives = metadata['detectives']
        self.suspects = metadata['suspects']
        self.crime_type = metadata['crime_type']
        self.raw = self.read_book_from_proj_gut(self.url)
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw
            
        lines = re.sub(r'\r\n', r'\n', lines)
        lines = re.findall(r'.*(?=\n)',  lines)        
        
        self.lines = self.clean_lines(lines=lines)
        self.clean = self.get_clean_book(make_lower=make_lower)

    @staticmethod
    def read_book_from_proj_gut(book_url: str) -> str:
        req = urllib.request.Request(book_url)
        client = urllib.request.urlopen(req)
        page = client.read()
        return page.decode('utf-8')

    def get_clean_book(self, make_lower: bool = True) -> List[str]:
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw

        lines = re.sub(r'\r\n', r'\n', lines)
        lines = re.findall(r'.*(?=\n)',  lines)        

        lines = self.clean_lines(lines)
        chapters = self.lines_to_chapters(lines)
        return chapters

    def clean_lines(self, lines: List[str]) -> List[str]:
        clean_lines = []
        start = False
        for line in lines:
            if re.match(r'^chapter i\.', line, re.IGNORECASE):
                clean_lines.append(line)
                start = True
                continue
            if not start:
                continue
            if re.match(r'^\*\*\* end of the project gutenberg ebook', line, re.IGNORECASE):
                break
            if self.pass_clean_filter(line):
                clean_lines.append(line)
        return clean_lines

    @staticmethod
    def lines_to_chapters(lines: List[str]) -> List[str]:
        chapters = []
        sentences = []
        current_sent = ''
        for i, line in enumerate(lines):
            # add chapter as 1st sentence
            if re.match(r'^chapter [ivxlcdm]+\.$', line, re.IGNORECASE):
                if sentences:
                    chapters.append(sentences)
                sentences = [line]
                add_chapter_title = True
                continue
            # add chapter title as 2nd sentence
            elif add_chapter_title:
                sentences.append(line)
                add_chapter_title = False
                continue
            sents = re.findall(r' *((?:mr\.|mrs.|[^\.\?!])*)(?<!mr)(?<!mrs)[\.\?!]', line, re.IGNORECASE)
            # if no sentence end is detected
            if not sents:
                if current_sent == '':
                    current_sent = line
                else:
                    current_sent += ' ' + line
            # if at least one sentence end is detected
            else:
                for group in sents:
                    if current_sent != '':
                        current_sent += ' ' + group
                        sentences.append(current_sent)
                    else:
                        sentences.append(group)
                    current_sent = ''
                # set the next sentence to its start if there is one
                sent_end = re.search(r'(?<!mr)(?<!mrs)[\.\?!] ((?:mr\.|mrs\.|[^\.\?!])*)$', line, re.IGNORECASE)
                if sent_end is not None:
                    current_sent = sent_end.groups()[0]
        return chapters

    @staticmethod
    def pass_clean_filter(line: str) -> bool:
        # removing the illustration lines and empty lines
        # can add other filters here as needed
        if line == '' or re.match(r'illustration:|\[illustration\]', line, re.IGNORECASE):
            return False
        else:
            return True

    def get_characters_per_chapter(self, chapter): 
        found_character_list = []
        singular_or_multiple_names = '[A-Z][a-z][A-Z]?[a-z][A-Z]?[a-z]+(?:(?:\s|,|.|\.\s)[A-Z][a-z][A-Z]?[a-z][A-Z]?[a-z]+)?(?:\s[A-Z][a-z][A-Z]?[a-z][A-Z]?[a-z]+)?(?:\s[A-Z][a-z][A-Z]?[a-z][A-Z]?[a-z]+)?'
        # TODO: Executive decision occurred on the search string to ignore the 
        #       first word of each sentence as regex cannot differentiate between 
        #       a singular word and a name.  However, this does introduce a curious
        #       thing with McKnight in the man in the lower ten, such that we get him
        #.      as "Knight". 
        search_string = re.compile(fr'(?<!“)(?<!‘)(?<!^)({singular_or_multiple_names})')
        #get characters per sentence in chapter
        for sentence in chapter:
            res = re.findall(search_string, sentence)
            found_character_list.append(res)

        unique_characters = list(np.concatenate(found_character_list))
        return found_character_list, unique_characters
      
        
    ##
    ## @Warning: Currently only works with all text as upper case.  
    ##
    def get_all_characters_per_novel(self):
        preceding_words_to_ditch = [
            'After', 'Although', 'And', 'As','At',
            'Before', 'Both', 'But', 'Did', 'For', 
            'Good', 'Had', 'Has', 'Home', 'If', 'Is',
            'Leaving', 'Like', 'No', 'Nice', 'Old', 'On', 'Or',
            'Poor', 'Send', 'So', 'That', 'Tell', 'The', 'Thank', 
            'To', 'Was', 'Whatever', 'When', 'Where', 'While', 
            'With','Your', 'View', 
            #Specific Places
            'African', 'Brewing', 'Hospital', 'Zion', 'New','Country', 
            'Greenwood', 'Western', 'American', 'Bar', 'Chestnut', 'Queen', 
            'Summitville', 'Union', 'City', "Japan","Europe","Company",
            "Street","Station","Bank","Weekly", "ville","Providence",
            "Creek","Brewing", 'California', 'Italian', 'London', 'French',
            'Scotland'
        ]
        

        book_by_chapter = self.lines_to_chapters(self.lines)
        
        totalUniqueList = []
        characterProgressionList = []
        for chapter in book_by_chapter: 

            characterProgression, uniqueCharacters = self.get_characters_per_chapter(chapter)

            characterProgressionList.append(characterProgression)
            totalUniqueList = [*totalUniqueList, *uniqueCharacters]

        totalUnique = set(totalUniqueList)
        
        joined_preceding_words_to_lose = '|'.join(preceding_words_to_ditch)
        #not even preceding - just ditch them if they're within the "name"
        preceding_word_to_lose_regex = fr'^(?!.*({joined_preceding_words_to_lose})).*'
        regex = re.compile(preceding_word_to_lose_regex)

        filtered_people = list(filter(regex.match, totalUnique))
        
        return filtered_people, characterProgressionList
        
    def get_chapter(self, chapter: int) -> str:
        return self.clean[chapter - 1]

    def extract_character_names(self):
        lines_by_chapter = self.lines_to_chapters(self.lines)
        for chapter in lines_by_chapter: 
            print(chapter)

In [9]:
# Load Circlular Staircase
title, metadata = list(books.items())[0]
staircase = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [10]:
for i, sent in enumerate(staircase.get_chapter(33)):
  if i == 10: break
  logger.info(sent)

2021-09-26 12:47:30 Notebook     INFO     [1m[36mCHAPTER XXXIII.[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mAT THE FOOT OF THE STAIRS[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mAs I drove rapidly up to the house from Casanova Station in the hack, I saw the detective Burns loitering across the street from the Walker place[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mSo Jamieson was putting the screws on—lightly now, but ready to give them a twist or two, I felt certain, very soon[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mThe house was quiet[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mTwo steps of the circular staircase had been pried off, without result, and beyond a second message from Gertrude, that Halsey insisted on coming home and they would arrive that night, there was nothing new[0m
2021-09-26 12:47:30 Notebook     INFO     [1m[36mMr. Jamieson, having failed to locate the secret room, had gone to the village[0m
2021-

In [11]:
# Load Lower Ten
title, metadata = list(books.items())[0]
lower_ten = ProcessedBook(title=title, metadata=metadata)

In [12]:
logger.info(f'The raw length of this book as a string is {len(lower_ten.raw)}')
logger.info(f'This book has {len(lower_ten.clean)} chapters\n')
for i, chapter in enumerate(lower_ten.clean):
  if i == 5: break
  logger.info(f'{chapter[0]} - {chapter[1]}')
  logger.info(f'There are {len(chapter)} sentences in this chapter.')
  num_words = []
  for sent in chapter:
    num_words.append(len(sent.split(' ')))
  avg_words = np.mean(num_words)
  logger.info(f'The average sentence length in this chapter is {avg_words} words\n')

2021-09-26 12:47:32 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09-26 12:47:32 Notebook     INFO     [1m[36mThere are 104 sentences in this chapter.

In [13]:
for i, sent in enumerate(lower_ten.get_chapter(15)):
  if i == 10: 
    break
  logger.info(sent)

2021-09-26 12:47:33 Notebook     INFO     [1m[36mchapter xv.[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mliddy gives the alarm[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mthe next day, friday, gertrude broke the news of her stepfather’s death to louise[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mshe did it as gently as she could, telling her first that he was very ill, and finally that he was dead[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mlouise received the news in the most unexpected manner, and when gertrude came out to tell me how she had stood it, i think she was almost shocked[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36m“she just lay and stared at me, aunt ray,” she said[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36m“do you know, i believe she is glad, glad[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mand she is too honest to pretend anything else[0m
2021-09-26 12:47:33 Notebook     INFO     [1m[36mw

In [14]:
# Example of running it on all the books
processed_books = {}
for title, metadata in books.items():
  logger.nl()
  logger.info(f"Book: {title}", color='yellow', attrs=['underline'])
  current_book = ProcessedBook(title=title, metadata=metadata)
  # Raw length
  logger.info(f'The raw length of this book as a string is {len(current_book.raw)}')
  # Number of chapters
  logger.info(f'This book has {len(current_book.clean)} chapters\n')
  # Sententences per chapter
  for i, chapter in enumerate(current_book.clean):
    if i == 5: 
      break
    logger.info(f'{chapter[0]} - {chapter[1]}')
    logger.info(f'There are {len(chapter)} sentences in this chapter.')
    num_words = []
    for sent in chapter:
      num_words.append(len(sent.split(' ')))
    avg_words = np.mean(num_words)
    logger.info(f'The average sentence length in this chapter is {avg_words} words\n')
  # Chapter 15
  for i, sent in enumerate(lower_ten.get_chapter(15)):
    if i == 10: break
    logger.info(sent)


2021-09-26 12:47:33 Notebook     INFO     [4m[33mBook: The_Circular_Staircase[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-26 12:47:35 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09

KeyError: 'detectives'

## TEST Dictionary 

In [None]:
# # Load Circlular Staircase
# title, metadata = list(books.items())[0]
# staircaseUpper = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [None]:
# inputStr = [ 'Miss Gertrude',
#  'Miss Gertrude Innes',
#  'Miss Innes',
#  'Miss Liddy',
#  'Miss Louise',
#  'Arnold Armstrong',
#  'Miss Rachel',
#  'Mr. Armstrong',
#  'Mr. Arnold',
#  'Mr. Arnold Armstrong',
#  'Louise Armstrong',
#  'Fanny Armstrong',
#  'Peter Armstrong',
#  'Miss Armstrong',
#  'Mrs. Armstrong',
#  'Paul Armstrong',
#     'Anne Endicott',
#   'Anne Haswell',
#   'Anne Watson',
#   'Mary Anne'
# ]

# output = {
#     'Miss Gertrude Innes': [
#         'Miss Gertrude',
#         'Miss Gertrude Innes',
#         'Miss Innes'
#     ],
#     'Miss Liddy': [ 'Miss Liddy'  ], 
#     'Miss Louise':[ 'Miss Louise' ],
#     'Miss Rachel':[ 'Miss Rachel' ],
#     'Mr. Arnold Armstrong': [
#         'Mr. Arnold Armstrong',
#         'Arnold Armstrong',
#         'Mr. Armstrong',
#         'Mr. Arnold',
#     ], 
#     'Anne Endicott': ['Anne Endicott'],
#     'Anne Haswell': ['Anne Haswell'],
#     'Anne Watson': ['Anne Watson'],
#     'Mary Anne': ['Mary Anne']

# }

In [None]:
# title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Mr. Arnold Armstrong')
# print(title)
# title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Doctor Arnold Armstrong')
# print(title)
# no_title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', 'Arnold Armstrong')
# print(no_title)
# name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', 'Mr. Arnold Armstrong')
# name_split_no_title if len(name_split_no_title) == 1 else [name_split_no_title[0]]

In [None]:
# surname = re.findall(r'[A-Z][a-z]+$', 'Mr. Arnold Armstrong')
# print(surname)
# surname = re.findall(r'[A-Z][a-z]+$', 'Arnold Armstrong')
# print(surname)
# surname = re.findall(r'[A-Z][a-z]+$', 'Miss Armstrong')
# print(surname)

In [None]:
# regexString = r'[A-Z][a-z]+(?=\s)'
# surname = re.findall(regexString, 'Mr. Arnold Armstrong')
# print(surname)
# surname = re.findall(regexString, 'Arnold Armstrong')
# print(surname)
# surname = re.findall(regexString, 'Miss Armstrong')
# print(surname)

In [None]:
# def createNamedDictionary(personList): 
#     personList.sort(key=len, reverse=True)
#     name_dictionary = {}
#     hasKey=False

#     for potential_name_key in personList:         
#         title = re.findall(r'^(?:Mr\.|Mrs\.|Miss|Doctor)', potential_name_key)
#         name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', potential_name_key)
#         original_length_no_title = len(name_split_no_title)
        
#         if len(name_split_no_title) > 1:
#             surname = name_split_no_title[1]
#             name_split_no_title = [name_split_no_title[0]]
        
#         print(len(name_split_no_title))
#         print(name_split_no_title)
#         for name in personList:
#             if name in name_dictionary.keys() or (len(name_dictionary.values()) > 0 and
#                                                   name in np.concatenate(list(name_dictionary.values()))):
# #                 print("Continuing on ", name)
#                 continue

#             if re.match(fr".*({'|'.join(name_split_no_title)}).*", name):
# #                 print('\t ',re.match(fr".*({'|'.join(name_split_no_title)}).*", name))

#                 if original_length_no_title == 1: 
#                     continue
                    

# #                 if len(name) > len(potential_name_key): 
# #                     raise("This shouldn't happen with the way the sorting works")
#                 else:
#                     if potential_name_key not in name_dictionary.keys() and name not in name_dictionary.keys(): 
# #                         print('\t Key:', potential_name_key, "\t\t Value: ", potential_name_key,  )
#                         name_dictionary[potential_name_key] = [ potential_name_key ]
#                     elif potential_name_key in name_dictionary.keys(): 
# #                             print('\t Key:', potential_name_key, "\t\t Value: ", name )
# #                             print('\tt name_split_no_title: ', name_split_no_title )
# #                             if re.match(fr'.*(?!{surname}).*$', name) and len(name_split_no_title) == len(re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', name)):
# #                                 print('\t\t >>>>>>>>>>>. SURNAME DOES NOT MATCH!!!')
# #                                 print('\t\t >>>>>>>>>>>.potential_name_key, name')

                                
#                             name_dictionary[potential_name_key] = [
#                                 *name_dictionary[potential_name_key], 
#                                 name
#                             ]                    
                    
# #                     print('key: ', potential_name_key, '\tvalue', potential_name_key)
# #                     name_dictionary[potential_name_key] = [ potential_name_key ]
#                 continue
        
#     return name_dictionary

In [None]:

def get_unambiguous_name_list(unique_person_list, surname_list): 
    unambiguous_name_list = []
    print(surname_list)
    for name in unique_person_list:
        name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor|Aunt)[A-Z][a-z]+', name)
        if len(name_split_no_title) == 0: 
            continue
#         print(name, name_split_no_title)
        first_name = name_split_no_title[0]


        if first_name not in surname_list:
            unambiguous_name_list.append(name)
        else: 
            print(f"Name {name} is ambiguous. Not processing")
    return unambiguous_name_list

def create_named_dictionary(unique_person_list): 
    title_regex = r'^(?:Mr\.|Mrs\.|Miss|Doctor)'
    no_title_regex = r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+'
    unique_person_list.sort(key=len, reverse=True)
    unique_person_key = {}
    name_dictionary = {}
    
    surname_list = extract_surnames(unique_person_list)
    print(surname_list)
    unambiguous_person_list = get_unambiguous_name_list(unique_person_list, surname_list)
#     print(unambiguous_person_list)
    alias_dictionary = createNamedDictionary(unambiguous_person_list)
    
    return alias_dictionary


In [None]:
# surnames = extract_surnames(unique_person_list)
# names_to_remove_from_surnames = ['Anne', 'Rachel', 'Rosie', 'Thomas']
# surnames = list(filter(lambda x: x not in names_to_remove_from_surnames))

In [58]:
# unique_person_list.sort(key=len, reverse=True)
# unique_person_list

# THIS IS THE ALIASING CODE: 

In [21]:
def extract_surnames(unique_person_list): 
    surname_list = []
    # first pass: go through, get break of first / lasts
    for name in unique_person_list:
        name_split_no_title = re.findall(r'(?!Mr\.|Mrs\.|Miss|Doctor)[A-Z][a-z]+', name)
        surname = '' if len(name_split_no_title) <= 1 else name_split_no_title[1]

        if surname != '' and surname: 
            surname_list.append(surname)

    return set(surname_list)


def obtain_aliases_for_book(unique_person_list): 
    unique_person_list.sort(key=len, reverse=True)
    alias_dictionary = {}
    title_regex = r'^(?:Mr\.|Mrs\.|Miss|Doctor)'
    name_with_no_title_regex = r'(?!Mr\.|Mrs\.|Miss|Doctor|Aunt)[A-Z][a-z]+'
    surnames = extract_surnames(unique_person_list)
    
    for personIdx in range(len(unique_person_list)): 
        if len(alias_dictionary.keys()) == 0: 
            alias_dictionary[unique_person_list[personIdx]] = [unique_person_list[personIdx]]


        comparator_person = unique_person_list[personIdx]
        title_comparator_person = re.findall(title_regex, comparator_person)
        name_split_no_title_comparator_person = re.findall(name_with_no_title_regex, comparator_person)

        if len(alias_dictionary.values()) > 0 and comparator_person in list(np.concatenate(list(alias_dictionary.values()))):
            continue
#         print(comparator_person)
#         print(title_comparator_person, name_split_no_title_comparator_person)


        if len(name_split_no_title_comparator_person) == 0: 
    #         raise('ZERO? ', comparator_person, name_split_no_title_comparator_person)
            continue
        surname_comparator_person = '' if len(name_split_no_title_comparator_person) <= 1 else name_split_no_title_comparator_person[1]
        first_name_comparator_person = name_split_no_title_comparator_person[0]

        for next_person_index in range(personIdx, len(unique_person_list)):     
            next_person = unique_person_list[next_person_index]
            title_next_person = re.findall(title_regex, next_person)
            name_split_no_title_next_person = re.findall(name_with_no_title_regex, next_person)

            if len(name_split_no_title_next_person) == 0: 
#                 print(f'NEXT PERSON ZERO:{next_person} ')
                continue

            if comparator_person == next_person: 
                alias_dictionary[comparator_person] = [ comparator_person ]
#                 print("COnTINUING BECAUSE ADDED STUFF")
                continue

            surname_next_person = '' if len(name_split_no_title_next_person) <= 1 else name_split_no_title_next_person[1]
            first_name_next_person = name_split_no_title_next_person[0]

#             print('\t\t',next_person)
#             print('\t\t',title_next_person, name_split_no_title_next_person, f'SURNAME {len(name_split_no_title_next_person)} {surname_next_person}')

            if first_name_next_person in surnames: 
#                 print("\t::::::::NAME IN SURNAMES::::::::", title_next_person, name_split_no_title_next_person)
                continue

            if first_name_comparator_person == first_name_next_person and len(first_name_next_person) > 0: 

                ### SURNAME CHECK GOES HERE *** NEED TO MAKE ABSOLUTELY SURE:
                if surname_comparator_person != '' and surname_next_person == '': 
                    alias_dictionary[comparator_person] = [*alias_dictionary[comparator_person] , next_person ]
                    continue


                if surname_comparator_person != '' and surname_next_person != '' and surname_comparator_person != surname_next_person: 
                    continue


#                 print("\t\t**********WHOOOOOP**********************",len(first_name_next_person), first_name_next_person, first_name_comparator_person)
                alias_dictionary[comparator_person] = [*alias_dictionary[comparator_person] , next_person ]
                continue
    return alias_dictionary

def get_dictionary_of_named_occurrences(character_progression): 
    namecount = {}
    for chapter in character_progression: 
        for line in chapter: 
            for element in line: 
                if element not in namecount.keys():
                    namecount[element] = 1
                else: 
                    namecount[element] = namecount[element]+1
                    
    sorted_dict = {}
    for key_value in sorted(namecount.items(), key=lambda x: x[1], reverse=True): 
        sorted_dict[key_value[0]] = key_value[1]
    return sorted_dict

def get_counts_per_book_by_index(book_index): 
    title, metadata = list(books.items())[book_index]
    print(title)
    book = ProcessedBook(title=title, metadata=metadata, make_lower=False)
    unique_characters, character_progression = book.get_all_characters_per_novel()
    character_reference_counts = get_dictionary_of_named_occurrences(character_progression)
    return character_reference_counts

def create_alias_occurrence_dictionary(aliases, named_occurrences): 
    new_named_occurrences = {}
    for key in aliases.keys(): 
        for named_occurrence in named_occurrences.keys(): 
            if named_occurrence in aliases[key]: 
                key_exists = key in new_named_occurrences.keys()
                occurrences = named_occurrences[named_occurrence]
                new_named_occurrences[key] =  occurrences if not key_exists else  new_named_occurrences[key] + occurrences 

    return new_named_occurrences

## Lower Ten

In [22]:
# Load Circlular Staircase
title, metadata = list(books.items())[1]
lower_ten = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [23]:
unique, progression = lower_ten.get_all_characters_per_novel()
lower_ten_aliases = obtain_aliases_for_book(unique)
named_occurrences = get_dictionary_of_named_occurrences(progression)

In [24]:
lower_ten_aliases

{'Henry Pinckney Sullivan': ['Henry Pinckney Sullivan'],
 'Harry Pinckney Sullivan': ['Harry Pinckney Sullivan', 'Harry'],
 'Westinghouse Electric': ['Westinghouse Electric'],
 'Wilson Budd Hotchkiss': ['Wilson Budd Hotchkiss'],
 'Alleghany Mountains': ['Alleghany Mountains'],
 'Government English': ['Government English', 'Government'],
 'Camberwell Beauty': ['Camberwell Beauty'],
 'Lawrence Blakeley': ['Lawrence Blakeley', 'Lawrence'],
 'Monsieur Blakeley': ['Monsieur Blakeley'],
 'Purloined Letter': ['Purloined Letter'],
 'Simon Harrington': ['Simon Harrington'],
 'Justice Springer': ['Justice Springer'],
 'Doctor Williams': ['Doctor Williams'],
 'Richey McKnight': ['Richey McKnight', 'Richey'],
 'Fish Commission': ['Fish Commission'],
 'Francis Johnson': ['Francis Johnson'],
 'Great Unwashed': ['Great Unwashed'],
 'Henry Sullivan': ['Henry Sullivan'],
 'Harry Sullivan': ['Harry Sullivan', 'Harry'],
 'Great Unkissed': ['Great Unkissed'],
 'Dorothy Browne': ['Dorothy Browne', 'Dorothy

In [25]:
create_alias_occurrence_dictionary(lower_ten_aliases, named_occurrences)


{'Henry Pinckney Sullivan': 3,
 'Harry Pinckney Sullivan': 2,
 'Westinghouse Electric': 1,
 'Wilson Budd Hotchkiss': 2,
 'Alleghany Mountains': 1,
 'Government English': 2,
 'Camberwell Beauty': 1,
 'Lawrence Blakeley': 16,
 'Monsieur Blakeley': 1,
 'Purloined Letter': 1,
 'Simon Harrington': 8,
 'Justice Springer': 1,
 'Doctor Williams': 2,
 'Richey McKnight': 30,
 'Fish Commission': 1,
 'Francis Johnson': 1,
 'Great Unwashed': 1,
 'Henry Sullivan': 1,
 'Harry Sullivan': 2,
 'Great Unkissed': 2,
 'Dorothy Browne': 7,
 'Blanche Conway': 2,
 'Janet MacLure': 2,
 'John Flanders': 1,
 'Cinematograph': 1,
 'Pennsylvania': 1,
 'Bernard Shaw': 1,
 'Alice Curtis': 3,
 'Pittsburgers': 1,
 'Timeo Danaos': 1,
 'Miss Gardner': 1,
 'Grand Rapids': 1,
 'Commonwealth': 2,
 'John Gilmore': 7,
 'Gulf Stream': 1,
 'Alleghanies': 1,
 'Christmases': 1,
 'Chevy Chase': 1,
 'Monongahela': 1,
 'Monte Carlo': 1,
 'Conan Doyle': 1,
 'Seal Harbor': 6,
 'Alison West': 67,
 'Edgar Allan': 1,
 'Cannonball': 5,
 '

In [None]:
# @kostas already did this
def find_murder_moment(book): 
    book_by_chapter = book.lines_to_chapters(book.lines)

    murder_words = ['murder', 'death']
    murder_string = '|'.join(murder_words)
    murder_regex = re.compile(fr".*({murder_string}).*")
    for chapter in book_by_chapter:
        for line in chapter: 
            found = re.findall(murder_regex, line, re.IGNORECASE)
            if len(found) > 1: 
                print("FOUND: ", found)
                print(line)
    

In [None]:
# def get_dictionary_of_named_occurrences(character_progression): 
#     namecount = {}
#     for chapter in character_progression: 
#         for line in chapter: 
#             for element in line: 
#                 if element not in namecount.keys():
#                     namecount[element] = 1
#                 else: 
#                     namecount[element] = namecount[element]+1
                    
#     sorted_dict = {}
#     for key_value in sorted(namecount.items(), key=lambda x: x[1], reverse=True): 
#         sorted_dict[key_value[0]] = key_value[1]
#     return sorted_dict

# def get_counts_per_book_by_index(book_index): 
#     title, metadata = list(books.items())[book_index]
#     print(title)
#     book = ProcessedBook(title=title, metadata=metadata, make_lower=False)
#     unique_characters, character_progression = book.get_all_characters_per_novel()
#     character_reference_counts = get_dictionary_of_named_occurrences(character_progression)
#     return character_reference_counts

## Circular Staircase: 

In [None]:
# Load Circlular Staircase
get_counts_per_book_by_index(0)

## Man in Lower Ten

In [None]:
# Load Circlular Staircase
get_counts_per_book_by_index(1)

## The Breaking Point

In [None]:
# Load Circlular Staircase
book_index = 2
title, metadata = list(books.items())[book_index]
print(title)
book = ProcessedBook(title=title, metadata=metadata, make_lower=False)

In [None]:
raw = book.raw
lines = re.sub(r'\r\n', r'\n', raw)
lines = re.findall(r'.*(?=\n)',  lines)     

In [None]:
lines
book.clean_lines(lines)

## You Know how Women Are 

In [None]:
get_counts_per_book_by_index(3)

## The Window at the White Cat

In [None]:
get_counts_per_book_by_index(4)