# `692-Proj-1 : Crime Novel Plot Analysis with Regex - Agatha Christie`
The goal of this project is to conduct a plot and protagonist/antagonist analysis of the famous crime novels. For this project, we will analyze five publicly available crime novels/stories by Agatha Christie at the project Gutenberg http://www.gutenberg.org/. The novels chosen are: 
- The Mysterious Affair at Styles 
- The Murder on the Links 
- The Secret Adversary 
- The Man in the Brown Suit 
- The Secret of Chimneys 


Note: Feel free to use any background resource for the understanding of the plot, protagonist and antagonist names, and other details. Look for spoilers, details, etc. Our goal is not to predict the crime, but to computationally analyze the structure of the plot.

#Data collection

##Background research: 

Location for Plain text UTF-8 files for novels: 
- The Mysterious Affair at Styles https://www.gutenberg.org/files/863/863-0.txt
- The Murder on the Links https://www.gutenberg.org/files/58866/58866-0.txt
- The Secret Adversary https://www.gutenberg.org/files/1155/1155-0.txt
- The Man in the Brown Suit https://www.gutenberg.org/files/61168/61168-0.txt
- The Secret of Chimneys https://www.gutenberg.org/files/65238/65238-0.txt

Note: One benefit to getting the text version is that the html version also has page number to clean, not present in text files

###Helpful Links

https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3

https://docs.python.org/3/howto/urllib2.html



In [2]:
# example test run for one file 
# we will need a data structure that holds the name and links and loop through or sequentially get the data of all the novels.

import urllib.request, re
url = "https://www.gutenberg.org/files/863/863-0.txt" # utf-8 text file link for The Mysterious Affair at Styles 

response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # Default encoding is ascii; gutenberg has utf-8 encoded files
#text  # uncomment for output if needed
# text needs to be cleaned before it can be analyzed 

# trying out word; we will need both word and sentences
#words = re.split('\s+', text)

# trying out tokenization sentences via re
# tried, this WIP, as you can see in the output, doesnt get all the sentences
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

In [3]:
#Author: Luke+Veronica
#Description: Functions for retrieving and cleaning corpus
import urllib.request, re

def get_index(title):
  last_reg=re.compile(r"\w+$")
  last_word=re.findall(last_reg,title)[0]
  if last_word =="Links":
    #"The Murder on the Links"
    index=1
  elif last_word=="Styles":
    #"The Mysterious Affair at Styles"
    index=2
  elif last_word=="Adversary":
    #"The Secret Adversary"
    index=3
  elif last_word=="Suit":
    #"The Man in the Brown Suit"
    index=4
  elif last_word=="Chimneys":
    #"The Secret of Chimneys"
    index=5
  return index

def get_text(index):
  if index==1:
    #"The Murder on the Links"
    url = "https://www.gutenberg.org/files/58866/58866-0.txt"
  elif index==2:
    #"The Mysterious Affair at Styles"
    url="https://www.gutenberg.org/files/863/863-0.txt"
  elif index==3:
    #"The Secret Adversary"
    url="https://www.gutenberg.org/files/1155/1155-0.txt"
  elif index==4:
    #"The Man in the Brown Suit"
    url="https://www.gutenberg.org/files/61168/61168-0.txt"
  elif index==5:
    #"The Secret of Chimneys"
    url="https://www.gutenberg.org/files/65238/65238-0.txt"
  response = urllib.request.urlopen(url)
  data = response.read()      # a `bytes` object
  text = data.decode('utf-8')
  return text

def get_ch_regex(index):
  if index==1:
    ch_carve=re.compile(r'\n\d\d?\s[\'\"\u201c]?[A-Z].*\n')
  elif index==2:
    ch_carve=re.compile(r'CHAPTER\s[IVX]+\.\r\n.*\r\n')
  elif index==3:
    ch_carve=re.compile(r'\r\n\r\n\r\nCHAPTER.*\r\n')
  elif index==4:
    ch_carve=re.compile(r'CHAPTER\s\w+\r\n')  
  elif index==5:
    ch_carve=re.compile(r'\d\d?\r\n\r\n[A-OQ-Z].*\r\n')
  return ch_carve

def trim_contents(ch_contents_dict,index):
  last=len(ch_contents_dict)
  if index==1:
    ch_contents_dict[last]=ch_contents_dict[last].split('\nEnd of Project Gutenberg')[0]
  elif index==2:
    ch_contents_dict[last]=ch_contents_dict[last].split('\nTHE END')[0]
  elif index==3:
    ch_contents_dict[last-1]=ch_contents_dict[last-1].split('\nEnd of the Project Gutenberg')[0]
  elif index==4:
    ch_contents_dict[last-1]=ch_contents_dict[last-1].split('THE END')[0]
  elif index==5:
    ch_contents_dict[last]=re.split(r"TRANSCRIBER",ch_contents_dict[last])[0]
  return ch_contents_dict

def remove_white(chapter):
  regex=r'[\r\n\u200a_]+'
  chapter = re.sub(regex,' ',chapter)
  return chapter

def sent_carve(chapter):
  #chapter=re.split(r'(?<![A-H|J-Z])[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])(?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\sI said)[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w)(?<![A-Z][a-z][a-z])(?<![A-Z][a-z])\s+',chapter,flags=re.UNICODE)
  chapter=re.split(r'(?<![^A-Z][A-H|J-Z])(?<!Mr|Ms|Dr)(?<!Mrs)(?<!Mlle)(?<!Melle)(?<!\w\.\w)[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*|\u2014\u201d\s*',chapter,flags=re.UNICODE)
  chapter=chapter[:-1]
  chapter={num:contents.lower() for (num,contents) in enumerate(chapter)}
  return chapter

def ch_carve(title):

  index=get_index(title)
  text=get_text(index)
  ch_regex=get_ch_regex(index)
  if index ==3:
    text=re.split("CHAPTER XXVIII.     AND AFTER\r\n\r\n\r\n\r\nPROLOGUE",text)[1]
  if index ==4:
    text=re.split("PROLOGUE",text)[1]
  ch_titles=re.findall(ch_regex,text)
  ch_titles_dict={num+1:remove_white(title.strip()) for (num,title) in enumerate(ch_titles)}
  if index==3 or index ==4:
    ch_titles_dict.update( {0 :"PROLOGUE"} )
  chapters=re.split(ch_regex,text)
  if index==3 or index==4:
    ch_contents_dict = {num:contents for (num,contents) in enumerate(chapters)}  
  elif index ==1 or index ==2 or index==5:
    chapters=chapters[1:]
    ch_contents_dict = {num+1:contents for (num,contents) in enumerate(chapters)}
  ch_contents_dict=trim_contents(ch_contents_dict,index)
  return {"title":title,"contents":ch_contents_dict,"chapters":ch_titles_dict}

def get_corpus():
  #tentatively planning to index books from 1 to match chapters
  titles=["The Mysterious Affair at Styles","The Murder on the Links","The Secret Adversary","The Man in the Brown Suit","The Secret of Chimneys"]
  corpus={ get_index(title):ch_carve(title) for title in titles}
  return corpus
def clean_corpus(corpus):
  for keyb,value in corpus.items():
    for  keyc,value in value["contents"].items():
      corpus[keyb]["contents"][keyc]=sent_carve(remove_white(value))
    
  return corpus


def sent_blob(chapter):
  temp=''  
  #chapter=re.split(r'(?<![A-H|J-Z])[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])(?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\sI said)[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w)(?<![A-Z][a-z][a-z])(?<![A-Z][a-z])\s+',chapter,flags=re.UNICODE)
  chapter=re.split(r'(?<![^A-Z][A-H|J-Z])(?<!Mr|Ms|Dr)(?<!Mrs)(?<!Mlle)(?<!Melle)(?<!\w\.\w)[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*|\u2014\u201d\s*',chapter,flags=re.UNICODE)
  chapter=chapter[:-1]
  for ch in chapter:
    temp=temp+" "+ch.lower()    
  return temp
def remove_punc(blob):
  blob=re.sub(r"[\u201c\u201d\?,;:\.!\u2018\u2019\u201a\u275b\u275c\u275f\s-]+" ,' ',blob)
  return blob
def tighten(blob):
  return re.sub(r"\s+"," ",blob)

def blob_corpus(dirty_corpus):
  for keyb,value in dirty_corpus.items():
    blob=''
    for  keyc,value in value["contents"].items():
      blob=blob+" "+tighten(remove_punc(remove_white(sent_blob(dirty_corpus[keyb]["contents"][keyc]))))
    dirty_corpus[keyb]["blob"]=blob  
  return dirty_corpus

In [4]:
#Author: Luke 
#Description: collect and clean corpus

dirty_corpus=get_corpus()
dirty_corpus=blob_corpus(dirty_corpus)
corpus=clean_corpus(dirty_corpus)

#print(corpus[1]["title"],len(corpus[1]["chapters"]),len(corpus[1]["contents"]))
#print(corpus[1]["contents"][28])
#print(corpus[2]["contents"][13])
#print(corpus[1]["blob"])

In [5]:
#Author: Luke 
#Description Helper functions for answering questions
def get_det(index):
  if index==1:
    det=re.compile(r"hercule|poirot|arthur|hastings")
  elif index==2:
    det=re.compile(r"someone")
  elif index==3:
    # Probably need to tewak this to capture the different versions of PTC
    det=re.compile(r"tuppence|beresford|prudence|cowley")
  else:
    det=re.compile(r"nobody")
  return det

#The Mysterious Affair at Styles
#Lead detective: Hercule Poirot, Arthur Hastings
#Other detectives/assistants:
#Victim: Emily Inglethorp
#Suspects: Alfred Inglethorp , Cavendish
#Perpetrator(s): Alfred Inglethorp, Evelyn Howard
#Other important characters: John Cavendish,
#Crime: Murder, Poisoning
#Mrs. Inglethorp

# V: Book 3 info from Addi
#The Secret Adversary (complicated, Needs to be looked at more)
#Lead detective: Tommy and Tuppence, Tommy Beresford, Tuppence Cowley, Prudence Cowley, Prudence "Tuppence" Cowley,
#Other detectives/assistants:
#Victim: Jane Finn, Mrs. Vandemeyer
#Suspects: Mr. Brown, Julius Hersheimmer
#Perpetrator: Sir James Peel Edgerton
#Other important characters: Jane Finn
#Crime: Espionage, Kidnapping
#motif: thriller focus rather than detection

def get_perp(index):
  if index==1:
    #checked book for abbreviations Mlle,mlle,Melle,melle - none occured in text
    perp=re.compile(r"mademoiselle marthe daubreuil|mademoiselle marthe|mademoiselle daubreuil|marthe daubreuil|marthe")
    #perp=re.compile(r"mademoiselle( marthe)? daubreuil|marthe daubreuil|marthe")
  elif index==2:
    perp=re.compile(r"Alfred|(?<!Mrs. |Emily )Inglethorp")
  elif index==3:
    perp=re.compile(r"[james peel ]edgerton")
  else:
    # V: shouldn't get here
    perp=re.compile(r"someone else")
  return perp

def get_crime(index):
  if index==1:
    crime=re.compile(r"murdered|body (was|had been) discovered")
  elif index==2:
    # FIXME
    crime=re.compile(r"some crime")
  elif index==3:
    # V: could look for spy also
    crime=re.compile(r"kidnapping|espionage")
  else:
    # V: shouldn't get here
    crime=re.compile(r"nothing happened")
  return crime

def get_sus(index):
  if index==1:
    sus=re.compile(r"jack")
  elif index==2:
    # FIXME
    sus=re.compile(r"some suspect")
  elif index==3:
    sus=re.compile(r"mr. brown|julius hersheimmer")
  else:
    # V: shouldn't get here
    sus=re.compile(r"no suspect")
  return sus

def get_occur(index,regex):
  occur=[]
  for ch_index,ch_contents in corpus[index]["contents"].items():
    for sent_index,sent_contents in ch_contents.items():
      matches=re.search(regex,sent_contents)
      if matches is not None:
        occur.append([ch_index,sent_index,sent_contents])
        #print("Chapter: ",ch_index, "Sentence: ", sent_index, "Contents: ",sent_contents)
  return occur

def get_co_occur(index, det,perp):
  co_occur=[]
  for ch_index,ch_contents in corpus[index]["contents"].items():
    for sent_index,sent_contents in ch_contents.items():
      dmatches=re.search(det,sent_contents)
      pmatches=re.search(perp,sent_contents)
      if dmatches is not None and pmatches is not None:
        co_occur.append([ch_index,sent_index,sent_contents])
        #print("Chapter: ",ch_index, "Sentence: ", sent_index, "Contents: ",sent_contents)
  return co_occur
def get_3words(book,perp):
  blob=corpus[book]["blob"]
  answer=[]
  splits=re.finditer(perp,blob)
  for iter in splits:
    before=re.split(r"\s+",blob[0:iter.start()-1])
    if len(before)>2:
      before=[before[-3],before[-2],before[-1]]
    elif len(before)==2:
      before=[" ", before[-2],before[-1]]
    elif len(before)==1:
      before=[" "," ",before[0]]
    elif len(before)==0:
      before=[" "]
 #   print(before)
    after=re.split(r"\s+",blob[iter.end()+1:])
    if len(after)>2:
      after=[after[0],after[1],after[2]]
    elif len(after)==2:
      after=[after[0],after[1]," "]
    elif len(after)==1:
      after=[after[0]," "," "]
    elif len(after)==0:
      after=[" "]
    answer.append(before+after)
  return answer

  #splits=[re.finditer(r"\s+",sp) for sp in splits]

def get_3sentences(book,ch,sent):
  near3=[]
  if sent==max(corpus[book]["contents"][ch].keys()):
    near3=[[sent-2,corpus[book]["contents"][ch][sent-2]],[sent-1,corpus[book]["contents"][ch][sent-1]],[sent,corpus[book]["contents"][ch][sent]]]
  elif sent==min(corpus[book]["contents"][ch].keys()):
    near3=[[sent,corpus[book]["contents"][ch][sent]],[sent+1,corpus[book]["contents"][ch][sent+1]],[sent+2,corpus[book]["contents"][ch][sent+2]]]
  else:
    near3=[[sent-1,corpus[book]["contents"][ch][sent-1]],[sent,corpus[book]["contents"][ch][sent]],[sent+1,corpus[book]["contents"][ch][sent+1]]]
  return near3


In [6]:
#Author: Luke
#Description: demo of code for answering questions
  
det=get_det(1)
perp=get_perp(1)
crime=get_crime(1)
sus=get_sus(1)
det_occur=get_occur(1,det)
perp_occur=get_occur(1,perp)
co=get_co_occur(1,det,perp)
crime=get_occur(1,crime)
sus_occur=get_occur(1,sus)
print(det_occur)
print(perp_occur)
print(co)
print(crime)
print(sus_occur)
print(perp_occur)
perp_neighbors=get_3words(1,perp)
for n in perp_neighbors:
  print(n)
print(len(perp_neighbors))

[[1, 4, 'i had been transacting some business in paris and was returning by the morning service to london where i was still sharing rooms with my old friend, the belgian ex-detective, hercule poirot'], [1, 127, '“that was poirot’s first big case'], [2, 1, 'my friend poirot, exact to the minute as usual, was just tapping the shell of his second egg'], [2, 8, 'elsewhere, i have described hercule poirot'], [2, 18, '” i slipped into my seat, and remarked idly, in answer to poirot’s greeting, that an hour’s sea passage from calais to dover could hardly be dignified by the epithet “terrible'], [2, 19, 'poirot waved his egg-spoon in vigorous refutation of my remark'], [2, 41, 'poirot shook his head seriously'], [2, 59, 'poirot threw me a withering glance'], [2, 60, '“what an intelligence has my friend hastings!” he exclaimed sarcastically'], [2, 64, 'poirot shook his head with a dissatisfied air'], [2, 71, '“cheer up, poirot, the luck will change'], [2, 74, 'poirot smiled, and taking up the n

In [7]:
# Block for book 1
#Author: Luke

In [8]:
# Block for book 2
#Author: Luke

In [9]:
# Author: @verolero86

# W.I.P. - first stab takes care of white space characters
import numpy as np # to grab unique elements 

def find_white_space(book,nchars,debug):
  result_book = re.findall(r'\s',book[0:nchars]);

#  if debug == True:
#    print(repr(book1_chapter1[0:nchars]))

  print(f"Number of white space characters = {len(result_book)}")
  print(f"Unique types of white space characters found = {np.unique(result_book)}")

  return len(result_book)

def remove_specific_white_space(book,regex):
  result_book = re.sub(regex,' ',book)

  return result_book

def clean_data(book):
  ws_regex=r'[\r\n\u200a]'
  result_book = remove_specific_white_space(book,ws_regex).lower()
  return result_book

# Set a subset of characters for easier parsing (-1 for all in chapter)
num_ws_b1c1 = find_white_space(corpus[1]["contents"][1],-1,True);
print(num_ws_b1c1)

num_ws_b1c2 = find_white_space(corpus[1]["contents"][2],-1,False);
print(num_ws_b1c2)

num_ws_b2c1 = find_white_space(corpus[2]["contents"][1],-1,False);
print(num_ws_b2c1)

# Cleaning up unwanted white space and breaking up int o sentences.
b1c1 = corpus[1]["contents"][1]
regex=r'[\r\n\u200a]'
b1c1_no_ws = remove_specific_white_space(b1c1,regex)
#print(f"No \\r and \\n anymore: {repr(b1c1_no_ws[0:400])}")
#result_sentences = re.findall(r'[^\.\!\?]*[\.\!\?]',b1c1_no_ws);
#print(repr(result_sentences[0]))
#print(repr(result_sentences))
sentence_regex=r'[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])(?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\sI said)[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+'
#result2=re.split(r'[\.\?!](?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\s[a-z])(?![\'\"\u2019\u201a\u201c\u275c\u275f\u201e\u201d\u0022\u275e]\sI said)[\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+',b1c1_no_ws,flags=re.UNICODE)
result2=re.split(sentence_regex,b1c1_no_ws,flags=re.UNICODE)
#print(b1c1_no_ws)
b1c1_clean = clean_data(b1c1)
result3=re.split(sentence_regex,b1c1_clean,flags=re.UNICODE)
print(result2)
print(result3)

TypeError: ignored

In [49]:
bookid=3
print(corpus[bookid]["title"],len(corpus[bookid]["chapters"]),len(corpus[bookid]["contents"]))
det=get_det(bookid)
perp=get_perp(bookid)
crime=get_crime(bookid)
sus=get_sus(bookid)
det_occur=get_occur(bookid,det)
perp_occur=get_occur(bookid,perp)
co=get_co_occur(bookid,det,perp)
crime=get_occur(bookid,crime)
sus_occur=get_occur(bookid,sus)
#print(det)
print("det_occur = ", det_occur)
#print(perp)
print("perp_occur = ", perp_occur)
print("co = ", co)
print("crime = ", crime)
print("sus_occur = ", sus_occur)
#print(get_context(1,3,116))


The Secret Adversary 29 29
det_occur =  [[1, 1, '“tuppence, old bean'], [1, 12, 'the very faint anxiety which underlay his tone did not escape the astute  ears of miss prudence cowley, known to her intimate friends for some  mysterious reason as “tuppence'], [1, 17, '“you always were a shocking liar,” said tuppence severely, “though you  did once persuade sister greenbank that the doctor had ordered you beer  as a tonic, but forgotten to write it on the chart'], [1, 24, 'tuppence sighed'], [1, 29, '“gratuity?” hinted tuppence'], [1, 34, 'the cost of  living--ordinary plain, or garden living nowadays is, i assure you, if  you do not know----”    “my dear child,” interrupted tuppence, “there is nothing i do _not_ know  about the cost of living'], [1, 37, 'and tuppence led the way upstairs'], [1, 44, 'but at that moment two elderly ladies rose and collected parcels, and  tuppence deftly ensconced herself in one of the vacant seats'], [1, 46, 'tuppence ordered tea and buttered toast'], [1,

In [None]:
#Dev Notes: will refactor fetch() to generate dict of titles and indices rather than take in index
# I.e., fetch() is backwards. should assign index based on title while fetching url


#artifacts I've spotted in data:
# I noticed and "[Illustration]" artifact in the mysterious affair at style.

#possible start for sentence splitting regex
#sentence_regex = r'([\.\?!][\'\"\u2018\u2019\u201c\u201d\)\]]*\s*(?<!\w\.\w.)(?<![A-Z][a-z][a-z]\.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)\s+)

# V: do we need to clean up contractions to be spelled out? e.g., "I'm" to "I am", "don't" to "do not".
#  

#Data Cleaning

## Background Research


- There are inconsistencies between the novel formats. Some of them start with a prologue and others dont. 
- There is START OF THE PROJECT present in the beginning of most but not all books, others have 'START OF THIS PROJECT', but   table of contents appear after that.
- Some of them have the word table of contents, others say contents
- some follow roman numeral in naming chapters, others dont
- Some use the word 'chapter' , others just kist chapter titles followed by a number
- Novel text files have license and other info at the end

These factors above will need to be considered in data cleaning. 

Listing a few key particulars below: 

- The Mysterious Affair at Styles 
    - This Phrase is present at the beginning - \*** START OF THE PROJECT 
    - The Novel plot starts at second instance of 'chapter I.' # period is important here. 
    - Novel ends at 'THE END' and has \*** END OF THE PROJECT GUTENBERG EBOOK...'. 
    - Each chapter starts with 'Chapter' followed by chapter number in roman numeral, followed by new line, followed by title of chapter
  

- The Murder on the Links 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel plot starts at second instance of '1 A Fellow Traveller'. Novel ends at 'End of Project Gutenberg's The Murder on the Links, by Agatha Christie
  - and has \*** END OF THIS PROJECT GUTENBERG ...' at the end. .
 - Each chapter  starts with number followed by title of chapter


- The Secret Adversary 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel plot starts at second instance of 'PROLOGUE'. Novel ends at 'End of the Project Gutenberg EBook of The Secret Adversary, by Agatha Christie' 
  -  has \*** END OF THIS PROJECT GUTENBERG ...' at the end. 
  - Each chapter starts with 'Chapter' followed by chapter number in roman numeral, followed by title of chapter


- The Man in the Brown Suit 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel starts at second instance of 'PROLOGUE'. Novel ends at 'End of Project Gutenberg's The Man in the Brown Suit, by Agatha Christie' 
  - has \*** END OF THIS PROJECT GUTENBERG ...' at the end. 
  - Each chapter starts with 'Chapter' followed by chapter number in roman numeral


- The Secret of Chimneys 
    - This Phrase is present at the beginning - \*** START OF THE PROJECT 
  - The Novel plot starts at second instance of '1 (new line)
Anthony Cade Signs on' # new line is first here. 
- Novel ends at 'Transcriber's Notes:' and has \*** END OF THE PROJECT GUTENBERG...' at the end.
Each chapter starts with number followed by new line followed by title of chapter


#Data Tokenization / Prep for analysis

Note: since we are not allowed to use NLTK for tokenization, we will have to use python for this as well. 

We can use split() but that would be very basic as it doesnt achieve tokens in a linguistic sense; we should be able to use the re package that adds support for regex; after all the point for us is to learn regex better. Recommend using re.split with our custom regex 
https://docs.python.org/3/library/re.html


###Helpful Links: 
- https://python.plainenglish.io/how-to-tokenize-sentences-without-using-any-nlp-library-in-python-a381b75f7d22 
- https://stackoverflow.com/questions/21361073/tokenize-words-in-a-list-of-sentences-python




#Data Analysis

Goal of this project is to analyze the frequencies of occurrence of the protagonists and the perpetrator(s) across the novel - per chapter, and per sentence in a chapter, the mention of the crime, and other circumstances surrounding the antagonists. The ultimate objective is to use basic NLP tools to observe any patterns in plot structures across the works of one or all of the authors.  Specifically, analysis questions below need to be answered. 

Note: To effectively conduct this analysis, you should find resources, and read the plot summaries of each novel, so you can make your search more effective. If plot summaries are not available, use regex to search for clues, and report how well/how fast that approach worked. 

The plot summary answers derived from reading the book/ summary are located below each question

##Pre-Steps

Details of each book: 

- The Mysterious Affair at Styles 
  - Lead detective: Hercule Poirot, Arthur Hastings
  - Other detectives/assistants: 
  - Victim: Emily Inglethorp
  - Suspects: Alfred Inglethorp , Cavendish
  - Perpetrator(s): Alfred Inglethorp, Evelyn Howard
  - Other important characters: John Cavendish, 
  - Crime: Murder, Poisoning
  - motif: murder mystery
https://agathachristie.fandom.com/wiki/The_Mysterious_Affair_at_Styles


- The Murder on the Links 
  - Lead detective(s): Hercule Poirot, Arthur Hastings
  - Other detectives/assistants:  Monsieur Giraud, Monsieur Hautet
  - Victim: Paul Renauld
  - Suspects: Jack Renauld
  - Perpetrator: Marthe Daubreuil.
  - Other important characters:  Paul Renauld, Eloise Renauld, Jack Renauld, Madame Daubreuil, Gabriel Stonor, Georges Conneau, Madame Beroldy, Marthe Daubreuil, Bella Duveen, Dulcie Duveen (Cindrella), Cindrella
  - Crime: Murder, Stabbing
  - motif: murder mystery
https://en.wikipedia.org/wiki/The_Murder_on_the_Links
https://agathachristie.fandom.com/wiki/The_Murder_on_the_Links



- The Secret Adversary  (complicated, Needs to be looked at more)
  - Lead detective: Tommy and Tuppence, Tommy Beresford, Tuppence Cowley, Prudence Cowley, Prudence "Tuppence" Cowley, 
  - Other detectives/assistants: 
  - Victim: Jane Finn, Mrs. Vandemeyer
  - Suspects: Mr. Brown,  Julius Hersheimmer
  - Perpetrator: Sir James Peel Edgerton
  - Other important characters: Jane Finn
  - Crime: Espionage, Kidnapping
  - motif: thriller focus rather than detection


- The Man in the Brown Suit (complicated, Needs to be looked at more)
  - Lead detective: Anne Beddingfeld
  - Other detectives/assistants: 
  - Victim: Nadina aka Anita Grünberg, L. B. Carton
  - Suspects: Harry
  - Perpetrator: Sir Eustace Pedler
  - Other important characters: Nadina, Count Sergius Paulovitch, the Colonel,  , Suzanne Blair, Colonel Race, Guy Pagett, Harry Rayburn, Harry Rayburn, Rev. Chichester, Miss Pettigrew,Harry Parker, Chichester
  - Crime: diamond theft, murders, kidnapping
  - motif: thriller focus rather than detection


- The Secret of Chimneys (complicated, Needs to be looked at more)
  - Lead detective: Anthony Cade aka Prince Nicholas
  - Other detectives/assistants: Superintendent Battle, Monsieur Lemoine of the Sûreté, Mr. Fish aka american agent
  - Victim: Perceived: Count Stanislaus aka Prince Michael Obolovitch
  - Suspects: Anthony Cade, Prince Nicholas, King Victor, 
  - Perpetrator: Mlle Brun aka Queen Varaga aka Angèle Mory, M Lemoine aka King Victor
  - Other important characters: King Nicholas IV, Queen Varaga aka Angèle Mory, Herman Isaacstein, Prince Michael Obolovitch,  George Lomax, Count Stylptitch, Jimmy McGrath, Virginia Revel, Captain O'Neill, Captain O'Neill, Mr Holmes, Isaacstein, Hiram P. Fish, Prince Nicholas, Mademoiselle Mlle Brun, Bill Eversleigh, Monsieur Lemoine of the Sûreté, Professor Wynwood, Boris Anchoukoff,
   - Crime: sensitive document theft, murders, treasure hunt, espionage
  - motif: thriller focus rather than detection



##1. When does the detective (or a pair) occur for the first time -  chapter #, the sentence(s) # in a chapter,

####Background research
Poirot appears in Chapter# 1, sentence # 2 for the first time

###**Murder on the links**
####Story told from viewpoint of Arthur Hastings, who is Hercule Poirot's sidekick.
####Hercule first appears in the 4th sentence of the first chapter.
####"i had been transacting some business in paris and was returning by the morning service to london where i was still sharing rooms with my old friend, the belgian ex-detective, hercule poirot"


##2. When is the crime first mentioned - the type of the crime and the details -  chapter #, the sentence(s) # in a chapter,
###**Murder on the Links**
####The murder of Paul Renauld is revealed at the end of chapter 2 in sentence 323.
####"m. renauld was murdered this morning"

##3. When is the perpetrator first mentioned - chapter #, the sentence(s) # in a chapter,
###**Murder on the Links**
####Mademoiselle Daubreuil (Marthe) is first mentioned in chapter 7 sentence 145.
####"'mademoiselle daubreuil,' said m. hautet, sweeping off his hat, 'we regret infinitely to disturb you, but the exigencies of the law—you comprehend"

## 4. What are the 3 words that occur around the perpetrator on each mention (i.e., the three words preceding, and the three words following the mention of a perpetrator),
###**Murder on the Links**
####**First mention:** ...she was afraid!"Mademoiselle Daubreuil,” said M. Hautet, sweeping...
####**Second mention:** ...turned to her. “Marthe, dear-" But the...
####**Third mention:** ...to speak before Mademoiselle Daubreuil." "As my daughter...
####**Fourth mention:** ...us.It was Marthe Daubreuil. “I beg your...
####**Fifth mention:** ...our amelie," explained Marthe, with a blush...




In [None]:
[[7, 145, '“mademoiselle daubreuil,” said m. hautet, sweeping off his hat, “we regret infinitely to disturb you, but the exigencies of the law—you comprehend'], [7, 176, '“marthe, dear'], [7, 184, '“i should prefer not to speak before mademoiselle daubreuil'], [7, 241, 'it was marthe daubreuil'], [7, 249, '“françoise told our amélie,” explained marthe, with a blush'], [7, 293, '“ah,  mon ami , do not set your heart on marthe daubreuil'], [11, 103, 'the subject of the quarrel was mademoiselle marthe daubreuil'], [11, 109, '“i love mademoiselle daubreuil, and i wish to marry her'], [11, 117, 'marthe is as good as she is beautiful'], [11, 119, '“i have nothing against mademoiselle daubreuil in any way'], [11, 125, '“when you informed your father of your intentions towards mademoiselle daubreuil,” he resumed, “he was surprised'], [11, 129, 'nettled, i demanded what he had against mademoiselle daubreuil'], [11, 131, 'i answered that i was marrying marthe, and not her antecedents, but he shouted me down with a peremptory refusal to discuss the matter in any way'], [11, 143, 'i wrote to marthe, telling her what had happened, and her reply soothed me still further'], [13, 19, '“yesterday it was mademoiselle daubreuil, today it is mademoiselle—cinderella'], [13, 23, 'mademoiselle daubreuil is a very beautiful girl, and i do admire her immensely—i don’t mind admitting it'], [13, 94, 'and thirdly, if you wish, endeavour to cut him out with mademoiselle marthe'], [13, 100, 'throw together a boy young renauld and a beautiful girl like mademoiselle marthe, and the result is almost inevitable'], [13, 108, 'that is how always think of mademoiselle daubreuil  as the girl with the anxious eyes'], [13, 194, 'a girl’s voice was speaking, a voice that i recognized as that of the beautiful marthe'], [13, 197, '“you know it, marthe,” jack renauld replied'], [15, 19, 'i do not know what put the idea into my head—possibly it was the deep anxiety underlying marthe daubreuil’s tones—but i asked suddenly: “young m. renauld—he did not leave by that train, did he'], [15, 28, 'that, then, was the reason of marthe’s poignant anxiety'], [15, 34, 'one thing was certain, marthe had known all along'], [17, 166, 'i came to see my fiancée, mademoiselle daubreuil'], [18, 19, '“with good fortune,” he remarked to me over his shoulder, “mademoiselle marthe may find herself in the garden'], [18, 25, '” i joined him at the moment that marthe daubreuil, looking slightly startled, came running up to the hedge at his call'], [18, 100, '“ maman ,” whispered marthe, “i must go'], [18, 107, 'unwittingly, mademoiselle marthe told us the truth on another point—and incidentally gave jack renauld the lie'], [18, 108, 'did you notice his hesitation when i asked him if he saw marthe daubreuil on the night of the crime'], [18, 111, 'it was necessary for me to see mademoiselle marthe before he could put her on her guard'], [18, 114, 'now, hastings, what was jack renauld doing here on that eventful evening, and if he did not see mademoiselle marthe whom did he see'], [20, 88, 'm. renauld quarrels with his son over latter’s wish to marry marthe daubreuil'], [20, 93, 'quarrel with tramp in garden, witnessed by marthe daubreuil'], [20, 134, '“may 23rd,” i read, “m. renauld quarrels with his son over latter’s wish to marry marthe daubreuil'], [22, 37, 'had jack renauld, returning to see marthe daubreuil, come face to face instead with bella duveen, the girl he had heartlessly thrown over'], [22, 42, 'did he fear for this former entanglement of his to come to the ears of marthe daubreuil'], [24, 181, '“marthe daubreuil'], [24, 187, 'marthe was at the door to meet us, and led poirot in, clinging with both hands to one of his'], [24, 200, 'marthe frowned'], [24, 219, 'marthe looked at him for a minute, then, letting her head fall forward on her arms, she burst into tears'], [24, 227, 'marthe listened spellbound'], [27, 1, 'young renauld had come to us as soon as he was liberated—before starting for merlinville to rejoin marthe and his mother'], [27, 29, 'after i met marthe, and realized i’d made a mistake, i ought to have written and told her so honestly'], [27, 30, 'but i was so terrified of a row, and of its coming to marthe’s ears, and her thinking there was more in it than there ever had been, that—well, i was a coward, and went on hoping the thing would die down of itself'], [27, 46, 'i came from cherbourg, as i told you, in order to see marthe before going to the other end of the world'], [27, 82, '“while you break it in person to mademoiselle marthe, eh?” finished poirot, with a twinkle'], [27, 118, '“here are jack and marthe daubreuil,” i exclaimed, looking out of the window'], [27, 127, '“but marthe and i'], [27, 156, '“he is overdone,” murmured poirot to marthe'], [27, 172, 'finally, having done all we could, we left him in the charge of marthe and her mother, and set out for the town'], [27, 227, 'thrown sharply on the blind was the profile of marthe daubreuil'], [27, 233, 'marthe daubreuil was embroidering by a table with a lamp on it'], [27, 260, 'poirot looked over his shoulder once at the lighted window and the profile of marthe as she bent over her work'], [27, 331, 'puzzled and uncomprehending, i knelt down, and lifting the fold of cloth, looked into the dead beautiful face of marthe daubreuil'], [28, 35, '“do you know, i actually dreamt that we found marthe daubreuil’s body in mrs. renauld’s room, and that you declared her to have murdered mr. renauld'], [28, 56, 'from marthe daubreuil’s own lips we have the admission that she overheard m. renauld’s quarrel with the tramp'], [28, 58, 'remember how easily you overheard marthe’s conversation with jack renauld from that spot'], [28, 59, '“but what possible motive could marthe have for murdering mr. renauld'], [28, 64, 'let us reconstruct the scene from the standpoint of marthe daubreuil'], [28, 65, '“marthe daubreuil overhears what passes between renauld and his wife'], [28, 70, 'if the latter defies his father, he will be a pauper—which is not at all to the mind of mademoiselle marthe'], [28, 79, 'and here comes in the second point which led me infallibly to marthe daubreuil—the dagger'], [28, 81, 'one he gave to his mother, one to bella duveen; was it not highly probable that he had given the third one to marthe daubreuil'], [28, 82, '“so then, to sum up, there were four points of note against marthe daubreuil: “(1) marthe daubreuil could have overheard m. renauld’s plans'], [28, 83, '“(2) marthe daubreuil had a direct interest in causing m. renauld’s death'], [28, 84, '“(3) marthe daubreuil was the daughter of the notorious madame beroldy who in my opinion was morally and virtually the murderess of her husband, although it may have been georges conneau’s hand which struck the actual blow'], [28, 85, '“(4) marthe daubreuil was the only person, besides jack renauld, likely to have the third dagger in her possession'], [28, 95, 'if it was  not  bella duveen, the only other person who could have committed the crime was marthe daubreuil'], [28, 99, 'but if, by any chance, it was  not  her sister’s, but the one given by jack to marthe daubreuil—why then, bella duveen’s dagger would be still intact'], [28, 102, '“in the meantime i had taken steps to force mademoiselle marthe into the open'], [28, 107, 'marthe daubreuil made a last bold bid for the renauld millions—and failed'], [28, 122, 'she had brains, that beautiful mademoiselle marthe'], [28, 127, 'on the floor by marthe daubreuil’s body, i found a pad and a little bottle of chloroform and a hypodermic syringe containing a fatal dose of morphine'], [28, 137, '“however, hastings, things did not go quite as mademoiselle marthe had planned'], [28, 141, 'there is a last chance for marthe daubreuil'], [28, 145, '“when did you first begin to suspect marthe daubreuil, poirot'], [28, 151, 'that is how i have thought of marthe daubreuil from the beginning'], [28, 169, 'so far we have looked upon bella duveen as a siren, and marthe daubreuil as the girl he really loved'], [28, 171, 'marthe daubreuil was very beautiful']]

## 5. When and how the detective/detectives and the perpetrators co-occur - chapter #, the sentence(s) # in a chapter,
###**The Murder on the Links**
####Arthur Hastings (sidekick) and Marthe (perpetrator) first co-occur in chapter 18 sentence 114.
####'now, hastings, what was jack renauld doing here on that eventful evening, and if he did not see mademoiselle marthe whom did he see'
####Hercule Poirot (detective) and Marthe (perpetrator) first co-occur in chapter 24 sentence 187
####'marthe was at the door to meet us, and led poirot in, clinging with both hands to one of his'
####Poirot and Marthe again co-occur in chapter 27 sentence 82
####'“while you break it in person to mademoiselle marthe, eh?” finished poirot, with a twinkle'
####They co-occur again in chapter 27 sentence 156
####'“he is overdone,” murmured poirot to marthe'
####As well as chapter 27 sentence 260
####'poirot looked over his shoulder once at the lighted window and the profile of marthe as she bent over her work'
#### Hastings and Marthe co-occur in chapter 28 sentence 137
####'“however, hastings, things did not go quite as mademoiselle marthe had planned'
####Finally, Poirot and Marthe co-occur in chapter 28 sentence 145
####'“when did you first begin to suspect marthe daubreuil, poirot'





## 6. When are other suspects first introduced - chapter #, the sentence(s) # in a chapter
###**Murder on the Links**
####Jack Renauld is the red herring in this book.
####He first appears in chapter 3 sentence 118.
####'finally there are madame renauld and her son, m. jack renauld'


# Additional/Extra Analysis

# Practice Section