<a href="https://colab.research.google.com/github/692-Team-1-NLP/proj-1/blob/main/Proj_1_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `692-Proj-1 : Crime Novel Plot Analysis with Regex - Agatha Christie`
The goal of this project is to conduct a plot and protagonist/antagonist analysis of the famous crime novels. For this project, we will analyze five publicly available crime novels/stories by Agatha Christie at the project Gutenberg http://www.gutenberg.org/. The novels chosen are: 
- The Mysterious Affair at Styles 
- The Murder on the Links 
- The Secret Adversary 
- The Man in the Brown Suit 
- The Secret of Chimneys 


Note: Feel free to use any background resource for the understanding of the plot, protagonist and antagonist names, and other details. Look for spoilers, details, etc. Our goal is not to predict the crime, but to computationally analyze the structure of the plot.

#Data collection

##Background research: 

Location for Plain text UTF-8 files for novels: 
- The Mysterious Affair at Styles https://www.gutenberg.org/files/863/863-0.txt
- The Murder on the Links https://www.gutenberg.org/files/58866/58866-0.txt
- The Secret Adversary https://www.gutenberg.org/files/1155/1155-0.txt
- The Man in the Brown Suit https://www.gutenberg.org/files/61168/61168-0.txt
- The Secret of Chimneys https://www.gutenberg.org/files/65238/65238-0.txt

Note: One benefit to getting the text version is that the html version also has page number to clean, not present in text files

###Helpful Links

https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3

https://docs.python.org/3/howto/urllib2.html



In [16]:
# example test run for one file 
# we will need a data structure that holds the name and links and loop through or sequentially get the data of all the novels.

import urllib.request, re
url = "https://www.gutenberg.org/files/863/863-0.txt" # utf-8 text file link for The Mysterious Affair at Styles 

response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # Default encoding is ascii; gutenberg has utf-8 encoded files
#text  # uncomment for output if needed
# text needs to be cleaned before it can be analyzed 

# trying out word; we will need both word and sentences
#words = re.split('\s+', text)

# trying out tokenization sentences via re
# tried, this WIP, as you can see in the output, doesnt get all the sentences
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

#Data Cleaning

## Background Research


- There are inconsistencies between the novel formats. Some of them start with a prologue and others dont. 
- There is START OF THE PROJECT present in the beginning of most but not all books, others have 'START OF THIS PROJECT', but   table of contents appear after that.
- Some of them have the word table of contents, others say contents
- some follow roman numeral in naming chapters, others dont
- Some use the word 'chapter' , others just kist chapter titles followed by a number
- Novel text files have license and other info at the end

These factors above will need to be considered in data cleaning. 

Listing a few key particulars below: 

- The Mysterious Affair at Styles 
    - This Phrase is present at the beginning - \*** START OF THE PROJECT 
    - The Novel plot starts at second instance of 'chapter I.' # period is important here. 
    - Novel ends at 'THE END' and has \*** END OF THE PROJECT GUTENBERG EBOOK...'. 
    - Each chapter starts with 'Chapter' followed by chapter number in roman numeral, followed by new line, followed by title of chapter
  

- The Murder on the Links 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel plot starts at second instance of '1 A Fellow Traveller'. Novel ends at 'End of Project Gutenberg's The Murder on the Links, by Agatha Christie
  - and has \*** END OF THIS PROJECT GUTENBERG ...' at the end. .
 - Each chapter  starts with number followed by title of chapter


- The Secret Adversary 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel plot starts at second instance of 'PROLOGUE'. Novel ends at 'End of the Project Gutenberg EBook of The Secret Adversary, by Agatha Christie' 
  -  has \*** END OF THIS PROJECT GUTENBERG ...' at the end. 
  - Each chapter starts with 'Chapter' followed by chapter number in roman numeral, followed by title of chapter


- The Man in the Brown Suit 
  - This Phrase is present at the beginning - \*** START OF THIS PROJECT 
  - The Novel starts at second instance of 'PROLOGUE'. Novel ends at 'End of Project Gutenberg's The Man in the Brown Suit, by Agatha Christie' 
  - has \*** END OF THIS PROJECT GUTENBERG ...' at the end. 
  - Each chapter starts with 'Chapter' followed by chapter number in roman numeral


- The Secret of Chimneys 
    - This Phrase is present at the beginning - \*** START OF THE PROJECT 
  - The Novel plot starts at second instance of '1 (new line)
Anthony Cade Signs on' # new line is first here. 
- Novel ends at 'Transcriber's Notes:' and has \*** END OF THE PROJECT GUTENBERG...' at the end.
Each chapter starts with number followed by new line followed by title of chapter


#Data Tokenization / Prep for analysis

Note: since we are not allowed to use NLTK for tokenization, we will have to use python for this as well. 

We can use split() but that would be very basic as it doesnt achieve tokens in a linguistic sense; we should be able to use the re package that adds support for regex; after all the point for us is to learn regex better. Recommend using re.split with our custom regex 
https://docs.python.org/3/library/re.html


###Helpful Links: 
- https://python.plainenglish.io/how-to-tokenize-sentences-without-using-any-nlp-library-in-python-a381b75f7d22 
- https://stackoverflow.com/questions/21361073/tokenize-words-in-a-list-of-sentences-python




#Data Analysis

Goal of this project is to analyze the frequencies of occurrence of the protagonists and the perpetrator(s) across the novel - per chapter, and per sentence in a chapter, the mention of the crime, and other circumstances surrounding the antagonists. The ultimate objective is to use basic NLP tools to observe any patterns in plot structures across the works of one or all of the authors.  Specifically, analysis questions below need to be answered. 

Note: To effectively conduct this analysis, you should find resources, and read the plot summaries of each novel, so you can make your search more effective. If plot summaries are not available, use regex to search for clues, and report how well/how fast that approach worked. 

The plot summary answers derived from reading the book/ summary are located below each question

##Pre-Steps

Details of each book: 

- The Mysterious Affair at Styles 
  - Lead detective: Hercule Poirot, Arthur Hastings
  - Other detectives/assistants: 
  - Victim: Emily Inglethorp
  - Suspects: Alfred Inglethorp , Cavendish
  - Perpetrator(s): Alfred Inglethorp, Evelyn Howard
  - Other important characters: John Cavendish, 
  - Crime: Murder, Poisoning
  - motif: murder mystery
https://agathachristie.fandom.com/wiki/The_Mysterious_Affair_at_Styles


- The Murder on the Links 
  - Lead detective(s): Hercule Poirot, Arthur Hastings
  - Other detectives/assistants:  Monsieur Giraud, Monsieur Hautet
  - Victim: Paul Renauld
  - Suspects: Jack Renauld
  - Perpetrator: Marthe Daubreuil.
  - Other important characters:  Paul Renauld, Eloise Renauld, Jack Renauld, Madame Daubreuil, Gabriel Stonor, Georges Conneau, Madame Beroldy, Marthe Daubreuil, Bella Duveen, Dulcie Duveen (Cindrella), Cindrella
  - Crime: Murder, Stabbing
  - motif: murder mystery
https://en.wikipedia.org/wiki/The_Murder_on_the_Links
https://agathachristie.fandom.com/wiki/The_Murder_on_the_Links



- The Secret Adversary  (complicated, Needs to be looked at more)
  - Lead detective: Tommy and Tuppence, Tommy Beresford, Tuppence Cowley, Prudence Cowley, Prudence "Tuppence" Cowley, 
  - Other detectives/assistants: 
  - Victim: Jane Finn, Mrs. Vandemeyer
  - Suspects: Mr. Brown,  Julius Hersheimmer
  - Perpetrator: Sir James Peel Edgerton
  - Other important characters: Jane Finn
  - Crime: Espionage, Kidnapping
  - motif: thriller focus rather than detection


- The Man in the Brown Suit (complicated, Needs to be looked at more)
  - Lead detective: Anne Beddingfeld
  - Other detectives/assistants: 
  - Victim: Nadina aka Anita Grünberg, L. B. Carton
  - Suspects: Harry
  - Perpetrator: Sir Eustace Pedler
  - Other important characters: Nadina, Count Sergius Paulovitch, the Colonel,  , Suzanne Blair, Colonel Race, Guy Pagett, Harry Rayburn, Harry Rayburn, Rev. Chichester, Miss Pettigrew,Harry Parker, Chichester
  - Crime: diamond theft, murders, kidnapping
  - motif: thriller focus rather than detection


- The Secret of Chimneys (complicated, Needs to be looked at more)
  - Lead detective: Anthony Cade aka Prince Nicholas
  - Other detectives/assistants: Superintendent Battle, Monsieur Lemoine of the Sûreté, Mr. Fish aka american agent
  - Victim: Perceived: Count Stanislaus aka Prince Michael Obolovitch
  - Suspects: Anthony Cade, Prince Nicholas, King Victor, 
  - Perpetrator: Mlle Brun aka Queen Varaga aka Angèle Mory, M Lemoine aka King Victor
  - Other important characters: King Nicholas IV, Queen Varaga aka Angèle Mory, Herman Isaacstein, Prince Michael Obolovitch,  George Lomax, Count Stylptitch, Jimmy McGrath, Virginia Revel, Captain O'Neill, Captain O'Neill, Mr Holmes, Isaacstein, Hiram P. Fish, Prince Nicholas, Mademoiselle Mlle Brun, Bill Eversleigh, Monsieur Lemoine of the Sûreté, Professor Wynwood, Boris Anchoukoff,
   - Crime: sensitive document theft, murders, treasure hunt, espionage
  - motif: thriller focus rather than detection



##1. When does the detective (or a pair) occur for the first time -  chapter #, the sentence(s) # in a chapter,

####Background research
Poirot appears in Chapter# 1, sentence # 2 for the first time

##2. When is the crime first mentioned - the type of the crime and the details -  chapter #, the sentence(s) # in a chapter,


##3. When is the perpetrator first mentioned - chapter #, the sentence(s) # in a chapter,

## 4. What are the 3 words that occur around the perpetrator on each mention (i.e., the three words preceding, and the three words following the mention of a perpetrator),

## 5. When and how the detective/detectives and the perpetrators co-occur - chapter #, the sentence(s) # in a chapter,

## 6. When are other suspects first introduced - chapter #, the sentence(s) # in a chapter

# Additional/Extra Analysis

# Practice Section