# NLP - Project 1
## Rinehart Analysis
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

Check out the **[README](https://github.com/NLPaladins/rinehartAnalysis/blob/main/README.md)**


Or the current **[TODO](https://github.com/NLPaladins/rinehartAnalysis/blob/main/TODO.md)** list.

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display

In [2]:
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [3]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### At any point, to save changes
click **File > Save a copy on Gihtub**

### Now go to files on the left, and open:
- rinehartAnalysis/confs/proj_1.yml

(Ctr+s) to save changes

## Load Libraries and setup

In [4]:
import traceback
import argparse
from importlib import reload as reload_lib
from pprint import pprint
import numpy as np

# Custom libs
from nlp_libs import Configuration, ColorizedLogger, ProcessedBook
# Import this way the libs you want to dynamically change and reload 
# import nlp_libs.books.processed_book as books_lib # Comment out until Class is finalized


### Libraries Overview
All the libraries are located under *"\<project root>/nlp_libs"*
- ***ProcessedBook***: Loc: **books/processed_book.py**, Desc: *Book Pre-processor*
- ***Configuration***: Loc: **configuration/configuration.py**, Desc: *Configuration Loader*
- ***ColorizedLogger***: Loc: **fancy_logger/colorized_logger.py**, Desc: *Logger with formatted text capabilities*

In [5]:
# The path of configuration and log save path
config_path = "confs/proj_1.yml"  # Open files > confs > proj_1.yml to edit temporalily. Commit to save permanently
# !cat "$config_path"
log_path = "logs/proj_1.log"  # Open files > logs > proj_1.log to debug logs of previous runs

In [6]:
# The logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)

2021-09-22 16:43:39 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/logs/proj_1.log[0m


In [7]:
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books = conf.get_config('data_loader')['config']['books']
pprint(books)  # Pretty print the books dict

2021-09-22 16:43:39 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis/confs/proj_1.yml[0m
2021-09-22 16:43:39 Config       INFO     [1m[37mConfiguration Tag: proj1[0m


{'Oh,_Well,_You_Know_How_Women_Are!': {'crime_type': 'example',
                                       'detectives': ['man1', 'man2'],
                                       'suspects': ['man3', 'man4'],
                                       'url': 'https://www.gutenberg.org/cache/epub/24259/pg24259.txt'},
 'The_Breaking_Point': {'crime_type': 'example',
                        'detectives': ['man1', 'man2'],
                        'suspects': ['man3', 'man4'],
                        'url': 'https://www.gutenberg.org/files/1601/1601-0.txt'},
 'The_Circular_Staircase': {'crime_type': 'example',
                            'detectives': ['Rachel Innes'],
                            'suspects': ['Liddy',
                                         'Halsey',
                                         'Gertrude',
                                         'Paul Armstrong',
                                         'Doctor Walker',
                                         'Louise Armstrong',
    

## Exploration

In [17]:
r'someString\nanotherstring'.split(r'\n')

['someString', 'anotherstring']

In [8]:
import urllib.request
import re
from typing import *


class ProcessedBook:
    num_map = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'),
               (100, 'C'), (90, 'XC'), (50, 'L'), (40, 'XL'),
               (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
    title: str
    url: str
    detectives: List[str]
    suspects: List[str]
    crime_type: str

    def __init__(self, title: str, metadata: Dict, make_lower: bool = True):
        """
        raw holds the books as a single string.
        clean holds the books as a list of lowercase lines starting
        from the first chapter and ending with the last sentence.
        """
        self.title = title
        self.url = metadata['url']
        self.detectives = metadata['detectives']
        self.suspects = metadata['suspects']
        self.crime_type = metadata['crime_type']
        self.raw = self.read_book_from_proj_gut(self.url)
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw
        loins = re.sub(r'\r\n', r'\n',staircase.raw)
        loins = re.findall(r'.*\n', loins)
        self.lines = self.clean_lines(lines=lines)
        self.clean = self.get_clean_book(make_lower=make_lower)

    @staticmethod
    def read_book_from_proj_gut(book_url: str) -> str:
        req = urllib.request.Request(book_url)
        client = urllib.request.urlopen(req)
        page = client.read()
        return page.decode('utf-8')

    def get_clean_book(self, make_lower: bool = True) -> List[str]:
        if make_lower:
            lines = self.raw.lower()
        else:
            lines = self.raw
        lines = lines.replace('\r\n', '\n').split('\n')
        lines = self.clean_lines(lines)
        chapters = self.lines_to_chapters(lines)
        return chapters

    def clean_lines(self, lines: List[str]) -> List[str]:
        clean_lines = []
        start = False
        for line in lines:
            if re.match(r'^chapter i\.', line):
                clean_lines.append(line)
                start = True
                continue
            if not start:
                continue
            if re.match(r'^\*\*\* end of the project gutenberg ebook', line):
                break
            if self.pass_clean_filter(line):
                clean_lines.append(line)
        return clean_lines

    @staticmethod
    def lines_to_chapters(lines: List[str]) -> List[str]:
        chapters = []
        sentences = []
        current_sent = ''
        for i, line in enumerate(lines):
            # add chapter as 1st sentence
            if re.match(r'^chapter [ivxlcdm]+\.$', line):
                if sentences:
                    chapters.append(sentences)
                sentences = [line]
                add_chapter_title = True
                continue
            # add chapter title as 2nd sentence
            elif add_chapter_title:
                sentences.append(line)
                add_chapter_title = False
                continue
            sents = re.findall(r' *((?:mr\.|mrs.|[^\.\?!])*)(?<!mr)(?<!mrs)[\.\?!]', line)
            # if no sentence end is detected
            if not sents:
                if current_sent == '':
                    current_sent = line
                else:
                    current_sent += ' ' + line
            # if at least one sentence end is detected
            else:
                for group in sents:
                    if current_sent != '':
                        current_sent += ' ' + group
                        sentences.append(current_sent)
                    else:
                        sentences.append(group)
                    current_sent = ''
                # set the next sentence to its start if there is one
                sent_end = re.search(r'(?<!mr)(?<!mrs)[\.\?!] ((?:mr\.|mrs\.|[^\.\?!])*)$', line)
                if sent_end is not None:
                    current_sent = sent_end.groups()[0]
        return chapters

    @staticmethod
    def pass_clean_filter(line: str) -> bool:
        # removing the illustration lines and empty lines
        # can add other filters here as needed
        if line == '' or re.match(r'illustration:|\[illustration\]', line):
            return False
        else:
            return True

    def get_chapter(self, chapter: int) -> str:
        return self.clean[chapter - 1]


In [9]:
# Load Circlular Staircase
title, metadata = list(books.items())[0]
staircase = ProcessedBook(title=title, metadata=metadata)

In [10]:
logger.info(f'The raw length of this book as a string is {len(staircase.raw)}')
logger.info(f'This book has {len(staircase.clean)} chapters\n')
for i, chapter in enumerate(staircase.clean):
  if i == 5: break
  logger.info(f'{chapter[0]} - {chapter[1]}')
  logger.info(f'There are {len(chapter)} sentences in this chapter.')
  num_words = []
  for sent in chapter:
    num_words.append(len(sent.split(' ')))
  avg_words = np.mean(num_words)
  logger.info(f'The average sentence length in this chapter is {avg_words} words\n')

2021-09-22 16:43:41 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mThere are 104 sentences in this chapter.

In [11]:
for i, sent in enumerate(staircase.get_chapter(33)):
  if i == 10: break
  logger.info(sent)

2021-09-22 16:43:41 Notebook     INFO     [1m[36mchapter xxxiii.[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mat the foot of the stairs[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mas i drove rapidly up to the house from casanova station in the hack, i saw the detective burns loitering across the street from the walker place[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mso jamieson was putting the screws on—lightly now, but ready to give them a twist or two, i felt certain, very soon[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mthe house was quiet[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mtwo steps of the circular staircase had been pried off, without result, and beyond a second message from gertrude, that halsey insisted on coming home and they would arrive that night, there was nothing new[0m
2021-09-22 16:43:41 Notebook     INFO     [1m[36mmr. jamieson, having failed to locate the secret room, had gone to the village[0m
2021-

In [12]:
# Load Lower Ten
title, metadata = list(books.items())[0]
lower_ten = ProcessedBook(title=title, metadata=metadata)

In [13]:
logger.info(f'The raw length of this book as a string is {len(lower_ten.raw)}')
logger.info(f'This book has {len(lower_ten.clean)} chapters\n')
for i, chapter in enumerate(lower_ten.clean):
  if i == 5: break
  logger.info(f'{chapter[0]} - {chapter[1]}')
  logger.info(f'There are {len(chapter)} sentences in this chapter.')
  num_words = []
  for sent in chapter:
    num_words.append(len(sent.split(' ')))
  avg_words = np.mean(num_words)
  logger.info(f'The average sentence length in this chapter is {avg_words} words\n')

2021-09-22 16:43:43 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mThere are 104 sentences in this chapter.

In [14]:
for i, sent in enumerate(lower_ten.get_chapter(15)):
  if i == 10: 
    break
  logger.info(sent)

2021-09-22 16:43:43 Notebook     INFO     [1m[36mchapter xv.[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mliddy gives the alarm[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mthe next day, friday, gertrude broke the news of her stepfather’s death to louise[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mshe did it as gently as she could, telling her first that he was very ill, and finally that he was dead[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mlouise received the news in the most unexpected manner, and when gertrude came out to tell me how she had stood it, i think she was almost shocked[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36m“she just lay and stared at me, aunt ray,” she said[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36m“do you know, i believe she is glad, glad[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mand she is too honest to pretend anything else[0m
2021-09-22 16:43:43 Notebook     INFO     [1m[36mw

In [15]:
# Example of running it on all the books
processed_books = {}
for title, metadata in books.items():
  logger.nl()
  logger.info(f"Book: {title}", color='yellow', attrs=['underline'])
  current_book = ProcessedBook(title=title, metadata=metadata)
  # Raw length
  logger.info(f'The raw length of this book as a string is {len(current_book.raw)}')
  # Number of chapters
  logger.info(f'This book has {len(current_book.clean)} chapters\n')
  # Sententences per chapter
  for i, chapter in enumerate(current_book.clean):
    if i == 5: 
      break
    logger.info(f'{chapter[0]} - {chapter[1]}')
    logger.info(f'There are {len(chapter)} sentences in this chapter.')
    num_words = []
    for sent in chapter:
      num_words.append(len(sent.split(' ')))
    avg_words = np.mean(num_words)
    logger.info(f'The average sentence length in this chapter is {avg_words} words\n')
  # Chapter 15
  for i, sent in enumerate(lower_ten.get_chapter(15)):
    if i == 10: break
    logger.info(sent)


2021-09-22 16:43:43 Notebook     INFO     [4m[33mBook: The_Circular_Staircase[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThe raw length of this book as a string is 410135[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThis book has 33 chapters
[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mchapter i. - i take a country house[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThere are 117 sentences in this chapter.[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 21.307692307692307 words
[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mchapter ii. - a link cuff-button[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThere are 142 sentences in this chapter.[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mThe average sentence length in this chapter is 16.260563380281692 words
[0m
2021-09-22 16:43:44 Notebook     INFO     [1m[36mchapter iii. - mr. john bailey appears[0m
2021-09

2021-09-22 16:43:48 Notebook     INFO     [1m[36m“she just lay and stared at me, aunt ray,” she said[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36m“do you know, i believe she is glad, glad[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mand she is too honest to pretend anything else[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mwhat sort of man was mr. paul armstrong, anyhow[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36m“he was a bully as well as a rascal, gertrude,” i said[0m

2021-09-22 16:43:48 Notebook     INFO     [4m[33mBook: The_Window_at_the_White_Cat[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mThe raw length of this book as a string is 87360[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mThis book has 0 chapters
[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mchapter xv.[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mliddy gives the alarm[0m
2021-09-22 16:43:48 Notebook     INFO     [1m[36mthe next da

In [23]:
lines = staircase.raw.replace('\r\n', '\n').split('\n')

In [21]:
loins = re.sub(r'\r\n', r'\n',staircase.raw)
loins = re.findall(r'.*\n', loins)

In [31]:
for lineIdx in range(len(loins)): 
    loins[lineIdx] = re.sub(r'\n', '', loins[lineIdx])