# NLP - Project 2
## Rinehart Analysis with Word Vectors
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

## Where to put the code
- Place the preprocessing functions/classes in [nlp_libs/books/preprocessing.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/preprocessing.py)
- The custom word embeddings functions/classes (task 1) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The pretrained word embeddings functions/classes (task 2) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The functions/classes (if any) that compare the results (tasks 3, 4, 5) in [nlp_libs/books/compare_statistics.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/compare_statistics.py)
- Any plotting related functions in [nlp_libs/books/plotter.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/plotter.py)

**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/proj_2.yml](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/confs/proj_2.yml)<br>
To load it run:
```python
config_path='confs/proj_2.yml'
conf = Configuration(config_src=config_path)
# Get the books dictionary
books = conf.get_config('data_loader')['config']['books'] # type = Dict
print(books.keys())
print(books['The_Bat'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/nlp_libs"*
- nlp_libs/**books**: This project's code (imported later)
- nlp_libs/**configuration**: Class that creates config objects from yml files
- nlp_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## Project 1 Code
If you need to import anything from Project 1 just run:
```python
import proj1_nlp_libs.books.processed_book as proc
import proj1_nlp_libs.books.book_extractor as extr
import proj1_nlp_libs.books.plotter as pl
```

## For more info check out:
- the **[Project Board](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/projects/1)**
- the **[README](https://github.com/NLPaladins/https://github.com/NLPaladins/rinehartAnalysis_wordVectors/blob/main/README.md)**
- and the **[Current Issues](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [33]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis_wordVectors.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [34]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [35]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
# Numpy
import numpy as np

# Import preprocessing lib
from nlp_libs.books import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the YML file

In [36]:
from nlp_libs import Configuration

In [37]:
# The path of configuration and log save path
config_path = "confs/proj_2.yml"
# !cat "$config_path"
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books_conf = conf.get_config('data_loader')['config']['books']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-10-31 18:11:39 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/NLP-Project2/Code/confs/proj_2.yml[0m
2021-10-31 18:11:39 Config       INFO     [1m[37mConfiguration Tag: proj2[0m


## Setup Logger and Example

In [38]:
log_path = "logs/proj_2.log"
# Load and setup logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)
# Examples
logger.info("Logger Examples:")
logger.nl(num_lines=1) # New lines
logger.warn("Logger Warning underlined", attrs=['underline']) 
# Atrs:  bold, dark, underline, blink, reverse, concealed
logger.error("Logger Error in red&yellow", color="yellow", on_color="on_red")
# Colors: on_grey, on_red, on_green, on_yellow, on_blue, on_magenta, on_cyan, on_white

2021-10-31 18:11:40 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/NLP-Project2/Code/logs/proj_2.log[0m
2021-10-31 18:11:40 Notebook     INFO     [1m[36mLogger Examples:[0m

2021-10-31 18:11:40 Notebook     ERROR    [1m[41m[33mLogger Error in red&yellow[0m


# ------------------------------------------------------------------

# Start of Project Code

In [58]:
from nlp_libs import books as books_lib

## Preprocessing

# The Circular Staircase

In [59]:
book_meta = books_conf['The_Circular_Staircase']
book = ProcessedBook(book_meta)

In [78]:
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()


In [159]:
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)


protagonists_antagonists: 
[('jamieson', 'watson'), ('detective', 'watson'), ('winters', 'watson')]

antagonists_crime_weapon: 
[('watson', 'revolver')]

antagonists_crime_objects: 
[('watson', 'staircase'), ('watson', 'floor'), ('watson', 'waistcoat')]


In [163]:
# df = books_lib.word_embeddings.calculate_differing_distances(sentences, [['jamieson', 'watson'], 
#                                                             ['revolver', 'watson'], 
#                                                             ['murder', 'watson'], 
#                                                             ['jamieson', 'murder'], 
#                                                             ['jamieson', 'revolver'],
#                                                             ['watson', 'revolver'],
#                                                             ['murder', 'revolver'],
#                                                             ['murder', 'bag'],
#                                                             ['murder', 'rachel'],                                                                  
#                                                             ['jamieson', 'detective']])

protagonists_antagonists_distances = books_lib\
                                     .word_embeddings\
                                     .calculate_differing_distances(sentences, 
                                                                    protagonists_antagonists)
antagonists_crime_weapon_distances = books_lib\
                                     .word_embeddings\
                                     .calculate_differing_distances(sentences, 
                                                                    antagonists_crime_weapon)
antagonists_crime_objects_distances = books_lib\
                                     .word_embeddings\
                                     .calculate_differing_distances(sentences, 
                                                                    antagonists_crime_objects)

KeyError: "Key 'waistcoat' not present"

In [None]:
display(protagonists_antagonists_distances.sort_values(['cosineSim', 'dotSim']))
display(antagonists_crime_weapon_distances.sort_values(['cosineSim', 'dotSim']))
display(antagonists_crime_objects_distances.sort_values(['cosineSim', 'dotSim']))

In [83]:
df = books_lib.word_embeddings.calculate_differing_distances(sentences, [ 
                                                            ['revolver', 'watson'], 
                                                            ['murder', 'watson'], 
                                                            ['murder', 'rachel'],                                                                  
                                                            ['jamieson', 'detective']])

In [84]:
df.sort_values(['cosineSim', 'dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
10,murder,rachel,50,3,0.989254,1.620097
2,murder,rachel,50,2,0.989922,1.625664
14,murder,rachel,50,10,0.990176,1.205099
6,murder,rachel,50,5,0.991233,1.596666
26,murder,rachel,100,3,0.994179,1.614592
...,...,...,...,...,...,...
39,jamieson,detective,200,5,0.999711,9.922542
47,jamieson,detective,200,10,0.999745,11.988062
59,jamieson,detective,300,3,0.999805,8.208540
63,jamieson,detective,300,10,0.999826,12.204557


# The Man in the Lower Ten

In [85]:
books = ProcessedBook(books_conf['The_Man_in_Lower_Ten'])
lemmas = book.lemmas

In [86]:
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [None]:
df = books_lib.word_embeddings.calculate_differing_distances(sentences, [['jamieson', 'watson'], 
                                                            ['revolver', 'watson'], 
                                                            ['murder', 'watson'], 
                                                            ['jamieson', 'murder'], 
                                                            ['jamieson', 'revolver'],
                                                            ['watson', 'revolver'],
                                                            ['murder', 'revolver'],
                                                            ['murder', 'bag'],
                                                            ['murder', 'rachel'],                                                                  
                                                            ['jamieson', 'detective']])

# The After House

In [None]:
book = ProcessedBook(books_conf['The_After_House'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Window at the Wide Cat

In [None]:
book = ProcessedBook(books_conf['The_Window_at_the_White_Cat'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Bat

In [None]:
book = ProcessedBook(books_conf['The_Bat'])
lemmas = book.lemmas

In [43]:
' '.join(lemmas[:100])

"-PRON- have get to get -PRON- boy -- get -PRON- or bust say a tired police chief pound a heavy fist on a table the detective -PRON- bellow the word at look at the floor -PRON- have do -PRON- good and fail failure mean resignation for the police chief return to the hate work of pound the pavement for -PRON- -- -PRON- know -PRON- and know -PRON- could summon no gesture of bravado to answer -PRON- chief 's gunman thug hi jacker loft robber murderer -PRON- could get -PRON- all in time -- but -PRON- could not get the man"

## Custom Word Embeddings

In [44]:
# Import word_embeddings lib
import nlp_libs.books.word_embeddings as we

In [45]:
# custom_embeddings = we.WordEmbeddingsCustom()

## Pretrained Word Embeddings

In [46]:
# pretrained_embeddings = we.WordEmbeddingsPretrained()

## Compare Vector distances and report similarities using Custom Embeddings

In [47]:
# Import compare_statistics lib
import nlp_libs.books.compare_statistics as cs

In [48]:
# cs.my_custom_embeddings_compare_function()

## Compare Vector distances and report similarities using Pretrained Embeddings

In [49]:
# cs.my_pretrained_embeddings_compare_function()

## Extra Analysis? Plots?

In [None]:
# Too much work