# NLP - Project 2
## Rinehart Analysis with Word Vectors
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

## Where to put the code
- Place the preprocessing functions/classes in [nlp_libs/books/preprocessing.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/preprocessing.py)
- The custom word embeddings functions/classes (task 1) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The pretrained word embeddings functions/classes (task 2) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The functions/classes (if any) that compare the results (tasks 3, 4, 5) in [nlp_libs/books/compare_statistics.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/compare_statistics.py)
- Any plotting related functions in [nlp_libs/books/plotter.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/plotter.py)

**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/proj_2.yml](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/confs/proj_2.yml)<br>
To load it run:
```python
config_path='confs/proj_2.yml'
conf = Configuration(config_src=config_path)
# Get the books dictionary
books = conf.get_config('data_loader')['config']['books'] # type = Dict
print(books.keys())
print(books['The_Bat'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/nlp_libs"*
- nlp_libs/**books**: This project's code (imported later)
- nlp_libs/**configuration**: Class that creates config objects from yml files
- nlp_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## Project 1 Code
If you need to import anything from Project 1 just run:
```python
import proj1_nlp_libs.books.processed_book as proc
import proj1_nlp_libs.books.book_extractor as extr
import proj1_nlp_libs.books.plotter as pl
```

## For more info check out:
- the **[Project Board](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/projects/1)**
- the **[README](https://github.com/NLPaladins/https://github.com/NLPaladins/rinehartAnalysis_wordVectors/blob/main/README.md)**
- and the **[Current Issues](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis_wordVectors.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [2]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [3]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
# Numpy
import numpy as np

# Import preprocessing lib
from nlp_libs.books import *

## Load the YML file

In [4]:
from nlp_libs import Configuration

In [5]:
# The path of configuration and log save path
config_path = "confs/proj_2.yml"
# !cat "$config_path"
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books = conf.get_config('data_loader')['config']['books']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-10-27 22:52:11 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis_wordVectors/confs/proj_2.yml[0m
2021-10-27 22:52:11 Config       INFO     [1m[37mConfiguration Tag: proj2[0m


## Setup Logger and Example

In [6]:
log_path = "logs/proj_2.log"
# Load and setup logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)
# Examples
logger.info("Logger Examples:")
logger.nl(num_lines=1) # New lines
logger.warn("Logger Warning underlined", attrs=['underline']) 
# Atrs:  bold, dark, underline, blink, reverse, concealed
logger.error("Logger Error in red&yellow", color="yellow", on_color="on_red")
# Colors: on_grey, on_red, on_green, on_yellow, on_blue, on_magenta, on_cyan, on_white

2021-10-27 22:52:11 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis_wordVectors/logs/proj_2.log[0m
2021-10-27 22:52:11 Notebook     INFO     [1m[36mLogger Examples:[0m

2021-10-27 22:52:11 Notebook     ERROR    [1m[41m[33mLogger Error in red&yellow[0m


# ------------------------------------------------------------------

# Start of Project Code

In [7]:
from nlp_libs.books import * 

## Preprocessing

# The Circular Staircase

In [8]:
books['The_Circular_Staircase']

{'url': 'https://www.gutenberg.org/files/434/434-0.txt',
 'protagonists': [{'Mr. Jamieson': ['jamieson', 'detective', 'winters']}],
 'antagonists': [{'Anne Watson': ['anne watson', 'watson']}],
 'crime': {'crime_weapon': ['revolver'], 'crime_objects': ['tmp']}}

In [9]:
book = ProcessedBook(books['The_Circular_Staircase'])
sentences = book.lemmatize_by_sentence()

In [12]:
df = word_embeddings.calculate_differing_distances(sentences, [['jamieson', 'watson'], 
                                                            ['revolver', 'watson'], 
                                                            ['murder', 'watson'], 
                                                            ['jamieson', 'murder'], 
                                                            ['jamieson', 'revolver'],
                                                            ['watson', 'revolver'],
                                                            ['murder', 'revolver'],
                                                            ['murder', 'bag'],
                                                            ['murder', 'rachel'],                                                                  
                                                            ['jamieson', 'detective']])

In [14]:
df.sort_values(['cosineSim', 'dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
38,murder,rachel,50,10,0.989302,1.401629
27,murder,bag,50,3,0.990234,0.933485
8,murder,rachel,50,2,0.990459,1.881804
18,murder,rachel,50,5,0.990615,1.557350
17,murder,bag,50,5,0.990925,0.962359
...,...,...,...,...,...,...
154,jamieson,revolver,300,10,0.999670,4.781205
139,jamieson,detective,300,5,0.999700,6.964769
149,jamieson,detective,300,3,0.999705,5.583714
150,jamieson,watson,300,10,0.999722,6.081312


In [25]:
arr = np.array(sentences)

  arr = np.array(sentences)


In [28]:
import pydash

In [55]:
flatarr = pydash.flatten(sentences)

In [56]:
for sentence in sentences: 
    for word in sentence: 
        if word == 'birthday': 
            print("BIRTHDAY")

BIRTHDAY


In [51]:
len(np.unique(flatarr))

4549

In [52]:
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

print(len(stopwords))

326


In [53]:
newwords = list(filter(lambda x: x not in stopwords, np.unique(flatarr)))


In [57]:
df = word_embeddings.calculate_differing_distances(sentences, [np.random.choice(newwords, 2, replace=False), 
                                                               np.random.choice(newwords, 2, replace=False), 
                                                               np.random.choice(newwords, 2, replace=False), 
                                                               np.random.choice(newwords, 2, replace=False), 
                                                               np.random.choice(newwords, 2, replace=False),
                                                           ])
df.sort_values(['cosineSim', 'dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
11,splittingly,complaint,50,3,0.268517,0.003491
19,examine,ejaculation,50,10,0.306649,0.043389
3,unsophistication,somewhar,50,2,0.341892,0.003507
31,splittingly,complaint,100,3,0.383376,0.003276
1,splittingly,complaint,50,2,0.458518,0.006561
...,...,...,...,...,...,...
62,crimp,nausea,300,2,0.951243,0.021188
77,crimp,nausea,300,10,0.955773,0.022533
70,scream,frantically,300,3,0.965854,0.127180
40,scream,frantically,200,2,0.975665,0.152570


In [58]:
df.sort_values(['dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
31,splittingly,complaint,100,3,0.383376,0.003276
11,splittingly,complaint,50,3,0.268517,0.003491
3,unsophistication,somewhar,50,2,0.341892,0.003507
23,unsophistication,somewhar,100,2,0.606022,0.005213
71,splittingly,complaint,300,3,0.771266,0.005722
...,...,...,...,...,...,...
74,examine,ejaculation,300,3,0.940620,0.141395
40,scream,frantically,200,2,0.975665,0.152570
79,examine,ejaculation,300,10,0.950440,0.158041
0,scream,frantically,50,2,0.913713,0.171738


In [60]:
import time

In [None]:
start = time.time()

for i in np.arange(9_000_000): 
    thing = np.arange(10).dot(np.arange(10))
    
stop = time.time()

In [None]:
stop-start

# The Man in the Lower Ten

In [None]:
book = ProcessedBook(books['The_Man_in_Lower_Ten'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

In [None]:
book.clean_lines

# The After House

In [None]:
book = ProcessedBook(books['The_After_House'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Window at the Wide Cat

In [None]:
book = ProcessedBook(books['The_Window_at_the_White_Cat'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Bat

In [None]:
book = ProcessedBook(books['The_Bat'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

## Custom Word Embeddings

In [None]:
# Import word_embeddings lib
import nlp_libs.books.word_embeddings as we

In [None]:
# custom_embeddings = we.WordEmbeddingsCustom()

## Pretrained Word Embeddings

In [None]:
# pretrained_embeddings = we.WordEmbeddingsPretrained()

## Compare Vector distances and report similarities using Custom Embeddings

In [None]:
# Import compare_statistics lib
import nlp_libs.books.compare_statistics as cs

In [None]:
# cs.my_custom_embeddings_compare_function()

## Compare Vector distances and report similarities using Pretrained Embeddings

In [None]:
# cs.my_pretrained_embeddings_compare_function()

## Extra Analysis? Plots?

In [None]:
# Too much work

In [None]:
NOTE: lemmatize the stop words then see