# NLP - Project 2
## Rinehart Analysis with Word Vectors
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

## Where to put the code
- Place the preprocessing functions/classes in [nlp_libs/books/preprocessing.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/preprocessing.py)
- The custom word embeddings functions/classes (task 1) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The pretrained word embeddings functions/classes (task 2) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The functions/classes (if any) that compare the results (tasks 3, 4, 5) in [nlp_libs/books/compare_statistics.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/compare_statistics.py)
- Any plotting related functions in [nlp_libs/books/plotter.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/plotter.py)

**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/proj_2.yml](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/confs/proj_2.yml)<br>
To load it run:
```python
config_path='confs/proj_2.yml'
conf = Configuration(config_src=config_path)
# Get the books dictionary
books = conf.get_config('data_loader')['config']['books'] # type = Dict
print(books.keys())
print(books['The_Bat'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/nlp_libs"*
- nlp_libs/**books**: This project's code (imported later)
- nlp_libs/**configuration**: Class that creates config objects from yml files
- nlp_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## Project 1 Code
If you need to import anything from Project 1 just run:
```python
import proj1_nlp_libs.books.processed_book as proc
import proj1_nlp_libs.books.book_extractor as extr
import proj1_nlp_libs.books.plotter as pl
```

## For more info check out:
- the **[Project Board](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/projects/1)**
- the **[README](https://github.com/NLPaladins/https://github.com/NLPaladins/rinehartAnalysis_wordVectors/blob/main/README.md)**
- and the **[Current Issues](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis_wordVectors.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [2]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [3]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
# Numpy
import numpy as np

# Import preprocessing lib
from nlp_libs.books import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the YML file

In [4]:
from nlp_libs import Configuration

In [5]:
# The path of configuration and log save path
config_path = "confs/proj_2.yml"
# !cat "$config_path"
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books_conf = conf.get_config('data_loader')['config']['books']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-10-31 22:44:28 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/NLP-Project2/Code/confs/proj_2.yml[0m
2021-10-31 22:44:28 Config       INFO     [1m[37mConfiguration Tag: proj2[0m


## Setup Logger and Example

In [6]:
log_path = "logs/proj_2.log"
# Load and setup logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)
# Examples
logger.info("Logger Examples:")
logger.nl(num_lines=1) # New lines
logger.warn("Logger Warning underlined", attrs=['underline']) 
# Atrs:  bold, dark, underline, blink, reverse, concealed
logger.error("Logger Error in red&yellow", color="yellow", on_color="on_red")
# Colors: on_grey, on_red, on_green, on_yellow, on_blue, on_magenta, on_cyan, on_white

2021-10-31 22:44:28 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/NLP-Project2/Code/logs/proj_2.log[0m
2021-10-31 22:44:28 Notebook     INFO     [1m[36mLogger Examples:[0m

2021-10-31 22:44:28 Notebook     ERROR    [1m[41m[33mLogger Error in red&yellow[0m


# ------------------------------------------------------------------

# Start of Project Code

In [7]:
from nlp_libs import books as books_lib

## Preprocessing

# The Circular Staircase

In [8]:
# Load conf
book_meta = books_conf['The_Circular_Staircase']
book = ProcessedBook(book_meta)

In [9]:
# Lemmatize sentences
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [10]:
# Generate word combinations
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)


protagonists_antagonists: 
[('jamieson', 'watson'), ('detective', 'watson'), ('winters', 'watson')]

antagonists_crime_weapon: 
[('watson', 'revolver')]

antagonists_crime_objects: 
[('watson', 'staircase'), ('watson', 'floor'), ('watson', 'waistcoat')]


In [11]:
# Calculate distances with custom word embeddings
protag_antag_dists = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [34]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Circular_Staircase__protag_antag_dists.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_weap_dists.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_obj_dists.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__protag_antag_dists.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_weap_dists.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_obj_dists.pkl")

In [35]:
display(protag_antag_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists.sort_values(['cosineSim', 'dotSim']).head())

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
3,jamieson,watson,,,0.177056,5.408132
2,jamieson,watson,,,0.246759,6.121186
1,jamieson,watson,,,0.378605,6.556598
0,jamieson,watson,,,0.434102,5.792758


Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
3,watson,revolver,,,0.076764,2.823694
2,watson,revolver,,,0.080159,2.572687
1,watson,revolver,,,0.092983,2.159217
0,watson,revolver,,,0.189547,3.656829


Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
3,watson,staircase,,,-0.061957,-2.265938
2,watson,staircase,,,-0.007154,-0.225405
1,watson,staircase,,,0.004739,0.112529
0,watson,staircase,,,0.040019,0.829024


In [36]:
# Calculate distances for pretrained embeddings
model_names = books_lib.word_embeddings.get_model_names()
model_names = [mn for mn in model_names if 'glove-wiki-gigaword' in mn]
print(model_names)

protag_antag_dists_pre = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(model_names=model_names, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists_pre = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(model_names=model_names, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists_pre = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(model_names=model_names, 
                                                       word_pairs=antagonists_crime_objects)

['glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300']


KeyboardInterrupt: 

In [None]:
# Save the results
protag_antag_dists_pre.to_pickle(f"data{os.sep}The_Circular_Staircase__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists_pre.to_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists_pre.to_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_obj_dists__PRETRAINED.pkl")
# To load them
protag_antag_dists_pre = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists_pre = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists_pre = pd.read_pickle(f"data{os.sep}The_Circular_Staircase__antag_crime_obj_dists__PRETRAINED.pkl")


In [21]:
display(protag_antag_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists_pre.sort_values(['cosineSim', 'dotSim']).head())

Unnamed: 0,word1,word2,model,cosineSim,dotSim,model_name
3,jamieson,watson,,0.177056,5.408132,glove-wiki-gigaword-300
2,jamieson,watson,,0.246759,6.121186,glove-wiki-gigaword-200
1,jamieson,watson,,0.378605,6.556598,glove-wiki-gigaword-100
0,jamieson,watson,,0.434102,5.792758,glove-wiki-gigaword-50


Unnamed: 0,word1,word2,model,cosineSim,dotSim,model_name
3,watson,revolver,,0.076764,2.823694,glove-wiki-gigaword-300
2,watson,revolver,,0.080159,2.572687,glove-wiki-gigaword-200
1,watson,revolver,,0.092983,2.159217,glove-wiki-gigaword-100
0,watson,revolver,,0.189547,3.656829,glove-wiki-gigaword-50


Unnamed: 0,word1,word2,model,cosineSim,dotSim,model_name
3,watson,staircase,,-0.061957,-2.265938,glove-wiki-gigaword-300
2,watson,staircase,,-0.007154,-0.225405,glove-wiki-gigaword-200
1,watson,staircase,,0.004739,0.112529,glove-wiki-gigaword-100
0,watson,staircase,,0.040019,0.829024,glove-wiki-gigaword-50


# The Man in the Lower Ten

In [23]:
books = ProcessedBook(books_conf['The_Man_in_Lower_Ten'])
lemmas = book.lemmas

In [24]:
# Lemmatize sentences
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [25]:
# Generate word combinations
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)


protagonists_antagonists: 
[('jamieson', 'watson'), ('detective', 'watson'), ('winters', 'watson')]

antagonists_crime_weapon: 
[('watson', 'revolver')]

antagonists_crime_objects: 
[('watson', 'staircase'), ('watson', 'floor'), ('watson', 'waistcoat')]


In [None]:
# Calculate distances with custom word embeddings
protag_antag_dists = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__protag_antag_dists.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_weap_dists.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_obj_dists.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__protag_antag_dists.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_weap_dists.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_obj_dists.pkl")

In [None]:
display(protag_antag_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists.sort_values(['cosineSim', 'dotSim']).head())

In [None]:
# Calculate distances for pretrained embeddings
model_names = books_lib.word_embeddings.get_model_names()
model_names = [mn for mn in model_names if 'glove-wiki-gigaword' in mn]
print(model_names)

protag_antag_dists_pre = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(model_names=model_names, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists_pre = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(model_names=model_names, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists_pre = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(model_names=model_names, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_obj_dists__PRETRAINED.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Man_in_Lower_Ten__antag_crime_obj_dists__PRETRAINED.pkl")

In [None]:
display(protag_antag_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists_pre.sort_values(['cosineSim', 'dotSim']).head())

# The After House

In [None]:
book = ProcessedBook(books_conf['The_After_House'])
lemmas = book.lemmas

In [26]:
# Lemmatize sentences
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [None]:
# Generate word combinations
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)

In [None]:
# Calculate distances with custom word embeddings
protag_antag_dists = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_After_House__protag_antag_dists.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_After_House__antag_crime_weap_dists.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_After_House__antag_crime_obj_dists.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_After_House__protag_antag_dists.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_After_House__antag_crime_weap_dists.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_After_House__antag_crime_obj_dists.pkl")

In [None]:
display(protag_antag_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists.sort_values(['cosineSim', 'dotSim']).head())

In [None]:
# Calculate distances for pretrained embeddings
model_names = books_lib.word_embeddings.get_model_names()
model_names = [mn for mn in model_names if 'glove-wiki-gigaword' in mn]
print(model_names)

protag_antag_dists_pre = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(model_names=model_names, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists_pre = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(model_names=model_names, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists_pre = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(model_names=model_names, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_After_House__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_After_House__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_After_House__antag_crime_obj_dists__PRETRAINED.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_After_House__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_After_House__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_After_House__antag_crime_obj_dists__PRETRAINED.pkl")

In [None]:
display(protag_antag_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists_pre.sort_values(['cosineSim', 'dotSim']).head())

# The Window at the Wide Cat

In [None]:
book = ProcessedBook(books_conf['The_Window_at_the_White_Cat'])
lemmas = book.lemmas

In [27]:
# Lemmatize sentences
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [None]:
# Generate word combinations
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)

In [None]:
# Calculate distances with custom word embeddings
protag_antag_dists = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__protag_antag_dists.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_weap_dists.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_obj_dists.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__protag_antag_dists.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_weap_dists.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_obj_dists.pkl")

In [None]:
display(protag_antag_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists.sort_values(['cosineSim', 'dotSim']).head())

In [None]:
# Calculate distances for pretrained embeddings
model_names = books_lib.word_embeddings.get_model_names()
model_names = [mn for mn in model_names if 'glove-wiki-gigaword' in mn]
print(model_names)

protag_antag_dists_pre = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(model_names=model_names, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists_pre = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(model_names=model_names, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists_pre = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(model_names=model_names, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_obj_dists__PRETRAINED.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Window_at_the_White_Cat__antag_crime_obj_dists__PRETRAINED.pkl")

In [None]:
display(protag_antag_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists_pre.sort_values(['cosineSim', 'dotSim']).head())

# The Bat

In [None]:
book = ProcessedBook(books_conf['The_Bat'])
lemmas = book.lemmas

In [None]:
# Lemmatize sentences
protagonist_subs = list(book_meta['protagonists'][0].values())[0]
substitution = (protagonist_subs, 'protagonist')
sentences_substituted = book.lemmatize_by_sentence(word_subs=substitution)
sentences = book.lemmatize_by_sentence()

In [None]:
# Generate word combinations
protagonists_antagonists = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['protagonists'], 
                                            get_all_sub_values_1=True,
                                            keys_2=['antagonists'],
                                            get_all_sub_values_2=True,
                                            ignore_words_with_spaces=True)
antagonists_crime_weapon = books_lib.word_embeddings\
                          .get_combinations(conf=book_meta, 
                                            keys_1=['antagonists'],
                                            get_all_sub_values_1=True,
                                            keys_2=['crime', 'crime_weapon'],
                                            get_all_sub_values_2=False,
                                            ignore_words_with_spaces=True)
antagonists_crime_objects = books_lib.word_embeddings\
                           .get_combinations(conf=book_meta,
                                             keys_1=['antagonists'],
                                             get_all_sub_values_1=True,
                                             keys_2=['crime', 'crime_objects'],
                                             get_all_sub_values_2=False,
                                             ignore_words_with_spaces=True)

print("\nprotagonists_antagonists: ")
pprint(protagonists_antagonists)
print("\nantagonists_crime_weapon: ")
pprint(antagonists_crime_weapon)
print("\nantagonists_crime_objects: ")
pprint(antagonists_crime_objects)

In [None]:
# Calculate distances with custom word embeddings
protag_antag_dists = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Bat__protag_antag_dists.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Bat__antag_crime_weap_dists.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Bat__antag_crime_obj_dists.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Bat__protag_antag_dists.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Bat__antag_crime_weap_dists.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Bat__antag_crime_obj_dists.pkl")

In [None]:
display(protag_antag_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_obj_dists.sort_values(['cosineSim', 'dotSim']).head())

In [None]:
# Calculate distances with custom word embeddings
protag_antag_dists_pre = books_lib\
                     .word_embeddings\
                     .calculate_differing_distances(sentences=sentences, 
                                                    word_pairs=protagonists_antagonists)
antag_crime_weap_dists = books_lib\
                         .word_embeddings\
                         .calculate_differing_distances(sentences=sentences, 
                                                        word_pairs=antagonists_crime_weapon)
antag_crime_obj_dists = books_lib\
                        .word_embeddings\
                        .calculate_differing_distances(sentences=sentences, 
                                                       word_pairs=antagonists_crime_objects)

In [None]:
# Save the results
protag_antag_dists.to_pickle(f"data{os.sep}The_Bat__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists.to_pickle(f"data{os.sep}The_Bat__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists.to_pickle(f"data{os.sep}The_Bat__antag_crime_obj_dists__PRETRAINED.pkl")
# To load them
protag_antag_dists = pd.read_pickle(f"data{os.sep}The_Bat__protag_antag_dists__PRETRAINED.pkl")
antag_crime_weap_dists = pd.read_pickle(f"data{os.sep}The_Bat__antag_crime_weap_dists__PRETRAINED.pkl")
antag_crime_obj_dists = pd.read_pickle(f"data{os.sep}The_Bat__antag_crime_obj_dists__PRETRAINED.pkl")

In [None]:
display(protag_antag_dists_pre.sort_values(['cosineSim', 'dotSim']).head())
display(antag_crime_weap_dists_pre.sort_values(['cosineSim', 'dotSim']).Shead())
display(antag_crime_obj_dists_pre.sort_values(['cosineSim', 'dotSim']).head())

## Compare Vector distances and report similarities using Pretrained Embeddings

In [None]:
# cs.my_pretrained_embeddings_compare_function()

## Extra Analysis? Plots?

In [None]:
# Too much work