# NLP - Project 2
## Rinehart Analysis with Word Vectors
**Team**: *Jean Merlet, Konstantinos Georgiou, Matt Lane*

## Where to put the code
- Place the preprocessing functions/classes in [nlp_libs/books/preprocessing.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/preprocessing.py)
- The custom word embeddings functions/classes (task 1) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The pretrained word embeddings functions/classes (task 2) in [nlp_libs/books/word_embeddings.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/word_embeddings.py) (separate class)
- The functions/classes (if any) that compare the results (tasks 3, 4, 5) in [nlp_libs/books/compare_statistics.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/compare_statistics.py)
- Any plotting related functions in [nlp_libs/books/plotter.py](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/nlp_libs/books/plotter.py)

**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/proj_2.yml](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/confs/proj_2.yml)<br>
To load it run:
```python
config_path='confs/proj_2.yml'
conf = Configuration(config_src=config_path)
# Get the books dictionary
books = conf.get_config('data_loader')['config']['books'] # type = Dict
print(books.keys())
print(books['The_Bat'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/nlp_libs"*
- nlp_libs/**books**: This project's code (imported later)
- nlp_libs/**configuration**: Class that creates config objects from yml files
- nlp_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## Project 1 Code
If you need to import anything from Project 1 just run:
```python
import proj1_nlp_libs.books.processed_book as proc
import proj1_nlp_libs.books.book_extractor as extr
import proj1_nlp_libs.books.plotter as pl
```

## For more info check out:
- the **[Project Board](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/projects/1)**
- the **[README](https://github.com/NLPaladins/https://github.com/NLPaladins/rinehartAnalysis_wordVectors/blob/main/README.md)**
- and the **[Current Issues](https://github.com/NLPaladins/rinehartAnalysis_wordVectors/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/NLPaladins/rinehartAnalysis_wordVectors.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [2]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/rinehartAnalysis')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [3]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
# Numpy
import numpy as np

# Import preprocessing lib
from nlp_libs.books import *

## Load the YML file

In [4]:
from nlp_libs import Configuration

In [274]:
# The path of configuration and log save path
config_path = "confs/proj_2.yml"
# !cat "$config_path"
# Load the configuration
conf = Configuration(config_src=config_path)
# Get the books dict
books = conf.get_config('data_loader')['config']['books']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-10-28 16:08:04 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis_wordVectors/confs/proj_2.yml[0m
2021-10-28 16:08:04 Config       INFO     [1m[37mConfiguration Tag: proj2[0m


## Setup Logger and Example

In [275]:
log_path = "logs/proj_2.log"
# Load and setup logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)
# Examples
logger.info("Logger Examples:")
logger.nl(num_lines=1) # New lines
logger.warn("Logger Warning underlined", attrs=['underline']) 
# Atrs:  bold, dark, underline, blink, reverse, concealed
logger.error("Logger Error in red&yellow", color="yellow", on_color="on_red")
# Colors: on_grey, on_red, on_green, on_yellow, on_blue, on_magenta, on_cyan, on_white

2021-10-28 16:08:05 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/96v/Documents/DSE/nlp/rinehartAnalysis_wordVectors/logs/proj_2.log[0m
2021-10-28 16:08:05 Notebook     INFO     [1m[36mLogger Examples:[0m

2021-10-28 16:08:05 Notebook     ERROR    [1m[41m[33mLogger Error in red&yellow[0m


# ------------------------------------------------------------------

# Start of Project Code

In [276]:
from nlp_libs.books import * 

## Preprocessing

# The Circular Staircase

In [277]:
books['The_Circular_Staircase']

{'url': 'https://www.gutenberg.org/files/434/434-0.txt',
 'protagonists': {'Mr. Jamieson': ['jamieson', 'detective', 'winters']},
 'suspects': {'John Bailey': ['john', 'jack'],
  'Gertrude Innes': ['gertrude'],
  'Halsey Innes': ['halsey']},
 'antagonists': {'Anne Watson': ['anne', 'watson']},
 'crime': {'crime_weapon': ['revolver'], 'crime_objects': ['tmp']}}

In [278]:
book = ProcessedBook(books['The_Circular_Staircase'])
sentences = book.lemmatize_by_sentence()

In [279]:
df = word_embeddings.calculate_differing_distances(sentences, [['jamieson', 'watson'], 
                                                            ['revolver', 'watson'], 
                                                            ['murder', 'watson'], 
                                                            ['murder', 'rachel'],                                                                  
                                                            ['jamieson', 'detective']])

In [263]:
df.sort_values(['cosineSim', 'dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
3,murder,rachel,50,10,0.989393,1.438346
3,murder,rachel,50,2,0.989715,2.006001
3,murder,rachel,50,3,0.990726,1.907275
3,murder,rachel,50,5,0.992565,1.951108
1,revolver,watson,50,2,0.995201,1.478831
...,...,...,...,...,...,...
4,jamieson,detective,200,3,0.999664,8.111022
4,jamieson,detective,200,10,0.999702,10.791206
4,jamieson,detective,300,5,0.999785,8.911703
4,jamieson,detective,300,3,0.999795,8.121413


In [264]:
arr = np.array(sentences)

  arr = np.array(sentences)


In [265]:
import pydash

In [266]:
flatarr = pydash.flatten(sentences)

In [267]:
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

print(len(stopwords))

326


In [205]:
words, counts = np.unique(flatarr, return_counts=True)

In [206]:
moreThan2Instances = []
for index in range(len(words)): 
    if counts[index] > 1: 
        moreThan2Instances.append(words[index])

non_stopwords = list(filter(lambda x: x not in stopwords,moreThan2Instances))

In [207]:
len(non_stopwords)

2286

In [209]:
moreThan2 = word_embeddings.calculate_differing_distances(sentences, [np.random.choice(non_stopwords, 2, replace=False), 
                                                               np.random.choice(non_stopwords, 2, replace=False), 
                                                               np.random.choice(non_stopwords, 2, replace=False), 
                                                               np.random.choice(non_stopwords, 2, replace=False), 
                                                               np.random.choice(non_stopwords, 2, replace=False),
                                                           ])

In [210]:
moreThan2.sort_values(['cosineSim', 'dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
2,success,nervous,50,2,0.427341,0.025499
2,success,nervous,50,5,0.489458,0.026372
2,success,nervous,50,10,0.566486,0.044190
3,surprised,volubly,50,5,0.631803,0.034098
2,success,nervous,100,10,0.659963,0.030586
...,...,...,...,...,...,...
0,shoe,widow,200,10,0.986582,0.181159
4,deposit,uneasily,300,5,0.987511,0.092233
4,deposit,uneasily,200,10,0.987676,0.139575
4,deposit,uneasily,300,3,0.987823,0.093743


In [211]:
combinationsOfNewWords = list(combinations(non_stopwords, 2))

In [212]:
len(combinationsOfNewWords)

2611755

In [214]:
embeddings = word_embeddings.calculate_differing_distances(sentences, combinationsOfNewWords, vector_dimensions=[200], window_dimensions=[5])

In [215]:
embeddings.columns

Index(['word1', 'word2', 'vectorSize', 'windowSize', 'cosineSim', 'dotSim'], dtype='object')

In [218]:
embeddings.sort_values(['dotSim'])

Unnamed: 0,word1,word2,vectorSize,windowSize,cosineSim,dotSim
622,I,eagerness,200,5,-0.338757,-0.065741
234982,armstrong,eagerness,200,5,-0.328867,-0.060188
1230559,eagerness,nt,200,5,-0.347839,-0.059893
1230878,eagerness,room,200,5,-0.314527,-0.059508
1230048,eagerness,gertrude,200,5,-0.330936,-0.059488
...,...,...,...,...,...,...
236059,armstrong,room,200,5,0.996301,17.813147
869,I,gertrude,200,5,0.998688,17.989231
581,I,doctor,200,5,0.997822,18.020947
104,I,armstrong,200,5,0.998007,18.302742


In [375]:
import networkx as nx
from networkx.readwrite import json_graph
import json 

In [234]:
dotsimDF = embeddings[['word1', 'word2', 'dotSim']]
dotsimDF.columns = ['source', 'target', 'weight']
dotsimNetwork = nx.from_pandas_edgelist(dotsimDF,edge_attr=True)

In [235]:
cosineSimDF = embeddings[['word1', 'word2', 'cosineSim']]
cosineSimDF.columns = ['source', 'target', 'weight']
cosSimNetwork = nx.from_pandas_edgelist(cosineSimDF,edge_attr=True)

In [367]:
tdsdf = dotsimDF[dotsimDF['weight'] > 7 ]
dotsimNetwork = nx.from_pandas_edgelist(tdsdf, edge_attr=True)
print(f"df: {len(tdsdf)} nodes: {len(dotsimNetwork.nodes)}")

df: 4080 nodes: 162


In [368]:
cosimThresh= cosineSimDF[cosineSimDF['weight'] > 0.9995]
cosimNetwork = nx.from_pandas_edgelist(cosimThresh, edge_attr=True)
print(f"df: {len(cosimThresh)} nodes: {len(cosimNetwork.nodes)}")

df: 4355 nodes: 194


In [388]:
def generate_word_attributes(book):
    attribute_list = {}
    for protagonist, pseudonyms in book.protagonists.items(): 
        for name in pseudonyms: 
            attribute_list[name] = {'type': 'protagonist'}
        
    for suspect, pseudonyms in book.suspects.items(): 
        for name in pseudonyms: 
            attribute_list[name] = {'type': 'suspect'}
        
    for antagonist, pseudonyms in book.antagonists.items(): 
        for name in pseudonyms: 
            attribute_list[name] = {'type': 'antagonist'}
        
    return attribute_list

def createNetwork(dataframe, book, typeofweight='cosineSim', weightThreshold=0.99925):
    edgelistDF = dataframe[['word1', 'word2', typeofweight]]
    edgelistDF.columns = ['source', 'target', 'weight']
    thresholdedEdgelist = edgelistDF[edgelistDF['weight'] > weightThreshold ]
    
    minWeight = thresholdedEdgelist['weight'].min()
    maxWeight = thresholdedEdgelist['weight'].max()
    
    thresholdedEdgelist['weight'] = thresholdedEdgelist['weight'].apply(
        lambda x: (x-minWeight)/(maxWeight-minWeight)*10
    )
    
    similarityNetwork = nx.from_pandas_edgelist(thresholdedEdgelist,edge_attr=True)
    nodeAttributes = generate_word_attributes(book)
    nx.set_node_attributes(similarityNetwork, nodeAttributes)

    jsonGraph = json_graph.node_link_data(similarityNetwork)

    f = open(f"{typeofweight}_{weightThreshold}.json", 'w')
    json.dump(jsonGraph, f)
    f.close()


In [389]:
createNetwork(embeddings, book, 'cosineSim',  0.9995)
createNetwork(embeddings, book, 'dotSim', 7)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  thresholdedEdgelist['weight'] = thresholdedEdgelist['weight'].apply(


In [376]:

eh = (cosimNetwork)

In [373]:
minn = dotsimDF['weight'].min()
maxx = dotsimDF['weight'].max()
dotsimDF['weight'] = dotsimDF['weight'].apply(lambda x:  (x - minn)/(maxx - minn) * 10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dotsimDF['weight'] = dotsimDF['weight'].apply(lambda x:  (x - minn)/(maxx - minn) * 10)


In [374]:
dotsimDF['weight'].max()


10.0

In [283]:
book.suspects

{'John Bailey': ['john', 'jack'],
 'Gertrude Innes': ['gertrude'],
 'Halsey Innes': ['halsey']}

In [299]:
book.antagonists.items()

dict_items([('Anne Watson', ['anne', 'watson'])])

In [311]:

summary = generate_word_attributes(book)

summary

{'jamieson': {'type': 'protagonist'},
 'detective': {'type': 'protagonist'},
 'winters': {'type': 'protagonist'},
 'john': {'type': 'suspect'},
 'jack': {'type': 'suspect'},
 'gertrude': {'type': 'suspect'},
 'halsey': {'type': 'suspect'},
 'anne': {'type': 'antagonist'},
 'watson': {'type': 'antagonist'}}

In [325]:
nx.set_node_attributes(dotsimNetwork, summary)


In [327]:
dotsimNetwork.nodes

NodeView(('I', 'afternoon', 'air', 'alex', 'anne', 'armstrong', 'arnold', 'ask', 'aunt', 'away', 'bad', 'bailey', 'bank', 'bed', 'begin', 'believe', 'black', 'body', 'boy', 'break', 'bring', 'car', 'carrington', 'casanova', 'chair', 'child', 'circular', 'close', 'clothe', 'club', 'come', 'country', 'course', 'd', 'day', 'dead', 'death', 'detective', 'doctor', 'door', 'drive', 'drop', 'east', 'end', 'evening', 'eye', 'face', 'fall', 'family', 'far', 'feel', 'find', 'fire', 'floor', 'foot', 'gertrude', 'girl', 'good', 'grow', 'half', 'hall', 'halsey', 'hand', 'happen', 'head', 'hear', 'hold', 'home', 'house', 'ill', 'inne', 'jack', 'jamieson', 'know', 'leave', 'let', 'liddy', 'lie', 'light', 'like', 'link', 'little', 'lock', 'lodge', 'long', 'look', 'louise', 'low', 'lucien', 'man', 'minute', 'miss', 'moment', 'money', 'morning', 'mother', 'mr', 'mrs', 'new', 'night', 'nt', 'oclock', 'old', 'open', 'paper', 'paul', 'people', 'place', 'probably', 'quiet', 'ray', 'read', 'right', 'road', '

# The Man in the Lower Ten

In [None]:
book = ProcessedBook(books['The_Man_in_Lower_Ten'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

In [None]:
book.clean_lines

# The After House

In [None]:
book = ProcessedBook(books['The_After_House'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Window at the Wide Cat

In [None]:
book = ProcessedBook(books['The_Window_at_the_White_Cat'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

# The Bat

In [None]:
book = ProcessedBook(books['The_Bat'])
lemmas = book.lemmas

In [None]:
' '.join(lemmas[:100])

## Custom Word Embeddings

In [None]:
# Import word_embeddings lib
import nlp_libs.books.word_embeddings as we

In [None]:
# custom_embeddings = we.WordEmbeddingsCustom()

## Pretrained Word Embeddings

In [None]:
# pretrained_embeddings = we.WordEmbeddingsPretrained()

## Compare Vector distances and report similarities using Custom Embeddings

In [None]:
# Import compare_statistics lib
import nlp_libs.books.compare_statistics as cs

In [None]:
# cs.my_custom_embeddings_compare_function()

## Compare Vector distances and report similarities using Pretrained Embeddings

In [None]:
# cs.my_pretrained_embeddings_compare_function()

## Extra Analysis? Plots?

In [None]:
# Too much work

In [None]:
NOTE: lemmatize the stop words then see