## Keyword Generation
This notebook is used to select words from a given set of document that bears close resemblance with a given set of user input words. This is done with the aim of supplementing our current pool of ESG keywords which will ultilmately be used to filter out sentences from our ESG reports

Attached is an image illustrating the deconstruction of ESG pillars

![esgmsci](https://www.visualcapitalist.com/wp-content/uploads/2021/03/shareable-5.jpg)

*Image Credit: www.visualcapitalist.com*

For each sub-pillar, we will pick out the keywords and use NLP derive the vector for each word so as to find it's nearest neighbours. This gives us a series of words that are similar/closely related to the features.

In [1]:
# Import the libraries 
import glob
import re
import os
import fitz
import math
import json
import pprint
import gensim
import collections
import spacy
import nltk
import tqdm
import time
import numpy as np
import pandas as pd
import gensim.corpora as corpora

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from gensim.utils import simple_preprocess
from textblob import TextBlob
from scipy import spatial
from gensim.models import Word2Vec
from numba import jit

## Initial set of keywords
This will be the json file that contains our preliminary set of keywords. They are retrieved manually from articles/ reports. 

In [2]:
# Read the key words from our json file
f = open('keywords.json')
keywordBank = json.load(f)
f.close()

## Word Embedding and Similarity scoring
Given that our initial approach of manually searching for keywords under each component for the 3 pillars will not yield a full representation of the topic at hand, the aim here is to employ Word Embedding NLP technique to filter out words that bears close resemblance to the given word in terms of cosine-similarity distance.

In [3]:
# --------------------------- Read a pdf into a large string of text ---------------------------
def read_pdf(file_path):
    pymupdf_text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            pymupdf_text += page.get_text()
    return pymupdf_text


# --------------------------- Read a report and breaks it up into individual sentences ---------------------------
def convert_pdf_into_sentences(text):
    # Remove unnecessary spaces and line breaks
    text = re.sub(r'\x0c\x0c|\x0c', "", str(text))
    text = re.sub('\n ', '', str(text))
    text = re.sub('\n', ' ', str(text))
    text = ' '.join(text.split())
    text = " " + text + "  "
    text = text.replace("\n", " ")
    if "”" in text: text = text.replace(".”", "”.")
    if "\"" in text: text = text.replace(".\"", "\".")
    if "!" in text: text = text.replace("!\"", "\"!")
    if "?" in text: text = text.replace("?\"", "\"?")
    text = text.replace(".", ".<stop>")
    text = text.replace("?", "?<stop>")
    text = text.replace("!", "!<stop>")
    text = text.replace("<prd>", ".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]

    # Filter for sentences with more than 100 characters
    sentences = [s.strip() for s in sentences if len(s) > 100]
    return sentences

In [4]:
word_bank = []

# Read our database of ESG reports
path = 'Reports 2.0'
esg_reports = glob.glob(path + '/*.pdf')
for report in tqdm.tqdm(esg_reports):
    word_bank.append(convert_pdf_into_sentences(read_pdf(report)))

100%|█████████████████████████████████████████████████████████████████████████████████████| 468/468 [02:04<00:00,  3.77it/s]


In [5]:
# Combining all sentence into a set of words
cleaned = []
for sentence in tqdm.tqdm(word_bank):
    combined = ' '.join(sentence)
    new_string = re.sub(r"[^a-zA-Z0-9]"," ", combined)
    cleaned.append(new_string.split(' '))

100%|████████████████████████████████████████████████████████████████████████████████████| 468/468 [00:04<00:00, 105.18it/s]


## Training the Word2Vec Model

In [6]:
model = Word2Vec(
    sentences=cleaned,
    size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    workers=4
)

model.save("word2vec.model")

In [7]:
# Import the model trained on the corpus of ESG reports
model = Word2Vec.load("word2vec.model")

In [8]:
# Test out the model with some basic inputs
test_words = ['data', 'security', 'governance', 'carbon']

for word in test_words:
    sims = model.wv.most_similar(word, topn=10) # Get other similar words
    print(f">>> Top 10 words that are similar to: {word}")
    pprint.pprint(sims)
    print('\n')

>>> Top 10 words that are similar to: data
[('information', 0.7035861015319824),
 ('Data', 0.6439417600631714),
 ('endpoint', 0.5328998565673828),
 ('privacy', 0.5268815755844116),
 ('cybersecurity', 0.49566537141799927),
 ('posture', 0.48101022839546204),
 ('authentication', 0.47874704003334045),
 ('cyber', 0.475664347410202),
 ('vulnerabilities', 0.4756225347518921),
 ('Redundant', 0.4716993272304535)]


>>> Top 10 words that are similar to: security
[('cybersecurity', 0.7609637379646301),
 ('privacy', 0.724395215511322),
 ('cyber', 0.6816527247428894),
 ('protection', 0.6676135659217834),
 ('safety', 0.6309021711349487),
 ('Security', 0.6017462015151978),
 ('systems', 0.5922068953514099),
 ('vulnerability', 0.5906700491905212),
 ('protocols', 0.5839361548423767),
 ('reliability', 0.5838332176208496)]


>>> Top 10 words that are similar to: governance
[('responsibility', 0.66522216796875),
 ('Governance', 0.6651098132133484),
 ('structure', 0.6410582661628723),
 ('citizenship', 0.637

In [9]:
# Parse out all our manually acuqired keywords
components_words = []

for component, keywords in keywordBank['Environment'].items():
    components_words.append(keywords)

for component, keywords in keywordBank['Social'].items():
    components_words.append(keywords)

for component, keywords in keywordBank['Governance'].items():
    components_words.append(keywords)

## Keyword generation

To retrieve keywords from the our model closely resembles a given word input, we will be running parsing the entire collection of manual keywords into the model.

To avoid the over-retrieval of keywords that may lead to overlaps, we will be limiting the retrieval to the top **3 words.**

Based on our *preliminary* testing, we concluded that a threshold of **70%** for similarity sccoring gives us the best keywords that resembles the input.

In [10]:
# Building on the current list of keywords
position = 0
count = 0
temp = components_words.copy()
for wordList in tqdm.tqdm(temp):
    newKeywords = []
    newKeywords.extend(wordList)
    for keyw in wordList:
        try:
            sims = model.wv.most_similar(keyw, topn=3)
            for newWord in sims:
                if (newWord[0] not in newKeywords) and (newWord[1] > 0.7):
                    count += 1
                    newKeywords.append(newWord[0])
        except KeyError:
            continue
    components_words[position] = newKeywords
    position += 1
print(f"Number of words added: {count}")

100%|████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 144.09it/s]

Number of words added: 116





In [11]:
# Reappend back to our dictionary
pointer = 0
for pillar, comps in keywordBank.items():
    if pillar in ['Environment', 'Social', 'Governance']:
        for component, keywords in keywordBank[pillar].items():
            keywordBank[pillar][component] = components_words[pointer]
            pointer += 1
            
# Repopulate our json file with the newly added keywords
with open("keywords.json", "w") as outfile:
    json.dump(keywordBank, outfile)