# Dissertation Set Expansion Paper

This notebook contains my coded attempt at reproducing the best method from the paper: 

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8999141.

## Download Files from my Dissertation Git Repo

In [1]:
!git clone https://github.com/LeeTaylorNewcastle/Dissertation

Cloning into 'Dissertation'...
remote: Enumerating objects: 1402, done.[K
remote: Counting objects: 100% (109/109), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 1402 (delta 37), reused 100 (delta 31), pack-reused 1293[K
Receiving objects: 100% (1402/1402), 122.28 MiB | 24.20 MiB/s, done.
Resolving deltas: 100% (58/58), done.
Updating files: 100% (1990/1990), done.


## Imports & Installations

In [66]:
!pip install bs4
!pip install html5lib
!pip install gensim
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [71]:
from sklearn.metrics import precision_score, recall_score, f1_score
from bs4 import BeautifulSoup as bs
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import numpy as np
import requests
import pickle
import string
import gensim
import os
import re

## Web Crawling and Scraping Functions

The authors of the paper (MISSING_PAPER_NAME) scraped web pages related to their category, to reproduce their method I have also scraped web pages related to my category in a similar fashion. 

In [51]:
INTERVAL = 3600
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
}

def url_to_soup_obj(url: str):
    try:
        page = requests.get(url, headers=HEADERS)
    except:
        print(f"<ERROR: {url}>")
        return
    return bs(page.content, 'html5lib')

def extract_elements(html_text, elm='p'):
    soup = bs(html_text, 'html.parser')
    paragraphs = []
    for p in soup.find_all(elm):
        paragraphs.append(p.text.strip())
    return paragraphs

def google_search_wikipedia(search_str: str, debug: bool = True):
    # Convert search to google search URL
    gsearch_url = f"https://www.google.com/search?q={'+'.join(search_str.lower().split())}"
    # Generate 'soup' object of google search
    gsearch_soup = url_to_soup_obj(gsearch_url) # Todo: error
    # Extract URLs
    elms = extract_elements(str(gsearch_soup), elm='cite')
    href = extract_elements(str(gsearch_soup), elm='a')
    # Store 'cite' and 'a' elements in elms list
    for item in href:
        elms.append(item)
    # For-loop extracts URLs into list
    urls_ = []
    for string in elms:
        # Skip blank strings
        if string == '':
            continue
        # String must contain 'https:' but not '...' and 'category'
        if string.split()[0].__contains__("https:") and \
                not string.split()[-1].__contains__('...') and \
                not string.lower().__contains__('category'):
            # Convert arrows to slashes for URL functionality
            string = string.replace(' ', '')
            urls_.append(string.replace('›', '/'))
    # Return list of URLs
    return list(set(urls_))

def read_words_from_file(file_path, 
                         encoding='utf-8-sig',
                         rtype='set'):
    with open(file_path, 'r', encoding=encoding) as f:
        if rtype == 'set':
            words = set(f.read().splitlines())
        elif rtype == 'list':
            words = f.read().splitlines()
        else:
            words = f.read().splitlines()
    return words

def remove_code(text):
    # Use a regular expression to find any instances of code and scripts
    code = re.findall(r'<.*?>', text)
    # Remove all instances of code and scripts from the text
    clean_text = re.sub(r'<.*?>', '', text)
    # Return the resulting text
    return clean_text

def extract_text(soup, headers=True, debug=True):
    # Find all HTML elements that contain the main content
    if headers:
        content_elements = soup.find_all(["p", "h1", "h2", "h3", "h4", "h5",
                                          "h6", "a", "li", "span", "strong", "em"])
    elif not headers:
        content_elements = soup.find_all(["p", "a", "li", "span", "strong", "em"])
    # Concatenate the text from all content elements
    content = [element.text.strip() for element in content_elements]
    for i, v in enumerate(content):
        content[i] = v.replace('\n', '')
        content[i] = content[i].replace('  ', '')
    # Remove keywords
    content = [element for element in content if element.strip() != '']  # Remove blanks
    content = [element for element in content if len(element.split()) > 9]  # Remove lines with less than X words
    content = [element for element in content if not element.lower().__contains__('site')]
    content = [element for element in content if not element.lower().__contains__('cookie')]
    content = [element for element in content if not element.lower().__contains__('sign in')]
    content = [element for element in content if not element.lower().__contains__('instagram')]
    content = [element for element in content if not element.lower().__contains__('contact us')]
    # Combine content into a string
    content = '\n'.join(set(content))
    content = remove_code(content)
    # Return the resulting text
    return content

def query_to_text(search_str, page_limit=3, debug=True):
    # Define storage for extracted text
    rv = []
    # Extract URLs to scrape from
    urls = google_search_wikipedia(search_str=search_str)
    # For each URL append extracted readable text
    for url in urls[:page_limit]:
        if debug:
            print(url)
        soup_url = url_to_soup_obj(url)
        if soup_url is None:
            return ['']
        rv.append(extract_text(soup_url))
    # Mark EOF
    return rv

def write_to_file(fn, text):
    with open(fn, "w", encoding='utf-8-sig') as f:
        for elm in text:
            f.write(str(elm) + '\n')

def main():
    # Read list of words to search
    entity_set = read_words_from_file(
      f'Dissertation/M6/'
      f'data_prep/entity_set.txt')
    # Entity set info
    print(
        f"Set of words describing personality traits info.\n"
        f"Number of words:  {len(entity_set)}\n"
        f"First five words: {list(entity_set)[:5]}\n"
        f"Source: http://ideonomy.mit.edu/essays/traits.html\n"
    )
    # Parse webapage for text
    counter, e_counter = 0, 0
    for word in list(entity_set)[:]:
        counter += 1
        try:
            search_term = f'wikipedia {word}'
            parsed_webpage_text = query_to_text(search_term, debug=False)
            if not os.path.exists("wcs"):
                os.makedirs("wcs")
            write_to_file(f"wcs/{search_term}.txt", parsed_webpage_text)
        except:
            counter -= 1
            e_counter += 1
            # print(f"'{word}' could not be written to a file!")
    print(f"\nSuccessfully scraped {counter} terms. {e_counter} terms failed.")


main()

Set of words describing personality traits info.
Number of words:  637
First five words: ['Impressive', 'Charmless', 'Cynical', 'Dull', 'Offhand']
Source: http://ideonomy.mit.edu/essays/traits.html

<ERROR: Estheticianhttps://en.m.wikipedia.org/wiki/Esthetician>
<ERROR: Wikipedia:VaguenessWikipediahttps://en.wikipedia.org/wiki/Wikipedia:Vagueness>
<ERROR: Old-fashionedhttps://en.wikipedia.org/wiki/Old-fashioned>
<ERROR: Wikipedia:RudeWikipediahttps://en.wikipedia.org/wiki/Wikipedia:Rude>
<ERROR: contemplativeWiktionaryhttps://en.wiktionary.org/wiki/contemplative>
<ERROR: pedanticWiktionaryhttps://en.wiktionary.org/wiki/pedantic>
<ERROR: WiseWikipediahttps://en.wikipedia.org/wiki/Wise>
<ERROR: ImpatienceWikiquotehttps://en.wikiquote.org/wiki/Impatience>
<ERROR: flamboyancehttps://en.wiktionary.org/wiki/flamboyance>
<ERROR: Curiosityhttps://en.wikipedia.org/wiki/Curiosity>
<ERROR: Wikipedia:SUBJECTIVEWikipediahttps://en.wikipedia.org/wikipedia/Special:Search>
<ERROR: DrollWikipediahttps:

The aim of this code is to collect and store information about personality traits from Wikipedia pages in a structured and organized manner. The information collected is used for further analysis and set expansion.

The code starts by reading a list of words that describe personality traits from a file and printing information about the list. Then, for each word in the list, the program performs a Google search to find relevant Wikipedia pages, extracts the readable text from those pages, and writes the extracted text to a separate text file.

This code performs web scraping by retrieving the text from Wikipedia pages relevant to words that describe personality traits. This code uses the Google search engine to find Wikipedia pages that match the given words, retrieves the HTML content of those pages, and extracts the readable text from the HTML content. The extracted text is then written to a text file.

## Explore Scraped Text

In [None]:
# Todo: Write code here
...

Explain the code here!

## Converted Scraped Text into Corpus

In [55]:
def store_pickle_file(obj, filename):
    with open(filename, 'wb') as f:
        pickle.dump(obj, f)

def load_pickle_file(filename):
    with open(filename, 'rb') as f:
        obj = pickle.load(f)
    return obj

def remove_punctuation(text):
    """ 
    Use a regular expression to match all 
    characters that are not letters or numbers.
    """ 
    return re.sub(r'[^\w\s]', '', text)

def smart_replace(big_string_):
    """ Replace fullstops not used to end a sentence. """
    # Type check
    if not isinstance(big_string_, str):
        raise TypeError("Param: 'big_string' should be a string!")
    # Convert big_string to a list of chars
    chars = list(big_string_)
    # Remove fullstops not used to end sentences
    #   and remove newline chars
    new_chars = []
    for i, c in enumerate(chars):
        # Fullstop used to end sentence
        if c == '.' and chars[i+1] == ' ':
            new_chars.append(c)
        # Fullstop not used to end sentence
        elif chars[i-1] != ' ' and c == '.' and chars[i+1] != ' ':
            new_chars.append(',')
        elif c == '\n':
            new_chars.append(' ')
        else:
            new_chars.append(c)
    new_big_string_ = ''.join(new_chars)
    return new_big_string_

def extract_sentences(big_string_):
    """ Removing casing and seperate into sentences on fullstop chars. """
    rv = []
    sentences_ = big_string_.split('.')
    for sentence in sentences_:
        sentence = sentence.lower()
        sentence = remove_punctuation(sentence)
        if len(sentence.split()) >= 10:
            rv.append(sentence.split())
    return rv

def find_filenames(directory: str):
    filenames = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            filenames.append(os.path.join(root, file))
    return filenames

def main():
    # Create list of file names
    filenames = find_filenames("wcs")
    # Extract sentences from .txt files
    all_sentences = []
    for fn in filenames:
        words = read_words_from_file(fn, rtype='list')
        sentences = extract_sentences(' '.join(words))
        all_sentences.extend(sentences)
    # Create dir for corpus.pkl file
    if not os.path.exists("wcs_pkl"):
        os.makedirs("wcs_pkl")
    store_pickle_file(all_sentences, "wcs_pkl/corpus.pkl")
    # Mark EOF
    pass


main()

The aim of this code is to create a pickle file containing a list of lists, each list containing a sentence from webcrawling and scraping on wikipedia. The code uses two main functions: `main`.

The main function uses the `find_filenames` function to search a directory named "wcs" and get a list of all the filenames. Then, it reads the contents of each file, removes punctuation, and splits the text into sentences. The resulting list of sentences is stored in a pickle file named "corpus.pkl".

The other functions in the code are helper functions used by the main functions to process the text data. The `store_pickle_file` and `load_pickle_file` functions are used to store and load the data from the pickle files. The `remove_punctuation` function removes all characters that are not letters or numbers from a given text. The `smart_replace` function replaces full stops not used to end a sentence with a comma. The `extract_sentences` function removes casing and separates text into sentences on full stop characters.

## Explore Sentences

In [None]:
# Todo: Write code here
...

Explain the code here!

## Train W2V 

In [56]:
def load_w2v(fn):
    print(f"Loading Word2Vec model...\n")
    rv = None
    if fn.__contains__('.bin.gz'):
        rv = KeyedVectors.load_word2vec_format(fn, binary=True)
    elif fn.__contains__('.model'):
        rv = Word2Vec.load(fn)
    print(f"Loaded Word2Vec model!\n")
    return rv

def train_model(corpus_):
    # Train the word2vec model on the corpus
    model_ = gensim.models.Word2Vec(corpus_)  # vector_size=300
    return model_

def save_model(model_, file_path):
    """ 
    Save the weights of the trained model

    :param model_: pass the model weights
    :param file_path: string - location and name to save to 
    """
    model_.save(file_path)

def main():
    corpus = load_pickle_file("wcs_pkl/corpus.pkl")
    trained_model = train_model(corpus)
    if not os.path.exists("model_weights"):
        os.makedirs("model_weights")
    save_model(trained_model, "model_weights/trained_model.model")
    # Mark EOF
    pass


main()

This code is performing three main tasks:

1. Loading a pre-trained Word2Vec model. The function `load_w2v` takes a file name as an argument and loads a Word2Vec model stored in that file. The model is either in binary format ('.bin.gz') or in text format ('.model').

2. Training a new Word2Vec model. The `train_model` function takes a corpus as an argument and trains a new Word2Vec model on that corpus.

3. Saving the weights of the trained model. The `save_model` function takes a model and a file path as arguments, and saves the weights of the model in the specified file.

The main function ties all of these tasks together by first loading the corpus from a pickle file, then training a new Word2Vec model on the corpus, and finally saving the weights of the trained model.

## Perform Set Expansion

In [63]:
def list_info(arr):
    """ Out information about the passed list. """
    if isinstance(arr, list):
        print(f"List contains {len(arr)} items.")
    elif isinstance(arr, set):
        print(f"Set contains {len(arr)} items.")
    print(f"First 5 items:[")
    for item in list(arr)[:5]:
        print(f"'{item}',")
    print(f"]\n")
    # Mark EOF
    pass

def write_words_to_file(file_path, arr):
    # Open and write to file path passed
    with open(file_path, 'w', encoding='utf-8-sig') as f:
        string = '\n'.join(arr)
        f.write(string)
    # Mark EOF
    pass

def expand_set(model_fp='google_w2v_weights/GoogleNews-'
                        'vectors-negative300.bin.gz',
               entity_set_fp='Dissertation/M6/'\
                        'data_prep/entity_set.txt',
               entity_set=None,
               out=True,
               output_fn='expansion_set_2'):
    """
    This function loads a specified W2V model, reads and loads an entity set
    from a pickle file and convert it to a list.
    Then for each word in the entity set it appends the word and most similar
    word returned from the W2V model.

    :param entity_set: set - the set of words which to expand on.
    :param model_fp: str - file path pointing to weights to be loaded.
    :param entity_set_fp: (str||None) - file path pointing to set to be loaded.
    :param out: bool - if true then the function will print info otherwise nothing
    will be outted to the console.
    :param output_fn: str - file name of the output_set_expansion to be stored and written to
    :return: a list of pairs where each pair contains the original word and
    the most similar word generated by the W2V model.
    """
    # Load a pre-trained Word2Vec model
    model = load_w2v(fn=model_fp)

    # Set state representing model type
    pre_trained = False
    if model_fp.__contains__('.model'):
        pre_trained = True

    # Define the initial entity set
    if entity_set_fp is not None:
        entity_set = read_words_from_file(entity_set_fp)
    if out:
        print(f"Entity set as a Python Set info: ")
        list_info(entity_set)

    # Convert set to array
    entity_arr = list(entity_set)
    if out:
        print(f"Entity set as a Python List info: ")
        list_info(entity_arr)
        
    # Expand the entity set
    expansion_arr = []
    e_counter = 0
    for word in entity_set:
        try:
            if not pre_trained:
                similar_words_list = model.most_similar(word.lower())
            else:
                similar_words_list = [word for word in model.wv.most_similar(word.lower())]
        except KeyError:
            if out:
                e_counter += 1
                # print(f"'{word}' not present in W2V vocabulary.")
            continue
        # similar_words = [word[0] for word in model.most_similar(word, topn=num_similar_words)]
        # expansion_arr.append(f"{word}, {similar_words_list[0][0]}")
        similar_words_added = 0
        for i, v in enumerate(similar_words_list):
            if similar_words_added == 3:
                continue
            if word.lower() != v[0].lower():
                expansion_arr.append(f"{word}, {similar_words_list[i][0]}")
                similar_words_added += 1
    # The expanded entity set
    if out:
        print(f"Expanded list info: ")
        list_info(expansion_arr)
    write_words_to_file(f"{output_fn}", expansion_arr)
    print(
        f"{e_counter} words were not present in the W2V-model's vocabulary.\n"
        f"Written expanded set to the following dir: '{output_fn}'"
    )
    return expansion_arr

def main():
    expand_set(
        model_fp='model_weights/trained_model.model',
        output_fn='expanded_set_1.txt'
    )
    # Mark EOF
    pass


main()

Loading Word2Vec model...

Loaded Word2Vec model!

Entity set as a Python Set info: 
Set contains 637 items.
First 5 items:[
'Impressive',
'Charmless',
'Cynical',
'Dull',
'Offhand',
]

Entity set as a Python List info: 
List contains 637 items.
First 5 items:[
'Impressive',
'Charmless',
'Cynical',
'Dull',
'Offhand',
]

Expanded list info: 
List contains 1530 items.
First 5 items:[
'Impressive, 181920',
'Impressive, adjective',
'Impressive, impersonal',
'Charmless, midfielder',
'Charmless, jotaro',
]

127 words were not present in the W2V-model's vocabulary.
Written expanded set to the following dir: 'expanded_set_1.txt'


The aim of this code is to expand a set of words (referred to as the entity set) by finding the most similar words for each word in the entity set using our trained Word2Vec model. 

The code first loads the Word2Vec model and reads in the entity set from a file. It then converts the set to a list and for each word in the entity set, it finds the most similar words using the pre-trained model. 

Finally, it writes the results to a file, which is a list of pairs where each pair contains the original word and the most similar word generated by the Word2Vec model.

## Expanded Set Manual Labelling 

In my GitHub repository files, which are cloned and downloaded into this project at the start of this notebook, I have manually labelled whether a word generated by W2V is a synonym, antonym, or related for the first 120 outputs. I have added a '1' to the end of the line if the word is a synonym, antonym, or related and a '0' otherwise.

This labelling can be found in 'Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki.txt'.

In [70]:
# /content/Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki_.txt
with open('Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki_.txt',
          "r") as f:
          lines = f.readlines()

list_info(lines)

List contains 120 items.
First 5 items:[
'﻿Sociable, radians, 0
',
'Sociable, modulo, 0
',
'Sociable, triplelevel, 0
',
'Energetic, dopamine, 1
',
'Energetic, legumes, 0
',
]



## Evaluation Metrics
#### Recall, Precision, F1-Score.

In [68]:
def load_y():
    # Read lines from file
    words = read_words_from_file(
        f"Dissertation/M6/"
        f"output_set_expansion/labelled/"
        f"eset_mit_wiki_.txt",
        rtype='list')
    y_pred_ = []
    # Extract y prediction value from each line
    for word in words:
        y = word.split()[-1]
        y_pred_.append(int(y))
    # Return y predictions read from file
    return y_pred_

def main():
    y_pred = load_y()
    y_true = [1 for _ in range(len(y_pred))]

    if not(len(y_pred) == len(y_true)):
        raise ValueError("Lists `y_pred` must equal `y_true`.")

    # Calculate precision
    precision = precision_score(y_true, y_pred)

    # Calculate recall
    recall = recall_score(y_true, y_pred)

    # Calculate F1 score
    f1 = f1_score(y_true, y_pred)

    print("Recall:", recall)
    print("Precision:", precision)
    print("F1 Score:", f1)


main()

Recall at k: 0.36666666666666664
Precision at k: 1.0
F1 Score at k: 0.5365853658536585


This code evaluates the results of the expanded set of words by calculating the precision, recall, and F1 score.

The code reads the predicted binary values from a file and stores it in the list `y_pred`. The actual binary values are stored in the list `y_true`, which is created as an array of ones with the same length as `y_pred`.

The code then uses the `precision_score`, `recall_score`, and `f1_score` functions from the scikit-learn library to calculate precision, recall, and F1 score respectively. These metrics are commonly used to evaluate the performance of a binary classifier.

Finally, the calculated values of precision, recall, and F1 score are printed to the console.