# Dissertation Set Expansion Paper

This notebook contains my coded attempt at reproducing the Word2Vec method from the paper: 

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8999141.

## 0. Setting Up Environment and Importing Dependencies

### 0.1 Download Files from my Dissertation Git Repo

In [None]:
!git clone https://github.com/LeeTaylorNewcastle/Dissertation

Cloning into 'Dissertation'...
remote: Enumerating objects: 2063, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 2063 (delta 2), reused 13 (delta 2), pack-reused 2048[K
Receiving objects: 100% (2063/2063), 173.61 MiB | 10.84 MiB/s, done.
Resolving deltas: 100% (287/287), done.
Updating files: 100% (2690/2690), done.


### 0.2 Imports & Installations

In [None]:
!pip install bs4
!pip install html5lib
!pip install gensim
!pip install scikit-learn
!pip install tqdm
!pip install gdown
!pip install tabulate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1270 sha256=8a4ef46e94227370e60e5c9f8d218282e6cb7d0c48b8611429bcbc57482cda06
  Stored in directory: /root/.cache/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://u

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
from bs4 import BeautifulSoup as bs
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from typing import List
from sklearn.metrics import confusion_matrix
from tabulate import tabulate
import xml.etree.ElementTree as ET
import numpy as np
import requests
import pickle
import string
import gensim
import os
import re
import nltk
import tqdm
import shutil
import gdown

### 0.3 Mount Google Drive 
The purpose of this step is to be able to copy files generated by this notebook into a personal folder.  
As files generated by this notebook are only available until the session ends. 

In [None]:
# Import and mount gdrive
from google.colab import drive
drive.mount('/content/gdrive')

# Create path
drive_folder_path = '/content/gdrive/MyDrive/_Mount_Dissertation'
os.makedirs(drive_folder_path, exist_ok=True)

Mounted at /content/gdrive


### 0.4 Create folders

In [None]:
pkl_dir = 'pkl_files'
wcs_dir = 'wcs'


dirs_ = [pkl_dir, wcs_dir, 'labelled_', 'model_weights']


for dir in dirs_:
    if not os.path.exists(dir):
        os.makedirs(dir)

## 1. Acquire and Pre-process Datasets

### 1.1 Web Crawling and Scraping Functions

In [None]:
from six import with_metaclass
INTERVAL = 3600
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
}

def url_to_soup_obj(url: str):
    try:
        page = requests.get(url, headers=HEADERS)
    except:
        # print(f"<ERROR: {url}>")
        return
    return bs(page.content, 'html5lib')

def extract_links(soup):
    links = []
    keywords = [
        'Help:', 'Main_Page', 'Talk:', 'Category:', 
        'Wikipedia:', 'Special:', 'Portal:', 'File:',
        '#', 'Template:'
    ]  

    for link in soup.find_all('a', href=True):
        if link['href'].startswith('/wiki/') and \
        not any(keyword.lower() in link['href'].lower() for keyword in keywords):
            links.append('https://en.wikipedia.org' + link['href'])

    return links

def extract_elements(html_text, elm='p'):
    soup = bs(html_text, 'html.parser')
    paragraphs = []
    for p in soup.find_all(elm):
        paragraphs.append(p.text.strip())
    return paragraphs

def google_search_wikipedia(search_str: str, debug: bool = True):
    # Convert search to google search URL
    gsearch_url = f"https://www.google.com/search?q={'+'.join(search_str.lower().split())}"
    # Generate 'soup' object of google search
    gsearch_soup = url_to_soup_obj(gsearch_url) # Todo: error
    # Extract URLs
    elms = extract_elements(str(gsearch_soup), elm='cite')
    href = extract_elements(str(gsearch_soup), elm='a')
    # Store 'cite' and 'a' elements in elms list
    for item in href:
        elms.append(item)
    # For-loop extracts URLs into list
    urls_ = []
    for string in elms:
        # Skip blank strings
        if string == '':
            continue
        # String must contain 'https:' but not '...' and 'category'
        if string.split()[0].__contains__("https:") and \
                not string.split()[-1].__contains__('...') and \
                not string.lower().__contains__('category'):
            # Convert arrows to slashes for URL functionality
            string = string.replace(' ', '')
            urls_.append(string.replace('›', '/'))
    # Return list of URLs
    return list(set(urls_))

def read_words_from_file(file_path, 
                         encoding='utf-8-sig',
                         rtype='set'):
    with open(file_path, 'r', encoding=encoding) as f:
        words = f.read().splitlines()
        if rtype == 'set':
            words = set(words)
    return words

def remove_code(text):
    # Use a regular expression to find any instances of code and scripts
    code = re.findall(r'<.*?>', text)
    # Remove all instances of code and scripts from the text
    clean_text = re.sub(r'<.*?>', '', text)
    # Return the resulting text
    return clean_text

def extract_text(soup, headers=True, debug=True):
    # Find all HTML elements that contain the main content
    if headers:
        content_elements = soup.find_all(["p", "h1", "h2", "h3", "h4", "h5",
                                          "h6", "a", "li", "span", "strong", "em"])
    elif not headers:
        content_elements = soup.find_all(["p", "a", "li", "span", "strong", "em"])
    # Concatenate the text from all content elements
    content = [element.text.strip() for element in content_elements]
    for i, v in enumerate(content):
        content[i] = v.replace('\n', '')
        content[i] = content[i].replace('  ', '')
    # Remove blanks and lines with sentences less than X words
    content = [element for element in content if element.strip() != ''] 
    content = [element for element in content if len(element.split()) > 9]  
    # Remove sentences based on the `keywords_to_filter` list 
    keywords_to_filter = ['site', 'cookie', 'sign in', 'instagram', 'contact us']
    content = [element for element in content if not any(
        keyword in element.lower() for keyword in keywords_to_filter
    )]
    # Combine content into a string
    content = '\n'.join(set(content))
    content = remove_code(content)
    # Return the resulting text
    return content

def query_to_text(search_str, page_depth_limit=3):
    """ 
    Given a search query, this function retrieves the top 
    `page_depth_limit` Wikipedia pages from Google search results, 
    extracts their readable text, and returns a list
    of the extracted text for each page.

    Note: this function is called for every word in the set of
    words describing personality traits `entity set` from MIT. 

    Args:
        search_str (str): The search query for which to find Wikipedia pages.
        page_depth_limit (int, optional): The maximum number of top Wikipedia pages
            to extract text from. Defaults to 3.

    Returns:
        list[str]: A list of strings containing the extracted text from each
            Wikipedia page.
    """
    # Define storage for extracted text
    rv = []
    
    # Extract URLs to scrape from
    urls = google_search_wikipedia(search_str=search_str)
    urls = urls[:3]
    
    # Set page limit 
    page_limit = 10
    page_count = 0
    
    # Prevent checking the same site
    explored = []

    # Scrape extracted readable text
    while len(urls) != 0 and page_count != page_limit:
        
        # Remove current link 
        url = urls.pop(0)

        # Ignore duplicate websites
        explored.append(url.lower())
        if url in explored:
            continue

        # Get `soup` from URL string
        soup_obj = url_to_soup_obj(url)
        
        # Prevent error 
        if soup_obj is None:
            continue
        
        # Extract Wikipedia links from the page
        # `set` to `list` type casting to remove duplicates
        wiki_links = list(set(extract_links(soup_obj)))

        # Explore wikipedia links from the page
        for url_str in wiki_links[:10]:
            urls.append(url_str)
        
        # Add text from website to `rv` (return value) list
        rv.append(extract_text(soup_obj))

        # Increase page_count to terminate the while loop
        page_count += 1
        
    # Mark EOF
    return rv

def write_to_file(fn, text):
    with open(fn, "w", encoding='utf-8-sig') as f:
        for elm in text:
            f.write(str(elm) + '\n')

def copy_files(src_dir, dst_dir):
    # Get a list of all files in the source directory
    files = os.listdir(src_dir)

    # Copy each file from the source directory to the destination directory
    for file_name in files:
        # Construct the full file path
        src_file = os.path.join(src_dir, file_name)
        dst_file = os.path.join(dst_dir, file_name)

        # Copy the file
        shutil.copy(src_file, dst_file)

    # Mark EOF
    pass

def main(use_git_clone=False):
    """ This function uses all functions to perform
    webpage exploration and text scraping from the resulting 
    webpage. """

    # Read list of words to search
    entity_set = read_words_from_file(
      f'Dissertation/M6/data_prep/entity_set.txt'
    )
    
    # Entity set info
    print(
        f"Set of words describing personality traits info.\n"
        f"Number of words:  {len(entity_set)}\n"
        f"First five words: {list(entity_set)[:5]}\n"
        f"Source: http://ideonomy.mit.edu/essays/traits.html\n"
    )

    """ Parse webapage for text
    `counter` counts number of pages downloaded and scraped
    `e_counter` counts number of pages failed
    `dir_` directory path to store web-scrapings 
    """
    counter, e_counter = 0, 0
    dir_ = 'wcs'

    # Make a dir. for parsed webpages text
    if not os.path.exists(dir_):
        os.makedirs(dir_)

    # Save execution time
    if use_git_clone:
        copy_files('Dissertation/M8/wcs', 'wcs')

    """ Google search wikipedia for each term from 
    the MIT entity set and scrape text from each 
    resulting page 
    """
    for word in tqdm.tqdm(list(entity_set)[:]):
        # Increase counter (decrease later if failed)
        counter += 1
        # Form search term to input into Google
        search_term = f'wikipedia {word}'
        # Saves time if code has already been executed
        if os.path.isfile(f"{dir_}/{search_term}.txt"):
            continue
        try:
            parsed_webpage_text = query_to_text(search_term)
            # Write text from webpage to .txt file inside above dir
            write_to_file(f"{dir_}/{search_term}.txt", parsed_webpage_text)
        except ValueError as e:
            print(e)
            counter -= 1
            e_counter += 1
            # print(f"'{word}' could not be written to a file!")
    print(f"\nSuccessfully scraped {counter} terms. {e_counter} terms failed.")

    # Mark EOF
    pass


main(use_git_clone=True)

Set of words describing personality traits info.
Number of words:  637
First five words: ['Stiff', 'Disrespectful', 'Amoral', 'Bland', 'Conciliatory']
Source: http://ideonomy.mit.edu/essays/traits.html



100%|██████████| 637/637 [00:00<00:00, 68748.47it/s]


Successfully scraped 637 terms. 0 terms failed.





The aim of this code is to collect and store information about personality traits from Wikipedia pages in a structured and organized manner. The information collected is used for further analysis and set expansion.

The code starts by reading a list of words that describe personality traits from a file and printing information about the list. Then, for each word in the list, the program performs a Google search to find relevant Wikipedia pages, extracts the readable text from those pages, and writes the extracted text to a separate text file.

This code performs web scraping by retrieving the text from Wikipedia pages relevant to words that describe personality traits. This code uses the Google search engine to find Wikipedia pages that match the given words, retrieves the HTML content of those pages, and extracts the readable text from the HTML content. The extracted text is then written to a text file.

### 1.2 Create Test Data From Thesaurus.com

In [None]:
def store_pickle_file(obj, filename):
    with open(filename, 'wb') as f:
        pickle.dump(obj, f)

def extract_elements(html_text, elm='p'):
    """
    Given HTML text and an HTML element tag, return a list of all text elements within the specified tag.
    :param html_text: str, the HTML text
    :param elm: str, the HTML element tag to search for (default 'p')
    :return: list of str, the text within the specified HTML element tag
    """
    # Use BeautifulSoup to parse the HTML text and find all elements with the specified tag
    soup = bs(html_text, 'html.parser')
    elements = soup.find_all(elm)
    # Extract the text from each element and add it to a list
    text_list = [element.text.strip() for element in elements]
    # Return the list of text elements
    return text_list

def extract_word_grid(soup):
    """
    Given a BeautifulSoup object of a webpage, find and return the list of words in the word grid.
    :param soup: BeautifulSoup object, the parsed HTML of the webpage
    :return: list of str, the words in the word grid
    """
    try:
        # Find the div with the 'word-grid-container' data-testid attribute
        word_grid_div = soup.find('div', {'data-testid': 'word-grid-container'})
        # Find all links within the div
        syn_list = word_grid_div.find_all('a')
        # Extract the text from each link and add it to a list
        word_list = [link.text.strip() for link in syn_list]
    except:
        word_list = []

    try:
        # Find the div with the 'antonyms' id
        antonyms_div = soup.find('div', {'id': 'antonyms'})
        # Find all links within the div
        ant_list = antonyms_div.find_all('a')
        # Extract the text from each link and add it to a list
        antonyms_list = [link.text.strip() for link in ant_list]
    except:
        antonyms_list = []

    # Return the list of words in the word grid
    return [word_list, antonyms_list]

def main(use_git_clone=False):
    # Save time
    if use_git_clone:
        test_dataset_dest = 'Dissertation/M8/related_words_matrix.pkl'
        if os.path.exists(test_dataset_dest):
            # Copy file from clone to working directory
            shutil.copyfile(test_dataset_dest, 
                            f"{pkl_dir}/related_words_matrix.pkl")
            print("Test Dataset successfully copied from cloned repo!")
            return

    # Read list of words to search
    entity_set = read_words_from_file(
      f'Dissertation/M6/data_prep/entity_set.txt'
    )

    related_words_matrix = []

    # Extract synonyms and antonyms
    thesaurus_url = 'https://www.thesaurus.com/browse/'
    for word in tqdm.tqdm(list(entity_set)[:]):
        # Form and download page from URL
        final_url = thesaurus_url + word.lower()
        soup = url_to_soup_obj(final_url)
        # Extract synonyms and antonyms
        related_words = extract_word_grid(soup)
        related_words.insert(0, [word.lower()])
        related_words_matrix.append(related_words)
        # # Check extracted values
        # for arr in related_words:
        #     print(arr)
        # print()

    # # Make a dir. for pickle files
    # if not os.path.exists(dir_):
    #     os.makedirs(dir_)

    store_pickle_file(related_words_matrix, f'{pkl_dir}/related_words_matrix.pkl')

    # Mark EOF
    pass


main(use_git_clone=False)

100%|██████████| 637/637 [03:22<00:00,  3.14it/s]


The code is designed to extract synonyms and antonyms of a given set of words from the website Thesaurus.com. The script performs the following tasks:

1. `store_pickle_file(obj, filename)`: This function stores an object (usually the extracted words) as a pickle file with the given filename. Pickle files are used for efficient storage and retrieval of Python objects.

2. `extract_elements(html_text, elm='p')`: This function takes HTML text as input and returns a list of text elements within the specified HTML tag (default is 'p'). It uses BeautifulSoup to parse the HTML and extract the text from the specified elements.

3. `extract_word_grid(soup)`: Given a BeautifulSoup object representing a parsed webpage, this function finds and returns a list of synonyms and antonyms in the word grid. It searches for specific div elements and extracts the text from the links within them.

4. `main(use_git_clone=False)`: This is the main function that drives the script. It performs the following steps:  
a. Reads a list of words (the entity set) from a file.  
b. Initializes an empty list to store the related words matrix.  
c. Loops through each word in the entity set, constructs the URL for the Thesaurus.com page of that word, and downloads the webpage.  
d. Parses the webpage and extracts the synonyms and antonyms using the extract_word_grid() function.  
e. Appends the extracted related words (synonyms and antonyms) to the related words matrix.  
f. Stores the related words matrix as a pickle file for later use.  

The purpose of this code is to build a dataset of synonyms and antonyms for a given set of words. 

### 1.3 Converted Scraped Text into Corpus

In [None]:
def load_pickle_file(filename):
    with open(filename, 'rb') as f:
        obj = pickle.load(f)
    return obj

def remove_punctuation(text):
    """ 
    Use a regular expression to match all 
    characters that are not letters or numbers.
    """ 
    return re.sub(r'[^\w\s]', '', text)

def smart_replace(big_string_):
    """ Replace fullstops not used to end a sentence. """
    # Type check
    if not isinstance(big_string_, str):
        raise TypeError("Param: 'big_string' should be a string!")
    # Convert big_string to a list of chars
    chars = list(big_string_)
    # Remove fullstops not used to end sentences
    #   and remove newline chars
    new_chars = []
    for i, c in enumerate(chars):
        # Fullstop used to end sentence
        if c == '.' and chars[i+1] == ' ':
            new_chars.append(c)
        # Fullstop not used to end sentence
        elif chars[i-1] != ' ' and c == '.' and chars[i+1] != ' ':
            new_chars.append(',')
        elif c == '\n':
            new_chars.append(' ')
        else:
            new_chars.append(c)
    new_big_string_ = ''.join(new_chars)
    return new_big_string_

def extract_sentences(big_string_):
    """ Removing casing and seperate into sentences on fullstop chars. """
    rv = []
    sentences_ = big_string_.split('.')
    for sentence in sentences_:
        sentence = sentence.lower()
        sentence = remove_punctuation(sentence)
        if len(sentence.split()) >= 10:
            rv.append(sentence.split())
    return rv

def find_filenames(directory: str):
    filenames = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            filenames.append(os.path.join(root, file))
    return filenames

def main():
    # Create list of file names
    filenames = find_filenames("wcs")
    # Extract sentences from .txt files
    all_sentences = []
    for fn in filenames:
        words = read_words_from_file(fn, rtype='list')
        sentences = extract_sentences(' '.join(words))
        all_sentences.extend(sentences)
    # Create dir for corpus.pkl file
    dir_ = 'wcs_pkl'
    if not os.path.exists(dir_):
        os.makedirs(dir_)
    store_pickle_file(all_sentences, f"{dir_}/corpus.pkl")
    print(
        f"Successfully combined wikipedia data into a corpus stored at:\n"
        f"{dir_}/corpus.pkl")
    # Mark EOF
    pass


main()

Successfully combined wikipedia data into a corpus stored at:
wcs_pkl/corpus.pkl


The aim of this code is to create a pickle file containing a list of lists, each list containing a sentence from webcrawling and scraping on wikipedia. The code uses two main functions: `main`.

The main function uses the `find_filenames` function to search a directory named "wcs" and get a list of all the filenames. Then, it reads the contents of each file, removes punctuation, and splits the text into sentences. The resulting list of sentences is stored in a pickle file named "corpus.pkl".

The other functions in the code are helper functions used by the main functions to process the text data. The `store_pickle_file` and `load_pickle_file` functions are used to store and load the data from the pickle files. The `remove_punctuation` function removes all characters that are not letters or numbers from a given text. The `smart_replace` function replaces full stops not used to end a sentence with a comma. The `extract_sentences` function removes casing and separates text into sentences on full stop characters.

### 1.4 Reading XML Books into Python Object

In [None]:
def store_as_pickle(obj, filename):
    """Stores a Python object as a pickle file."""
    with open(filename, 'wb') as f:
        pickle.dump(obj, f)

def load_from_pickle(filename):
    """Loads a Python object from a pickle file."""
    with open(filename, 'rb') as f:
        obj = pickle.load(f)
    return obj

def list_filenames(directory):
    # Each file name in a passed directory is appended to 
    #  a list and returned by this function
    filenames = []
    for filename in os.listdir(directory):
        filenames.append(f"{directory}/{filename}")
    return filenames

def print_list(arr, limit=None):
    # Out number of items contained to user
    print(f"List contains {len(arr)} items.\n[")
    # Out each item & it's item type i.e. int, str, etc.etc
    for item in arr[:limit]:
        print(f"    {str(type(item)).split(' ')[1][1:-2]}: '{item}',")
    # Close content outed from list
    print("]")
    # State number of items (if user specified) outed
    if limit is not None:
        print(f"First {limit} items shown above this line.\n")

def print_matrix(matrix, m_lim=None, v_lim=None):
    # Call `print_list` for each list in the matrix
    print(f"\nMatrix contains {len(matrix)} list(s).")
    print("---\n<Begin Matrix>")
    for arr in matrix[:m_lim]:
        print_list(arr, v_lim)
    print("<End Matrix>\n---\n")
    print(f"Printed {m_lim} list(s) out of {len(matrix)} list(s).\n")
    # Mark EOF
    pass

def read_xml(fn):
    # Load the XML file
    tree = ET.parse(fn)
    root = tree.getroot()
    # Initialize the list
    text_list = []
    # Iterate through the 's' elements and add the text to the list
    for s in root.iter('s'):
        text_list.append(s.text)
    # Return sentences
    return text_list

def read_all_books():
    dir = "Dissertation/M8/opus.nlpl.eu.books.php/books_xml"
    # Read the sentences from each XML book
    sentences = []
    for fn in list_filenames(dir):
        sentences.extend(read_xml(fn))
    return sentences

def main():
    # Showcase function list_filenames
    print(f"List containing book file paths info:")
    print_list(list_filenames(
        "Dissertation/M8/opus.nlpl.eu.books.php/books_xml"
        ), 
        10)

    # Read sentences from all books
    u_sentences = read_all_books()  # 'u_' = un-processed
    print(f"List containing sentences from all books info:")
    print_list(u_sentences, 10)

    # Store sentences
    store_as_pickle(u_sentences, 'u_sentences.pkl')

    # Mark EOF
    pass


main()

List containing book file paths info:
List contains 42 items.
[
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Tolstoy_Leo-Anna_Karenina_vol2.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Zola_Emile-Therese_Raquin.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Defoe_Daniel-Moll_Flanders.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Kafka_Franz-Prozess.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Zola_Emile-Germinal.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Kafka_Franz-Verwandlung.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Cervantes_Miguel-Don_Quijote.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Verne_Jules-Forceurs_de_blocus.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Doyle_Arthur_Conan-Great_Shadow.xml',
    str: 'Dissertation/M8/opus.nlpl.eu.books.php/books_xml/Verne_Jules-Ile_mysterieuse.xml',
]
First 10 items s

#### Overview of output generated by `main()` 

* The first section of the output is the result of calling `print_list` on the output of `list_filenames`, which lists the file paths of all the book XML files in the specified directory. It first prints a message indicating that it's printing the list of file paths, and then uses `print_list` to output the information. The list contains 42 items (i.e., there are 42 book XML files in the directory), and the print_list function outputs the file paths as a list of strings, with the type of each item being str.
* The next line shows that the first 10 items from the list of file paths were printed.
* The second section of the output is the result of calling `print_list` on the output of `read_all_books`, which extracts all the sentences from the book XML files and concatenates them into a single list. It first prints a message indicating that it's printing the list of sentences, and then uses print_list to output the information. The list contains 239,555 items (i.e., there are 239,555 sentences in total across all the book XML files), and the `print_list` function outputs each sentence as a `string`, with the type of each item being `str`.
* The next line shows that the first 10 items from the list of sentences were printed.

### Overview of the functions:

* `store_as_pickle()` takes two arguments: the Python object you want to store, and the filename you want to save the object to. The function opens the file in write binary mode, and then uses the pickle.dump function to dump the object to the file.
* `load_from_pickle()` takes a single argument, the filename of the pickle file you want to load. The function opens the file in read binary mode, and then uses the pickle.load function to load the object from the file. Finally, the function returns the loaded object.
* `list_filenames(directory)` takes a directory path as input and returns a list of all file names in that directory. Each file name is prepended with the input directory path to form a full file path.
* `print_list(arr, limit=None)` takes a list (`arr`) as input and prints out information about each item in the list. It first outputs the total number of items in the list, and then iterates over the list and prints out the type of each item (e.g., `int`, `str`) and the item itself. If a `limit` argument is passed, only the first `limit` items in the list will be printed. At the end, if a `limit` argument is passed, it prints a message indicating how many items were printed.
* `print_matrix(matrix, lim=None)` takes a matrix (a list of lists) as input and calls `print_list` on each sublist in the matrix. It doesn't return anything, but simply prints out the information for each sublist in the matrix.
* `read_xml(fn)` takes a file name (presumably a full file path) as input, reads an XML file from that location using the `ElementTree` module, and extracts all the text contained within the `s` tags in the XML file. It returns a list of all the extracted sentences.
* `read_all_books()` reads all the XML files in a specified directory (by calling `list_filenames` and filtering for XML files), and extracts all the sentences from each file by calling `read_xml` on each file. It returns a list of all the extracted sentences.
* `main()` is the main function that runs when the script is executed. It first calls `list_filenames` to print out a list of all the book file paths. It then calls `read_all_books` to extract all the sentences from the books, and prints out information about the sentences using `print_list`. Finally, `store_as_pickle` is called to store all of the sentences .pkl file to later be loaded and used.

### 1.5 Pre-processing XML Book 'Sentences' 

In [None]:
def remove_punctuation(text):
    """ Removes all punctuation from the given string. """
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def preprocess_sentences(u_sentences_):
    """
    Preprocesses a list of sentences by converting 
    all characters to lowercase and splitting each sentence
    into a list of words where for each word it's punctuation
    is removed.
    """
    p_sentences = []
    for i, v in enumerate(u_sentences_):
        words = v.lower().split()
        for i, word in enumerate(words):
            words[i] = remove_punctuation(word)
        p_sentences.append(words)
    return p_sentences

def main():
    # Load unprocessed sentences
    u_sentences = load_from_pickle('u_sentences.pkl')
    print_list(u_sentences, 10)

    # Process sentences
    p_sentences = preprocess_sentences(u_sentences)
    print_list(p_sentences, 10)
    print_matrix(p_sentences, 2)

    # Store sentences
    store_as_pickle(p_sentences, 'p_sentences.pkl')
    
    # Mark EOF
    pass


main()

List contains 239555 items.
[
    str: 'Source: http://librosgratis.liblit.com/',
    str: 'Anna Karenina',
    str: 'Leo Tolstoy',
    str: 'VOLUME TWO PART V',
    str: 'CHAPTER I',
    str: 'THE PRINCESS SHCHERBATSKAYA AT FIRST CONSIDERED it out of the question to have the wedding before Advent, to which there remained but five weeks, but could not help agreeing with Levin that to put it off until after the Fast might involve waiting too long, for Prince Shcherbatsky's old aunt was very ill and likely to die soon, and then the family would be in mourning and the wedding would have to be considerably deferred.',
    str: 'Consequently, having decided to divide her daughter's trousseau into two parts, a lesser and a larger, the Princess eventually consented to have the wedding before Advent.',
    str: 'She decided that she would have the smaller part of the trousseau got ready at once, and would send on the larger part later; and she was very cross with Levin because he could not giv

#### Overview of Pre-processing

I have defined three functions and a main function that work together to preprocess a list of sentences, removing all punctuation and converting all characters to lowercase. Here's an explanation of each function:

* `remove_punctuation(text)` takes a string (`text`) as input and removes all punctuation characters from it using the `string` module. It then returns the resulting string without punctuation.
* `preprocess_sentences(u_sentences_)` takes a list of strings (`u_sentences_`) representing unprocessed sentences, and preprocesses each sentence by converting all characters to lowercase, splitting each sentence into a list of words, and removing all punctuation characters from each word. It then returns the resulting list of preprocessed sentences, where each sentence is represented as a list of words.
* `main()` is the main function that runs when the script is executed. It first loads a list of unprocessed sentences from a pickle file using `load_from_pickle`, and prints information about the list using `print_list`. It then calls `preprocess_sentences` on the list of unprocessed sentences to preprocess them, and prints information about the preprocessed sentences using `print_list` and `print_matrix`. Finally, it stores the preprocessed sentences to a new pickle file using `store_as_pickle`.

Overall, my code demonstrates preprocessing in natural language processing (NLP) by removing punctuation and converting all text to lowercase. It uses Python's built-in string module to remove punctuation, and defines a separate function (`preprocess_sentences`) to apply this preprocessing step to a list of sentences. The main function demonstrates how to use the `preprocess_sentences` function and store the preprocessed sentences to a file.

### 1.6 Combine Datasets

In [None]:
def combine_corpus(fns: List[str]=["wcs_pkl/corpus.pkl", "p_sentences.pkl"],
                   out: bool=True):
    """ Given a list of .pkl files containing lists of lists which
    store corpus' this function returns the matrix combination of all
    .pkl files. 
    """
    # Load first corp and print to user
    big_corpus = load_pickle_file(fns[0])
    if out: 
        print_list(big_corpus, 3)
    for i, fn in enumerate(fns[1:]):
        # Load the current corpus pointed to
        corpus_ = load_pickle_file(fn)
        # Combine corpus'
        big_corpus.extend(corpus_)
        # Info for user for each corpus
        if out: 
            print_list(corpus_, 3)
    # Info for user for the big corpus
    if out: 
        print_list(big_corpus, 3)
    return big_corpus


# Run function to demonstrate functionality for later use
example_combination = combine_corpus()
# Delete `example_combination` as it is never used
# and for notebook memory optimization 
del example_combination 

List contains 815822 items.
[
    list: '['systematic', 'trading', 'also', 'known', 'as', 'mechanical', 'trading', 'is', 'a', 'way', 'of', 'defining', 'trade', 'goals', 'risk', 'controls', 'and', 'rules', 'that', 'can', 'make', 'investment', 'and', 'trading', 'decisions', 'in', 'a', 'methodical', 'way', 'systematic', 'bias', 'errors', 'that', 'are', 'not', 'determined', 'by', 'chance', 'but', 'are', 'introduced', 'by', 'an', 'inaccuracy', 'involving', 'either', 'the', 'observation', 'or', 'measurement', 'process', 'inherent', 'to', 'the', 'system', 'systematic', 'chaos', 'ninth', 'studio', 'album', 'by', 'american', 'progressive', 'metal', 'band', 'dream', 'theater', 'this', 'page', 'was', 'last', 'edited', 'on', '19', 'november', '2022', 'at', '2123', 'utc']',
    list: '['paul', 'bostaph', 'left', 'the', 'band', 'in', '2004', 'and', 'former', 'drummer', 'shaun', 'bannon', 'rejoined', 'having', 'healed', 'from', 'his', 'injuries']',
    list: '['he', 'was', 'not', 'in', 'the', 'group'

This function, combine_corpus, is designed to combine multiple corpora stored in separate pickle files into a single, larger corpus. 

This function takes a list of pickle files, each containing a corpus, and returns a combined corpus formed by merging all the individual corpora. It also provides an option to print some lines from each corpus for the user's reference.

## 2. Exploratory Data Analysis

This includes basic corpus statistics, frequency analysis, and POS-tagging analysis. 

### 2.1 Basic Corpus Statistics

In [None]:
def calc_words(sentences):
    """ Calculate the total number of words. """
    words = 0
    for sentences in sentences:
        words += len(sentences)
    return words

def calc_avg_sen_len(sentences):
    """ Calculate the average length of a sentence. """
    return calc_words(sentences) / len(sentences)

def calc_unique_words(sentences):
    """ Calculate the number of unique words. """
    all_sentences = []
    for sentence in sentences:
        all_sentences.extend(sentence)
    all_sentences_set = set(all_sentences)
    return len(all_sentences_set)

def main():
    """ Provide basic statistics to the user. """
    # Load Processed sentences
    # p_sentences = load_from_pickle('p_sentences.pkl')
    p_sentences = load_from_pickle('wcs_pkl/corpus.pkl') + load_from_pickle('p_sentences.pkl')

    total_words  = calc_words(p_sentences)
    avg_length   = calc_avg_sen_len(p_sentences)
    unique_words = calc_unique_words(p_sentences)
    
    print(
        f"Total Number of Sentences: {len(p_sentences)}\n"
        f"Average Sentence Length:   {avg_length}\n"
        f"Total Number of Words:     {total_words}\n"
        f"Number of Unique Words:    {unique_words}\n"
    )

    # Mark EOF
    pass


main()

Total Number of Sentences: 1055377
Average Sentence Length:   21.492753774243706
Total Number of Words:     22682958
Number of Unique Words:    399115



This code defines three functions and a main function that work together to calculate some basic statistics on a list of preprocessed sentences. Here's an explanation of each function:

* `calc_words(sentences)` takes a list of sentences represented as lists of words (`sentences`) and calculates the total number of words across all sentences. It does this by iterating over each sentence and adding the length of each sentence (i.e., the number of words in the sentence) to a running total. It then returns the total number of words.
* `calc_avg_sen_len(sentences)` takes a list of sentences represented as lists of words (`sentences`) and calculates the average length of a sentence. It does this by dividing the total number of words in all sentences by the number of sentences in the list (i.e., `len(sentences)`). It calls the `calc_words` function to calculate the total number of words.
* `calc_unique_words(sentences)` takes a list of sentences represented as lists of words (`sentences`) and calculates the number of unique words across all sentences. It does this by creating a new list (`all_sentences`) that concatenates all sentences together into a single list of words, and then converting this list to a set (`all_sentences_set`) to remove duplicates. It then returns the length of the resulting set, which represents the number of unique words across all sentences.
* `main()` is the main function that runs when the script is executed. It first loads a list of preprocessed sentences from a pickle file using `load_from_pickle`. It then calculates three basic statistics on the list of preprocessed sentences: the total number of sentences, the average sentence length (in words), and the number of unique words across all sentences. This information is outed to the user.

### 2.2 Frequency Analysis

In [None]:
def matrix_to_dict(sentences):
    """
    Converts a list of sentences represented as lists of words 
    into a dictionary that maps each unique word to its frequency
    count in the sentences.
    """
    single_words = {}
    for s in sentences:
        for word in s:
            if word in single_words:
                single_words.update({word: single_words[word] + 1})
            else:
                single_words.update({word: 1})
    return single_words

def common_words(dict_: dict, items: int=10):
    # Optimization necessary to allow colab to sort dict
    #  without deleting lots of entries colab cannot 
    #  perform `sorted` later on
    for key in list(dict_.keys()):
        if dict_.get(key) < 10:
            dict_.pop(key)
    # Sort dict
    sorted_dict = dict(sorted(dict_.items(), 
                              key=lambda item: item[1], 
                              reverse=True))
    # Out common words
    print(
        f"Top {items} most common word(s).\n"
        f"{'|WORD|':<20} {'|COUNT|':>5}"
    )
    for k, v in list(sorted_dict.items())[:items]:
        k = f"'{k}'"
        print(f"{k:<20} {v:>5}")
    print()

def matrix_to_dict_pairs(sentences):
    """
    Converts a list of sentences represented as lists of words into a 
    dictionary that maps each unique pair of contiguous words to its 
    frequency count in the sentences.
    """
    word_pairs = {}
    for s in sentences:
        for i in range(len(s) - 1):
            pair = (s[i], s[i+1])
            if pair in word_pairs:
                word_pairs[pair] += 1
            else:
                word_pairs[pair] = 1
    return word_pairs

def common_word_pairs(dict_: dict, items: int=10):
    """
    Prints the most common pairs of contiguous words in the given dictionary, 
    sorted in descending order of frequency count.
    """
    # Remove entries with frequency count less than 10
    # Notebook memory optimization
    for pair in list(dict_.keys()):
        if dict_.get(pair) < 10:
            dict_.pop(pair)

    # Sort the dictionary by frequency count in descending order
    sorted_dict = dict(sorted(dict_.items(), 
                              key=lambda item: item[1], 
                              reverse=True))

    # Print out the most common word pairs with their frequency counts
    print(f"Top {items} most common word pairs.\n{'|UNIGRAM|':<40} {'|COUNT|':>5}")
    for pair, count in list(sorted_dict.items())[:items]:
        pair_str = f"'{pair[0]} {pair[1]}'"
        print(f"{pair_str:<40} {count:>5}")
    print()

def matrix_to_dict_triplets(sentences):
    """
    Converts a list of sentences represented as lists of words into a 
    dictionary that maps each unique triplet of contiguous words to its 
    frequency count in the sentences.
    """
    word_triplets = {}
    for s in sentences:
        for i in range(len(s) - 2):
            triplet = (s[i], s[i+1], s[i+2])
            if triplet in word_triplets:
                word_triplets[triplet] += 1
            else:
                word_triplets[triplet] = 1
    return word_triplets

def common_word_triplets(dict_: dict, items: int=10):
    """
    Prints the most common triplets of contiguous words in the given dictionary, 
    sorted in descending order of frequency count.
    """
    # Remove entries with frequency count less than 10
    # Notebook memory optimization
    for triplet in list(dict_.keys()):
        if dict_.get(triplet) < 10:
            dict_.pop(triplet)

    # Sort the dictionary by frequency count in descending order
    sorted_dict = dict(sorted(dict_.items(), 
                              key=lambda item: item[1], 
                              reverse=True))

    # Print out the most common word triplets with their frequency counts
    print(f"Top {items} most common word triplets.\n{'|TRIGRAM|':<60} {'|COUNT|':>5}")
    for triplet, count in list(sorted_dict.items())[:items]:
        triplet_str = f"'{triplet[0]} {triplet[1]} {triplet[2]}'"
        print(f"{triplet_str:<60} {count:>5}")
    print()

def pos_analysis():
    ...

def main():
    """ Provide frequency statistics to the user. """
    
    # Load sentences matrix
    p_sentences = load_from_pickle('wcs_pkl/corpus.pkl') + load_from_pickle('p_sentences.pkl')
    
    # Create dictionary of word counts {k=<word> : v=<count>}
    d_sentences = matrix_to_dict(p_sentences)

    # # This line proves the dictionary construction is correct
    # print(len(list(d_sentences.keys())))

    # Output top n common words (n=10 by default)
    common_words(d_sentences)
    del d_sentences  # Notebook memory optimization

    # Create dict of pair-word counts {k=<w1, w2> : v=<count>}
    d_pairs = matrix_to_dict_pairs(p_sentences)

    # Output top n common pairs of words (n=10 by default)
    common_word_pairs(d_pairs)
    del d_pairs  # Notebook memory optimization

    # Create dict of unigram counts {k=<w1, w2, w3> : v=<count>}
    d_triplets = matrix_to_dict_triplets(p_sentences)
    
    # Output top n common pairs of words (n=10 by default)
    common_word_triplets(d_triplets)
    del d_triplets  # Notebook memory optimization

    # Mark EOF
    pass


main()

KeyboardInterrupt: ignored

This code first loads a list of tokenized sentences (called p_sentences) from a pickle file. Then, the script performs several frequency analyses, such as finding the most common words, word pairs (bigrams), and word triplets (trigrams) in the corpus. The `main()` function organizes the execution of these tasks.

Here's a brief explanation of each function:

* `matrix_to_dict()`: Given a list of tokenized sentences, this function creates a dictionary that maps each unique word to its frequency count.

* `common_words()`: Given a dictionary of word frequencies, this function prints the n most common words and their frequencies.

* `matrix_to_dict_pairs()`: Given a list of tokenized sentences, this function creates a dictionary that maps each unique pair of contiguous words (bigrams) to their frequency count.

* `common_word_pairs()`: Given a dictionary of bigram frequencies, this function prints the n most common bigrams and their frequencies.

* `matrix_to_dict_triplets()`: Given a list of tokenized sentences, this function creates a dictionary that maps each unique triplet of contiguous words (trigrams) to their frequency count.

* `common_word_triplets()`: Given a dictionary of trigram frequencies, this function prints the n most common trigrams and their frequencies.

* `main()`: This function organizes the execution of the tasks mentioned above. It first loads the list of tokenized sentences from a pickle file, then performs the frequency analyses for single words, bigrams, and trigrams. 


### 2.3 POS-Tagging Analysis


In [None]:
# nltk downloads are seperate to prevent 
# downloading the same files every execution
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [None]:
def pos_tagging(sentence):
    """
    Given a list of words, returns a list of tuples with each tuple containing
    a word and its corresponding POS tag.
    """
    pos_tags = nltk.pos_tag(sentence)
    _, tags = zip(*pos_tags)
    tags = list(tags)
    # print(tags)
    return pos_tags, tags

def pos_matrix(sentences):
    """ Generate matrix of POS-Taggings. """
    pos_matrix = []
    for sentence in sentences[:]:
        pos_matrix.append(pos_tagging(sentence)[1])
    return pos_matrix

def pos_distribution(pos_dict):
    total_count = sum(pos_dict.values())
    pos_percentages = {}

    for pos, count in pos_dict.items():
        percentage = (count / total_count) * 100
        pos_percentages[pos] = round(percentage, 2)

    return pos_percentages

def sort_by_value(dictionary, reverse=True):
    return sorted(dictionary.items(), key=lambda x: x[1], reverse=reverse)

def display_top_n(dictionary, n=10, suffix=''):
    # 
    sorted_dict = sort_by_value(dictionary)
    max_key_len = max(len(key) for key, _ in sorted_dict[:n])
    # 
    for i, (key, value) in enumerate(sorted_dict[:n]):
        print(f"{i+1:>{len(str(n)) + 1}}. {key:<{max_key_len}}: {value}{suffix}")
    # Mark EOF
    pass

def main():
    # try-except saves times if this cell has already
    # been executed before in your current runtime session
    try:
        # Load tags
        tags = load_from_pickle("tags_dict.pkl")
    except FileNotFoundError:
        # Load sentences matrix & Perform POS
        p_sentences = load_from_pickle('p_sentences.pkl')
        tags = pos_matrix(p_sentences)
        # Memory optimization for notebook
        del p_sentences  
        # Save as file as it is faster to load from file
        store_as_pickle(tags, 'tags_dict.pkl')

    # Convert POS-taggings to a dictionary for counts
    pos_dict = matrix_to_dict(tags)
    # Notebook memory optimization
    del tags

    # Calculate percentages
    distribution = pos_distribution(pos_dict)

    # Sort dictionaries
    sort_by_value(pos_dict)
    sort_by_value(distribution)

    # print(tags)
    print("\nTop 10 POS Tags Counts:")
    display_top_n(pos_dict, 10)

    # Percentage distributions
    print("\nTop 10 POS Tags Distribution:")
    display_top_n(distribution, 10, '%')

    # Mark EOF
    pass


main()

This code consists of several functions that analyze the part-of-speech (POS) tags in a corpus of text. It performs POS tagging on a list of tokenized sentences, calculates the distribution of POS tags, and displays the top 10 POS tags by count and percentage.

Here's a brief explanation of each function:

* `pos_tagging()`: Given a list of words (a tokenized sentence), this function returns a list of tuples containing each word and its corresponding POS tag.

* `pos_matrix()`: Given a list of tokenized sentences, this function generates a matrix of POS tags by applying the `pos_tagging()` function to each sentence.

* `pos_distribution()`: Given a dictionary of POS tags and their counts, this function calculates the percentage distribution of each POS tag.

* `sort_by_value()`: Given a dictionary, this function sorts the items by their values (either in ascending or descending order).

* `display_top_n()`: Given a dictionary, this function prints the top n items (by value) with an optional suffix (e.g., '%' for percentages).

* `main()`: This function organizes the execution of the tasks mentioned above. It first loads the list of POS tags from a pickle file and converts it into a dictionary of POS tag counts. Then, it calculates the percentage distribution of POS tags, sorts the dictionaries by value, and displays the top 10 POS tags by count and percentage.


## 3. Training and Expanding Word Sets with Word2Vec Models

### 3.1 Train W2V 

Different models are generated with by using different window sizes. 

In [None]:
def load_w2v(fn, out=True):
    if out:
        print(f"Loading Word2Vec model...\n")
    rv = None
    if fn.__contains__('.bin.gz'):
        rv = KeyedVectors.load_word2vec_format(fn, binary=True)
    elif fn.__contains__('.model'):
        rv = Word2Vec.load(fn)
    if out:
        print(f"Loaded Word2Vec model!\n")
    return rv

def train_model(corpus_, window_size):
    # Train the word2vec model on the corpus
    model_ = gensim.models.Word2Vec(corpus_, window=window_size)  # vector_size=300
    return model_

def save_model(model_, file_path):
    """ 
    Save the weights of the trained model

    :param model_: pass the model weights
    :param file_path: string - location and name to save to 
    """
    model_.save(file_path)

def train_save_model(corpus_, window_size:int=5):
    # Train model
    trained_model = train_model(corpus_, window_size=window_size)

    # Make dir to save weights
    if not os.path.exists("model_weights"):
        os.makedirs("model_weights")
    
    # Save model weights
    save_model(
        trained_model, 
        f"model_weights/trained_model_2c{window_size}w.model"
    )

    # Mark EOF
    pass

def main(use_git_clone=False):
    """ 
    Model weights from git-clone are incomplete due 
        to upload and download format.
    """
    # # Save time
    # if use_git_clone:
    #     copy_files("Dissertation/M8/model_weights", "model_weights")
    #     print("Successfully copied model weights from cloned repo!")
    #     return

    # Save time by downloading weights
    if use_git_clone:
        url = 'https://drive.google.com/drive/folders/1pZJZXgIcxQ7CrQeSxouFkqbaRQEnNi9X?usp=share_link'
        output = 'model_weights'
        gdown.download_folder(url, output=output, quiet=False)
        return

    # Use combined corpus
    corpus = combine_corpus(out=False)

    window_sizes = [2, 5, 8, 10, 12, 15, 20]

    for ws in tqdm.tqdm(window_sizes):  
        train_save_model(corpus, ws)

    # Mark EOF
    pass


main(use_git_clone=True)

Retrieving folder list


Processing file 1947yGTAIOD6ps9fHaT3r8bX2W8gtzbFs trained_model_2c2w.model
Processing file 1oXNj984-bavYKZoffu1xHFTpYUspGWHX trained_model_2c5w.model
Processing file 1NPQBj7AQ4zYVf9nQP4JtC13W93y3hRi8 trained_model_2c8w.model
Processing file 1RQgozYsM4fRZTrbTFe52D3A7bxWT-zJ4 trained_model_2c10w.model
Processing file 1ObpOg1Ce4lFWUJAchq6DTUSi7OwJ2Yn4 trained_model_2c12w.model
Processing file 1MxNHHRpvAnpst93FbxFJ_fphvEOsd7kW trained_model_2c15w.model
Processing file 1n1dEwCRVDwFYLKExjDYwhdcGonk4y1xX trained_model_2c20w.model
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1947yGTAIOD6ps9fHaT3r8bX2W8gtzbFs
To: /content/model_weights/trained_model_2c2w.model
100%|██████████| 132M/132M [00:01<00:00, 128MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oXNj984-bavYKZoffu1xHFTpYUspGWHX
To: /content/model_weights/trained_model_2c5w.model
100%|██████████| 132M/132M [00:01<00:00, 115MB/s]
Downloading...
From: https://drive.google.com/uc?id=1NPQBj7AQ4zYVf9nQP4JtC13W93y3hRi8
To: /content/model_weights/trained_model_2c8w.model
100%|██████████| 132M/132M [00:01<00:00, 89.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1RQgozYsM4fRZTrbTFe52D3A7bxWT-zJ4
To: /content/model_weights/trained_model_2c10w.model
100%|██████████| 132M/132M [00:00<00:00, 173MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ObpOg1Ce4lFWUJAchq6DTUSi7OwJ2Yn4
To: /content/model_weights/trained_model_2c12w.model
100%|██████████| 132M/132M [00:00<00:00, 214M

This code is performing three main tasks:

1. Loading a pre-trained Word2Vec model. The function `load_w2v` takes a file name as an argument and loads a Word2Vec model stored in that file. The model is either in binary format ('.bin.gz') or in text format ('.model').

2. Training a new Word2Vec model. The `train_model` function takes a corpus as an argument and trains a new Word2Vec model on that corpus.

3. Saving the weights of the trained model. The `save_model` function takes a model and a file path as arguments, and saves the weights of the model in the specified file.

The main function ties all of these tasks together by first loading the corpus from a pickle file, then training a new Word2Vec model on the corpus, and finally saving the weights of the trained model.

### 3.2 Perform Set Expansion

In [None]:
def list_info(arr):
    """ Out information about the passed list. """
    if isinstance(arr, list):
        print(f"List contains {len(arr)} items.")
    elif isinstance(arr, set):
        print(f"Set contains {len(arr)} items.")
    print(f"First 5 items:[")
    for item in list(arr)[:5]:
        print(f"'{item}',")
    print(f"]\n")
    # Mark EOF
    pass

def write_words_to_file(file_path, arr):
    # Open and write to file path passed
    with open(file_path, 'w', encoding='utf-8-sig') as f:
        string = '\n'.join(arr)
        f.write(string)
    # Mark EOF
    pass

def expand_set(model_fp='google_w2v_weights/GoogleNews-'
                        'vectors-negative300.bin.gz',
               entity_set_fp='Dissertation/M6/'\
                        'data_prep/entity_set.txt',
               entity_set=None,
               out=True,
               output_fn='expansion_set_2',
               k=3):
    """
    This function loads a specified W2V model, reads and loads an entity set
    from a pickle file and convert it to a list.
    Then for each word in the entity set it appends the word and most similar
    word returned from the W2V model.

    :param entity_set: set - the set of words which to expand on.
    :param model_fp: str - file path pointing to weights to be loaded.
    :param entity_set_fp: (str||None) - file path pointing to set to be loaded.
    :param out: bool - if true then the function will print info otherwise nothing
    will be outted to the console.
    :param output_fn: str - file name of the output_set_expansion to be stored and written to
    :return: a list of pairs where each pair contains the original word and
    the most similar word generated by the W2V model.
    """
    # Load a pre-trained Word2Vec model
    model = load_w2v(fn=model_fp, out=out)

    # Set state representing model type
    pre_trained = False
    if model_fp.__contains__('.model'):
        pre_trained = True

    # Define the initial entity set
    if entity_set_fp is not None:
        entity_set = read_words_from_file(entity_set_fp)
    if out:
        print(f"Entity set as a Python Set info: ")
        list_info(entity_set)

    # Convert set to array
    entity_arr = list(entity_set)
    if out:
        print(f"Entity set as a Python List info: ")
        list_info(entity_arr)
        
    # Array to store expanded terms and counter for errors
    expansion_arr = []
    e_counter = 0

    # Expand the entity set
    for word in entity_set:
        
        # Get a list of similar words for the entity set target word
        try:
            if not pre_trained:
                similar_words_list = model.most_similar(word.lower())
            else:
                similar_words_list = [word for word in model.wv.most_similar(word.lower())]
        except KeyError:
            if out:
                e_counter += 1
                # print(f"'{word}' not present in W2V vocabulary.")
            continue
        
        # Start adding similar words
        similar_words_added = 0
        for i, v in enumerate(similar_words_list):
            if similar_words_added == k:
                continue
            if word.lower() != v[0].lower():
                # Add word to expanded set
                expansion_arr.append(
                    f"{word}, {similar_words_list[i][0]}, {similar_words_list[i][1]}"
                )
                similar_words_added += 1
    
    # Write expanded set to a text file
    write_words_to_file(f"{output_fn}", expansion_arr)

    # Out function information to user 
    if out:
        print(f"Expanded list info: ")
        list_info(expansion_arr)
    if out:
        print(
            f"{e_counter} words were not present in the W2V-model's vocabulary.\n"
            f"Written expanded set to the following dir: '{output_fn}'"
        )

    # Return 
    return expansion_arr

def main():
    # Check if the 'expanded_sets' directory exists, if not, create it
    parent_dir = f'expanded_sets_'
    if not os.path.exists(parent_dir):
        os.makedirs(parent_dir)

    # Iterate over each file in the 'model_weights' directory
    for model_file in tqdm.tqdm(os.listdir('model_weights')[:]):
        if model_file.endswith('.model'):
            model_fp = os.path.join('model_weights', model_file)

            for k in [3, 5, 10, 15]:
                # Expand set of words & write to text file in 'expanded_sets' directory
                parent_dir_ = os.path.join(parent_dir, str(k))
                if not os.path.exists(parent_dir_):
                    os.makedirs(parent_dir_)
                expand_set(
                    model_fp=model_fp,
                    output_fn=os.path.join(parent_dir_, 
                        f"set_{model_file.split('_')[-1].split('.')[0]}.txt"
                    ),
                    out=False,
                    k=k
                )

    # Mark EOF
    pass



main()

100%|██████████| 7/7 [03:13<00:00, 27.58s/it]


The aim of this code is to expand a set of words (referred to as the entity set) by finding the most similar words for each word in the entity set using our trained Word2Vec model. 

The code first loads the Word2Vec model and reads in the entity set from a file. It then converts the set to a list and for each word in the entity set, it finds the most similar words using the pre-trained model. 

Finally, it writes the results to a file, which is a list of pairs where each pair contains the original word and the most similar word generated by the Word2Vec model.

## 4. Evaluation

### 4.1 Expanded Set Manual Labelling 

In my GitHub repository files, which are cloned and downloaded into this project at the start of this notebook, I have manually labelled whether a word generated by W2V is a synonym, antonym, or related for the first 120 outputs. I have added a '1' to the end of the line if the word is a synonym, antonym, or related and a '0' otherwise.

This labelling can be found in 'Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki.txt'.

In [None]:
# /content/Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki_.txt
with open('Dissertation/M6/output_set_expansion/labelled/eset_mit_wiki_.txt',
          "r") as f:
          lines = f.readlines()

for line in lines[:5]: 
    print(line.strip())

﻿Sociable, radians, 0
Sociable, modulo, 0
Sociable, triplelevel, 0
Energetic, dopamine, 1
Energetic, legumes, 0


### 4.1b Expanded Set Automatic Labelling

Automate labelling for evaluation

In [None]:
test_str = f'{pkl_dir}/related_words_matrix.pkl'

def check_test_dataset():
    test_dataset = load_pickle_file(test_str)
    for vec in test_dataset:
        for arr in vec:
            print(arr)
        print()

def convert_test_dataset():
    test_dataset = load_pickle_file(test_str)
    print(test_dataset)
    test_dataset_dict = {}
    for vec in test_dataset:
        vec[1] = [s.lower().strip() for s in vec[1]]
        vec[2] = [s.lower().strip() for s in vec[2]]
        test_dataset_dict.update(
            { vec[0][0].strip(): vec[1] + vec[2] }
        )
    store_pickle_file(test_dataset_dict, test_str)
    return test_dataset_dict

def write_words_to_file(file_path, arr):
    # Open and write to file path passed
    with open(file_path, 'w', encoding='utf-8-sig') as f:
        string = '\n'.join(arr)
        f.write(string)
    # Mark EOF
    pass

def debug_tds_dict():
    # ...
    test_dataset_dict = convert_test_dataset()
    print('public' in test_dataset_dict['private'])
    # Mark EOF
    pass

def clean_up():
    """ Remove labelled files. """
    par_dir = 'expanded_sets_'
    for foldername in os.listdir(par_dir):
        for fn in os.listdir(os.path.join(par_dir, foldername)):
            if fn.__contains__('lab_lab_') or fn.__contains__('lab_'):
                os.remove(os.path.join(os.path.join(par_dir, foldername), fn)) 
    # Mark EOF
    pass

def main(use_gdown=False):
    """ 
    Read an expanded set from a file given a fp string. 
    For each line check if expanded_word is a synonym or antonym. 
    If yes then add a label 1, otherwise 0. 

    File contents examples: 
    set_expansion_output.txt = 'target_word, expanded_word\n'
    labelled_SEO.txt = 'target_word, expanded_word, label ∈ {0, 1}\n'
    """
    if not use_gdown:
        # ...
        test_dataset_dict = convert_test_dataset()
    else:
        # Save time by downloading weights
        url = 'https://drive.google.com/drive/folders/1zYkKs5Yn3Wl_SjmLMBtGBcv83Dd2RI-u?usp=sharing'
        output = f'{pkl_dir}'
        gdown.download_folder(url, output=output, quiet=False)
        test_dataset_dict = load_pickle_file(f"{pkl_dir}/related_words_matrix.pkl")
    
    # Read entity set
    entity_set_fp = f'Dissertation/M6/data_prep/entity_set.txt'
    entity_set = read_words_from_file(entity_set_fp)

    dir_ = 'expanded_sets_'
    errors = 0
    for k in [3, 5, 10, 15]:
        for fn in tqdm.tqdm(os.listdir(f'{dir_}/{k}')):
            # Create file path 
            fp = dir_ + f'/{k}/' + fn
            
            # Create expanded set
            expanded_set = read_words_from_file(fp)

            # ...
            labelled_lines = []
            for line in list(expanded_set)[:]: # 1634:1635
                # Extract words
                try: 
                    tw, ew, sim = target_word, expanded_word, similarity = line.split(',')
                except ValueError as e:
                    # print(e)
                    errors += 1
                    continue
                # Determine label
                label = -1
                try:
                    if ew.lower().strip() in test_dataset_dict[tw.lower()]:
                        label = 1
                    else:
                        label = 0
                except KeyError as e:
                    print(e)
                # Update `labelled_lines`
                labelled_lines.append(f'{tw}, {ew}, {label}, {sim}')

            # Write labelled expanded set to text file
            write_words_to_file(f'{dir_}/{k}/lab_{fn}', labelled_lines)
    
    print(f"\nNo. of errors: {errors}")

    # Mark EOF
    pass


# clean_up()
main(use_gdown=True)
# debug_tds_dict()

Retrieving folder list


Processing file 1pGADoR2ad0VgNHX7DVjX_o2BgyL7F9Ib related_words_matrix.pkl
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1pGADoR2ad0VgNHX7DVjX_o2BgyL7F9Ib
To: /content/pkl_files/related_words_matrix.pkl
100%|██████████| 263k/263k [00:00<00:00, 54.3MB/s]
Download completed
100%|██████████| 7/7 [00:00<00:00, 247.68it/s]
100%|██████████| 7/7 [00:00<00:00, 175.04it/s]
100%|██████████| 7/7 [00:00<00:00, 80.56it/s]
100%|██████████| 7/7 [00:00<00:00, 81.16it/s]


No. of errors: 0





This section is designed to process a dataset of related words, clean it, and then label expanded sets based on whether a given word is a synonym or an antonym. The script consists of several functions that perform specific tasks, which are combined in the main function to complete the overall workflow.

1. `check_test_dataset()`: This function loads the test dataset and prints its contents.

2. `convert_test_dataset()`: This function converts the test dataset by changing the words to lowercase, stripping any extra spaces, and updating the dataset dictionary with a new key-value pair combining both synonyms and antonyms. It returns the updated dataset dictionary.

3. `write_words_to_file(file_path, arr)`: This function writes an array of words to a file specified by the file_path parameter.

4. `debug_tds_dict()`: This is a debugging function that prints whether the word "public" is present in the test dataset dictionary with the key "private".

5. `clean_up()`: This function removes labeled files from the "expanded_sets_" directory.

6. `main(use_gdown=False)`: This is the main function that orchestrates the entire workflow. It loads the test dataset, reads the entity set, and iterates through the expanded sets to label them based on whether a given word is a synonym or an antonym.

**Main**:

The `main` function starts by checking if the `use_gdown` flag is set to `True`. If not, it calls the `convert_test_dataset()` function to convert the test dataset. If `use_gdown` is True, it downloads the dataset from a Google Drive folder and loads it.

Next, the function reads the entity set from a file and initializes the error counter. It iterates through the expanded sets in the directory "expanded_sets_" for various values of K (3, 5, 10, 15) and processes each file by labeling the expanded words as either synonyms (label 1) or antonyms (label 0). It then writes the labeled expanded set to a new text file.

Finally, it prints the total number of errors encountered during the process.

### 4.2 Evaluation Metrics
#### Recall, Precision, F1-Score.

In [None]:
def load_y(fp=
        f"Dissertation/M6/output_set_expansion/labelled/"
        f"eset_mit_wiki_.txt"
    ):
    # Read lines from file
    words = read_words_from_file(
        fp, rtype='list'
    )
    y_pred_ = []
    y_true_ = []
    # Extract y prediction value from each line
    for word in words:
        y = word.split()[-2][:-1]
        y_pred_.append(int(y))
        y_true_.append(float(word.split()[-1]))
    # Return y predictions read from file
    return y_pred_, y_true_

def main(fp='', out=False):
    if fp == '':
        y_pred, y_true = load_y()
    else:
        y_pred, y_true = load_y(fp)

    # Calculate y_true
    for i,v in enumerate(y_true):
        if v > 0.6999999:
            y_true[i] = 1
        else:
            y_true[i] = 0

    if not(len(y_pred) == len(y_true)):
        raise ValueError("Lists `y_pred` must equal `y_true`.")

    # Calculate precision
    precision = precision_score(y_true, y_pred)

    # Calculate recall
    recall = recall_score(y_true, y_pred)

    # Calculate F1 score
    f1 = f1_score(y_true, y_pred)

    if out:
        print(f"Recall: {recall}")
        print(f"Precision: {precision}")
        print(f"F1 Score: {f1}\n")

    return recall, precision, f1


# dir_ = 'labelled_'
dir_ = 'expanded_sets_'  # Dir. of sets
k_values = [3, 5, 10, 15]  # Top K
results = [[] for y in range(7)]
# wsi = Window Size Index
wsi = {2: 0, 5: 1, 8: 2, 10: 3, 12: 4, 15: 5, 20: 6}
out = False

# Eval. every file for each folder of K
for k in k_values:
    for fn in os.listdir(f'{dir_}/{k}'):
        # Only eval. labelled files
        if not fn.__contains__('lab_set_'):
            continue
        # Calc. and out window size
        win_size = fn.split('_')[-1].split('.')[0][2:-1]
        if out:
            print(f'window_size = {win_size}, Metrics@{k}')
        # Perform eval. on e-set located at `fp`
        fp = dir_ + f'/{k}/' + fn
        for val in main(fp):
            results[wsi[int(win_size)]].append(round(val, 4))

for i,v in enumerate(results):
    v.insert(0, list(wsi.keys())[i])

metrics = 'RPF'
headers = ''
for k in k_values:
    for char in metrics:
        headers = headers + f'{char}@{k} '
    
headers = headers.split(' ')
headers.insert(0, 'Win. Size')

print(tabulate(results, headers, tablefmt="github"))


|   Win. Size |    R@3 |    P@3 |    F@3 |    R@5 |    P@5 |    F@5 |   R@10 |   P@10 |   F@10 |   R@15 |   P@15 |   F@15 |
|-------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
|           2 | 0.0691 | 0.4692 | 0.1204 | 0.0546 | 0.4022 | 0.0962 | 0.0429 | 0.3167 | 0.0756 | 0.0429 | 0.3167 | 0.0756 |
|           5 | 0.0818 | 0.352  | 0.1327 | 0.0622 | 0.284  | 0.102  | 0.0438 | 0.1963 | 0.0717 | 0.0438 | 0.1963 | 0.0717 |
|           8 | 0.0711 | 0.2627 | 0.1119 | 0.0582 | 0.2035 | 0.0906 | 0.0437 | 0.1466 | 0.0673 | 0.0437 | 0.1466 | 0.0673 |
|          10 | 0.0571 | 0.1947 | 0.0884 | 0.0467 | 0.1572 | 0.072  | 0.0355 | 0.1158 | 0.0543 | 0.0355 | 0.1158 | 0.0543 |
|          12 | 0.0592 | 0.181  | 0.0892 | 0.0447 | 0.1429 | 0.0681 | 0.0395 | 0.1115 | 0.0583 | 0.0395 | 0.1115 | 0.0583 |
|          15 | 0.0486 | 0.1416 | 0.0724 | 0.039  | 0.1154 | 0.0583 | 0.0319 | 0.0931 | 0.0475 | 0.0319 | 0.0931 | 0.0475 |
|       

This code evaluates the results of the expanded set of words by calculating the precision, recall, and F1 score.

The code reads the predicted binary values from a file and stores it in the list `y_pred`. The actual binary values are stored in the list `y_true`, which is created as an array of ones with the same length as `y_pred`.

The code then uses the `precision_score`, `recall_score`, and `f1_score` functions from the scikit-learn library to calculate precision, recall, and F1 score respectively. These metrics are commonly used to evaluate the performance of a binary classifier.

Finally, the calculated values of precision, recall, and F1 score are printed to the console.

# `Todo List`



❌🟡✅= Incomplete, Started (but not finished), Complete

* https://drive.google.com/drive/my-drive
* ✅ Either **make the main file public** for N.R or create and send a copy 
* 🟡 **Finalize current W2V method**
* ✅ **Refine the definition of TP, TN, FP, and FN**: Make sure that your definitions for True Positive, True Negative, False Positive, and False Negative are clear and mutually exclusive.  
	* Refined definitions:
	1. True Positive (TP): The word returned by W2V has a similarity percentage higher than the defined threshold, and the word is a known synonym or antonym (appears in test data).
	2. True Negative (TN): The word returned by W2V has a similarity percentage lower than the defined threshold, and the word is not a known synonym or antonym (does not appear in test data).
	3. False Positive (FP): The word returned by W2V has a similarity percentage higher than the defined threshold, but the word is not a known synonym or antonym (does not appear in test data).
	4. False Negative (FN): The word returned by W2V has a similarity percentage lower than the defined threshold, but the word is a known synonym or antonym (appears in test data). 
* ✅ **Find related papers for evaluation methodology (WRITTEN BELOW)**: To ensure the correctness and validity of your evaluation process, search for similar papers and study their evaluation methodologies. Incorporate any relevant insights into your project. To Clarify the calculation of F1 for top k results: Review the original paper and related papers to understand how the F1 score is calculated for top k results. Make sure to note any discrepancies or inconsistencies between the figures and the results mentioned in the original paper.
* 🟡 **Determine the appropriate threshold for low and high percentages** to ensure that each answer corresponds to only one of these classes.
	* To determine the appropriate threshold for low and high percentages, I can: **Analyze the distribution of similarity percentages in my dataset** to identify natural cut-off points. 

* ❌ **Investigate Dependency-Based word embeddings**: 
	* SkipGram https://www.geeksforgeeks.org/implement-your-own-word2vecskip-gram-model-in-python/
	* Generalize SkipGram by replacing BOG contexts with arbitrary contexts
	* Linear Bag-of-Words Contexts
	* Dependency-Based Contexts

## Secondary

* ❌ Reorganise folders created for .pkl files
* ❌ Incorporate test data into EDA section
* ❌ Expand book corpora
* ❌ Explain W2V model with TSNE
* ❌ Adapt Transformers to lexicon construction
* ❌ Add .py tests

#### Hyper-parameter testing

* ❌ size (Vector Size) [100, 200, 300]
* ❌ workers (*Research this one)
* ❌ negative (*Research this one)
* ❌ min_alpha (*Research this one)

#### Other Evaluation

* ❌ Evaluate different 'seed sets'
* ❌ Evaluate Google's model



## Evaluation Methodology Justification

* In the 2019 paper "Entity Set Expansion for Detecting Fashion Trends," the authors define metrics@K as Recall, Precision, and F-score calculated for top K predicted results, with K taking the values of 5, 10, and 15. The evaluation is based on the percentage of true positive results within the top K results. For unsupervised methods, the top K results are determined by selecting the K most similar words.

* In the 2014 paper "A Search Log Sparseness Oriented Query Expansion Method," the authors define metrics@K as Prec@N, specifically Prec@10 and Prec@20. These metrics represent the accuracy of the first N documents (N being 10 or 20) in the search results. 

* In the 2022 paper "A Meta Path Based Method for Entity Set Expansion in Knowledge Graph," the authors define metrics@K using two popular evaluation criteria: precision-at-k (p@k) and mean average precision (MAP). Precision-at-k (p@k) measures the percentage of correct entities within the top k results, where k is set to 30, 60, and 90 in this study. 

* In the 2018 paper "Entity Set Expansion with Semantic Features of Knowledge Graphs," the authors define metrics@K using precision-at-k (p@k) as their primary evaluation metric. Precision-at-k (p@k) is the mean of the percentages of relevant entities in the top-k ranked results for all queries. In this study, the authors measure p@5 and p@10, which assess the performance of their approach at two specific cut-off points in the ranked results. 

* The authors of four papers on entity set expansion define metrics@K differently. In the 2019 paper, metrics are defined as Recall, Precision, and F-score for top K predicted results with K values of 5, 10, and 15. In the 2014 paper, metrics are defined as Prec@N, specifically Prec@10 and Prec@20, representing the accuracy of the first N documents. In the 2022 paper, metrics are defined using precision-at-k (p@k) and mean average precision (MAP) with k values of 30, 60, and 90. Lastly, in the 2018 paper, metrics are defined using precision-at-k (p@k), measuring the mean percentages of relevant entities in top-k ranked results, specifically p@5 and p@10.

### Passages from Papers:

1. **Entity Set Expansion for Detecting Fashion Trends (2019)**  
"Table II shows the quantitative results for the four scenarios
: W2V and DEP, with and without supervision. The columns
show percentages for Recall, Precision and F-score at top
k predicted results, for k = 5, 10, 15. For the unsupervised
methods, top k results are the most similar k words in their
respective vector spaces."

2. **A Search Log Sparseness Oriented Query Expansion
Method (2014)**  
"B. Experimental evaluation metrics
The experiment mainly uses TREC standard to rank
search results. The main evaluation index is MAP (Mean of
Average Precision), the accuracy of the first N documents
Prec@N (Prec@10 and Prec@20), comparison and analysis
the query performance of the query expansion methods.
Baseline of the experiment is the search results of vector
space model system without any query expansion method."

3. **A Meta Path Based Method for Entity Set
Expansion in Knowledge Graph (2022)**  
"We employ two popular criteria of precision-at-k (p@k) and
mean average precision (MAP) to evaluate the performance
of our approach. p@k is the percentage of top k results that
belong to correct entities. Here, they are p@30, p@60 and
p@90. MAP is the mean of the average precision (AP) of the
p@30, p@60 and p@90."

4. **Entity set expansion with semantic features of knowledge graphs (2018)**   
5,3, Evaluation metrics
We adopt the following metrics [48] for experimental evalua-
tion:
• Precision@k (shorted as p@k): the mean of the percentages
of the relevant entities in the top-k ranked results for all
queries, p@5 and p@10 are measured in our study.

## Done

* ✅ Expand online-wikipedia corpora 
* ✅ Combine corpora of books and wikipedia pages
* ✅ Train one final model on this combined corpora
* ✅ Import and use `tqdm`
* ✅ Test different window sizes.  
    ✅ This cannot be done until the dataset is finalized.

* ✅ Tested different (Window Size) [2, 5, 8, 10, 12, 15]
* ✅ Speed up the notebook. Replace scraping and PP with downloading files from cloned repo. 

#### Evaluate different Metrics at K i.e. R@K, P@K, F@K.
* ✅ K = 3
* ✅ K = 5
* ✅ K = 10
* ✅ K = 15

#### Other
* ✅ Video presentation