# Fun with CALC and pre-trained GloVe word embeddings

This is an attempt to make CALC searchable by leveraging pre-trained [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings.

Note that unlike other notebooks in this repository, this one requires Anaconda and runs on Python 3.6.

In [1]:
from pathlib import Path
import re

import pandas as pd
import numpy as np

In [2]:
rows = pd.read_csv('../data/hourly_prices.csv', index_col=False, thousands=',')

## Build our vocabulary

Before running this cell, you'll need to manually download [glove.6B.zip](http://nlp.stanford.edu/data/glove.6B.zip) and extract it into the `data` directory of this repository.

In [3]:
DIMS = 50

GLOVE_FILE = Path(".") / ".." / "data" / f"glove.6B.{DIMS}d.txt"

MAX_VOCAB = 400_000

words_to_indices = {}
indices_to_words = {}

# It's probably easier to allocate a matrix that's too big
# and then reshape it if we under-filled it, than it is to
# constantly make it bigger on each iteration. Though there
# might be an easier way to do this that I don't know about.
vocab = np.zeros(shape=(MAX_VOCAB, DIMS), dtype=np.float32)

for i, line in zip(range(MAX_VOCAB), GLOVE_FILE.open(encoding='utf-8')):
    parts = line.split(' ')
    word = parts[0]
    vocab[i] = np.array(list(map(float, parts[1:])))
    words_to_indices[word] = i
    indices_to_words[i] = word

# Now make the matrix smaller if we under-filled it.
if vocab.shape[0] > i + 1:
    vocab = vocab[:i + 1]

## Convert labor categories into vectors

To do this, we'll just average the vectors for all the words in a labor category. It's pretty simple but many sources say it actually works pretty well; it also (hopefully) helps that labor categories are fairly short and word ordering doesn't tend to matter much in them.

In [4]:
IGNORE_WORDS = set(
    'i',   # We don't want to confuse the roman numeral "I" with the first-person pronoun
)

SPECIAL_CHARS = re.compile(r'[^A-Za-z]')

WHITESPACE = re.compile(r'\W+')

def normalize_labor_category(name):
    return WHITESPACE.sub(' ', SPECIAL_CHARS.sub(' ', name.lower())).strip()

def labor_category_to_vector(name):
    words = [
        word for word in normalize_labor_category(name).split()
        if word not in IGNORE_WORDS and word in words_to_indices
    ]
    vector = np.zeros(shape=(1, DIMS), dtype=np.float32)
    if len(words) > 0:
        for word in words:
            vector[0] += vocab[words_to_indices[word]]
        vector = vector / len(words)
    return vector

# Quick sanity check
assert normalize_labor_category('Blarg (flag)-2') == 'blarg flag'

In [5]:
def map_labor_categories_to_vectors(labor_categories):
    n = len(labor_categories)
    result = np.zeros(shape=(n, DIMS), dtype=np.float32)
    for i in range(n):
        result[i] = labor_category_to_vector(labor_categories[i])
    return result

labor_categories = map_labor_categories_to_vectors(rows['Labor Category'])

## Define a naive KNN algorithm

I've never actually implemented K nearest neighbors before, but here's a naive attempt at one. I should probably just use scikit-learn though.

In [6]:
def find_closest(name, k=5):
    vector = labor_category_to_vector(name)
    distances = np.linalg.norm(labor_categories - vector, axis=1)
    indices = np.argsort(distances)
    return list(rows.iloc[list(indices[:k])]['Labor Category'])

## Perform some searches

Now that we have all the infrastructure, let's try doing some searches!

How about a search for the word "tutor"?

In [7]:
find_closest('tutor')

['Tutor',
 'Tutor ***',
 'Engineer - Graduate/Apprentice',
 'Principal Instruction Technologist',
 'Principal Instruction Technologist- Training']

That's kind of cool, it found "instruction technologist", which _sounds_ tutor-like, even though it doesn't have the word "tutor" in it.