## Using latent semantic indexing on labor categories

This is an attempt to use gensim's [Latent Semantic Indexing](https://radimrehurek.com/gensim/models/lsimodel.html) functionality with contract data, providing us with a way to find contract rows whose labor categories are similar to one we're looking at. Then we'll combine that data with some other dimensions from the contract rows, like price and minimum experience, and finally use a [K Nearest Neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm to help us find comparables for any given criteria.

Code is largely based off the [Making an Impact with Python Natural Language Processing Tools](https://www.youtube.com/watch?v=jSdkFSg9oW8) Pycon 2016 tutorial, specifically its [LSI with Gensim](https://github.com/totalgood/twip/blob/master/docs/notebooks/09%20Features%20--%20LSI%20with%20Gensim.ipynb) notebook.

In [1]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
rows = pd.read_csv('../data/hourly_prices.csv', index_col=False, thousands=',')

In [2]:
from gensim.models import LsiModel, TfidfModel
from gensim.corpora import Dictionary



We'll build a vocabulary off the labor categories in the contract rows, and then build a [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) matrix off it.

In [3]:
vocab = Dictionary(rows['Labor Category'].str.split())

In [4]:
tfidf = TfidfModel(id2word=vocab, dictionary=vocab)

In [5]:
bows = rows['Labor Category'].apply(lambda x: vocab.doc2bow(x.split()))

In [6]:
vocab.token2id['engineer']

651

In [7]:
vocab[0]

'Manager'

In [8]:
dict([(vocab[i], round(freq, 2)) for i, freq in tfidf[bows[0]]])

{'Manager': 0.65, 'Project': 0.76}

Here we'll build a LSI model that places each labor category into a 5-dimensional vector.

In [9]:
lsi = LsiModel(tfidf[bows], num_topics=5, id2word=vocab, extra_samples=100, power_iters=2)

In [10]:
len(vocab)

6944

In [11]:
topics = lsi[bows]
df_topics = pd.DataFrame([dict(d) for d in topics], index=bows.index, columns=range(5))

In [12]:
lsi.print_topic(1, topn=5)

'-0.511*"Consultant" + -0.333*"Analyst" + 0.329*"Manager" + 0.272*"Project" + -0.255*"Senior"'

This part is a bit weird: we're extending our vectors with information about the price and minimum experience of each contract row, semi-normalizing the data so that they don't "overwhelm" the importance of the LSI dimensions when calculating distances between points.

I have no idea if this is actually legit.

In [13]:
PRICE_COEFF = 1 / 500.0
XP_COEFF = 1 / 10.0

df_topics['Price'] = (rows['Year 1/base'] * PRICE_COEFF).fillna(0)
df_topics['Experience'] = (rows['MinExpAct'] * XP_COEFF).fillna(0)

Now we'll use a [K Nearest Neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm to make it easy for us to find vectors that are nearby.

In [14]:
from sklearn.neighbors import NearestNeighbors

df_topics = df_topics.fillna(0)
neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(df_topics)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [15]:
neigh.kneighbors(df_topics.ix[0].values.reshape(1, -1), return_distance=False)

array([[  0,  17,  18, 104, 256]])

Here's where things potentially become useful: we'll create a function that takes a labor category, price, and experience, and returns a list of comparables from our vector space.

In [16]:
def get_neighbors(labor_category, price, experience):
    vector = []
    topic_values = lsi[tfidf[vocab.doc2bow(labor_category.split())]]
    vector.extend([v[1] for v in topic_values])
    vector.extend([price * PRICE_COEFF, experience * XP_COEFF])
    
    neighbors = list(neigh.kneighbors([vector], return_distance=False)[0])
    return pd.DataFrame([rows.loc[i] for i in neighbors], index=neighbors)

get_neighbors('Awesome Engineer', 80, 5)

Unnamed: 0,Labor Category,Year 1/base,Year 2,Year 3,Year 4,Year 5,Education,MinExpAct,Bus Size,Location,COMPANY NAME,CONTRACT .,Schedule,SIN NUMBER,Contract Year,Begin Date,End Date,CurrentYearPricing
33394,Engineer,79.86,81.46,83.09,84.75,86.44,Bachelors,5.0,O,Contractor Site,"SGT, Inc.",GS-23F-0381K,PES,"871-1,2,3,4,5,6/RC",1,8/10/2000,8/9/2020,79.86
33791,Testing Engineer,78.95,81.21,83.53,85.92,88.38,Bachelors,5.0,S,,"Procon Consulting, LLC",GS-00F-247CA,Consolidated,C871 7,1,8/12/2015,8/11/2020,78.95
32842,Engineer,81.1,82.97,84.88,86.83,88.83,Bachelors,5.0,O,Customer Site,Navmar Applied Sciences Corporation,GS-10F-0281U,PES,"871-1, 871-2, 871-3, 871-4, 871-5, 871-6",3,7/8/2008,7/7/2018,84.88
33844,Engineer,78.79,81.0,83.26,85.6,87.99,Bachelors,5.0,S,Both,"Moca Systems, Inc.",GS-00F-255CA,Consolidated,874-1,1,8/18/2015,8/17/2020,78.79
33845,Engineer,78.79,81.0,83.26,85.6,87.99,Bachelors,5.0,S,Both,"Moca Systems, Inc.",GS-00F-255CA,Consolidated,874-7,1,8/18/2015,8/17/2020,78.79
