<a href="https://colab.research.google.com/github/SDS-AAU/SDS-master/blob/master/M2/notebooks/W3_job_recommender_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting dats from kaggle

In [1]:
# install kaggle package
! pip install -q kaggle

upload `kaggle.json`API key

In [2]:
# make folder for api key
! mkdir ~/.kaggle

In [3]:
# copy key into folder
! cp kaggle.json ~/.kaggle/

In [4]:
# change access permissions
! chmod 600 ~/.kaggle/kaggle.json

In [5]:
# check if worked
! kaggle datasets list

ref                                                         title                                              size  lastUpdated          downloadCount  
----------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  
gpreda/reddit-vaccine-myths                                 Reddit Vaccine Myths                              235KB  2021-10-21 20:52:33          14407  
crowww/a-large-scale-fish-dataset                           A Large Scale Fish Dataset                          3GB  2021-04-28 17:03:01           8674  
imsparsh/musicnet-dataset                                   MusicNet Dataset                                   22GB  2021-02-18 14:12:19           3856  
dhruvildave/wikibooks-dataset                               Wikibooks Dataset                                   2GB  2021-10-22 10:48:21           3229  
fatiimaezzahra/famous-iconic-women                          Famous Iconic Wo

for getting data go to kaggle page and ... and copy API command

In [6]:
! kaggle datasets download -d kandij/job-recommendation-datasets

Downloading job-recommendation-datasets.zip to /content
 78% 41.0M/52.4M [00:00<00:00, 62.4MB/s]
100% 52.4M/52.4M [00:00<00:00, 89.1MB/s]


In [7]:
! unzip /content/job-recommendation-datasets.zip

Archive:  /content/job-recommendation-datasets.zip
  inflating: Combined_Jobs_Final.csv  
  inflating: Experience.csv          
  inflating: Job_Views.csv           
  inflating: Positions_Of_Interest.csv  
  inflating: job_data.csv            


In [8]:
import pandas as pd
import numpy as np
import spacy

#instantiating English module
nlp = spacy.load('en')

In [9]:
df_jobs = pd.read_csv('/content/Combined_Jobs_Final.csv')
df_views = pd.read_csv('/content/Job_Views.csv')
df_poi = pd.read_csv('/content/Positions_Of_Interest.csv')
df_exp = pd.read_csv('/content/Experience.csv')


In [10]:
df_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84090 entries, 0 to 84089
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Job.ID              84090 non-null  int64  
 1   Provider            84090 non-null  int64  
 2   Status              84090 non-null  object 
 3   Slug                84090 non-null  object 
 4   Title               84090 non-null  object 
 5   Position            84090 non-null  object 
 6   Company             81819 non-null  object 
 7   City                83955 non-null  object 
 8   State.Name          83919 non-null  object 
 9   State.Code          83919 non-null  object 
 10  Address             36 non-null     object 
 11  Latitude            84090 non-null  float64
 12  Longitude           84090 non-null  float64
 13  Industry            267 non-null    object 
 14  Job.Description     84034 non-null  object 
 15  Requirements        0 non-null      float64
 16  Sala

In [11]:
# concatenate several columns into one text
df_jobs['text'] = df_jobs['Title'].str.cat(df_jobs['Position'].astype(str), sep=' ').str.cat(df_jobs['Company'].astype(str), sep=' ').str.cat(df_jobs['Job.Description'].astype(str), sep=' ')

In [12]:
# progress bar
import tqdm

In [13]:
# how long would it take to run it with plain spacy?

%%time
nlp(df_jobs['text'][0])

CPU times: user 31.6 ms, sys: 1.92 ms, total: 33.6 ms
Wall time: 41.2 ms


Server @ Tacolicious Server Tacolicious Tacolicious' first Palo Alto store just opened recently, and we are hiring! If you love tacos, you will love working at our restaurant! 

 ● Serve food/drinks to customers in a professional manner 
 ● Act as a cashier when needed 
 ● Clean up the dining space 
 ● Train the new staff 

In [14]:
# run that for all? That is 0.71h - too long
30.8*84000/1000/3600

0.7186666666666666

In [15]:
# run progress bare and clean up using spacy but without some heavy parts of the pipeline

%%time
clean_text = []


pbar = tqdm.tqdm(total=len(df_jobs['text']),position=0, leave=True)

for text in nlp.pipe(df_jobs['text'], disable=["tagger", "parser", "ner"]):

  txt = [token.lemma_.lower() for token in text 
         if token.is_alpha 
         and not token.is_stop 
         and not token.is_punct]

  clean_text.append(txt)

  pbar.update(1)

100%|█████████▉| 83968/84090 [02:26<00:00, 511.38it/s]

CPU times: user 2min 23s, sys: 1.99 s, total: 2min 25s
Wall time: 2min 26s


In [16]:
df_jobs['clean_text'] = clean_text

In [17]:
df_jobs['clean_text'].isnull().sum()

0

In [18]:
# update gensim
!pip install --upgrade gensim -q

[K     |████████████████████████████████| 24.1 MB 2.7 kB/s 
[?25h

In [19]:
# get tooling for Word2Vec model
from gensim.models import Word2Vec

In [20]:
# enable logging
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [21]:
# train word2vec model
w2v_model = Word2Vec(sentences=df_jobs['clean_text'], vector_size=300, window=5, min_count=2, workers=2, epochs=5)

2021-10-26 17:44:52,755 : INFO : collecting all words and their counts
2021-10-26 17:44:52,759 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-10-26 17:44:53,099 : INFO : PROGRESS: at sentence #10000, processed 1489501 words, keeping 18094 word types
2021-10-26 17:44:53,410 : INFO : PROGRESS: at sentence #20000, processed 2801776 words, keeping 25505 word types
2021-10-26 17:44:53,799 : INFO : PROGRESS: at sentence #30000, processed 4383902 words, keeping 30826 word types
2021-10-26 17:44:54,127 : INFO : PROGRESS: at sentence #40000, processed 5723039 words, keeping 35286 word types
2021-10-26 17:44:54,454 : INFO : PROGRESS: at sentence #50000, processed 7118565 words, keeping 39058 word types
2021-10-26 17:44:54,764 : INFO : PROGRESS: at sentence #60000, processed 8409519 words, keeping 42530 word types
2021-10-26 17:44:55,107 : INFO : PROGRESS: at sentence #70000, processed 9822031 words, keeping 45756 word types
2021-10-26 17:44:55,473 : INFO : PROGRE

In [22]:
w2v_model.wv.similar_by_word('bartender')

[('busser', 0.7618190050125122),
 ('waitress', 0.7473346590995789),
 ('waiter', 0.7098878622055054),
 ('hostess', 0.7010378241539001),
 ('bussers', 0.6423121690750122),
 ('waitstaff', 0.6400488018989563),
 ('bartenders', 0.6347126364707947),
 ('sushi', 0.633686900138855),
 ('ra', 0.6296216249465942),
 ('dishwasher', 0.6136170625686646)]

In [23]:
w2v_model.wv.similar_by_word('java')

[('javascript', 0.777656078338623),
 ('jquery', 0.7706832885742188),
 ('xml', 0.7641456723213196),
 ('ui', 0.7162367105484009),
 ('html', 0.7130112051963806),
 ('developer', 0.7120020389556885),
 ('ajax', 0.7038507461547852),
 ('sql', 0.7029151916503906),
 ('php', 0.7026695609092712),
 ('python', 0.6975985765457153)]

In [24]:
# check out cosine similarity for some word-vectors
from sklearn.metrics.pairwise import cosine_similarity

In [25]:
pizza = w2v_model.wv['pizza'].reshape(1,300)
pasta = w2v_model.wv['pasta'].reshape(1,300)
sushi = w2v_model.wv['sushi'].reshape(1,300)
uber = w2v_model.wv['uber'].reshape(1,300)

In [26]:
cosine_similarity(pizza, uber)

array([[0.12800236]], dtype=float32)

In [27]:
w2v_model.save('w2v_model')

2021-10-26 17:47:19,759 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'w2v_model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-10-26T17:47:19.759224', 'gensim': '4.1.2', 'python': '3.7.12 (default, Sep 10 2021, 00:21:48) \n[GCC 7.5.0]', 'platform': 'Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'saving'}
2021-10-26 17:47:19,767 : INFO : not storing attribute cum_table
2021-10-26 17:47:19,905 : INFO : saved w2v_model


In [28]:
# define function for avg-embeddings 
# (in older versions of gensim it's not key_to_index but vocab)

def get_mean_vector(word2vec_model, words):
    # remove out-of-vocabulary words
    words = [word for word in words if word in w2v_model.wv.key_to_index]
    if len(words) >= 1:
        return np.mean(word2vec_model.wv[words], axis=0)
    else:
        return []

In [29]:
get_mean_vector(w2v_model, df_jobs['clean_text'][0])

array([-2.27459788e-01,  2.83792838e-02, -3.15880328e-01, -4.02654737e-01,
        1.61475033e-01,  7.92500898e-02, -2.25903973e-01, -3.36948335e-01,
       -5.61738193e-01, -1.15351133e-01,  2.12891281e-01, -5.98014705e-02,
        3.36666316e-01, -2.02782884e-01, -2.49138206e-01, -3.48881841e-01,
       -3.79665703e-01,  9.20985714e-02, -4.38628495e-01,  3.60054113e-02,
        2.10461989e-01, -2.60255896e-02,  2.48899572e-02,  2.77973175e-01,
       -7.31425941e-01, -2.59217113e-01,  2.49705449e-01, -5.53703308e-01,
        4.02820975e-01,  3.04305613e-01,  4.56851684e-02, -1.94203660e-01,
        1.60758495e-01, -7.31263608e-02,  1.12131871e-01,  1.14503965e-01,
        4.28892672e-01, -5.43757342e-02, -7.82297626e-02,  7.69299716e-02,
        5.33812284e-01, -3.76203865e-01, -2.16519818e-01,  5.15402019e-01,
        2.97062188e-01,  2.88686812e-01, -2.22387254e-01, -3.92101288e-01,
       -9.28101987e-02, -3.97997469e-01,  2.46904105e-01,  1.75887927e-01,
       -2.88872391e-01, -

In [30]:
# transform all texts into avg-vec-repre
avg_job_vecs = df_jobs['clean_text'].map(lambda t: get_mean_vector(w2v_model, t))

In [31]:
# aggregate vectors from list to matrix
avg_job_vecs = np.vstack(avg_job_vecs)

In [32]:
# check shape
avg_job_vecs.shape

(84090, 300)

In [33]:
df_jobs['text'][0]

"Server @ Tacolicious Server Tacolicious Tacolicious' first Palo Alto store just opened recently, and we are hiring! If you love tacos, you will love working at our restaurant! \r\n\r\n ● Serve food/drinks to customers in a professional manner \r\n ● Act as a cashier when needed \r\n ● Clean up the dining space \r\n ● Train the new staff \r\n"

In [34]:
# calculate similarity to 1 article
sims = cosine_similarity(avg_job_vecs[0].reshape(1,300), avg_job_vecs)

In [35]:
# extract indices
ix = np.flip(np.argsort(sims)).tolist()[0][:10]

In [36]:
# show results
df_jobs['text'][ix]

0        Server @ Tacolicious Server Tacolicious Tacoli...
84057    Server @ Kabuto Restaurant Server Kabuto Resta...
13448    Server @ BALEENkitchen Server BALEENkitchen  ●...
84081    Server @ Pizza Antica Server Pizza Antica  ● S...
13453    Server @ Gonpachi Server Gonpachi  ● Serve foo...
24471    Server @ Exotic Thai Cafe Server Exotic Thai C...
84044    Server @ Yuzu Server Yuzu  Yuzu is one of the ...
10778    Server @ Far Niente Ristorante Server Far Nien...
84082    Server @ Giardino Server Giardino  ● Serve foo...
10783    Server @ La Fontaine Restaurant Server La Font...
Name: text, dtype: object

In [48]:
# free text query
query = 'consultant in legal'

query = nlp(query, disable=["tagger", "parser", "ner"])
query = [token.lemma_.lower() for token in query 
         if token.is_alpha 
         and not token.is_stop 
         and not token.is_punct]

query = get_mean_vector(w2v_model,query)

In [49]:
sims = cosine_similarity(query.reshape(1,300), avg_job_vecs)
ix = np.flip(np.argsort(sims)).tolist()[0][:10]
df_jobs['text'][ix]

53292    HR Consultant - ACA expert needed! @ OfficeTea...
56218    Leasing Consultant @ Eastwood Management Corpo...
76525    Leasing Consultant @ OfficeTeam Leasing Consul...
13691    Administrative Assistant with Leasing Experien...
53752    Senior Consultant   EMCC A2D2 @ GreyStone Staf...
44270    Leasing Consultant Leasing Consultant nan Leas...
25871    Legal Writer (part-time) @ Randstad Profession...
47978    Paralegal/legal assistant/litigation specialis...
37578    Legal Assistant/ Project Assistant @ Carolina ...
1962     Legal Secretary @ OfficeTeam Legal Secretary O...
Name: text, dtype: object

In [None]:
df_jobs['text'][84037]

"Sushi Chef @ Haku Sushi Sushi Chef Haku Sushi Haku Sushi is Santa Rosa's newest sushi bar. We have 100+ seats and business is great!\r\nWe need a head sushi chef to lead our current team of 3 sushi chefs.\r\nIf you currently are NOT a head sushi chef, this is a great opportunity to move into a head position very very quickly.\r\nWe can talk about job expectations and skills required during the phone interview. Here is more info on Haku:\r\nhttp://707.pressdemocrat.com/2013-05-31/featured/cox-haku-sushi\r\nhttp://www.bohemian.com/northbay/rollin-deep/Content?oid=2419001\r\n\r\nAlso check us out on facebook and Yelp where you can see Haku is the area's best sushi restaurant...full stop."

Implementing TFIDF-weighted Word2Vec embeddings

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [51]:
# function that does absolutely nothing...
# to be able to use TfidfVectorizer on already tokenized text
def dummy_fun(doc):
    return doc

In [52]:
# we turn of any preprocessing and align vocabulary with the one
# used by our embeddings
# that will allow us to use TFIDF vectors to weight the embeddings

tfidf = TfidfVectorizer(vocabulary=w2v_model.wv.key_to_index.keys(),
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  

In [53]:
# create TFIDF matrix (we could also just use that one for search)
df_jobs_tfidf = tfidf.fit_transform(df_jobs['clean_text'])

In [57]:
# how many word-vectors do we have?
len(w2v_model.wv.key_to_index)

33071

In [56]:
# one tfidf vector has also 33071 columns - because we provided a vocab
df_jobs_tfidf[:1,:]

<1x33071 sparse matrix of type '<class 'numpy.float64'>'
	with 27 stored elements in Compressed Sparse Row format>

To get the vectors, we can use the dot-product of the TFIDF vector (or full matrix) with our word embeddings. n-columns (TFIDF) = n-rows (W2V embeddings)

![](https://hadrienj.github.io/assets/images/2.2/dot-product.png)

In [61]:
# we can use np.dot or since Python 3 the @ for matrix-multiplication

# let's try
df_jobs_tfidf[:1,:] @ w2v_model.wv.vectors

array([[-7.01223280e-01,  3.86127719e-01, -4.29526500e-01,
        -6.08647310e-01,  3.61619589e-01,  2.74596986e-01,
        -2.19369633e-01, -1.12414017e-01, -1.43040101e+00,
        -3.24954501e-01,  4.68771499e-01, -2.72240093e-01,
         5.27229745e-01, -6.42945363e-01, -8.21799343e-01,
        -1.14912679e+00, -1.36769792e+00,  1.69325095e-01,
        -7.47179696e-01, -2.78584557e-01,  6.54656144e-01,
         1.35298108e-01,  2.66373847e-02,  1.05617106e+00,
        -1.46310842e+00, -1.10216644e+00,  6.30542485e-01,
        -1.43483234e+00,  1.51325905e+00,  7.13999978e-01,
        -6.50459404e-02, -5.44729546e-01,  7.08846423e-01,
        -5.80415599e-02,  3.48056014e-01,  4.34607849e-01,
         1.23761953e+00, -3.07814260e-01, -2.99076442e-01,
         3.84057309e-01,  1.45418452e+00, -8.22665776e-01,
        -3.64323951e-01,  1.09074330e+00,  6.37676585e-01,
         8.96839523e-01, -3.12701543e-01, -1.19127401e+00,
        -3.78006360e-01, -1.37103632e+00,  1.15345555e+0

In [63]:
# for the whole matrix

df_jobs_w2v_tfidf = df_jobs_tfidf @ w2v_model.wv.vectors

In [64]:
df_jobs_w2v_tfidf.shape

(84090, 300)

In [67]:
# calculate similarity to 1 article
sims = cosine_similarity(df_jobs_w2v_tfidf[0].reshape(1,300), df_jobs_w2v_tfidf)

# extract indices
ix = np.flip(np.argsort(sims)).tolist()[0][:10]

# show results
df_jobs['text'][ix]

0        Server @ Tacolicious Server Tacolicious Tacoli...
84057    Server @ Kabuto Restaurant Server Kabuto Resta...
13448    Server @ BALEENkitchen Server BALEENkitchen  ●...
84044    Server @ Yuzu Server Yuzu  Yuzu is one of the ...
13453    Server @ Gonpachi Server Gonpachi  ● Serve foo...
54458    Server @ Sakae Sushi Server Sakae Sushi  Locat...
84081    Server @ Pizza Antica Server Pizza Antica  ● S...
84062    Server @ Kenta Ramen Server Kenta Ramen  New R...
84082    Server @ Giardino Server Giardino  ● Serve foo...
84001    Server @ Waraku Server Waraku We are a newly o...
Name: text, dtype: object

Results a bit different (compared to avg embeddings)


In [87]:
# slightly more complex function that includes preprocessing with Spacy
# TFIDF transformation and embeddings

def get_tfidf_vector(word2vec_model, query):

    query = nlp(query, disable=["tagger", "parser", "ner"])
    query = [token.lemma_.lower() for token in query 
         if token.is_alpha 
         and not token.is_stop 
         and not token.is_punct]
    if len(query) >= 1:
      words = tfidf.transform([query])
      return words @ word2vec_model.wv.vectors
    else:
        return []

In [96]:
query = get_tfidf_vector(w2v_model, 'web developer and designer')
sims = cosine_similarity(query.reshape(1,300), df_jobs_w2v_tfidf)
ix = np.flip(np.argsort(sims)).tolist()[0][:10]
df_jobs['text'][ix]

81172    Jr. Ruby on Rails Developer @ ConsultNet Jr. R...
59799    Tableau UI Developer Tableau UI Developer nan ...
19011    Web Designer (freelance) @ Creative Circle Web...
59440    Sr. UX Designer @ Creative Circle Sr. UX Desig...
32658    Web Developer @ Creative Circle Web Developer ...
34535    Graphic Designer (Print + Web) @ Creative Circ...
79576    Web Developer @ Creative Circle Web Developer ...
73957    Web Developer @ Creative Circle Web Developer ...
41978    Web designer/Front-end developer @ Collabera I...
27721    Front End Developer @ ConsultNet Front End Dev...
Name: text, dtype: object

Adding Annoy - Approximate nearest neighbor matching

![](https://camo.githubusercontent.com/a056535a8490b4b1aa933808e77207276235a209e97a980119d3e438897e1d36/68747470733a2f2f7261772e6769746875622e636f6d2f73706f746966792f616e6e6f792f6d61737465722f616e6e2e706e67)

calculating cosines is great for small datasets (like ours) that gets more problematic once things get larger - here on-disk ANN approximation is the solution (we are talking 1M+ observations - when the vectorized matrices get too large)

I work easilly with colelctions of 40+ million with annoy...runs quick :-)

check out: https://github.com/spotify/annoy

In [97]:
!pip install -q annoy

[?25l[K     |▌                               | 10 kB 10.2 MB/s eta 0:00:01[K     |█                               | 20 kB 16.0 MB/s eta 0:00:01[K     |█▌                              | 30 kB 15.4 MB/s eta 0:00:01[K     |██                              | 40 kB 11.9 MB/s eta 0:00:01[K     |██▌                             | 51 kB 5.7 MB/s eta 0:00:01[K     |███                             | 61 kB 5.3 MB/s eta 0:00:01[K     |███▌                            | 71 kB 5.6 MB/s eta 0:00:01[K     |████                            | 81 kB 6.2 MB/s eta 0:00:01[K     |████▋                           | 92 kB 4.9 MB/s eta 0:00:01[K     |█████                           | 102 kB 5.3 MB/s eta 0:00:01[K     |█████▋                          | 112 kB 5.3 MB/s eta 0:00:01[K     |██████                          | 122 kB 5.3 MB/s eta 0:00:01[K     |██████▋                         | 133 kB 5.3 MB/s eta 0:00:01[K     |███████                         | 143 kB 5.3 MB/s eta 0:00:01[K 

In [100]:
from annoy import AnnoyIndex

# instatiate a search tree (with shape n/300)
t = AnnoyIndex(df_jobs_w2v_tfidf.shape[1], 'angular') 

In [99]:
# we will build that on disk (can reuse later if we store it somwhere)

t.on_disk_build('jobs_search_tree.annoy')

True

In [101]:
# now we add all our vectors - line by line to the tree
# along with an index (here i - running index)
for i in tqdm.tqdm(range(df_jobs_w2v_tfidf.shape[0]),position=0, leave=True):
    t.add_item(i, df_jobs_w2v_tfidf[i])

100%|██████████| 84090/84090 [00:03<00:00, 22103.48it/s]


In [102]:
# now we build the search tree (that creates partitions within the data-a bit like clustering)
# thereafter search will be performed within the nearest partitions (that reduces search time A LOT)
t.build(50, n_jobs=-1)

True

In [105]:
t.get_nns_by_vector(df_jobs_w2v_tfidf[0], n=10, include_distances=True)

([0, 84057, 13448, 84044, 13453, 54458, 84081, 84062, 84082, 84001],
 [0.0,
  0.43137580156326294,
  0.45826101303100586,
  0.46562379598617554,
  0.47330835461616516,
  0.4771498143672943,
  0.4818252623081207,
  0.493192195892334,
  0.5003194808959961,
  0.5004116892814636])

In [109]:
# now we can search by index
knn_search = t.get_nns_by_item(0, n=10, include_distances=True)
df_jobs['text'][knn_search[0]]

0        Server @ Tacolicious Server Tacolicious Tacoli...
84057    Server @ Kabuto Restaurant Server Kabuto Resta...
13448    Server @ BALEENkitchen Server BALEENkitchen  ●...
84044    Server @ Yuzu Server Yuzu  Yuzu is one of the ...
13453    Server @ Gonpachi Server Gonpachi  ● Serve foo...
54458    Server @ Sakae Sushi Server Sakae Sushi  Locat...
84081    Server @ Pizza Antica Server Pizza Antica  ● S...
84062    Server @ Kenta Ramen Server Kenta Ramen  New R...
84082    Server @ Giardino Server Giardino  ● Serve foo...
84001    Server @ Waraku Server Waraku We are a newly o...
Name: text, dtype: object

In [110]:
# now we can search by vector
knn_search = t.get_nns_by_vector(df_jobs_w2v_tfidf[0], n=10, include_distances=True)
df_jobs['text'][knn_search[0]]

0        Server @ Tacolicious Server Tacolicious Tacoli...
84057    Server @ Kabuto Restaurant Server Kabuto Resta...
13448    Server @ BALEENkitchen Server BALEENkitchen  ●...
84044    Server @ Yuzu Server Yuzu  Yuzu is one of the ...
13453    Server @ Gonpachi Server Gonpachi  ● Serve foo...
54458    Server @ Sakae Sushi Server Sakae Sushi  Locat...
84081    Server @ Pizza Antica Server Pizza Antica  ● S...
84062    Server @ Kenta Ramen Server Kenta Ramen  New R...
84082    Server @ Giardino Server Giardino  ● Serve foo...
84001    Server @ Waraku Server Waraku We are a newly o...
Name: text, dtype: object

In [112]:
# also with a free query
query = get_tfidf_vector(w2v_model, 'web developer and designer')

knn_search = t.get_nns_by_vector(query[0], n=10, include_distances=True) 
# need to index [0] our matrix - annoy likes vectors - that's opposite from sklearn cosine

df_jobs['text'][knn_search[0]]

81172    Jr. Ruby on Rails Developer @ ConsultNet Jr. R...
59799    Tableau UI Developer Tableau UI Developer nan ...
19011    Web Designer (freelance) @ Creative Circle Web...
59440    Sr. UX Designer @ Creative Circle Sr. UX Desig...
32658    Web Developer @ Creative Circle Web Developer ...
79576    Web Developer @ Creative Circle Web Developer ...
73957    Web Developer @ Creative Circle Web Developer ...
41978    Web designer/Front-end developer @ Collabera I...
27721    Front End Developer @ ConsultNet Front End Dev...
16189    Visual Designer/Front-end Developer @ Wunderla...
Name: text, dtype: object