**Background:**

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

**Data Description:**

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

**Attributes:**
    
* id : unique identifier for candidate (numeric)

* job_title : job title for candidate (text)

* location : geographical location for candidate (text)

* connections: number of connections candidate has, 500+ means over 500 (text)

**Output (desired target):**
* fit - how fit the candidate is for the role? (numeric, probability between 0-1)

**Keywords: “Aspiring human resources” or “seeking human resources”**

**Download Data:**

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

**Goal(s):**

Predict how fit the candidate is based on their available information (variable fit)

**Success Metric(s):**

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

In [1]:
import pandas as pd
import numpy as np



In [2]:
df =pd.read_csv("potential-talents - Aspiring human resources - seeking human resources.csv").set_index('id')
df.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   104 non-null    object 
 1   location    104 non-null    object 
 2   connection  104 non-null    object 
 3   fit         0 non-null      float64
dtypes: float64(1), object(3)
memory usage: 4.1+ KB


In [4]:
df["connection"] =df["connection"].str.replace('+', '').astype(int)
df.head()

  df["connection"] =df["connection"].str.replace('+', '').astype(int)


Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500,


In [5]:
df['connection'] =pd.to_numeric(df["connection"])
                               

In [6]:
df.job_title.value_counts()

2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 7
Aspiring Human Resources Professional                                                                                    7
Student at Humber College and Aspiring Human Resources Generalist                                                        7
People Development Coordinator at Ryan                                                                                   6
Native English Teacher at EPIK (English Program in Korea)                                                                5
Aspiring Human Resources Specialist                                                                                      5
HR Senior Specialist                                                                                                     5
Student at Chapman University                                                                                            4
SVP, CHRO, Marke

In [7]:
df =df.drop_duplicates()

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   53 non-null     object 
 1   location    53 non-null     object 
 2   connection  53 non-null     int32  
 3   fit         0 non-null      float64
dtypes: float64(1), int32(1), object(2)
memory usage: 1.9+ KB


### TF-IDF


**Prepping our Text for Modelling**

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer =TfidfVectorizer(stop_words="english", ngram_range=(1,2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])

In [10]:
def get_tfidf_similarity(vectorizer, docs_tfidf, query):
    querry_tfidf =vectorizer.transform([query])
    cos_sim  = cosine_similarity(querry_tfidf, docs_tfidf).flatten()
    return cos_sim

In [11]:
query ="Aspiring human resources"

cos_sim=  get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df['fit'] =cos_sim

In [12]:
def top_candidates(n, by  ='fit', ascending=False, min_con=0, location=df.location):
    df2 = df.loc[(df.connection >= min_con) & (df.location ==location)].sort_values(by=by, ascending=ascending).head(n).copy()
    return df2

In [13]:
top_candidates(n=10, by='fit', ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500,0.374733
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human R...,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.336485


In [14]:
top_candidates(n=10, by="fit", ascending=False, min_con=90)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500,0.374733
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.308829
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.246772
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500,0.196509
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500,0.196509
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",500,0.196509
68,Human Resources Specialist at Luxottica,Greater New York City Area,500,0.189503
89,Director Human Resources at EY,Greater Atlanta Area,349,0.187433


In [15]:
top_candidates(n=50, by='fit', ascending=False, location="Austin, Texas Area")

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642


In [16]:
querry ='seeking human resources'

cos_sim = get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df['fit'] = cos_sim

In [17]:
top_candidates(n=10, by='fit', ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500,0.374733
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human R...,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.336485


## Word2Vec

In [18]:
import tensorflow as tf
import re 
import nltk
from nltk.corpus import stopwords

In [19]:
stop_words = stopwords.words('english')
df['job_title_cleaned'] = df.job_title.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  for w in x.split() 
                                                                                  if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  not in stop_words) )

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 104
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   job_title          53 non-null     object 
 1   location           53 non-null     object 
 2   connection         53 non-null     int32  
 3   fit                53 non-null     float64
 4   job_title_cleaned  53 non-null     object 
dtypes: float64(1), int32(1), object(3)
memory usage: 2.3+ KB


In [21]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer =Tokenizer()

tokenizer.fit_on_texts(df.job_title_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.job_title_cleaned)
tokenized_pad_document =pad_sequences(tokenized_documents, maxlen=64, padding="post")
vocab_size= len(tokenizer.word_index)+1

In [22]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
import gensim
import gensim.downloader as api
from gensim import models


# load word2vec model
word2vec_model =api.load("word2vec-google-news-300")

In [23]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in word2vec_model:
        embedding_matrix[i]=word2vec_model[word]
        
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_pad_document),64,300))
for i in range(len(tokenized_pad_document)):
    for j in range(len(tokenized_pad_document[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_pad_document[i][j]]
document_word_embeddings.shape

(53, 64, 300)

In [24]:
document_word_embeddings[0][0]

array([-2.08007812e-01,  3.41796875e-02,  2.57568359e-02,  1.79687500e-01,
       -1.81640625e-01, -3.41796875e-02, -1.40625000e-01, -1.63085938e-01,
       -8.59375000e-02, -1.52343750e-01, -9.57031250e-02, -1.34765625e-01,
       -1.92382812e-01,  2.43164062e-01, -1.91406250e-01,  4.93164062e-02,
        2.60009766e-02,  3.28125000e-01, -7.37304688e-02,  5.05371094e-02,
       -1.52343750e-01, -1.57226562e-01, -1.44958496e-04, -2.51953125e-01,
       -4.22363281e-02, -1.72119141e-02, -4.84375000e-01,  2.07031250e-01,
       -1.40625000e-01, -1.35498047e-02, -1.78222656e-02,  5.95092773e-03,
       -3.10058594e-02, -2.75390625e-01, -2.65625000e-01,  9.52148438e-02,
       -4.55078125e-01,  1.13281250e-01, -1.33789062e-01,  1.18652344e-01,
       -5.37109375e-02,  8.10546875e-02,  7.32421875e-02,  6.39648438e-02,
       -9.47265625e-02,  4.39453125e-02,  1.46484375e-01, -8.59375000e-02,
       -1.58203125e-01,  1.63085938e-01, -1.32812500e-01,  2.50000000e-01,
       -5.61523438e-02,  

In [25]:
word2vec_model['england'][:5]

array([-0.3671875 , -0.03491211,  0.11083984,  0.40039062,  0.18261719],
      dtype=float32)

In [26]:
def processing(query):
    df3 = pd.DataFrame([query], columns=['query'])
    stop_words = stopwords.words('english')
    df3['processed'] = df3['query'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  for w in x.split() 
                                                                                  if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  not in stop_words) )
    
    tokenizer.fit_on_texts(df3.processed)
    tokenized_documents=tokenizer.texts_to_sequences(df3.processed)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    
    embedding_matrix=np.zeros((vocab_size,300))
    for word,i in tokenizer.word_index.items():
        if word in word2vec_model:
            embedding_matrix[i]=word2vec_model[word]

    # creating document-word embeddings
    query_document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
    for i in range(len(tokenized_paded_documents)):
        for j in range(len(tokenized_paded_documents[0])):
            query_document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
#     document_word_embeddings.shape
    
    return query_document_word_embeddings

In [27]:
processing("hello wordl!!!!").shape

(1, 64, 300)

In [28]:
def get_w2v_query_similarity(document_word_embeddings, query):
    """
    query_w2v: processing the query
    model_w2v: word2vec embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_w2v = processing(query)
    
    nsamples, nx, ny = query_w2v.shape
    query_w2v_reshape = query_w2v.reshape((nsamples,nx*ny))

    nsamples, nx, ny = document_word_embeddings.shape
    document_word_embeddings_reshape = document_word_embeddings.reshape((nsamples,nx*ny))
    
    cos_sim_w2v = cosine_similarity(query_w2v_reshape, document_word_embeddings_reshape).flatten()
    
    return cos_sim_w2v

In [30]:
query ="Aspiring human resources"

# Word2Vec Similarity

cos_sim_w2v =get_w2v_query_similarity(document_word_embeddings, query=query)
df['w2v_fit'] =cos_sim_w2v

# original TFIDF similarity
cos_sim =get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df["tfidf_fit"]=cos_sim



In [31]:
top_candidates(n=10, by="w2v_fit", ascending=False , min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855,aspiring human resources professional,0.898174,0.735855
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855,aspiring human resources professional,0.898174,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697,aspiring human resources specialist,0.873679,0.632697
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.220668,seeking human resources position,0.654387,0.220668
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642,aspiring human resources professional energe...,0.641739,0.31642
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500,0.374733,aspiring human resources management student se...,0.628601,0.374733
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668,seeking human resources opportunities,0.619797,0.220668
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888,aspiring human resources manager seeking inte...,0.584569,0.50888
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.246772,aspiring human resources professional passio...,0.551164,0.246772
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,0.141333,seeking human resources hris generalist positions,0.519345,0.141333


In [34]:
query = 'seeking human resources'

# word2vec similarity
cos_sim_w2v =get_w2v_query_similarity(document_word_embeddings, query=query)
df["w2v_fit"] =cos_sim_w2v

# original TFIDF similarity for comparison
cos_sim = get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df['tfidf_fit']=cos_sim

In [35]:
top_candidates(n=10, by="w2v_fit", ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.220668,seeking human resources position,0.886226,0.675682
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668,seeking human resources opportunities,0.839381,0.675682
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,0.141333,seeking human resources hris generalist positions,0.703341,0.432761
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855,aspiring human resources professional,0.663209,0.240319
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855,aspiring human resources professional,0.663209,0.240319
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697,aspiring human resources specialist,0.645122,0.206629
94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.124523,seeking human resources opportunities open tr...,0.639099,0.38129
89,Director Human Resources at EY,Greater Atlanta Area,349,0.187433,director human resources ey,0.571728,0.162381
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642,aspiring human resources professional energe...,0.473859,0.103338
81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,0.118408,senior human resources business partner heil e...,0.470671,0.102581


In [36]:
query = 'business intelligence specialist'

cos_sim_w2v=get_w2v_query_similarity(document_word_embeddings, query=query)
df["w2v_fit"] =cos_sim_w2v

cos_sim=get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df['tfidf_fit']=cos_sim

In [37]:
top_candidates(n=10, by="w2v_fit", ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.0,business intelligence analytics travelers,0.552532,0.56006
68,Human Resources Specialist at Luxottica,Greater New York City Area,500,0.189503,human resources specialist luxottica,0.44738,0.178214
8,HR Senior Specialist,San Francisco Bay Area,500,0.0,hr senior specialist,0.348536,0.168359
86,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,0.0,information systems specialist programmer love...,0.274835,0.099972
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500,0.196509,human resources generalist loparex,0.251939,0.0
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500,0.196509,human resources generalist schwan s,0.231181,0.0
13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500,0.129163,human resources coordinator intercontinental b...,0.2158,0.0
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.38759,business management major aspiring human resou...,0.214225,0.12298
4,People Development Coordinator at Ryan,"Denton, Texas",500,0.0,people development coordinator ryan,0.205907,0.0
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",500,0.196509,human resources generalist scottmadden inc,0.20292,0.0


### GloVe -

In [38]:
##!pip install wget
##import wget
##wget.download('https://nlp.stanford.edu/data/glove.840B.300d.zip')

100% [....................................................................] 2176768927 / 2176768927

'glove.840B.300d.zip'

In [52]:
#import zipfile as zf
#files = zf.ZipFile("glove.840B.300d.zip", 'r')
#files.extractall('GloVe')
#files.close()

In [53]:
path = "GloVe/glove.840B.300d.txt"

In [56]:
with open(path) as file:
    for i in range(10):
        line =file.readline()
        print(line[:100])

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0
. 0.012001 0.20751 -0.12578 -0.59325 0.12525 0.15975 0.13748 -0.33157 -0.13694 1.7893 -0.47094 0.704
the 0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 2.5882 -0.35179 -
and -0.18567 0.066008 -0.25209 -0.11725 0.26513 0.064908 0.12291 -0.093979 0.024321 2.4926 -0.017916
to 0.31924 0.06316 -0.27858 0.2612 0.079248 -0.21462 -0.10495 0.15495 -0.03353 2.4834 -0.50904 0.087
of 0.060216 0.21799 -0.04249 -0.38618 -0.15388 0.034635 0.22243 0.21718 0.0068483 2.4375 -0.27418 0.
a 0.043798 0.024779 -0.20937 0.49745 0.36019 -0.37503 -0.052078 -0.60555 0.036744 2.2085 -0.23389 -0
in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.03
" -0.075242 0.57337 -0.31908 -0.18484 0.88867 -0.27381 0.077588 0.13905 -0.47746 1.4442 -0.56159 0.0
: 0.008746 0.33214 -0.29175 -0.15119 -0.41842 -0.23931 -0.23458 -0.055618 -0.09896 0.75175 

In [58]:
df_glove =  pd.read_csv(path, sep= " ", quoting = 3, header=None, index_col=0)
df_glove.T

Unnamed: 0,",",.,the,and,to,of,a,in,"""",:,...,Ogenki,Orig.US,PMfound,POP1,PX130,Pandalam,Parascript,Parnells,Pautsch,PerchesJoe
1,-0.082752,0.012001,0.272040,-0.185670,0.319240,0.060216,0.043798,0.089187,-0.075242,0.008746,...,0.423770,0.225700,0.240270,0.53965,-0.10935,0.328590,0.531970,0.611780,-0.473970,0.060495
2,0.672040,0.207510,-0.062030,0.066008,0.063160,0.217990,0.024779,0.257920,0.573370,0.332140,...,-1.198100,-0.031287,-0.165540,-0.76420,0.25466,-0.212350,-0.202010,-0.687970,-0.400420,-0.675210
3,-0.149870,-0.125780,-0.188400,-0.252090,-0.278580,-0.042490,-0.209370,0.262820,-0.319080,-0.291750,...,-0.700850,0.538230,0.071787,0.37678,0.99594,0.376060,0.091246,-0.066194,-0.055173,0.265550
4,-0.064983,-0.593250,0.023225,-0.117250,0.261200,-0.386180,0.497450,-0.029365,-0.184840,-0.151190,...,-0.224230,-0.332040,0.166280,-0.39625,0.08476,0.080909,-0.003389,0.389530,0.136680,0.648760
5,0.056491,0.125250,-0.018158,0.265130,0.079248,-0.153880,0.360190,0.471870,0.888670,-0.418420,...,0.642660,-0.523410,-0.347660,0.34098,0.11552,-0.316570,-0.583550,-0.363120,0.447960,-0.577200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0.053380,0.063500,-0.018168,-0.039709,-0.258100,0.329200,0.080421,0.193680,-0.212800,0.700590,...,0.062563,-0.788720,-0.293950,-0.43114,0.37192,0.562300,-0.312860,0.405260,-0.378540,
297,-0.050821,0.140190,0.114070,0.324980,-0.044629,-0.175970,-0.061246,-0.325460,-0.226150,-0.213710,...,0.828370,-0.470380,-0.291760,0.69350,0.38199,0.070511,-0.084947,0.191570,-0.652060,
298,-0.191800,0.138710,0.130150,-0.023452,0.082745,0.117090,-0.300990,0.144210,0.328000,-0.286770,...,1.043200,-0.106990,-0.085336,-0.13797,-0.36604,-0.015289,0.928730,0.337280,0.325620,
299,-0.378460,-0.360490,-0.183170,0.123020,0.097801,-0.166920,-0.145840,-0.169000,-0.109340,-0.226630,...,1.277200,-0.287070,0.050977,-0.40676,-0.23111,0.044558,-0.570400,-0.487290,-0.029220,


In [59]:
glove = {key: val.values for key, val in df_glove.T.items()}

In [61]:
glove["man"][:20]

array([-1.7310e-01,  2.0663e-01,  1.6543e-02, -3.1026e-01,  1.9719e-02,
        2.7791e-01,  1.2283e-01, -2.6328e-01,  1.2522e-01,  3.1894e+00,
       -1.6291e-01, -8.8759e-02,  3.3067e-03, -2.9483e-03, -3.4398e-01,
        1.2779e-01, -9.4536e-02,  4.3467e-01,  4.9742e-01,  2.5068e-01])

In [62]:
unknown_word = df_glove.mean().values
unknown_word[:20]

array([ 0.1970628 , -0.24059718,  0.11459859, -0.007148  , -0.09472859,
        0.09408282,  0.05378338,  0.06780886, -0.05926033, -1.24604854,
        0.38741211, -0.05182717, -0.08254744,  0.13708713,  0.196355  ,
        0.006143  ,  0.14478799, -1.08882521,  0.26819442,  0.04808988])

In [63]:
df_glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
",",-0.082752,0.67204,-0.14987,-0.064983,0.056491,0.40228,0.002775,-0.3311,-0.30691,2.0817,...,-0.14331,0.018267,-0.18643,0.20709,-0.35598,0.05338,-0.050821,-0.1918,-0.37846,-0.06589
.,0.012001,0.20751,-0.12578,-0.59325,0.12525,0.15975,0.13748,-0.33157,-0.13694,1.7893,...,0.16165,-0.066737,-0.29556,0.022612,-0.28135,0.0635,0.14019,0.13871,-0.36049,-0.035
the,0.27204,-0.06203,-0.1884,0.023225,-0.018158,0.006719,-0.13877,0.17708,0.17709,2.5882,...,-0.4281,0.16899,0.22511,-0.28557,-0.1028,-0.018168,0.11407,0.13015,-0.18317,0.1323
and,-0.18567,0.066008,-0.25209,-0.11725,0.26513,0.064908,0.12291,-0.093979,0.024321,2.4926,...,-0.59396,-0.097729,0.20072,0.17055,-0.004736,-0.039709,0.32498,-0.023452,0.12302,0.3312
to,0.31924,0.06316,-0.27858,0.2612,0.079248,-0.21462,-0.10495,0.15495,-0.03353,2.4834,...,-0.12977,0.3713,0.18888,-0.004274,-0.10645,-0.2581,-0.044629,0.082745,0.097801,0.25045


In [65]:
# creating a vectorie representationfor each job
job_titles =df.job_title_cleaned

doc_sent_vec = []

for sentences in job_titles:
    word_vec = []
    for word in sentences.split():
        if word in glove:
            vectors = glove[word]
            word_vec.append(vectors)
        else:
            word_vec.append(unknown_word)
    word_vec_model = sum(word_vec) / len(word_vec)
    doc_sent_vec.append(word_vec_model)

In [66]:
len(doc_sent_vec)

53

In [67]:
# creating a vectorize representation for each query
def q_sent_vec(query):
    q_sent_vec =[]
    q_word_vec = []
    
    for word in query.split():
        if word in glove:
            vectors =glove[word]
            q_word_vec.append(vectors)
        else:
            q_word_vec.append(unknown_word)
        q_word_vec_mean = sum(q_word_vec) / len(q_word_vec)
    q_sent_vec.append(q_word_vec_mean)
    
    return q_sent_vec

In [68]:
query = "native english speaking"
len(q_sent_vec(query))

1

In [69]:
q_sent_vec(query)[0][:5]

array([-0.29654333,  0.12640833, -0.49922333,  0.22307667,  0.4358    ])

In [70]:
query ="student indiana university"
q_sent_vec(query)[0][:5]

array([-0.10656   ,  0.06428367,  0.10134093, -0.19890667,  0.51552   ])

In [71]:
def get_glove_query_similarity(doc_sent_vec, query):
    query_glove = q_sent_vec(query)
    cos_sim_glove =cosine_similarity(query_glove, doc_sent_vec).flatten()
    return cos_sim_glove

In [73]:
query ="Aspring human resources"

# Glove similarity
cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query=query)
df["glove_fit"] =cos_sim_glove

# original Tfidf similarity 
cos_sim = get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df["tfidf_fit"] =cos_sim

cos_sim_w2v =get_w2v_query_similarity(document_word_embeddings, query=query)
df["w2v_fit"] =cos_sim_w2v

In [74]:
top_candidates(n = 10, by ="glove_fit", ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500,0.196509,human resources generalist loparex,0.2425,0.320059,0.901884
68,Human Resources Specialist at Luxottica,Greater New York City Area,500,0.189503,human resources specialist luxottica,0.176811,0.308648,0.899012
74,Human Resources Professional,Greater Boston Area,16,0.340769,human resources professional,0.155016,0.555018,0.861686
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668,seeking human resources opportunities,0.689779,0.359406,0.844087
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",500,0.196509,human resources generalist scottmadden inc,0.195317,0.320059,0.830482
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888,aspiring human resources manager seeking inte...,0.465349,0.45381,0.828745
88,Human Resources Management Major,"Milpitas, California",18,0.204639,human resources management major,0.201412,0.3333,0.827524
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697,aspiring human resources specialist,0.695496,0.388463,0.816165
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.220668,seeking human resources position,0.728275,0.359406,0.811868
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500,0.196509,human resources generalist schwan s,0.22252,0.320059,0.788186


In [75]:
query ="seeking human resources"

# glove similarity
cos_sim_glove= get_glove_query_similarity(doc_sent_vec, query=query)
df["glove_fit"] =cos_sim_glove

# original tfidf 
cos_sim =get_tfidf_similarity(vectorizer, docs_tfidf, query=query)
df["tfidf_fit"]=cos_sim

cos_sim_w2v  = get_w2v_query_similarity(document_word_embeddings, query=query)
df["w2v_fit"] =cos_sim_w2v

In [76]:
top_candidates(n=10, by="glove_fit", ascending=False, min_con=0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668,seeking human resources opportunities,0.839381,0.675682,0.970024
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.220668,seeking human resources position,0.886226,0.675682,0.953714
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888,aspiring human resources manager seeking inte...,0.431644,0.362648,0.935586
74,Human Resources Professional,Greater Boston Area,16,0.340769,human resources professional,0.133104,0.295223,0.903558
94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.124523,seeking human resources opportunities open tr...,0.639099,0.38129,0.885495
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697,aspiring human resources specialist,0.645122,0.206629,0.874185
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.308829,aspiring human resources manager graduating ...,0.343832,0.220083,0.870053
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855,aspiring human resources professional,0.663209,0.240319,0.864091
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855,aspiring human resources professional,0.663209,0.240319,0.864091
88,Human Resources Management Major,"Milpitas, California",18,0.204639,human resources management major,0.170611,0.177288,0.859179
