## Recommend job from skillsets

In [1]:
from gensim import corpora, similarities
from gensim.models import TfidfModel

In [2]:
job_skills = ["Python", "Machine Learning", "AWS", "SQL", "GIT"]

candidates_skills = [
    ["Python", "Machine Learning", "AWS", "SQL", "Deep Learning"],
    ["R", "Statistics", "GIT",  "English"],
    ["Python", "Machine Learning", "AWS", "GIT", "English"]
]

### Jaccard similarity

In [3]:
def matching_score(job_skills, candidate_skills):
    """modified the Jaccard similarity to matching score"""
    if job_skills:
        commom_skills = set(job_skills) & set(candidate_skills)
        score = len(commom_skills) / len(set(job_skills))
        return score
    else:
        return 0

In [4]:
for i in range(3):
    print("score of candidate %d: %f" % (i, matching_score(job_skills, candidates_skills[i])))

score of candidate 0: 0.800000
score of candidate 1: 0.200000
score of candidate 2: 0.800000


### Cosine similarity
We need make skill embeddings for all the text. Here I use bag of words + tfidf. You are free to further reduce the vector dimension by using Latent semantic indexing(LSI).

In [5]:
## bulid dictionary and embedding use gensim
def embedding(skillsets):
    dictionary = corpora.Dictionary(skillsets)
    corpus = [dictionary.doc2bow(text) for text in skillsets]
    model = TfidfModel(corpus)
    index = similarities.MatrixSimilarity(model[corpus])
    return dictionary, model, index

dictionary, model, index = embedding(candidates_skills)

In [6]:
## compute similarity
def compute_sim(skills):
    job_query = dictionary.doc2bow(skills)
    vec_job = model[job_query]
    sims = index[vec_job]
    return sims

sims = compute_sim(job_skills)

In [7]:
for i,s in enumerate(sims):
    print("score of candidate %d: %f" % (i, s))

score of candidate 0: 0.730248
score of candidate 1: 0.072699
score of candidate 2: 0.531179


In [8]:
## sort by score
sims_sorted = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims_sorted):
    print("score of candidate %d: %f" % (s[0], s[1]))

score of candidate 0: 0.730248
score of candidate 2: 0.531179
score of candidate 1: 0.072699


### Word Mover’s Distance (WMD)

WMD allow us the compare two texts by considering the word similarity. In general we will use the pre-computed word vectors. However, in this use case, we have to find a way to compute the skill vectors. Actually it involves two type of NLP tasks, NER (how to identify skills) and skill embedding (how to generate a vector for each skill). Here I assume that we have already had the skill vectors in hand. Then we just need to call the function:
```
model.wmdistance(skillset_1, skillset_2)
```

### References

- [Gemsim documents](https://radimrehurek.com/gensim/auto_examples/index.html#documentation)
- [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [Latent semantic indexing](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)
- Word Mover’s Distance:
    - [gensim function](https://radimrehurek.com/gensim/auto_examples/tutorials/run_wmd.html)
    - [paper](http://proceedings.mlr.press/v37/kusnerb15.pdf)
- [Text similarity](https://medium.com/@adriensieg/text-similarities-da019229c894)