# Introduction

In this notebook, we will implement word embeddings for job postings using [Word2vec](https://en.wikipedia.org/wiki/Word2vec). By implementing this, you will learn about embedding words in job postings such as job titles and skills. This will be useful for semantic job search, for instance.

Table of contents

- [Word embeddings and word2vec](#Word-embeddings,-word2vec)
- [Job posting data](#Job-posting-data)
- [Data preprocessing](#Data-preprocessing)
- [Training](#Training)
- [Visualization](#Visualization)
- [Future work](#Future-work)
- [Reading](#Reading)

# Word embeddings and word2vec
Word2vec is a machine learning model used for learning vector representation of words, called "word embeddings".
Before starting implemenation, let's look at why we would want to learn word embeddings in the first place.

## Why learn word embeddings?
Natural language processing (NLP) traditonally convert words to discrete symbols or ids.  
e.g. `I like cats and dogs.` -> `[23, 761, 748, 221, 309] `

These encondings provide no useful information about relationships that may exist between individual symbols. Furthermore, representing words as unique, discrete ids leads to data sparsity, and requires a lot of data.  
Using vector representations can overcome some of these issues.

## Word2vec
Word2vec is a neural network model which allows you to vectorize words.
If you are interested in how to calculate vectors of words, check out the readings listed below.

### Architectures
There are 2 architectures for implementing word2vec, **CBOW** (Continuous Bag-Of-Words) and **Skip-gram**.  

**CBOW** is learning to predict the word by the context, on the other hand, the **Skip-gram** is designed to predict the context.
For example, if the sentence is "*The cat ate the mouse*"  and the current word is `ate`,   
In **CBOW**, `['The', 'cat', 'the', 'mouse']` will be calculating `ate`, while `ate` will be calculating `['The', 'cat', 'the', 'mouse']` in **Skip-gram**.

<img src="assets/word2vec_diagrams.png" width="700">

You don't need to understand these architectures completely, the important point is that the word2vec model learns the mearning of words from contexts.

### Similarity
Each word will be located as a point in hundreds-dimentional vectorspace (practically 100 - 500 dimentions).  
A well trained set of word vectors will place similar words close to each other in that space.  
For instance, `dogs` and `cats` might cluster in one corner, while `war`, `conflict` and `strife` in another.

### Translation
Word2vec can learn many associations other than similarity.
For instance, it can gauge relations between words of one language, and map them to another.

<img src="assets/word2vec_translation.png" width="700">

### Distance
These vectors are the basis of a more comprehensive geometry of words.  Not only will Rome, Paris, Berlin and Beijing cluster near each other, but they will each have similar distances in vectorspace to the countries whose capitals they are.

<img src="assets/countries_capitals.png" width="700">

So you can find the capital of Japan by calculating `Rome - Italy + Japan = Tokyo`.  
Another famous example would be `king - man + woman = queen`.

## How are job word embeddings useful?
One of the reasons why we would want to learn "job word embeddings" is that it will be useful when finding jobs that have simiar job titles with your current one, which is difficult to be done by traditonal search engine.  

<style>
table {margin: none;}
</style>

|Target|Similar job titles|
|:---|:---|
|Web engineer| Server side eingeer, fullstack engineer, front-end engineer|
|Infra engineer| Devops engineer, System reliability engineer|
|Data scientist| Data analyist, Data engineer, Machine learning engineer|

# Job posting data
We need job postings text data to learn word embeddings.

The prepared dataset has 12130 postings with many fields: `job_title`, `summary`, `requirements`, `salary`, `location` etc. some explain job content well, other don't. We can decide which fields to use for learning.

Let's look at the raw data.

In [None]:
import pandas as pd

job_postings = pd.read_csv('./data/job_postings.csv')
job_postings.head()

# Data preprocessing

In this project, we will use `requirements`, `summary` as they seem to be explaining job contents.  and also append `job_title` to learn what each job title means.

In [None]:
sample_job = job_postings.loc[10]

In [None]:
sample_job[['requirements', 'summary']]

The following function extracts keywords from these fields.  
We will use only nouns because adjectives, verbs and marks don't give meaningful information for this material.

In [None]:
import MeCab
# check where neologd is in your computer, and change the path in the line below.
tagger = MeCab.Tagger("-U %m,未知語\\t -F %f[0],%f[6]\\t  -d  /usr/local/opt/mecab-ipadic-neologd")

def extract_nouns(text): 
    words = []
    for row in tagger.parse(text).split('\t'):
        if row.strip() == 'EOS':
            continue
        t = row.split(',')[0]
        w = row.split(',')[1]
        if t == '名詞':
            words.append(w.strip().lower())
    return words

In [None]:
print(extract_nouns(sample_job['requirements']))

You can see that there are words that explains the job such as `web`, `application`, `java`, `javascript` etc.

In [None]:
print(extract_nouns(sample_job['summary']))

In the summary field, there are words related to the job like `java`, `swift`, `backlog` etc, but also some words that represent the business of the company like `crm`, `no.1`, `platform`.

We were able to extract meaningful words from the `requirements` and `summary` fields.  
Let's check `job_title` field next.

In [None]:
# show the first 20 job titles
job_postings['job_title'].head(20)

We would like to use job titles without tokenization to learn what each job title means.  
e.g. web engineer, web application engineer, lead engineer . 

But as you can see above, the raw data are written in a variety of ways.  
So you need to clean them up to normalize job titles.

In [None]:
import re
import unicodedata

job_titles = job_postings['job_title']

# Remove brackets from text
# e.g. インフラエンジニア(ADPLAN) -> インフラエンジニア
def normalize_job_title(title):
    title = unicodedata.normalize('NFKC', title)
    title = re.sub(r'【.*】', '', title)
    title = re.sub(r'\[.*\]', '', title)
    title = re.sub(r'「.*」', '', title)
    title = re.sub(r'\(.*\)', '', title)
    title = re.sub(r'\<.*\>', '', title)
    title = re.sub(r'[※@◎].*$', '', title)
    return title.lower()

In [None]:
# apply the function to the first 20 job titles
job_postings['job_title'].head(20).apply(normalize_job_title)

## Coverting to the input of word2vec API

In the next section, we will use [Gensim's word2vec API](https://radimrehurek.com/gensim/models/word2vec.html) to produce word embeddings.  
As you can see in the documentation, it requires lists of words.

```python
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model = Word2Vec(sentences)
say_vector = model['say']  # get vector for word
```

So we need to convert the raw data to the required format. You can do that with the functions implemented above.  
You also need to append job titles to get mearning out of them.  
You can do that by adding them to the beginning of requirements.

```python
['web engineer'] + ['ruby', 'postgresql', 'agile', 'developement', ... ]
```

Now, it's your time to get your hands dirty and prepare input data for training.  

**Exercise**: Complete the function `convert_job_posting` and convert all job posting to appropriate format.

In [None]:
def convert_job_posting(job):
    # implement this function
    pass

# convert data to inputs of Gensim's word2vec API
inputs = []
for _, p in job_postings.iterrows():
    inputs += convert_job_posting(p)

In [None]:
print(inputs[20])

# Training

Finally, you can learn word embeddings with the inputs you prepared above and gensim's word2vec.

**Exercise**: build word2vec model checking [the API document](https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
from gensim.models.word2vec  import Word2Vec

word2vec_model = # build model

word2vec_model.save('job_word_embeddings.model')

The word2vec model has a method called `most_similar` to return similar words from learned corpus.  
You can use a wrapper function defined beflow.

In [None]:
def similar_words(title):
    return word2vec_model.most_similar(title.lower())

Let's try finding similar words

In [None]:
similar_words('Webエンジニア') # web engineer

In [None]:
similar_words('UIデザイナー') # UIデザイナー

In [None]:
similar_words('aiエンジニア') # ai engineer

In [None]:
similar_words('Ruby')

In [None]:
similar_words('photoshop')

In [None]:
similar_words('agile')

These results will be useful when finding jobs.

# Visualization

We looked at similar words in the previous section, but the learned model has not only similarities, but also geometric information.  
The following code will plot the vectors of the most popular 1000 job titles in 2-dimentional space.

In [None]:
import matplotlib.pyplot as plt 
from matplotlib.font_manager import FontProperties
from sklearn.manifold import TSNE
import numpy as np

# Check the Japanese font file in your machine, and change the line below.
fp = FontProperties(fname='/System/Library/Fonts/Hiragino Sans GB W3.ttc', size=14)

popular_job_titles = job_postings['job_title'].map(normalize_job_title).value_counts()[0:1000].keys()
emb_tuple = tuple([word2vec_model[t] for t in popular_job_titles if t in word2vec_model] )
X = np.vstack(emb_tuple)

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X) 

plt.figure(figsize=(40,40))
plt.scatter(model.embedding_[:, 0], model.embedding_[:, 1])

for label, x, y in zip(popular_job_titles, model.embedding_[:, 0], model.embedding_[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points', fontproperties=fp)
    
plt.show()

You can see similar job titles are plotted closely.  
This information will be useful when clustering job titles or skills. 

# Optional exercises 
I suggest you to try further exercises if you are intertested.

## Parameter tuning
In the above exercise, you didn't change the training parameters. 

```python
word2vec.Word2Vec(inputs, size=100, min_count=5, window=10, sg=1)
```

Reading [the gensim's document](https://radimrehurek.com/gensim/models/word2vec.html), change those parameters and see how it affects the output.


## Clustering
Find clusters from vectorized words.  
Scikit-learn provides[Kmean clustering API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). [This article](http://ai.intelligentonlinetools.com/ml/k-means-clustering-example-word2vec/) explains how to use with sample code.

# Future work
In this final section, we will see potential applications for job word embeddings.

## Semantic job search
One useful application would be "semantic" job search with which you can find jobs not only by your query, but also by similar words with it.  
The idea is not complicated. the system converts query to original query + similar words (some of you may remember the project :P )  
It may be easilly implemented by using some of the Elasticsearch features.

- [Synonym Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-tokenfilter.html)
  - You can register similar words as synonyms
- [Compound Queries](https://www.elastic.co/guide/en/elasticsearch/reference/master/compound-queries.html)
  - Needs more investigation, but you can probably use similarity scores as weights of search score.

There are some articles/projects that do the same thing.

- [Cocenptual Job Search](https://github.com/DiceTechJobs/ConceptualSearch)
- [Using Word2Vec in Fusion for Better Search Results](https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-search/)
- [ThisPlusThat: A Search Engine That Lets You ‘Add’ Words as Vectors](https://blog.insightdatascience.com/thisplusthat-me-a-search-engine-that-lets-you-add-words-as-vectors-2ec0b8a4f629)

## Sematic candidate search
It's similar to job search. but it's also difficult for recuters to find good candidates, expecially when there are not many candidates who satifsfy the requirements of a position. e.g. data scientist, creative director, product owner, senior engineer etc.  
Using similar job titles and skills, you can find more candidates semantically.

## Similarity-based job recommendation system
There is an advanced model that vectorizes documents, called "[doc2vec](https://radimreoshurek.com/gensim/models/doc2vec.html)". When a user has already applied to or bookmarked some job positions, you can recommend similar jobs with those prefered ones.

- [Recommender System with Distributed Representation](https://www.slideshare.net/rakutentech/recommender-system-with-distributed-representation)

# Reading
Here are some resources which you can learn more about word2vec after workshop.
- [Word2vec explanation in DL4j document](https://deeplearning4j.org/word2vec)
- [Gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html)
- [Gensim word2vec tutorial](https://rare-technologies.com/word2vec-tutorial/)
- [Tensorflow tutorial](https://www.tensorflow.org/tutorials/word2vec)
- [Udacity material](https://github.com/udacity/deep-learning/blob/master/embeddings/Skip-Gram_word2vec.ipynb)
- [Conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)