##### Copyright 2020 HrFlow's AI Research Department

Licensed under the Apache License, Version 2.0 (the "License");

In [16]:
# Copyright 2020 HrFlow's AI Research Department. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Job API:

This notebook illustrates how to use **HrFlow's Job API**. This API serves as an interface to upload jobs (either structured as a json or as a file located in your hard drive) and retrieve results from HrFlow. In the current version, the following results can be retrieved or used:
* The **Job object** 
* The **Searchin engine** and **Scoring**
* The **embeddings** at various degree of granularity

An **example of applications** with the Job API is available below. The example shows how **embeddings** can be leveraged to **classify jobs**. 

**Embeddings** eases the management of documents like resumes of jobs. It turns any highly structured image of a resume into a single **vector of numbers** with fixed length. 

The document embeddings can also be trivially used to compute **job or profile level meaning similarity** as well as to enable better performance on downstream classification tasks using **less supervised training data**.


**Disclaimer**: Jobs comes from [pole-emploi.fr](https://www.pole-emploi.fr) 

<p>
<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/Riminder/python-hrflow-api/blob/master/examples/colab/hrflow_job_api.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/Riminder/python-hrflow-api/blob/master/examples/colab/hrflow_job_api.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td><td>
  <a target="_blank"  href="https://www.hrflow.ai/book-us">
    <img width=32px src="https://gblobscdn.gitbook.com/spaces%2F-M1L6Hspq8r9LXd5_gIC%2Favatar-1586188377926.png?generation=1586188378327930&alt=media" />Get an account</a>
</td></table>
<br>
</p>

# Getting Started
This section sets up the environment to get access to **HrFlow Job API** and sets up a connection to HrFlow.

In [None]:
# Machine Learning and Classification Libs
!pip install --quiet tensorflow
!pip install --quiet matplotlib
!pip install --quiet pandas
!pip install --quiet seaborn
!pip install --quiet plotly

# HrFlow Dependencies
!apt-get install libmagic-dev
!pip install --quiet python-magic
!pip install --quiet hrflow

In [None]:
import os
import pickle
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
ROOT_PATH = "drive/My Drive/Data"

An **API Key** is required here. You can get your API Key at **https://```<your-sub domain/>```.hrflow.ai/settings/api/keys** or ask us for a **demo API Key**.

Either add your API Key as a file in your 'ROOT_PATH' or set the python variable named api_secret to your  API Key (api_secret = 'YOUR_SECRET_API_KEY')

In [None]:
import pprint
import hrflow as hf

# with open(os.path.join(ROOT_PATH,'api_key'), 'rb') as file:
#   api_secret = pickle.load(file)

client = hf.Client(api_secret="ask_0dec90609f229d31d0bdd6a03da4f588")

# 1. Job API Routes

There is currently 5 routes for this API:
*  **Indexing**: Uploading a job by specifying some informations. The job will be stored as a JSON file in HrFlow
*  **Parsing**: Retrieve job's parsing.
*  **Embedding**: Retrieve your jobs **Embeddings** to build your custom solution
*  **searching** Retrieve a list of Job based on filters
*  **scoring** Retrieve a list of Job based on filters and their scores for a given profile.

## 1.1. Upload Json

In [None]:
job_json = {
    "name": "Data Engineer",
    "agent_key": "cb0a59170cf0034a2fe5912382cdc478bc001ecc",
    "reference": "Job's reference 07082020",
    "url": "https://www.pole-emploi.ai/jobs/data_engineer",
    "summary": "As an engineer for the Data Engineering Infrastructure team, you will design, build, scale, and evolve our data engineering  platform, services and tooling. Your work will have a critical  impact on all areas of business:supporting detailed internal analytics, calculating customer usage, securing our platform, and much more.",
    "location": {
                  "text": "Dampierre en Burly (45)",
                  "geopoint": {
                      "lat": 47.7667,
                      "lon": 2.5167
                  }
                 },
    "sections": [{
                    "name": "profile",
                    "title": "Searched Profile",
                    "description": "Bac+5"
                  }
                  ],
    "skills": [{
                  "name": "python",
                  "value": None
               },
               {
                  "name": "spark",
                  "value": 0.9
               }
               ],
    "languages": [{
                     "name": "english",
                     "value": 1
                  },
                 {  
                     "name": "french",
                     "value": 1
                  }
                  ],
    "tags": [{
                "name": "archive",
                "value": True
             },
             {  
                "name": "tag example",
                "value": "tag"
              }
              ],
    "ranges_date": [{
                       "name": "Dates",
                       "value_min": "2020-05-18T21:59",
                       "value_max": "2020-09-15T21:59"
                    }
                    ],
    "ranges_float": [{
                       "name": "salary",
                       "value_min": 30,
                       "value_max": 40,
                       "unit": "eur"
                    }
                    ],
    "metadatas": [{
                     "name": "metadata example",
                     "value": "metadata"
                  }
                  ]
}

response = client.job.indexing.add_json(board_key="board_key", job_json=job_json)
pprint.pprint(response)

## 1.2. Edit Existing job

In [None]:
job_json = {
    "name": "Data Engineer",
    "agent_key": "cb0a59170cf0034a2fe5912382cdc478bc001ecc",
    "reference": "Job's reference abc",
    "url": "https://www.pole-emploi.ai/jobs/data_engineer",
    "summary": "As an engineer for the Data Engineering Infrastructure team, you will design, build, scale, and evolve our data engineering  platform, services and tooling. Your work will have a critical  impact on all areas of business:supporting detailed internal analytics, calculating customer usage, securing our platform, and much more.",
    "location": {
                  "text": "Dampierre en Burly (45)",
                  "geopoint": {
                      "lat": 47.7667,
                      "lon": 2.5167
                  }
                 },
    "sections": [{
                    "name": "profile",
                    "title": "Searched Profile",
                    "description": "Bac+5"
                  }
                  ],
    "skills": [{
                  "name": "python",
                  "value": None
               },
               {
                  "name": "spark",
                  "value": 0.9
               }
               ],
    "languages": [{
                     "name": "english",
                     "value": 1
                  },
                 {  
                     "name": "french",
                     "value": 1
                  }
                  ],
    "tags": [{
                "name": "archive",
                "value": False
             },
             {  
                "name": "tag example",
                "value": "tag"
              }
              ],
    "ranges_date": [{
                       "name": "Dates",
                       "value_min": "2020-05-18T21:59",
                       "value_max": "2020-09-15T21:59"
                    }
                    ],
    "ranges_float": [{
                       "name": "salary",
                       "value_min": 30,
                       "value_max": 50,
                       "unit": "eur"
                    }
                    ],
    "metadatas": [{
                     "name": "metadata example",
                     "value": "metadata"
                  }
                  ]
}

response = client.job.indexing.edit(board_key="board_key", key="job_key",job_json=job_json)
pprint.pprint(response)

## 1.3. Get Job object

The method client.job.indexing.get retrieves the informations of a given job. It uses two mandatory fields:  **board_key** and **job_key** or **job_reference**. The job_key is returned as part of a response's upload : **key**.

In [None]:
response = client.job.indexing.get(board_key="board_key", key="job_key")
pprint.pprint(response)

## 1.4. Embeddings Retrieval

client.job.embedding.get returns embeddings for a given job (uniquely defined by the pair **board_key** and **job_key** or **job_reference**). 

This methods presently handles only job per job embeddings. A loop is required to get embeddings for more than one job.

In [None]:
response = client.job.embedding.get(board_key="board_key", key="job_key")
pprint.pprint(response)

## 1.5. Advanced Tools

### 1.5.1. Job Search Engine

client.job.searching searches jobs based on their **name**

In [None]:
import json

response = client.job.searching.list(board_keys=["board_key"], page=1, limit=30, sort_by='created_at')
pprint.pprint(response)

### 1.5.2. Job Scoring

In [None]:
response  = client.job.scoring.list(board_keys=["board_key"],
                                  source_key="source_key",
                                  profile_key="profile_key",
                                  use_agent=1,
                                  agent_key="agent_key",
                                  page=1, limit=30, sort_by='created_at', order_by="desc")
pprint.pprint(response)

# 2. Machine Learning With Embeddings

Embeddings is widely used these last years (2013 onwards) in the field of *Natural Language Processing*, thanks to Tomas Mikolov and his team at Google. Their breakthrough on building reliable embeddings for words had a huge impact both scientifically and technologically.

The rough idea behind embeddings consists in **numerically capture the meaning or informations** of a word (or sentence or even a whole document like a resume). Any resume can thus be relatively accurately represented by a set of real numbers ('vector of floats'). The **measure of accuracy** is evaluated to a predefined task. 

An embedding algorithm is deemed to be 'good' as for being good for a given **evaluation task**. In the case of word embeddings, the latter can be trained and evaluated (on the same task) on filling sentences gaps. This task quantifies how an embedding algorithm performs at knowing a sentence context (sequence words in the sentence) by filling missing words.

In our case, the most obvious, practical and meaningfull evaluation task is the **classification of jobs** (which ones are 'bakers', 'data scientists', etc). This task is usually quite easily done by humans (Human Resources departments) and relatively well done by computers (keywords).

The following cells of this notebook shows a relatively simple model that classifies some type of jobs based on pole emploi jobs (HrFlow Crawling Pipeline Feature). 

In [None]:
import requests
import shutil
import numpy as np

def load_embedding(url):
  response = requests.get(url, stream=True)
  with open('tmp', 'wb') as file:
      shutil.copyfileobj(response.raw, file)
  return np.load('tmp', allow_pickle=True)

## 2.1. Embeddings Retrieval

We advise, in the case of retrieving a great amount of embeddings, to get embeddings asynchronously.

The next colab cell download some embeddings knowing a list of **job_id**. Extra information for the given jobs have been saved in a file named **job_types** (the pole emploi job_type associated to the job_id). The embeddings, the job description and title and the job types are saved to the hard disk for later usages.

In [None]:
# Loading From HrFlow
from tqdm import tqdm

with open(os.path.join(ROOT_PATH, 'jobs_ids'), 'rb') as file:
  jobs_ids = pickle.load(file)
with open(os.path.join(ROOT_PATH, 'jobs_types'), 'rb') as file:
  jobs_types = pickle.load(file)
jobs_texts = []
jobs_embeddings = []

for job_id in tqdm(jobs_ids):
  # Get Embedding
  response = client.job.embedding.get(job_id=job_id)
  jobs_embeddings.append(load_embedding(response['data']))
  # Get Job Informations
  response = client.job.parsing.get(job_id=job_id)
  job_text = {'title': response['data']['name'], 
              'description': response['data']['description']}
  jobs_texts.append(job_text)

# Save Data To Disk
with open(os.path.join(ROOT_PATH, 'jobs_embeddings'), 'wb') as file:
  pickle.dump(jobs_embeddings, file) 
with open(os.path.join(ROOT_PATH, 'jobs_texts'), 'wb') as file:
  pickle.dump(jobs_texts, file) 
with open(os.path.join(ROOT_PATH, 'jobs_types'), 'wb') as file:
  pickle.dump(jobs_types, file) 

In [None]:
# Loading From Disk
with open(os.path.join(ROOT_PATH, 'jobs_embeddings'), 'rb') as file:
  jobs_embeddings = pickle.load(file) 
with open(os.path.join(ROOT_PATH, 'jobs_texts'), 'rb') as file:
  jobs_texts = pickle.load(file) 
with open(os.path.join(ROOT_PATH, 'jobs_types'), 'rb') as file:
  jobs_types = pickle.load(file) 

## 2.2. Job Classification

#### 2.2.a. Model: Shallow Neural Network

Our Neural Network is a single (shallow) hidden layer network defined by three layers:
*  Input: profiles embeddings lies into $R^{64}$. This explains the input shape 'shape=(64,)'
*  Hidden Layer: a simple 64-neurons dense using tanh ($x\mapsto (e^x-1)/(e^x+1)$) activation function
*  Output: probabilities-like real numbers using softmax activation function.

Since we are building a classifier we are compiling with the most common loss and optimizer (categorical crossentropy and Adam respectively). More informations about tensorflow neural network library can be found in https://www.tensorflow.org/api_docs

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical

model_input = Input(shape=(64,))
dense = Dense(64, activation='tanh')(model_input)
softmax = Dense(7, activation='softmax')(dense)
model = Model(inputs=[model_input], outputs=[softmax])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

#### 2.2.b. Train-Test Split

Our dataset (1488 items) is splitting into two subsets:
*  **Training Set**: 67% of the dataset is used for the model's training phase
*  **Validation Set**: 33% of dataset (the remaining part) is used for model validation (using confusion matrix and principal component analysis) 

In [None]:
jobs_list = list(set(jobs_types))
jobs_types_labels = {job_type: index for job_type, index in zip(jobs_list, range(len(jobs_list)))}

In [None]:
from sklearn.model_selection import train_test_split

jobs_embeddings_train, jobs_embeddings_test, jobs_types_train, jobs_types_test, jobs_texts_train, jobs_texts_test = train_test_split(jobs_embeddings, 
                                                                                                                                     jobs_types, 
                                                                                                                                     jobs_texts, 
                                                                                                                                     test_size=0.33)
labels_train = [jobs_types_labels[job] for job in jobs_types_train]
labels_test = [jobs_types_labels[job] for job in jobs_types_test]

#### 2.2.c. Training

In [None]:
model.fit(x=jobs_embeddings_train, 
          y=to_categorical(labels_train),
          epochs=20)

#### 2.2.d. Evaluation and Analysis on Test Set

The model is evaluated on its validation with two different evaluation methods:

1.   **Confusion Matrix**: a matrix that shows the number of:
*   On the **diagonal**: **rightfully predicted** classes
*   Anywhere else: wrongly classified resumes


2.   **Principal Component Analysis Plot**: uses dimension reduction (projection towards high variance axes) to show high dimensional vectors into a lower dimension (usually 2 or 3). Clusters of jobs are showed in the 3-dimensional space


In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.decomposition import PCA
from pandas.core.frame import DataFrame

# Scatter Plot Hover Text Formating
def line_jump(text, every_char=50):
    n_jumps = len(text) // every_char
    output = text[:every_char]
    for index in range(1, n_jumps):
        output += '<br />' + text[every_char*index:every_char*(index+1)] 
    return output

In [None]:
# Compute Model's Predictions on Test Set
predictions = np.argmax(model.predict(jobs_embeddings_test), axis=1)

In [None]:
# Confusion Matrix
confusion_matrix = tf.math.confusion_matrix(labels_test, predictions)

# Plot Confusion Matrix
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(confusion_matrix, 
            linewidths=0.5, cmap="YlGnBu", square=True, 
            xticklabels=jobs_list, yticklabels=jobs_list,
            annot=True, fmt='g')
plt.xlabel('True Label', fontsize=15)
plt.ylabel('Predicted Label', fontsize=15)

In [None]:
# Principal Component Analysis in Dimension 3
pca = PCA(n_components=3).fit_transform(jobs_embeddings_test)

# DataFrame
df = DataFrame({'Title': [job['title'] for job in jobs_texts_test],
                'Description': [line_jump(job['description'][:1000], 75) for job in jobs_texts_test],
                'Predicted Job Type': [jobs_list[pred] for pred in predictions],
                'Job Type': jobs_types_test,
                'Classification Success': [jobs_list[pred]==job for pred, job in zip(predictions, jobs_types_test)],
                'First PCA Axis': pca[:, 0], 
                'Second PCA Axis': pca[:, 1], 
                'Third PCA Axis': pca[:, 2]})

In [None]:
# Scatter Plot
fig = px.scatter_3d(df, x='First PCA Axis', y='Second PCA Axis', z='Third PCA Axis', 
                    hover_data=['Description', 'Predicted Job Type', 'Job Type', 'Classification Success'],
                    hover_name='Title',
                    color='Predicted Job Type',
                    color_discrete_sequence=px.colors.qualitative.Pastel,
                    symbol='Classification Success',
                    symbol_map={True: "circle", False: "square-open"},
                    width=800, height=800, template='plotly_white')
fig.update(layout_showlegend=False)
fig.update_traces(marker=dict(size=6, line=dict(width=1)))