### 031 Implementing Silhouette Score
Because there is no ground truth it is difficult to assess a model trained on the extracted requirements. This makes model selection and tuning a challenging task. Without a labeled dataset the evaluation metric has to be computed from the model itself. This makes Silhouette score a natural candidate which was used in this notebook.

In [2]:
import pandas as pd
pd.options.plotting.backend = "plotly" #interactive plots will be useful in this context
import plotly.express as px
import numpy as np
import gensim
from sentence_transformers import SentenceTransformer, util
import pickle
import torch
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_samples, silhouette_score
import umap
from datetime import datetime
from sklearn.cluster import DBSCAN
import math

In [3]:
df=pd.read_csv('../datasets/df_cleaned_by_dbscan.csv')

The retrieval of embeddings for each of the tested models can be skipped, after this step the embeddings are loaded from a file using pickle, it is left but commented out for reproducibility.

In [4]:
#def umap_embeddings(vectors):
#    reducer = umap.UMAP(random_state=1)
#    umap_embeddings = reducer.fit_transform(vectors)
#    return(umap_embeddings)

In [6]:
#Retrieval of embeddings using pickle
all_mpnet_base_v2_embeddings = pickle.load(open('../020_Removing_Outliers/all-mpnet-base-v2_embeddings', 'rb'))
all_distilroberta_v1_embeddings = pickle.load(open('../020_Removing_Outliers/all-distilroberta-v1_embeddings', 'rb'))
all_MiniLM_L12_v2_embeddings = pickle.load(open('../020_Removing_Outliers/all-MiniLM-L12-v2_embeddings', 'rb'))
all_MiniLM_L6_v2_embeddings = pickle.load(open('../020_Removing_Outliers/all-MiniLM-L6-v2_embeddings','rb')) 
average_word_embeddings_glove_embeddings = pickle.load(open('../020_Removing_Outliers/average_word_embeddings_glove_embeddings','rb'))

In [7]:
#only keep the embeddings that are in this dataset which is read after the cleaning by DBSCAN step
l_all_mpnet_base_v2_embeddings =[]
l_all_distilroberta_v1_embeddings =[]
l_all_MiniLM_L12_v2_embeddings =[]
l_all_MiniLM_L6_v2_embeddings =[]
l_average_word_embeddings_glove_embeddings =[]

for i, row in df.iterrows():
    l_all_mpnet_base_v2_embeddings.append(all_mpnet_base_v2_embeddings[row['full_set_row_id']])
    l_all_distilroberta_v1_embeddings .append(all_distilroberta_v1_embeddings [row['full_set_row_id']])
    l_all_MiniLM_L12_v2_embeddings.append(all_MiniLM_L12_v2_embeddings[row['full_set_row_id']])
    l_all_MiniLM_L6_v2_embeddings.append(all_MiniLM_L6_v2_embeddings[row['full_set_row_id']])
    l_average_word_embeddings_glove_embeddings.append(average_word_embeddings_glove_embeddings[row['full_set_row_id']])


## Dimensionality Reduction
In Notebook 020, we reduced dimensionality in order to enable DBSCAN. In this step, dimensionality is reduced in order to allow for visualization. We used UMAP for this task.

The retrieval of UMAP embeddings for each of the tested models can be skipped, after this step the embeddings are loaded from a file using pickle, it is left but commented out for reproducibility.

In [8]:
#Dimensionality reduction using umap
#umap_all_mpnet_base_v2_embeddings = umap_embeddings(l_all_mpnet_base_v2_embeddings)
#umap_all_distilroberta_v1_embeddings = umap_embeddings(l_all_distilroberta_v1_embeddings)
#umap_all_MiniLM_L12_v2_embeddings = umap_embeddings(l_all_MiniLM_L12_v2_embeddings)
#umap_all_MiniLM_L6_v2_embeddings = umap_embeddings(l_all_MiniLM_L6_v2_embeddings)
#umap_average_word_embeddings_glove_embeddings = umap_embeddings(l_average_word_embeddings_glove_embeddings)

In [9]:
#file_list = [umap_all_mpnet_base_v2_embeddings,umap_all_distilroberta_v1_embeddings,umap_all_MiniLM_L12_v2_embeddings,umap_all_MiniLM_L6_v2_embeddings,umap_average_word_embeddings_glove_embeddings]
#name_list = ['umap_all_mpnet_base_v2_embeddings','umap_all_distilroberta_v1_embeddings','umap_all_MiniLM_L12_v2_embeddings','umap_all_MiniLM_L6_v2_embeddings','umap_average_word_embeddings_glove_embeddings']
#for m in name_list:
#    filename = m
#    pickle.dump(file_list[name_list.index(m)], open(filename, 'wb'))

In [10]:
umap_all_mpnet_base_v2_embeddings = pickle.load(open('umap_all_mpnet_base_v2_embeddings', 'rb'))
umap_all_distilroberta_v1_embeddings = pickle.load(open('umap_all_distilroberta_v1_embeddings', 'rb'))
umap_all_MiniLM_L12_v2_embeddings = pickle.load(open('umap_all_MiniLM_L12_v2_embeddings', 'rb'))
umap_all_MiniLM_L6_v2_embeddings = pickle.load(open('umap_all_MiniLM_L6_v2_embeddings', 'rb'))
umap_average_word_embeddings_glove_embeddings = pickle.load(open('umap_average_word_embeddings_glove_embeddings', 'rb'))

In [11]:
#Creating columns in df to visualize results of umap
df['umap1_all_mpnet_base_v2']= [i[0] for i in umap_all_mpnet_base_v2_embeddings]
df['umap2_all_mpnet_base_v2']= [i[1] for i in umap_all_mpnet_base_v2_embeddings]

df['umap1_all_distilroberta_v1'] = [i[0] for i in umap_all_distilroberta_v1_embeddings]
df['umap2_all_distilroberta_v1'] = [i[1] for i in umap_all_distilroberta_v1_embeddings]

df['umap1_all_MiniLM_L12_v2'] = [i[0] for i in umap_all_MiniLM_L12_v2_embeddings]
df['umap2_all_MiniLM_L12_v2'] = [i[1] for i in umap_all_MiniLM_L12_v2_embeddings]

df['umap1_all_MiniLM_L6_v2'] = [i[0] for i in umap_all_MiniLM_L6_v2_embeddings]
df['umap2_all_MiniLM_L6_v2'] = [i[1] for i in umap_all_MiniLM_L6_v2_embeddings]

df['umap1_average_word_embeddings_glove'] = [i[0] for i in umap_average_word_embeddings_glove_embeddings]
df['umap2_average_word_embeddings_glove'] = [i[1] for i in umap_average_word_embeddings_glove_embeddings]

## Evaluation by Silhouette Score
For each of the to be evaluated models a range of K-clustering models was trained for K=[2,50]. For each of these models the silhouette score was computed in order to evaluate the models and choose an optimal parameter for the number of clusters K.

In [13]:
from datetime import datetime
print(datetime.now())

model_embeddings = [umap_all_mpnet_base_v2_embeddings,umap_all_distilroberta_v1_embeddings,umap_all_MiniLM_L12_v2_embeddings,umap_all_MiniLM_L6_v2_embeddings,umap_average_word_embeddings_glove_embeddings]
name_list = ['all_mpnet_base_v2','all_distilroberta_v1','all_MiniLM_L12_v2','all_MiniLM_L6_v2','average_word_embeddings_glove']

fig = px.line()
fig.update_layout(
    title="Model Selection and Optimizing the Number of Clusters",
    xaxis_title="Number of clusters K",
    yaxis_title="Silhouette Score", 
)

for m in name_list:
    x=[]
    
    range_n_clusters = range(2,50)
    silhouette_avg_n_clusters = []

    for n_clusters in range_n_clusters:
        print("k = "+str(n_clusters)+" "+str(datetime.now()))
        x.append(n_clusters)
        kmeans = KMeans(init='k-means++',n_clusters=n_clusters, random_state=0).fit(model_embeddings[name_list.index(m)])
        labels = kmeans.labels_
        silhouette_avg = silhouette_score(model_embeddings[name_list.index(m)], labels, metric = 'euclidean')
        silhouette_avg_n_clusters.append(silhouette_avg)
    fig.add_scatter(x=x,y=silhouette_avg_n_clusters,mode='lines', name=m)
    print('done with model '+str(datetime.now()))
fig.show()

2021-12-29 10:23:22.814048
k = 2 2021-12-29 10:23:22.877342
k = 3 2021-12-29 10:23:32.037715
k = 4 2021-12-29 10:23:40.473738
k = 5 2021-12-29 10:23:49.922279
k = 6 2021-12-29 10:24:00.218089
k = 7 2021-12-29 10:24:10.528779
k = 8 2021-12-29 10:24:20.187455
k = 9 2021-12-29 10:24:31.238277
k = 10 2021-12-29 10:24:41.653254
k = 11 2021-12-29 10:24:51.587068
k = 12 2021-12-29 10:25:02.793289
k = 13 2021-12-29 10:25:14.049232
k = 14 2021-12-29 10:25:24.749772
k = 15 2021-12-29 10:25:34.890061
k = 16 2021-12-29 10:25:44.648288
k = 17 2021-12-29 10:25:54.600423
k = 18 2021-12-29 10:26:04.148022
k = 19 2021-12-29 10:26:14.135252
k = 20 2021-12-29 10:26:24.063305
k = 21 2021-12-29 10:26:34.074504
k = 22 2021-12-29 10:26:43.905568
k = 23 2021-12-29 10:26:53.737693
k = 24 2021-12-29 10:27:03.588438
k = 25 2021-12-29 10:27:14.432567
k = 26 2021-12-29 10:27:26.032772
k = 27 2021-12-29 10:27:36.194402
k = 28 2021-12-29 10:27:46.423775
k = 29 2021-12-29 10:27:56.898163
k = 30 2021-12-29 10:28:07.54

k = 47 2021-12-29 11:06:19.973848
k = 48 2021-12-29 11:06:30.883258
k = 49 2021-12-29 11:06:41.750410
done with model 2021-12-29 11:06:53.683008


In [15]:
fig.update_layout(
    title="Model Selection and Optimizing the Number of Clusters",
    xaxis_title="Number of clusters K",
    yaxis_title="Silhouette Score", 
    )
fig.update_xaxes(range=[10, 50])
fig.write_image('silhouette_selection.png')
fig.show()

## Selection
As the figures above show, for the silhouette score metric, for any K>28, the optimal model is "all_distilroberta_v1". The optimal number of clusters is between K=38.

Because of optimization after analyzing the frequency of cluster precence in each job posting, a lower number for K shows to be beneficial. The local optimum at K=31 was selected as most appropriate.

The result of clustering the dataset of skill requirements is plotted below.

In [16]:
k_31_full = KMeans(n_clusters=31, random_state=0).fit(l_all_distilroberta_v1_embeddings)

In [17]:
df['cluster_k31_full'] = k_31_full.labels_
df['cluster_k31_full_str'] = df['cluster_k31_full'].astype(str)

fig = px.scatter(df,'umap1_all_distilroberta_v1','umap2_all_distilroberta_v1', hover_data = ['requirement'], title = "UMAP Analysis of job requirements embeddings from all_distilroberta_v1", color = 'cluster_k31_full_str')
fig.update_traces(marker=dict(size=1.5))
fig.show()

## Housekeeping
The model is pickled and the results are written to a dataframe which is stored.

In [22]:
filename = 'k_31_full'
pickle.dump(k_31_full, open(filename, 'wb'))

In [19]:
df_clean = df.iloc[:,:13]

df_clean['cluster_k31_full']=k_31_full.labels_
df_clean['cluster_k31_full_str']= df['cluster_k31_full'].astype(str)

df_clean['embedding_umap']=umap_all_distilroberta_v1_embeddings.tolist()
df_clean['embedding_sbert']=l_all_distilroberta_v1_embeddings
df_clean.head()

Unnamed: 0.1,Unnamed: 0,requirement_raw,requirement_tokenized,Unnamed: 0.1.1,dt,url,title,location,country,full_text,list_elements,row_id,requirement,cluster_k31_full,cluster_k31_full_str,embedding_umap,embedding_sbert
0,0,Collaborate with stakeholders and other engine...,"['collaborate', 'with', 'stakeholders', 'and',...",0,2021-08-08 21:01:03.303868,https://se.indeed.com/viewjob?jk=b0669075c820856d,Data Scientist,Jobba hemifrån,se,"About us\n\nHere at Mavenoid, we are building ...",['Collaborate with stakeholders and other engi...,0,Collaborate with stakeholders and other engine...,8,8,"[8.437570571899414, 1.452495813369751]","[0.029299488, -0.02018766, -0.017798664, -0.01..."
1,1,\nApply the right tools for the job and solve ...,"['apply', 'the', 'right', 'tools', 'for', 'the...",0,2021-08-08 21:01:03.303868,https://se.indeed.com/viewjob?jk=b0669075c820856d,Data Scientist,Jobba hemifrån,se,"About us\n\nHere at Mavenoid, we are building ...",['Collaborate with stakeholders and other engi...,0,Apply the right tools for the job and solve bu...,29,29,"[9.287023544311523, 1.903634786605835]","[0.005183228, -0.027321953, 0.013979478, -0.02..."
2,2,\nDevelop knowledge representations and virtua...,"['develop', 'knowledge', 'representations', 'a...",0,2021-08-08 21:01:03.303868,https://se.indeed.com/viewjob?jk=b0669075c820856d,Data Scientist,Jobba hemifrån,se,"About us\n\nHere at Mavenoid, we are building ...",['Collaborate with stakeholders and other engi...,0,Develop knowledge representations and virtual-...,13,13,"[12.255410194396973, 0.6413768529891968]","[-0.02551106, -0.045136135, 0.028303996, -0.00..."
3,3,\nProvide model explanations and apply structu...,"['provide', 'model', 'explanations', 'and', 'a...",0,2021-08-08 21:01:03.303868,https://se.indeed.com/viewjob?jk=b0669075c820856d,Data Scientist,Jobba hemifrån,se,"About us\n\nHere at Mavenoid, we are building ...",['Collaborate with stakeholders and other engi...,0,Provide model explanations and apply structure...,21,21,"[8.778813362121582, 1.796252727508545]","[0.07653201, 0.03380716, 0.012670742, 0.053662..."
4,4,\nLearn desired behavior from examples using c...,"['learn', 'desired', 'behavior', 'from', 'exam...",0,2021-08-08 21:01:03.303868,https://se.indeed.com/viewjob?jk=b0669075c820856d,Data Scientist,Jobba hemifrån,se,"About us\n\nHere at Mavenoid, we are building ...",['Collaborate with stakeholders and other engi...,0,Learn desired behavior from examples using cau...,6,6,"[12.108744621276855, 0.36008894443511963]","[-0.012662642, -0.02441167, -0.01782834, 0.036..."


In [20]:
df_clean.to_csv('../datasets/df_k_31')