## Model Evaluation
It is challenging to evaluate unsupervised learning approaches like clustering. In order to get an idea of the accuracy of the modelling pipeline trained in ../Model_Selection/ this notebook was used to get an estimation by manual labeling of the outcomes.

In [1]:
import pandas as pd
pd.options.plotting.backend = "plotly" #interactive plots will be useful in this context
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import gensim
from sentence_transformers import SentenceTransformer, util
import pickle
import torch
from sklearn.cluster import KMeans



## Sampling of the CV data
From a large dataset of CVs a set of 5805 skills that are present on Data Scientist CVs was extracted in ../data_acquisition/. In the steps below a sample of 100 skills is randomly selected from this dataset. The modelling pipeline as described in ../Model_Selection/ is read into memory and used to infer clusters on the sample.

In [2]:
df = pd.read_csv('../datasets/df_cv.csv')
k31_full = pickle.load(open('../Model_Selection/k_31_full', 'rb'))

In [3]:
df_sample = df.sample(n=100, random_state=1)

In [4]:
df_sample.to_csv('sample.csv')

In [8]:
pd.options.display.max_rows = 100
df_sample

Unnamed: 0.1,Unnamed: 0,id,title,skill,cluster,score,embedding
3969,3969,43,Data Scientist,strong skills in statistical methodologies su...,12,0.709697,"[0.013033631257712841, -0.0799548551440239, -0..."
3300,3300,37,Data Scientist,"extensive experience in development of t-sql,...",1,0.706087,"[0.055112168192863464, -0.00916033424437046, -..."
4384,4384,45,Sr. Data Scientist/Machine Learning Engineer,worked on resulting reports of the application.,23,0.463415,"[0.03124876320362091, 0.017939738929271698, -0..."
3308,3308,37,Data Scientist,"proficient in system analysis, er/dimensional...",30,0.664705,"[0.039068177342414856, -0.04305744916200638, -..."
2890,2890,32,Senior Data Scientist,created a wholesale portal with easy checkout...,25,0.519363,"[0.000201515867956914, 0.003409451339393854, 0..."
3830,3830,40,Data Scientist/Machine Learning Engineer,"programming languages: python, sql, r, matlab...",18,0.719582,"[0.030494963750243187, -0.030020488426089287, ..."
545,545,12,Data Scientist,responsible for debugging and troubleshooting...,22,0.45646,"[0.02476958930492401, -0.012480653822422028, -..."
1812,1812,23,Data Scientist,experience in big data technologies like spar...,8,0.809228,"[0.02623741142451763, -0.015738269314169884, -..."
2596,2596,28,Senior Data Scientist,- built an nlp system to automatically assess...,9,0.233252,"[-0.02712273970246315, -0.019583553075790405, ..."
959,959,16,Data Scientist,experienced in ensemble learning usingbagging...,6,0.465559,"[0.03367180749773979, -0.03725603222846985, -0..."


## Labeling
After acquiring a random sample of 50 skills from the CV data each labels cluster was labeled 0 for false and 1 for true. This was done outside of the coding environment and stored in csv 'sample_labeled.csv'. After labeling it is read into memory to compute basic statistics.

In [19]:
df_labeled_CV = pd.read_csv('sample_labeled.csv', sep=';')

In [22]:
df_labeled_CV.head()

Unnamed: 0,id,skill,cluster,Label
0,43,strong skills in statistical methodologies su...,12,0
1,37,"extensive experience in development of t-sql,...",1,1
2,45,worked on resulting reports of the application.,23,1
3,37,"proficient in system analysis, er/dimensional...",30,0
4,32,created a wholesale portal with easy checkout...,25,1


In [23]:
df_labeled_CV['Label'].mean()

0.88

## Labeling Learning Objectives
A similar effort was undertaken on a set of Learning Objectives. A set of learning objectives was extracted from the program Msc Data Scientist 60ECT of IU International University of Applied Sciences. These learning objectives were labeled by the same modelling pipeline. Because the dataset only contains 39 relevant learning objectives, the full dataset was evaluated using the manual labeling approach.

In [26]:
df_curriculum_labeled = pd.read_csv('df_handbook_DS_60_labeled.csv', sep=';')

In [27]:
df_curriculum_labeled.head()

Unnamed: 0.1,Unnamed: 0,objective,cluster,Label
0,0,understand the fundamental building blocks of ...,12,0
1,1,analyze stochastic data in terms of the underl...,11,1
2,2,utilize Bayesian statistics techniques.,12,0
3,3,summarize the properties of observed data usin...,11,1
4,4,apply data visualization techniques to design ...,14,1


In [28]:
df_curriculum_labeled['Label'].mean()

0.7692307692307693