<a href="https://colab.research.google.com/github/giangntgg/CourseProject/blob/main/03_Try_K_Means_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.1 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 11.6 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 65.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 53.3 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 473 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import spacy
import pandas as pd
import re
from joblib import dump, load

In [4]:
# KEY IN A SENTENCE
sentences = ['Work on complex and extremely varied data sets from some of the world’s largest organisations to solve real world problems Develop data science products and solutions for clients as well as for our data science team Write highly optimized code to advance our internal Data Science Toolbox Work in a multi-disciplinary environment with specialists in machine learning, engineering and design Focus on modelling by working alongside the Data Engineering team Add real-world impact to your academic expertise, as you are encouraged to write papers and present at meetings and conferences should you wish Take part in R&D (video: R&D at QuantumBlack); attend conferences such as NIPS and ICML as well as data science retrospectives where you will have the opportunity to share and learn from your co-workers Work in one of the most advanced data science teams globally']

In [8]:
def clean_text(texts):
  # remove entering 
  texts = [i.replace('\n', '').replace('\r', '') for i in texts]
  print('Remove entering...')

  # remove URL's 
  texts = [re.sub(r'http\S+', '', i) for i in texts]
  print('Remove HTTPS...')

  # convert text to lowercase
  texts = [i.lower() for i in texts]
  print('Lowercasing...')

  # remove numbers
  texts = [i.replace("[0-9]", " ") for i in texts]
  print('Remove numbers...')

  # remove whitespaces
  texts = [' '.join(i.split()) for i in texts]
  print('Remove whitespaces...')

  # remove empty tokens
  texts = [''.join(ch for ch in i if len(ch) > 0) for i in texts]
  print('Remove empty tokens...')

  return texts 

# import spaCy's language model
nlp = spacy.load('en', disable=['parser', 'ner'])

# function to lemmatize text
def lemmatization(texts):
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

sentences = clean_text(sentences)
sentences = lemmatization(sentences)

Remove entering...
Remove HTTPS...
Lowercasing...
Remove numbers...
Remove whitespaces...
Remove empty tokens...


In [9]:
tfidf = load('TFIDF_Vectorizer.joblib')
tfidf_docs = tfidf.transform(sentences)

In [10]:
embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
bert_docs = embedder.encode(sentences)

Downloading:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Load the pre-text model

In [16]:
tfidf_model = load('TFIDF_JobClustering (2).joblib')
bert_model = load('BERT_JobClustering.joblib')

Get the cluster result

In [22]:
bert_label = bert_model.predict(bert_docs)
tfidf_label = tfidf_model.predict(tfidf_docs)

Get the skills that are relevant

In [28]:
bert_skills = pd.read_csv('BERT Skill Group.csv')
bert_skills = bert_skills.fillna('NA')
tfidf_skills = pd.read_csv('TFIDF Skill Group.csv')
tfidf_skills = tfidf_skills.fillna('NA')


for i in range(len(bert_label)):
  cluster = bert_label[i]
  print('---- BERT-based Skills Extractor----')
  print(f'Trending skills for job description {i}: ' + bert_skills[bert_skills['BERT Cluster'] == cluster]['jobSkill_y'].iloc[0])
  print(f'Relevant industry for job description {i}: ' + bert_skills[bert_skills['BERT Cluster'] == cluster]['jobSkill_x'].iloc[0])

for i in range(len(tfidf_label)):
  cluster = tfidf_label[i]
  print('---- TFIDF-based Skills Extractor----')
  print(f'Trending skills for job description {i}: ' + tfidf_skills[tfidf_skills['TFIDF Cluster'] == cluster]['jobSkill_y'].iloc[0])
  print(f'Relevant industry for job description {i}: ' + tfidf_skills[tfidf_skills['TFIDF Cluster'] == cluster]['jobSkill_x'].iloc[0])

---- BERT-based Skills Extractor----
Trending skills for job description 0: NA
Relevant industry for job description 0: Engineering, Information Technology,Information Technology,Analyst, Information Technology, Engineering,Quality Assurance, Engineering, Information Technology,Information Technology, Product Management, Engineering,Information Technology, Consulting, Engineering,Engineering,Other,Information Technology, Engineering,Other, Information Technology, Management,Human Resources,Finance, Sales,Management, Manufacturing,Marketing, Sales,Engineering, Information Technology, Research,Engineering, Manufacturing, Design
---- TFIDF-based Skills Extractor----
Trending skills for job description 0: Data mining software
Relevant industry for job description 0: Information Technology, Product Management, Engineering,Engineering, Information Technology,Information Technology,Information Technology, Business Development,Finance, Information Technology,Consulting,Finance, Sales,Advertisi