## Data

Job offers - I want to find words and ofers similar to 'Machine Learning Engineer' & 'Data scientist' (: using job offers titles.

In [1]:
import pandas as pd

from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary

from gensim.models import Word2Vec, LdaMulticore

from ast import literal_eval

import pyLDAvis
from pyLDAvis import gensim

pyLDAvis.enable_notebook()

In [78]:
df = pd.read_csv('data/job_ofer.csv')
df.head()

Unnamed: 0,title,company_name,address,description,seniority_level,employment_type,job_function,industries
0,Machine Learning Engineer,Intellipro Group Inc,"Palo Alto, CA, US","['About The Company', ""W*** is reshaping the f...",Entry level,Full-time,Engineering,Information Technology and Services
1,Deep Learning Applied Researcher - Chicago,Ethosia,"Chicago, IL, US","['תיאור המשרה', 'Deep learning for Computer Vi...",Associate,Full-time,Other,Information Technology and Services
2,Machine Learning Engineer,Motorola Solutions,"Chicago, IL, US","['Company Overview', 'At Motorola Solutions, w...",Entry level,Full-time,Engineering,Information Technology and Services
3,Machine Learning / Data Scientist,Proprius LLC,"San Francisco, CA, US",['Our client is a digital invention agency foc...,Entry level,Full-time,Engineering,Information Technology and Services
4,Cloud Architect,TCS,"Framingham, Massachusetts, United States","['Technical/Functional Skills', ' ', 'Good to ...",Mid-Senior level,Full-time,Engineering,Information Technology and Services


In [3]:
df.shape

(36109, 8)

In [4]:
df.description[0]

'[\'About The Company\', "W*** is reshaping the future of delivery. We are an on-demand drone delivery service that can deliver food, medicine or other items within minutes. We\'ve also developed an unmanned traffic management platform to safely route drones through the sky. Our service is faster, safer and produces far less pollution than traditional delivery.", \'About The Role\', \'As a Machine Learning Engineer you will help develop models to support the next generation of intelligence that backs our flight planning and navigation solutions. In this way, you will play an important part in our larger goal of building a state-of-the-art delivery system that safely flies thousands of autonomous aircraft every day around people, buildings, and terrain, with or without GPS, day or night, rain or shine. Developing machine-learning-based models with the real world application in mind is a difficult problem, but working at the intersection of R&D and production is a unique and exhilarating

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36109 entries, 0 to 36108
Data columns (total 8 columns):
title              36109 non-null object
company_name       33925 non-null object
address            36109 non-null object
description        36109 non-null object
seniority_level    36109 non-null object
employment_type    36109 non-null object
job_function       36099 non-null object
industries         36095 non-null object
dtypes: object(8)
memory usage: 2.2+ MB


In [6]:
df.isnull().sum()

title                 0
company_name       2184
address               0
description           0
seniority_level       0
employment_type       0
job_function         10
industries           14
dtype: int64

There are no missing values in titles.

## Word2Vec - finding most similar words (in titles).

Making **corpus from title** column & **simple preprocess** & **remooving stopwords**.

In [7]:
stop = [ word for word in STOPWORDS if word not in ['it', 'full'] ] 
# probably 'it analyst' or 'full stack developer' could be usefull for someone

stop[:8]

['whereafter',
 'cant',
 'the',
 'together',
 'but',
 'bottom',
 'meanwhile',
 'still']

In [8]:
title_corpus = df['title'].map( simple_preprocess )

In [9]:
title_corpus[36086]

['it', 'analyst', 'web', 'and', 'mobile', 'developer']

In [10]:
title_corpus = title_corpus.apply(lambda x: [x for x in x if x not in stop])

In [11]:
title_corpus[36086]

['it', 'analyst', 'web', 'mobile', 'developer']

**Bigram**

In [12]:
title_bigram =  Phrases( title_corpus, min_count = 3, threshold = 3 ) 

In [13]:
list(title_corpus[:5])

[['machine', 'learning', 'engineer'],
 ['deep', 'learning', 'applied', 'researcher', 'chicago'],
 ['machine', 'learning', 'engineer'],
 ['machine', 'learning', 'data', 'scientist'],
 ['cloud', 'architect']]

In [14]:
list(title_bigram[title_corpus])[:5]

[['machine_learning', 'engineer'],
 ['deep_learning', 'applied', 'researcher', 'chicago'],
 ['machine_learning', 'engineer'],
 ['machine_learning', 'data_scientist'],
 ['cloud', 'architect']]

In [15]:
title_bigram.vocab.keys()

dict_keys([b'machine', b'learning', b'machine_learning', b'engineer', b'learning_engineer', b'deep', b'deep_learning', b'applied', b'learning_applied', b'researcher', b'applied_researcher', b'chicago', b'researcher_chicago', b'data', b'learning_data', b'scientist', b'data_scientist', b'cloud', b'architect', b'cloud_architect', b'store', b'room', b'store_room', b'clerk', b'room_clerk', b'director', b'product', b'director_product', b'recruiting', b'manager', b'recruiting_manager', b'ad', b'manager_ad', b'census', b'ad_census', b'ext', b'census_ext', b'gb', b'ext_gb', b'bilingual', b'bilingual_engineer', b'german', b'engineer_german', b'germany', b'german_germany', b'switzerland', b'germany_switzerland', b'sommelier', b'entry', b'level', b'entry_level', b'project', b'level_project', b'project_manager', b'shelton', b'manager_shelton', b'ct', b'shelton_ct', b'based', b'ct_based', b'finance', b'finance_manager', b'firestone', b'manager_firestone', b'industrial', b'firestone_industrial', b'pr

Counting the number of two-word phrases (bigram).

In [16]:
from collections import Counter

In [17]:
bigram_counter = Counter()
for key in title_bigram.vocab.keys():
    if len(key.decode('utf-8').split("_")) > 1:
        bigram_counter[key] += title_bigram.vocab[key]

for key, counts in bigram_counter.most_common(20):
    print ('{0}  {1}'.format(key.decode("utf-8"), counts))

relocate_china  3847
new_york  2722
san_francisco  2306
account_executive  818
china_relocate  668
beijing_relocate  593
editor_relocate  591
project_manager  561
copy_editor  528
manager_new  457
product_manager  434
software_engineer  421
business_development  418
data_scientist  376
senior_account  354
public_school  354
manager_san  350
shenzhen_relocate  347
account_manager  344
francisco_ca  325


Probably 'machine_learning' is not very popular in job offers titles (; but 'data_scientist' phrase appears in 14th place (counting the number of occurrences).

Applaying bigram (two word phrases) to corpus.

In [18]:
def prepare_corpus( corpus, bigram ):
    for sent in corpus:
        yield bigram[sent] + sent #remove_overflow(sent, title_corpus)

In [39]:
title_bigram_pr = Phraser( Phrases( title_corpus, min_count = 3, threshold = 3 ) )

In [40]:
title_ext_corp = list( prepare_corpus( title_corpus, title_bigram_pr ) )

title_model = Word2Vec( title_ext_corp, size = 100, window = 5, min_count = 3 )

Checking some **most similar** words based on title to compare it later with **doc2vec**.

In [41]:
title_model.wv.most_similar('data_scientist')

[('scientist', 0.974702000617981),
 ('medicinal', 0.947231650352478),
 ('ml', 0.9435145854949951),
 ('python_ml', 0.9412312507629395),
 ('healthcare_asset', 0.9365801215171814),
 ('geologist', 0.925195038318634),
 ('big', 0.9217954874038696),
 ('data', 0.920747697353363),
 ('applied_research', 0.9154780507087708),
 ('nlp', 0.9061712622642517)]

In [42]:
title_model.wv.most_similar('machine')

[('machine_learning', 0.965211033821106),
 ('processing_nlp', 0.9566091895103455),
 ('nlp', 0.9398211240768433),
 ('deep_learning', 0.932887613773346),
 ('vision', 0.9280160665512085),
 ('deep', 0.9211409687995911),
 ('natural', 0.9192565083503723),
 ('learning', 0.9164762496948242),
 ('natural_language', 0.908204972743988),
 ('relocate_boulder', 0.8986300230026245)]

In [43]:
title_model.wv.most_similar('machine_learning')

[('deep_learning', 0.9790380001068115),
 ('nlp', 0.9744572639465332),
 ('deep', 0.9693440198898315),
 ('relocate_boulder', 0.9666751623153687),
 ('machine', 0.965211033821106),
 ('processing_nlp', 0.9648038148880005),
 ('artificial', 0.9523654580116272),
 ('ml', 0.9504517316818237),
 ('vision', 0.9486271739006042),
 ('artificial_intelligence', 0.945408821105957)]

In [44]:
title_model.wv.most_similar('nlp')

[('deep_learning', 0.9893084168434143),
 ('deep', 0.9851727485656738),
 ('vision', 0.980521559715271),
 ('engineer_aws', 0.9747466444969177),
 ('machine_learning', 0.9744572639465332),
 ('processing_nlp', 0.9667194485664368),
 ('natural_language', 0.9665544629096985),
 ('big_data', 0.9649794697761536),
 ('artificial_intelligence', 0.962164580821991),
 ('artificial', 0.9607573747634888)]

### Most similar words based on Description - **Word2vec**

In [45]:
df.sample()['description'].values[0] 

"['Job Description', 'info: an academically rigorous private international school at primary school levels, we', 'implement a bilingual educational program, combining the essence of traditional Chinese', 'compulsory courses with the trans-disciplinary activities. Courses are taught in both English and', 'Chinese, with class sizes no larger than 20 students per class.', 'Requirements', 'Main Responsibilities：', 'Worktime: Monday - Friday, 8:00 - 17:00, one hour lunch break;', 'Lesson planning according to teaching materials, and students’ English level;', 'Promoting learning in a professional and comprehensive manner;', 'Creating an environment to promote English interaction among Chinese students in the', 'class;', ' Observing and evaluating each student’s behavior and educational needs, and providing', 'individualized instruction to each student as needed;', ' Collaborating with other teachers, parents and administration and participating in regular', 'meetings, seminars and trainings

In [46]:
for line in df.sample()['description'].map(literal_eval).values[0]:
    print(line)
    print("")

Laboratory in Northern New York State is seeking a full-time CMT / Certified Medical Technologist to add to its clinical/medical laboratory staff .

 COMPENSATION is $30.00 per hour to $32.00 per hour based on experience.

 CMT / Certified Medical Technologist JOB TYPE is Contingent to Permanent/Retained Employee, Full-time.

 WORK SCHEDULE for the CMT / Certified Medical Technologist s Monday through Friday, NIGHTS 11:00 PM to 7:00 am. No weekends. No holidays.

 LICENSE, EDUCATION & EXPERIENCE - Completion of a Bachelors Degree in MEDICAL TECHNOLOGY and possession of a NYS CMT / Certified Medical Technologist permit or license. I f less than one year of work experience ASCP is not required. However, after one year of work experience the MT must possess a ASCP certification along with a current NYS license

 JOB DESCRIPTION for the CMT / Certified Medical Technologist : (Other duties may be assigned):

 Provide accurate testing to include: Blood counts, CBCs, immunology. 40% hematolog

In [47]:
descr_corpus = df['description'].map( simple_preprocess ) 

descr_bigram = Phraser( Phrases( descr_corpus, min_count = 3, threshold = 3 ) ) 

In [48]:
descr_ext_corp = list( prepare_corpus( descr_corpus, descr_bigram ) )
len(descr_ext_corp)

36109

In [49]:
descr_model = Word2Vec( descr_ext_corp, size = 100, window = 2, min_count = 1 ) 

**most similar**

In [50]:
descr_model.wv.most_similar('machine')

[('implanting', 0.5705114603042603),
 ('statistics_machine', 0.5571773648262024),
 ('ai_machine', 0.5489904880523682),
 ('deep_learning', 0.5392367839813232),
 ('deploy_machine', 0.5311635732650757),
 ('algorithms', 0.5246731638908386),
 ('machines', 0.5219459533691406),
 ('ml', 0.5208945274353027),
 ('processing_machine', 0.5182268619537354),
 ('applied_machine', 0.5119470953941345)]

In [51]:
descr_model.wv.most_similar('machine_learning')

[('deep_learning', 0.8579201698303223),
 ('computer_vision', 0.8143433928489685),
 ('big_data', 0.7978620529174805),
 ('predictive_analytics', 0.7926192283630371),
 ('ml', 0.7653876543045044),
 ('advanced_analytics', 0.7524453997612),
 ('artificial_intelligence', 0.7496665716171265),
 ('data_mining', 0.735003650188446),
 ('data_science', 0.733741044998169),
 ('nlp', 0.723162055015564)]

In [52]:
descr_model.wv.most_similar('pytorch') 

[('keras', 0.9433885812759399),
 ('scikit_learn', 0.9194432497024536),
 ('tensorflow', 0.9154489040374756),
 ('caffe', 0.9080365896224976),
 ('mxnet', 0.8996084332466125),
 ('keras_tensorflow', 0.8813401460647583),
 ('numpy', 0.8805082440376282),
 ('scipy', 0.8779493570327759),
 ('matplotlib', 0.8730191588401794),
 ('sklearn', 0.8708810806274414)]

OK :). For most similar in title_model 'pytorch' was even not in vocabulary.

## Topic modelling

Topic modelling on extended corpus made from titles.

In [53]:
dictionary = Dictionary(title_ext_corp)

First 10 pairs of a dict (as an example):

In [54]:
first10pairs = {k: dict(dictionary)[k] for k in list(dict(dictionary))[:10]}
first10pairs

{0: 'engineer',
 1: 'learning',
 2: 'machine',
 3: 'machine_learning',
 4: 'applied',
 5: 'chicago',
 6: 'deep',
 7: 'deep_learning',
 8: 'researcher',
 9: 'data'}

In [55]:
bow_corpus = [ dictionary.doc2bow(sent) for sent in title_ext_corp ]

In [56]:
bow_corpus[:10]

[[(0, 2), (1, 1), (2, 1), (3, 1)],
 [(1, 1), (4, 2), (5, 2), (6, 1), (7, 1), (8, 2)],
 [(0, 2), (1, 1), (2, 1), (3, 1)],
 [(1, 1), (2, 1), (3, 1), (9, 1), (10, 1), (11, 1)],
 [(12, 2), (13, 2)],
 [(9, 1), (10, 1), (11, 1)],
 [(14, 2), (15, 2), (16, 2)],
 [(17, 2), (18, 2)],
 [(19, 2), (20, 2), (21, 2), (22, 2), (23, 2), (24, 2)],
 [(0, 2), (25, 2), (26, 2), (27, 2), (28, 2)]]

In [57]:
lda_model = LdaMulticore(bow_corpus, id2word = dictionary, num_topics=100, passes = 20, workers = 20)

In [58]:
%time lda_vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)

Wall time: 1min 31s


In [59]:
pyLDAvis.display(lda_vis)

In [60]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.321*"executive" + 0.243*"account" + 0.067*"senior_account" + 0.054*"senior" + 0.044*"account_executive" + 0.043*"named" + 0.042*"named_account" + 0.019*"beach" + 0.012*"pittsburgh" + 0.012*"minolta"
Topic: 1 
Words: 0.120*"retail" + 0.083*"leader" + 0.077*"associate" + 0.065*"server" + 0.044*"nyc" + 0.043*"food" + 0.035*"time" + 0.031*"service" + 0.025*"analysis" + 0.020*"seasonal"
Topic: 2 
Words: 0.111*"full" + 0.059*"cloud" + 0.057*"full_time" + 0.055*"time" + 0.035*"stack" + 0.035*"full_stack" + 0.032*"stylist" + 0.024*"manager" + 0.021*"week" + 0.019*"provided"
Topic: 3 
Words: 0.075*"industrial" + 0.052*"defense" + 0.044*"english_editor" + 0.044*"english" + 0.037*"dept" + 0.035*"tools" + 0.030*"franchise" + 0.029*"starts" + 0.028*"texas" + 0.028*"country"
Topic: 4 
Words: 0.089*"engagement" + 0.081*"manager" + 0.059*"virtual" + 0.030*"canada" + 0.029*"water" + 0.021*"excellence" + 0.020*"curriculum" + 0.018*"deployment" + 0.018*"ne" + 0.016*"early"
Topic: 5 
Wo

## doc2vec

In [61]:
import numpy as np

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

**title_corpus was already prepared in Word2vec part**. (simple preprocess + stopwords removed ) 

In [62]:
title_corpus[:6]

0                     [machine, learning, engineer]
1    [deep, learning, applied, researcher, chicago]
2                     [machine, learning, engineer]
3              [machine, learning, data, scientist]
4                                [cloud, architect]
5                                 [data, scientist]
Name: title, dtype: object

In [63]:
len(title_corpus)

36109

Assigning an id to each word list using a TaggedDocument.

In [64]:
title_tagged = [TaggedDocument(words = sent, tags = [i]) for i, sent in enumerate(title_corpus)]

In [65]:
title_tagged[:5]

[TaggedDocument(words=['machine', 'learning', 'engineer'], tags=[0]),
 TaggedDocument(words=['deep', 'learning', 'applied', 'researcher', 'chicago'], tags=[1]),
 TaggedDocument(words=['machine', 'learning', 'engineer'], tags=[2]),
 TaggedDocument(words=['machine', 'learning', 'data', 'scientist'], tags=[3]),
 TaggedDocument(words=['cloud', 'architect'], tags=[4])]

In [66]:
title_model = Doc2Vec(vector_size = 300, window_size = 5, min_count = 1)
title_model.build_vocab(title_tagged)

In [67]:
title_model.train(title_tagged, total_examples=title_model.corpus_count, epochs=10)

In [68]:
m_sim = title_model.docvecs.most_similar(0)
m_sim

[(364, 0.9512962102890015),
 (30335, 0.9454559087753296),
 (9748, 0.9426690340042114),
 (35355, 0.9420065879821777),
 (31783, 0.9405244588851929),
 (15791, 0.9403284788131714),
 (20276, 0.9351061582565308),
 (3553, 0.9350285530090332),
 (20618, 0.9350106716156006),
 (31727, 0.9345158934593201)]

In [69]:
idx_similar = np.array(m_sim)[:,0] #numer id do arrey
idx_similar

array([  364., 30335.,  9748., 35355., 31783., 15791., 20276.,  3553.,
       20618., 31727.])

**Finfing most similar titles without bigram**.

I`m looking for most similar titles to 'Machine Learning Engineer'.

In [70]:
df[ df.index == 0 ].title

0    Machine Learning Engineer
Name: title, dtype: object

In [71]:
df.loc[idx_similar, 'title'].values 

array(['Machine Learning Engineer, Pure1', 'Teller I', 'Host/Hostess',
       'Invasive Cardiology Physician',
       'Vertriebsmitarbeiter für Heizungs-, Lüftungs- und Klimatechnik (m/w/d)',
       'Massage Therapist',
       'Japanese, Croatian or German Speaking Customer Service Executives Amsterdam, The Netherlands - Up to €25k plus excellent bonus + benefits (PTR 2315)',
       'Business Development, Chicago - Fever',
       'Outpatient Primary Care opportunity available a short distance from Buffalo and Toronto',
       'Male Locker Room Associates - Highland Park'], dtype=object)

The model does not perform well without bigram fhrases.

**Using extended titled corpus (with bigram)**

The title_ext_corp was prepared in the word2vec part.

In [72]:
title_ext_corp[:5]

[['machine_learning', 'engineer', 'machine', 'learning', 'engineer'],
 ['deep_learning',
  'applied',
  'researcher',
  'chicago',
  'deep',
  'learning',
  'applied',
  'researcher',
  'chicago'],
 ['machine_learning', 'engineer', 'machine', 'learning', 'engineer'],
 ['machine_learning',
  'data_scientist',
  'machine',
  'learning',
  'data',
  'scientist'],
 ['cloud', 'architect', 'cloud', 'architect']]

In [73]:
title_tagged_big = [TaggedDocument(words = sent, tags = [i]) for i, sent in enumerate(title_ext_corp)]

In [74]:
title_model_big = Doc2Vec(vector_size = 300, window_size = 5, min_count = 1)
title_model_big.build_vocab(title_tagged_big)

In [75]:
title_model_big.train(title_tagged_big, total_examples=title_model.corpus_count, epochs=10)

In [76]:
df[ df.index == 0 ].title

0    Machine Learning Engineer
Name: title, dtype: object

In [77]:
m_sim = title_model_big.docvecs.most_similar(0)
idx_similar_big = np.array(m_sim)[:,0] 
df.loc[idx_similar_big, 'title'].values 

array(['Package and PCB Layout Engineer (1693-167)',
       'Package and PCB Layout Engineer',
       'Kibana - Visualisations Engineer', 'CNC Setup Engineer',
       'Kibana - Visualisations Engineer', 'DevSecOps Engineer',
       'Civil Engineer MUST BE A US CITIZEN', 'Deep Learning Engineer',
       'Observability - Integrations Engineer (Go)',
       'Kibana - Visualisations Engineer'], dtype=object)

### Better (:

The model performed quite well on title_ext_corp (made from title_corpus + title_bigram)