# Topic Modeling

In [1]:
import pandas as pd
from gensim import corpora, models, similarities
from pprint import pprint
import numpy as np

In [2]:
data = pd.read_csv('data_science_jobs_USA.csv')

In [3]:
data.head(3)

Unnamed: 0,company,date,job_description,job_title,job_url,location,salary
0,Workplace Alaska,10 days ago,This individual position is EXEMPT from the hi...,Research Analyst III,http://www.indeed.com/rc/clk?jk=42399517a00f67...,"Juneau, AK","$5,017 a month"
1,Lili‘uokalani Trust,30+ days ago,"Job Title: Manager, Data Science Reports to: D...","Manager, Data Science",http://www.indeed.com/rc/clk?jk=bd079f6b150eb0...,"Honolulu, HI",
2,Hawaii Medical Service Association,30+ days ago,Data Management: Reviews data sources and prep...,Advanced Data Analyst I - Jr. Data Scientist,http://www.indeed.com/rc/clk?jk=ebea4074e03761...,"Honolulu, HI 96814 (Makiki area)",


In [4]:
import warnings
warnings.filterwarnings('ignore')
import re

# There are some cases in the data with no job descriptions or job descriptions that include 
# strings with no spaces like 'LearningHadoopRPythonData'. We have to clean it up:

data = data[data.job_description.notnull()]

data['job_description_clean'] = [" ".join([ word if len(word)<11 else re.sub(r'([A-Z])', r' \1', word)  
                                                 for word in document.split()]) for document in data['job_description']]

data.iloc[-3:, [2, -1]]


Unnamed: 0,job_description,job_description_clean
26627,Apergy Corporation is seeking a Data Scientist...,Apergy Corporation is seeking a Data Scientis...
26628,Desired: AccountingMicrosoft OfficeClayton Ser...,Desired: Accounting Microsoft Office Clayton...
26631,Job SummaryExternal Role / Title: Team Lead – ...,Job Summary External Role / Title: Team Lead ...


In [5]:
documents = data.job_description_clean

### Plotting documents in 2D using SVD (sklearn)

In [6]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet, HoverTool

output_notebook()

#### Plotting documents

In [7]:
#documents_s = data_clean.sample(n = 500)
#documents_s.to_csv('sample.csv', encoding='utf-8', index=False)

sample = pd.read_csv('sample.csv')

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1, max_df=0.5, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\\\\-][a-zA-Z\\\\-]{2,}')
data_vectorized = vectorizer.fit_transform(sample.job_description)


svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)

df = pd.DataFrame(columns=['x', 'y', 'title', 'company'])
df['x'], df['y'], df['title'], df['company'] = documents_2d[:,0], documents_2d[:,1], sample.iloc[:, 3].values, sample.iloc[:, 0].values

source = ColumnDataSource(ColumnDataSource.from_df(df))

labels = LabelSet(x="x", y="y", text="title", y_offset=8,
                  text_font_size="6pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=800, plot_height=800)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8, color = 'lightskyblue')

plot.add_tools(HoverTool(
    tooltips=[
        ( 'Job Title',   '@title'  ),
        ( 'Company',   '@company'  )
    ]
))

#plot.add_layout(labels)
show(plot, notebook_handle=True)

<br /> 
Judging by the data sample visualization, there seems to be a trend in our data. Research analysts/statisticians <br>
jobs tend to be plotted on the upper right side, while Machine Learning engineer jobs are more in the bottom left corner. <br>
<br /> 

#### Plotting words

In [8]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=800, plot_height=800)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8, color="cornflowerblue")
plot.add_layout(labels)
show(plot, notebook_handle=True)

<br /> 
There are a lot of irrelevant words that make if difficult to see the whole picture. <br> 
Let's filter our documents to only see the words that we are interested in.<br /> 
<br /> 

In [9]:
list_of_required_words = (['analytics', 'research', 'machine', 'learning', 'mining', 'software', 'algorythms', 
                           'algorythm', 'cloud', 'deep', 'analysis', 'statistics', 'statistical', 'python', 
                           'reports', 'reporting', 'hadoop', 'scala', 'tensorflow', 'distributed', 'systems', 
                           'aws', 'keras', 'pandas', 'vision', 'c++', 'phd', 'spark', 'tableau', 'pytorch', 
                           'sql', 'quantitative', 'engineer', 'engineering', 'programming', 'java', 'etl', 
                           'big', 'natural', 'nlp', 'language', 'processing', 'software', 'nltk', 'opencv', 
                           'computing', 'development', 'computer', 'visualization', 'software', 'octave', 
                           'julia', 'matlab', 'database', 'azure', 'artificial', 'intelligence', 'cassandra'
                           'nosql', 'hive', 'apache', 'kafka', 'git', 'linux'] )

df_filtered = df[df['word'].isin(list_of_required_words)]
df_filtered

source = ColumnDataSource(ColumnDataSource.from_df(df_filtered))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="7pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=800, plot_height=800)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8, color="plum")
plot.add_layout(labels)
show(plot, notebook_handle=True)

<br /> 
We didn't include job titles into the SVD model, but only based it on job descriptions. <br>
However, we would expect both visualizations to show a similar tendency. And indeed, research/statistics related words are close to each other on the plot.<br>
Moreover, when zoom in, we also see that words like 'tensorflow', keras' and 'pytorch' are plotted together, as well as 'ingeneering' and 'programming': <br>
<img src="img/bokeh_plot.png",width=500, height=400>

### Data preprocesing (gensim)

In [10]:
from gensim.parsing.preprocessing import remove_stopwords
from nltk.corpus import stopwords
import re

# Removing stopwords
# Removing words that are less than 3 letters long
# Removing numbers and punctuation
# Tokenizing the documents

stoplist = set('for and'.split())
texts = [[word for word in remove_stopwords(re.sub(r"[,.;@#?!&$/()*_'’:]+|[0-9]+", " ", document)).lower().split() 
          if word not in stoplist
          and len(word)>2] 
         for document in documents]

# Removing words that appear only once

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

pprint(texts[0][:10])

['this',
 'individual',
 'position',
 'exempt',
 'hiring',
 'restrictions',
 'qualified',
 'applicants',
 'encouraged',
 'apply']


In [11]:
# Building a dictionary - linking words to numeric ids

#dictionary = corpora.Dictionary(texts)
#dictionary.save("dictionary") # Saving the dictionary to get reprobucible results

dictionary = corpora.Dictionary.load("dictionary")
print(dictionary)

Dictionary(38778 unique tokens: ['abilities', 'ability', 'able', 'academic', 'accommodation']...)


In [12]:
# Building a corpus - transforming the collection of texts to a numerical form

corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus[0][:10])

[(0, 3), (1, 2), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]


### Building models (gensim)

#### LSI

In [13]:
nt= 7

In [14]:
lsi_model = models.LsiModel(corpus=corpus, num_topics=nt, id2word=dictionary)

In [15]:
# Printing out top words

from IPython.display import HTML

def print_top_words(model, num_topics, num_words):
    x = model.show_topics(num_topics, num_words, formatted=False)
    topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

    frame = []
    for i in range(nt):
        for f in range(len(topics_words[i][1])):
            word = topics_words[i][1][f]
            frame.append({'Word': word, 'Topic': i+1, 'Word rank':f+1})

    d = pd.DataFrame(frame)
    p = d.pivot(columns='Word rank', index='Topic', values='Word')

    def hover(hover_color = 'lavender'): 
        return dict(selector="tr:hover",
                    props=[("background-color", "%s" % hover_color)])
    styles = [
        hover(),
        dict(selector="th", props=[("font-size", "100%"),
                                   ("text-align", "center")]),
        dict(selector="caption", props=[("caption-side", "bottom")]),
    ]
    html = (p.style.set_table_styles(styles)
              .set_caption("Hover to highlight."))
    #df.style.hide_index()
    return html

print_top_words(lsi_model, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,experience,business,work,team,skills,science,learning,years,analytics
2,data,research,experience,work,learning,development,required,machine,skills,the
3,learning,machine,research,experience,you,analysis,ability,information,required,data
4,experience,learning,machine,research,business,systems,analytics,science,software,years
5,business,research,learning,team,machine,data,analytics,experience,solutions,product
6,experience,business,status,work,years,analysis,kpmg,analytics,you,systems
7,research,team,business,analytics,you,status,information,learning,kpmg,statistical


#### LDA

In [16]:
lda_model = models.LdaModel(corpus=corpus, num_topics=nt, id2word=dictionary, 
                           random_state=np.random.RandomState(1))
    
print_top_words(lda_model, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,you,experience,learning,machine,team,new,years,customers,work
2,data,experience,learning,machine,science,models,business,analysis,work,skills
3,data,research,experience,analysis,work,business,skills,the,ability,team
4,experience,data,management,work,technical,skills,systems,development,required,knowledge
5,data,business,experience,analytics,team,work,marketing,insights,product,people
6,data,experience,team,big,work,development,software,design,years,engineer
7,data,status,experience,work,employment,information,ability,applicants,national,kpmg


In [17]:
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

#### HDP 

In [18]:
hdp_model = models.HdpModel(corpus=corpus, id2word=dictionary, 
                           random_state=np.random.RandomState(1))

#hdp_model.show_topics()
topic_info = hdp_model.print_topics(num_topics = -1, num_words=10)
print("Total number of topics detected: " + str(len(topic_info)))

print("Most significant topics:")
print_top_words(hdp_model, 10, 10)

Total number of topics detected: 150
Most significant topics:


Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,experience,work,business,team,skills,research,learning,science,analysis
2,data,experience,team,work,learning,business,years,skills,machine,science
3,data,experience,business,learning,able,science,team,machine,must,models
4,data,experience,work,skills,business,ability,years,team,strong,environment
5,data,experience,learning,business,work,machine,skills,team,models,health
6,mapr,data,hadoop,service,professional,proficiency,sales,customer,platform,experience
7,data,experience,network,business,industry,work,conduent,skills,services,management


#### LSI + TfIdf

In [19]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In [20]:
lsi_tfidf_model = models.LsiModel(corpus=corpus_tfidf, num_topics=nt, id2word=dictionary)    

print_top_words(lsi_tfidf_model, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,learning,machine,you,research,analytics,kpmg,business,big,software,product
2,kpmg,lighthouse,status,creativity,genetic,military,excellence,firm,unfavorable,matriculation
3,one,capital,succeeding,banking,respect,machine,you,scala,learning,kpmg
4,capital,research,one,big,hadoop,aws,amazon,spark,engineer,java
5,learning,machine,natural,algorithms,language,models,deep,you,google,scientist
6,hire,markets,tech,contract-to-,exemplary,paired,motion,doing,staffing,contract
7,google,learning,marketing,machine,clearance,product,insights,security,facebook,media


#### LDA + TfIdf

In [21]:
lda_tfidf_model = models.LdaModel(corpus=corpus_tfidf, num_topics=nt, id2word=dictionary, 
                                  random_state=np.random.RandomState(1))  

print_top_words(lda_tfidf_model, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,one,capital,alexa,speech,banking,succeeding,machine,learning,amazon,you
2,jll,sinai,kore,mount,estate,research,bradstreet,dun,sanofi,markit
3,research,clinical,google,statistical,analyst,study,government,clearance,studies,analysis
4,hire,markets,tech,american,architect,contract,aws,you,will,staffing
5,learning,machine,analytics,business,research,marketing,statistical,models,you,analysis
6,big,cloud,hadoop,java,spark,engineer,kafka,software,technologies,architecture
7,kpmg,lighthouse,status,military,diverse,cognitive,genetic,creativity,excellence,carrier


In [22]:
vis = pyLDAvis.gensim.prepare(lda_tfidf_model, corpus_tfidf, dictionary)
pyLDAvis.display(vis)

#### HDP + TfIdf

In [23]:
hdp_tfidf_model = models.HdpModel(corpus=corpus_tfidf, id2word=dictionary, 
                                 random_state=np.random.RandomState(1))

#hdp_model.show_topics()
topic_info = hdp_tfidf_model.print_topics(num_topics = -1, num_words=10)
print("Total number of topics detected: " + str(len(topic_info)))

print("Most significant topics:")
print_top_words(hdp_tfidf_model, 10, 10)

Total number of topics detected: 150
Most significant topics:


Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,learning,machine,you,research,analytics,business,big,software,analysis,product
2,learning,machine,you,research,analytics,business,big,software,product,solutions
3,learning,machine,you,research,analytics,business,big,statistical,analysis,product
4,learning,machine,research,you,business,analytics,big,software,kpmg,status
5,learning,machine,you,research,analytics,business,big,systems,software,solutions
6,learning,machine,research,you,analytics,business,big,kpmg,product,statistical
7,learning,machine,research,you,analytics,business,big,kpmg,software,models


#### LSI + Bigrams

In [24]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

bigram = Phrases(texts, min_count=1, threshold=1, delimiter=b' ')
bigram_phraser = Phraser(bigram)

texts_b = [bigram_phraser[text] for text in texts]
pprint(texts_b[0][:10])

#dictionary_b = corpora.Dictionary(texts_b)
#dictionary_b.save("dictionary_b") # Saving the dictionary to get reprobucable results
dictionary_b = corpora.Dictionary.load("dictionary_b")

corpus_b = [dictionary_b.doc2bow(text) for text in texts_b]

['this individual',
 'position exempt',
 'hiring',
 'restrictions',
 'qualified applicants',
 'encouraged apply',
 'this',
 'recruitment',
 'open',
 'alaska']


In [25]:
lsi_model_b = models.LsiModel(corpus=corpus_b, num_topics=nt, id2word=dictionary_b)
    
print_top_words(lsi_model_b, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,experience,work,team,business,analytics,machine learning,data science,solutions,support
2,data,experience,machine learning,team,work,research,data science,systems,development,skills
3,machine learning,research,data science,support,analysis,required,you,management,experience,team
4,experience,research,business,data science,machine learning,team,big data,analysis,work,the
5,research,business,machine learning,analytics,data science,data,solutions,team,big data,experience
6,experience,analytics,data science,systems,team,research,you,solutions,technical,work
7,team,you,machine learning,data science,business,solutions,work,years experience,analytics,experience


#### LDA + Bigrams

In [26]:
lda_model_b = models.LdaModel(corpus=corpus_b, num_topics=nt, id2word=dictionary_b, 
                              random_state=np.random.RandomState(1))
    
print_top_words(lda_model_b, nt, 10)

Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,experience,research,work,analysis,team,analytics,data science,support,business
2,data,experience,team,big data,solutions,work,systems,development,business,skills
3,data,experience,machine learning,data science,big data,business,analytics,work,team,data mining
4,experience,support,research,data,work,required,analysis,management,systems,development
5,machine learning,experience,data,research,data science,team,work,you,skills,business
6,data,experience,business,analytics,work,analysis,team,support,data science,skills
7,experience,data,machine learning,years experience,solutions,big data,team,skills,you,analytics


In [27]:
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(lda_model_b, corpus_b, dictionary_b)
pyLDAvis.display(vis)

#### HDP + Bigrams

In [28]:
hdp_model_b = models.HdpModel(corpus=corpus_b, id2word=dictionary_b, 
                             random_state=np.random.RandomState(1))

#hdp_model.show_topics()
topic_info = hdp_model_b.print_topics(num_topics = -1, num_words=10)
print("Total number of topics detected: " + str(len(topic_info)))

print("Most significant topics:")
print_top_words(hdp_model_b, 10, 10)

Total number of topics detected: 150
Most significant topics:


Word rank,1,2,3,4,5,6,7,8,9,10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,data,experience,work,team,machine learning,business,analytics,data science,research,solutions
2,experience,data,work,business,big data,team,machine learning,analytics,data science,research
3,data,experience,work,team,research,business,support,skills,analysis,analytics
4,data,experience,business,team,work,analytics,machine learning,data science,skills,research
5,data,experience,team,business,solutions,work,machine learning,analysis,analytics,skills
6,experience,data,team,data science,business,solutions,technical,machine learning,work,analytics
7,data,experience,team,work,business,big data,support,skills,insights,research


### Testing the results with simiarity queries

#### Defining the query document

In [29]:
# Selecting a random document for query and testing the recommendations 

job_ad = data.sample(n=1, random_state = 1)

print('SELECTED DOCUMENT')
print(job_ad.company.values[0])
print(job_ad.job_title.values[0])
search = job_ad.job_description.values[0]
pprint(search)

search = [word for word in remove_stopwords(re.sub(r"[,.;@#?!&$/()*_'’:]+|[0-9]+", " ", search)).lower().split()] 

SELECTED DOCUMENT
Aetna
Sr Data Engineer
('POSITION SUMMARY\n'
 'The Sr. Data Engineer will be responsible for developing and managing our '
 'big data environment. This individual will work with the data engineering '
 'team and be the thought leader and head guru for our data environment.\n'
 '\n'
 'Fundamental Components:\n'
 'Develops large scale data structures and pipelines to organize, collect and '
 'standardize data that helps generate insights and addresses reporting '
 'needs.\n'
 'Collaborates with other data teams to transform data and integrate '
 'algorithms and models into automated processes.\n'
 'Uses knowledge in Hadoop architecture, HDFS commands and experience '
 'designing & optimizing queries to build data pipelines.\n'
 'Uses strong programming skills in Python, Java or any of the major languages '
 'to build robust data pipelines and dynamic systems.\n'
 'Builds data marts and data models to support Data Science and other internal '
 'customers.\n'
 'Analyzes c

#### Building the query

In [30]:
from gensim import similarities

def query(model, corpus, search_text):
    index = similarities.MatrixSimilarity(model[corpus])
 
    # Performing a query and sorting results

    similar = index[model[search_text]]
    similar = sorted(enumerate(similar), key=lambda item: -item[1])
 
    # Showing most similar documents' numbers and similarity 
    print('Most similar documents')
    print(similar[:6])

    # Showing most similar documents (skipping the first 1 that is our sampled document)
    document_ids = [x[0] for x in similar[1:6]]

    pd.options.display.max_colwidth = 400
    return data.iloc[document_ids, [0, 3, -1]]

#### Testing the query with different models

In [31]:
# LSI

import warnings
warnings.filterwarnings('ignore')

bow = dictionary.doc2bow(search)

print('LSI topic distribution')
print(lsi_model[bow])

query(lsi_model, corpus, bow)

LSI topic distribution
[(0, 23.363365977285625), (1, -7.253649409206521), (2, 2.3978763734529815), (3, 2.821368203579634), (4, 2.66983371599434), (5, -2.552050207000691), (6, 0.4105969226832645)]
Most similar documents
[(14643, 0.9995043), (14601, 0.9994314), (532, 0.99849576), (14611, 0.99787664), (16251, 0.99772924), (8698, 0.99772257)]


Unnamed: 0,company,job_title,job_description_clean
17692,Aetna,Sr Data Engineer,"Desired: Java Hadoop Python P O S I T I O N SUMMARY The Sr. Data Engineer will be responsible for developing and managing our big data environment. This individual will work with the data engineering team and be the thought leader and head guru for our data environment. Fundamental Components: Develops large scale data structures and pipelines to organize, collect and standardize data that ..."
661,Dupaco Community Credit Union,Data Engineer,"At Dupaco Community Credit Union, we are a not-for-profit; member-owned financial cooperative that helps our members save money—through lower loan rates, fewer service fees, low-cost insurance, the list goes on and on. As an employee at Dupaco, you’ll be part of an interactive team that believes that by working together, we create better solutions. Outside our branches, employees have the oppo..."
17702,Aetna,Lead Data Engineer,"POSITION SUMMARY Manages and responsible for successful delivery of large scale data structures and Pipelines and efficient Extract/ Load/ Transform (ETL) workflows. Acts as the data engineering team lead for large and complex projects involving multiple resources and tasks, providing individual mentoring in support of company objectives. Fundamental Components: Designs and develops complex..."
19657,Aetna,Sr Data Engineer,"POSITION SUMMARY Leads and participates in the design, built and management of large scale data structures and pipelines and efficient Extract/ Load/ Transform (ETL) workflows. Fundamental Components: Develops large scale data structures and pipelines to organize, collect and standardize data that helps generate insights and addresses reporting needs. Writes ETL (Extract / Transform / Load)..."
10644,University of San Francisco,Data Engineer,"Data Engineer University of San Francisco Job Summary: The University of San Francisco collects vast amounts of data across its e-learning assets as well as its physical plant. Help us extract meaningful signal from this data by developing a world-class data warehouse, so we can bring data-driven improvement to student retention, operational efficiency, organizational management, and many othe..."


In [32]:
# LDA

print('LDA topic distribution')
print(lda_model[bow])

query(lda_model, corpus, bow)

LDA topic distribution
[(3, 0.09568885), (4, 0.3748616), (5, 0.44926798), (6, 0.07893407)]
Most similar documents
[(14643, 0.99991745), (14601, 0.99925524), (8673, 0.9924741), (7679, 0.99176675), (7888, 0.99059415), (7575, 0.9905874)]


Unnamed: 0,company,job_title,job_description_clean
17692,Aetna,Sr Data Engineer,"Desired: Java Hadoop Python P O S I T I O N SUMMARY The Sr. Data Engineer will be responsible for developing and managing our big data environment. This individual will work with the data engineering team and be the thought leader and head guru for our data environment. Fundamental Components: Develops large scale data structures and pipelines to organize, collect and standardize data that ..."
10615,Quid,Data Science Infrastructure Software Engineer,"As a software engineer in infrastructure, you will build software and processes to enable engineers to self-service the operation of Quid at scale. Our developers are your customers. Your goal is to continuously assess and ease pain points of fellow engineers and of our infrastructure. We strive to allow developers to take an idea from their laptops through to production in the quickest yet sa..."
9406,ICF,Sr. Data Engineer,"Desired: Hive Pentaho Hadoop Azure Kafka S D L C S Q L Java Spark Data Warehouse Python A W S Working at ICF Working at ICF means applying a passion for meaningful work with intellectual rigor to help solve the leading issues of our day. Smart, compassionate, innovative, committed, ICF employees tackle unprecedented challenges to benefit people, businesses, and governments around the globe. ..."
9687,PayScale,"Senior Software Engineer, Customer Data Engineering","Job Description About the Customer Data Ingestion Team: Our charter: ""We take the data from the customers, and put it in the products."" This can take many forms, and really varies from project to project. Our tools help customers connect to Human Resources Information Systems (HRIS) and provide a centralized data store for all PayScale products to use. You are going to be the 4th member of t..."
9265,PayScale,"Senior Software Engineer, Customer Data Engineering","Job Description About the Customer Data Ingestion Team: Our charter: ""We take the data from the customers, and put it in the products."" This can take many forms, and really varies from project to project. Our tools help customers connect to Human Resources Information Systems (HRIS) and provide a centralized data store for all PayScale products to use. You are going to be the 4th member of t..."


In [33]:
# HDP

print('HDP topic distribution')
print(hdp_model[bow])

query(hdp_model, corpus, bow)

HDP topic distribution
[(0, 0.014905367294066305), (1, 0.9850236480534993)]
Most similar documents
[(777, 1.0), (1304, 1.0), (2027, 1.0), (2391, 1.0), (2520, 1.0), (2662, 1.0)]


Unnamed: 0,company,job_title,job_description_clean
1583,Publicis Spine,Data Engineer (Back-End) Spine,Job Description We are looking for a talented Data Engineer for an exciting opportunity on the data engineering team. You would be involved with designing workflows for data and analytics tools that are a big part of the road-map for 2018 and managing data and infrastructure to efficiently query data in the billions. Candidates considered based on their ability to design large distributed tec...
2409,Compass,"Data Engineer, Business Intelligence","As a Data Engineer for Business Intelligence at Compass, you will be responsible for helping to build the data-driven decision-making culture throughout the organization. You'll work as part of a rapidly growing team in a fast-paced environment. You will be responsible for managing large-scale business systems initiatives that impact multiple functions and teams across the organization. In th..."
2865,Publicis Spine,Data Engineer (Back-End) Spine,Job Description We are looking for a talented Data Engineer for an exciting opportunity on the data engineering team. You would be involved with designing workflows for data and analytics tools that are a big part of the road-map for 2018 and managing data and infrastructure to efficiently query data in the billions. Candidates considered based on their ability to design large distributed tec...
3034,Trident Consulting Inc,Big Data Engineer,"Our client is a global leader in consulting, technology and outsourcing solutions. We enable clients, in more than 30 countries, to stay a step ahead of emerging business trends and outperform the competition. We help them transform and thrive in a changing world by co-creating breakthrough solutions that combine strategic insights and execution excellence. With US$8.25 billion in annual reven..."
3232,KDR recruitment,Lead Data Engineer,"Desired: Ruby Identity And Access Management Kafka R Scala C/ C++ Apache Java S3 Spark S P S S M A T L A B Python A W S Lead Data Engineer - AWS, Redshift, Kafka, Big DataI am working on behalf of an exciting client within the media analytics sector at the moment, this great start up company is currently jumping from success to success, picking up many awards along the way, they offer great ..."


In [34]:
# LSI + TfIdf

bow_tfidf = tfidf[bow]

print('LSI TfIdf topic distribution')
print(lsi_tfidf_model[bow_tfidf])

query(lsi_tfidf_model, corpus_tfidf, bow_tfidf)

LSI TfIdf topic distribution
[(0, 0.24057509697224158), (1, 0.02456694687138038), (2, 0.01546423607309241), (3, -0.030772931718475928), (4, 0.04366012614405163), (5, -0.012691802202716606), (6, -0.01383581951763373)]
Most similar documents
[(14643, 0.9999754), (14601, 0.99953586), (10154, 0.9974402), (17576, 0.996623), (16262, 0.9966146), (5856, 0.99570626)]


Unnamed: 0,company,job_title,job_description_clean
17692,Aetna,Sr Data Engineer,"Desired: Java Hadoop Python P O S I T I O N SUMMARY The Sr. Data Engineer will be responsible for developing and managing our big data environment. This individual will work with the data engineering team and be the thought leader and head guru for our data environment. Fundamental Components: Develops large scale data structures and pipelines to organize, collect and standardize data that ..."
12297,TaosMountain,Data Center Site Reliability Engineer,"THIS IS NOT A REMOTE O P P O R T U N I T Y / NO T H I R D- P A R T Y VENDORS Who is Taos? Taos is an IT consulting and services company that offers expertise across the strategic, management and tactical layers of IT and engineering organizations. As part of the nation's IT landscape since 1989, we offer opportunities that will allow you to achieve your career goals and objectives. We're cha..."
21413,CIITS,DATA SCIENTIST,"Location: Boca Raton, FL Duration: 6+ (Green Card Citizens or EAD GC) Interview Process: Phone/ In-person is required to get hired Job Description: We need a Big Data (Hadoop and/or Teradata) Data Architect, Data Strategist, Data Scientist caliber position. This is not a developer role, it is a very HIGH LEVEL position. The manager is looking for someone using web analytics tools or statist..."
19669,Aetna,Lead Data Modeling Engineer,POSITION SUMMARY Aetna Consumer Health & Products mandate is to transform health and wellness via products that enable consumer engagement and best-in-class experience. This team is responsible for Aetnas digital transformation and all digital products; as well as the development and execution of the digital roadmap. A key part of the responsibility is to build an integrated platform that enab...
7187,Progressive,Data Operations Engineer,"Data Operations Engineer Job Number: 152859 Data Operations Engineer Our team The Data Management & Analytics Platform (DMA) has an exciting opportunity for an experienced Data Operations Engineer to join our team! Our goal is to organize Progressive’s data, make it accessible and useful while staying at the forefront of analytics in order to drive business value. Your role on our team In thi..."


In [35]:
# LDA + TfIdf

print('LDA TfIdf topic distribution')
print(lda_tfidf_model[bow_tfidf])

query(lda_tfidf_model, corpus_tfidf, bow_tfidf)

LDA TfIdf topic distribution
[(0, 0.010773249), (1, 0.010770378), (2, 0.010786133), (3, 0.010779091), (4, 0.010810468), (5, 0.93530476), (6, 0.010775904)]
Most similar documents
[(1587, 1.0), (2473, 1.0), (2675, 1.0), (14601, 1.0), (20202, 1.0), (2073, 0.99999994)]


Unnamed: 0,company,job_title,job_description_clean
2971,Komodo Health,Senior Data Engineer,"Desired: Machine Learning Hadoop Scala C I/ C D S Q L Java Spark Kubernetes Git Docker Postgre S Q L Python Jenkins A W S Komodo Health is addressing the global burden of disease through the world's most actionable healthcare map. Our solutions drive a more transparent, efficient and productive healthcare ecosystem. We value our culture of encouraging growth, collaboration, and constructive d..."
3253,Komodo Health,Senior Data Engineer,"Komodo Health is addressing the global burden of disease through the world's most actionable healthcare map. Our solutions drive a more transparent, efficient and productive healthcare ecosystem. We value our culture of encouraging growth, collaboration, and constructive debate as well as delivering innovative solutions that ""wow"" our customers. The Data Engineering (DE) team is looking for a..."
17692,Aetna,Sr Data Engineer,"Desired: Java Hadoop Python P O S I T I O N SUMMARY The Sr. Data Engineer will be responsible for developing and managing our big data environment. This individual will work with the data engineering team and be the thought leader and head guru for our data environment. Fundamental Components: Develops large scale data structures and pipelines to organize, collect and standardize data that ..."
24656,Komodo Health,Senior Data Engineer,"Desired: Machine Learning Hadoop Scala C I/ C D S Q L Java Spark Kubernetes Git Docker Postgre S Q L Python Jenkins A W S Komodo Health is addressing the global burden of disease through the world's most actionable healthcare map. Our solutions drive a more transparent, efficient and productive healthcare ecosystem. We value our culture of encouraging growth, collaboration, and constructive d..."
2462,LendKey,Principal Data Engineer,"Desired: Microsoft SQL ServerData Management Hadoop Kafka S Q L My S Q L Design Experience Software Development Lend Key is solving a complex challenge – to improve lives with lending made simple – by helping financial institutions compete in the digital age and provide a delightful customer experience, while providing borrowers with the simple, transparent, digital borrowing experience the..."


In [36]:
# HDP + TfIdf

print('HDP TfIdf topic distribution')
print(hdp_tfidf_model[bow_tfidf])

query(hdp_tfidf_model, corpus_tfidf, bow_tfidf)

HDP TfIdf topic distribution
[(0, 0.9418615626926986)]
Most similar documents
[(0, 1.0), (2, 1.0), (4, 1.0), (5, 1.0), (12, 1.0), (18, 1.0)]


Unnamed: 0,company,job_title,job_description_clean
2,Hawaii Medical Service Association,Advanced Data Analyst I - Jr. Data Scientist,"Data Management: Reviews data sources and preps needed data Identifies existing and new data sources Performs data prototyping, if necessary, to agree what data can/not be used Preps data for use in analysis, goes back to source if problems are found, and identifies other options/alternatives if problems cannot be resolved Performs review of data to ensure completeness/accuracy/timeliness for..."
4,Smartronix,Voice/Data Communications Engineer,"Smartronix, Inc., is an information technology and engineering solutions provider specializing in Cloud Computing, Cyber Security, Health IT, Network Operations, and Mission- Focused Engineering. Smartronix has an opening for a Voice/Data Communications Engineer support a Combat Support Agency for the United States Military. Candidate will serve as a communications expert within the unit ..."
5,CACI,Intelligence Analyst / OBP Analyst (Hawaii),"Job Description What You’ll Get to Do: National Cyber Solutions Group, CACI is looking for top talent to join our elite team. NCS is currently seeking an Intelligence Analyst to join one of our Prime Contracts. As the prime contractor, we design and develop mission-essential capabilities to increase the efficiency, integration, and quality of the customer’s intelligence analysis activities. ..."
13,Hawaii Medical Service Association,Advanced Data Analyst I - Jr. Data Scientist,"Desired: Microsoft WordTime Management Microsoft Powerpoint S P S S R S A S S Q L Data Management: Reviews data sources and preps needed data Identifies existing and new data sources Performs data prototyping, if necessary, to agree what data can/not be used Preps data for use in analysis, goes back to source if problems are found, and identifies other options/alternatives if problems canno..."
24,Mosaic North America,"Scientist, Data Lead","Overview The Data Science Lead will join the Advanced Analytics team as a senior contributor and leader focused on creating transformational analytics-enabled capabilities across all of Acosta’s businesses. This can range from using statistical methods, data-mining and machine-learning techniques, or generating novel approaches uniquely suited the challenge. Datasets range in size but will als..."


In [37]:
# LSI + Bigrams

bow_b = dictionary_b.doc2bow(search)

print('LSI(Bigrams) topic distribution')
print(lsi_model_b[bow_b])

query(lsi_model_b, corpus_b, bow_b)

LSI(Bigrams) topic distribution
[(0, 21.676062085622284), (1, -9.426517707993947), (2, 0.7358536740566307), (3, -2.1039593708079365), (4, 2.4566942502531757), (5, -1.9635845674537968), (6, -1.0749836198586833)]
Most similar documents
[(16421, 0.997999), (16730, 0.997999), (15802, 0.9978725), (21581, 0.9976823), (21605, 0.99748856), (5801, 0.9973398)]


Unnamed: 0,company,job_title,job_description_clean
20259,Flywire Corporation,Data Engineer,"We, at Flywire, are looking for a smart, analytical thinker who's excited to empower data-driven decision making at an exciting and fast-growing organization! As our Data Engineer, you will work within the Data Analytics team to ensure that our organization has access to reliable, accurate, and timely data to be used in various reporting, business intelligence, and analytical solutions. Great ..."
19153,Flywire Corporation,Data Engineer,"Desired: Business Intelligence Database Administration Spark Python A W S Tableau Apache We, at Flywire, are looking for a smart, analytical thinker who's excited to empower data-driven decision making at an exciting and fast-growing organization! As our Data Engineer, you will work within the Data Analytics team to ensure that our organization has access to reliable, accurate, and timely da..."
26351,Electronic Arts,Senior Software Engineer/Architect – Data & AI,"Requisition Number:151529 Location: Austin Date Opened:2018-07-30 Electronic Arts Inc. is a leading global interactive entertainment software company. EA delivers games, content and online services for Internet-connected consoles, personal computers, mobile phones and tablets. Senior Software Engineer/ Architect - Data We are EA And we make games – how cool is that? In fact, we entertain..."
26382,Electronic Arts,Senior Software Engineer/Architect – Data & AI,"Desired: HiveData Mining Hadoop Kafka C I/ C D C/ C++ S Q L Java Spark Software Development No S Q L Shell Scripting Python Requisition Number:151529 Location: Austin Date Opened:2018-07-30 Electronic Arts Inc. is a leading global interactive entertainment software company. EA delivers games, content and online services for Internet-connected consoles, personal computers, mobile phones ..."
7128,General Electric,Sr. Staff Data Engineer - Data Governance,"Role Summary: This role will be for supporting the MDM team with the successful execution of the MDM governance program by leading and coordinating tasks and responsibilities with an off-shore data steward team. This specialist role requires an energetic, hands-on individual who understands and continuously improves, processes, policies, procedures, standards, and metrics. The objective of the..."


In [38]:
# LDA + Bigrams

print('LDA (Bigrams) topic distribution')
print(lda_model_b[bow_b])

query(lda_model_b, corpus_b, bow_b)

LDA (Bigrams) topic distribution
[(1, 0.34091327), (3, 0.09501252), (4, 0.2190753), (5, 0.34374446)]
Most similar documents
[(10199, 0.99430025), (517, 0.9903199), (778, 0.98996216), (20856, 0.98672444), (11584, 0.98420405), (10303, 0.9842032)]


Unnamed: 0,company,job_title,job_description_clean
645,INTL FCStone,Data Engineer,"The Data Engineer is responsible for empowering the Strategic Data team to achieve its primary objectives in ingesting, mastering and exposing real-time, event-driven data streams pertaining to the essential relational data dimensions that are crucial to the evolution and continued success of INTL FCStone Inc. This position will focus heavily on utilizing advanced techniques and cutting-edge t..."
957,3M,"Adv Scientist/Adv Engineer/System Analyst* (Maplewood, MN)","3M is seeking an experienced Adv Scientist/ Adv Engineer/ System Analyst to lead the development of data analysis and reporting capabilities within the Corporate R&D; Services Center organization located in Maplewood, MN. At 3M, you can apply your talent in bold ways that matter. Here, you go. Job Summary : The person hired for the position of Adv Scientist/ Adv Engineer/ System Analyst wi..."
25454,Hiscox Ltd,Data Engineer,"Are you a data guru? Do you enjoy solving complex business problems with your expert programming and data analytics skills? If so, our newest Data Engineer role is the right challenge for you. Position: Data Engineer Reporting to: Head of Business Intelligence & Data Analytics Location: Atlanta GA About the Business Intelligence Team: Highly skilled, close knit, technical and functional anal..."
13990,Apple,Senior Software/Data Engineer,"Have you thought about where you are going? Can the answer of ""where have you been?"" Build a whole new set of ideas for you? Apple Maps provides an incredible sense of direction. Apple Maps is confronted with the most meaningful large-scale data problems in the world. We are solving problems that impact billions of people around the globe every single day. Are you passionate about working with..."
12463,Apple,Senior Software/Data Engineer,"Have you thought about where you are going? Can the answer of ""where have you been?"" Build a whole new set of ideas for you? Apple Maps provides an incredible sense of direction. Apple Maps is confronted with the most meaningful large-scale data problems in the world. We are solving problems that impact billions of people around the globe every single day. Are you passionate about working with..."


In [39]:
# HDP + Bigrams

print('HDP (Bigrams) topic distribution')
print(hdp_model_b[bow_b])

query(hdp_model_b, corpus_b, bow_b)

HDP (Bigrams) topic distribution
[(0, 0.9764092612459218), (2, 0.022962413387825793)]
Most similar documents
[(15676, 0.99999976), (3763, 0.9999994), (15000, 0.9999994), (4234, 0.99999833), (10283, 0.99999803), (5888, 0.99999744)]


Unnamed: 0,company,job_title,job_description_clean
4563,"City of Alexandria, VA",Performance Analyst,"Performance Analyst An Overview The City of Alexandria is focused on making smarter, data-driven decisions to deliver efficient and effective services. The lead department for this is the Office of Performance and Accountability, where this open Performance Analyst position opportunity is available. The Performance Analyst is responsible for analyzing data, processes, and programs to iden..."
18204,"City of Alexandria, VA",Performance Analyst,"Performance Analyst An Overview The City of Alexandria is focused on making smarter, data-driven decisions to deliver efficient and effective services. The lead department for this is the Office of Performance and Accountability, where this open Performance Analyst position opportunity is available. The Performance Analyst is responsible for analyzing data, processes, and programs to iden..."
5121,"City of Alexandria, VA",Performance Analyst,"Desired: R S A S Excel Tableau Performance Analyst An Overview The City of Alexandria is focused on making smarter, data-driven decisions to deliver efficient and effective services. The lead department for this is the Office of Performance and Accountability, where this open Performance Analyst position opportunity is available. The Performance Analyst is responsible for analyzing data, ..."
12441,Apple,Maps Business Intelligence - Sr. Data Engineer,"The Maps Supply Chain Management team within the Internet Services group is looking for a skilled, experienced, and motivated Senior Data Analyst to analyze and improve quality and coverage of our data sets for worldwide business listings, points-of-interest, local events, and similar local data. Individuals with proven track record of improving products and services by analyzing large and co..."
7229,Laredo Technical Services Inc.,Data Analyst/Scientist,"Desired: Adobe Acrobat Statistical Software Microsoft SQL Server Machine Learning Microsoft Powerpoint S A S S Q L Microsoft Office S P S S Military Experience D E S C R I P T I O N OF SERVICES: The purpose of this effort is to acquire professional, scientific, and technical support services for the United States Air Force School of Aerospace Medicine Department of Aeromedical Research..."


### Looking into query results

Besides jobs and tasks descriptions, our texts include some irrelevant information about companies, teams or benefits, as well as different kinds of disclaimers. Most of the models are able to catch this, grouping words like "gender", "religion", "origin" or "age" together.<br> 

Company information is usually the same piece of text for all jobs in this company. That is why having a smaller number of topics is more preferable in our case, since we want recomendations to be based on required qualifications, not company descriptions. And that is why HDP models with large number of topics may be less accurate. <br>

For this same reason, TfIdf models' recomendations are more company-oriented, since tfidf transformation brings words from companies descriptions to the surface. If we check the lists of most important words, we see that there are a lot of company names there. The models group all qualification descriptions to few big topics, while the remaining topics are much smaller and very company-specific. <br>

For our specific test case, all LSI and LDA models with lower number of topics seem to give meaningful results, showing Data Engineering jobs similar to the one from the query, from several different companies.
