## LDA model visualisation
#### A job worth doing


This notebook explores a trained LDA topic model

Ensure you have created a topic model using topic_model.py and that this notebook is in the root of the project

1. Load and visualise a trained topic model
2. Name each topic
3. Classify job descriptions
4. Further analysis using the results of the model

In [None]:
import os

import matplotlib.pyplot as plt
import pandas as pd
import pyLDAvis.gensim
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary, MmCorpus

# this import relies on the relative position of this notebook
from lib.nlp_utils import NLTKProcessor

pyLDAvis.enable_notebook()

# 1. Load and visualise a trained topic model

In [None]:
LDA_MODEL_NAME = 'lda_model_20topics_10passes'
LDA_MODEL_FILE_PATH = 'models/{}.model'.format(LDA_MODEL_NAME)
# all of these methods reduce the high dimensional topics into 2 dimensional representations to be visualised
#                    options: tsne, pcoa or mmds
DIMENSION_REDUCTION_METHOD = 'tsne'

In [None]:
lda_model = LdaModel.load(LDA_MODEL_FILE_PATH)
dictionary = Dictionary.load('models/dictionary.dict')
doc_term_matrix = MmCorpus('models/corpus.mm')

In [None]:
# creates html visualisation of the topic model (sort_topics=Flase to preserve the topic index, so that a dictionary mapping can be made)
# topic numbers in the visualisation start at 1, whereas in the gensim.models.LDAModel the index starts at 0
html = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary, mds=DIMENSION_REDUCTION_METHOD, sort_topics=False)

In [None]:
html

In [None]:
# save the visualisation as a stand alone webpage
pyLDAvis.save_html(html, 'visualisations/pyLDAvis.html')

# 2. Name each topic

In [None]:
# hand created lookup between topic index from the pyLDAvis visualisation and the percieved category
# the pyLDAvis index starts at 1, whereas the index within the topic model starts at 0
TOPIC_LOOKUP = {
    1: 'cleaning',
    2: 'military',
    3: 'care_takers',
    4: 'construction',
    5: 'other',
    6: 'manufacturing',
    7: 'other',
    8: 'medical_and_insurance',
    9: 'other',
    10: 'project_and_business_management',
    11: 'driving',
    12: 'store_management',
    13: 'warehouse', # pyshical effort / stooping
    14: 'online_marketing', # sales, linkedin, youtube
    15: 'software_development',
    16: 'administration', # reports, compliance
    17: 'healthcare_and_caring',
    18: 'human_resource', # references to diversity, human resource?
    19: 'accounting',
    20: 'other' # references ajilon and randstad (recruitment agencies)
}

# 3. Classify job descriptions
### 3.1 Create functions and initialise a text processer

In [None]:
# define a function that gets the most probable classification of a job description using the topic model
def classify_job_description(job_description, text_processer, dictionary, lda_model, topic_lookup=None, return_probability=False):
    processed_job_description = text_processer.process(job_description)
    bow_job_description = dictionary.doc2bow(processed_job_description.split())
    classification_probabilities = lda_model[bow_job_description]
    
    # sort the classifications by the second element (probability of belonging to topic) then select the first element
    classification = sorted(classification_probabilities, key=lambda x: x[1], reverse=True)[0]
    if topic_lookup:
        # +1 to align the indexes (different between pyLDAvis and gensim)
        return topic_lookup[classification[0]+1] if not return_probability else classification[1]
    else:
        # if there is no topic lookup just return the index of the topic
        return classification[0] if not return_probability else classification[1]



In [None]:
# initialise the text processer that was originally used for the text processing
text_processer = NLTKProcessor(stemmer=None)

### 3.2 Check that the pipeline is working using some example job descriptions

In [None]:
TEST_JOB_DESCRIPTION = """
Performs administrative and office support activities for summer camp. Duties may include fielding telephone calls, 
receiving and directing visitors, data entry, creating and generating reports, sorting mail, and filing. Software skills, 
and strong communication skills are required. Camp Harkness serves up to 36 campers, youth and adults with special needs, 
at any given time, with a staff of approximately 15, including 12 counselors and a nurse. A current resume must be submitted 
as an indicator of interest in this position.
"""

classify_job_description(TEST_JOB_DESCRIPTION, 
                         text_processer, 
                         dictionary, 
                         lda_model, 
                         topic_lookup=TOPIC_LOOKUP, 
                         return_probability=False)

In [None]:
DEESET_DATA_SCIENTIST_JOB_DESCRIPTION = """
Data Scientist required by an organisation in the North West who are investing heavily in their data analytics capabilities in 2018.
If you want to work with Python and various sophisticated Machine Learning models on a daily basis and join a completely greenfield site with a scary amount of untapped data, read on.

The Role
You will also be joining a recently appointed Head of Data & Analytics who is keen to have someone in his team that he can work closely with, train up and develop in to a high calibre Data Scientist.
As this is a fairly new area for the company, there is going to be a lot of initial grunt work to understand where they are at in relation to their data. You will help the new Head of Data Science find, scrape and collate data from various sources across the organisation to then make a start on making sense of it all.
After the initial collation of the data, you will then play a key role in developing a strategy for the implementation of an advanced analytics platform across the entire product offering.

Technical Stack:
From an experience and technical perspective, the manager is happy to consider pure graduates from an MSc, PhD level or more experienced candidates as he has a track record of training people up in his previous role.
Tech wise, he's most comfortable with Python in terms of programming so is likely to continue that in the new role. However, he's equally happy for people with strong SQL, Matlab or R skills to apply as he appreciates the similarities and how easy it can be to cross-train.
From a Data Science perspective, we're looking for people with a real interest in most, if not all of the following areas:

** Machine Learning 
** Statistics / Mathematics 
** Artificial Intelligence 
** Chatbots 
** Neural Networks 
** Python / R / SQL / Matlab
"""

classify_job_description(DEESET_DATA_SCIENTIST_JOB_DESCRIPTION, 
                         text_processer,
                         dictionary, 
                         lda_model, 
                         topic_lookup=TOPIC_LOOKUP, 
                         return_probability=False)

### 3.3 Classify all the jobs in the monster.com job postings dataset using the topic model

In [None]:
if not os.path.isfile('data/final_data.csv'):
    data_frame = pd.read_csv('data/cleansed_data.csv')
    
    # create a column and populate it with the classification
    data_frame['classification'] = data_frame['job_description'].apply(classify_job_description, 
                                                                     args=(text_processer, dictionary, lda_model, TOPIC_LOOKUP))
    
    # create a column and populate it with the probability of the classification
    data_frame['classification_probability'] = data_frame['job_description'].apply(classify_job_description,
                                                                                   args=(text_processer, dictionary,
                                                                                         lda_model, TOPIC_LOOKUP, True))
    
    data_frame.to_csv('data/final_data.csv', index=False)
else:
    data_frame = pd.read_csv('data/final_data.csv', encoding='latin1')

# 4. Use the new classifications for some further analysis

In [None]:
# set up jupyter notebook for creating readable visualisations inline
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 12)
plt.rcParams['font.size'] = 20

### 4.1 Plot the distributions of salaries in each of the new classes

In [None]:
figure, axis = plt.subplots(nrows=1, ncols=1)

# remove outliers in the salary range (some hourly rates are actually yearly rates)
quantile_data_frame = data_frame[data_frame['standardised_salary'] < data_frame['standardised_salary'].quantile(.99)]

axis = quantile_data_frame.boxplot(column='standardised_salary', by='classification', ax=axis, rot=90, figsize=(40, 40))

# edit to True to save figure
if False:
    plt.tight_layout()
    plt.savefig('visualisations/Salary distribution by classification')

# Next steps

1. improve salary parsing
2. improve text processing, implement SpaCy for production level text preparation (to reduce noise in dataset)
3. create a visualisation that shows the keywords for each category
4. create a vector representations of all words in the dataset using word2vec
5. engage more stakeholders in the topic labelling process and refine for business problem