### Job classification on the basis of job description using Naive bayes classifier

First let import required libraries

In [73]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.tokenize import word_tokenize
from sklearn.naive_bayes import MultinomialNB
import numpy as np

Setting the display optiion so that all columns and rows will be displayed without any truncation

In [75]:
pd.set_option("display.max_colwidth", -1)

  """Entry point for launching an IPython kernel.


In [76]:
pd.set_option("display.max_row", None)

References:
https://www.kaggle.com/airiddha/trainrev1
https://www.kaggle.com/chadalee/text-analytics-explained-job-description-data  
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

Import dataset and explore it!

Description of dataset is as follow:

Id - A unique identifier for each job ad

Title - A freetext field supplied by the job advertiser as the Title of the job ad. Normally this is a summary of the job title or role.

FullDescription - The full text of the job ad as provided by the job advertiser. Whenever we see ***s, these are values stripped from the description in order to ensure that no salary information appears within the descriptions.

LocationRaw - The freetext location as provided by the job advertiser.

LocationNormalized - Normalized location of the job location.

ContractType - full_time or part_time.

ContractTime - permanent or contract.

Company - the name of the employer as supplied by the job advertiser.

Category - which of 30 standard job categories this ad fits into.

SalaryRaw - the freetext salary field in the job advert from the advertiser.

SalaryNormalised - the annualised salary interpreted by Adzuna from the raw salary. We convert this value to a categorical variable denoting 'High salary' or 'Low Salary' and try to predict those.

In [3]:
df = pd.read_csv('Train_rev1.csv')
df.head(5)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


Dataset has no null values, thats a good news...yippeee!! We will try to classify jobs category (column 'Category') on the basis of job description (FullDescription). Let's see how many unique values column Category has..

In [5]:
df['Category'].unique()

array(['Engineering Jobs', 'HR & Recruitment Jobs',
       'Accounting & Finance Jobs', 'Healthcare & Nursing Jobs',
       'Other/General Jobs', 'Hospitality & Catering Jobs', 'IT Jobs',
       'Customer Services Jobs', 'Travel Jobs', 'Sales Jobs',
       'Manufacturing Jobs', 'Teaching Jobs', 'Creative & Design Jobs',
       'Trade & Construction Jobs', 'Property Jobs', 'Admin Jobs',
       'Legal Jobs', 'Retail Jobs', 'Consultancy Jobs',
       'Energy, Oil & Gas Jobs', 'Logistics & Warehouse Jobs',
       'PR, Advertising & Marketing Jobs', 'Charity & Voluntary Jobs',
       'Scientific & QA Jobs', 'Maintenance Jobs',
       'Domestic help & Cleaning Jobs', 'Social work Jobs',
       'Graduate Jobs', 'Part time Jobs'], dtype=object)

In [6]:
len(df['Category'].unique())

29

Let's also ensure that we have sufficient example for each category in our dataset

In [10]:
df['Category'].value_counts()

IT Jobs                             38483
Engineering Jobs                    25174
Accounting & Finance Jobs           21846
Healthcare & Nursing Jobs           21076
Sales Jobs                          17272
Other/General Jobs                  17055
Teaching Jobs                       12637
Hospitality & Catering Jobs         11351
PR, Advertising & Marketing Jobs     8854
Trade & Construction Jobs            8837
HR & Recruitment Jobs                7713
Admin Jobs                           7614
Retail Jobs                          6584
Customer Services Jobs               6063
Legal Jobs                           3939
Manufacturing Jobs                   3765
Logistics & Warehouse Jobs           3633
Social work Jobs                     3455
Consultancy Jobs                     3263
Travel Jobs                          3126
Scientific & QA Jobs                 2489
Charity & Voluntary Jobs             2332
Energy, Oil & Gas Jobs               2255
Creative & Design Jobs            

In [36]:
df['Category'].value_counts(normalize=True)

IT Jobs                             0.157222
Engineering Jobs                    0.102848
Accounting & Finance Jobs           0.089252
Healthcare & Nursing Jobs           0.086106
Sales Jobs                          0.070565
Other/General Jobs                  0.069678
Teaching Jobs                       0.051628
Hospitality & Catering Jobs         0.046375
PR, Advertising & Marketing Jobs    0.036173
Trade & Construction Jobs           0.036104
HR & Recruitment Jobs               0.031511
Admin Jobs                          0.031107
Retail Jobs                         0.026899
Customer Services Jobs              0.024770
Legal Jobs                          0.016093
Manufacturing Jobs                  0.015382
Logistics & Warehouse Jobs          0.014843
Social work Jobs                    0.014115
Consultancy Jobs                    0.013331
Travel Jobs                         0.012771
Scientific & QA Jobs                0.010169
Charity & Voluntary Jobs            0.009527
Energy, Oi

Clearly, our dataset is an imbalanced dataset. 
Let's see how many unique values column FullDescription has..

In [11]:
df['FullDescription']

0         Engineering Systems Analyst Dorking Surrey Sal...
1         Stress Engineer Glasgow Salary **** to **** We...
2         Mathematical Modeller / Simulation Analyst / O...
3         Engineering Systems Analyst / Mathematical Mod...
4         Pioneer, Miser  Engineering Systems Analyst Do...
                                ...                        
244763    Position: Qualified Teacher Subject/Specialism...
244764    Position: Qualified Teacher or NQT Subject/Spe...
244765    Position: Qualified Teacher Subject/Specialism...
244766    Position: Qualified Teacher Subject/Specialism...
244767    This entrepreneurial and growing private equit...
Name: FullDescription, Length: 244768, dtype: object

In [9]:
len(df['FullDescription'].unique().tolist())

242138

Some problems with the way FullDescription has been encoded

In [24]:
# some problems with the way FullDescription has been encoded
def convert_utf8(s):
    return str(s)

df['FullDescription'] = df['FullDescription'].map(convert_utf8)

In [25]:
df['FullDescription'][0:20]

0     Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Extracting required columns in a separate database

In [26]:
df_new = df.filter(['Category','FullDescription'], axis=1)

In [27]:
df_new.head()

Unnamed: 0,Category,FullDescription
0,Engineering Jobs,"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K"
1,Engineering Jobs,"Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****"
2,Engineering Jobs,"Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire Up to ****K AAE pension contribution, private medical and dental The opportunity Our client is an independent consultancy firm which has an opportunity for a Data Analyst with 35 years experience. The role will require the successful candidate to demonstrate their ability to analyse a problem and arrive at a solution, with varying levels of data being available. Essential skills Thorough knowledge of Excel and proven ability to utilise this to create powerful decision support models Experience in Modelling and Simulation Techniques, Experience of techniques such as Discrete Event Simulation and/or SD modelling Mathematical/scientific background minimum degree qualified Proven analytical and problem solving skills Self Starter Ability to develop solid working relationships In addition to formal qualifications and experience, the successful candidate will require excellent written and verbal communication skills, be energetic, enterprising and have a determination to succeed. They will be required to build solid working relationships, both internally with colleagues and, most importantly, externally with our clients. They must be comfortable working independently to deliver against challenging client demands. The offices are located in Basingstoke, Hampshire, but our client work for clients worldwide. The successful candidate must therefore be prepared to undertake work at client sites for short periods of time. Physics, Mathematics, Modelling, Simulation, Analytical, Operational Research, Mathematical Modelling Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire ****K AAE pension contribution, private medical and dental"
3,Engineering Jobs,"Engineering Systems Analyst / Mathematical Modeller. Our client is a highly successful and respected Consultancy providing specialist software development MISER, PIONEER, Maths, Mathematical, Optimisation, Risk Analysis, Asset Management, Water Industry, Access, Excel, VBA, SQL, Systems . Engineering Systems Analyst / Mathematical Modeller. Salary ****K****K negotiable Location Dorking, Surrey"
4,Engineering Jobs,"Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K Located in Surrey, our client provides specialist software development Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K"


Performing startified Splitting the data into train and test data as we have imbalanced data

In [30]:
X_train, X_val, y_train, y_val = train_test_split(df_new['FullDescription'], df['Category'], test_size=0.2, stratify= df['Category'])

In [37]:
y_train.value_counts(normalize=True)

IT Jobs                             0.157221
Engineering Jobs                    0.102848
Accounting & Finance Jobs           0.089253
Healthcare & Nursing Jobs           0.086107
Sales Jobs                          0.070567
Other/General Jobs                  0.069678
Teaching Jobs                       0.051631
Hospitality & Catering Jobs         0.046376
PR, Advertising & Marketing Jobs    0.036172
Trade & Construction Jobs           0.036106
HR & Recruitment Jobs               0.031509
Admin Jobs                          0.031106
Retail Jobs                         0.026898
Customer Services Jobs              0.024768
Legal Jobs                          0.016092
Manufacturing Jobs                  0.015382
Logistics & Warehouse Jobs          0.014841
Social work Jobs                    0.014115
Consultancy Jobs                    0.013329
Travel Jobs                         0.012772
Scientific & QA Jobs                0.010168
Charity & Voluntary Jobs            0.009529
Energy, Oi

In [38]:
y_val.value_counts(normalize=True)

IT Jobs                             0.157229
Engineering Jobs                    0.102852
Accounting & Finance Jobs           0.089247
Healthcare & Nursing Jobs           0.086101
Sales Jobs                          0.070556
Other/General Jobs                  0.069678
Teaching Jobs                       0.051620
Hospitality & Catering Jobs         0.046370
PR, Advertising & Marketing Jobs    0.036177
Trade & Construction Jobs           0.036095
HR & Recruitment Jobs               0.031519
Admin Jobs                          0.031111
Retail Jobs                         0.026903
Customer Services Jobs              0.024778
Legal Jobs                          0.016097
Manufacturing Jobs                  0.015382
Logistics & Warehouse Jobs          0.014851
Social work Jobs                    0.014115
Consultancy Jobs                    0.013339
Travel Jobs                         0.012767
Scientific & QA Jobs                0.010173
Charity & Voluntary Jobs            0.009519
Energy, Oi

Cleaning up the descriptions
A look at the description above shows us that these descriptions contain - numbers, urls and certain strings as '*' which I believe are either phone numbers or salary figures that have been removed so that these do not affect our predictions! We will have to remove these strings before we try out any analytics!

Approach - We will use the substitute feature to find and substitute these anomalous strings in our job descriptions

In [42]:
def remove_nums(s):
    return re.sub('[^\s]*[0-9]+[^\s]*', "", s)

X_train = X_train.map(remove_nums)

In [43]:
# Remove the urls - Anything that has .com, .co.uk or www. is a url!
def remove_urls(s):
    s = re.sub('[^\s]*.com[^\s]*', "", s)
    s = re.sub('[^\s]*www.[^\s]*', "", s)
    s = re.sub('[^\s]*.co.uk[^\s]*', "", s)
    return s

X_train  = X_train.map(remove_urls)

In [44]:
# Remove the star_words
def remove_star_words(s):
    return re.sub('[^\s]*[\*]+[^\s]*', "", s)

X_train = X_train.map(remove_star_words)

In [54]:
# Remove the punctuations
from string import punctuation

def remove_punctuation(s):
    global punctuation
    for p in punctuation:
        s = s.replace(p, '')
    return s

X_train = X_train.map(remove_punctuation)

In [56]:
# Convert to lower case
X_train= X_train.map(lambda x: x.lower())

In [62]:
X_train[0]

'engineering systems analyst dorking surrey salary  our client is located in dorking surrey and are looking for engineering systems analyst our client provides specialist software development keywords mathematical modelling risk analysis system modelling optimisation miser pioneeer engineering systems analyst dorking surrey salary '

Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’.  Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the vocabulary dictionary and it returns a Document-Term matrix. [n_samples, n_features].

In [80]:
count_vec = CountVectorizer(input='content', lowercase=True, analyzer='word')
X_train_count_vec = count_vec.fit_transform(X_train)
X_train_count_vec.shape

(195814, 176855)

There are various algorithms which can be used for text classification. We are using ‘Naive Bayes (NB)’ classifier

In [72]:
clf = MultinomialNB().fit(X_train_count_vec, y_train)

Processing of test data for prediction

In [77]:
X_val = X_val.map(remove_nums)
X_val  = X_val.map(remove_urls)
X_val = X_val.map(remove_star_words)
X_val = X_val.map(remove_punctuation)
X_val= X_val.map(lambda x: x.lower())

In [81]:
X_val_count_vec = count_vec.transform(X_val)

In [83]:
X_val_count_vec.shape

(48954, 176855)

In [84]:
predicted = clf.predict(X_val_count_vec)
np.mean(predicted == y_val)

0.6698124770192425

In [85]:
print(predicted[0])

Accounting & Finance Jobs


In [88]:
X_val.head()

216634    the fixed  portfolio analyst will work with the global bond team in providing support for their portfolio management activities specific responsibilities will include  interacting closely with portfolio managers and traders to ensure timely and accurate execution of investment strategies across client portfolios  determining the effective implementation of ideas across portfolio mandates with different constraints  rebalancing portfolios in response to cash flows benchmark changes market price movements and changes in client guidelines  monitoring positions and verifying that transactions are consistent with client guidelines  interacting with many areas of the firm to improve processes and minimize operational risks  monitoring risk and performance applicants should have the following  a background of relevant professional experience  demonstrate a basic understanding of and a strong interest in fixed  investing  have strong analytical skills a quantitative orientation and b

In [89]:
y_val.head()

216634    Accounting & Finance Jobs
31408     Trade & Construction Jobs
215181    Travel Jobs              
138280    Engineering Jobs         
157816    Accounting & Finance Jobs
Name: Category, dtype: object