# Jobs titles and functions Dataset



Job functions and job titles are very different things. A job title is essentially
the name of a position within an organization filled by an employee. Job
function is the routine set of tasks or activities undertaken by a person in
that position. An employee's title and function are often closely related,
though not all job functions are clear based on title alone.


### So, in this notebook we are trying to build a system to recommend the job functions in which an employee with a specific job title can work



# 1. Data Engineering

In [10]:
%matplotlib inline
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

from keras.models import Sequential
from keras import layers
from keras.utils import to_categorical
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import pickle



 At first, We start by fetching and inspecting the dataset

In [11]:
#Inspecting the data
jobs_data = pd.read_csv("jobs_data.csv")
jobs_data.head(10)

Unnamed: 0.1,Unnamed: 0,title,jobFunction,industry
0,0,Full Stack PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Marketing and Advertisi..."
1,1,CISCO Collaboration Specialist Engineer,"['Installation/Maintenance/Repair', 'IT/Softwa...",['Information Technology Services']
2,2,Senior Back End-PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Computer Networking']"
3,3,UX Designer,"['Creative/Design/Art', 'IT/Software Developme...","['Computer Software', 'Information Technology ..."
4,4,Java Technical Lead,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Information Technology ..."
5,5,Technical Support Engineer,"['IT/Software Development', 'Engineering - Tel...","['Information Technology Services', 'Computer ..."
6,6,Senior iOS Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Information Technology Services', 'Graphic D..."
7,7,Mechanical Engineer,['Engineering - Mechanical/Electrical'],"['Architectural and Design Services', 'Enginee..."
8,8,Real Estate Sales Specialist - 10th of Ramadan,['Sales/Retail'],['Real Estate/Property Management']
9,9,School Principal,"['Education/Teaching', 'Administration', 'Oper...",['Education']


As you can see above, We have to rename the Unnamed column to be the ID and the index of this dataset


In [12]:
jobs_data = jobs_data.rename(columns={"Unnamed: 0": "ID"})
jobs_data

Unnamed: 0,ID,title,jobFunction,industry
0,0,Full Stack PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Marketing and Advertisi..."
1,1,CISCO Collaboration Specialist Engineer,"['Installation/Maintenance/Repair', 'IT/Softwa...",['Information Technology Services']
2,2,Senior Back End-PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Computer Networking']"
3,3,UX Designer,"['Creative/Design/Art', 'IT/Software Developme...","['Computer Software', 'Information Technology ..."
4,4,Java Technical Lead,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Information Technology ..."
5,5,Technical Support Engineer,"['IT/Software Development', 'Engineering - Tel...","['Information Technology Services', 'Computer ..."
6,6,Senior iOS Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Information Technology Services', 'Graphic D..."
7,7,Mechanical Engineer,['Engineering - Mechanical/Electrical'],"['Architectural and Design Services', 'Enginee..."
8,8,Real Estate Sales Specialist - 10th of Ramadan,['Sales/Retail'],['Real Estate/Property Management']
9,9,School Principal,"['Education/Teaching', 'Administration', 'Oper...",['Education']


##### After inspecting the dataset, We noticed  2 repeated  job functions  which are :


 ['IT/Software Development', 'Engineering - Telecom/Technology']
 
#####  And

 ['Engineering - Telecom/Technology', 'IT/Software Development']
 
##### which are simillar to each other but inverted, So we decided to convert one to the other so they would be the exact same.
 

In [13]:
term = "['IT/Software Development', 'Engineering - Telecom/Technology']"
i=0
for job in jobs_data.jobFunction:

    if(job == term):
        jobs_data.jobFunction[i]=  "['Engineering - Telecom/Technology', 'IT/Software Development']"
    i+=1

During inspecting the dataset, We have Also noticed that maybe morethan 95% of the __titles__ of all data instances that have
the __['Media/Journalism/Publishing', 'Marketing/PR/Advertising']__ term as a __JobFunction__ 
include at least 1 word from a list of keywords : __['marketing' , 'market', 'media' , 'digital' , 'social', 'advertising', 'brand','seo']__

<b>Tip:</b> This is why we decided to change all these titles into 1 unified Title : __"Marketing Specialist"__
and save all these Keywords in one list.


In [14]:
term = "['Media/Journalism/Publishing', 'Marketing/PR/Advertising']"
i=0
keywords = []
for jobFunction in jobs_data.jobFunction:  
    if(jobFunction == term): 
        jobs_data.title[i] = "Marketing Specialist"      
    i+=1

In [15]:
Marketing_Keywords = ['marketing' , 'market', 'media' , 'digital' , 'social', 'advertising', 'brand','seo']

Simillary, During inspecting the dataset, We have Also noticed that maybe morethan 95% of the __titles__ of all data instances that have
the __['Sales/Retail']__ term as a __JobFunction__ 
include at least 1 word from a list of keywords : __['sales' , 'real estate', 'real' , 'estate' , 'real-estate', 'broker', 'property']__

<b>Tip:</b> This is why we decided to change all these titles into 1 unified Title : __"Sales Consultant"__
and save all these Keywords in one list.


In [16]:
term = "['Sales/Retail']"
i=0
keywords = []
for jobFunction in jobs_data.jobFunction:  
    if(jobFunction == term): 
        jobs_data.title[i] = "Sales Consultant"      
    i+=1

In [17]:
Sales_Keywords = ['sales' , 'real estate', 'real' , 'estate' , 'real-estate', 'broker', 'property']

Here it's a little bit different, We found that there is a lot of data instances with a 
__jobFunction__ of __['Engineering - Telecom/Technology', 'IT/Software Development']__ but the titles of these instaces are very diverse and we can't just colect them manually like we did in the previous examples, So we had to do it automatically by appending all these titles into one list and then start to improve and enhance this list.

Also change each title into one Unified title : __Software Developer__


In [18]:
term = "['Engineering - Telecom/Technology', 'IT/Software Development']"
i=0
keywords = []
for jobFunction in jobs_data.jobFunction:
    
    if(jobFunction == term):
        words = jobs_data.title[i].split()  
        jobs_data.title[i]= "Software Developer"
        for word in words:
            if word not in keywords:
                word = word.lower()
                keywords.append(word)
    i+=1

##### From here We start working on the list of Software kewords

In [19]:
keywords_noDupl = list(dict.fromkeys(keywords))
keywords_noDupl

['full',
 'stack',
 'php',
 'developer',
 'senior',
 'back',
 'end-php',
 'java',
 'technical',
 'lead',
 'ios',
 'full-stack',
 '-',
 'joomla',
 'expert',
 'website',
 'front-end',
 'back-end',
 'odoo',
 'software',
 'erp',
 'implementer',
 'alexandria',
 '.net',
 'tech',
 'android',
 'team',
 'leader',
 'mid',
 'web',
 'associate',
 'engineer',
 'wordpress',
 'junior',
 '(',
 'angular',
 '4+',
 ')',
 'tester',
 '"laravel"',
 'mobile',
 'android/ios',
 'front',
 'end',
 'vue.js',
 '/',
 'ui/ux',
 'director',
 'in',
 'tunisia',
 'ge',
 'mvc',
 'oracle',
 'database',
 'admin',
 'flutter',
 'nasr',
 'city',
 'it',
 'help',
 'desk',
 'specialist',
 'minya',
 'nodejs',
 'backend',
 'rpa',
 'asp.net',
 'crm',
 'frontend',
 '(jquery)',
 'internship',
 '"laravel',
 'project"',
 'application',
 'programmer',
 'monitoring',
 'agent',
 'data',
 'etl',
 'security',
 'business',
 'analyst',
 'core',
 'rest',
 'to',
 'graphql',
 'project',
 'bi/bw',
 'consultant',
 'architect',
 'ionic',
 'app',
 '

1. Remove Spaces, punctuations, numbers and single charcter using __regular expressions__
2. Manullay inspect and collect some generic words inside the list to be removed
3. Add 2 parts words that are removd with generic words

In [20]:


import re
i=0
for word in keywords_noDupl:
    # Removing multiple spaces
    word = re.sub(r'\s+', ' ', word)
    # Remove punctuations and numbers
    word = re.sub('[^a-zA-Z]', ' ', word)
    # Single character removal
    word = re.sub(r"\s+[a-zA-Z]\s+", ' ', word)

    keywords_noDupl[i]=word
    i+=1
    
GeneralWords = ['full','alexandria','owner', 'surveillance','access', 'system','leader','stack','senior','back','lead','-','expert','team', 'leader', 'mid', 'associate', 'engineer', 'end',  'front', 'nasr', 'city', 'help', 'desk', 'specialist','minya', 'internship','project',  'monitoring','agent', 'security',  'senior\\team','#','freelance/part','consultant','architect','egypt', 'solution','4+','brazil', 'saudi', 'arabia', 'owner','kickstart','[marketing]', 'designer','riyadh','alex','ksa','riseup', 'contact', 'salesforce.com', 'dubai', 'business', 'analyst', 'core', 'rest','to', 'benha', 'product', 'principal', 'indonesia', 'implementation', '(sales', 'buzz)', 'intern', 'cross', 'remotely', 'platform','infrastructure','cairo','liferay', 'turkey','-entity', 'professional', 'bi','host','sharqia','tunis','(part', 'department)','only','india', 'time)','forms','officer', 'success','bw','fresh','graduates', 'monufya','coordinator','e-commerce', 'dakahlia','graduate','unpaid','(outsource)', 'hr','writer','jeddah','services','subject','mm', 'middle', 'next','on','up','deep','learning','pre-sales','ware','head','level','of','delivery','(leading','branch)','factor','stack/', 'supervisor', 'design','unified','banking', 'mansoura','communications', 'manager', 'functional', 'consultant)', '(r&d)', 'scientist','hurghada','(alexandria)','media','(information']
i=0

 

for word in keywords_noDupl:
    if word in GeneralWords:
        keywords_noDupl.remove(word)
    i+=1
#add 2 parts words technical lead, tech lead, back end, front end,react native, computer engineer, c#,middle ware, ruby on rails

keywords_noDupl.append('technical lead')
keywords_noDupl.append('back end')
keywords_noDupl.append('front end')
keywords_noDupl.append('react native')
keywords_noDupl.append('computer engineer')
keywords_noDupl.append('c#')
keywords_noDupl.append('middle ware')
keywords_noDupl.append('ruby on rails')


#in order to include stopwords in our code we had to use nltk.download() which will open a window for us,
#from which we can use stopwords to download 

# nltk.download()
import nltk
from nltk.corpus import stopwords

#By trial and error, we found that the process of eliminating the stop words from our data
#needs to be done 3 times at least to get rid of all stop words

for word in keywords_noDupl:
    if word in (stopwords.words('english')):
        keywords_noDupl.remove(word)
        
#remove remaining spaces
k=0
for word in keywords_noDupl:
    if word == ' ':
        del keywords_noDupl[k]
    k+=1

Software_Keywords = keywords_noDupl

# 2. Machine learning Model


We start be preparing the data into our machine learning model. 

We will be using Deep learning model this time. Our apprach for preparing the dataset is, Binary Encoding to the targets of the dataset (JobFunctions), and vectorizing the features (title) as we did with naive bayes

__in order__ to bunary Encode the targets we first need to change it;s form from string to itegeres

In [21]:

labels = jobs_data.jobFunction.astype("category").cat.codes
train_data = jobs_data['title']


We are going to save a Map between the original text of all job functions and it's corrosponding Encoded integer

In [22]:
Map = dict( zip( jobs_data['jobFunction'].astype("category").cat.codes, jobs_data['jobFunction'] ) )
Map


{313: "['Engineering - Telecom/Technology', 'IT/Software Development']",
 435: "['Installation/Maintenance/Repair', 'IT/Software Development', 'Engineering - Telecom/Technology']",
 127: "['Creative/Design/Art', 'IT/Software Development']",
 372: "['IT/Software Development', 'Engineering - Telecom/Technology', 'Customer Service/Support']",
 260: "['Engineering - Mechanical/Electrical']",
 803: "['Sales/Retail']",
 180: "['Education/Teaching', 'Administration', 'Operations/Management']",
 787: "['Sales/Retail', 'Marketing/PR/Advertising', 'Media/Journalism/Publishing']",
 23: "['Accounting/Finance']",
 759: "['Sales/Retail', 'Creative/Design/Art']",
 205: "['Education/Teaching']",
 557: "['Media/Journalism/Publishing', 'Marketing/PR/Advertising']",
 536: "['Marketing/PR/Advertising', 'Sales/Retail']",
 403: "['IT/Software Development', 'Sales/Retail', 'Engineering - Telecom/Technology']",
 132: "['Creative/Design/Art', 'Media/Journalism/Publishing']",
 540: "['Marketing/PR/Advertising']

__Then__ here we Binary Encode all the integer targets

In [23]:
labels = to_categorical(labels)

After that, We are going to use the train_test_split function of the sckiti learn to split the dataset.

In [24]:
jobs_train, jobs_test, labels_train, labels_test = train_test_split(train_data, labels, test_size=0.25, random_state=500)

From here we start vectorizing the features of both training and testing datasets

In [25]:
vectorizer_train = CountVectorizer()
vectorizer_train.fit(jobs_train)

jobs_train = vectorizer_train.transform(jobs_train)
jobs_test  = vectorizer_train.transform(jobs_test)


### start building our model

Our dataset is not huge, So 2 dense layers are enough for training our dataset without having extreme overfitting.

In [26]:
input_dim = jobs_train.shape[1]  # Number of features
number_of_classes = jobs_data.jobFunction.nunique()


model = Sequential()
model.add(layers.Dense(1024, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(number_of_classes, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


In [27]:
model.compile(loss='binary_crossentropy', 
               optimizer='adam', 
               metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1024)              1143808   
_________________________________________________________________
dense_2 (Dense)              (None, 835)               855875    
Total params: 1,999,683
Trainable params: 1,999,683
Non-trainable params: 0
_________________________________________________________________


In [28]:
from keras.callbacks import EarlyStopping

es = EarlyStopping(monitor='loss', mode='min',patience=10)
history = model.fit(jobs_train, labels_train,epochs=50,batch_size=10, callbacks=[es])


Instructions for updating:
Use tf.cast instead.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [29]:
loss, accuracy = model.evaluate(jobs_train, labels_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(jobs_test, labels_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Training Accuracy: 0.9996
Testing Accuracy:  0.9993


In order to test our testing dataset accordig to the approach of Keywords list we have followed, We need to follow some steps

1. We have to inspect each title in the dataset,
2. check whether it includes any words that belongs to any of the the 3 lists,
3. if yes, change the title into the assigned title of the List, otherwise don't change it
4. Vectorize the final result( As we dont' have a pipeline this time)
5. apply the model on the final result(binary encoded result)
6. decode the model prediction (integer result)
7. Map the decoded prediction into it's text value

In [30]:
def decode(datum):
    return np.argmax(datum)

def JobFunction(job):
    Software_exist = False
    Sales_exist = False
    Marketing_exist = False

    words = job.split()
    for word in words:
        word = word.lower()
    
        if word in Sales_Keywords:
            Sales_exist = True
            break
        if word in Marketing_Keywords:
            Marketing_exist =True
            break
        if word in Software_Keywords:
            Software_exist =True
            break 
    if Software_exist:    
        job= "Software Developer"
    if Sales_exist:
        job= "Sales Consultant"
    if Marketing_exist:
        job= "Marketing Specialist"
    
    job = vectorizer_train.transform([job])
    final_result = model.predict(job)
    decoded_datum = decode(final_result)
    return Map[decoded_datum]

In [31]:
print(JobFunction('Medical doctor'))
print(JobFunction('php developer'))
print(JobFunction('social media specialist'))
print(JobFunction('Real Estate agent'))


['Medical/Healthcare']
['Engineering - Telecom/Technology', 'IT/Software Development']
['Media/Journalism/Publishing', 'Marketing/PR/Advertising']
['Sales/Retail']


Here we are going to save all the arrays/models/values that we are going to use for thr APIs call

In [32]:
model.save('final_keras_jobs_model.h5')

Sales_Keywords = np.array(Sales_Keywords)
Marketing_Keywords =np.array(Marketing_Keywords)
Software_Keywords = np.array(Software_Keywords)
Map_np = np.array(Map)


np.save('Sales_keywords.npy',Sales_Keywords)
np.save('Marekting_keywords.npy',Marketing_Keywords)
np.save('Software_keywords.npy',Software_Keywords)
np.save('Map.npy',Map_np)

pickle.dump(vectorizer_train, open("jobs_data_vectorizer.pickel", "wb"))

In [33]:
#finish