# Job title Classification by industry

In [1]:
#This notebook is directed to Eng. Ali Osama (Data Scientist at iNetworks)
############### Multi-text Text Classification Task  #####################
############### Job title Classification by industry #####################
#By: Ahmed Ashraf Hussein - Machine Learning Intern applicant at iNetworks
############### E-mail: s-ahmedashraf@zewailcity.edu.eg ##################

### Importing the essential libraries

In [2]:
############## start of imports ############
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import pickle as pic
import request, json
import requests
############## End of imports  #############

### Loading data from csv file

In [3]:
#make sure the csv file containing the data is located in the same directory as the notebook
#if not, please provide the correct path in the next line of code
df = pd.read_csv('Job titles and industries.csv')

In [4]:
#Quick check
print (df.empty)                    #Ensuring that the dataframe is not empty
print (df.isnull().sum())           #Ensuring that the dataframe has no missing values

False
job title    0
industry     0
dtype: int64


In [5]:
#Getting data insights

counts = []
industries = np.unique(df['industry'].values)
for i in industries:
    tot = sum(df['industry'] == i)
    counts.append((i, tot))
df_summary = pd.DataFrame(counts, columns=['industry', 'number_jobtitles'])
print(df_summary)


text_len = df['job title'].str.len()
print(f'\nThe longest job title has {text_len.max()} characters and the shortest has {text_len.min()} characters')

      industry  number_jobtitles
0  Accountancy               374
1    Education              1435
2           IT              4746
3    Marketing              2031

The longest job title has 100 characters and the shortest has 2 characters


This indicates that the job titles fall in 4 industries (Accountancy, Education, IT and Marketing) with the number of the instances in each industry to the right. The difference between that number in each industry indicates imbalanced data. 

In [6]:
#df2 = df.drop_duplicates()
# counts = []
# classes = np.unique(df2['industry'].values)
# for i in classes:
#     j = sum(df2['industry'] == i)
#     counts.append((i, j))
# df_stats = pd.DataFrame(counts, columns=['category', 'number_of_comments'])
# df_stats

The commented section above was a trial to balance the data; by removing the duplicates from the data. It made a fairly good balance between the IT, Education and Marketing (1529,973,1203 respectively) but not the accountancy (263) 

___________________________________________________________________________________________________________________________

### Data preprocessing 

In [7]:
#Converting the industries to one hot encoding for faster processing
#Accountancy    [1,0,0,0]
#Education      [0,1,0,0]
#IT             [0,0,1,0]
#Marketing      [0,0,0,1]

df['industry'] = pd.Categorical(df['industry'])
dfDummies = pd.get_dummies(df['industry'], prefix = 'industry')
df = pd.concat([df, dfDummies], axis=1)
df = df.drop(['industry'], axis = 1)
df.head()

Unnamed: 0,job title,industry_Accountancy,industry_Education,industry_IT,industry_Marketing
0,technical support and helpdesk supervisor - co...,0,0,1,0
1,senior technical support engineer,0,0,1,0
2,head of it services,0,0,1,0
3,js front end engineer,0,0,1,0
4,network and telephony controller,0,0,1,0


The following two cells are to inspect the data, to get intuition about the job titles and develop the hand written rules can be used to clean the data. They display the longest job title(s), that can give the strongest indication about the normalization needed.
The cell can be run repeatedly without affecting the original dataframe (df) that the developed rules would be applied on, eventually.

In [8]:
df2 = df.drop_duplicates()
text_len2 = df2['job title'].str.len()
len_sorted = sorted(text_len2, reverse=True)

In [9]:
for i in range(0,10):
    #change the range of i to view different job titles
    #starting from i=0 for the longest job titles and the bigger i the shortest the title
    print(df2['job title'][text_len2 == len_sorted[i]].to_string())

4113    financial modeller / financial planning analys...
567    java based graduate developer, central norwich...
609    senior cro consultant - digital agency - londo...
879    epos or pc or computer field service engineer ...
609    senior cro consultant - digital agency - londo...
879    epos or pc or computer field service engineer ...
4948    marketing & sales assistant required - no expe...
6786    graduate in maths - engineering - chemistry - ...
4948    marketing & sales assistant required - no expe...
6786    graduate in maths - engineering - chemistry - ...
4411    regional marketing manager western europe- acu...
7276    finance graduate/ trainee accountant role - st...
4411    regional marketing manager western europe- acu...
7276    finance graduate/ trainee accountant role - st...
695     aem adobe consultant role: £100-120k basic sal...
5333    content and digital marketing manager (part-ti...
6278    education support officer: mathematics and num...
695     aem adobe c

* Cleaning and splitting The data

In [10]:
#Definig a function for job titles cleaning
def clean_jobtitles(text):
    text = text.lower()                                 #converting all letters to lowercase
    text = re.sub('\W', ' ', text)                      #replacing all non words(e.g. £,/) by space
    text = re.sub('(?<!^)\d{2,}.+', ' ', text)          #removing digits and what's next as long as they don't start a jobtitle
    text = re.sub('\s+', ' ', text)                     #removing multiple white spaces
    text = text.strip(' ')                              #stripping the text(assuring step)
    return text

In [11]:
df['job title'] = df['job title'].map(lambda text : clean_jobtitles(text))

In [12]:
#Augmenting some words needed to be removed after noticed from the inspection cell above
#those would be added to the predefined stopwords
#those words (such as country names) could be removed by performing
#named entity recognition or using dictionaries but this way is more effecient
stop_words = set(stopwords.words('english'))
stop_words = stop_words.union(['london','London','day','Day','days','Days','week','Week','weeks','Weeks','month','Month',
                               'months','Months','year','Year','years','Years','role','Role','position',
                               'Position','positions','Positions','required','Required','part','Part','time',
                               'Time','full','Full','city','City','salary','Salary','starting','dubai','Dubai','uae',
                               'early'])

In [13]:
#splitting the data by StratifiedShuffleSplit to manage imbalanced data
spliter = StratifiedShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
X = df['job title']
y = df.drop(['job title'], axis = 1)
X = np.array(X)
y = np.array(y)

for train_index, test_index in spliter.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

### Model Training

* Multinomial Naive Bayes

In [14]:
#Building a pipeline includes feature extractor and classifier
pipeline_MNB = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)),
                         ('clf', OneVsRestClassifier(MultinomialNB(fit_prior=True, class_prior=None)))])
#Model training
pipeline_MNB.fit(X_train, y_train)

#Model prediction
prediction_MNB = pipeline_MNB.predict(X_test)

In [15]:
#Model evaluation
#we use different f1_score to assess imbalanced data
print('MNB Accuracy Score equals '+str(accuracy_score(y_test, prediction_MNB)))
print('MNB F1 Score equals '+str(f1_score(y_test, prediction_MNB, average='micro'))) 

MNB Accuracy Score equals 0.8872845831392641
MNB F1 Score equals 0.911785799665312


* Support Vector Machine

In [16]:
#Building a pipeline includes feature extractor and classifier
pipeline_SVM = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)),
                         ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1))])
#Model training
pipeline_SVM.fit(X_train, y_train)
#Model prediction
prediction_SVM = pipeline_SVM.predict(X_test)

In [17]:
#Model evaluation
print('SVM Accuracy Score equals '+str(accuracy_score(y_test, prediction_SVM)))
print('SVM F1 Score equals '+str(f1_score(y_test, prediction_SVM, average='micro'))) 

SVM Accuracy Score equals 0.911970190964136
SVM F1 Score equals 0.9295973628443608


* Logistic Regression

In [18]:
#Building a pipeline includes feature extractor and classifier
pipeline_LR = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)),
                         ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1))])
#Model training
pipeline_LR.fit(X_train, y_train)
#Model prediction
prediction_LR = pipeline_LR.predict(X_test)

In [19]:
#Model evaluation
print('LR Accuracy Score equals '+str(accuracy_score(y_test, prediction_LR)))
print('LR F1 Score equals '+str(f1_score(y_test, prediction_LR, average='micro'))) 

LR Accuracy Score equals 0.8812296227293899
LR F1 Score equals 0.9158607350096711


## Deploying the Model as a RESTful API service where the input is a HTTP request
## with a parameter for the "Job title" and the output is the expected industry.

In this section, we save the models to the storage. In the 'Flask.py' code, we can choose the desired model to load

In [20]:
# Saving models to disk
pic.dump(pipeline_MNB, open('MNB_model.pkl','wb'))
pic.dump(pipeline_SVM, open('SVM_model.pkl','wb'))
pic.dump(pipeline_LR, open('LR_model.pkl','wb'))

Instructions:
1- open cmd
2- run 'python Flask.py'
3- provide the ip address to the url argument (first line in the next cell)
4- change the value of 'exp' to test a new job title

In [22]:
url = 'http://127.0.0.1:5000/api'
r = requests.post(url,json={'exp':'teacher'})
print(r.json())

The predicted industry is Education
