# Automatic skill suggestions using detailed description

### Project
I decided to write a algorithm that can predict which skills are neccessary for a job post based on the description. As a result the working algorithm could suggest skills to add to a job after the employer has written the description of the job. 

### Data
To do so, I have gathered 2000 job posts from the website using the API provided to me by Manuel Montes. From these 2000 posts I have saved the description and the required skills in separate arrays. The description will be used as input for my ML algorithms while the skills will function as output.

### TF TFIDF
After this I have calculated the TF and the TFIDF values. The Term Frequency is basically the word count over the entire database, the TFIDF value indicates how much a word is used in one job post relative to the entire database. So, if the entire database has the word "Job" in it 2000 times, then the word is less important one job post. 

### Data Pre-processing
I've preprocessed the data (Removed null values, unknown values andd stopwords and reshaped the data). Also 50% of the jobs were in spanish. This shouldn't matter for the machine learning algorithms too much, but due to my lack of spanish I decided to remove them for better understanding of the algorithms output. Finally I splitted the data in train and test data 80/20

### Training & Evaluation
Finally I've trained 2 multiclass classification algorithms. KNN and Random forest. I tried tweeking them, but there are no metrics for multiclass predictions in the sklearn yet and due to the lack of time I didn't get to write them myself. This made optimization a lot harder. Also the 2000

## API data gathering

Start of with gathering data from the server. Normally this would just be a download or I would run my code on the cloud, but since I only have access to an API. The only way to gather data is to do a lot of API calls

Neccessary imports to do API calls

In [4]:
import requests
import json
import pprint
from collections import Counter 

API call to get the index of 2000 job posts. These indexes are later used to get the detailed information.

In [2]:
session = requests.Session()
url = f"https://search.torre.co/opportunities/_search/?q=%20language%3AEnglish%29[offset={0}&size={2000}&aggregate={10}]"
query_response = session.post(url)

Get the index and save them in ids array

In [3]:
query_responsejson = query_response.json()
ids = []
for element in query_responsejson['results']:
    ids.append(element['id'])

## Add all data to idsearch

The obtained indexes are used to do API calls to https://torre.co/api/opportunities/#IDOFJOBPOST the results are added to idsearch array and strengths array.

#### I disabled this because the data is saved as npy values and the API calls are not neccessary anymore

# Calculate TFIDF values to summerize text on word importance

Some neccessary imports

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from langdetect import detect
import pandas as pd

import re           
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## preprocessing

Loading the data from the files idsearch.npy and oldstrenghts.npy

In [6]:
# save numpy array as csv file
import numpy as np
from numpy import asarray
from numpy import savetxt
# define data
# idsearchdata = np.array(idsearch)
# oldstrengthsdata = np.array(oldstrengths)

# save to csv file
# np.save('idsearch.npy',idsearch)
# np.save('oldstrengths.npy',oldstrengths)

#load data from npy file
idsearch = np.load('idsearch.npy', allow_pickle=True)
oldstrengths = np.load('oldstrengths.npy', allow_pickle=True)
# print(len(idsearch))
# print(len(teststrengths))

Now we run over all the data. We remove the spanish job post and we add the content of every job post to a separate array element. The strengths are saved in another array relative to their job posts.

In [7]:
database=''
test_database = ''
databaseperoppertunity = []
teststrengths =oldstrengths.copy()
print(len(idsearch))
print(len(teststrengths))
removelater = []
for index,oppertunity in enumerate(idsearch):
    tempdata = ''
    for textblock in oppertunity:
        tempdata += textblock['content']
    
    if len(tempdata)>0:
        if detect(tempdata)=='en':
            databaseperoppertunity.append(tempdata)
        else:
            removelater.append(index)
            teststrengths[index]='clean'
    else:
        removelater.append(index)
        teststrengths[index]='clean'

teststrengths  = np.delete(teststrengths, removelater)
# for index  in reversed(removelater):
#     if (teststrengths[index]=='clean'):
#         teststrengths.pop(index)
#     else:
#         print(teststrengths[index])


2000
2000


In [8]:
print(len(databaseperoppertunity))
print(len(teststrengths))

1161
1161


Tokenize all texts to sentences and words

In [9]:
tokenized_sent = nltk.sent_tokenize(database)
tokenized_word = nltk.word_tokenize(database)

## remove stopwords

In [10]:
stop_words=set(stopwords.words("english"))
stop_words_spain= (set(stopwords.words("spanish")))

filtered_list = []
for article in databaseperoppertunity:
    tokens = word_tokenize(article) 
    filtered_article_list = [w for w in tokens if not w.lower() in stop_words and not w.lower() in stop_words_spain]
    filtered_article = ''
    for word in filtered_article_list:
        if word[0].isalpha():
            filtered_article+=' '+word
    
    filtered_list.append(filtered_article)

## Calculate TF and TFIDF weights

In [11]:
#instantiate CountVectorizer()
cv=CountVectorizer()

# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(filtered_list)

word_count_vector[1].shape

(1, 12072)

In [12]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [13]:
 # print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
experience,1.083419
team,1.145155
work,1.154151
working,1.362099
skills,1.482822
...,...
ged,7.364751
geico,7.364751
gendered,7.364751
redeeming,7.364751


## Calculate TFIDF values of documents

In [14]:
count_vector=cv.transform(filtered_list)

tf_idf_vector=tfidf_transformer.transform(count_vector)

In [15]:
from random import randint
index = randint(0,10)
feature_names = cv.get_feature_names()

    
#get tfidf vector for first document
first_document_vector=tf_idf_vector[index]

print(first_document_vector.T.todense().shape)
#print the scores
print(teststrengths[index])
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)


(12072, 1)
['React', 'Node.js', 'Fullstack Development', 'Javascript', 'TypeScript', 'HTML', 'CSS']


Unnamed: 0,tfidf
clevertech,0.478741
personal,0.141059
put,0.119759
currently,0.105156
backgrounds,0.105156
...,...
feeds,0.000000
feel,0.000000
feelings,0.000000
feels,0.000000


# Machine Learning

In [16]:
import numpy as np

## prepare dataset

In [17]:
print(len(teststrengths))
print(len(databaseperoppertunity))

1161
1161


In [18]:
# classes = []
# for skillList in teststrengths:
#     for skill in skillList:
#         if not(skill in classes):
#             classes.append(skill)

X = pd.DataFrame(tf_idf_vector.todense())
Y = pd.DataFrame(teststrengths)
Y = Y.astype('str')
Y.replace(np.nan, '', regex=True)
Y.fillna('', inplace=True)

print(X.shape)
print(Y.shape)

(1161, 12072)
(1161, 1)


## Split in train and test

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier

In [20]:
X_train, X_test = train_test_split(X,  test_size=0.20)

In [21]:
Y_train, Y_test = train_test_split(Y,  test_size=0.20)

In [22]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
print(type(Y_train.values[0][0][0]))

(928, 12072)
(233, 12072)
(928, 1)
(233, 1)
<class 'str'>


## Training

This is where the networks are trained. 

### KNN Multiclass

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=9)
classifierKNN = MultiOutputClassifier(knn, n_jobs=1)
classifierKNN.fit(X_train, Y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(algorithm='auto',
                                                     leaf_size=30,
                                                     metric='minkowski',
                                                     metric_params=None,
                                                     n_jobs=None, n_neighbors=9,
                                                     p=2, weights='uniform'),
                      n_jobs=1)

### Randomforrest


If you run out of memory here, restart the notebook and decrease the n_estimators size

In [25]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
classifierRF=RandomForestClassifier(n_estimators=100)
classifierRF.fit(X_train, Y_train)

  


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
y_pred = classifierRF.predict(X_test)

## Evaluate

In [125]:
X_test[0]

852     0.0
1157    0.0
450     0.0
17      0.0
971     0.0
       ... 
127     0.0
265     0.0
512     0.0
1107    0.0
940     0.0
Name: 0, Length: 233, dtype: float64

In [126]:
index = randint(0,200)
np.array(Y_test.iloc[[index]])[0]

array(["['hands-on development', 'Exemplary communication skills', 'Managing projects', 'Developer']"],
      dtype=object)

In [127]:
y_pred[index]

"['Marketing', 'Excel']"

In [128]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import roc_auc_score

In [167]:
# TEMPORARY GENERATES RANDOM INPUTS
from random import randint
index = randint(0,100)
feature_names = cv.get_feature_names()

    
#get tfidf vector for first document
first_document_vector=tf_idf_vector[index]
# first_document_vector= tf_idf_vector_test[0]

print(first_document_vector.T.todense().shape)
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)
print('done')


(12072, 1)
done


In [168]:
print(df.T.shape)
print(X_test.shape)
T_true = teststrengths[index]
print(T_true)
Y_pred = classifierRF.predict(df.T)
print(Y_pred)
df.sort_values(by=["tfidf"],ascending=False)

(1, 12072)
(233, 12072)
['SCORM', 'AICCC', 'TinCan', 'xAPI', 'Javascript', 'Node.js', 'React', 'LMS', 'Communication skills']
["['React', 'Javascript', 'PHP', 'HTML', 'CSS', 'WordPress', 'Articles', 'Tutorial']"]


Unnamed: 0,tfidf
lms,0.350028
articulate,0.255815
xapi,0.175014
scorm,0.175014
aicc,0.175014
...,...
fedramp,0.000000
fedwire,0.000000
fee,0.000000
feed,0.000000


# Saving the model

In [71]:
from sklearn.externals import joblib



In [180]:
filename = 'RFClassiferextrasmall.joblib.pkl'
_ = joblib.dump(classifierRF, filename)
filenametransformer = 'Transformer.joblib.pkl'
_ = joblib.dump(tfidf_transformer, filenametransformer)

filenamewordcount = 'word_count_vector.joblib.pkl'
_ = joblib.dump(cv, filenamewordcount)

In [209]:
url = 'http://127.0.0.1:5000/'
inputvariable = ['aws']

params ={'query': inputvariable}
response = requests.get(url, params)
print(response)
response.json()

<Response [200]>


{'prediction': '{"0":{"0":"[\'Software Development\', \'React\', \'English\']"}}'}