In [21]:
import pandas as pd
import numpy as np


# Data Used
I have used publically available data from Kaggle.
https://www.kaggle.com/gauravduttakiit/resume-dataset


In [22]:
df = pd.read_csv('/content/drive/MyDrive/deepLearning/ResumeScreening/UpdatedResumeDataSet.csv')

# Exploratory Data Analysis
Let's have a quick view of the Data we have

In [24]:
df.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


In [25]:
df.shape

(962, 2)

There are 962 observations we have in the data. Each observation represents the complete details of each candidate so we have 962 resumes for screening.

# Data Preprocessing
# 1.Cleaning the 'Resume' Column
Removing unnecessary information from resumes like URLs, hashtags, and special characters.


In [26]:
import re

In [27]:
def cleanResume(resumeText):
    resumeText = re.sub('httpS+s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^x00-x7f]',r' ', resumeText) 
    resumeText = re.sub('s+', ' ', resumeText)  # remove extra whitespace
    return resumeText
df['cleaned_resume'] = df.Resume.apply(lambda x: cleanResume(x))



In [28]:
df.head()

Unnamed: 0,Category,Resume,cleaned_resume
0,Data Science,Skills * Programming Languages: Python (pandas...,Skill Programming Language P thon panda ...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,Education Detail Ma 2013 to Ma 2017 B E ...
2,Data Science,"Areas of Interest Deep Learning, Control Syste...",Area of Intere t Deep Learning Control S te...
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skill R P thon SAP HANA Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",Education Detail MCA YMCAUST Faridabad...


# 2.Encoding 'Category'
Now, encoding the 'Category' column using LabelEncoding. Even though the 'Category' column is 'Nominal' data we are using LabelEncoder because the 'Category' column is our 'target' column. By performing LabelEncoding each category will become a class and we will be building a multiclass classification model.

In [29]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Category']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])

# 3. Preprocessing 'cleaned_resume' column
Converting the 'cleaned_resume' column into vectors. There are many ways to do that like 'Bag of Words','Tf-Idf','Word2Vec' and combination of these methods.

I will use 'Tf-Idf' method to get the vectors in this approach.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
requiredText = df['cleaned_resume'].values
requiredTarget = df['Category'].values
word_vectorizer = TfidfVectorizer(
    sublinear_tf= True,
    stop_words ='english',
    max_features = 1500
)
word_vectorizer.fit(requiredText)
WordFeatures = word_vectorizer.transform(requiredText)

# Building Model
I will use 'One vs Rest' method with 'KNeighborsClassifier' to build this multiclass classification model.

Using 80% data for training and 20% data for validation. Let's split the data now into training and test set.

In [31]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(WordFeatures,
                                                 requiredTarget, 
                                                 random_state=0,
                                                 test_size=0.2)

In [32]:
print(X_train.shape)

(769, 1500)


In [33]:
print(X_test.shape)

(193, 1500)


In [34]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
clf  = OneVsRestClassifier(KNeighborsClassifier())
clf.fit(X_train,y_train)
prediction = clf.predict(X_test)

# Results
Let's see the results


In [35]:
print('Accuracy of KNeighbors Classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of KNeighbors Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))


Accuracy of KNeighbors Classifier on training set: 0.99
Accuracy of KNeighbors Classifier on test set: 0.99


We can see that results are awesome. We are able to classify each Category of a given resume with 99% accuracy.

Checking the detailed classification report for each class or category.


In [36]:
from sklearn.metrics import classification_report
print(classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         3
           1       1.00      1.00      1.00         3
           2       1.00      0.80      0.89         5
           3       1.00      1.00      1.00         9
           4       1.00      1.00      1.00         6
           5       1.00      1.00      1.00         5
           6       1.00      1.00      1.00         9
           7       1.00      1.00      1.00         7
           8       1.00      0.91      0.95        11
           9       1.00      1.00      1.00         9
          10       1.00      1.00      1.00         8
          11       0.90      1.00      0.95         9
          12       1.00      1.00      1.00         5
          13       1.00      1.00      1.00         9
          14       1.00      1.00      1.00         7
          15       1.00      1.00      1.00        19
          16       1.00      1.00      1.00         3
          17       1.00    

Where 0,1,2... are the job categories. We get the actual labels from the label encoder that we used.

In [37]:
le.classes_

array(['Advocate', 'Arts', 'Automation Testing', 'Blockchain',
       'Business Analyst', 'Civil Engineer', 'Data Science', 'Database',
       'DevOps Engineer', 'DotNet Developer', 'ETL Developer',
       'Electrical Engineering', 'HR', 'Hadoop', 'Health and fitness',
       'Java Developer', 'Mechanical Engineer',
       'Network Security Engineer', 'Operations Manager', 'PMO',
       'Python Developer', 'SAP Developer', 'Sales', 'Testing',
       'Web Designing'], dtype=object)