# Resume/CV classifier using NLP

Our aim is to create an application that should be used by the HR Team to filter the resume based on the Skills.

For this, We are going to use a dataset from kaggle clean and pre-process it to fit a suitable classification model to assist the HR team to filter resume based on the job-role. 

##Dataset
The datset for this model has been taken from kaggle, link to the data - *"https://www.kaggle.com/datasets/jillanisofttech/updated-resume-dataset"*

NLP 

For this problem statement, we are gonna use spacy package which is an open source NLP package to process text data. In particular we are using 'en_core_web_lg' pre-trained package for our analysis.

referred from "https://spacy.io/usage/spacy-101/"

In [None]:
!python -m spacy download en_core_web_lg

2023-03-27 23:36:30.350407: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-27 23:36:32.634889: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-27 23:36:32.635062: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-27 23:36:35.481747: E tensorfl

In [None]:
#importing neccessary  spacy modules
import spacy #using spacy module for vectorization
nlp= spacy.load('en_core_web_lg') #en_core_web_lg is the pre-trained model by spacy




In [None]:
!pip install unidecode

In [None]:
import pandas as pd
import numpy as np
import re
import string
import unidecode

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We'll read the data and convert it into dataframe.

In [None]:
df=pd.read_csv("UpdatedResumeDataSet.csv") 

In [None]:
df.shape

(962, 2)

Well we have around 960 samples for this analysis. 

In [None]:
df.head()


Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


In [None]:
df.Category.value_counts()

Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Blockchain                   40
ETL Developer                40
Operations Manager           40
Data Science                 40
Sales                        40
Mechanical Engineer          40
Arts                         36
Database                     33
Electrical Engineering       30
Health and fitness           30
PMO                          30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
SAP Developer                24
Civil Engineer               24
Advocate                     20
Name: Category, dtype: int64

we have different categories of data and this falls under multiclass classification problem.

##Data pre-processing and cleaning

Our text data consists of special characters, digits, whitespaces, single characters, accented characters, stop words. Which is not gonna contribute a lot while fitting the model. So we gonna strip it and clean it using below function.

In [None]:
#we need to pre-process and clean the data
def clean_words(text):
    """Basic cleaning of texts"""
    
    # remove html
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    
    #removing single charcters pattern
    text=re.sub(pattern='\s+[a-zA-Z]\s+' ,repl=" ", string=text)
    
    #remove accented characters
    text=unidecode.unidecode(text) #we have accented characters like a^ etc, so to remove that we are performing 
    
    #to make words into lowercase
    text=text.lower()
    
    #removing stop words from the paragraph
    words = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    text = " ".join(words)
    
    #here we are avoiding tokenization since the spacy's model takes string or doc as input not list of words
    # also we are not performing stemming and lemmatization since it will change the context of skills and other words in resume text
    
    
    return text

In [None]:
df['cleaned_text']=df.Resume.map(lambda x: clean_words(x)) #applying function column wise.

In [None]:
df.head()

Unnamed: 0,Category,Resume,cleaned_text
0,Data Science,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas num...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,education details may may e uit rgpv data scie...
2,Data Science,"Areas of Interest Deep Learning, Control Syste...",areas interest deep learning control system de...
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,skills python sap hana tableau sap hana sql sa...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",education details mca ymcaust faridabad haryan...


Now that we have cleaned the text it is ready to vectorize

###Vectorizing text data

ML algorithms only understood the numerical type data as input, so we need to convert our text data into numerical and fit the model. For this we are applying .vector method from spacy. It will returns a 300 sized uniform vectors for our text model as numpy ndarray.

In [None]:
df['vectorized_data']=df.cleaned_text.apply(lambda text: nlp(text).vector)

In [None]:
df.head()

Unnamed: 0,Category,Resume,cleaned_text,vectorized_data
0,Data Science,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas num...,"[0.00019259728, -0.063411176, 0.022081006, 0.5..."
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,education details may may e uit rgpv data scie...,"[-0.38565177, 0.41584927, -0.22689901, 0.05342..."
2,Data Science,"Areas of Interest Deep Learning, Control Syste...",areas interest deep learning control system de...,"[-0.34571993, 0.064293616, -0.47398934, 0.1207..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,skills python sap hana tableau sap hana sql sa...,"[-0.06141665, -0.009636918, -0.60103905, 0.345..."
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",education details mca ymcaust faridabad haryan...,"[0.102590054, -0.23031569, -0.4390396, -0.2980..."


The pre-processing and vectoriztion part is done, Now we will proceed with fitting a best model for the dataset and check if it accurately predicts the results

##Applying ML classification

In [None]:
#importing necessary ML packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [None]:
x=df['vectorized_data']
y=df['Category']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.3,
    random_state=1049
)

###KNN

In [None]:
knn=KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean') #using k=5

Note that our vectorized array was an embedded numpy array so inorder to make it ready for analysis we need to flatten it out.

In [None]:
#using numpy stack method
x_train=np.stack(x_train)
x_test=np.stack(x_test)

In [None]:
knn.fit(x_train,y_train)
y_predtest=knn.predict(x_test)

In [None]:
print(classification_report(y_test, y_predtest))

                           precision    recall  f1-score   support

                 Advocate       1.00      1.00      1.00         6
                     Arts       0.80      1.00      0.89         8
       Automation Testing       1.00      0.50      0.67         4
               Blockchain       1.00      1.00      1.00        10
         Business Analyst       0.57      0.80      0.67         5
           Civil Engineer       1.00      0.50      0.67        10
             Data Science       1.00      1.00      1.00        15
                 Database       1.00      1.00      1.00         8
          DevOps Engineer       0.77      0.91      0.83        11
         DotNet Developer       1.00      0.82      0.90        11
            ETL Developer       0.79      1.00      0.88        11
   Electrical Engineering       0.75      1.00      0.86        12
                       HR       1.00      0.47      0.64        15
                   Hadoop       1.00      1.00      1.00     

### Gradient boosting classifier
Now we will check fitting the data with an ensemble classification model Gradient boosting classifier and check the accuracy and F1 score

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gbc=GradientBoostingClassifier(learning_rate=0.1,n_estimators=100)

In [None]:
gbc.fit(x_train,y_train)
y_predtestgbc=gbc.predict(x_test)

In [None]:
print(classification_report(y_test, y_predtestgbc))

                           precision    recall  f1-score   support

                 Advocate       1.00      1.00      1.00         6
                     Arts       1.00      1.00      1.00         8
       Automation Testing       1.00      1.00      1.00         4
               Blockchain       1.00      1.00      1.00        10
         Business Analyst       1.00      1.00      1.00         5
           Civil Engineer       1.00      1.00      1.00        10
             Data Science       1.00      0.73      0.85        15
                 Database       1.00      1.00      1.00         8
          DevOps Engineer       1.00      0.91      0.95        11
         DotNet Developer       1.00      1.00      1.00        11
            ETL Developer       1.00      1.00      1.00        11
   Electrical Engineering       1.00      1.00      1.00        12
                       HR       1.00      1.00      1.00        15
                   Hadoop       1.00      1.00      1.00     

wow!...The results were great when compared to KNN

##Model testing

We have string file, now we'll do the cleaning and pre-processing and make it ready for prediction.

In [None]:
with open ("resumedata.txt", 'r') as file: #resumedata.txt is an output file from the "resume parser driver code" program
    data_str=file.read()

In [None]:
vect_data=nlp(clean_words(data_str)).vector

In [None]:
vect_data

array([-0.03519086, -0.23032434,  0.8617982 ,  0.7763227 ,  1.4544411 ,
       -0.38665673,  0.51583856,  2.6982753 , -2.0374744 , -0.65372753,
        3.7231057 ,  2.0689526 , -3.488773  ,  1.9285731 , -0.6332722 ,
        0.68342286,  2.2526276 ,  1.5785108 , -1.6024902 , -0.2917276 ,
        0.4299252 ,  1.3061293 , -1.8252627 ,  0.5566585 , -1.5120682 ,
       -1.6978103 , -0.47150263, -1.5305506 , -0.51427376,  0.5980255 ,
       -0.00430435,  0.73801225, -1.0413609 ,  0.06293961,  0.8704292 ,
       -0.4150487 ,  0.29026455,  0.17028241,  1.4742746 ,  0.41138595,
        0.58001614, -0.28210592, -0.3220857 ,  0.29367995, -1.1440262 ,
        1.0020949 ,  1.0268222 , -1.836038  , -0.11047358, -1.2677326 ,
        0.09761833,  1.331491  , -0.33938974, -2.1910796 , -1.2874818 ,
        0.75148696, -1.3914795 ,  1.5129235 ,  0.35978946, -1.4824693 ,
        1.6968132 ,  1.5539579 , -1.9627119 , -0.43256775,  1.2096732 ,
        2.1433537 , -1.7191191 , -2.7408767 , -0.07127384,  1.67

Note that our output array is of 1D numpy array, but we need to feed 2D array as input, because model will consider only 2D. 
You can ignore this if you are gonna predict for a number of CV's. In our case we doing it for just one. so we are reshaping it
with .reshape(1,-1).

In [None]:
sample_pred=gbc.predict(vect_data.reshape(1, -1))

In [None]:
sample_pred

array(['Java Developer'], dtype=object)

Oh okay !... it got predicted as java developer though ....we will keep working on tuning model and the data as well...