## Resume parser

#### Authors: Anushree Manoharrao | Lavanya Kumaran |Sadiya Sayara Chowdhury Puspo | Srija Reddy Kallu

### Task 2: Picking the top 3 categories that best suit the resume

Installing all the required packages

In [1]:
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import gensim
import re
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics
from joblib import dump, load
import docx2txt

In [2]:
#Importing the english stopwords
stop_words = stopwords.words('english')

Reading the resume dataset and displayong first 5 records

In [20]:
df_cv = pd.read_csv('./UpdatedResumeDataSet.csv')
df_cv.head(5)

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


In [4]:
#Function to remove stopwords
def remove_stop_words (text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words:
            result.append(token)
    return result

In [22]:
#Cleaning the Resume column in the dataframe and displaying the first 2 records
df_cv['clean'] = df_cv['Resume'].apply(remove_stop_words).astype(str)
df_cv.head(2)

Unnamed: 0,Category,Resume,clean
0,Data Science,Skills * Programming Languages: Python (pandas...,"['skills', 'programming', 'languages', 'python..."
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,"['education', 'details', 'rgpv', 'data', 'scie..."


Constructing a Pipeline from the given estimators. Initializing a Random Forest Classifier

In [7]:
model_pipeline = make_pipeline(
    CountVectorizer(), 
    RandomForestClassifier(random_state=29)
)

In [6]:
#Splitting the data into training and test set with 80% as training set and 20% as test set
X_train, X_test, Y_train, Y_test = train_test_split(df_cv['clean'], df_cv['Category'], test_size = 0.2)

In [8]:
#Keys are steps names and values are the steps objects.
model_pipeline.named_steps

{'countvectorizer': CountVectorizer(),
 'randomforestclassifier': RandomForestClassifier(random_state=29)}

#### Performing cross validation using GridSearchCV

Exhaustive search is performed over specified parameter values for an estimator. Important members are fit, predict.

GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [9]:
param_grid = { 
    'randomforestclassifier__n_estimators': [200,500],
    'randomforestclassifier__max_features': ['sqrt', 'log2'],
    'randomforestclassifier__max_depth' : [4,5,6,7,8],
    'randomforestclassifier__criterion' :['gini', 'entropy']
}

grid = GridSearchCV(estimator=model_pipeline, 
                    param_grid=param_grid, 
                    cv= 5, scoring='accuracy', 
                    return_train_score=False, verbose=1)
grid_search = grid.fit(X_train, Y_train)
print(grid_search.best_params_)
#printing the best parameters as a result of cross-validation

Fitting 5 folds for each of 40 candidates, totalling 200 fits
{'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__max_depth': 8, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__n_estimators': 200}


In [10]:
#Setting the parameters of this estimator
model_pipeline.set_params(**grid_search.best_params_)

In [11]:
#Fitting the model
model_pipeline.fit(X_train, Y_train)

In [12]:
#Printing the training and test score
print("training Score: {:.2f}".format(model_pipeline.score(X_train, Y_train)))
print("test Score: {:.2f}".format(model_pipeline.score(X_test, Y_test)))

training Score: 0.99
test Score: 0.96


In [13]:
#Predicting using X_test
prediction = model_pipeline.predict(X_test)
#print("model report: %s: \n %s\n" % (model_pipeline, metrics.classification_report(Y_test, prediction)))

In [14]:
#Creating a dataframe to ocmpare the actual category Vs the predicted category
df_actual = pd.DataFrame(Y_test)
df_pred = pd.DataFrame(prediction)
df_actual = df_actual.reset_index().reset_index()
df_actual = df_actual.drop('index', axis=1)
df_actual = df_actual.rename(columns={"level_0": "index"})
df_pred = df_pred.reset_index()
df_pred =  df_pred.rename(columns={0: "Category_Predicted"})
Compare_categories = pd.concat([df_actual, df_pred], axis=1)
Compare_categories = Compare_categories.drop('index', axis=1)

The categories listed below are not correctly predicted by the model

In [15]:
Compare_categories.loc[~(Compare_categories['Category'] == Compare_categories['Category_Predicted'])]

Unnamed: 0,Category,Category_Predicted
9,Advocate,Arts
12,Advocate,Arts
45,DotNet Developer,Java Developer
46,Advocate,HR
47,Advocate,HR
78,Automation Testing,Java Developer
154,DevOps Engineer,Java Developer


In [16]:
#Dumping the model as a .joblib file for future use
dump(model_pipeline, "model_pipeline.joblib")

['model_pipeline.joblib']

In [17]:
#Loading the dumped model to test for a new resume
model_pipeline = load("model_pipeline.joblib")

In [18]:
#Creating a function to get the resume from the user as a docx file, cleaning it and deploying the model to predict the category
def get_category(path):
    resume = docx2txt.process(path)
    resume = remove_stop_words(resume)
    resume = pd.Series(" ".join(resume))
    probs = model_pipeline.predict_proba(resume)[0]
    rf = model_pipeline['randomforestclassifier']
    return pd.DataFrame({"Category":rf.classes_, "Prob":probs}).sort_values("Prob", ascending=False, ignore_index= True).head(3)

In [23]:
print("The top 3 categories that best suit your resume are")
get_category("sample-resumes.docx")

The top 3 categories that best suit your resume are


Unnamed: 0,Category,Prob
0,HR,0.104143
1,Arts,0.085096
2,Java Developer,0.082869


#### References

[1] S. Sanyal, S. Hazra, N. Ghosh, and S. Adhikary, “Resume parser with natural language processing,” Ph.D. dissertation, 03 2017.<br>
[2] K. Tejaswini, V. Umadevi, S. M. Kadiwal, and S. Revanna, “Design and development of machine learning based resume ranking system,” Global Transitions Proceedings, 2021. <br>
[3] https://towardsdatascience.com/do-the-keywords-in-your-resume-aptlyrepresent-what-type-of-data-scientist-you-are 59134105ba0d. <br>
[4] https://github.com/meghnalohani/Resume-Scoring-using NLP <br>