# Resume Classification 

## Objectives

* perform data preprocessing, EDA and feature extraction on the Resume dataset
* perform multinomial Naive Bayes classification on the Resume dataset

### Dataset description

The data is in CSV format, with two features: Category, and Resume. 

**Category** -  Industry sector to which the resume belongs to, and 

**Resume** - The complete CV (text) of the candidate.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib.gridspec import GridSpec
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

Dataset Source Reference: [Resume dataset](https://www.kaggle.com/gauravduttakiit/resume-dataset/download) 

#### Download the data 

**Read the UpdatedResumeDataset.csv dataset**

In [None]:
# read the dataset
df = pd.read_csv('UpdatedResumeDataSet.csv')

In [None]:
df.head()

In [None]:
df.shape

### Pre-processing and EDA

**Display  all the categories of resumes and their counts in the dataset**



In [None]:
# Display the distinct categories of resume
df['Category'].unique()

In [None]:
# Display the distinct categories of resume and the number of records belonging to each category
df['Category'].value_counts()

**Create the count plot of different categories**

In [None]:
plt.figure(figsize=(10, 10))
sns.countplot(x= None, y=df['Category'])

**Create a pie plot depicting the percentage of resume distributions category-wise**

In [None]:
targetCounts = df['Category'].value_counts()
targetLabels  = df['Category'].value_counts().index.tolist()
plt.figure(1, figsize=(10,10))
the_grid = GridSpec(2, 2)
cmap = plt.cm.coolwarm
colors = cmap(np.linspace(0., 1., len(targetLabels)))
plt.pie(targetCounts, labels=targetLabels, colors=colors)
plt.show()

Convert all the `Resume` text to lower case**




In [None]:
# Convert all characters to lowercase
df['Resume'] = df['Resume'].str.lower()
df.head()

### Cleaning resumes' text data

**A function to clean the resume text**

In the text there are special characters, urls, hashtags, mentions, etc. Remove:  

* URLs: For reference 
* RT | cc: For reference click 
* Hashtags, # and Mentions, @
* punctuations
* extra whitespace
 

In [None]:
df.tail(20)

In [None]:
import re
def cleanResume(resumeText):
  resumeText = re.sub(r'https?://\S+', '', resumeText)
  resumeText = re.sub(r'RT|cc', '', resumeText)
  resumeText = re.sub(r'#\S+', '', resumeText)
  resumeText = re.sub(r'@\S+', '', resumeText)
  resumeText = re.sub(r'[^a-zA-Z0-9\s]', '', resumeText)
  resumeText = re.sub(r'\S+', '', resumeText)
  return resumeText


In [None]:
import re
def cleanResume(resumeText):
  # Removing the URLs
  resumeText = re.sub(r'http\S+', '', resumeText)

  # Removing RTs
  resumeText = re.sub(r'RT|cc:\S+', '', resumeText)

  # Removing mentions(@)
  resumeText = re.sub(r'@\S+', '', resumeText)

  # Removing hashtags(#)
  resumeText = re.sub(r'#\S+', '', resumeText)

  # Removing punctuations
  resumeText = re.sub(r'[^a-zA-Z0-9\s]', '', resumeText)

  # Removing Extra whitespaces
  resumeTokens = resumeText.split()
  resumeTokens = [token.strip() for token in resumeTokens]
  resumeText = " ".join(resumeTokens)

  return resumeText

In [None]:
 df['cleaned_resume'] = df['Resume'].apply(cleanResume)

In [None]:
df.head()

In [None]:
sent_lens = []
for i in df.cleaned_resume:
    length = len(i.split())
    sent_lens.append(length)
    
print(len(sent_lens))
print(max(sent_lens))

### Stopwords removal

The stopwords, for example, `and, the, was, and so forth` etc. appear very frequently in the text and are not helpful in the predictive process. Therefore these are usually removed for text analytics and text classification purposes.

1. Tokenize the input words into individual tokens and store it in an array
2. Using `nltk.corpus.stopwords`, remove the stopwords 


**Use `nltk` package to find the most common words from the `cleaned resume` column **

In [None]:
# stop words
stopword_list = nltk.corpus.stopwords.words('english')
print(stopword_list)

In [None]:
def remove_stopwords(resumeText):

  resumeTokens = resumeText.split()
  resumeTokens = [token for token in resumeTokens if token not in stopword_list]
  resumeText = " ".join(resumeTokens)

  return resumeText


df['cleaned_resume'] = df['cleaned_resume'].apply(remove_stopwords)


In [None]:
# most common words
from nltk.probability import FreqDist
c_words = []
def common_words(text):
  words = nltk.tokenize.word_tokenize(text)
  for word in words:
    c_words.append(word)

df['cleaned_resume'].apply(common_words)
Common_words_freq = FreqDist(c_words)
Common_words_freq

In [None]:
plt.figure(figsize=(10,10))
words = ' '.join(word for word in c_words)
WC = WordCloud(width=1000, height=500, max_words=500, min_font_size=5)
Common_words_wc = WC.generate(words)
plt.imshow(Common_words_wc, interpolation='bilinear')
plt.show

**Convert the categorical variable `Category` to a numerical feature and make a different column, which can be treated as the target variable **

In [None]:
from sklearn.preprocessing import LabelEncoder
# YOUR CODE HERE
le = LabelEncoder()
df['Category_class'] = le.fit_transform(df['Category'])
print(df['Category_class'].unique())
df['Category_class'].value_counts()


### Feature Extraction

**Convert the text to feature vectors by applying `tfidf vectorizer` to the Label encoded category made above **

`TF-IDF`will tokenize documents, learn the vocabulary, inverse document frequency weightings, and allow you to encode new documents



In [None]:
tv = TfidfVectorizer(ngram_range=(1,2))
tfidf_CV = tv.fit_transform(df['cleaned_resume'])
print('tfidf_CV:', tfidf_CV.shape)

## Naive Bayes Classifier

**Split the data into train and test sets. Apply Naive Bayes Classifier (MultinomialNB) and evaluate the model predictions** 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_CV, df['Category_class'], test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
mnb = MultinomialNB()
mnb_tfidf = mnb.fit(X_train, y_train)
print('MultinomialNB for tf-idf :', mnb_tfidf)

In [None]:
mnb_tfidf_predict = mnb.predict(X_test)
print('predictions for tf-idf :', mnb_tfidf_predict)

In [None]:
mnb_tfidf_score = accuracy_score(y_test, mnb_tfidf_predict)
print("mnb_tfidf_score :", mnb_tfidf_score)

## Create a Gradio based web interface to test and display the model predictions

In [None]:
!pip -qq install gradio

In [None]:
import gradio

In [None]:
# Function for preprocessing of text

def preprocess_text(text):
    text = cleanResume(text)
    text = remove_stopwords(text)

    return text

In [None]:
def predict_category_label(text):
    
    processed_text = preprocess_text(text)
    category = tv.transform([processed_text])
    pred = mnb_tfidf.predict(category)
    pred_category = le.inverse_transform(pred)
    return pred_category[0]

In [None]:
predict_category_label('skills * programming languages: python (pandas, numpy, scipy, scikit-learn, matplotlib), sql, java, javascript/jquery. * machine learning: regression, svm, naã¯ve bayes, knn, random forest, decision trees, boosting techniques, cluster analysis, word embedding, sentiment analysis, natural language processing, dimensionality reduction, topic modelling (lda, nmf), pca & neural nets. * database visualizations: mysql, sqlserver, cassandra, hbase, elasticsearch d3.js, dc.js, plotly, kibana, matplotlib, ggplot, tableau. * others: regular expression, html, css, angular 6, logstash, kafka, python flask, git, docker, computer vision - open cv and understanding of deep learning.education details \r\n\r\ndata science assurance associate \r\n\r\ndata science assurance associate - ernst & young llp\r\nskill details \r\njavascript- exprience - 24 months\r\njquery- exprience - 24 months\r\npython- exprience - 24 monthscompany details \r\ncompany - ernst & young llp\r\ndescription - fraud investigations and dispute services   assurance\r\ntechnology assisted review\r\ntar (technology assisted review) assists in accelerating the review process and run analytics and generate reports.\r\n* core member of a team helped in developing automated review platform tool from scratch for assisting e discovery domain, this tool implements predictive coding and topic modelling by automating reviews, resulting in reduced labor costs and time spent during the lawyers review.\r\n* understand the end to end flow of the solution, doing research and development for classification models, predictive analysis and mining of the information present in text data. worked on analyzing the outputs and precision monitoring for the entire tool.\r\n* tar assists in predictive coding, topic modelling from the evidence by following ey standards. developed the classifier models in order to identify "red flags" and fraud-related issues.\r\n\r\ntools & technologies: python, scikit-learn, tfidf, word2vec, doc2vec, cosine similarity, naã¯ve bayes, lda, nmf for topic modelling, vader and text blob for sentiment analysis. matplot lib, tableau dashboard for reporting.\r\n\r\nmultiple data science and analytic projects (usa clients)\r\ntext analytics - motor vehicle customer review data * received customer feedback survey data for past one year. performed sentiment (positive, negative & neutral) and time series analysis on customer comments across all 4 categories.\r\n* created heat map of terms by survey category based on frequency of words * extracted positive and negative words across all the survey categories and plotted word cloud.\r\n* created customized tableau dashboards for effective reporting and visualizations.\r\nchatbot * developed a user friendly chatbot for one of our products which handle simple questions about hours of operation, reservation options and so on.\r\n* this chat bot serves entire product related questions. giving overview of tool via qa platform and also give recommendation responses so that user question to build chain of relevant answer.\r\n* this too has intelligence to build the pipeline of questions as per user requirement and asks the relevant /recommended questions.\r\n\r\ntools & technologies: python, natural language processing, nltk, spacy, topic modelling, sentiment analysis, word embedding, scikit-learn, javascript/jquery, sqlserver\r\n\r\ninformation governance\r\norganizations to make informed decisions about all of the information they store. the integrated information governance portfolio synthesizes intelligence across unstructured data sources and facilitates action to ensure organizations are best positioned to counter information risk.\r\n* scan data from multiple sources of formats and parse different file formats, extract meta data information, push results for indexing elastic search and created customized, interactive dashboards using kibana.\r\n* preforming rot analysis on the data which give information of data which helps identify content that is either redundant, outdated, or trivial.\r\n* preforming full-text search analysis on elastic search with predefined methods which can tag as (pii) personally identifiable information (social security numbers, addresses, names, etc.) which frequently targeted during cyber-attacks.\r\ntools & technologies: python, flask, elastic search, kibana\r\n\r\nfraud analytic platform\r\nfraud analytics and investigative platform to review all red flag cases.\r\nâ\x80¢ fap is a fraud analytics and investigative platform with inbuilt case manager and suite of analytics for various erp systems.\r\n* it can be used by clients to interrogate their accounting systems for identifying the anomalies which can be indicators of fraud by running advanced analytics\r\ntools & technologies: html, javascript, sqlserver, jquery, css, bootstrap, node.js, d3.js, dc.js')


In [None]:
in_resume = gradio.inputs.Textbox(lines=2, placeholder=None, default="resume", label='Enter Resume Text')

In [None]:
out_label = gradio.outputs.Textbox(type="text", label='Predicted Resume Category')

In [None]:
iface = gradio.Interface(
  fn = predict_category_label, 
  inputs = [in_resume],
  outputs = [out_label])
iface.launch(share=True)

## Count Vectorizer (accuracy = 0.99)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))
# transformed train reviews
X = cv.fit_transform(df['cleaned_resume'])
print('cv_resume:', X.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, df['Category_class'], test_size = 0.2, random_state = 0)

In [None]:
mnb = MultinomialNB()
# fitting the NaiveBayes for count vectorizer
mnb_cv = mnb.fit(X_train, y_train)
print('MultinomialNB for Count Vectorizer :', mnb_cv)

In [None]:
mnb_cv_predict = mnb.predict(X_test)
print('predictions for Count Vectorizer :', mnb_cv_predict)

In [None]:
mnb_cv_score = accuracy_score(y_test, mnb_cv_predict)
print("mnb_cv_score :", mnb_cv_score)

mnb_cv_score : 0.9948186528497409


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))
X = cv.fit_transform(df['cleaned_resume'])
print('cv_resume:', X.shape)

X_train, X_test, y_train, y_test = train_test_split(X, df['Category_class'], test_size = 0.2, random_state = 0)

mnb = MultinomialNB()
mnb_cv = mnb.fit(X_train, y_train)
print('MultinomialNB for Count Vectorizer :', mnb_cv)

mnb_cv_predict = mnb.predict(X_test)
print('predictions for Count Vectorizer :', mnb_cv_predict)

mnb_cv_score = accuracy_score(y_test, mnb_cv_predict)
print("mnb_cv_score :", mnb_cv_score)

# TFIDF part 2 (alpha=0) (accuracy = 0.99)

In [None]:
tv = TfidfVectorizer(ngram_range=(1,2))
tfidf_CV = tv.fit_transform(df['cleaned_resume'], df['Category_class'])
print('tfidf_CV:', tfidf_CV.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_CV, df['Category_class'], test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
mnb = MultinomialNB(alpha=0)
mnb_tfidf = mnb.fit(X_train, y_train)
print('MultinomialNB for tf-idf :', mnb_tfidf)

In [None]:
mnb_tfidf_predict = mnb.predict(X_test)
print('predictions for tf-idf :', mnb_tfidf_predict)

predictions for tf-idf : [20 14  6 17 15 14 10 14 15  2  6 23  4 11 13  4 19  8  8  9 12 11 17 22
 19 16  5  8  3  7 20 18 22  7 23 23 22 18  7 20 10 20 14  8 15 15  8 11
  4 22  1 24 14 15 22 23  8 15  3 17 18  3  0 15 15 15 16 21 13 18 12 23
 22 12 13 22  8  7 19  4 24 14  7  1 24 13 12 10  9  8 22  9 23 11  9 23
 11 15 23 13  4 17  2  5  6 10  0 19 20 10 22 10 15 10 15 15 22  6 14  6
  0  4  5  7  9 13 23  6  9  9 21 11  5  3  9 24 19 13  8  3 13 13 11 20
 16 23 21 24  7 21 20 15 22 19 15 23  9 15 15  6  2 20  7 11 23 24  8  3
 20  2 10 22 15  2 11 23  1 23  6  3  3 24 24 12  5 23 18 22 20 20  3  6
 15]


In [None]:
mnb_tfidf_score = accuracy_score(y_test, mnb_tfidf_predict)
print("mnb_tfidf_score :", mnb_tfidf_score)

mnb_tfidf_score : 0.9948186528497409


# TFIDF part 3 (stratify and fit_prior =False) (accuracy = 0.98)

In [None]:
tv = TfidfVectorizer(ngram_range=(1,2))
tfidf_CV = tv.fit_transform(df['cleaned_resume'])
print('tfidf_CV:', tfidf_CV.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_CV, df['Category_class'], test_size = 0.2, random_state = 0, stratify = df['Category_class'])

In [None]:
mnb = MultinomialNB(fit_prior = False)
mnb_tfidf = mnb.fit(X_train, y_train)
print('MultinomialNB for tf-idf :', mnb_tfidf)

In [None]:
mnb_tfidf_predict = mnb.predict(X_test)
print('predictions for tf-idf :', mnb_tfidf_predict)

In [None]:
mnb_tfidf_score = accuracy_score(y_test, mnb_tfidf_predict)
print("mnb_tfidf_score :", mnb_tfidf_score)

# TFIDF part 4 (stratify, alpha, fit_prior) (accuracy = 1.0)

In [None]:
tv = TfidfVectorizer(ngram_range=(1,2))
tfidf_CV = tv.fit_transform(df['cleaned_resume'])
print('tfidf_CV:', tfidf_CV.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_CV, df['Category_class'], test_size = 0.2, random_state = 0, stratify = df['Category_class'])

In [None]:
mnb = MultinomialNB(fit_prior = False, alpha = 0)
mnb_tfidf = mnb.fit(X_train, y_train)
print('MultinomialNB for tf-idf :', mnb_tfidf)

In [None]:
mnb_tfidf_predict = mnb.predict(X_test)
print('predictions for tf-idf :', mnb_tfidf_predict)

In [None]:
mnb_tfidf_score = accuracy_score(y_test, mnb_tfidf_predict)
print("mnb_tfidf_score :", mnb_tfidf_score)

mnb_tfidf_score : 1.0
