<a href="https://colab.research.google.com/github/Somu112/NLP/blob/main/Resume_Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Resume Classification**


In this case we are using a bunch of Resumes in .DOCX format which is the latest format in which we save MS Word files.

In case we have resumes in PDF Format we would need to use the appropriate library to read the PDF Files as Text.

We would separately be using a skills document to further understand the skills in the available resumes and finally perform the classification of which particular resumes are fit for our given role.

In [2]:
!nvidia-smi

Mon Oct  4 10:34:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
!pip install python-docx

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 11.1 MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184508 sha256=42254f298a024a12b8463cf7523aa914650aded70c72af93b6459f62a12660bf
  Stored in directory: /root/.cache/pip/wheels/f6/6f/b9/d798122a8b55b74ad30b5f52b01482169b445fbb84a11797a6
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


In [8]:
import os
import sys
import spacy as sy
import docx
from tqdm import tqdm
import pandas as pd
import numpy as np
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
import json
import random
from spacy.matcher import Matcher
import re
from os import listdir
from os.path import isfile, join
from io import StringIO
from collections import Counter
import pickle
import plotly.express as px
import plotly.graph_objects as go

**Loading Data**

In [9]:
# unzip the available set of resumes.
# note: please upload the dataset which is shared already in LMS into Colab env. 
!unzip '/content/dataset.zip'

Archive:  /content/dataset.zip
  inflating: data/Abiral_Pandey_Fullstack_Java.docx  
  inflating: data/Achyuth Resume_8.docx  
  inflating: data/Adelina_Erimia_PMP1.docx  
  inflating: data/Adhi Gopalam - SM.docx  
  inflating: data/AjayKumar.docx     
  inflating: data/Akhil.profile.docx  
  inflating: data/Akhil_Sr BSA.docx  
  inflating: data/Alekhya Resume.docx  
  inflating: data/Amar Sr BSA.docx   
  inflating: data/Ami Jape.docx      
  inflating: data/Amrinder Business Analyst.docx  
  inflating: data/Amulya Komatineni.docx  
  inflating: data/Anil Krishna Mogalaturthi.docx  
  inflating: data/AnilAgarwal.docx   
  inflating: data/Anudeep N_Sr Java Developer.docx  
  inflating: data/Ashok Jayakumar - PM.docx  
  inflating: data/Ashwini J2EE Developer.docx  
  inflating: data/Atul_Mathur_Resume.docx  
  inflating: data/Avathika BA-Healthcare_.docx  
  inflating: data/avinash G.docx     
  inflating: data/B Shaker-Sr BSA-Scrum Master .docx  
  inflating: data/B Suresh Kumar_Proje

In [10]:
def getText(file):
    doc = docx.Document(file) # reading each document file, resumes in this case
    fullText=[] # empty corpus variable, where we can store the text 
    for paragraph in doc.paragraphs:
        fullText.append(paragraph.text) # add each paragraph from the available text
    return '\n\n'.join(fullText)

In [12]:
# creating the directory if that does not exist already
if not os.path.exists('text_data'):
    os.mkdir('text_data')

In [13]:
# Before proceeding please ensure that all the .docx files are in data folder
data={'file':[],'text':[]} # dict. to store the file name and the text in that file. 
for file in tqdm(os.listdir('data')):
        data['file'].append(file) # getting the file name 
        text=getText(os.path.join('data',file))
        data['text'].append(text) # getting the text within the file 
        with open(os.path.join('text_data',str(file)+'.txt'),'w',encoding="utf-8") as f:
            f.writelines(text) # created .txt file for easier analysis of each resume file. 
        

100%|██████████| 228/228 [00:05<00:00, 41.15it/s]


**Extracting Names and Resume**

In [14]:
# importing the data as a Pandas dataframe
data=pd.DataFrame(data)

In [15]:
data.head()

Unnamed: 0,file,text
0,Francis Gomes Resume.docx,Professional summary \n\n16 + years of experi...
1,Uday_Maripelly.docx,\n\nUday Maripelly\n\nSENIOR QA AUTOMATION ENG...
2,Varun.docx,Varun\n\nOBJECTIVE: \t\n\nSeeking a position ...
3,Manohar Reddy.docx,Manohar\n\nSr. Java Developer\n\n\n\nEmail: ...
4,Pavan Kumar Full Stack Java Developer.docx,PAVAN KUMAR\t\t\t Pavank068...


In [16]:
st_en = sy.load('en_core_web_sm')

In [17]:
# initialize matcher with a vocab
matcher = Matcher(st_en.vocab)

# user defined function:
# getting the proper names of the candidates.  
def getName(resume_text,st_en):
    nlp_text = st_en(resume_text)
    
    # First name and Last name are always Proper Nouns
    ptrn = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('NAME', None, ptrn)
    matches = matcher(nlp_text)
    
    for match_id, start, end in matches:
        span = nlp_text[start:end]
        return span.text

In [18]:
# for example, print the name of the 12th file in the list. 
print('Name:',getName(data.text.iloc[12],st_en))

Name: Agile RUP


**Extracting contact**

In [19]:
# User defined function:
# using Python regular expressions to find the contact numbers of each candidate 
def getContact(text):
    phone = re.findall(re.compile(r'(?:(?:\+?([1-9]|[0-9][0-9]|[0-9][0-9][0-9])\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([0-9][1-9]|[0-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?'), text)    
    if phone:
        number = ''.join(phone[0])
        if len(number) > 10:
            return '+' + number
        else:
            return number

In [20]:
print('Contact:',getContact(data.text.iloc[14]))

Contact: +19404370150


**Extracting Email**

In [21]:
# similar to extracting phone numbers, we can also extract emails. 
def getEmail(email):
    email = re.findall("([^@|\s]+@[^@]+\.[^@|\s]+)", email)
    if email:
        try:
            return email[0].split()[0].strip(';')
        except IndexError:
            return None

In [22]:
# test print
print('Email:',getEmail(data.text.iloc[102]))

Email: usilawal123@gmail.com


**Extracting Education**

In [23]:
# Education Degrees
EDUCATION = [
            'BE','B.E.', 'B.E', 'BS', 'B.S', 
            'ME', 'M.E', 'M.E.', 'MS', 'M.S', 
            'BTECH', 'B.TECH', 'M.TECH', 'MTECH', 
            'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII'
        ]

In [24]:
# user defined function to extract the Education qualifications as per the set list of values
def getEducation(resume_text,st_en):
    nlp_text = st_en(resume_text)

    # Sentence Tokenizer
    nlp_text = [sent.string.strip() for sent in nlp_text.sents]

    edu = {}
    # Extract education degree
    for index, text in enumerate(nlp_text):
        for tex in text.split():
            # Replace all special symbols
            tex = re.sub(r'[?|$|.|!|,]', r'', tex)
            if tex.upper() in EDUCATION and not st_en.vocab[tex].is_stop:
                edu[tex] = text + nlp_text[index]

    # Extract year
    education = []
    for key in edu.keys():
        year = re.search(re.compile(r'(((20|19)(\d{2})))'), edu[key])
        if year:
            education.append((key, ''.join(year[0])))
        else:
            education.append(key)
    return education

In [25]:
print('Education:',getEducation(data.text.iloc[51],st_en),sep='\n')

Education:
['MS']


**Extracting Skills**

To extract skills, we will need some extra help from a predefined set of skills. To serve that purpose we have a skill.csv file

In [26]:
# creating this user defined function, using the skill.csv where we have already listed
# all the common skills that we are looking for as part of the requirements/hiring
def getSkills(resume_text,st_en):
    nlp_text = st_en(resume_text)
    noun_chunks = nlp_text.noun_chunks # taking only the NOUNS 
    # removing stop words and implementing word tokenization
    tokens = [token.text for token in nlp_text if not token.is_stop]
    data = pd.read_csv("skills.csv") 
    skills = list(data.columns.values)
    skillset = []
    
    # check for one-grams (example: python)
    for token in tokens:
        if token.lower() in skills:
            skillset.append(token)
    
    # check for bi-grams and tri-grams (example: machine learning)
    for token in noun_chunks:
        token = token.text.lower().strip()
        if token in skills:
            skillset.append(token)
    
    return [i.capitalize() for i in set([i.lower() for i in skillset])]

In [27]:
print('Skills:',getSkills(data.text.iloc[140],st_en),sep='\n')

Skills:
['Software development life cycle', 'Underwriting', 'Open source', 'Transactions', 'Email', 'Ruby', 'Apis', 'Workflows', 'Ui', 'Mobile', 'Coding', 'Adobe', 'Jira', 'Unix', 'Mock', 'Reports', 'Pattern', 'Inventory', 'Banking', 'Debugging', 'Css', 'Js', 'Broadcast', 'Pl/sql', 'Windows', 'Workflow', 'Flex', 'Documentation', 'Mortgage', 'Technical skills', 'Communication', 'Database', 'Distribution', 'Oracle', 'Html', 'Design', 'Servers', 'Security', 'Vmware', 'Administration', 'Reporting', 'Rest', 'Usability', 'Presentation', 'Hospital', 'Logging', 'Xml', 'Sdlc', 'Operations', 'Nosql', 'Website', 'Soap', 'Pharmacy', 'Jsp', 'Billing', 'Shell', 'Automation', 'Health', 'Technical', 'Scheduling', 'Selenium', 'Requests', 'Mysql', 'System', 'Html5', 'Aws', 'Ibm', 'Java', 'Analysis', 'Access', 'Scrum', 'Agile', 'Docker', 'Php', 'Javascript', 'Queries', 'Json', 'Sql server', 'Architecture', 'Admissions', 'Writing', 'Web services', 'Linux', 'Programming', 'Threading', 'Test cases', 'Cloud'

In [28]:
print('Name:',getName(data.text.iloc[12],st_en))
print('Skills:',getSkills(data.text.iloc[12],st_en),sep='\n')

Name: Agile RUP
Skills:
['Software development life cycle', 'Asp', 'Microsoft visio', 'Test plans', 'Plan', 'Word', 'Legal', 'Conversion', 'Presentations', 'Quality assurance', 'Ui', 'Adobe', 'Jad', 'Jira', 'Reports', 'Sharepoint', 'Mock', 'Excel', 'Documentation', 'Modeling', 'Proposal', 'Training', 'Project management', 'Communication', 'Market research', 'Database', 'C#', 'Oracle', 'Html', 'Design', 'Crm', 'Audit', 'Policies', 'Ms excel', 'Administration', 'Schedules', 'Presentation', 'Content', 'Ecommerce', 'Specifications', 'Xml', 'Sdlc', 'Visio', 'Operations', 'C', 'Budget', 'Warehouse', 'Website', 'Analytical', 'Vendors', 'Compliance', 'C++', 'International', 'Jsp', 'Project planning', 'Budgeting', 'Technical', 'Requests', 'Metrics', 'System', 'Reconciliation', 'Logistics', 'Ibm', 'Benchmark', 'Java', 'Hp alm', 'Analysis', 'Access', 'Scrum', 'Agile', 'Architecture', 'Research', 'Gap analysis', 'Ms project', 'Writing', 'Test cases', 'Testing', 'Process', 'Sql']


In [29]:
gsheet="https://docs.google.com/spreadsheets/d/1vZzA3Ccx5vM4d-CiCloCnbyMrU0Pdo134450Z5L6YrI/edit#gid=0"
url_1 = gsheet.replace('/edit#gid=', '/export?format=csv&gid=')

In [30]:
ky_roles = pd.read_csv(url_1)

In [31]:
ky_roles.head()

Unnamed: 0,Statistician,Machine Learning Engineer,Deep Learning Engineer,Python Developer,NLP Engineer,Data Engineering,JAVA developer,Cloud Engineer,Web Developer
0,statistical models,Ruby,neural network,python,nlp,laws,Java,AWS,Javascript
1,statistical modeling,Python,keras,flask,natural language processing,ec2,J2ee,GCP,Typescript
2,probability,SAS,theano,django,topic modeling,amazon redshift,Object Oriented Programming,Amazon Web Services,HTML
3,normal distribution,SPSS,face detection,pandas,Ida,s3,OOPs,Google Cloud,HTML5
4,poisson distribution,Weka,neural networks,numpy,named entity recognition,docker,Angular JS,Azure,.js


In [36]:
# create a user defined function to understand the status of the roles
# to do this we would use the PhraseMatcher
def getRolesStatus(text,ky_roles,st_en):
    words={role:None for role in ky_roles}
    for role in ky_roles:
        words[role] = [st_en(tt) for tt in ky_roles[role].dropna(axis=0)]
    
    matcher = PhraseMatcher(st_en.vocab)
    for role in ky_roles:
        matcher.add(role,None,*words[role])
    
    doc = st_en(text) # Document
    d = []  
    matches = matcher(doc)
    for match_id, start, end in matches:
        rule_id = st_en.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
        span = doc[start : end]  # get the matched slice of the doc
        d.append((rule_id, span.text))      
    keywords = "\n".join(f'{i[0]} {i[1]} ({j})' for i,j in Counter(d).items())
    
    ## convertimg string of keywords to dataframe
    df = pd.read_csv(StringIO(keywords),names = ['Keywords_List'])
    df1 = pd.DataFrame(df.Keywords_List.str.split('(',1).tolist(),columns = ['Subject','Count'])
    df1.Subject = [' '.join(x.split(' ')[:-2]) for x in df1.Subject]
    df2 = pd.concat([df1['Subject'],df1['Count'].str.replace(')','')], axis =1) 
    dataf = pd.concat([df2['Subject'], df2['Count']], axis = 1)

    return(dataf)

In [34]:
from spacy.matcher import PhraseMatcher


In [37]:
# using this line of code, we are running the UDF to 
# map any given resume (with its ID) and we get the matching ROLES
# rather than only the skills
words=getRolesStatus(' '.join(getSkills(data.text.iloc[170],st_en)),ky_roles,st_en)
words

Unnamed: 0,Subject,Count


# **Compiling the Information**

In [38]:
# creating a compilation function : UDF: 
# wherein we will collect all information for any given text  
def compileInformation(text):
    info={}
    
    info['Name']=getName(text,st_en) # name
    info['Contact']=getContact(text) # contact details, if any
    info['Email']=getEmail(text) # email address, if any
    info['Education']=getEducation(text,st_en) # edu. details, if any
    info['Skills']=getSkills(text,st_en) # specific skiils
    info['Domains']=getRolesStatus(' '.join(getSkills(text,st_en)),ky_roles,st_en).Subject # Roles
    return info

In [39]:
# search in all documents/resumes
# and list all the properties of them into a single place
all_docs=[]
for text in tqdm(data.text):
    all_docs.append(compileInformation(text))

100%|██████████| 228/228 [13:53<00:00,  3.66s/it]


In [40]:
# dumping the findings as a Pickle file in Binary
pickle.dump(all_docs,open('all_docs.pkl','wb'))

In [41]:
# Reading it back again as a different variable
all_docs=pickle.load(open('all_docs.pkl','rb'))

In [42]:
# for example: trying to read the domains/Roles from the compiled information
# which we have got out of the resumes that we scanned as part of this study
all_domains=[]
for doc in all_docs:
    all_domains.extend(doc['Domains'])

In [43]:
# save the domains we have found as a series of elements
all_domains=pd.Series(all_domains)

In [44]:
# e.g., we are using PLotly library to display the histograms of the available
# roles which have matched from the given set of Resumes. 
px.bar(x=all_domains.value_counts().index,y=all_domains.value_counts().values)