In [1]:
import spacy
import pickle
import random

## Loading the DataSet
The selection of data is the main problem for extracting the information. The data which I had taken is understandable and the data is in a specific format such that the extraction of the data is a bit easy.

The training Data is in the following format:

('RESUME TEXT', {'entities': (index of the information that needs to be extracted along with the name of the label info)})

In [3]:
train_data = pickle.load(open('../Data/train_data.pkl', 'rb'))

In [4]:
train_data[0]

('Govardhana K Senior Software Engineer  Bengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/ b2de315d95905b68  Total IT experience 5 Years 6 Months Cloud Lending Solutions INC 4 Month • Salesforce Developer Oracle 5 Years 2 Month • Core Java Developer Languages Core Java, Go Lang Oracle PL-SQL programming, Sales Force Developer with APEX.  Designations & Promotions  Willing to relocate: Anywhere  WORK EXPERIENCE  Senior Software Engineer  Cloud Lending Solutions -  Bangalore, Karnataka -  January 2018 to Present  Present  Senior Consultant  Oracle -  Bangalore, Karnataka -  November 2016 to December 2017  Staff Consultant  Oracle -  Bangalore, Karnataka -  January 2014 to October 2016  Associate Consultant  Oracle -  Bangalore, Karnataka -  November 2012 to December 2013  EDUCATION  B.E in Computer Science Engineering  Adithya Institute of Technology -  Tamil Nadu  September 2008 to June 2012  https://www.indeed.com/r/Govardhana-K/b2de315d95905b68?isid=rex-

## Creating the Training Model for parsing Resume

In [38]:
nlp = spacy.blank('en')
def train_model(train_data):
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last = True)
        
    for _, annotation in train_data:
        for ent in annotation['entities']:
            ner.add_label(ent[2])
    
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itr in range(10):
            print("Iteration" + str(itr))
            random.shuffle(train_data)
            loss = {}
            index = 0
            
            for text, annotations in train_data:
                try:
                    nlp.update(
                        [text],
                        [annotations],
                        drop=0.2,
                        sgd = optimizer,
                        losses= loss)
                except Exception as e:
                    pass
            print(loss)
            
        

## Training the Model

In [39]:
train_model(train_data)

Iteration0
{'ner': 10921.799873459804}
Iteration1
{'ner': 7953.806305676695}
Iteration2
{'ner': 10314.221859141748}
Iteration3
{'ner': 5844.826505412054}
Iteration4
{'ner': 8489.393323315906}
Iteration5
{'ner': 6406.373823576703}
Iteration6
{'ner': 4841.215209409064}
Iteration7
{'ner': 5391.030628845243}
Iteration8
{'ner': 4689.830447240619}
Iteration9
{'ner': 5080.196722783974}


## Storing the Trained model for future use 

In [40]:
nlp.to_disk('nlp_model')

## Loading the trained model for extracting new resumes

In [41]:
nlp_model = spacy.load('nlp_model')

## Checking how the information is extracted using the data from the dataset

In [46]:
doc = nlp_model(train_data[0][0])
for ent in doc.ents:
    print(f'{ent.label_} : {ent.text}')

Name : Karthik G
Designation : Program Manager, Product Manager,
Location : Secunderabad
Companies worked at : Microsoft India
Location : Hyderabad
Degree : PGDBM in Business Management
Location : Hyderabad
Companies worked at : Microsoft Technology
Companies worked at : Microsoft Role


## Using PyMuPDF for converting the pdf into text data for extracting the details using the model we trained

In [48]:
import sys, fitz

## opening the new resume pdf and converting it into text and preprocessing it to the required format to extract the details

In [49]:
file = '../Data/Alice Clark CV.pdf'
doc = fitz.open(file)
text = ""
for page in doc:
    text = text + str(page.getText())
print(text)

Alice Clark 
AI / Machine Learning 
 
Delhi, India Email me on Indeed 
• 
20+ years of experience in data handling, design, and development 
• 
Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to 
data warehousing and business intelligence 
• 
Database: Experience in database designing, scalability, back-up and recovery, writing and 
optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes. 
Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure, 
Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake 
analytics(U-SQL) 
Willing to relocate anywhere 
 
WORK EXPERIENCE 
Software Engineer 
Microsoft – Bangalore, Karnataka 
January 2000 to Present 
1. Microsoft Rewards Live dashboards: 
Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping 
online. Microsoft Rewards members can earn points when searching with Bing, bro

In [50]:
tx = " ".join(text.split('\n'))
print(tx)

Alice Clark  AI / Machine Learning    Delhi, India Email me on Indeed  •  20+ years of experience in data handling, design, and development  •  Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to  data warehousing and business intelligence  •  Database: Experience in database designing, scalability, back-up and recovery, writing and  optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes.  Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure,  Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake  analytics(U-SQL)  Willing to relocate anywhere    WORK EXPERIENCE  Software Engineer  Microsoft – Bangalore, Karnataka  January 2000 to Present  1. Microsoft Rewards Live dashboards:  Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping  online. Microsoft Rewards members can earn points when searching with Bing, bro

In [52]:
doc = nlp_model(tx)
for ent in doc.ents:
    print(f'{ent.label_} : {ent.text}')

Name : Alice Clark
Location : Delhi
Designation : Software Engineer
Companies worked at : Microsoft –
Location : Bangalore
Companies worked at : Microsoft
Companies worked at : Microsoft
Companies worked at : Microsoft
Companies worked at : Microsoft
College Name : Indian Institute of Technology – Mumbai
Skills : Machine Learning, Natural Language Processing, and Big Data Handling    ADDITIONAL INFORMATION  Professional Skills  • Excellent analytical, problem solving, communication, knowledge transfer and interpersonal  skills with ability to interact with individuals at all the levels  • Quick learner and maintains cordial relationship with project manager and team members and  good performer both in team and independent job environments  • Positive attitude towards superiors &amp; peers  • Supervised junior developers throughout project lifecycle and provided technical assistance
