# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing
* load and examine the dataset we will be working with
* preparing the data for training: see part 2

#### Load the Dataset

In [1]:
import os
dataset_path = "data.json"  # expecting this dataset file in the current directory
assert(os.path.exists(dataset_path))

In [2]:
with open(dataset_path, 'r', encoding="utf8") as f:
    text = f.readlines()
    print('{} lines read'.format(len(text)))
    print("Sample resume:")
    print(text[5])

701 lines read
Sample resume:
{"content": "Ashalata Bisoyi\nTransaction Processor - Oracle India Private Limited\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df\n\nTo secure a position an esteem organization with good working culture that will help my career\nin the field of finance through my sincerity, hard works and skills.\n\nWORK EXPERIENCE\n\nTransaction Processor\n\nOracle India Private Limited -\n\nApril 2016 to Present\n\n2 Year of experience with Oracle India Private Limited in expense team. My work is auditing of\nexpense reports of employees of all the countries, handling queries through emails and calls also.\nJOB DESCRIPTION\n• Auditing of expense reports of the employees for all the countries and working on service portal\n(Answering queries through email)\n• Handling the team in absence of seniors.\n• Working on Payment Rejections, export of expense reports to AP.\n• Take care of running Backlog, Having knowledge about travel 

#### Convert the dataset to python dictionaries
As we can see, the resumes are not in a convenient human-readable form, but are json dictionaries. We want to work with the resumes as python dictionaries and not as raw text, so we will convert the resumes from text to dictionaries.

In [3]:
import json
all_resumes = []
for line in text:
    resume = json.loads(line)  # resume is a dictionary
    all_resumes.append(resume)

# select an example to explore its structure
resume = all_resumes[5]

##### Explore the resume data structure

In [4]:
# helper function to print a separator line
def print_sep(symbol='-'):
    print(symbol * 80)

print("keys and values in resume:")
print_sep('=')
for key, value in resume.items():
    print_sep('-')
    print('{k} = {v}'.format(k=key, v=value))
print_sep('-')

keys and values in resume:
--------------------------------------------------------------------------------
content = Ashalata Bisoyi
Transaction Processor - Oracle India Private Limited

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df

To secure a position an esteem organization with good working culture that will help my career
in the field of finance through my sincerity, hard works and skills.

WORK EXPERIENCE

Transaction Processor

Oracle India Private Limited -

April 2016 to Present

2 Year of experience with Oracle India Private Limited in expense team. My work is auditing of
expense reports of employees of all the countries, handling queries through emails and calls also.
JOB DESCRIPTION
• Auditing of expense reports of the employees for all the countries and working on service portal
(Answering queries through email)
• Handling the team in absence of seniors.
• Working on Payment Rejections, export of expense reports to AP.
• Take car

##### Results
* key "content" points to the resume content
* key "annotation" points to the resume's entity annotations

In [5]:
print("part of resume content:")
print_sep('-')
print('\n'.join(resume['content'].splitlines()[:5]))
print_sep('=')

print("resume entity list:")
print_sep('-')
print(resume['annotation'])

part of resume content:
--------------------------------------------------------------------------------
Ashalata Bisoyi
Transaction Processor - Oracle India Private Limited

Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df

resume entity list:
--------------------------------------------------------------------------------
[{'label': ['Skills'], 'points': [{'start': 1710, 'end': 1720, 'text': 'M.S. OFFICE'}]}, {'label': ['Skills'], 'points': [{'start': 1692, 'end': 1706, 'text': ' DOEACC O LEVEL'}]}, {'label': ['Address'], 'points': [{'start': 1436, 'end': 1441, 'text': 'Orissa'}]}, {'label': ['Degree'], 'points': [{'start': 1404, 'end': 1432, 'text': 'Board of Secondary Education '}]}, {'label': ['Email Address'], 'points': [{'start': 1315, 'end': 1359, 'text': 'indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df'}]}, {'label': ['Links'], 'points': [{'start': 1303, 'end': 1400, 'text': 'https://www.indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df?isid=r

##### Explore the list of entity labels
The entity list is a list of dictionaries:

In [6]:
entities = resume['annotation']

print('entity list:')
for entity_item in entities:
    print_sep('-')
    print(entity_item)

entity list:
--------------------------------------------------------------------------------
{'label': ['Skills'], 'points': [{'start': 1710, 'end': 1720, 'text': 'M.S. OFFICE'}]}
--------------------------------------------------------------------------------
{'label': ['Skills'], 'points': [{'start': 1692, 'end': 1706, 'text': ' DOEACC O LEVEL'}]}
--------------------------------------------------------------------------------
{'label': ['Address'], 'points': [{'start': 1436, 'end': 1441, 'text': 'Orissa'}]}
--------------------------------------------------------------------------------
{'label': ['Degree'], 'points': [{'start': 1404, 'end': 1432, 'text': 'Board of Secondary Education '}]}
--------------------------------------------------------------------------------
{'label': ['Email Address'], 'points': [{'start': 1315, 'end': 1359, 'text': 'indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df'}]}
--------------------------------------------------------------------------------
{'label

##### Structure and datatype of entity entries
keys: 'label' and 'points'
values: list of str (for key 'label') / list of dict (for key 'points')
##### Meaning of the entity entries
label entry = descriptive name of the entity
points entry = value(s) for that entity

##### Convert  data to "spacy" offset format
Before we go any further, we need to convert the data into a slightly more compact format. This format is the format we will be using to train our first models in the next part.

**Note**: implemented slight changes to text data and entity offsets here to solve problems which became evident in part 2, namely:
* replace unicode characters with ASCII characters (in the hope to reduce mismatch of tokenizer and entity offsets, but its main use is to map all characters used for lists to a single character which can be used later for sentence splitting)
* exclude whitespace from entity annotations by adjusting offsets (having the most impact on reducing mismatch between tokenizer and entity offsets)

In [7]:
## data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    # replace problematic unicode symbols that confuse the tokenizer,
    # leading to '-' where a BILOU tag should be
    # must not change the string size! (-> replacing with empty string is not allowed)
    char_replacement_map = {
        '\u2026': ' ',
        ('\u2022\u27a2\u25cf\u2013\u2756\u2713\u3013\u2611'
         '\u25e6\u2751\u2663\u2794\u2666\u21e8\u2212\u00b7'
         '\u2752\u25c6\u27b2\u00c4\u00ac\u00d8\u25c7\u21d2'): '*',
        '\u2019\u2018': "'",
        '\u201c\u201d\u035e\u035f': '"',
        '\u00e9': 'e',
        '\u00ae': 'R',
        '\u00e7': 'c',
        '\u00d7': 'x',
        '\u00e0': 'a',
        '\u00b5': 'u',
        '\u00a0': ' ',
        #'\n': ' ',
    }
    for chars_to_replace, replacement_char in char_replacement_map.items():
        for c in chars_to_replace:
            text = text.replace(c, replacement_char)
    word_replacement_map = {
        'Ltd.': 'Ltd ',
        'Inc.': 'Inc ',
    }
    for word_to_replace, replacement_word in word_replacement_map.items():
        text = text.replace(word_to_replace, replacement_word)
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            point = annotation['points'][0]
            start = point['start']
            end = point['end']
            # own modification: some labels don't start or stop at a word boundary
            # needs to be fixed for compatibility with automatic tokenizers
            # and to improve data quality
            # reduces occurences of '-' entities in training data from ~900 to ~200
            while text[end].isspace() and end > start:
                end -= 1
            while text[start].isspace() and start < end:
                start += 1
            # further corrections:
            #if text[start].isalnum():
            #    # shrink selection to exclude non-alphanumeric chars at end
            #    while (not text[end].isalnum()) and end > start:
            #        end -= 1
            #    # grow selection to include following alphanumeric chars
            #    while end < len(text)-1 and text[end+1].isalnum():
            #        end += 1
            # ^- commented out; reason: too aggressive
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((start, end + 1, label))
    return (text, {"entities": entities})
   
converted_resumes = [convert_data(resume) for resume in all_resumes]
print('{} resumes converted'.format(len(converted_resumes)))

701 resumes converted


##### New data structure
converted to list of 2-tuples (content, {'entities': \[(start, end+1, label1), ..., (start, end+1, labeln)\]}) 

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data.

In [8]:
converted_resumes_with_empty_entities = converted_resumes
# filter out resumes where resume entities list is None
converted_resumes = [resume for resume in converted_resumes if len(resume[1]['entities']) != 0]
print('removing duplicates dropped document count from {} to {}'.format(
    len(converted_resumes_with_empty_entities), len(converted_resumes)
))

removing duplicates dropped document count from 701 to 690


##### Print all entities for one converted resume

In [9]:
# pick a random resume
converted_resume = converted_resumes[5]
# extract text and entities
text = converted_resume[0]
entities_list = converted_resume[1]['entities']
# print entities for the chosen resume
for (start, end, label) in entities_list:
    print('{label} = {value}'.format(label=label, value=text[start:end]))

Skills = M.S. OFFICE
Skills = DOEACC O LEVEL
Address = Orissa
Degree = Board of Secondary Education
Email Address = indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df
Links = https://www.indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df?isid=rex-download&ikw=download-top&co=IN
Address = Orissa
College Name = Government girls High school
Graduation Year = 2008
Address = Orissa
Graduation Year = 2010
Address = Orissa
College Name = Science College, Hinjilicut
Degree = Accounting
Graduation Year = 2013
Address = Orissa
College Name = Berhampur university
Degree = Bachelor in Commerce
Graduation Year = 2015
Address = Orissa
College Name = Khallikote Autonomous college
Degree = Master of Finance and Control in MFC
Companies worked at = Oracle India Private Limited
Years of Experience = 2 Year of experience
Companies worked at = Oracle India Private Limited
Designation = Transaction Processor
Email Address = indeed.com/r/Ashalata-Bisoyi/cf02125911cfb5df
Location = Bengaluru
Companies worked at = Orac

##### Collect unique labels of all entities in dataset
Now we are interested in finding out all of the (unique) entity labels which exist in our dataset.

In [10]:
## collect names of all entities in complete resume dataset
all_labels = list()
for res in converted_resumes:
    entity_list = res[1]['entities']
    all_labels.extend([label for (start, end, label) in entity_list])

unique_labels = set(all_labels)
print("Entity labels")
print_sep('=')
for label in sorted(unique_labels):
    print(label)

Entity labels
Address
Can Relocate to
Certifications
College
College Name
Companies worked at
Degree
Designation
Email Address
Graduation Year
Links
Location
Name
Relocate to
Rewards and Achievements
Skills
UNKNOWN
University
Years of Experience
abc
des
links
projects
state
training


##### Choice of up to 3 entities to use  for training a named entity recognition model
Name, College Name, Companies worked at

##### Validate entities
Now we need to check that there is adequate training data for the entities chosen. 

In [11]:
chosen_entity_label = ['Name', 'College Name', 'Companies worked at']
## for each chosen entity label, count how many documents have a labeled entity for that label,
## and how many labeled entities total there are for that entity
for label in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if label in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == label])
    print("Docs with {}: {}".format(label,found_docs_with_entity))
    print("Total count of {}: {}".format(label,entity_count))

Docs with Name: 687
Total count of Name: 826
Docs with College Name: 497
Total count of College Name: 1160
Docs with Companies worked at: 627
Total count of Companies worked at: 2830


#####  Adequate training data for the chosen entities is available (at least several hundred examples of each entity)

##### Save converted data for the next part

In [12]:
converted_resumes_path = "converted_resumes.json"
with open(converted_resumes_path, 'wt') as output_file:
    json.dump(converted_resumes, output_file)

### Debugging problems with entity offsets

In [13]:
import datetime

name_to_search = 'Mahesh Vijay'  # set to None to deactivate

def print_raw_resume(resume):
    """
    prints content and entities of a raw resume, especially printing
    context (5 characters) around the annotated entity to track down errors
    with entity offsets
    """
    print('=================== raw resume ====================================')
    text = resume['content']
    print(repr(text))
    print('-------------------------------------------------------------------')
    if 'annotation' not in resume or resume['annotation'] is None or len(resume['annotation']) == 0:
        return
    ents = resume['annotation']
    for ent in ents:
        label = ', '.join(ent['label'])
        point = ent['points'][0]
        start = point['start']
        end = point['end']
        referenced_text = point['text']
        print('{} = {}, offsets {}-{}: {} | {} | {}'.format(
            label, repr(referenced_text), start, end,
            repr(text[max(0,start-5):start]),
            repr(text[start:end+1]),
            repr(text[min(len(text)-1,end+1):min(len(text)-1,end+6)])
        ))

def print_converted_resume(resume):
    """
    prints content and entities of a converted resume, especially printing
    context (5 characters) around the annotated entity to track down errors
    with entity offsets
    """
    print('=================== converted resume ==============================')
    text = resume[0]
    print(repr(text))
    print('-------------------------------------------------------------------')
    ents = sorted(resume[1]['entities'], key=(lambda start_end_label_list: start_end_label_list[0]))
    for ent in ents:
        start = ent[0]
        end = ent[1]
        print('{} = {} | {} | {}'.format(
            ent[2],
            repr(text[max(0,start-5):start]),
            repr(text[start:end]),
            repr(text[min(len(text)-1,end):min(len(text)-1,end+5)])
        ))


if name_to_search is not None and len(name_to_search) > 0:        
    print('{}: searching for resumes starting with {}'.format(
        datetime.datetime.now(), name_to_search
    ))
    for res in all_resumes:
        if res['content'].startswith(name_to_search):
            print_raw_resume(res)
    for res in converted_resumes:
        if res[0].startswith(name_to_search):
            print_converted_resume(res)

2019-06-19 02:06:05.768265: searching for resumes starting with Mahesh Vijay
"Mahesh Vijay\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Mahesh-Vijay/a2584aabc9572c30\n\nOver 6.5 years of functional enriched experience in ERP in the Procurement to Pay domain. Was\nassociated with Oracle India Pvt Ltd, Bangalore as Team lead - Supplier Data Management in\ntheir Global Financial Information Centre (Global Shared Service Center) for Oracle's Business\nfrom Sep 2007- Feb 2014.\n\nWilling to relocate: Anywhere\n\nWORK EXPERIENCE\n\nTeam lead - supplier data management\n\nOracle India -  Bangalore, Karnataka -\n\nMarch 2014 to December 2016\n\nManaging Partner of family business of Tours & Travels\n\nTeam Lead\n\nOracle India Pvt Ltd -\n\nOctober 2013 to February 2014\n\nSupplier Data Management\n\nLead Analyst -SME -Supplier Data Management\n\nOracle India Pvt Ltd -\n\nSeptember 2012 to October 2013\n\nSenior Analyst -Supplier Data Management\n\nOracle India Pvt Ltd -  Bengaluru,

Companies worked at = 'Oracle', offsets 1829-1834: ' the ' | 'Oracle' | ' Fina'
Companies worked at = 'Oracle', offsets 1733-1738: 's.\n- ' | 'Oracle' | ' Fusi'
Companies worked at = 'Oracle', offsets 1542-1547: 'ke - ' | 'Oracle' | ' Fusi'
Email Address = 'indeed.com/r/Mahesh-Vijay/a2584aabc9572c30', offsets 1358-1399: '/www.' | 'indeed.com/r/Mahesh-Vijay/a2584aabc9572c30' | '?isid'
Companies worked at = 'Oracle', offsets 1313-1318: ' the ' | 'Oracle' | ' e-bu'
Graduation Year = '2004', offsets 1089-1092: 'lore(' | '2004' | ')\n• P'
College Name = 'Vivekananda PU College', offsets 1055-1076: 'from ' | 'Vivekananda PU College' | ', Ban'
Graduation Year = '2007', offsets 1027-1030: 'sity(' | '2007' | ')\n• P'
College Name = 'Vivekananda Degree College, Bangalore\nUniversity', offsets 978-1025: 'from ' | 'Vivekananda Degree College, Bangalore\nUniversity' | '(2007'
Degree = 'Bachelors in Commerce (B.Com) ', offsets 943-972: 'ia\n• ' | 'Bachelors in Commerce (B.Com) ' | 'from '
Location =

Companies worked at = 'rs), ' | 'Oracle' | ' (6 y'
Companies worked at = 'on\n* ' | 'Oracle' | ' E- B'
Skills = 's\n\n\n\n' | '* Desk Manuals/Business Process & Navigation Documentation\n* Business Ethics\n* Professional Communication\n* Reporting Tools & Microsoft Office Applications' | ''
