### Spacy Resume NER Example

In [1]:
import json
import random
import logging
import spacy

from spacy import displacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
from spacy.util import minibatch, compounding

from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

from pathlib import Path

**Things to Work On**

- Things in progress: 

    - More evaluation metrics and understanding of misclassifciation. 
    - Check the validtion of the metric not sure if they are done properly. 

- Research and Wants: 

    - Plug in model in a language model of our own. 
    - Look into DL custom plug in with in spacy to see if bringing in is possible. 
    - Get a better understanding of what is really happening under the hood for spacy. 
    
**Next Steps** 

- BIO Class Function for Camille to add to OCR API. 
- Work with this data set that is in a tsv file reading it properly to compare results.
    - This give a BIO format that will allow for alt approaches (See Power Point). 
- Find alt frame work to work with looking into something right now. 
- Carol could take a CRF approach to this problem to see if that will do a better job. 

**Things Not To Forget** 

- Image Preprocessing / Accuracy of OCR out of the box. 

#### Spacy Resume Example

**Data Example One** 

Extraction Possibilities: 

|Entities: | 
| --- |
| College |
|  Name |
| Skills |
| College Degree 
| Graduation Year |
| Years of Experience |
| Companies worked at |
| Designation |
| Location |
| Email Address |
| Address |
| Can Relocate to |
| Rewards and Achievements |
| Certifications |
| Links |
| University |
| projects |
| state |

Things to remember about this example below the data set and objective is to show that Spacy can work for a baseline and the thought process of the training data set isn't going to be as strong as the one we will create once we have a large corpus of data to put into out model. This is a good start to see what this can do for us and there is still a ton of tweaking and things that can be done with this example. 

I am in the process of wanting to have more than one option. So that we can tweak more than one thing when one doesn't give us a result that we are wanting or expecting. There are a ton of advantages to spacy and there will be advantages to other methods as well. 

In [2]:
pwd

'C:\\Users\\EJ3514\\Documents\\Python Notebook'

In [3]:
PATH = 'data/ner/'

In [4]:
JSON_FilePath = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/resumes.json'
JSON_Test_PATH = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/testdata.json'
output_dir = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/models'

In [5]:
#Converting Json document to Spacy Format
def convert_to_spacy(JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(JSON_FilePath, 'r', encoding='utf-8', errors='ignore') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))


            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        logging.exception("Unable to process " + JSON_FilePath + "\n" + "error = " + str(e))
        return None

This is what we have inside the current **API** structure. So this example will work for Spacy baseline attempt

In [6]:
data_spacy = convert_to_spacy(JSON_FilePath)

In [7]:
data_spacy[0]

('Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming

In [8]:
JSON_FilePath = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/resumes.json'
JSON_Test_PATH = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/testdata.json'
output_dir = 'C:/Users/EJ3514/Documents/Python Notebook/data/ner/models'

### **Training NER Model**

In [11]:
def train_spacy_NER(model=None,  data_path=JSON_FilePath, new_model_name='Testing',
                    output_dir=None, drop_out=0.20, n_iter=10):
    """
    model: None -> blank english model, but you can use exisiting or built spacy models 
    data_path: This is only for this problem, but this is where the JSON file is and we take that structure
               and build a Spacy Format. 
    new_model_name: This is going to be in the output_dir if told to.
    output_dir: Path to where you want the model to be stored. 
    n_iter: 10 defaulted based on what spacy us
    """
    TRAIN_DATA = convert_to_spacy(JSON_FilePath)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
    #create the built-in pipeline components and add them to the pipeline
    #nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    #add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])
    #get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            #batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts,
                           annotations,
                           sgd=optimizer,
                           drop=drop_out, #0.20, #0.35,
                           losses=losses)
            print('Losses', losses)
            
    #save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta['name'] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

In [13]:
train_spacy_NER(model=None, data_path=JSON_FilePath, new_model_name='Testing',
                    output_dir=output_dir, drop_out=0.20, n_iter=1)

Losses {'ner': 1564.8083640816276}
Saved model to C:\Users\EJ3514\Documents\Python Notebook\data\ner\models


### Testing On Test Data Set

In [14]:
def test_spacy_NER(model_path=None, test_path=JSON_Test_PATH):
    """
    model_path: Path To NER Model
    test_path: Path to test data for this problem 
    """
    if (model_path == None):
        print("Need Model")
    else:
        #test the model and evaluate it
        nlp = spacy.load(model_path)
        examples = convert_to_spacy(JSON_Test_PATH)
        tp=0
        tr=0
        tf=0
        ta=0
        c=0        
        for text,annot in examples:
            f=open("resume"+str(c)+".txt","w")
            doc_to_test=nlp(text)
            d={}
            for ent in doc_to_test.ents:
                d[ent.label_]=[]
            for ent in doc_to_test.ents:
                d[ent.label_].append(ent.text)

            d={}
            for ent in doc_to_test.ents:
                d[ent.label_]=[0,0,0,0,0,0]
            for ent in doc_to_test.ents:
                doc_gold_text= nlp.make_doc(text)
                gold = GoldParse(doc_gold_text, entities=annot.get("entities"))
                y_true = [ent.label_ if ent.label_ in x else 'Not '+ent.label_ for x in gold.ner]
                y_pred = [x.ent_type_ if x.ent_type_ ==ent.label_ else 'Not '+ent.label_ for x in doc_to_test]  
                if(d[ent.label_][0]==0):
                    (p,r,f,s)= precision_recall_fscore_support(y_true,y_pred,average='weighted')
                    a=accuracy_score(y_true,y_pred)
                    d[ent.label_][0]=1  #Index
                    d[ent.label_][1]+=p #Precision
                    d[ent.label_][2]+=r #Recall
                    d[ent.label_][3]+=f #F-1 Socre
                    d[ent.label_][4]+=a #Accuracy 
                    d[ent.label_][5]+=1 #Base
            c+=1
        for i in d:
            print("\n For Entity "+i+"\n")
            print("Accuracy : "+str((d[i][4]/d[i][5])*100)+"%")
            print("Precision : "+str(d[i][1]/d[i][5]))
            print("Recall : "+str(d[i][2]/d[i][5]))
            print("F-score : "+str(d[i][3]/d[i][5]))

In [15]:
test_spacy_NER(model_path=output_dir, test_path=JSON_Test_PATH)

  'recall', 'true', average, warn_for)



 For Entity Name

Accuracy : 99.83805668016194%
Precision : 0.9983831936194594
Recall : 0.9983805668016195
F-score : 0.9981113185060555

 For Entity Location

Accuracy : 99.19028340080972%
Precision : 0.9887112503324567
Recall : 0.9919028340080972
F-score : 0.9892169555139021

 For Entity Email Address

Accuracy : 99.43319838056681%
Precision : 1.0
Recall : 0.994331983805668
F-score : 0.9971579374746244

 For Entity Can Relocate to

Accuracy : 99.91902834008097%
Precision : 1.0
Recall : 0.9991902834008097
F-score : 0.9995949777237749

 For Entity Designation

Accuracy : 99.83805668016194%
Precision : 1.0
Recall : 0.9983805668016195
F-score : 0.9991896272285252

 For Entity Companies worked at

Accuracy : 99.91902834008097%
Precision : 1.0
Recall : 0.9991902834008097
F-score : 0.9995949777237749

 For Entity Skills

Accuracy : 97.32793522267207%
Precision : 0.973995665286443
Recall : 0.9732793522267207
F-score : 0.962707431998957


### Viz of NER

In [28]:
test = convert_to_spacy(JSON_Test_PATH)
df_test = [test[0]]

In [31]:
df_test2 = [test[3]]; df_test2

[('Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in performing in

In [29]:
df_test

[("Abhishek Jha\nApplication Development Associate - Accenture\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a\n\n• To work for an organization which provides me the opportunity to improve my skills\nand knowledge for my individual and company's growth in best possible ways.\n\nWilling to relocate to: Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nApplication Development Associate\n\nAccenture -\n\nNovember 2017 to Present\n\nRole: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries\nfor the Bot which will be triggered based on given input. Also, Training the bot for different possible\nutterances (Both positive and negative), which will be given as\ninput by the user.\n\nEDUCATION\n\nB.E in Information science and engineering\n\nB.v.b college of engineering and technology -  Hubli, Karnataka\n\nAugust 2013 to June 2017\n\n12th in Mathematics\n\nWoodbine modern school\n\nApril 2011 to March 2013\n\n10th\n\nKendriya Vidyalaya\

In [18]:
print("Loading from", output_dir)

Loading from C:/Users/EJ3514/Documents/Python Notebook/data/ner/models


In [20]:
nlp = spacy.load(output_dir)

In [32]:
test_text = "Abhishek Jha\nApplication Development Associate - Accenture\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a\n\n• To work for an organization which provides me the opportunity to improve my skills\nand knowledge for my individual and company's growth in best possible ways.\n\nWilling to relocate to: Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nApplication Development Associate\n\nAccenture -\n\nNovember 2017 to Present\n\nRole: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries\nfor the Bot which will be triggered based on given input. Also, Training the bot for different possible\nutterances (Both positive and negative), which will be given as\ninput by the user.\n\nEDUCATION\n\nB.E in Information science and engineering\n\nB.v.b college of engineering and technology -  Hubli, Karnataka\n\nAugust 2013 to June 2017\n\n12th in Mathematics\n\nWoodbine modern school\n\nApril 2011 to March 2013\n\n10th\n\nKendriya Vidyalaya\n\nApril 2001 to March 2011\n\nSKILLS\n\nC (Less than 1 year), Database (Less than 1 year), Database Management (Less than 1 year),\nDatabase Management System (Less than 1 year), Java (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTechnical Skills\n\nhttps://www.indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a?isid=rex-download&ikw=download-top&co=IN\n\n\n• Programming language: C, C++, Java\n• Oracle PeopleSoft\n• Internet Of Things\n• Machine Learning\n• Database Management System\n• Computer Networks\n• Operating System worked on: Linux, Windows, Mac\n\nNon - Technical Skills\n\n• Honest and Hard-Working\n• Tolerant and Flexible to Different Situations\n• Polite and Calm\n• Team-Player"
test_text2 = 'Alok Khandai\nOperational Analyst (SQL DBA) Engineer - UNISYS\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Alok-Khandai/5be849e443b8f467\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server.\n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n❖ Experience in creating Jobs, Alerts, SQL Mail Agent\n❖ Experience in performing integrity checks. Methods include configuring the database\nmaintenance plan wizard and DBCC utilities\n❖ Experience in using Performance Monitor, SQL Profiler and optimizing the queries, tracing long\nrunning queries and deadlocks.\n❖ Experience in applying patches and service packs to keep the database at current patch level.\n❖ Ability to manage own work and multitask to meet tight deadlines without losing sight of\npriorities..\n\nWilling to relocate to: Bengaluru, Karnataka\n\nWORK EXPERIENCE\n\nOperational Analyst (SQL DBA) Engineer\n\nUNISYS -  Bengaluru, Karnataka -\n\nJuly 2016 to Present\n\n❖ Having 3.5 Years of IT experience in SQL Database Administration, System Analysis, Design,\nDevelopment & Support of MS SQL Servers in Production, Development environments &\nReplication and Cluster Server Environments.\n❖ Working Experience with relational database such as SQL.\n❖ Experience in Installation, Configuration, Maintenance and Administration of SQL Server. \n❖ Experience in upgrading SQL Server.\n❖ Good experience with implementing DR solution, High Availability of database servers using\nDatabase mirroring and replications and Log Shipping.\n❖ Experience in implementing SQL Server security and Object permissions like maintaining\nDatabase authentication modes, creation of users, configuring permissions and assigning roles\nto users.\n\nDBA Support Analyst\n\nMicrosoft Corporation -  Redmond, WA -\n\nhttps://www.indeed.com/r/Alok-Khandai/5be849e443b8f467?isid=rex-download&ikw=download-top&co=IN\n\n\nJuly 2016 to Present\n\nClient Description:\nMicrosoft Corporation is an American public multinational corporation headquartered in\nRedmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of\nproducts and services predominantly related to computing through its various product divisions.\n\nEnvironment:\nMicrosoft has E2E development and production environment of more than 25000 servers and\napplications. We are responsible for pro-active monitoring of all the servers and their jobs using\nmonitoring tools to reduce critical business impact by alerting respective peer teams. Microsoft\nService Enterprise an ITSM tools are used for ticketing and SharePoint portal is used to store all\ntechnical and process documentation.\n\nRoles and Responsibilities:\n• Responsible for Database support, troubleshooting, planning and migration. Resource planning\nand coordination for application migrations with project managers, application and web app\nteams. Project involved guidance and adherence to standardized procedures for planned data\ncenter consolidation for worldwide centers using in-house corporate and third party applications\nbased on SQL 2000 in upgrade project to SQL 2005.\n• Monitoring of database size and disk space in Production, Staging & Development environments\n• Performed installation of SQL Enterprise 2005 64bit version on Windows 2003 servers on\nEnterprise systems of clustered and standalone servers in enterprise Data Centers. Patch\napplications.\n• Failover cluster testing and resolution on HP servers as well as monitoring and backup reporting\nsetup with Microsoft Operations Manager and backup teams.\n• Working in Microsoft production environment which includes applications and servers.\n• Configured Transactional Replication and Log Shipping with SQL Server Management Studio as\nwell as basic account management and troubleshooting with connectivity, security and firewall\nissues.\n• Handling issues related to Server Availability, Performance.\n• Performed Production support and on Call duties\n• Conducted Performance Tuning using SQL Profiler and Windows Performance Monitor.\n• Worked with various business groups while developing their applications, assisting in database\ndesign, installing SQL Server clients, phasing from development to QA and to Production\nenvironment.\n\nPrevious Project\n❖ Project Title: Finance Support\n❖ Client: Costco Wholesale Corporation (USA)\n❖ Team size: 22\n❖ Role: DBA Support Analyst\n❖ Environment: Window 10\n\n(SQL DBA Analyst) Engineer\n\nHCL Technologies -  Bengaluru, Karnataka -\n\nNovember 2014 to July 2016\n\n〓 Performed server installation and configurations for SQL Server 2005 and SQL Server 2000.\n\n\n\n〓 Performed installation of SQL Server Service Packs\n〓 Upgraded databases from SQL Server 2000 to SQL Server 2005.\n〓 Scheduled Full and Transactional log backups for the user created and system databases in\nthe production environment using the Database Maintenance Plan Wizard.\n〓 Setup backup and restoration jobs for development and QA environments\n〓 Created transactional replication for the reporting applications.\n〓 Implemented disaster recovery solution at the remote site for the production databases using\nLog Shipping.\n〓 Used System monitor to find the bottlenecks in CPU, Disk I/O and memory devices and\nimproved the database server performance.\n〓 Used SQL Server Profiler to monitor and record database activities of particular users and\napplications.\n〓 Used DBCC commands to troubleshoot issues related to database consistency\n〓 Worked with various business groups while developing their applications, assisting in database\ndesign, installing SQL Server clients, phasing from development to QA and to Production\nenvironment\n\nMicrosoft Corporation -\n\nNovember 2014 to July 2016\n\nClient Description:\n\n❖ Costco Wholesale Corporation operates an international chain of membership warehouses,\nmainly under the "Costco Wholesale" name, that carry quality, brand name merchandise at\nsubstantially lower prices than are typically found at conventional wholesale or retail sources. The\nwarehouses are designed to help small-to-medium-sized businesses reduce costs in purchasing\nfor resale and for everyday business use. Individuals may also purchase for their personal needs.\n\n❖ Responsibilities:\n\n➢ Performed server installation and configurations for SQL Server 2005 and SQL Server 2000.\n➢ Performed installation of SQL Server Service Packs\n➢ Upgraded databases from SQL Server 2000 to SQL Server 2005.\n➢ Scheduled Full and Transactional log backups for the user created and system databases in\nthe production environment using the Database Maintenance Plan Wizard.\n➢ Setup backup and restoration jobs for development and QA environments\n➢ Created transactional replication for the reporting applications.\n➢ Implemented disaster recovery solution at the remote site for the production databases using\nLog Shipping.\n➢ Used System monitor to find the bottlenecks in CPU, Disk I/O and memory devices and improved\nthe database server performance.\n➢ Used SQL Server Profiler to monitor and record database activities of particular users and\napplications.\n➢ Used DBCC commands to troubleshoot issues related to database consistency\n➢ Worked with various business groups while developing their applications, assisting in database\ndesign, installing SQL Server clients, phasing from development to QA and to Production\nenvironment\n\n\n\nEDUCATION\n\nB.Tech in Computer Science and Engineering in CSE\n\nIndira Gandhi Institute Of Technology\n\n2012\n\nSKILLS\n\nDatabase (3 years), SQL (3 years), Sql Dba\n\nADDITIONAL INFORMATION\n\nTECHNICAL PROFICIENCY\n❖ Operating Environment: […] Windows95/98/XP/NT\n❖ Database Tool: SQL Management Studio (MSSQL), Business\nDevelopment Studio, Visual studio 2005\n❖ Database Language: SQL, PL/SQL\n❖ Ticket Tracking Tool: Service Now\n❖ Reporting Tools: MS Reporting Services, SAS\n❖ Languages: C, C++, PL/SQL'

In [24]:
doc = nlp(test_text)

In [25]:
#What Was tagged
for ent in doc.ents:
    print(ent.label_, ent.text)

Name Abhishek Jha
Designation Application Development Associate
Companies worked at Accenture
Location Bengaluru
Email Address indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a
Designation Application Development Associate
Companies worked at Accenture
Degree B.E in Information science and engineering
College Name B.v.b college of engineering and technology
Location Hubli
Designation Kendriya Vidyalaya
Skills C
Skills Database
Skills Java


In [33]:
doc2 = nlp(test_text2)

In [34]:
#What Was tagged
for ent in doc2.ents:
    print(ent.label_, ent.text)

Name Alok Khandai
Designation Operational Analyst
Location Bengaluru
Email Address indeed.com/r/Alok-Khandai/5be849e443b8f467
Years of Experience 3.5 Years of IT experience
Location Bengaluru
Designation Operational Analyst
Companies worked at UNISYS
Location Bengaluru
Designation DBA Support Analyst
Companies worked at Microsoft Corporation
Years of Experience July 2016 to Present
Companies worked at HCL Technologies
Location Bengaluru
Companies worked at Microsoft Corporation
Degree B.Tech in Computer Science and Engineering
College Name Indira Gandhi Institute Of Technology
Graduation Year 2012
Skills Database
Skills SQL


In [26]:
#Viz of What Was Tagged
doc = nlp(test_text)
displacy.render(doc, style="ent",jupyter=True)

In [35]:
#Viz of What Was Tagged
doc = nlp(test_text2)
displacy.render(doc, style="ent",jupyter=True)