<a href="https://colab.research.google.com/github/RumethR/Resume-Ner/blob/main/Cw1_w1809911_RumethRandombage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part A: Application Area Review

**Selected Domain: Human resources and computing.**



The domain of human resources within an organisation entails a wide variety of operations and responsibilities ranging from employee compensation to organisational structure. This poses a great opportunity for Artificial Intelligent (AI) systems to contribute towards the Human Resource Management (HRM) tasks within an organisation. One of the most common areas where AI has been highly used is within the realm of recruitment and talent acquisition. A study by the ‘Oracle’ corporation has revealed that professionals in the human resource management industry believe that AI can boost productivity among other statistics (Rahmani and Kamberaj, 2021). Other sources believe that AI will enable new application procedures for candidates, increase applicant discovery rates and even reduce vacancy times as these systems can traverse through applications faster than trained professionals (Vrontis et al., 2021). AI powered applications in HRM are also able to process large amounts of data at high speeds, faster and more accurately than any trained professional, this allows organisations to gain insights that would otherwise be impossible with manual labour. The use of these systems are also more likely to be less biassed towards factors that do not affect performance such as age, gender or race (assuming that they are developed properly). Repetitive tasks within the domain of HRM are more likely to be successfully automated with AI powered applications according to *Vrontis et al. (2021)*.

The overall effect of bias that is induced when training was studied by the influential work of *Tambe, Cappelli and Yakubovich (2019)* who observed that the data that these systems were built on had a major impact in how these systems performed. This underlying data itself has the opportunity to represent industry swings and unspoken ethics (i.e. the system might be trained on data that indicated towards hiring from a specific university), hence it is important that these systems work in tandem with trained professionals.This study also looks into how Machine Learning algorithms have fared against several different HR operations within organisations. AI powered HRM task management systems such as Quine, BenefitFocus and Jobvite are now widely available for any organisation of any scale to incorporate into their workflows (Tambe, Cappelli and Yakubovich, 2019).

On the opposite side of the spectrum, employees now have access to tools such as IBM’s ‘Blue Match’ presents viable job openings for their specific skill set, similarly AI powered tools have helped IBM to identify possible flight risks and retain employees for much longer according to their former CEO Ginni Rometty (Rosenbaum, 2019). These HRM service vendors however do not provide any information about the type of AI techniques that are used internally within their products. It is also important to note that most AI powered solutions can typically be applied on a small portion of a HRM task, for example in the task of vetting candidates these tools can only provide a suggestion or score of desirability is most cases rather than performing the entire stack of recruiting tasks that go alongside the vetting process (Ghosh, Majumder and Santosh Kumar Das, 2023). The common theme among most studies that observe the usage of AI in HRM is that they are heavily reliant on the quality and the quantity of the data that they are being built upon.

With the rise of open source Large Language Models (LLM) it is easier than ever for organisations to create a custom tailored algorithm that can perform trivial tasks within HRM such as answering frequently asked questions from employees through chatbots *Pawan Budhwar et al. (2023)* studied the use of popular generative AI tool ChatGPT in a HRM context and how it affects areas other than productivity such as employment relations, employee wellbeing and engagement. This study revealed that employees prefer tools such as ChatGPT in many scenarios that require understanding and perceiving information more than their human co-workers. However long term social, ethical and legal implications of using such technologies are difficult to quantify (Pawan Budhwar et al., 2023). Listed below are some other AI-based techniques used in the HRM context (Ghosh, Majumder and Santosh Kumar Das, 2023).

*   Turnover Prediction with Artificial Neural Networks
*   Candidate Search With Knowledge-Based Search Engines
*   Sentiment Analysis on employee engagement and relations
*   Résumé data acquisition using NLP




# Part B: Comparision and evaluation of AI Techniques used in Human Resources

As mentioned before all HRM tasks in an organization cannot be supplemented or automated by one singular system. The scope of this specific section will cover the task of utilizing Machine Learning techniques for the purpose of recruiting, vetting and staffing employees using a CV/Résumé. This project will focus on using Named Entity Recognition (NER) on CVs to identify talent that best suites a given job posting. NER is classified as a Sequence Labelling Task within the domain of Natural Language Processing (NLP), there are many different ways of addressing these sequence labelling tasks listed below are three such approaches.

### NER using Conditional Random Fields (CRF)

Using Conditional Random Fields for NER tasks is a popular machine learning based approach which uses a probabilistic model. CRFs capture dependencies among neighbouring labels in a sequence and have the ability to consider the entire input sequence when making predictions. Feature functions are used to define relationships between input features (words) and output labels (entities). These relationships have to be manually defined by the trainer. CRFs calculate the probability distribution of an output sequence (labels) given an input sequence.Transition weights in Conditional Random Fields represent the likelihood of moving from one label to another in the sequence. During the training phase, the relationships between input features, their corresponding output labels (using data from the feature functions) and the effect of the transition weights are learned from the training data.

For training a model that can perform NER using CRF text data with entities as labels are required. Necessary words in the training data sequence need to be accompanied by its corresponding entity label. Datasets for training NER models are widely available under different platforms such as Kaggle and the UCI Machine Learning Repository. One of the most common dataset and evaluation benchmark for NER is the CoNLL-2003 [dataset](https://www.clips.uantwerpen.be/conll2003/ner/). CRF based models tend to perform well as they have the ability to capture contextual information about the data. However performance is heavily dependent on the quality and relevance of engineered features, this is one of the major drawbacks of CRF based models. Feature engineering will mostly depend on the domain that the information is extracted from, and in the context of CVs people vary in the way they express information hence engineering features that capture necessary information will be quite a difficult task. Generally, CRF based models tend to be computationally efficient compared to their Deep Learning based counterparts who in turn will be better at capturing more intricate patterns within the data.

### NER using Neural Networks and pre-trained models

Currently there are many frameworks and pre-trained models that ease the task of NER. Similar to Tensorflow there are libraries that are built for performing NLP tasks. [SpaCy](https://spacy.io/usage/spacy-101) is one such Python package that offers pre-trained models for tasks such as NER. SpaCy’s default models use Convolutional Neural Networks (CNN) for the task of feature extraction and entity recognition. These models are trained on large labelled datasets hence they cover a wide range of contexts and features. By default the SpaCy package can identify entities within a  given text such as locations, dates, persons and organisations. These can be customised through a variety of API’s that are also included within the library. Similar to CRF and Transformer based models SpaCy’s CNNs can also understand contextual information regarding the input data, allowing the model to capture relationships between entities.

Efficiency of a custom built CRF based model for NER depends heavily on the training data and the quality of the engineered features. Comparatively pre-trained models from SpaCy can offer similar accuracy scores and be more computationally efficient at  the same time. However by using a pre-built model the level of customization is limited. There are other libraries that have packaged in pre-trained models for NER such as [NLTK](https://www.nltk.org/) (Natural Language Toolkit) which can offer similar functionality to SpaCy, each library has their own strengths and weaknesses.

### NER using pre-trained Transformer based models (BERT)

With the rise of pre-trained Transformer based models such as BERT and GPT, the application of these models for NLP tasks has proven to yield competitive results. The Bi-directional approach considers text preceding from both the left and the right of a given word in a sentence. This helps the model to understand a much broader context (and longer text sequences) compared to the default pre-trained models offered by SpaCy which uses a CNN. This can be especially useful in the context of parsing data from CVs which can contain ambiguous and complex sentences. There are many more pre-trained Transformer based models which can have this characteristic, these are available through the [Hugging Face](https://huggingface.co/) community platform. BERT specifically is trained on a large corpus of text using 2 unsupervised learning techniques, masked language modelling and next sentence prediction.

Unlike SpaCy, BERT is not an out-of-the-box solution for most NER tasks. This model needs to be fine-tuned in order to be used for NER purposes. This allows BERT to be domain specific and can in theory lead to better performance on very complex and domain specific NER tasks. The drawback of this level of customization is that a BERT model is more resource intensive and technically challenging to develop compared to SpaCy or NLTK. It also requires a labelled dataset for training before it can be used. Fine-tuning a BERT model requires large amounts of data to yield competitive results, which might not be publicly available in the case of CVs. In summary, BERT models can perform better than the other two approaches given that they are fine-tuned correctly, whilst SpaCy offers a better efficiency to performance ratio without requiring extensive training or customization. Although compared to training a CRF based model for NER form scratch customising a BERT model will yield better results, one drawback is it being more computationally intensive to train and use.



# Part C: Implementation

We will be using a text-based resume dataset available from Kaggle, the dataset will be accessed directly via this notebook using the Kaggle API. The credentials are hardcoded onto this notebook for ease-of-use during the viva. (Not Recommended)

In [1]:
!pip install kaggle

!mkdir ~/.kaggle
!echo '{"username":"rumethrandombage","key":"6c49cc180f0498fccf2dabb3d22f0236"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

import json

!kaggle datasets download -d dataturks/resume-entities-for-ner
!unzip resume-entities-for-ner.zip #Extracts to ./content

dataset = [] # Each element contains a json object
with open('/content/Entity Recognition in Resumes.json', 'r') as file:
    for line in file:
        dataset.append(json.loads(line))

print("Number of elements in dataset list", len(dataset))

Downloading resume-entities-for-ner.zip to /content
  0% 0.00/323k [00:00<?, ?B/s]
100% 323k/323k [00:00<00:00, 14.6MB/s]
Archive:  resume-entities-for-ner.zip
  inflating: Entity Recognition in Resumes.json  
Number of elements in dataset list 220


The dataset consists of 220 JSON objects. The structure of each JSON object is as follows:

> "content" -> (contains all the data in the CV as a string)

> "annotations" -> annotation label ("Name", "Skills", etc.) -> points ("start and stop index where the label is contained") - > text for label






SpaCy is packaged with a CNN based pre-built model that can perform NER for general entity types (Persons, Organizations, Locations, etc.).

In [2]:
!python -m spacy download en_core_web_sm

2024-01-06 12:33:53.631246: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 12:33:53.631315: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 12:33:53.632911: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-06 12:33:53.641642: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

In [76]:
import spacy
from spacy import displacy

# Access one string of text without any annotations from the datset and pass it into the pre-trained spacy model
first_entry = dataset[0]

nlp = spacy.load("en_core_web_sm")
doc = nlp(first_entry["content"])

displacy.render(doc, style="ent", jupyter=True,)
#GPE stands for Geo-Political Entity (Place)

The built-in NER model performs extremely poorly without any customisations. We can use transfer learning or we can create our own model using the NER pipeline provided by SpaCy, we will be creating a custom model for this project.

In [4]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm # Used to show progress bars also helps identify places with problems
from spacy.util import filter_spans # To make sure there are no overlapping entities

nlp = spacy.load('en_core_web_sm')

labels = ["Name", "College Name", "Degree", "Graduation Year", "Years of Experience", "Companies worked at", "Designation", "Skills", "Location", "Email Address"]

#ner = nlp.add_pipe("ner") #only using the ner component for the pipeline provided by SpaCy
ner = nlp.get_pipe('ner')

# Add the pre-defined labels gathered from the dataset to the pipeline
for label in labels:
  ner.add_label(label)

Required format for SpaCy to train model.

{
    "text": "Apple Inc. is headquartered in California.",
    "entities": [
        {"start": 0, "end": 10, "label": "ORG"},
        {"start": 36, "end": 46, "label": "GPE"}
    ]
}

Current format (After extracting from dataset)

{
  "content": "Rumeth Randombage is a student at IIT, with one year Internship experience. Email .....",

  annotation: [
    { [label: "Skills"],  
      [points:
      {start: 1295,
      end: 1621,
      text: Python}
      ]
    }
  ]
}

In [77]:
import string
import re

# Convert the data to spaCy's training data format
formatted_data = []

def process_dataset_entry(first_entry):
  raw_text = first_entry["content"]

  # Clean text: remove punctuation and newline characters
  cleaned_text = re.sub(r'[\n\r]', ' ', raw_text)
  cleaned_text = cleaned_text.replace("'", "")

  #cleaned_text = raw_text.translate(str.maketrans('', '', string.punctuation)).replace('\n', ' ')

  # Process cleaned text with spaCy
  doc = nlp.make_doc(cleaned_text)

  # Retrieve cleaned text without punctuation and newline characters
  cleaned_text = ' '.join(token.text for token in doc if not token.is_punct and not token.is_space)

  formatted_entry = {"text": "", "entities": []}
  formatted_entry["text"] = cleaned_text

  # Make sure that the given start and end points for specific label lines up with the text
  entities = []

  # This loop checks if the labels and annotations are correct.
  for i in first_entry["annotation"]:
    try:
      label = i['label'][0]
      annotation_text = i["points"][0]["text"]

      #clean the annotation text as before, and get the start and end points within the larger text
      annotation_text_clean = re.sub(r'[\n\r]', ' ', annotation_text)
      annotation_text_clean = annotation_text_clean.replace("'", "")

      # Process cleaned text with spaCy
      doc = nlp.make_doc(annotation_text_clean)

      # Retrieve cleaned text without punctuation and newline characters
      cleaned_annotation_text = ' '.join(token.text for token in doc if not token.is_punct and not token.is_space)

      start_index = cleaned_text.find(cleaned_annotation_text)
      end_index = start_index + len(cleaned_annotation_text) - 1 if start_index != -1 else -1
      if label in labels: # In case there are some labels that are not needed/wrong
        entities.append({"start": start_index, "end": end_index, "label": label})
        formatted_entry["entities"] = entities
    except IndexError:
      print("There was an error parsing a label, skipping entity")
      return

  discard_entries = 0 # Discarded because of overlapping
  checked_entry = prevent_overlapping(formatted_entry)
  if len(checked_entry["entities"]) == 0:
    print("There was overlapping within all the entities")
    discard_entries = discard_entries + 1
    return
  else:
    formatted_data.append(formatted_entry)


# Make sure the entities do not overlap within the same text span
def prevent_overlapping(entry):
  # Overlapping refers to when two labels point to the same text span
  entities = entry["entities"]

  # Sort entities by start index to check for overlaps
  sorted_entities = sorted(entities, key=lambda x: x['start'])\

  previous_end = -1
  non_overlapping_entities = []
  overlapping_entities = []
  for entity in sorted_entities:
    if entity['start'] >= previous_end:
        non_overlapping_entities.append(entity)
        previous_end = entity['end']
    else:
      overlapping_entities.append(entry)
      #print("Overlapping Detected on the following entity")
      #print(entity)

  # Create and return a new entry with the only the non overlapping entities
  if len(non_overlapping_entities) >= 1:
    entry["entities"] = non_overlapping_entities
    return entry

for entry in tqdm(dataset):
  process_dataset_entry(entry)

#print(formatted_data[0])
print(len(formatted_data))


 40%|████      | 88/220 [00:01<00:02, 56.24it/s]

There was an error parsing a label, skipping entity


 84%|████████▍ | 185/220 [00:02<00:00, 80.37it/s]

There was an error parsing a label, skipping entity


100%|██████████| 220/220 [00:03<00:00, 66.64it/s]

218





spaCy requires training data to be in 'Doc' objects. The output from the pipeline will also result in a Doc object. Finally the file will be saved in the runtime with the .spacy format.

In [78]:
def create_doc_objects(entry):
  text = entry["text"]
  doc = nlp.make_doc(text)
  # Assuming you've already loaded the SpaCy model and created a 'doc' object from the text

  annotations = entry["entities"]

  ents = []

  for entity in annotations:
      start = entity["start"]
      end = entity["end"] + 1
      label = entity["label"]

      # Initialize token indices
      start_token_idx = None
      end_token_idx = None

      # Find the token indices that correspond to the character indices
      for i, token in enumerate(doc):
          if token.idx == start:
              start_token_idx = i
          if token.idx + len(token.text) == end:
              end_token_idx = i

      if start_token_idx is not None and end_token_idx is not None:
          span = doc[start_token_idx:end_token_idx + 1]  # Create a span from tokens
          span_label = (span.start_char, span.end_char, label)
          span_to_add = doc.char_span(span.start_char, span.end_char, label=label)  # Format: (start_char, end_char, label)
          ents.append(span_to_add)
      else:
          print("Invalid span detected:", start, end)

  # 'ents' will contain the entities in the format required for spaCy training
  try:
    doc.ents = ents
    return doc
  except ValueError:
    print("Error while creating doc, skipping entry")
    return None

db = DocBin()
for entry in tqdm(formatted_data):
  doc_obj = create_doc_objects(entry)
  if doc_obj != None:
    db.add(doc_obj)
  else:
    continue

 11%|█         | 24/218 [00:00<00:01, 115.46it/s]

Invalid span detected: 900 1578
Invalid span detected: 7303 7307
Invalid span detected: 1790 2041
Invalid span detected: 20 40


 23%|██▎       | 51/218 [00:00<00:01, 121.02it/s]

Invalid span detected: 120 126
Invalid span detected: 377 381
Invalid span detected: 923 970
Invalid span detected: 1029 1033
Invalid span detected: 1711 1743
Invalid span detected: 188 192
Invalid span detected: 194 197
Invalid span detected: 49 55
Invalid span detected: 254 274


 45%|████▌     | 99/218 [00:00<00:01, 102.50it/s]

Error while creating doc, skipping entry
Invalid span detected: 57 101


 70%|███████   | 153/218 [00:01<00:00, 159.91it/s]

Invalid span detected: -1 0
Invalid span detected: 0 4
Invalid span detected: 936 941
Invalid span detected: 35 48
Invalid span detected: -1 0
Invalid span detected: 5060 5115


 87%|████████▋ | 190/218 [00:01<00:00, 156.80it/s]

Invalid span detected: -1 0
Invalid span detected: 909 1028
Invalid span detected: 0 17
Invalid span detected: -1 0
Invalid span detected: 34 45
Invalid span detected: 165 173


100%|██████████| 218/218 [00:01<00:00, 131.03it/s]


In [64]:
import random

docs = list(db.get_docs(nlp.vocab))

# Shuffle the Doc objects
random.seed(42)  # Set seed for reproducibility
random.shuffle(docs)

# Define the ratio for train-test split (e.g., 80% train, 20% test)
train_ratio = 0.8
num_docs = len(docs)
train_size = int(train_ratio * num_docs)

# Split into train and test sets
train_docs = docs[:train_size]
test_docs = docs[train_size:]

print("Number of Valid and usable data points for training: ", len(train_docs))
print("Number of Valid and usable data points for training: ", len(test_docs))

train_doc_bin = DocBin(docs=train_docs)
test_doc_bin = DocBin(docs=test_docs)

# Save the train and test DocBins to separate files
train_doc_bin.to_disk("./train_data.spacy")
test_doc_bin.to_disk("./test.spacy")

Number of Valid and usable data points for training:  173
Number of Valid and usable data points for training:  44


In [52]:
import os
train_path = './train.spacy'
test_path = "./test.spacy"

print("Train file exists:", os.path.exists(train_path))
print("Test file exists:", os.path.exists(train_path))

!python -m spacy init config - --lang en --pipeline ner --optimize efficiency > config.cfg

Train file exists: True
2024-01-06 14:16:35.791355: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 14:16:35.791432: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 14:16:35.792953: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


The config.cfg file contains all the necessary hyperparameters to the model. This uses the default configuration. The optimizer is set to **Adam** by default, and the **learning rate to 0.001**.

In [54]:
# WARNING: RUNNING THIS MIGHT TAKE A WHILE ~40mins
!python -m spacy train config.cfg --paths.train /content/train.spacy --paths.dev /content/test.spacy --output output_folder

2024-01-06 14:17:04.400488: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 14:17:04.400567: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 14:17:04.401861: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[38;5;4mℹ Saving to output directory: output_folder[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0     

ENTS_F = F1 Score for each epoch in indentifying the entities

ENTS_P = F1 Score for each epoch in indentifying the entities

ENTS_R = F1 Score for each epoch in indentifying the entities

In [None]:
nlp_ner = spacy.load("/content/output_folder/model-best")

Let's see if the model works, using a datapoint in the dataset.

In [75]:
from spacy import displacy

doc = nlp_ner("Abhishek Jha Application Development Associate Accenture Bengaluru Karnataka Email me on Indeed indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and companys growth in best possible ways Willing to relocate to Bangalore Karnataka WORK EXPERIENCE Application Development Associate Accenture November 2017 to Present Role Currently working on Chat bot Developing Backend Oracle PeopleSoft Queries for the Bot which will be triggered based on given input Also Training the bot for different possible utterances Both positive and negative which will be given as input by the user EDUCATION B.E in Information science and engineering B.v.b college of engineering and technology Hubli Karnataka August 2013 to June 2017 12th in Mathematics Woodbine modern school April 2011 to March 2013 10th Kendriya Vidyalaya April 2001 to March 2011 SKILLS C Less than 1 year Database Less than 1 year Database Management Less than 1 year Database Management System Less than 1 year Java Less than 1 year ADDITIONAL INFORMATION Technical Skills https://www.indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a?isid=rex-download&ikw=download-top&co=IN Programming language C C++ Java Oracle PeopleSoft Internet Of Things Machine Learning Database Management System Computer Networks Operating System worked on Linux Windows Mac Non Technical Skills Honest and Hard Working Tolerant and Flexible to Different Situations Polite and Calm Team Player")

colors = {
    "Name": "#ffcccb",  # Name - light red
    "College Name": "#ffebcd",  # College Name - blanched almond
    "Degree": "#add8e6",  # Degree - light blue
    "Graduation Year": "#98fb98",  # Graduation Year - pale green
    "Years of Experience": "#ffb6c1",  # Years of Experience - light pink
    "Companies worked at": "#ffdead",  # Companies worked at - navajo white
    "Designation": "#ffe4e1",  # Designation - misty rose
    "Skills": "#f0e68c",  # Skills - brown
    "Location": "#afeeee",  # Location - pale turquoise
    "Email Address": "#d3d3d3"  # Email Address - light gray
}

# Visualize named entities using displacy with custom colors for all labels
options = {"ents": labels, "colors": colors}

displacy.render(doc, style="ent", jupyter=True, options = options)

In [66]:
from spacy.training.example import Example

# Load the test data from the .spacy file
test_data = DocBin().from_disk("/content/test.spacy")  # Replace with your file path

# Convert DocBin back to individual Doc objects
test_docs = list(test_data.get_docs(nlp.vocab))

# Initialize a list to store Example objects for evaluation
eval_examples = []

# Create Example objects from the test documents
for doc in test_docs:
    entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    example = Example.from_dict(doc, {"entities": entities})
    eval_examples.append(example)

# Evaluate the model with the created Example objects
scores = nlp.evaluate(eval_examples)
print(scores)

{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'tag_acc': None, 'sents_p': None, 'sents_r': None, 'sents_f': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'pos_acc': None, 'morph_acc': None, 'morph_micro_p': None, 'morph_micro_r': None, 'morph_micro_f': None, 'morph_per_feat': None, 'lemma_acc': None, 'ents_p': 0.1783142736128759, 'ents_r': 1.0, 'ents_f': 0.30265995686556435, 'ents_per_type': {'Name': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'Designation': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'Companies worked at': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'Location': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'Email Address': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Graduation Year': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'DATE': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Degree': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'College Name': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'PERSON': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Skills': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'CARDINAL': {'p': 0.0, '

# References

Includes references for all of the above sections of the notebook

Ghosh, S., Majumder, S. and Santosh Kumar Das. (2023). Artificial Intelligence Techniques in Human Resource Management. Apple Academic Press (CRC Press). Available from https://doi.org/10.1201/9781003328346.

<br/>

Pawan Budhwar et al. (2023). Human resource management in the age of generative artificial intelligence: Perspectives and research directions on ChatGPT. Human resource management in the age of generative artificial intelligence: Perspectives and research directions on ChatGPT, 33 (3). Available from https://doi.org/10.1111/1748-8583.12524.

<br/>

Rahmani, D. and Kamberaj, H. (2021). Implementation and Usage of Artificial Intelligence Powered Chatbots in Human Resources Management Systems. International Conference on Social and Applied Sciences. May 2021. Available from https://www.researchgate.net/profile/Hiqmet-Kamberaj/publication/351345726_Implementation_and_Usage_of_Artificial_Intelligence_Powered_Chatbots_in_Human_Resources_Management_Systems/links/60926106299bf1ad8d78e1d1/Implementation-and-Usage-of-Artificial-Intelligence-Powered-Chatbots-in-Human-Resources-Management-Systems.pdf [Accessed 19 December 2023].

<br/>

Rosenbaum, E. (2019). IBM artificial intelligence can predict with 95% accuracy which workers are about to quit their jobs. CNBC. Available from https://www.cnbc.com/2019/04/03/ibm-ai-can-predict-with-95-percent-accuracy-which-employees-will-quit.html [Accessed 1 January 2024].

<br/>

Tambe, P., Cappelli, P. and Yakubovich, V. (2019). Artificial Intelligence in Human Resources Management: Challenges and a Path Forward. California Management Review, 61 (4), 15–42. Available from https://doi.org/10.1177/0008125619867910.

<br/>

Vrontis, D. et al. (2021). Artificial intelligence, robotics, advanced technologies and human resource management: a systematic review. The International Journal of Human Resource Management, 33 (6), 1–30. Available from https://doi.org/10.1080/09585192.2020.1871398.