<a href="https://colab.research.google.com/github/KunalSachdev2005/Data_Science_Intern_at_Info_Origin/blob/main/NER_for_Job_Descriptions/Training_Custom_NER_Model_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing & Importing Required Libraries

In [1]:
! pip install -U spacy -q

In [2]:
!python -m spacy info

[1m

spaCy version    3.7.5                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [4]:
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting transformers<4.37.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.8.0->spacy-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.8.0->spacy-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from to

In [18]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json
import random

# Reading and Processing Annotated Data

In [19]:
db = DocBin() # creating a DocBin object for storing Doc objects

In [20]:
# Reading Our Annotated Data from a JSON file

with open('/content/drive/My Drive/JD/annotations.json') as readfile:
    train_data = json.load(readfile)

In [21]:
train_data.keys()

dict_keys(['classes', 'annotations'])

In [22]:
train_data['classes'] # Annotation Labels

['YEARS_OF_EXPERIENCE',
 'EDUCATION',
 'ROLE',
 'TOOLS_TECH',
 'CERTIFICATIONS',
 'LOC',
 'CONCEPTS',
 'SOFT_SKILLS',
 'DOMAIN']

In [23]:
train_data['annotations'][0]

# Consists of the job description and the corresponding entities (character offsets and label)

['"Diagnose computer errors and triage to determine the urgency of issues Install, configure and upgrade  PC software and operating systems Facilitate  Onsite and escalation support activities Provide technical support over the phone or  Web to end users /clients Use remote support software to take control of end-users computers to troubleshoot, diagnose and resolve issues Setup new user and email accounts Assist end-users with password changes Setup email on  Computers and  Mobile devices Document  Resolution steps for closed tickets and notes for escalations Create and maintain documentation about customer networks Escalate to higher tier support to resolve customer issues within  SLACreate and maintain documentation about  Customer networks Troubleshoot software, hardware, and network issues.  Qualifications Bachelors degree preferred in  Computer  Science or related field 3+ years of helpdesk experience or working with a helpdesk or  IT provider Knowledge and hands-on experience pr

In [24]:
# Processing the annotated data

# for each Job description and corresponding annotations in the data
for text, annot in tqdm(train_data['annotations']):
    doc = nlp.make_doc(text) # create a Doc object from the job description
    ents = [] # Empty list to hold entity spans

    # for each entity annotation
    for start, end, label in annot["entities"]:

      # Create a span object from the character offsets and label
      span = doc.char_span(start, end, label=label, alignment_mode="contract")

      if span is None:
        print("Skipping entity")
      else: # Add valid span to the list of entities
        ents.append(span)

    # Assign list of entities to the Doc object
    doc.ents = ents
    # Add processed Doc object to the DocBin
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 31/31 [00:00<00:00, 131.92it/s]


# Creating Training and Validation DocBins

In [25]:
# loading a new spacy model
nlp = spacy.blank("en")

# Load the existing DocBin object
db = DocBin().from_disk("./training_data.spacy")

# Convert DocBin to list of Doc objects
docs = list(db.get_docs(nlp.vocab))

# Shuffle the docs
random.shuffle(docs)

# Define split ratios
train_ratio = 0.9
val_ratio = 0.1

# Calculate split indices
train_size = int(train_ratio * len(docs))
val_size = len(docs) - train_size

# Split the docs
train_docs = docs[:train_size]
val_docs = docs[train_size:]

# Create new DocBin objects for each split
train_db = DocBin(docs=train_docs)
val_db = DocBin(docs=val_docs)

# Save the new DocBin objects
train_db.to_disk("./train_data.spacy")
val_db.to_disk("./val_data.spacy")

print(f"Training set size: {train_size} docs")
print(f"Validation set size: {val_size} docs")




Training set size: 27 docs
Validation set size: 4 docs


# Initializing Configuration File for Training spaCy Model

Reference: https://spacy.io/usage/training

In [26]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


# Training

In [27]:
! python -m spacy train config.cfg --output ./ --paths.train ./train_data.spacy --paths.dev ./val_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    549.13    0.00    0.00    0.00    0.00
  7     200      14108.39  21514.90    5.21   13.16    3.25    0.05
 14     400       6368.96   9186.67   12.56   20.29    9.09    0.13
 22     600       3184.33   5160.68   14.85   22.67   11.04    0.15
 29     800       2963.91   2668.67   18.67   29.58   13.64    0.19
 37    1000       1722.67   1597.16   17.02   24.69   12.99    0.17
 44    1200       1952.50   1152.70   20.35   31.94   14.94    0.20
 51    1400       1662.35    938.54   15.11   2

# Testing the Model on a JD & Visualizing the Results

In [None]:
nlp_ner = spacy.load("./model-best")

In [None]:
doc = nlp_ner('''Pocket  Gems seeks to build the greatest mobile games and most compelling interactive entertainment in the world.  Thats the mission our founders began with, in an apartment above a pizza shop back in 2009, and it continues to inspire us today.  Since then, weve grown to over 200 people headquartered in  San  Francisco, and with $155 million in backing from  Sequoia  Capital and  Tencent, were constantly breaking new ground in mobile entertainment.  Our products have been downloaded over 500 million times by players around the world and have grossed over $1 billion in revenue.  We continue to release brand new content for  Episode, a mobile storytelling network and platform, and  War  Dragons, a visually stunning 3 D real-time strategy game.  As our community of players continues to grow, were committed to building diverse and inclusive environments across our teams, and in our games.  As a  Data  Scientist, youll work with product managers to set the analytics and data science agenda for one of our game studios.  Youll carry out research and development projects aimed at optimizing player retention, youll deploy machine learning models, and youll perform deep-dive analyses to understand changes in key metrics.  Along the way, you'll develop a deep understanding of mobile gaming and will work with our high-producing and fast-paced team.  Your work will have a significant, lasting impact on the health of our games.  You'll have a great deal of independence  we'll expect you to use it wisely.  What  Weve  Accomplished  Built models to estimate the lifetime value of every player in our games  Deployed systems to monitor player satisfaction with our games  Developed models that recommend more relevant content to players  What  Youll  Do  Drive analytics and data science initiatives for one of our game studios  Design and deploy live machine learning models to optimize player experience  Perform deep-dive analyses and research projects to understand changes in key metrics  Deliver results and recommendations to key stakeholders on a regular basis  Collaborate with other data analysts and data scientists on the analytics team  Collaborate with cross functional partners on other teams  What  Youll  Bring  To  The  Analytics  Team  Bachelors degree   AND  4 years of professional experience in a similar/related role   OR  2 years of professional experience in a similar/related role  AND  Masters/ Ph D in a similar/related field such as data science or analytics  Expert command of  SQL  youll be using it daily  Firm grasp of  Python, including  Pandas,  Scikit-learn, and  Jupyter notebooks  Excellent knowledge of common machine learning models  both supervised and unsupervised (e.g., linear/logistic regression, tree-based models, gradient boosting, neural networks, clustering methods, etc.)  and when to use them  Solid understanding of statistics  Strong understanding of  A/ B testing (design and analysis)  Knack for data visualization and ability to present data in a clear, concise manner  Ability and desire to present complex results to non-technical audiences  Outstanding communication skills and ability to work collaboratively with other roles  Ambition to own the analytics and data science function for a large game studio  Extra  Gems  For  Experience working in the gaming industry  Experience developing and deploying production code  Leadership and/or consulting experience  Read more about what weve been up to!  Pocket  Gems  Blog  At  Pocket  Gems, we're building teams that value originality, inclusivity, and accountability -- and we hope to engage with talent based on these and other core values.  We also offer competitive perks, such as flexible vacation, 401k matching, and a generous benefits package,  Pocket  Gems is an equal opportunity employer.  We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status.  Pursuant to the  San  Francisco  Fair  Chance  Ordinance, we will consider for employment qualified applicants with arrest and conviction records.''')

In [None]:
colors = {'YEARS_OF_EXPERIENCE': "red",
 'EDUCATION': "blue",
 'ROLE': "green",
 'TOOLS_TECH': "yellow",
 'CERTIFICATIONS': "orange",
 'LOC': "purple",
 'CONCEPTS': "grey",
 'SOFT_SKILLS': "pink",
 'DOMAIN': "violet"}
options = {"colors": colors}

In [None]:
spacy.displacy.render(doc, style="ent", options = options, jupyter=True) # display in Jupyter

# Saving our Model

In [33]:
import shutil
from google.colab import files

# Zip the folder
shutil.make_archive('model-best', 'zip', 'model-best')

# Download the zipped folder
files.download('model-best.zip')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>