reference for training custom NER 

https://github.com/amrrs/custom-ner-with-spacy3/blob/main/Custom_NER_with_Spacy3.ipynb

In [1]:
!pip install -U spacy -q

In [2]:
!python -m spacy info

[1m

spaCy version    3.8.3                         
Location         f:\sfu\python\lib\site-packages\spacy
Platform         Windows-10-10.0.19045-SP0     
Python version   3.9.5                         
Pipelines        en_core_web_sm (3.8.0)        



In [3]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

In [36]:
import json
with open('healthcare_annotations.json', encoding='utf-8') as f:
    TRAIN_DATA = json.load(f)

In [37]:
#remove null values
filtered_annotations = [item for item in TRAIN_DATA['annotations'] if item is not None]

for text, annot in tqdm(filtered_annotations): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 25/25 [00:00<00:00, 373.20it/s]


In [25]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy --force

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [27]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [38]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     14.13    0.80    0.42    7.69    0.01
  4     200       1603.61   3370.43   53.96   57.86   50.55    0.54
 10     400        910.54    958.96   82.49   89.68   76.37    0.82
 17     600       5565.00   1199.09   85.31   87.79   82.97    0.85
 26     800         76.35    688.89   85.71   87.43   84.07    0.86
 37    1000         88.71    833.93   85.16   85.16   85.16    0.85
 52    1200        407.20   1100.99   87.72   93.75   82.42    0.88
 70    1400        137.63   1067.85   86.38   91.41   81.87    0.86
 93    1600         55.62   1163.27   87.89   84.34   91.76    0.88
121    1800         48.50   1290.47   88.40   88.89

In [39]:
nlp_ner = spacy.load("model-best") 

In [None]:
import pandas as pd

df = pd.read_csv('Datasets/Health_News/Health_News.csv')

sample_text = df["content"].iloc[247]

doc = nlp_ner("""

RARITAN, N.J. - Johnson & Johnson (NYSE:JNJ) has revealed promising overall survival (OS) results from its Phase 3 MARIPOSA study, which could transform the treatment landscape for patients with advanced non-small cell lung cancer (NSCLC). The study showed that the combination of RYBREVANT® (amivantamab-vmjw) and LAZCLUZE™ (lazertinib) significantly improved survival outcomes compared to the current standard treatment, osimertinib.

The data, expected to be presented at the European Lung Cancer Congress (ELCC) in 2025, suggests that patients with locally advanced or metastatic NSCLC with specific EGFR mutations could benefit from this new treatment regimen. With an EBITDA of nearly $30 billion in the last twelve months, Johnson & Johnson demonstrates the financial strength needed to support its extensive research and development initiatives. InvestingPro analysis reveals the company maintains a "GOOD" financial health score, suggesting strong operational stability. Dr. Yusri Elsayed, Global Therapeutic Area Head of Oncology at Johnson & Johnson Innovative Medicine, emphasized the potential for these therapies to extend patients’ lives beyond what current treatments offer.

In addition to the MARIPOSA study, Johnson & Johnson will also present findings from the Phase 2 COCOON study, which evaluated a dermatologic regimen to prevent skin reactions in patients receiving the RYBREVANT® combination therapy. The regimen met its primary endpoint, enhancing patient experience by managing side effects more effectively.

The company’s extensive clinical trial program continues to explore RYBREVANT® in various combinations and settings, including the Phase 2 PALOMA-2 study, which assesses the feasibility of switching to a subcutaneous form of amivantamab.

RYBREVANT® has received approvals in the U.S., Europe, and other global markets for several indications related to NSCLC treatment. The European Medicines Agency’s Committee for Medicinal Products for Human Use (CHMP) recommended approval of a subcutaneous formulation of amivantamab and LAZCLUZE™ for first-line treatment of adult patients with advanced NSCLC harboring specific EGFR mutations.

The National Comprehensive Cancer Network® (NCCN®) has included RYBREVANT® and LAZCLUZE™ as a Category 1 recommendation for first-line therapy in patients with NSCLC with certain EGFR mutations.

The announcement is based on a press release statement and provides a glimpse into the ongoing efforts to enhance cancer treatment options and improve patient outcomes.

For further information on the safety and prescribing information for RYBREVANT® and LAZCLUZE™, healthcare professionals are directed to the full prescribing information provided by Janssen Biotech, Inc. For investors seeking deeper insights, InvestingPro offers comprehensive analysis of Johnson & Johnson’s financial performance, including over 30 additional exclusive ProTips and detailed valuation metrics. The company currently offers a 3.04% dividend yield and trades near its 52-week high, reflecting strong market confidence in its pipeline developments.

In other recent news, Johnson & Johnson has received Fast Track designation from the U.S. Food and Drug Administration for nipocalimab, aimed at treating moderate-to-severe Sjögren’s disease. This follows the Breakthrough Therapy designation granted in 2024, underscoring the FDA’s support for the drug’s rapid development. Meanwhile, RBC Capital Markets has maintained its Outperform rating on Johnson & Johnson, highlighting the potential $5 billion annual sales opportunity for Icotrokinra, which could significantly contribute to the company’s growth from 2025 to 2030.

In another development, Johnson & Johnson decided not to exercise its option to license Genmab’s HexaBody-CD38, following a clinical proof-of-concept study. Despite promising initial data, the decision was based on an evaluation of the drug’s clinical data and market landscape. Additionally, Johnson & Johnson reported positive results from its Phase 2b ANTHEM-UC trial for icotrokinra in ulcerative colitis, achieving a 63.5% clinical response rate at the highest dose.

Furthermore, Guggenheim Securities downgraded Neumora Therapeutics to Neutral after Johnson & Johnson discontinued its Phase 3 VENTURA program for aticaprant in major depressive disorder due to insufficient efficacy. This decision influenced Guggenheim’s outlook on Neumora’s own drug development efforts. These recent developments highlight Johnson & Johnson’s ongoing efforts in healthcare innovation and strategic decisions in drug development.

This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.
              ”""")

In [50]:
spacy.displacy.render(doc, style="ent", jupyter=True)