<a href="https://colab.research.google.com/github/Amt15/spacy/blob/main/custom_ner_with_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# https://tecoholic.github.io/ner-annotator/
# by defauld google colab comes with something spacy 2.2 version not 3 , update spacy
!pip install -U spacy -q

In [3]:

!python -m spacy info

[1m

spaCy version    3.3.0                         
Location         /usr/local/lib/python3.7/dist-packages/spacy
Platform         Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.7.13                        
Pipelines                                      



In [4]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en")   # load a new spacy model
db = DocBin()   # create DocBin object

In [6]:
import json 

f = open('training_data.json')
TRAIN_DATA = json.load(f)

In [7]:
TRAIN_DATA

{'annotations': [["Cryptocurrency price today: Major cryptocurrencies have registered decent gains in last 24 hours, especially after the US Fed meeting last night. Among top 10 crypto assets Bitcoin, Ethereum, Terra, Dogecoin, Solana, Shiba Inu, etc. have logged up to 7 per cent rise in last 24 hours. Crypto assets had tumbled on Wednesday after the surprise RBI's repo rate and CRR hike.\r",
   {'entities': [[88, 96, 'TIME'],
     [173, 180, 'CRYTO'],
     [182, 190, 'CRYTO'],
     [192, 197, 'CRYTO'],
     [199, 207, 'CRYTO'],
     [209, 215, 'CRYTO'],
     [217, 222, 'CRYTO'],
     [223, 226, 'CRYTO'],
     [251, 261, 'PERCENTAGE'],
     [275, 284, 'TIME']]}],
  ['Among major cryptocurrency in India, Bitcoin price today is ₹31,66,065, adding ₹1,00,787 or 3.29 per cent in last 24 hours. Current market capital of Bitcoin is ₹56.3 trillion whereas current market volume of Bitcoin is ₹2.5 trillion.\r',
   {'entities': [[37, 44, 'CRYTO'],
     [60, 70, 'VALUE'],
     [79, 88, 'VALUE'],
 

In [8]:
for text, annot in tqdm(TRAIN_DATA['annotations']):
  doc = nlp.make_doc(text)
  ents=[]
  for start, end, label in annot['entities']:
    span = doc.char_span(start, end, label=label, alignment_mode='contract')
    if span is None:
      print("Skipping entity")
    else:
      ents.append(span)
  doc.ents = ents
  db.add(doc)

db.to_disk("./training_data.spacy")  # save the DocBin object


100%|██████████| 15/15 [00:00<00:00, 1444.65it/s]


In [9]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [10]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-05-05 13:19:27,582] [INFO] Set up nlp object from config
[2022-05-05 13:19:27,594] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-05 13:19:27,598] [INFO] Created vocabulary
[2022-05-05 13:19:27,600] [INFO] Finished initializing nlp object
[2022-05-05 13:19:27,749] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     53.11    0.00    0.00    0.00    0.00
 39     200        183.28   1651.43   99.38   98.77  100.00    0.99
 89     400         24.57     77.92   99.37  100.00   98.75    0.99
155     600         25.15    103.40   99.38   98.77  100.00    0.99
232     800         19.89    100.92   99.37  100.00   98.75    

In [11]:
nlp_ner = spacy.load("/content/model-best")

In [12]:
text = '''Another crypto major Ethereum price today is quoting ₹2,35,297 per coin, ₹9,536 or 4.22 per cent higher from the price it was quoting 24 hours ago. Currently, market capital of Ether is ₹26.2 trillion and its current trade volume stands at 1.1 trillion.

Solana is selling at ₹7,480.73, around ₹504 or 7.23 per cent higher from its selling price it quoted 24 hours ago. Its current market valuation is ₹2.2 trillion and its current trade volume is ₹83.4 billion.

Cryptocurrency major Shiba Inu is selling at ₹0.001731, adding ₹0.000055 or 3.28 per cent to its price it was quoting 24 hours ago.

Dogecoin price today is ₹10.95, which is ₹0.41 or 3.89 per cent higher from its price 24 hours ago. Currently, its market capital is ₹1.3 trillion and trade volume is ₹52.2 billion.
'''

In [13]:
doc = nlp_ner(text)

In [14]:
spacy.displacy.render(doc,style="ent",jupyter=True)

In [15]:
for ent in doc.ents:
  if ent.label_== 'PERCENTAGE':
    print(ent.text,ent.label_)
  elif ent.label_ == 'CRYTO':
    print(ent.text,ent.label_)

Ethereum CRYTO
4.22 per cent PERCENTAGE
Solana CRYTO
7.23 per cent PERCENTAGE
Shiba Inu CRYTO
3.28 per cent PERCENTAGE
Dogecoin CRYTO
3.89 per cent PERCENTAGE
