<a href="https://colab.research.google.com/github/Gigi1111/AI-U1/blob/master/spaCy_spell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This small project is inspired by Ines Montani's Advanced NLP with spaCy course (https://course.spacy.io/en/chapter4). The purpose of this notebook is to showcase a simple example of training a custom spaCy pipeline, and is purely for educational purpose online. I do not own the right to the text of the Harry Potter series, if you want to download the text for practice, here is the repo that contains the text: https://github.com/formcept/whiteboard/blob/master/nbviewer/notebooks/data/harrypotter/Book%201%20-%20The%20Philosopher's%20Stone.txt

In [64]:
import spacy
from spacy import displacy

muggle_nlp = spacy.load('en_core_web_sm') 
doc = muggle_nlp("Only a huge Harry Potter fan like ChungFan Tsai would want SpaCy to recognize spells as entities.")
displacy.render(doc, style="ent",jupyter=True)

In [65]:
doc = muggle_nlp("“Wingardium Leviosal” he shouted, waving his long arms like a windmill.")
displacy.render(doc, style="ent",jupyter=True)

In [66]:
muggle_nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [67]:
spacy.explain("FAC")

'Buildings, airports, highways, bridges, etc.'

In [140]:
## This cell is only necessary if you wish to load your project on Google Collab
# from google.colab import files
# uploaded = files.upload()

Saving book1.txt to book1 (1).txt


Read in the whole book

In [70]:
corpus = ''
with open('book1.txt') as file:
    for line in file:
      if 'Page | ' not in line.rstrip() and len(line.rstrip()) > 5:
        corpus += ' ' + line.rstrip()

first_nth = 10000
print("The length of the entire book text is: {}.\nAnd the first {} words are:\n{} ".format(len(corpus), first_nth, corpus[:first_nth])) 
      

The length of the entire book text is: 435940.
And the first 10000 words are:
 THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursley s had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn’t think they could bear it if anyone found o

Divide the text into sentences

In [132]:
# manually set some custom sentence start token to divide the sentences better
def custom_sent_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ".(" or token.text == ").":
            doc[token.i+1].is_sent_start = True
        elif token.text == "Rs." or token.text == ")":
            doc[token.i+1].is_sent_start = False
        elif token.text == "...":
            doc[token.i+1].is_sent_start = False
        elif token.text == "...":
            doc[token.i].is_sent_start = False
        elif not (token.text == "“" or token.text == "\"") and not (token.text[0].isalpha() and token.text[0].isupper()):
            doc[token.i].is_sent_start = False
    return doc

muggle_nlp.add_pipe(custom_sent_boundaries, before="parser")
doc = muggle_nlp(corpus)

sentences = []
for sent in doc.sents:
  sentences.append(sent.text)

In [132]:
first_n_sent = 10
print("There are {} sentences.\nFirst {} sentences are:".format(len(sentences), first_n_sent))
for idx, sent in enumerate(sentences[:first_n_sent]):
  print("{}. {}".format(idx+1, sent))

There are 6269 sentences.
First 10 sentences are:
1.  THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
2. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.
3. Mr. Dursley was the director of a firm called Grunnings, which made drills.
4. He was a big, beefy man with hardly any neck, although he did have a very large mustache.
5. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.
6. The Dursley s had a small son called Dudley and in their opinion there was no finer boy anywhere.
7. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it.
8. They didn’t think they could bear it if anyone found out a

In [139]:
## This cell is only necessary if you wish to load your project on Google Collab
# # load training file
# uploaded = files.upload()

Saving spells_train.json to spells_train (4).json


In [80]:
import random
import json
   
with open("spells_train.json") as f:
    TRAINING_DATA = json.loads(f.read())

In [134]:
nlp = spacy.blank("en") # nlp = English()
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("SPELL")
nlp.get_pipe('ner').labels

('SPELL',)

In [135]:
nlp.vocab.vectors.name = 'example_harry_potter'

nlp.begin_training()

for itn in range(15):
    random.shuffle(TRAINING_DATA)
    losses = {}
    print("Epoch: {}".format(itn))
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        nlp.update(texts, annotations, losses=losses)
        print("Loss: {0:.5f}".format(losses['ner']) )

Epoch: 0
Loss: 27.50000
Loss: 59.92296
Loss: 111.73921
Loss: 125.83555
Epoch: 1
Loss: 21.74859
Loss: 28.19477
Loss: 33.43836
Loss: 35.36604
Epoch: 2
Loss: 3.67650
Loss: 3.67850
Loss: 7.43006
Loss: 9.39815
Epoch: 3
Loss: 1.86726
Loss: 4.64171
Loss: 4.64184
Loss: 24.65530
Epoch: 4
Loss: 0.00026
Loss: 3.62832
Loss: 8.71516
Loss: 11.24340
Epoch: 5
Loss: 1.48579
Loss: 3.21332
Loss: 5.94538
Loss: 7.33182
Epoch: 6
Loss: 0.93059
Loss: 2.72422
Loss: 3.67656
Loss: 3.67656
Epoch: 7
Loss: 0.24668
Loss: 1.11628
Loss: 1.92151
Loss: 1.92151
Epoch: 8
Loss: 0.61653
Loss: 0.66955
Loss: 5.06977
Loss: 5.10850
Epoch: 9
Loss: 0.06068
Loss: 0.06068
Loss: 4.38143
Loss: 6.63121
Epoch: 10
Loss: 2.16374
Loss: 2.16566
Loss: 2.41055
Loss: 2.41055
Epoch: 11
Loss: 0.04070
Loss: 0.26309
Loss: 0.31601
Loss: 0.31601
Epoch: 12
Loss: 0.00255
Loss: 0.00256
Loss: 0.00580
Loss: 0.00588
Epoch: 13
Loss: 0.00011
Loss: 0.00031
Loss: 0.00033
Loss: 0.00033
Epoch: 14
Loss: 0.00002
Loss: 0.00002
Loss: 0.00003
Loss: 0.00003


In [136]:
test = "“Kill the spare.” A swishing noise and a second voice, which screeched the words to the night: “Avada Kedavra”"
doc = nlp(test)
displacy.render(doc, style="ent", jupyter=True)

In [137]:
doc = muggle_nlp(test)
displacy.render(doc, style="ent", jupyter=True)

In [138]:
for sent in sentences:
  test = sent
  doc = nlp(test)
  for word in doc.ents:
    if word.label_ == "SPELL":
      print(sent)
      print(word.text,word.label_)
      print('------------------------------')

 THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
WHO SPELL
------------------------------
 THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
LIVED Mr. SPELL
------------------------------
 THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
Drive SPELL
------------------------------
“Little tyke,” chortled Mr. Dursley as he left the house.
Little SPELL
------------------------------
He also thought he had been called a Muggle, whatever that was.
Muggle SPELL
------------------------------
Perhaps people have been celebrating Bonfire Night early — it’s not until next week, folks!
Bonfire SPELL
------------------------------
Shooting stars all over Britain?
Britain SPELL
------------------------------
“Why?”
Why SPEL

There are several ways to improve the performance of identifying real charms:
1. Expand the training data so that the algorithm can learn with more examples
2. Use mutil-label training so that person's name and organizations are not mistaken as spells


In [None]:
# Try improving it yourself :)