# Technology NER (Neural Network)

Here we gonna teach our model to learn to label technology stuff.

We gonna go through the whole process from labeling to training, so you understand how to do it.  The idea is:

1. Grab some raw text containing technological stuffs
2. Grab another text containing terms about technology
3. Use 2 to annotate 1
4. Then train the NER model with the annotated 1

First, let us grab the raw text containing technology related stuffs.  We grab these raw texts from patent

This is edited from https://github.com/kinivi/patent_ner_linking

## 1. Loading data

In [1]:
# if you've already unzipped the file
# this is a text I grab from 
#https://www.google.com/patents/sitemap/en/Sitemap/G06/G06K.html
patent_data = open('data/G06K.txt').read().strip()
patent_data[:500]

'BACKGROUND OF THE INVENTION\n[0001]\n1. Field of the Invention\n[0002]\nThe present invention relates to a wireless communication terminal with an RFID (Radio-Frequency Identification) module, a wireless communication system, a wireless communication method, and a device storing a program.\n[0003]\nPriority is claimed on Japanese Patent Application No. 2013-154951, filed Jul. 25, 2013, the content of which is incorporated herein by reference.\n[0004]\n2. Description of Related Art\n[0005]\nA method of usi'

Since when we train NER, we need to give many samples, each sample as a `Doc`, we gonna split our `patent_data` into many samples.  One doc per one patent.  Looking closely, they are splitted by `\n\n`

In [2]:
# split into patents texts | 1 entry = 1 patent
patent_texts = patent_data.split('\n')
print("Length: ", len(patent_texts))
print("First patent: ",  patent_texts[0][:50])
print("Second patent: ", patent_texts[1][:50])

Length:  427
First patent:  BACKGROUND OF THE INVENTION
Second patent:  [0001]


In [20]:
patent_texts[0]

'BACKGROUND OF THE INVENTION'

Next, let's grab some technological terms from another text file.  To extract relevant terms from the text, we can use `CountVectorizer` from scikit-learn. In such way, we can remove less frequent terms than some threshold.

In [12]:
# here are the potential terms
terms = open('data/manyterms.lower.txt').read().lower().strip().split('\n')
print(terms[44444:44456])
print(len(terms), 'terms')

[]
66 terms


As you can see, we got a lot of irrelevant terms.  Let's filter only the top 25 for now.

In [11]:
unique_terms = list(set(terms))

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Here lowercase=False option is used to keep the original case of the terms, since we possibly could have term abbreviations. Like API, CAT, etc.
cvectorizer = CountVectorizer(ngram_range=(
    1, 4), stop_words="english", vocabulary=unique_terms, lowercase=True)
X = cvectorizer.fit_transform(patent_texts)

Let's take a look at the results of the counting

In [15]:
#row = patents
#columns = terms
#value  = counts
X.toarray().shape

(427, 59)

Let's sum the row for each column (to get each term frequency), sort them, and map to actual vocab

In [16]:
import numpy as np

#sum them across all documents
counts = np.sum(X, axis=0)
counts.shape

(1, 59)

In [17]:
#we can get the actual vocab name
vocabs = cvectorizer.get_feature_names_out()
cvectorizer.get_feature_names_out()[:10]

array(['wherein the power supply control unit detects residual electric power supply of the power supply unit when the signal to instruct the electric power supply to the wireless communication module is input from the rfid module,',
       'a control unit configured to perform control such that parameters of the layer higher than the data link layer is set to parameters of the wireless communication module;',
       'a tag reading unit configured to transmit the read request to the rfid module, and receives the wireless communication setting information transmitted in response to the read request, and',
       'wherein the power supply unit does not perform the electric power supply to the rfid module.',
       'wherein, when information designating a condition of the transmission target data is acquired from the external terminal, the rfid module determines whether or not there is transmission target data satisfying the condition, and wirelessly transmits the wireless communication s

In [18]:
import pandas as pd

#put in the dataframe nicely for viewing
#.T to transpose columns to rows
df = pd.DataFrame(counts, columns = vocabs).T.sort_values(by=0, ascending=False)
df.head()

Unnamed: 0,0
"wherein the power supply control unit detects residual electric power supply of the power supply unit when the signal to instruct the electric power supply to the wireless communication module is input from the rfid module,",0
"1. a wireless communication terminal with an rfid module, comprising:",0
"13. a wireless communication system, comprising:",0
"4. the wireless communication terminal with the rfid module according to claim 3,",0
"wherein the rfid module stores the data presence/absence information for each external terminal, and",0


## 2. SpaCy NER

Let's start from the original model, and try to see how it looks.

In [19]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(patent_texts[0][18000:20000])
displacy.render(doc, style="ent", jupyter=True)



Looks great!  But what we want is to further enhance the model so it can tag some technological stuffs

First thing is the create a proper dataset that is compatible with spaCy 3.0 to train a NER model

### 2.1 Create Dataset

Here we used the library’s `PhraseMatcher` class to find the entities from the pre-defined Wiki list.

In [21]:
df.index[:25]

Index(['wherein the power supply control unit detects residual electric power supply of the power supply unit when the signal to instruct the electric power supply to the wireless communication module is input from the rfid module,',
       '1. a wireless communication terminal with an rfid module, comprising:',
       '13. a wireless communication system, comprising:',
       '4. the wireless communication terminal with the rfid module according to claim 3,',
       'wherein the rfid module stores the data presence/absence information for each external terminal, and',
       'wherein the power supply unit performs the electric power supply to the wireless communication module, and',
       'a power supply control step of controlling power supply of a power supply unit configured to perform the electric power supply to the wireless communication module;',
       'a storage unit configured to store metadata of the transmission target data; and',
       'wherein the rfid module outputs a

In [22]:
from spacy.matcher import PhraseMatcher

nlp = spacy.blank("en")

# Creating matcher to label enitites in text
matcher = PhraseMatcher(nlp.vocab)

# Create an efficient stream of data
# nlp.pipe gives you docs
patterns = list(nlp.pipe(df.index[:25]))
print("patterns:", patterns[0])
print("type:    ", type(patterns[0]))
matcher.add("TECH", patterns) #expect list of docs

patterns: wherein the power supply control unit detects residual electric power supply of the power supply unit when the signal to instruct the electric power supply to the wireless communication module is input from the rfid module,
type:     <class 'spacy.tokens.doc.Doc'>


In [23]:
#let's test our matcher
text = ["electronic device is very expensive", 
        "facial expression is the future"]
for doc in nlp.pipe(text):
    matches = matcher(doc)
    print(matches)
    for match_id, start, end in matches:
        print(match_id, doc[start:end])

[]
[]


Next, we can create training and dev dataset, where each sample is simply each sentence.

In [24]:
from spacy.tokens import DocBin, Span
from spacy.util import filter_spans #fix overlapping

def create_dataset(text):
    #text is each sentence.
    docs = []
    for doc in nlp.pipe(text):
        matches = matcher(doc)
        spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
        filtered_ents = filter_spans(spans)
        doc.ents = filtered_ents
        
        docs.append(doc)
        
    train_size = int(len(docs) * 0.8)
        
    train_docs = docs[:train_size]
    dev_docs   = docs[train_size:]

    train_doc_bin = DocBin(docs=train_docs)
    train_doc_bin.to_disk("docs/train.spacy")

    dev_doc_bin = DocBin(docs=dev_docs)
    dev_doc_bin.to_disk("docs/dev.spacy")

Split `patent_texts` into sentences, and create the dataset

In [25]:
# split each patent into chunks based on end line
patent_lines = patent_data.split('\n')
print(len(patent_lines))
patent_lines[2] #example

427


'1. Field of the Invention'

Since we have 280k+ chunks, it will take too much time, let's just grab 10000 chunks for now for training and dev.

In [26]:
create_dataset(patent_lines[:10000])

### 2.2 Generate config

In [27]:
!python3 -m spacy init config --force configs/tech-config.cfg --lang en --pipeline ner

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
configs/tech-config.cfg
You can now add your data and train your pipeline:
python -m spacy train tech-config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### 2.3 Training

In [28]:
gpu = spacy.require_gpu()
gpu

True

In [29]:
!python3 -m spacy train configs/tech-config.cfg --output ./output --paths.train docs/train.spacy --paths.dev docs/dev.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     19.00    0.00    0.00    0.00    0.00
  1     200          0.11    178.84    0.00    0.00    0.00    0.00
  3     400          0.00      0.00    0.00    0.00    0.00    0.00
  4     600          0.00      0.00    0.00    0.00    0.00    0.00
  7     800          0.00      0.00    0.00    0.00    0.00    0.00
  9    1000          0.00      0.00    0.00    0.00    0.00    0.00
 13    1200          0.00      0.00    0.00    0.00    0.00    0.00
 17    1400          0.00      0.00    0.00    0.00    0.00    0.00
 22    1600          0.00      0.00    0.00    0.00    0.00    0.00
[38;5;2m✔ Saved pipeline to output directo

### 2.4 Loading and Testing

In [32]:
import spacy

nlp = spacy.load("output/model-best")
doc = nlp("a power supply unit configured to perform the electric power supply to the wireless communication module;")

colors = {"TECH": "#F67DE3"}
options = {"colors": colors}

print(doc.ents)

# to visualize the named entities in the processed text.
spacy.displacy.render(doc, style="ent", options=options, jupyter=True)

()


In [None]:
# Load spaCy model: The code loads a spaCy language model from a custom path. The path is specified as "output/model-best".

# Process the text: The model (nlp) is used to process a given text, in this case, the text "iPhone is an electronic device. The control unit is made in China."

# Define entity colors: The colors dictionary associates entity types with specific colors. In this example, the entity type "TECH" is associated with the color "#F67DE3".

# Define rendering options: The options dictionary includes rendering options for spaCy's displacy.render function. It specifies the colors for entities using the previously defined colors dictionary.

# Print recognized entities: The doc.ents attribute contains the recognized entities in the processed text. This information is printed to the console.

# Render named entities: The spacy.displacy.render function is used to visualize the named entities in the processed text. The rendering style is set to "ent" (entities), and the specified rendering options, including entity colors, are passed using the options parameter.

# This code is useful for visualizing and customizing the rendering of named entities in a text using spaCy. It provides a clear and colorful representation of the identified entities in the specified text.