# Natural Language Processing

## Technology NER

Here we gonna teach our model to learn to label technology stuff.

We gonna go through the whole process from labeling to training, so you understand how to do it.  The idea is:

1. Grab some raw text containing technological stuffs
2. Grab another text containing terms about technology
3. Use 2 to annotate 1
4. Then train the NER model with the annotated 1

First, let us grab the raw text containing technology related stuffs.  We grab these raw texts from patent

This is edited from https://github.com/kinivi/patent_ner_linking

## 1. Loading data

In [43]:
# if you've already unzipped the file
# this is a text I grab from 
#https://www.google.com/patents/sitemap/en/Sitemap/G06/G06K.html
patent_data = open('../data/G06K.txt').read().strip()
patent_data[:500]

'COMMUNICATION DEVICE, COMMUNICATION METHOD AND PROGRAM\n_____2019_____3500050_____490084061_____EP3500000.txt_____G06K_____G06K7/10722:G06K7/1417:H04L67/104:H04M1/00:H04M11/00:H04W12/001:H04W12/04:H04W12/04033:H04W12/04071:H04W12/06:H04W76/14:H04W84/12:H04W84/20\nA communication device obtains identification information and a public key of a first other communication device by a particular obtaining method that does not use a wireless LAN and notifies the first other communication device of a role'

Since when we train NER, we need to give many samples, each sample as a `Doc`, we gonna split our `patent_data` into many samples.  One doc per one patent.  Looking closely, they are splitted by `\n\n`

In [44]:
# split into patents texts | 1 entry = 1 patent
patent_texts = patent_data.split('\n\n')
print("Length: ", len(patent_texts))
print("First patent: ",  patent_texts[0][:50])
print("Second patent: ", patent_texts[1][:50])

Length:  2003
First patent:  COMMUNICATION DEVICE, COMMUNICATION METHOD AND PRO
Second patent:  
OPERATIONAL STATUS CLASSIFICATION DEVICE
_____201


Next, let's grab some technological terms from another text file.  To extract relevant terms from the text, we can use `CountVectorizer` from scikit-learn. In such way, we can remove less frequent terms than some threshold.

In [None]:
# here are the potential terms
terms = open('../data/manyterms.lower.txt').read().lower().strip().split('\n')
print(terms[44444:44456])
print(len(terms), 'terms')

In [None]:
terms = open('../data/manyterms.lower.txt').read().lower().strip().split('\n')
terms[:10]

As you can see, we got a lot of irrelevant terms.  Let's filter only the top 25 for now.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Here lowercase=False option is used to keep the original case of the terms, since we possibly could have term abbreviations. Like API, CAT, etc.
cvectorizer = CountVectorizer(ngram_range=(
    1, 4), stop_words="english", vocabulary=terms, lowercase=True)
X = cvectorizer.fit_transform(patent_texts)

Let's take a look at the results of the counting

In [None]:
#row = patents
#columns = terms
#value  = counts
X.toarray().shape

Let's sum the row for each column (to get each term frequency), sort them, and map to actual vocab

In [None]:
import numpy as np

#sum them across all documents
counts = np.sum(X, axis=0)
counts.shape

In [None]:
#we can get the actual vocab name
vocabs = cvectorizer.get_feature_names_out()
cvectorizer.get_feature_names_out()[:10]

In [None]:
import pandas as pd

#put in the dataframe nicely for viewing
#.T to transpose columns to rows
df = pd.DataFrame(counts, columns = vocabs).T.sort_values(by=0, ascending=False)
df.head()

## 2. SpaCy NER

Let's start from the original model, and try to see how it looks.

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(patent_texts[0][18000:20000])
displacy.render(doc, style="ent", jupyter=True)

Looks great!  But what we want is to further enhance the model so it can tag some technological stuffs

First thing is the create a proper dataset that is compatible with spaCy 3.0 to train a NER model

### 2.1 Create Dataset

Here we used the library’s PhraseMatcher class to find the entities from the pre-defined Wiki list.

In [None]:
from spacy.matcher import PhraseMatcher

nlp = spacy.blank("en")

# Creating matcher to label enitites in text
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Create an efficient stream of data
# nlp.pipe gives you docs
patterns = list(nlp.pipe(list(df.index[:25]))) #top 25
print("patterns:", patterns)
print("type:    ", type(patterns[0]))
matcher.add("TECH", patterns) #expect list of docs

Next, we can create training and dev dataset, where each sample is simply each sentence.

In [None]:
from spacy.tokens import DocBin, Span

def create_dataset(text):
    #text is each sentence.
    docs = []
    for doc in nlp.pipe(text):
        matches = matcher(doc)
        spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
        doc.ents = spans
        docs.append(doc)
        
    train_size = int(len(docs) * 0.8)
        
    train_docs = docs[:train_size]
    dev_docs   = docs[train_size:]

    train_doc_bin = DocBin(docs=train_docs)
    train_doc_bin.to_disk("docs/train.spacy")

    dev_doc_bin = DocBin(docs=dev_docs)
    dev_doc_bin.to_disk("docs/dev.spacy")

Split `patent_texts` into sentences, and create the dataset

In [None]:
# split each patent into chunks based on end line
patent_lines = patent_data.split('\n')
print(len(patent_lines))
patent_lines[2], patent_lines[5]

Since we have 280k+ chunks, it will take too much time, let's just grab 10000 chunks for now for training and dev.

In [None]:
create_dataset(patent_lines[:10000])

### 2.2 Generate config

In [None]:
!python -m spacy init config --force configs/tech-config.cfg --lang en --pipeline ner

### 2.3 Training

In [None]:
!python -m spacy train configs/tech-config.cfg --output ./output --paths.train docs/train.spacy --paths.dev docs/dev.spacy

In [None]:
nlp = spacy.load("output/model-best")
nlp.analyze_pipes(pretty=True)

### 2.4 Loading and Testing

In [None]:
import spacy

nlp = spacy.load("output/model-best")
doc = nlp("Wi-Fi Direct (registered trademark, which will be hereinafter referred to as WFD) \
           corresponding to a technology for directly performing a communication based on a \
           wireless LAN between communication devices without intermediation of an access \
           point (hereinafter referred to as AP) is standardized in Wi-Fi Alliance serving \
           as a wireless LAN industry group.")

colors = {"TECH": "#F67DE3"}
options = {"colors": colors}

print(doc.ents)

spacy.displacy.render(doc, style="ent", options=options, jupyter=True)