INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES

I – Term detection

1-1	What do we call as a term?

It a phrase or multiword or a word that have

-	A clear and specific meaning within the context of the patent

-	It should not a common word or phrase found in everyday language

-	A technical or scientific concept, material, method or device that is specific to the invention described in the patent.

1-2	Create a gold dataset

1-3	Build a rule-based baseline (may using re or generating n-grams and filter it)

1-4	Train a statistical model (bases on spacy)

1-5	Improve annotation using prodigy

1-6	Evaluate the model 

1-7	Iterate and refine the rule-based line


In [1]:
import sys
import time
import json
import pprint
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from spacy.matcher import PhraseMatcher, Matcher
from spacy.util import filter_spans
from spacy import displacy
from spacy.tokens import DocBin
from spacy.tokens import Span
from collections import Counter
from nltk.tokenize import MWETokenizer
from nltk.util import Trie
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
nltk.download('omw-1.4')
tqdm.pandas()
spacy.__version__


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\etien\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\etien\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


'3.3.2'

In [2]:
spacy.require_gpu()

True

In [3]:
patent_data = open('G01N.txt').read().strip()

# split into patents texts | 1 entry = 1 patent
patent_texts = patent_data.split('\n\n')

# split each patent into lines
patent_lines = patent_data.split('\n')

In [4]:
len(patent_lines)

330594

In [5]:
len(patent_texts)

2139

In [6]:
def get_feature_names(vectorizer):
    if hasattr(vectorizer, 'get_feature_names'):
        return vectorizer.get_feature_names()
    else:
        return vectorizer.get_feature_names_out()

In [7]:
# here are the potential terms
mwes = open('manyterms.lower.txt').read().lower().strip().split('\n')
print(mwes[44444:44456])
print(len(mwes), 'mwes')

['antonio superchi', 'antonio tarver', 'antonio torres jurado', 'antonio valdes', 'antonio valdes y fernandez bazan', 'antonio valdez', 'antonio valdés y bazán', 'antonio valdés y fernández bazán', 'antonio valente', 'antonio vitali', 'antonio vivaldi', 'antonio xavier machado e cerveira']
743274 mwes


In [8]:
# Here lowercase=False option is used to keep the original case of the terms, since we possibly could have term abbreviations. Like API, CAT, etc.
cvectorizer = CountVectorizer(ngram_range=(
    1, 4), stop_words="english", vocabulary=mwes, lowercase=True)
X = cvectorizer.fit_transform(patent_texts)

# Show top-25 most frequent terms
termdf_cv = pd.DataFrame(np.sum(X, axis=0), columns=get_feature_names(cvectorizer)).T.sort_values(by=0, ascending=False)
termdf_cv.head(25)



Unnamed: 0,0
amino acid,24339
nucleic acid,13185
amino acid sequence,11407
light source,6943
amino acids,6170
control unit,5249
nucleotide sequence,4374
variable region,3264
mass spectrometry,3091
monoclonal antibody,2917


In [9]:
termdf_cv = termdf_cv[termdf_cv[0] >= 1]
termdf_cv.to_csv('terms.tsv', sep='\t')

In [10]:
# !python -m spacy download en_core_web_lg
#!python -m spacy download en_core_web_trf

In [11]:
nlp = spacy.load("en_core_web_lg")
doc = nlp(patent_texts[0][10000:12000])
#displacy.render(doc, style="ent", jupyter=True)

In [12]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(text) for text in termdf_cv.index]
matcher.add("CH", patterns)

In [13]:
train_lines, test_lines = train_test_split(
    patent_lines, test_size=0.3, random_state=42)


print(len(train_lines))
print(len(test_lines))

231415
99179


In [14]:
def create_dataset(text, n_lines, filename, offset=0):
    LABEL = "CH"
    doc_bin = DocBin()  # create a DocBin object

    for training_example in tqdm(text[offset:offset+n_lines]):
        doc = nlp.make_doc(training_example)
        ents = []

        for match_id, start, end in matcher(doc):
            span = Span(doc, start, end, label=LABEL)
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)

        filtered_ents = filter_spans(ents)
        doc.ents = filtered_ents
        doc_bin.add(doc)
    doc_bin.to_disk(filename)

In [16]:
create_dataset(train_lines, 60000, "training_data.spacy")
create_dataset(test_lines, 2000, "test_data.spacy")
create_dataset(test_lines, 3000, "valid_data.spacy", offset=2000)

  0%|          | 0/60000 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/3000 [00:00<?, ?it/s]

In [17]:
# save train_lines to txt file
with open('train_lines.txt', 'w') as f:
    for line in train_lines:
        f.write(line)
        f.write('\n')
f.close()

# save train_lines to txt file
with open('valid_lines.txt', 'w') as f:
    for line in test_lines[2000:5000]:
        f.write(line)
        f.write('\n')
f.close()

# save test_lines to txt file
with open('test_lines.txt', 'w') as f:
    for line in test_lines[0:2000]:
        f.write(line)
        f.write('\n')
f.close()

In [18]:
# Run to generate full training config
!python -m spacy init fill-config base_config.cfg config.cfg

✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [19]:
!python -m spacy train config.cfg --output ./spacy_output --paths.train ./training_data.spacy --paths.dev ./valid_data.spacy

✔ Created output directory: spacy_output

[2023-04-14 13:15:38,664] [INFO] Set up nlp object from config
[2023-04-14 13:15:38,674] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-04-14 13:15:38,678] [INFO] Created vocabulary
[2023-04-14 13:15:38,681] [INFO] Finished initializing nlp object
[2023-04-14 13:16:46,998] [INFO] Initialized pipeline components: ['tok2vec', 'ner']



ℹ Saving to output directory: spacy_output
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
[1m
✔ Initialized pipeline
[1m
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     49.50    2.62    1.63    6.67    0.03
  0     200         31.07   2389.20   16.91   65.13    9.72    0.17
  0     400         31.20    836.19   31.39   60.78   21.16    0.31
  0     600         51.40   1028.38   42.12   55.07   34.11    0.42
  0     800       2755.18   1037.90   21.01   85.20   11.98    0.21
  0    1000       9404.46    939.89   45.81   59.92   37.08    0.46
  0    1200        137.78   1124.66   19.47   48.13   12.21    0.19
  0    1400        204.03   1574.53   50.91   64.14   42.21    0.51
  0    1600       8332.32   1569.84   49.04   71.40   37.34    0.49
  0    1800        466.53   2029.48   52.82   72.40   4

In [20]:
# Load best model
nlp= spacy.load("./spacy_output/model-best")

# Just text snippet
doc = nlp("""The scientific and technological terminologies referred to herein have the same meanings as what are generally understood by a person skilled in the art, and if there is a conflict, the definition in the present description shall prevail.
Firstly, in one aspect, the present invention provides a solid phase carrier (the solid phase carrier of the present invention) comprising:
a polydimethylsiloxane layer with an initiator on the surface, andtitanium dioxide particles distributed in said polydimethylsiloxane layer with an initiator on the surface.
Said polydimethylsiloxane with an initiator on the surface (iPDMS) belongs to the prior art, and reference can be made to Chinese Patent Publication No. CN 101265329 A.
Titanium dioxide, commonly known as titanium white, is usually a white powder. The crystal form of said titanium oxide is not particularly limited and may be, for example, of rutile type, anatase type or nanoscale ultrafine titanium dioxide.
Preferably, said titanium dioxide particles have an average particle size of 1 nm to 1000 nm, more preferably 5 nm to 500 nm, more preferably 5 nm to 200 nm, more preferably 10 nm to 100 nm, and more preferably 10 nm to 50 nm.
The specific surface area ofsaid titanium dioxide particles is preferably 10 to 500 m2/g, more preferably 20 to 400 m2/g, more preferably 30 to 300 m2/g, and more preferably 40 to 200 m2/g.
The content of said titanium dioxide particles is 0.0001 to 100 parts by weight, preferably 0.0002 to 90 parts by weight, more preferably 0.0005 to 80 parts by weight, more preferably 0.001 to 70 parts by weight, more preferably 0.002 to 60 parts by weight, more preferably 0.005 to 40 parts by weight, more preferably 0.01 to 40 parts by weight, more preferably 0.02 to 30 parts by weight, more preferably 0.05 to 20 parts by weight, and more preferably 0.1 to 10 parts by weight, relative to 100 parts by weight of said polydimethylsiloxane layer with an initiator on the surface.
Said solid phase carrier may be prepared, for example, by mixing macromolecular precursor A, cross-linking agent B, vinyl-endcapped initiator C, and said titanium dioxide particles D at a certain weight ratio, and leaving the mixture to stand, for example, for 6 to 24 hours to form an elastomer. In the mixture, said macromolecular precursor A is poly(dimethyl-methylvinylsiloxane); said cross-linking agent B is vinyl-endcapped poly(dimethyl-methylvinylsiloxane) and poly(dimethyl-methylhydrogenosiloxane); and said vinyl-endcapped initiator C is 10-undecenyl 2-bromo-2-methylpropionate.
The shape of said solid phase carrier includes, but is not limited to: beads, magnetic beads, films, microtubes, filter membranes, plates, microplates, carbon nanotubes, sensor chips, etc. Pits, grooves, filter membrane bottoms and the like may be provided on a flat solid phase carrier such as a film or a plate.
Preferably, said solid phase carrier further comprises: an oligomeric ethylene glycol methacrylate layer located on said polydimethylsiloxane layer with an initiator on the surface. Said oligomeric ethylene glycol methacrylate layer may be formed by initiating the polymerization of oligomeric ethylene glycol methacrylate on said polydimethylsiloxane layer with an initiator on the surface. Such a solid phase carrier can completely prevent non-specific adsorption of proteins.
In another aspect, the present invention provides a detection device (the detection device of the present invention) comprising: the above-mentioned solid phase carrier of the present invention, and a polypeptide or protein linked to said oligomeric ethylene glycol methacrylate layer. The manner by which said polypeptide or protein is linked to said oligomeric ethylene glycol methacrylate layer may be a covalent linkage, and reference can be made to, for example, the description of International Patent Application Publication WO 2014044184 for details. The polypeptide layer or protein herein refers to a compound obtained by the dehydration condensation of amino acid molecules, usually, the compound consisting of 2 to 50 amino acid residues is referred to as a polypeptide, and the compound consisting of 50 or more amino acid residues is referred to as a protein.
The detection device of the present invention has high detection sensitivity and can completely inhibit non-specific binding of proteins, and is thus particularly suitable for detecting a substance (e.g., a protein, a polypeptide, a small molecule compound, a nucleic acid and the like) capable of binding to the polypeptide or protein in said polypeptide layer or protein layer.
In another aspect, the present invention provides a detection kit (the detection kit of the present invention) comprising the above-mentioned detection device of the present invention or the solid phase carrier of the present invention. The detection kit of the present invention may further comprise other components, and reference can be made to, for example, the description of International Patent Application Publication WO 2014044184.
The detection kit of the present invention is also particularly suitable for detecting a substance (e.g., a protein, a polypeptide, a small molecule compound, a nucleic acid and the like) capable of binding to the polypeptide or protein in said polypeptide layer or protein layer.
Examples
The present invention will be described below in more detail by means of examples, but the present invention is not limited to these examples.
1. Preparation and confirmation of polypeptides
A peptide composed of 30 amino acids used in the examples has an amino acid sequence as set forth in SEQ ID NO: 1, which was synthesized by GL Biochem (Shanghai) Ltd.
SEQ ID NO: 1: PLVEDGVKQCDRYWPDEGASLYHVYEVNLV is positive for sera of patients with type 1 diabetes, and negative for sera of normal healthy individuals or non-type 1 diabetic patients, and reference can be made to Chinese Patent Application Publication No. CN104098677A.
2. Preparation of detection device
Detection devices 1-4 comprising SJ-modified silica gel thin films 1-4 of the invention were obtained as described in Example 2 of Chinese Patent Application Publication No. CN 104098677 A, except that in order to prepare the SJ-modified silica gel thin films 1-4 of the invention, the polydimethylsiloxane precursor A, cross-linking agent B, vinyl-endcapped initiator C and titanium oxide particles D (with an average particle size of 40 nm, a specific surface area of 60 m2/g, of rutile type) were sufficiently mixed at a ratio of A : B : C : D = 10 : 1 : 0.2 : (5, 1, 0.5 or 0.1), respectively.
Detection devices 5-8 comprising SJ-modified silica gel thin films 5-8 of the invention were obtained as described in Example 2 of Chinese Patent Application Publication No. CN 104098677 A, except that in order to prepare the SJ-modified silica gel thin films 5-8 of the invention, the polydimethylsiloxane precursor A, cross-linking agent B, vinyl-endcapped initiator C and titanium oxide particles D (with an average particle size of 20 nm, a specific surface area of 120 m2/g, of rutile type) were sufficiently mixed at a ratio of A : B : C : D = 10 : 1 : 0.2 : (5, 1, 0.5 or 0.1), respectively.
In addition, a detection device of Example 2 of Chinese Patent Application Publication No. CN 104098677 A (the SJ-modified silica gel thin film used does not comprise titanium oxide particles) was prepared as a control.
3. Detection with detection devices""")

# Show NER results
spacy.displacy.render(doc, style="ent", jupyter=True)

### Improve the model using prodigy

In [21]:
# !pip install C:\Users\etien\Downloads\prodigy-1.11.11-windows\windows\prodigy-1.11.11-cp39-cp39-win_amd64.whl

In [22]:
# !python -m prodigy

In [30]:
!python -m prodigy ner.correct gold_tech  ./spacy_output/model-best  valid_lines.txt --label CH

Using 1 label(s): CH
Added dataset val_fine_tune to database SQLite.
⚠ The model you're using isn't setting sentence boundaries (e.g. via the parser
or sentencizer). This means that incoming examples won't be split into
sentences.

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!



ERROR:    [Errno 10048] error while attempting to bind on address ('::1', 8080, 0, 0): only one usage of each socket address (protocol/network address/port) is normally permitted


In [32]:
!netstat -ano | findstr :8080

  TCP    127.0.0.1:8080         0.0.0.0:0              LISTENING       7812
  TCP    [::1]:8080             [::]:0                 LISTENING       7812


In [33]:
!Taskkill /PID <PIDProdigy> /F

The system cannot find the file specified.
