<a href="https://colab.research.google.com/github/JennyFrost/Sentence_decomposition_microservice/blob/main/Rule_based_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget "https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label" -O train.txt
!wget "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label" -O test.txt

--2021-05-21 23:30:33--  https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label
Resolving cogcomp.seas.upenn.edu (cogcomp.seas.upenn.edu)... 158.130.57.77
Connecting to cogcomp.seas.upenn.edu (cogcomp.seas.upenn.edu)|158.130.57.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 335858 (328K)
Saving to: ‘train.txt’


2021-05-21 23:30:33 (10.3 MB/s) - ‘train.txt’ saved [335858/335858]

--2021-05-21 23:30:33--  https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label
Resolving cogcomp.seas.upenn.edu (cogcomp.seas.upenn.edu)... 158.130.57.77
Connecting to cogcomp.seas.upenn.edu (cogcomp.seas.upenn.edu)|158.130.57.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23354 (23K)
Saving to: ‘test.txt’


2021-05-21 23:30:34 (2.96 MB/s) - ‘test.txt’ saved [23354/23354]



In [None]:
def read_infile(infile):
    sents, labels = [], []
    with open(infile, "r", encoding="Windows-1251") as fin:
        for line in fin:
            line = line.strip()
            if line == "":
                continue
            label, sent = line.split()[0], ' '.join(line.split()[1:])
            sents.append(sent)
            labels.append(label)
    return sents, labels

In [None]:
train_data, train_labels = read_infile("train.txt")
test_data, test_labels = read_infile("test.txt")

In [None]:
import re
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Извлечение подлежащих-существительных и зависимых от них существительных

In [None]:
def get_subject(doc):
  subjects = [token for token in doc if token.dep_ == 'nsubj' or token.dep_ == 'nsubjpass']
  subj_dep = []
  for subj in subjects:
    subj_dep.extend([token.lemma_ for token in subj.subtree if token.pos_ == 'NOUN'])
  return subj_dep

Извлечение синсетов для подлежащих и их зависимых из wordnet

In [None]:
def get_synsets(doc):
  synsets = [wn.synsets(subj, pos=wn.NOUN) for subj in get_subject(doc)]  # берем все синсеты, в которых данное слово является существительным
  return synsets

Извлечение гиперонимов для данного синсета

In [None]:
def get_hypernyms(synset):
  if synset.hypernyms() == []:
    return None
  else:
    hypernyms = synset.hypernyms()
    for hypernym in hypernyms:
      name = hypernym.name().split('.')[0]
      if name == 'animal':
        label = 'ENTY:animal'
      elif name == 'body_part':
        label = 'ENTY:body'
      elif name in ['production', 'creation', 'art', 'social event', 'show', 'auditory_communication', 'written_communication']:
        label = 'ENTY:cremat'
      elif name in ['organization', 'group', 'administrative_unit']:
        label = 'HUM:gr'
      elif name == 'substance':
        label = 'ENTY:substance'
      elif name == 'food':
        label = 'ENTY:food'
      elif name == 'vehicle':
        label = 'ENTY:vehicle'
      elif name == 'pathological_state':
        label = 'ENTY:dismed'
      elif name == 'plant':
        label = 'ENTY:plant'
      elif name == 'phone':
        label = 'ENTY:letter'
      elif name == 'occupation':
        label = 'HUM:title'
      elif name == 'person':
        label = 'HUM:ind'
      elif name in ['geological_formation', 'body_of_water', 'topographic_point', 'building']:
        label = 'LOC:other'


      else:
        return get_hypernyms(hypernym)

  return label

Загрузка модели без парсера (для лемматизации всех предложений)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

In [None]:
def normalize_sent(data):
    if isinstance(data, list):
        processed_sents = list(nlp.pipe(data))
        return [normalize_sent(sent) for sent in processed_sents]
    elif isinstance(data, str):
        processed_sent = nlp(data)
    else:
        processed_sent = data
    answer = [token.lemma_ if token.lemma_ != "-PRON-" else token.text.lower() for token in processed_sent]
    return answer

Классификатор по словам/словосочетаниям, которые есть в предложениях (без гиперонимов и синтаксиса)

In [None]:
def classify_basic(sent):
  join = ' '.join(sent).lower()
  if ('weigh' in sent) or ('weight' in sent):
    label = 'NUM:weight'
  elif (sent[:2] in (['how', 'many'], ['how', 'much'])) or (sent[1:3] in (['how', 'many'], ['how', 'much'])):
    if [word for word in sent if word in ['money', 'spend', 'cost', 'charge', 'tax', 'fine', 'worth']]:
      label = 'NUM:money'
    else:
      label = 'NUM:count'
  elif [word for word in join if word in ['population', 'number', 'amount', 'death toll', 'reactivity', 'latitude', 'longitude', 'hourly rate', 'score', 'statistics']] or 'how often' in join:
    label = 'NUM:other'
  elif ('how long' in join) or ('how old' in join) or ('average age' in join) or ('average time' in join) or ('life span'in join) or ('life expectancy'in join):
    label = 'NUM:period'
  elif ((sent[0] == 'how') and (sent[1] in ['far', 'long', 'tall', 'high', 'wide', 'deep'])) or ('length' in sent):
    label = 'NUM:dist'
  elif ('how hot' in join) or ('how cold' in join) or ('temperature' in sent):
    label = 'NUM:temp'
  elif (sent[:2] == ['how', 'fast']) or ('speed' in sent):
    label = 'NUM:speed'
  elif (sent[0] == 'how' and sent[1] in ['large', 'big']) or (sent[0] == 'what' and 'size' in sent):
    label = 'NUM:volsize'
  elif (sent[0] == 'how'):
    label = 'DESC:manner'
  elif (sent[0] == 'when') or ('date' in sent) or ('birthday' in sent) or ('what year' in join) or ('what is the year' in join):
    label = 'NUM:date'
  elif [word for word in sent if word in ['abbreviation', 'abbreviate', 'acronym']]:
    label = 'ABBR:abb'
  elif ('stand for' in join) or ('full form of' in join) or ((('what be' in join) or ('mean' in sent)) and (re.match(r'[A-Z]+', join))):
    label = 'ABBR:exp'
  elif ('come from' in join) or ('what be the origin of' in join) or ('what be the difference between' in join) or ('what be the history of' in join) or ('what be the use of' in join):
    label = 'DESC:desc'
  elif ('definition of' in join) or ('what be the meaning of' in join) or ('define' in sent) or (('what does' in join) and ('mean' in sent)) or (sent[0] == 'what' and sent[1] == 'be' and len(sent) in [4, 5, 6]):
    label = 'DESC:def'
  elif 'color' in sent:
    label = 'ENTY:color'
  elif (sent[0] == 'why') or (sent[:2] == ['what', 'cause']) or (sent[:2] == ['what', 'make']) or (sent[-2:] == ['famous', 'for']) or (sent[-2:] == ['know', 'for']) or ('reason' in sent) or ('purpose' in sent) or ('function' in sent):
    label = 'DESC:reason'
  elif 'plant' in sent:
    label = 'ENTY:plant'
  elif ('animal' in sent) or ('bird' in sent):
    label = 'ENTY:animal'
  elif (join[-2:] in ['make from', 'make of', 'consist of']) or ('substance' in sent):
    label = 'ENTY:substance'
  elif [word for word in sent if word in ['word', 'noun', 'verb', 'adjective', 'adverb']]:
    label = 'ENTY:word'
  elif (sent[0] == 'what') and (sent[1] == 'be') and (('fear' in sent) or ('disease' in sent) or ('illness' in sent)):
    label = 'ENTY:dismed'
  elif ('sport' in sent) or ('game' in sent) or ('play' in sent):
    label = 'ENTY:sport'
  elif ('food' in sent) or ('taste' in sent) or ('eat' in sent):
    label = 'ENTY:food'
  elif 'language' in sent:
    label = 'ENTY:lang'
  elif 'instrument' in sent:
    label = 'ENTY:instru'
  elif 'letter' in sent:
    label = 'ENTY:letter'
  elif (sent[:2] == ['what', 'money']) or ('currency' in sent):
    label = 'ENTY:currency'
  elif [word for word in sent if word in ['title', 'profession', 'job', 'occupation']] or (sent[-1] == 'do') or (sent[-4:] == ['do', 'for', 'a', 'living']):
    label = 'HUM:title'
  elif re.match(r'What( be)?( the)?( \w+( \w+)?( \'s)?)?( first| second| last| real)?( name| nickname)( of)?', join):
    label = 'HUM:ind'
  elif [word for word in sent if word in ['mountain', 'Mountain', 'Mountains', 'mount', 'Mount', 'peak']]:
    label = 'LOC:mount'
  elif ('city' in sent) or ('town' in sent) or ('capital' in sent):
    label = 'LOC:city'
  elif 'country' in sent:
    label = 'LOC:country'
  elif 'state' in sent:
    label = 'LOC:state'
  elif ('where' in sent) or ('address' in sent) or ('location' in sent):
    label = 'LOC:other'
  elif [word for word in sent if word in ['symbol', 'sign', 'mark', 'trademark']]:
    label = 'ENTY:symbol'
  elif ('call' in sent) or ('term' in sent) or ('translate' in sent) or ('how do you say' in join) or ('known as' in join) or ('what do you call' in join) or (re.match(r'what be the \w+ for ', join)):
    label = 'ENTY:termeq'
  elif (sent[0] == 'what') and ([word for word in sent if word in ['way', 'technique', 'method', 'procedure', 'formula', 'measure', 'principle', 'stroke']]):
    label = 'ENTY:techmeth'
  elif ([word.lower() for word in sent if word in ['war', 'battle', 'rite', 'event', 'disaster', 'tragedy', 'holiday', 'meeting', 'project', 'revolt', 'phenomenon', 'age', 'program', 'occurence', 'hurricane', 'incident', 'trial',
                                                   'happen', 'occur', 'follow', 'organize', 'befall', 'celebrate']] or ('take place' in join)):
    label = 'ENTY:event'
  elif [word for word in sent if word in ['probability', 'chance', 'odd', 'rating', 'percent', 'percentage', 'fraction', 'ratio', 'rate']]:
    label = 'NUM:perc'
  elif 'what chapter' in join:
    label = 'NUM:ord'
  elif [word for word in join if word in ['phone number', 'telephone number', 'code']]:
    label = 'NUM:code'
  elif (sent[0] == 'who') and (sent[1] == 'be') and (sent[2][0].isupper() or sent[3][0].isupper()):
    label = 'HUM:desc'

  else:
    label = 0

  return label

In [None]:
# train_data_lemmatized = normalize_sent(train_data)
test_data_lemmatized = normalize_sent(test_data)
test_data_lemmatized[0]

['how', 'far', 'be', 'it', 'from', 'Denver', 'to', 'aspen', '?']

Базовая классфикация по словам и выделение предложений для дальнейшего анализа

In [None]:
labels_classified = []
sents_for_further_analysis_numbers = []
sents_for_further_analysis = []
for i, sent in enumerate(test_data_lemmatized):
  if classify_basic(sent) != 0:
    label = classify_basic(sent)
    labels_classified.append((i, classify_basic(sent)))
  else:
    sents_for_further_analysis_numbers.append(i)

for num in sents_for_further_analysis_numbers:
  for i, sent in enumerate(test_data):
    if i == num:
      sents_for_further_analysis.append((num, sent))



# print(*labels_classified[:10], sep='\n')
# print(*sents_for_further_analysis[:5], sep='\n')
# print(*sents_for_further_analysis_numbers[:5])

for sent in sents_for_further_analysis[:10]:
  print(sent)
print(len(sents_for_further_analysis))
print(len(labels_classified))
print(len(test_data))

(1, 'What county is Modesto , California in ?')
(6, 'George Bush purchased a small interest in which baseball team ?')
(7, "What is Australia 's national flower ?")
(11, "What person 's head is on a dime ?")
(13, 'Who was the first man to fly across the Pacific Ocean ?')
(16, 'What metal has the highest melting point ?')
(17, 'Who developed the vaccination against polio ?')
(20, 'Who was the first American to walk in space ?')
(22, 'What river in the US is known as the Big Muddy ?')
(25, 'Who developed the Macintosh computer ?')
151
349
500


Загрузка модели с парсером для анализа оставшихся предложений

In [None]:
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe([elem[1] for elem in sents_for_further_analysis])) #sents_for_further_analysis - предложения, оставшиеся после того, как убрали предложения, которые классифицировала функция classify_basic, и функция get_hypernyms

In [None]:
processed_sents = list(zip(sents_for_further_analysis_numbers, docs))
print(processed_sents[:10])
print(len(processed_sents))

[(1, What county is Modesto , California in ?), (6, George Bush purchased a small interest in which baseball team ?), (7, What is Australia 's national flower ?), (11, What person 's head is on a dime ?), (13, Who was the first man to fly across the Pacific Ocean ?), (16, What metal has the highest melting point ?), (17, Who developed the vaccination against polio ?), (20, Who was the first American to walk in space ?), (22, What river in the US is known as the Big Muddy ?), (25, Who developed the Macintosh computer ?)]
151


Классификация с использованием гиперонимов

In [None]:
sents_for_extra_classifier = []
for num, doc in processed_sents:
  for synsets in get_synsets(doc):
    for synset in synsets:
      if get_hypernyms(synset):
        # if num not in [elem[0] for elem in labels_classified]:
          labels_classified.append((num, get_hypernyms(synset)))
          break
    if num in [elem[0] for elem in labels_classified]:
      break
  if num not in [elem[0] for elem in labels_classified]:
    sents_for_extra_classifier.append((num, doc))

print(len(labels_classified))
len(sents_for_extra_classifier)

428


72

Дополнительный классификатор с использованием синтаксиса для оставшихся предложений

In [None]:
def classify_extra(doc):
  if (doc[0].text in ['Who', 'Name']) or ((doc[0].text in ['What', 'Which']) and (doc[0].dep_ == 'det') and (doc[0].head.dep_ == 'dobj')):
    label = 'HUM:ind'
  elif (doc[0].text == 'What' and doc[1].lemma_ == 'do'):
    label = 'DESC:desc'
  else:
    label = 'ENTY:other'

  return label

In [None]:
for num, doc in sents_for_extra_classifier:
  if num not in [elem[0] for elem in labels_classified]:
    labels_classified.append((num, classify_extra(doc)))
print(len(labels_classified))
labels_classified[:10]
# sents_for_extra_classifier

500


[(0, 'NUM:dist'),
 (2, 'HUM:desc'),
 (3, 'DESC:def'),
 (4, 'NUM:date'),
 (5, 'NUM:dist'),
 (8, 'DESC:reason'),
 (9, 'DESC:def'),
 (10, 'LOC:city'),
 (12, 'NUM:weight'),
 (14, 'NUM:date')]

Получение меток в том порядке, в котором они шли в test_labels

In [None]:
labels_classified = sorted(labels_classified)
len(labels_classified)
labels_classified = [elem[1] for elem in labels_classified]
labels_classified[:10]

['NUM:dist',
 'ENTY:other',
 'HUM:desc',
 'DESC:def',
 'NUM:date',
 'NUM:dist',
 'ENTY:other',
 'ENTY:plant',
 'DESC:reason',
 'DESC:def']

Измерение accuracy

In [None]:
correct = 0
for i, label in enumerate(labels_classified):
  if label == test_labels[i]:
    correct += 1

accuracy = correct / len(test_labels) * 100
accuracy

74.4