#### Домашнее задание 6:

 John Snow Labs предоставляют простой и унифицированный API Python для обработки текстовых данных с помощью NLP на профессиональном уровне.

 Изучите одно из предлагаемых решений процессинга клинических данных по поиску именованных сущностей (NER).
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb

 Протестируйте распознавание моделью терминов на тексте из какой-либо статьи раздела Case Reports в PubMed NCBI.

Для работы John Snow Labs требуются лицензионные ключи с сайта https://www.johnsnowlabs.com

Запрос на free trail был отправлен. Файл с ключами не был получен.

Решение: протестировать обработку текстов с помощью другой модели - CRF.

Задача: Из текста медицинских статей необходимо выявить зависимости "Заболевание - Лечение". Создать таблицу для отображения заболеваний и соответствующего им лечения.



Данные: обучающие и тестовые наборы данных

* train_sent
* test_sent
* train_label
* test_label

В наборах данных с текстом (sent) находятся преложения из медицинских статей.

В наборе данных с метками (label) находятся три метки: O, D и T, которые соответствуют ‘Другому’, ‘Заболеванию’ и ‘Лечению’ соответственно. Эти метки соответствуют каждому слову, доступному в наборах данных ‘train_sent’ и 'test_sent'.

Таким образом, существует взаимно однозначное сопоставление каждой метки, доступной в наборах данных 'train_label' и 'test_label', со словами, которые доступны в наборах данных 'train_sent' и 'test_sent' соответственно.

## Colab Setup

In [None]:
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite
import pandas as pd
from sklearn_crfsuite import metrics
from collections import Counter
model = spacy.load("en_core_web_sm")

Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1869 sha256=3b7eac50440b88d2b6165aa4c2596a3f8e258b9d1be2a7a54d28903ffb6bcf71
  Stored in directory: /root/.cache/pip/wheels/fd/3a/fb/e4d15c9c2b169f43811b23a863ee9717ff3eda5d2301789043
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-c

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Обработка данных

In [None]:
def data_preproc(path) :
  with open(path) as file_hd :
    list_sent = file_hd.read().split("\n\n")
  sentences = [sent.replace("\n", " ") for sent in list_sent]
  return sentences

In [None]:
train_sentences = data_preproc("/content/drive/MyDrive/Анализ данных в медицине/пробы/train_sent.txt")
train_labels = data_preproc("/content/drive/MyDrive/Анализ данных в медицине/пробы/train_label.txt")
test_sentences = data_preproc("/content/drive/MyDrive/Анализ данных в медицине/пробы/test_sent.txt")
test_labels = data_preproc("/content/drive/MyDrive/Анализ данных в медицине/пробы/test_label.txt")


In [None]:
print(train_sentences[:5])
print(train_labels[:5])
print(test_sentences[:5])
print(test_labels[:5])

['All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )', 'The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )', 'Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )', "The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )", "Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 and 24.6 % , respectively )"]
['O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O 

In [None]:
print(len(train_sentences))
print(len(test_sentences))

2600
1057


In [None]:
print(len(train_labels))
print(len(test_labels))

2600
1057


POS-тегирование

Поиск NOUN и PROPN

In [None]:
noun_propn = []
for sent in (train_sentences + test_sentences) :
  doc = model(sent.lower())
  for token in doc :
    if token.pos_ in ["NOUN", "PROPN"] :
      noun_propn.append(token.text)
freq_dist = Counter(noun_propn)

25 наиболее распространенных токенов с тегами NOUN или PROP

In [None]:
print(freq_dist.most_common(25))

[('patients', 507), ('treatment', 304), ('%', 247), ('cancer', 211), ('therapy', 177), ('study', 174), ('disease', 151), ('cell', 142), ('lung', 118), ('results', 117), ('group', 111), ('effects', 99), ('gene', 92), ('chemotherapy', 91), ('use', 88), ('effect', 82), ('women', 81), ('analysis', 76), ('risk', 74), ('surgery', 73), ('cases', 72), ('p', 72), ('rate', 68), ('survival', 67), ('response', 66)]


Определение признаков для одного слова

In [None]:
def getFeaturesForOneWord(sentence, pos, tokens) :

  word = sentence[pos].lower()
  word_pos_tag = tokens[pos].pos_
  features = ["word = " + word,
              "word_POS_tag = " + word_pos_tag,
              "word[-3:] = " + word[-3:],
              "word[-2:] = " + word[-2:],
              "word_length = %s" % len(word)
  ]

  if pos > 0 :

    prev_word = sentence[pos-1]
    prev_word_pos_tag = tokens[pos-1].pos_
    features.append("prev_word = " + prev_word)
    features.append("prev_word_POS_tag = " + prev_word_pos_tag)
    features.append("prev_word_length = %s" % len(prev_word))

  else :
    features.append("BEG")

  if pos == len(sentence) - 1 :
    features.append("END")

  return features

Определение признаков предложения

In [None]:
def getFeaturesForOneSentence(sentence) :
  tokens = model(sentence)
  sentence = sentence.split()
  return [getFeaturesForOneWord(sentence, pos, tokens) for pos in range(len(sentence))]

Определение меток предложения

In [None]:
def getLabelsForOneSentence(labels) :
  labels = labels.split()
  return labels

X_Y_train_test разделение

In [None]:
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

In [None]:
Y_train = [getLabelsForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsForOneSentence(labels) for labels in test_labels]

Построение модели

In [None]:
crf_model = sklearn_crfsuite.CRF(max_iterations=100)
try:
  crf_model.fit(X_train, Y_train)
except AttributeError:
  pass

In [None]:
Y_pred = crf_model.predict(X_test)

f1_score

In [None]:
metrics.flat_f1_score(Y_test, Y_pred, average = "weighted")

0.9126604386056725

Прогноз модели для сопоставления заболевания и лечения.

In [None]:
def D_T_identification(pos) :
  label_seq = Y_pred[pos]
  disease_idx = []
  treatment_idx = []
  for idx, label in enumerate(label_seq) :
    if label == "D" :
      disease_idx.append(idx)

    if label == "T" :
      treatment_idx.append(idx)

  return disease_idx, treatment_idx

In [None]:

diseases = []
treatments = []
records = pd.DataFrame(columns = ["Disease", "Treatment"])
for id, sent in enumerate(test_sentences) :
  sent = sent.split()
  disease_idx, treatments_idx = D_T_identification(id)
  if len(disease_idx) > 0 and len(treatments_idx) > 0 :
    # records[" ".join([sent[idx] for idx in disease_idx])] = " ".join([sent[idx] for idx in treatments_idx])
    diseases.append(" ".join([sent[idx] for id, idx in enumerate(disease_idx) if id == 0 or idx == disease_idx[id-1] + 1]))
    treatments.append(" ".join([sent[idx] for idx in treatments_idx]))
records["Disease"] = diseases
records["Treatment"] = treatments
records["Treatment"] = records["Treatment"].apply(lambda x : x.replace("and", ",").replace(", ,", ","))
records = records.groupby("Disease")["Treatment"].apply(", ".join).reset_index()
records

Unnamed: 0,Disease,Treatment
0,B16 melanoma,adenosine triphosphate buthionine sulfoximine
1,Barrett 's esophagus,Acid suppression therapy
2,Eisenmenger 's syndrome,laparoscopic cholecystectomy
3,Parkinson 's disease,Microelectrode-guided posteroventral pallidotomy
4,abdominal pain,thoracic paravertebral block ( tpvb )
...,...,...
116,tumors,Immunotherapy
117,unresectable stage iii nsclc,sequential chemotherapy
118,unstable angina or non-Q-wave myocardial infar...,roxithromycin
119,untreated small cell lung cancer ( sclc ) sclc,chemotherapy
