<a href="https://colab.research.google.com/github/EddyGiusepe/Studying_spaCy_NER/blob/main/NER_Clinical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align='center'>Clinical NLP - NER of Clinical & BioMedical Text</h2>



Data Scientist.: Dr.:Eddy Giusepe Chirinos Isidro

Links de estudo:

* [Reconhecimento clínico de entidade nomeada em Python com spaCy](https://www.youtube.com/watch?v=ZnSebx45SLk)

* [Custom NER with spaCy v3 Tutorial | Free NER Data Annotation | Named Entity Recognition Tutorial](https://www.youtube.com/watch?v=p_7hJvl7P2A)

* [Train Custom NER with Spacy v3.0](https://www.youtube.com/watch?v=9mXoGxAn6pM)

* [Introduction to named entity recognition - Dr.:W.J.B. Manttingly](http://ner.pythonhumanities.com/intro.html)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Importamos as nossas bibliotecas

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt


In [3]:
# Load Dataset

df = pd.read_csv('/content/drive/MyDrive/3_EDDY_ISH_TECNOLOGIA/4_ML_inside_of _MANTIS/1_Tratando_Dados_json_Mantis/Applying_spaCy_to_Mantis/mtsamples.csv')


In [4]:
df.shape

(4999, 6)

In [5]:
# ou simplesmente escrevemos -->   df.drop(["Unnamed: 0", "description", "keywords"], axis=1, inplace=True)

df = df.drop(["Unnamed: 0", "description", "keywords"], axis=1)

In [6]:
df.rename(columns={"medical_specialty":"label", "sample_name":"description", "transcription":"text"}, inplace=True)

In [7]:
# Agora sim temos nosso DataFrame tal como se mostra no vídeo

df.head()

Unnamed: 0,label,description,text
0,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   label        4999 non-null   object
 1   description  4999 non-null   object
 2   text         4966 non-null   object
dtypes: object(3)
memory usage: 117.3+ KB


In [9]:
# df.isnull().sum()

df.isna().sum()

label           0
description     0
text           33
dtype: int64

In [10]:
df['text'].loc[2]

'HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 years old, 344 pounds.  He is 5\'9".  He has a BMI of 51.  He has been overweight for ten years since the age of 33, at his highest he was 358 pounds, at his lowest 260.  He is pursuing surgical attempts of weight loss to feel good, get healthy, and begin to exercise again.  He wants to be able to exercise and play volleyball.  Physically, he is sluggish.  He gets tired quickly.  He does not go out often.  When he loses weight he always regains it and he gains back more than he lost.  His biggest weight loss is 25 pounds and it was three months before he gained it back.  He did six months of not drinking alcohol and not taking in many calories.  He has been on multiple commercial weight loss programs including Slim Fast for one month one year ago and Atkin\'s Diet for one month two years ago.,PAST MEDICAL HISTORY: , He has difficulty climbing stairs, difficulty with airline seats, tying sho

# Usando [SciSpaCy](https://allenai.github.io/scispacy/)

* bc5cdr: chemical disease relation (relação doença química)

Instalamos o spaCy:
```
!pip install spacy
```

também:
```
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz
```

In [None]:
!pip install spacy
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz

In [12]:
import spacy
import scispacy

In [13]:
# Criamos o objeto NLP

sci_nlp = spacy.load('en_ner_bc5cdr_md')

In [14]:
# Componentes do objeto NLP

sci_nlp.component_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']

In [15]:
# Eplorando entidades

sci_nlp.get_pipe('ner').labels


('CHEMICAL', 'DISEASE')

In [16]:
# exemplo

x = df['text'].loc[1] 

x

'PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to public seating, and lifting objects off the floor.  He exercises three times a week at home and does cardio.  He has difficulty walking two blocks or five flights of stairs.  Difficulty with snoring.  He has muscle and joint pains including knee pain, back pain, foot and ankle pain, and swelling.  He has gastroesophageal reflux disease.,PAST SURGICAL HISTORY:, Includes reconstructive surgery on his right hand 13 years ago.  ,SOCIAL HISTORY:, He is currently single.  He has about ten drinks a year.  He had smoked significantly up until several months ago.  He now smokes less than three cigarettes a day.,FAMILY HISTORY:, Heart disease in both grandfathers, grandmother with stroke, and a grandmother with diabetes.  Denies obesity and hypertension in other family members.,CURRENT MEDICATIONS:, None.,ALLERGIES:,  He is allergic to Penicillin.,MISCELLANEOUS/EATING HISTORY:, He has b

In [18]:
docx = sci_nlp(x)


In [19]:
# Extraímos todas as entidades

for ent in docx.ents:
  print(ent.text, ent.label_)


snoring DISEASE
pains DISEASE
knee pain DISEASE
back pain DISEASE
ankle pain DISEASE
swelling DISEASE
reflux disease.,PAST DISEASE
Heart disease DISEASE
stroke DISEASE
diabetes DISEASE
Denies obesity DISEASE
hypertension DISEASE
allergic DISEASE
chest pain DISEASE
heart attack, coronary artery disease DISEASE
congestive heart failure DISEASE
arrhythmia DISEASE
atrial fibrillation DISEASE
cholesterol CHEMICAL
pulmonary embolism DISEASE
CVA DISEASE
venous insufficiency DISEASE
thrombophlebitis DISEASE
asthma DISEASE
shortness of breath DISEASE
COPD DISEASE
emphysema DISEASE
sleep apnea DISEASE
diabetes DISEASE
swelling DISEASE
osteoarthritis DISEASE
rheumatoid arthritis DISEASE
hernia DISEASE
peptic ulcer disease DISEASE
gallstones DISEASE
infected gallbladder DISEASE
pancreatitis DISEASE
fatty liver DISEASE
hepatitis DISEASE
hemorrhoids DISEASE
bleeding DISEASE
polyps DISEASE
incontinence DISEASE
urinary stress incontinence DISEASE
cancer DISEASE
cellulitis DISEASE
pseudotumor DISEASE
m

In [20]:
# Visualização

from spacy import displacy

In [21]:
displacy.render(docx, style='ent', jupyter=True)

In [36]:
# Função para extrair todas as diseade

def Extract_diseases(text):
  docx = sci_nlp(text)
  results = [ent.text for ent in docx.ents if ent.label_ == 'DISEASE']
  return results

In [40]:
Extract_diseases(str(df['text']))

['atrial enlargement', 'Kawasaki']