<a href="https://colab.research.google.com/github/ArjunRAj77/ICD10-code-extractor/blob/main/ICD10_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ICD 10 Extractor

The aim of the following machine learning model is to create a model that will identify the corresponding ICD10 code of a symptom/disease.

Installing required packages:

In [1]:
!pip install transformers[sentencepiece] datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers[sentencepiece]
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 7.7 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 51.0 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 52.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 62.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 51.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.

In [2]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
pip install torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
from datasets import list_datasets
list_datasets()

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus',
 'ag_news',
 'ai2_arc',
 'air_dialogue',
 'ajgt_twitter_ar',
 'allegro_reviews',
 'allocine',
 'alt',
 'amazon_polarity',
 'amazon_reviews_multi',
 'amazon_us_reviews',
 'ambig_qa',
 'americas_nli',
 'ami',
 'amttl',
 'anli',
 'app_reviews',
 'aqua_rat',
 'aquamuse',
 'ar_cov19',
 'ar_res_reviews',
 'ar_sarcasm',
 'arabic_billion_words',
 'arabic_pos_dialect',
 'arabic_speech_corpus',
 'arcd',
 'arsentd_lev',
 'art',
 'arxiv_dataset',
 'ascent_kb',
 'aslg_pc12',
 'asnq',
 'asset',
 'assin',
 'assin2',
 'atomic',
 'autshumato',
 'babi_qa',
 'banking77',
 'bbaw_egyptian',
 'bbc_hindi_nli',
 'bc2gm_corpus',
 'beans',
 'best2009',
 'bianet',
 'bible_para',
 'big_patent',
 'billsum',
 'bing_coronavirus_query_set',
 'biomrc',
 'biosses',
 'blbooks',
 'blbooksgenre',
 'blended_skill_talk',
 'blimp',
 'blog_authorship_corpus',
 'bn_hate_speech',
 'bnl_newspapers',
 'bookcorpus',
 'bookcorpusopen'

In [5]:
from sklearn.model_selection import train_test_split

Loading the data from the CSV file. 

It contains list of disease and its corresponding ICD10 code.

In [6]:
import pandas as pd

icdcodedf = pd.read_csv('/content/ICD10.csv')
features=["ICDCode","Disease"]
icd10 = icdcodedf[features]
diseaselist = icd10['Disease'].tolist()
icdlist = icd10['ICDCode'].tolist()

icdcodedf

Unnamed: 0,ICDCode,Disease,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,A00.0,Cholera due to Vibrio cholerae 01,,,
1,A00.1,Cholera due to Vibrio cholerae 01,,,
2,A00.9,Cholera,,,
3,A01.00,Typhoid fever,,,
4,A01.1,Paratyphoid fever A,,,
...,...,...,...,...,...
14810,C90.11,Plasma cell leukemia in remission,,,
14811,C90.12,Plasma cell leukemia in relapse,,,
14812,C88.8,Other malignant immunoproliferative diseases,,,
14813,C90.21,Extramedullary plasmacytoma in remission,,,


Cleaning the above dataset into a labelled dataset with string values.

In [7]:
icdfinallist=[]
diseasefinallist=[]
datalen=len(icd10)
for i in range(datalen):
  str_X = str(icdlist[i])
  str_Y = str(diseaselist[i])
  icdfinallist.append(str_X)
  diseasefinallist.append(str_Y)

new_df=zip(icdfinallist,diseasefinallist)
column_names = ['ICDCode', 'Disease']
df = pd.DataFrame(new_df,columns = column_names)

df

Unnamed: 0,ICDCode,Disease
0,A00.0,Cholera due to Vibrio cholerae 01
1,A00.1,Cholera due to Vibrio cholerae 01
2,A00.9,Cholera
3,A01.00,Typhoid fever
4,A01.1,Paratyphoid fever A
...,...,...
14810,C90.11,Plasma cell leukemia in remission
14811,C90.12,Plasma cell leukemia in relapse
14812,C88.8,Other malignant immunoproliferative diseases
14813,C90.21,Extramedullary plasmacytoma in remission


Converting the above dataframe into a new CSV file.

In [8]:
df.to_csv('/content/new_ICD10.csv')

*make sure that the entire dataset values are in string format, otherwise the following codes will not work.*

Loading a custom dataset in python:


Reference: https://huggingface.co/docs/datasets/v2.7.1/en/package_reference/loading_methods#datasets.load_dataset

In [13]:
from datasets import load_dataset
ICDData = load_dataset('csv', data_files='/content/new_ICD10.csv')



  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
ICDData

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 14815
    })
})

In [15]:
ICDData['train'] = ICDData['train'].shuffle(seed=1).select(range(4000))
ICDData['train']

Dataset({
    features: ['Unnamed: 0', 'ICDCode', 'Disease'],
    num_rows: 4000
})

In [16]:
ICDData_train_validation = ICDData['train'].train_test_split(train_size=0.8)
ICDData_train_validation

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 3200
    })
    test: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 800
    })
})

In [17]:
ICDData['test'] = ICDData_train_validation['test'].shuffle(seed=1).select(range(600))
ICDData['test']

Dataset({
    features: ['Unnamed: 0', 'ICDCode', 'Disease'],
    num_rows: 600
})

In [18]:
ICDData_train_validation['test']

Dataset({
    features: ['Unnamed: 0', 'ICDCode', 'Disease'],
    num_rows: 800
})

In [19]:
ICDData_train_validation['validation'] = ICDData_train_validation.pop('test')
ICDData_train_validation

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 3200
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 800
    })
})

In [20]:
ICDData.update(ICDData_train_validation)
ICDData

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 3200
    })
    test: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 600
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'ICDCode', 'Disease'],
        num_rows: 800
    })
})

In [21]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-cased"
# checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(batch):
    return tokenizer(batch["Disease"], padding=True, truncation=True)

ICDData_encoded = ICDData.map(tokenize_function, batched=True, batch_size=None)

ICDData_encoded

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

TypeError: ignored

In [None]:
import torch
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = 2
model = (AutoModelForSequenceClassification
         .from_pretrained(checkpoint, num_labels=num_labels)
         .to(device))

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 8
logging_steps = len(ICDData_encoded["train"]) // batch_size
model_name = f"{checkpoint}-icdCode-encoder"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error",
                                  optim='adamw_torch'
                                  )
training_args

In [None]:
import transformers
import re

[x for x in dir(transformers) if re.search(r'^AutoModel', x)]

In [None]:
ICDData_encoded

In [None]:
ICDData_encoded["train"]

In [None]:
ICDData_encoded["validation"]

In [None]:
from transformers import Trainer

torch.cuda.empty_cache()

trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=ICDData_encoded["train"],
                  eval_dataset=ICDData_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train()

In [None]:
preds = trainer.predict(ICDData_encoded['validation'])
preds

In [None]:
preds.predictions.shape

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))

In [None]:
def get_accuracy(preds):
  predictions = preds.predictions.argmax(axis=-1)
  labels = preds.label_ids
  accuracy = accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))
  return {'accuracy': accuracy}

In [None]:
trainer.evaluate()

In [None]:
trainer.save_model()

In [63]:
model_name

'distilbert-base-cased-icdCode-encoder'

In [None]:
from transformers import pipeline
classifier = pipeline('text-classification', model=model_name)
classifier('Hepatitis A with hepatic coma')

In [None]:
classifier('Hepatitis A without hepatic coma')