<a href="https://colab.research.google.com/github/ArjunRAj77/Discharge-summary-extractor/blob/main/ICD10_identifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ICD 10 Extractor**

Training a machine learning model to accurately identify ICD-10 codes in sentences is a complex task that requires a good understanding of both machine learning and the ICD-10 coding system. 

To build a model like this, you would need a large dataset of sentences that have been labeled with the correct ICD-10 codes, as well as a machine learning algorithm that can learn to map the sentences to the correct codes. You would then need to train the model on your dataset, using techniques like feature engineering and hyperparameter tuning to optimize its performance. 

It's also important to evaluate the model's performance on a separate test dataset to ensure it is accurate and reliable.

**Steps:**

1. Collecting and preparing a dataset of labeled examples (i.e., sentences and their corresponding ICD-10 codes).
2. Choosing a machine learning algorithm and implementing it in code.
3. Training the model on the dataset using techniques like feature engineering and hyperparameter tuning.
4. Evaluating the model's performance on a separate test dataset to ensure it is accurate and reliable.

## **Step 1 : Collecting Data for dataset creation.**

Since we are dealing with sensitive data, we need a reliable source of information.
- One possible source of this information is the World Health Organization (WHO), which maintains a database of ICD-10 codes and their corresponding descriptions.
-  collect this information from medical records or other healthcare databases.


The dataset we have generated for the usability of the ML model is :

 https://www.kaggle.com/datasets/mrhell/icd10cm-codeset-2023

## **Step 2 :  Choosing a machine learning algorithm and implementing it in code**.

This is the most trickier part. Since we are dealing with NER and text classification, a proper NLP should be models can be used:
- **RNN**
- **Transformer** 

Here we will be using Spacy modules for NER identification.

***spaCy*** is a popular natural language processing (NLP) library for Python. It provides tools and libraries for performing a variety of NLP tasks, such as tokenization, part-of-speech tagging, named entity recognition, and more.

The ***Matcher*** class in spaCy allows you to create and match patterns in text. A pattern is a list of dictionaries that defines the sequence of tokens and the conditions or constraints on the tokens that should be matched in the text. For example, a pattern can specify that a specific word or phrase should be matched, or that the text matched by the pattern should be optional, occur at least a certain number of times, or be in a specific part of speech.

Once you have created a pattern using the Matcher class, you can use the matcher object to match the pattern against a text. This will return a list of matches, where each match is a tuple consisting of the label of the pattern, the start and end indices of the match in the text, and the span of the match.

In [None]:
pip install ngrok

In [None]:
!pip install pyngrok

In [None]:
pip install streamlit

In [None]:
pip install spacy

In [None]:
%%writefile streamlit_app.py 
import streamlit as st
import spacy
from spacy.matcher import Matcher
from spacy import displacy
import re

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Create the pattern using the Matcher class
matcher = Matcher(nlp.vocab)
ICDDescription=['Cholera','Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae','Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor']
ICDcode=['A00','A000','A001']
for sentence in ICDDescription:
  words = re.findall(r'\b\w+\b', sentence) # selecting only the words, omitting the special characters
pattern = [{'OP': '?'}, {'TEXT': 'google'}, {'OP': '*'}, {'TEXT': 'inc.'}]
matcher.add('GoogleIncPattern',[pattern])
# google_pattern = [{'TEXT': 'Google'}, {'OP': '*'}, {'TEXT': 'Inc.'}]
# matcher.add('GoogleIncPattern',[google_pattern])

# apple_pattern = [{'TEXT': 'Apple'}, {'OP': '*'}, {'TEXT': 'Inc.'}]
# matcher.add('AppleIncPattern',[apple_pattern])

# Use the streamlit text_area widget to allow the user to enter a text
text = st.text_area('Enter a text:')

# Parse the text with spaCy and use the pattern to find named entities
doc = nlp(text)
matches = matcher(doc)

# Set the colors for different named entity types
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)'}
options = {'ents': ['ORG'], 'colors': colors}

# Visualize the named entities in the text
st.markdown(displacy.render(doc, style='ent', options=options), unsafe_allow_html=True)

In [None]:
!streamlit run /content/streamlit_app.py & npx localtunnel --port 8501 