# Neural Network Model for ICD-10 Code Prediction

## Introduction
This neural network model is designed to predict ICD-10 codes based on patient notes. The model mimics the structure and functioning of the human brain, consisting of layers of interconnected nodes (neurons). These nodes are adjusted during training to minimize prediction errors and improve the model's accuracy.

## Libraries Used and Their Purpose
The model utilizes several key Python libraries to handle various tasks:

- **TensorFlow/Keras**: These libraries are used for building and training the neural network. TensorFlow provides a robust platform for deep learning, and Keras offers a high-level API for easy model creation.
- **Pandas**: This library is used for data manipulation and analysis. It helps in loading and preparing the dataset.
- **NLTK (Natural Language Toolkit)**: NLTK is used for text preprocessing, including tokenization, removing stop words, and lemmatization, which are essential steps in preparing the text data for modeling.
- **Scikit-learn**: This library is used for additional preprocessing steps like vectorization of text data using TF-IDF and encoding the ICD-10 labels. It also provides functions for evaluating the model's performance.
- **Imbalanced-learn**: This library is used to handle class imbalance in the dataset. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) are employed to ensure that the model performs well even with underrepresented classes.
- **Pickle**: This library is used for serializing and deserializing Python objects. It allows saving and loading the trained model, vectorizer, and label encoder without retraining the model each time.
- **re (Regular Expressions)**: This module is used for text preprocessing. It helps in cleaning the text data by removing unwanted characters, numbers, and punctuations based on specified patterns.

## How the Neural Network Works
Neural networks are a type of artificial intelligence modeled after the human brain. They consist of layers of interconnected nodes (neurons) that process data in complex ways to learn patterns and make predictions.

### Layers of the Neural Network
1. **Input Layer**: This is the first layer of the network and it accepts the processed text data (patient notes) as input. The text data is converted into numerical form so that the network can work with it.
2. **Hidden Layers**: These are the intermediate layers between the input and output layers. Each hidden layer consists of multiple neurons. These neurons apply mathematical operations to the input data and learn complex patterns by adjusting their internal parameters (weights) during training. The activation function (ReLU in this case) helps the network learn non-linear relationships in the data.
3. **Output Layer**: This is the final layer of the network. It produces a probability distribution over all possible ICD-10 codes and outputs the most likely code for a given patient note. The softmax activation function is used here to convert the raw output into probabilities that sum up to 1.

### Training the Neural Network
- **Forward Propagation**: The input data is passed through the network layer by layer. At each layer, the neurons perform calculations and transform the data.
- **Loss Calculation**: The network's prediction is compared to the actual ICD-10 code, and the difference (error) is calculated using a loss function.
- **Backpropagation**: The network adjusts its weights to minimize the error by propagating the error backward through the network and updating the weights.
- **Iteration**: This process is repeated for multiple iterations (epochs) until the network learns to make accurate predictions.

## Model Summary and Results
### Model Architecture
The neural network model is designed to predict ICD-10 codes based on patient notes. The architecture consists of:
- **Input Layer**: Dense layer with 512 neurons, ReLU activation, and a dropout rate of 0.5 to prevent overfitting.
- **Hidden Layer**: Dense layer with 256 neurons, ReLU activation, and a dropout rate of 0.5.
- **Output Layer**: Dense layer with neurons equal to the number of unique ICD-10 codes, using softmax activation to output a probability distribution.

### Model Performance
The model has undergone several iterations and training with different datasets and parameters. Key performance metrics include:
- **Accuracy**: 72.67%
- **Precision**: 73.37%
- **Recall**: 72.67%
- **F1-Score**: 72.29%

### Limitations and Suggestions for Improvement
#### Limitations
- **Class Imbalance**: Certain ICD-10 codes with fewer samples exhibited low recall and precision, highlighting the challenge of accurately predicting less frequent classes.
- **Overfitting**: The model showed signs of overfitting, as indicated by the increasing validation loss after the second epoch.

#### Suggestions for Improvement
- **Increase Training Epochs with Early Stopping**: Increasing the number of epochs while implementing early stopping could help capture more details without overfitting.
- **Class Weight Adjustment**: Adjusting class weights during training to give more importance to underrepresented classes can improve recall for these classes.
- **Data Augmentation**: Augmenting data for less frequent classes can help the model learn better representations for these classes.
- **Hyperparameter Tuning**: Tuning hyperparameters such as learning rate, batch size, and dropout rates can help find the optimal configuration for the model.
- **Advanced Oversampling Techniques**: Employing techniques like SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic) could better handle class imbalance.

## Input/Output Description
### Input
- **Patient Notes**: The input to the model consists of text data from patient notes which are preprocessed and vectorized.

### Output
- **ICD-10 Code**: The output is a predicted ICD-10 code which represents a specific diagnosis.

## Example Chart Notes
Here are some example chart notes you can use to test the model:

1. **Example 1**: Patient complains of severe lower back pain that has persisted for several weeks. Pain radiates down the left leg. MRI indicates a herniated disc at the L4-L5 level.
2. **Example 2**: Patient presents with uncontrolled diabetes mellitus. Blood sugar levels have been consistently high despite adherence to prescribed medication and dietary restrictions. Complains of frequent urination, increased thirst, and fatigue.
3. **Example 3**: Patient reports chest pain radiating to the left arm and jaw, shortness of breath, and sweating. Symptoms began suddenly while exercising. History of hypertension and high cholesterol. ECG shows ST elevation.

Feel free to run the model with these example chart notes to see how it predicts the ICD-10 codes.

In [2]:
!pip install transformers nltk

"""
NLTK (Natural Language Toolkit) is a Python library that provides tools for handling human language data (text). 
It supports a variety of NLP tasks such as tokenization, parsing, and tagging, and includes resources for 
building machine learning-based language processing models.
"""



'\nNLTK (Natural Language Toolkit) is a Python library that provides tools for handling human language data (text). \nIt supports a variety of NLP tasks such as tokenization, parsing, and tagging, and includes resources for \nbuilding machine learning-based language processing models.\n'

In [4]:
!pip install tensorflow keras

"""
TensorFlow is an open-source library developed by Google for deep learning and machine learning tasks. It provides 
a flexible and comprehensive ecosystem of tools, libraries, and community resources that lets researchers push 
the state-of-the-art in ML, and developers easily build and deploy ML-powered applications.

Keras is a high-level API for building and training deep learning models. It runs on top of TensorFlow and allows 
for easy and fast prototyping by providing simple and user-friendly methods for creating complex neural network 
architectures.
"""



'\nTensorFlow is an open-source library developed by Google for deep learning and machine learning tasks. It provides \na flexible and comprehensive ecosystem of tools, libraries, and community resources that lets researchers push \nthe state-of-the-art in ML, and developers easily build and deploy ML-powered applications.\n\nKeras is a high-level API for building and training deep learning models. It runs on top of TensorFlow and allows \nfor easy and fast prototyping by providing simple and user-friendly methods for creating complex neural network \narchitectures.\n'

In [5]:
import re
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import pickle

# Ensure stopwords and wordnet are downloaded
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Load the dataset
data = pd.read_csv('Mock Data - Sheet1.csv')

# Data cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remove short words
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    words = nltk.word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply data cleaning
data['cleaned_notes'] = data['Patient Notes'].apply(clean_text)

# Encode labels
label_encoder = LabelEncoder()
data['encoded_labels'] = label_encoder.fit_transform(data['ICD-10 Code'])

# Split the data into training and test sets
X = data['cleaned_notes']
y = data['encoded_labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Transform text data using TF-IDF with N-grams
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=10000)
X_train_tfidf = vectorizer.fit_transform(X_train).toarray()  # Convert to dense array
X_test_tfidf = vectorizer.transform(X_test).toarray()

# Simple oversampling technique to balance the classes in the training set
def oversample(X, y):
    unique_classes, class_counts = np.unique(y, return_counts=True)
    max_count = max(class_counts)
    X_resampled, y_resampled = [], []
    for cls in unique_classes:
        X_cls = X[y == cls]
        y_cls = y[y == cls]
        n_samples = max_count - len(y_cls)
        X_resampled.extend(X_cls)
        y_resampled.extend(y_cls)
        if n_samples > 0:
            X_resampled.extend(X_cls[np.random.choice(len(X_cls), n_samples)])
            y_resampled.extend([cls] * n_samples)
    return np.array(X_resampled), np.array(y_resampled)

X_train_tfidf, y_train = oversample(X_train_tfidf, y_train)

# Convert labels to categorical format
y_train_categorical = to_categorical(y_train, num_classes=len(np.unique(y)))
y_test_categorical = to_categorical(y_test, num_classes=len(np.unique(y)))

# Build a simple neural network
model = Sequential()
model.add(Dense(512, input_shape=(X_train_tfidf.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(np.unique(y)), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with epochs reduced to 4
history = model.fit(X_train_tfidf, y_train_categorical, epochs=4, batch_size=32, validation_data=(X_test_tfidf, y_test_categorical))

# Evaluate the model
loss, accuracy = model.evaluate(X_test_tfidf, y_test_categorical)
print(f"Neural Network Model Accuracy: {accuracy * 100:.2f}%")

# Predict and calculate precision, recall, and F1-score
y_pred_prob = model.predict(X_test_tfidf)
y_pred = np.argmax(y_pred_prob, axis=1)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")
print(f"F1-Score: {f1 * 100:.2f}%")

# Ensure the classification report includes all the classes present in y_test and y_pred
unique_labels = np.unique(np.concatenate((y_test, y_pred)))
target_names = label_encoder.inverse_transform(unique_labels)
print("\nClassification Report:\n", classification_report(y_test, y_pred, labels=unique_labels, target_names=target_names, zero_division=0))

# Save the model, vectorizer, and label encoder for future use
model.save('icd10_nn_model.h5')
with open('vectorizer.pkl', 'wb') as vec_file:
    pickle.dump(vectorizer, vec_file)
with open('label_encoder.pkl', 'wb') as le_file:
    pickle.dump(label_encoder, le_file)

print("Neural network model, vectorizer, and label encoder saved successfully.")

# Function to predict ICD-10 code for a new patient note
def predict_icd10_nn(note):
    cleaned_note = clean_text(note)
    encoded_note = vectorizer.transform([cleaned_note]).toarray()
    prediction_prob = model.predict(encoded_note)
    prediction = np.argmax(prediction_prob, axis=1)
    return label_encoder.inverse_transform([prediction[0]])[0]

# Input for physicians or medical coders to check chart notes
while True:
    user_input = input("Enter a patient note to predict ICD-10 code (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    prediction = predict_icd10_nn(user_input)
    print(f"Predicted ICD-10 code: {prediction}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ricky\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ricky\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ricky\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/4
[1m4046/4046[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m278s[0m 68ms/step - accuracy: 0.6884 - loss: 1.6353 - val_accuracy: 0.7044 - val_loss: 2.7481
Epoch 2/4
[1m4046/4046[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m268s[0m 66ms/step - accuracy: 0.9870 - loss: 0.0454 - val_accuracy: 0.7133 - val_loss: 3.3809
Epoch 3/4
[1m4046/4046[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m268s[0m 66ms/step - accuracy: 0.9919 - loss: 0.0272 - val_accuracy: 0.7089 - val_loss: 3.7961
Epoch 4/4
[1m4046/4046[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m277s[0m 68ms/step - accuracy: 0.9946 - loss: 0.0179 - val_accuracy: 0.7189 - val_loss: 4.1903
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7018 - loss: 4.7056
Neural Network Model Accuracy: 71.89%
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Precision: 74.08%
Recall: 71.89%
F1-Score: 71.68%

Classification Report:
                              precision    recall  f1-score   support

                     E08.00       0.00      0.00      0.00         1
                     E08.11       0.00      0.00      0.00         0
                     E08.21       0.00      0.00      0.00         1
                     E08.29       0.00      0.00      0.00         1
                   E08.3112       0.00      0.00      0.00         0
                   E08.3192       1.00      1.00      1.00         1
                   E08.3293       0.00      0.00      0.00         0
                   E08.3312       0.00      0.00      0.00         2
                   E08.3393       0.00      0.00      0.00         1
                   E08.3399       0.00      0.00      0.00         0
                   E08.3492       0.00      0.00      0.00         1
                   E08.3493       0.00      0.00      0.00         0
                   E08.3521