<!-- zelf ontwikkelen van neural net die uit text de types kanker kan classificeren doormiddel van NLP technieken

probleem: specialisten in ziekenhuizen typen vaak een lang stuk tekst waar bepaalde personen dit niet lees baar voor is. 
oplossing: classificeren van text door te zien over welke type kanker een bepaald stuk tekst gaat. 
-->

In [3]:
import numpy as np # linear algebra
import pandas as pd

In [4]:
df = pd.read_csv("./MedicalReports.csv")
df = df.drop(['Unnamed: 0'], axis=1) # drop unwanted column unnamed

In [5]:
df.columns=['Label', 'Medical Reports']
df

Unnamed: 0,Label,Medical Reports
0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...
3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,Thyroid_Cancer,This study aimed to investigate serum matrix ...
...,...,...
3370,Colon_Cancer,lipases are very versatile enzymes and produc...
3371,Colon_Cancer,recently extensive evidence has clarified ...
3372,Colon_Cancer,the introduction of combined conventional ...
3373,Colon_Cancer,dysregulation of lncrnas is frequent in gl...


<h2>Datacleaning</h2>

<!-- datacleaning -->

In [6]:
import string
string.punctuation

# Verwijderd interpunctie over alle medical reports teksten
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
    
# Verwijderd interpunctie over alle medical reports teksten
df['Medical Reports']= df['Medical Reports'].apply(lambda x:remove_punctuation(x))
# maakt van alle hoofdletters kleine letters
df['Medical Reports']= df['Medical Reports'].apply(lambda x: x.lower())

In [7]:
import nltk
from nltk.corpus import stopwords as nltk_stopwords

nltk.download('stopwords') # helpt in verschillende talen stopwoorden te vinden

stop_words = set(nltk_stopwords.words('english')) # gebruik stopwords in het engels

# verwijder alle stopwoorden en zet deze woorden terug naar een lege ""
df['Medical Reports'] = df['Medical Reports'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/minorai3/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
df['Label'].value_counts()
# "check het aantal labels

Label
Thyroid_Cancer    1405
Colon_Cancer      1098
Lung_Cancer        872
Name: count, dtype: int64

<h1>Prepair the model</h1>

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.utils import to_categorical

2023-11-29 13:40:58.277568: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [10]:
# get texts and labels
texts = df['Medical Reports'].values
labels = df['Label'].values

# encode the labels 
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
encoded_labels
# label 0 = Thyroid_Cancer    
# label 1 = Colon_Cancer          
# label 2 = Lung_Cancer            


array([2, 2, 2, ..., 0, 0, 0])

In [11]:
X_train, X_test, y_train, y_test = train_test_split(texts, encoded_labels, test_size=0.2, random_state=42)

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train) # Deze regel roept de fit_on_texts() methode van de tokenizer aan, waarbij de X_train dataset als invoer wordt gegeven. Deze methode traint de tokenizer om de unieke tokens in de dataset te identificeren en een numerieke waarde aan elk token toe te wijzen.
vocab_size = len(tokenizer.word_index) + 1 # zoekt het aantal unique woorden dat door de tokenizer is geleerd. 

# tokenizer.word_index haalt de index van het woordenboek op en voegt er een aan toe met +1.
# de extra 1 is voor het reserveren van het woord_id 0,1,2,3 etc. voor het padteken (padding)
# Het padteken is een speciaal teken dat wordt gebruikt om tekst te vullen tot een bepaalde lengte. Het is gebruikelijk om het padteken te gebruiken in NLP-taken waarbij tekst wordt verwerkt door een neuraal netwerkmodel.

# Door het padteken te reserveren voor het woord-ID 0, wordt ervoor gezorgd dat er altijd een woord-ID beschikbaar is voor het padteken. Dit is belangrijk omdat sommige neuraal netwerkmodellen het padteken vereisen als input.
vocab_size

207764

In [13]:
# Convert text sequences to numerical sequences
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# Pad sequences to have the same length
max_sequence_length = 100  # Adjust the value based on your data and sequence lengths
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_sequence_length, padding='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_sequence_length, padding='post')



# Convert labels to one-hot encoding
num_classes = 3
y_train_one_hot = to_categorical(y_train, num_classes)
y_test_one_hot = to_categorical(y_test, num_classes)
y_test_one_hot

array([[0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]], dtype=float32)

<p>
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) that are well-suited for tasks that involve sequential data, such as natural language processing (NLP) tasks like text classification. Here's why LSTMs are particularly advantageous for classifying tokenized vectorized medical text:

Capturing Long-Range Dependencies: LSTM networks are designed to address the vanishing gradient problem that plagues traditional RNNs. This problem arises when dealing with long sequences of data, as the gradients of the error function become increasingly small as they are backpropagated through the recurrent layers, making it difficult for the model to learn long-range dependencies between words in a sentence. LSTMs overcome this issue by using memory cells and gating mechanisms to selectively remember and update information over time, allowing them to capture long-range dependencies in the text.

Handling Contextual Meaning: Medical text often contains complex medical terminology and subtle contextual nuances that are crucial for accurate classification. LSTMs excel at understanding the context in which words appear, allowing them to effectively interpret the meaning of medical jargon and identify patterns that indicate the correct classification.

Processing Variable-Length Sequences: Medical reports and documents can vary significantly in length, posing a challenge for algorithms that require fixed-length inputs. LSTMs are able to handle variable-length sequences by processing the input text one token at a time, dynamically adjusting their internal state as they encounter new information.

Adapting to Uncertainties: Medical text often contains ambiguities and uncertainties, making it challenging to classify with absolute certainty. LSTMs are adept at dealing with such uncertainties, as they can learn to probabilistically estimate the likelihood of different classifications based on the context of the text.

In summary, LSTMs offer several advantages for classifying tokenized vectorized medical text:

Ability to capture long-range dependencies in text

Effective understanding of contextual meaning and medical terminology

Handling of variable-length sequences

Adaptation to uncertainties in medical text

As a result, LSTMs have become a popular choice for a variety of medical NLP tasks, including disease classification, risk assessment, and patient outcome prediction.</p>

In [14]:
embedding_dim = 100  

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=max_sequence_length))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))  
model.add(Dense(units=num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          20776400  
                                                                 
 lstm (LSTM)                 (None, 128)               117248    
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 3)                 387       
                                                                 
Total params: 20894035 (79.70 MB)
Trainable params: 20894035 (79.70 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


2023-11-29 13:41:08.385775: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-29 13:41:08.386041: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-29 13:41:08.467040: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the 

In [15]:
from tensorflow.keras.callbacks import ModelCheckpoint
checkp = ModelCheckpoint('best.h5', 'val_accuracy', mode='max', save_best_only=True, verbose=2)

In [14]:
batch_size = 32  
epochs = 5  
model.fit(X_train_padded, y_train_one_hot, batch_size=batch_size, epochs=epochs, callbacks=[checkp])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7fc8c364a9d0>

In [16]:
loss, accuracy = model.evaluate(X_test_padded, y_test_one_hot)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 1.0992774963378906
Test Accuracy: 0.2992592453956604


In [17]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding

# Deze regels definiëren de invoerlaag (inp) en de embeddingslaag (emb). De invoerlaag heeft een vorm van (250,), wat betekent dat het een sequentie van 250 tokens kan verwerken. De embeddingslaag neemt de invoersequentie en mapt elk token toe aan een vector van 32 dimensies, waardoor de tokens in wezen worden omgezet in numerieke representaties.
inp = Input(shape=(250,)) 
emb = Embedding(100, 32)(inp)
model2 = Model(inputs=inp, outputs=emb)
model2 = Model(inputs=inp, outputs=emb)

token_str = tokenizer.texts_to_sequences(["surgycal"])
pad_str = pad_sequences(token_str, 250, padding = 'post')

ret1 = model2.predict(pad_str.reshape(1,250))[0][0]
ret1



array([-0.02671672,  0.01269192, -0.02109702,  0.0119508 , -0.03398069,
       -0.01636165, -0.00141905,  0.01428673, -0.00179089, -0.04859804,
       -0.02846521,  0.03496825, -0.03082448,  0.00908528, -0.04808161,
        0.01286054, -0.0047849 , -0.01566142,  0.04292766,  0.04003518,
        0.01633671, -0.04942472,  0.04724241, -0.03744145,  0.02398254,
       -0.0036276 , -0.0156409 ,  0.03913525,  0.02579613, -0.03972366,
        0.01206271,  0.03991858], dtype=float32)

In [18]:
token_str = tokenizer.texts_to_sequences(["cancer"])
pad_str = pad_sequences(token_str, 250, padding = 'post')

ret2 = model2.predict(pad_str.reshape(1,250))[0][0]
print(ret2)

[ 0.0124376  -0.03279578 -0.00497026  0.02903298  0.04632712 -0.01467222
 -0.0422116  -0.04048993  0.01556461 -0.02149679  0.04977516  0.00035103
 -0.02256124  0.00210451  0.02124209 -0.04778255 -0.00391231  0.00391845
 -0.04109521 -0.02934302 -0.00417997 -0.02454194 -0.02103158 -0.0489982
 -0.03452469  0.01647905 -0.01217004  0.03209524 -0.03674219 -0.01855302
 -0.00137908  0.0389163 ]


In [19]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array(ret1).reshape(1,-1), np.array(ret2).reshape(1,-1))

array([[-0.09589048]], dtype=float32)

In [20]:
def classify_one_text(new_texts):
    new_sequences = tokenizer.texts_to_sequences(new_texts)
    new_padded = pad_sequences(new_sequences, maxlen=max_sequence_length, padding='post')
    predictions = model.predict(new_padded[0].reshape(1,100))
    predicted_labels = label_encoder.inverse_transform([argmax(pred) for pred in predictions])
    print("Predicted Labels:", predicted_labels)

In [21]:
def classify_multiple_texts(new_texts):
    new_sequences = tokenizer.texts_to_sequences(new_texts)
    new_padded = pad_sequences(new_sequences, maxlen=max_sequence_length, padding='post')
    predictions = model.predict(new_padded)
    predicted_labels = label_encoder.inverse_transform([argmax(pred) for pred in predictions])
    print("Predicted Labels:", predicted_labels)

In [27]:
from numpy import argmax

# Colon_Cancer 

# classify_multiple_texts(new_texts)
# classify_one_text(new_text)

In [28]:
new_text=['Cretinism is a severe form of iodine deficiency that affects the physical and mental development of children. Cretinism is characterized by stunted growth, intellectual disability, and neurological abnormalities. Iodine deficiency during pregnancy and early childhood can lead to cretinism, with the most severe cases occurring when iodine deficiency is present during fetal development.']

new_sequences = tokenizer.texts_to_sequences(new_texts)
new_padded = pad_sequences(new_sequences, maxlen=max_sequence_length, padding='post')
predictions = model.predict(new_padded[0].reshape(1,100))

predicted_labels = label_encoder.inverse_transform([argmax(pred) for pred in predictions])
print("Predicted Labels:", predicted_labels)

Predicted Labels: ['Colon_Cancer']
