<a href="https://colab.research.google.com/github/Thesis-AfaanOromooChatGPT2025/MedPromptX/blob/main/Text_Classification_using_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
falgunipatel19_biomedical_text_publication_classification_path = kagglehub.dataset_download('falgunipatel19/biomedical-text-publication-classification')

print('Data source import complete.')



<h3 style="text-align: center;">Hello! Welcome to my notebook❤️


In [None]:
#read the data
import pandas as pd
df=pd.read_csv("/kaggle/input/biomedical-text-publication-classification/alldata_1_for_kaggle.csv",encoding='latin1')
df.head()

# 📎Initial Data Exploration and Cleaning

In [None]:
df.info()


* data has no Null Values
* its shape is(7570,3)

In [None]:
#Check Duplicated vals
df.duplicated().sum()

In [None]:
#rename cols
df = df.rename(columns={'0': 'labels', 'a': 'text'})

In [None]:
# df['labels'].unique()
df['labels'].value_counts()

* The classification problem involves three distinct classes:
> 1. **Thyroid Cancer**
> 2. **Colon Cancer**
> 3. **Lung Cancer**

In [None]:
texts = df['text'].values
labels = df['labels'].values

# 📎Spliting the data

In [None]:
#split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42,shuffle=True,stratify=labels)

print("Dimensions of X_train :", X_train.shape)
print("Dimensions of X_test  :", X_test.shape)
print("Dimensions of y_train :", y_train.shape)
print("Dimensions of y_test  :", y_test.shape)

> * **shuffle**:shuffle the data before splitting
> * **stratify**: ensures that the class distribution in the training and testing sets is proportional to the class distribution in the original dataset.

# 📎Text Tokenization and Sequence Conversion


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_seq[0]

> * **Tokenizer**: Converts text into sequences of integers, with each integer representing a unique word.
> * **fit_on_texts**: Updates the tokenizer’s vocabulary with words from the provided texts, building the word-to-integer mapping.
> * **texts_to_sequences**: Transforms each text into a sequence of integers based on the word index created by fit_on_texts.

# 📎Sequence Padding and Length Adjustment

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_len = max([len(seq) for seq in X_train_seq])  # Maximum length of sequences
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

X_train_pad[0]

* **pad_sequences**:Pads sequences to ensure that they all have the same length
> * **maxlen**:  the maximum length of the sequences. Sequences longer than this length will be truncated, and shorter sequences will be padded.
> * **padding**: 'pre' or 'post'. Whether to pad sequences at the beginning or the end (default is 'pre').
> * **truncating**: 'pre' or 'post'. Whether to truncate sequences at the beginning or the end (default is 'pre').

* **difference between Padding and Truncating**
> * **Padding**:To ensure all sequences in the dataset have the same length by adding extra values (usually zeros) to sequences that are shorter than the desired length.
> * **Truncating**:To shorten sequences that exceed the maximum length by removing values from either the start or end of the sequence.

*  **max([len(seq) for seq in X_train_seq])**
> *  it makes length equals to max length of sequence
> * so you do not need to do any Truncating

# 📎One-Hot Encoding

In [None]:
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_ = label_encoder.fit_transform(y_train)
y_test_ = label_encoder.transform(y_test)



y_train_cat = to_categorical(y_train_, num_classes=3)
y_test_cat = to_categorical(y_test_, num_classes=3)

y_train_cat


> * **LabelEncoder**: Converts categorical labels (strings) into integer labels
> * **to_categorical**: Converts a class vector (integers) to binary class matrix (one-hot encoding), which is useful for categorical classification problems.

# 📎RNN Architecture

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_len))
model.add(SimpleRNN(128, return_sequences=False))
model.add(Dense(3, activation='softmax'))

> * **Sequential**: A linear stack of layers. You can add layers to the model in a sequential manner.
> * **Embedding**: Turns positive integers (indexes) into dense vectors of fixed size, often used as the first layer in text-based neural networks to convert words into vectors.
> * **SimpleRNN**: A basic RNN layer that processes input sequences and has an internal state that captures temporal dependencies.
> * **Dense**: A fully connected layer where each neuron is connected to every neuron in the previous layer.

* **Sequential**:
> * Initializes a new, empty model. Layers will be added sequentially.

* **Embedding Layer**:This layer converts the integer sequences of words (generated by Tokenizer) into dense vectors of fixed size
> * **input_dim**: The size of the vocabulary (num of unique words in the dataset+1).
 > >   * 1 for padding
> * **output_dim**: The dimension of the dense embedding vectors. Each word is represented as a 128-dimensional vector.
> * **input_length**: The length of input sequences. Each sequence has been padded to max_len

* **SimpleRNN Layer:**: This layer processes the sequences output by the Embedding layer
> * **128**: The number of units (neurons) in the RNN. Each unit maintains a hidden state and processes one word at a time in the sequence, updating the hidden state with each word.
> * **return_sequences=False**: The RNN will only output the final hidden state after processing the entire sequence. If True, it would return the hidden state at each timestep, which is useful for stacking RNN layers.

* **Dense Layer**:This layer is used to classify the final output from the RNN into one of the three cancer types
> * **3**: The number of output units, corresponding to the number of classes (e.g., Thyroid Cancer, Colon Cancer, Lung Cancer).
> * **activation='softmax'**: The softmax activation function converts the output of the Dense layer into probabilities, summing to 1 across the 3 classes.

# 📎Compile the Model

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

> * **optimizer='adam'**: The Adam optimizer adjusts the learning rate dynamically during training, leading to faster convergence and better performance.
> * **loss='categorical_crossentropy'**: The categorical crossentropy loss function is ideal for multi-class classification problems, comparing predicted probabilities with the true class labels.
> * **metrics=['accuracy']**: Accuracy is tracked during training to provide a clear and intuitive measure of how well the model is performing in classifying the data.

# 📎Train the Model

In [None]:
history = model.fit(X_train_pad, y_train_cat, epochs=5, batch_size=32, validation_split=0.2)
history

> * **X_train_pad** and **y_train_cat**: These are the input data and corresponding labels used for training the model.
> * **epochs=5**: The model will be trained over 5 full iterations through the dataset.
> * **batch_size=32**: The data will be processed in batches of 32 samples at a time, leading to frequent updates to the model's weights.
> * **validation_split=0.2**: 20% of the training data will be used for validation, allowing you to monitor the model's performance on unseen data during training.
> * **history**: This object stores the training history, which can be analyzed to evaluate the model's performance over time

# 📎Evaluate the Model

In [None]:
loss, accuracy = model.evaluate(X_test_pad, y_test_cat)

print(f'Test loss: {loss}')
print(f'Test accuracy: {accuracy}')

In [None]:
import numpy as np
predictions = model.predict(X_test_pad)
predicted_labels = np.argmax(predictions, axis=1)
predicted_labels

In [None]:
res_df = pd.DataFrame({
    'Actual Labels': y_test_,
    'Predicted Labels': predicted_labels
})

res_df[:30]


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test_, predicted_labels, target_names=label_encoder.classes_))


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

conf_matrix = confusion_matrix(y_test_, predicted_labels)


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


* **Wish u luck** 💕
* **Esraa Meslam**