<a href="https://colab.research.google.com/github/JeanMusenga/PhD-Thesis_2024_Musenga/blob/main/TextCNN_with_vocab_size.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TextCNN model
https://chatgpt.com/share/d5dd93d5-d7a7-4488-9bfb-8824d7cffe39

The create_text_cnn_model function you provided constructs a TextCNN model for binary text classification. This model uses multiple convolutional layers with different kernel sizes to capture various features from the text data, followed by pooling and concatenation. Here's a detailed explanation of the components:

Inputs Layer: Specifies the input shape, which is the maximum length of the sequences.
Embedding Layer: Transforms input tokens into dense vectors of fixed size (embedding_dim).
Convolutional and Pooling Layers: Three sets of convolutional layers with different kernel sizes (3, 4, and 5) followed by max pooling. These layers help in capturing different n-gram features from the text.
Concatenate Layer: Concatenates the outputs of the pooling layers along the specified axis.
Flatten Layer: Flattens the concatenated outputs into a single dimension.
Dense Layer: A fully connected layer with ReLU activation.
Dropout Layer: Helps prevent overfitting by randomly dropping units during training.
Output Layer: A single neuron with sigmoid activation for binary classification.
Compile Model: Configures the model for training with binary cross-entropy loss, the Adam optimizer, and accuracy as a metric.

https://chatgpt.com/share/6b20ab3c-04a2-4b5b-b39b-6531835e3571

In [None]:
pip install tensorflow

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Conv1D, MaxPooling1D, concatenate, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

In [5]:
file_path = './saved_file'
file_path = ('posts.xlsx')
data = pd.read_excel(file_path)

# Preprocess the data

In [6]:
# Preprocess text data
data['Question_body'] = data['Question_body'].str.replace('\n', ' ').str.replace('<.*?>', '', regex=True)

In [9]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data['Question_body'], data['Label'], test_size=0.3, random_state=42)


In [10]:
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

In [11]:
# Pad the sequences
max_length = max(len(seq) for seq in X_train_seq)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding='post')

In [14]:
# TextCNN model
def create_text_cnn_model(vocab_size, embedding_dim, max_length):
    inputs = Input(shape=(max_length,))
    embedding = Embedding(vocab_size, embedding_dim, input_length=max_length)(inputs)

    conv1 = Conv1D(128, 3, activation='relu')(embedding)
    pool1 = MaxPooling1D(pool_size=2)(conv1)

    conv2 = Conv1D(128, 4, activation='relu')(embedding)
    pool2 = MaxPooling1D(pool_size=2)(conv2)

    conv3 = Conv1D(128, 5, activation='relu')(embedding)
    pool3 = MaxPooling1D(pool_size=2)(conv3)

    concatenated = concatenate([pool1, pool2, pool3], axis=1)
    flatten = Flatten()(concatenated)
    dense1 = Dense(128, activation='relu')(flatten)
    dropout = Dropout(0.5)(dense1)
    outputs = Dense(1, activation='sigmoid')(dropout)

    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

In [12]:
# Parameters
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100

# Create the model

In [15]:
# Create the model
model = create_text_cnn_model(vocab_size, embedding_dim, max_length)



# Display the model summary

In [None]:
# Display the model summary
text_cnn_model.summary()

# Train the model

In [16]:
# Train the model
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(X_train_pad, y_train, epochs=10, batch_size=32, validation_split=0.2, callbacks=[early_stopping])


Epoch 1/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 124ms/step - accuracy: 0.5194 - loss: 1.6575 - val_accuracy: 0.8957 - val_loss: 0.3414
Epoch 2/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 99ms/step - accuracy: 0.8774 - loss: 0.3569 - val_accuracy: 0.9110 - val_loss: 0.2720
Epoch 3/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 100ms/step - accuracy: 0.9308 - loss: 0.2298 - val_accuracy: 0.9173 - val_loss: 0.2335
Epoch 4/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 96ms/step - accuracy: 0.9649 - loss: 0.1547 - val_accuracy: 0.9125 - val_loss: 0.2559
Epoch 5/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 99ms/step - accuracy: 0.9783 - loss: 0.0887 - val_accuracy: 0.9048 - val_loss: 0.3654
Epoch 6/10
[1m262/262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 97ms/step - accuracy: 0.9868 - loss: 0.0466 - val_accuracy: 0.9082 - val_loss: 0.4506


# Evaluate the model on the test set

In [17]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
print(f'Test Accuracy: {test_accuracy:.4f}')

[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.9141 - loss: 0.2664
Test Accuracy: 0.9152


# Predict on new data

In [18]:
# Predict on new data
y_pred_probs = model.predict(X_test_pad)
y_pred = (y_pred_probs > 0.5).astype(int)

[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step


In [19]:
# Compute and display metrics
precision_class, recall_class, f1_class, support_class = precision_recall_fscore_support(y_test, y_pred, average=None, labels=[0, 1])
conf_matrix = confusion_matrix(y_test, y_pred, labels=[0, 1])

# Calculate overall accuracy
accuracy = (conf_matrix[0, 0] + conf_matrix[1, 1]) / conf_matrix.sum()

print(f'Class 0 - Precision: {precision_class[0]}, Recall: {recall_class[0]}, Accuracy: {accuracy}, F1-score: {f1_class[0]}, Support: {support_class[0]}')
print(f'Class 1 - Precision: {precision_class[1]}, Recall: {recall_class[1]}, Accuracy: {accuracy}, F1-score: {f1_class[1]}, Support: {support_class[1]}')

Class 0 - Precision: 0.9622363903874448, Recall: 0.866225165562914, Accuracy: 0.9151785714285714, F1-score: 0.9117100371747211, Support: 2265
Class 1 - Precision: 0.8758705448586644, Recall: 0.9652370203160271, Accuracy: 0.9151785714285714, F1-score: 0.918384879725086, Support: 2215
