In [6]:
from datasets import load_dataset


dataset = load_dataset("liar")   

'''
User: Mels Habold
Data downloaded from https://huggingface.co/datasets/liar
Based on research from https://arxiv.org/abs/1705.00648
And an article posted in https://www.analyticssteps.com/blogs/detection-fake-and-false-news-text-analysis-approaches-and-cnn-deep-learning-model 
''' 

Downloading and preparing dataset liar/default to C:/Users/Mels/.cache/huggingface/datasets/liar/default/1.0.0/479463e757b7991eed50ffa7504d7788d6218631a484442e2098dabbf3b44514...


Downloading data:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10269 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1283 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1284 [00:00<?, ? examples/s]

Dataset liar downloaded and prepared to C:/Users/Mels/.cache/huggingface/datasets/liar/default/1.0.0/479463e757b7991eed50ffa7504d7788d6218631a484442e2098dabbf3b44514. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## The LIAR Dataset

We consider six fine-grained labels for the truthfulness ratings: 
pants-fire, false, barelytrue, half-true, mostly-true, and true.

It consists of three sets: The training set is used to fit the model's parameters, while the validation set is used to tune the model's hyperparameters (e.g., learning rate, number of layers, etc.) and to monitor the model's performance during training. The test set is used to evaluate the final performance of the model after it has been trained and fine-tuned using the training and validation sets.

The next code allows us to split the data into 3 types:
False, in between, True

In [119]:
if False:
    print("")
else:
    train_label = dataset["train"]['label']
    test_label = dataset["test"]['label']
    validation_label = dataset["validation"]['label']

## Model building

The first part of the code imports the necessary libraries and defines the hyperparameters. Then, it loads the training data from a dataset object and tokenizes the text using the Tokenizer class from Keras. The Tokenizer class is used to convert text to sequences of integers. The num_words parameter specifies the maximum number of words to keep, based on word frequency. The next step is to pad the sequences so that all input data have the same length using the pad_sequences function from Keras.

After the preprocessing steps, the code loads a pre-trained Word2Vec model using the Gensim library. Then, it creates an embedding matrix using the Word2Vec model to map each word in the input sequences to a corresponding vector in the embedding space. The embedding matrix is used to initialize the embedding layer in the CNN.

In [162]:
#import the necessary libraries
import numpy as np
import tensorflow as tf
import warnings
 
warnings.filterwarnings(action = 'ignore')

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from keras.utils import to_categorical
import gensim
from gensim.models import Word2Vec

#define the hyperparameters
max_words = 100                 # max number of words in a statement
embedding_dim = 50              # dimension of the word vector
output_dim = tf.unique(train_label)[0].shape[0]  # number of output labels

#tokenize the statements
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(dataset["train"]['statement'])
sequences_statement = tokenizer.texts_to_sequences(dataset["train"]['statement'])

#tokenize the statements
tokenizer.fit_on_texts(dataset["train"]['subject'])
sequences_subject = tokenizer.texts_to_sequences(dataset["train"]['subject'])

#pad the sequences
x_data_subject = pad_sequences(sequences_statement, maxlen=max_words)
x_data_statement = pad_sequences(sequences_subject, maxlen=max_words)
y_data = to_categorical(train_label, num_classes=output_dim)
sequences_combined = sequences_statement + sequences_subject

# Create the embedding matrix
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_words:
        try:
            embedding_vector = word2vec_model.wv[word]
            embedding_matrix[i] = embedding_vector
        except KeyError:
            pass

# Create the embedding layer
embedding_layer = Embedding(input_dim=max_words, output_dim=embedding_dim,
                            weights=[embedding_matrix], input_length=max_words)

## CNN Architecture

1. Embedding layer: This layer maps the integer-encoded vocabulary to dense vectors of fixed size, in this case, an embedding vector of dimension embedding_dim for each word. The input_dim parameter specifies the size of the vocabulary, which is set to max_words in this code. The output_dim parameter specifies the size of the embedding vector.

2. Bidirectional layer that wraps around the LSTM layer. The Bidirectional layer takes the LSTM layer as input and creates two copies of it, one for processing the text forward and the other for processing it backward. The outputs from these two layers are then concatenated and passed to the next layer.

3. Convolutional layer: This layer applies num_filters filters of size kernel_size to the input sequence. The filters slide over the input sequence, producing a feature map of convolved features. The activation parameter is set to ReLU, which means that the output of the layer will be the rectified linear activation function of the convolved features.

4. Max pooling layer: This layer downsamples the convolved features by taking the maximum value of each non-overlapping pool_size-sized segment of the feature map.

5. Flatten layer: This layer flattens the output of the previous layer to a 1D array.

6. Dense layer: This layer is a fully connected layer with hidden_dim units. The activation function used is ReLU, which applies the rectified linear activation function to the output.

7. Dropout layers randomly drop out some of the neurons during training, which can help prevent overfitting. This can be added using the Dropout layer in Keras.

8. Output layer: This layer is a fully connected layer with output_dim units, which corresponds to the number of output classes. The activation function used is softmax, which produces a probability distribution over the classes.

In summary, the CNN takes as input a sequence of integers representing the words in a text, and passes it through an embedding layer to obtain dense vectors. Then, the convolutional layer applies a set of filters to the sequence, producing a feature map that is downsampled by the max pooling layer. The flatten layer converts the output of the max pooling layer into a 1D array, which is passed through a dense layer and finally an output layer to produce a probability distribution over the classes.

In [164]:
#import the necessary libraries

from tensorflow.keras.layers import *
from keras.models import Model

#define the hyperparameters

num_filters = 128            # number of convolutional filters
kernel_size = 3              # size of convolutional kernel
pool_size = 2                # size of pooling window
hidden_dim = 50              # dimension of fully connected layers

# Define the two inputs
input_subject = Input(shape=(max_words,), dtype='int32')
input_statement = Input(shape=(max_words,), dtype='int32')

# Define the embedding layer
embedding_layer = Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len)

# Apply the embedding layer to both inputs
embedded_subject = embedding_layer(input_subject)
embedded_statement = embedding_layer(input_statement)

# Merge the two embedded inputs using the Concatenate layer
merged = Concatenate(axis=-1)([embedded_subject, embedded_statement])

# Define the model
conv_layer = Conv1D(filters=num_filters, kernel_size=kernel_size, activation='softmax')(merged)
pool_layer = MaxPooling1D(pool_size=pool_size)(conv_layer)
flatten_layer = Flatten()(pool_layer)
hidden_layer = Dense(hidden_dim, activation='relu')(flatten_layer)
output = Dense(num_classes, activation='softmax')(hidden_layer)

model = Model(inputs=[input_subject, input_statement], outputs=output)
model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_43 (InputLayer)          [(None, 100)]        0           []                               
                                                                                                  
 input_44 (InputLayer)          [(None, 100)]        0           []                               
                                                                                                  
 embedding_24 (Embedding)       (None, 100, 50)      5000        ['input_43[0][0]',               
                                                                  'input_44[0][0]']               
                                                                                                  
 concatenate_18 (Concatenate)   (None, 100, 100)     0           ['embedding_24[0][0]',     

## Training

The second part of the code trains the CNN using the fit function in Keras. The batch_size parameter specifies the number of samples processed in one batch, and the epochs parameter specifies the number of training epochs. The compile function is used to configure the learning process, including the optimizer, loss function, and evaluation metric. The adam optimizer is used, and the categorical_crossentropy loss function is used since this is a multi-class classification problem. The evaluation metric is accuracy.

After training, the model is evaluated using the evaluate function in Keras, and the test loss and accuracy are printed.

In [165]:
from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=3)

#compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

#train the model
#model.fit(x_data, y_data, batch_size=32, epochs=10)
model.fit(x_data, y_data, validation_split=0.1, batch_size=32, epochs=10, callbacks=[early_stop])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10


<keras.callbacks.History at 0x29d384e84f0>

## Evaluating Precise

In [167]:
# create the test data
tokenizer.fit_on_texts(dataset["test"]['statement'])
sequences_statement = tokenizer.texts_to_sequences(dataset["test"]['statement'])

tokenizer.fit_on_texts(dataset["test"]['subject'])
sequences_subject = tokenizer.texts_to_sequences(dataset["test"]['subject'])

#pad the sequences
x_test = [pad_sequences(sequences_statement, maxlen=max_words), pad_sequences(sequences_subject, maxlen=max_words)]
y_test = to_categorical(test_label, num_classes=output_dim)

#evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, batch_size=32)

#print the results
print('Test loss:', loss)
print('Test accuracy:', accuracy)

Test loss: 1.78783118724823
Test accuracy: 0.19485580921173096


## Evaluating with Neighbours

We will now give the model a margin of error by being okay with it predicting neighbouring labels

In [168]:
# Make predictions on the test set
y_pred_probs = model.predict(x_test)
y_pred = np.argmax(y_pred_probs, axis=1)

# Define the threshold for neighboring labels
threshold = 0.4

# Apply the threshold to the predicted probabilities
y_pred_adj = np.where(y_pred_probs > threshold, y_pred_probs, 0)

# Get the indices of the neighboring labels
neighboring_labels = [set([i-1, i, i+1]) for i in range(output_dim)]
neighboring_indices = [set([j for j in range(output_dim) if j in labels]) for labels in neighboring_labels]

# Replace the predicted labels with neighboring labels if necessary
for i in range(len(y_pred)):
    if y_pred[i] not in neighboring_indices[y_test[i]]:
        neighboring_preds = neighboring_indices[y_test[i]].intersection(np.nonzero(y_pred_adj[i])[0])
        if neighboring_preds:
            y_pred[i] = max(neighboring_preds)

# Calculate the accuracy with the modified predictions
accuracy_adj = np.mean(y_pred == y_test)

print('Test loss:', loss)
print('Test accuracy:', accuracy)
print('Adjusted accuracy:', accuracy_adj)




TypeError: only integer scalar arrays can be converted to a scalar index