Project for DISCERN 2024
===

This project is supposed to use [gpt2 model from huggingface](https://huggingface.co/openai-community/gpt2) and then use train data from [kaggle](https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets) to detect fake and true information in news and articles. The articles used for checking collected by our team [are here](https://unirau-my.sharepoint.com/:x:/g/personal/dovhan_o_nikita22_stud_rau_ro/EVZaoVJ1OIFFmkT7ognXzbcBiR8JDDXK-ID0DWAdiFnMvg?e=DTESAz).

In [2]:
import tensorflow as tf
#import tensorflow_text as text
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]




All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [5]:
input_text = "Romanian president is going on a trip to Smolensk"

In [6]:
input_ids = tokenizer.encode(input_text, return_tensors='tf')

In [7]:
output = model.generate(input_ids, max_length=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

In [8]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

In [9]:
print(generated_text)

Romanian president is going on a trip to Smolensk, where he will meet with the Russian president, Vladimir Putin.

The trip is part of a larger effort by the Kremlin to bolster its influence in the region.

The Kremlin has been trying to boost its influence in the region since the fall of the Soviet Union.

The Kremlin has been trying to boost its influence in the region since the fall of the Soviet Union.

The Kremlin has been trying to boost


TRAINING
===

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, TFGPT2Model, TFGPT2ForSequenceClassification
import tensorflow as tf
import numpy as np

In [12]:
true_df = pd.read_csv("True.csv")
fake_df = pd.read_csv("Fake.csv")

In [13]:
true_df = true_df.sample(frac=1).reset_index(drop=True)  # shuffle
fake_df = fake_df.sample(frac=1).reset_index(drop=True)  # shuffle

In [14]:
true_df["label"] = 1
fake_df["label"] = 0

In [15]:
num_entries_per_file = 400  # Specify the desired number of entries

limited_true_df = true_df.head(num_entries_per_file)
limited_fake_df = fake_df.head(num_entries_per_file)

In [16]:
combined_df = pd.concat([limited_true_df, limited_fake_df], ignore_index=True)
combined_df = combined_df[["text", "label"]]  # Selecting relevant columns

In [17]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    combined_df["text"].values,
    combined_df["label"].values,
    test_size=0.2,
    random_state=42
)

In [18]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [19]:
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=512, return_tensors='tf')
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=512, return_tensors='tf')

In [20]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

In [21]:
# Build classification model
class GPT2Classifier(tf.keras.Model):
    def __init__(self, num_classes):
        super(GPT2Classifier, self).__init__()
        self.gpt2 = TFGPT2Model.from_pretrained("gpt2")
        self.dropout = tf.keras.layers.Dropout(0.1)
        self.classifier = tf.keras.layers.Dense(num_classes, activation='sigmoid')

    def call(self, inputs):
        outputs = self.gpt2(inputs)[0]
        pooled_output = tf.reduce_mean(outputs, axis=1)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

model = GPT2Classifier(num_classes=2)

All PyTorch model weights were used when initializing TFGPT2Model.

All the weights of TFGPT2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


In [22]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

In [23]:
batch_size = 6  # Adjust this value based on your memory constraints

model.fit(train_dataset.shuffle(1000).batch(batch_size), epochs=3, batch_size=batch_size)

Epoch 1/3



  output, from_logits = _get_logits(



Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x2000f38f250>

In [24]:
loss, accuracy = model.evaluate(test_dataset.batch(16))
print(f"Test accuracy: {accuracy}")

Test accuracy: 1.0


In [25]:
model_dir = '1.model'
model.save(model_dir)

INFO:tensorflow:Assets written to: 1.model\assets


INFO:tensorflow:Assets written to: 1.model\assets
