# About the Dataset:

The dataset I used for this project was a dataset of half real news articles, which were found from the All the News dataset located here: https://components.one/datasets/all-the-news-2-news-articles-dataset/ . The All the News dataset contains 2.6 million news articles from 2015-2020, from 27 different publications. I utilized these articles to generate the other half of the dataset that I used for this project. I was able to generate fake articles through fine-tuning LLaMa 7B on 220 thousand articles on 22 different publications from the All the News dataset. The fake articles were generated by using a prompt consisting of the real articles headline, publication, and first two sentences. The selection of articles that fine-tuned the model and the articles I utilized for prompt generation are seperate subsets of the All The News dataset. Overall, I generated 2750 fake articles from 22 different publications. The other 2750 articles in this dataset are the real corresponding articles. Coming into this project, I wanted to explore how well a transformer model could differentiated between the two of them, and correctly classify each as real or fake.

The fine-tuned model can be accessed through huggingface here: https://huggingface.co/AkhilGhosh/llama-cnn-210k

# Data reading and clean-up

In [1]:
!pip install keras_nlp transformers torchsummary GPUtil datasets


Collecting keras_nlp
  Downloading keras_nlp-0.6.0-py3-none-any.whl (576 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/576.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m576.5/576.5 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
Collecting GPUtil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-core (from keras_nlp)
  Downloading keras_core-0.1.0-py3-none-any.whl (727 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m728

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
import gensim.downloader as api
from gensim import utils
import gensim.models
import re
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from torchsummary import summary
from torch.utils.data import DataLoader, Dataset, random_split
from torch.optim import AdamW


In [5]:
df = pd.read_csv('/content/FakeNews.csv')
df = df[['pubhead','first_two_sentences','article','generated_clean']]

In [6]:
def get_publication(text, sep='<sep>'):
    parts = text.split(sep, 1)
    return parts[0]
df['publication'] = df['pubhead'].apply(get_publication)

df.head(5)


Unnamed: 0,pubhead,first_two_sentences,article,generated_clean,publication
0,Vox<sep>Everyone accusing Harvey Weinstein of ...,"On October 5, the New York Times published an ...","On October 5, the New York Times published an ...","On October 5, the New York Times published an ...",Vox
1,"Vox<sep>Steve Bannon, the Trump adviser who he...",Steve Bannon is one of the world’s most powerf...,Steve Bannon is one of the world’s most powerf...,Steve Bannon is one of the world’s most powerf...,Vox
2,Vox<sep>Vox Sentences: Pope Francis’s crisis o...,"The news, but shorter. Your daily wrap-up for ...","The news, but shorter. Your daily wrap-up for ...","The news, but shorter. Your daily wrap-up for ...",Vox
3,Vox<sep>Self-Flying Lily Camera Has Booked $34...,"Drones are big at CES this year, even though t...","Drones are big at CES this year, even though t...","Drones are big at CES this year, even though t...",Vox
4,Vox<sep>Afghan government and Taliban may meet...,Fragile talks to end the war in Afghanistan ma...,Fragile talks to end the war in Afghanistan ma...,Fragile talks to end the war in Afghanistan ma...,Vox


In [7]:
df['pubhead'][275]

'Reuters<sep>Companies, industry groups target Congress to derail Trump tariffs'

In [8]:
df['generated_clean'][275]

'WASHINGTON, March 5 (Reuters) - Automakers, business groups and farmers took their fight against U.S. President Donald Trump’s proposed tariffs on steel and aluminum to Capitol Hill on Monday, betting that Republican lawmakers would stand up to the White House on their behalf. By turning to Congress, lobbyists for industries that could lose in a trade war are making a bet that Republican lawmakers would stick to their traditional support for open trade, and potentially use legislative power to derail tariffs. Trump’s proposed 25 percent tariff on steel and 10 percent tariffs on aluminum, are unlikely to face immediate approval by the World Trade Organization (WTO), which could allow Congress to halt the tariffs, lobbyists and members of the U.S. Congress said. “Our position has always been that we prefer not to (pay tarifs),” Senator Orrin Hatch, the chairman of the Senate Finance Committee, told Reuters on Monday. “We’re trying to come out of something that will be acceptable.” Hatch

In [9]:
real = df['article']
fake = df['generated_clean']

pub = df['publication']
real_and_fake = real.append(fake).reset_index(drop=True)
publist = pub.append(pub).reset_index(drop=True)
labels = [1 for i in range(len(real))] + [0 for j in range(len(fake))]
labels=pd.Series(labels)

  real_and_fake = real.append(fake).reset_index(drop=True)
  publist = pub.append(pub).reset_index(drop=True)


In [10]:
cleaned_data = pd.DataFrame({'publication': publist, 'articles': real_and_fake, 'labels': labels})
#shuffling data around
cleaned_data = cleaned_data.sample(frac=1).reset_index(drop=True)
cleaned_data

Unnamed: 0,publication,articles,labels
0,The Verge,Google today announced a new set of changes to...,0
1,Gizmodo,Your browser does not support HTML5 video tag....,1
2,Vice News,A former FBI translator who held a top-secret ...,0
3,Axios,Outgoing Press Secretary Sarah Sanders has bee...,0
4,Fox News,Christie Brinkley has appeared on more than 50...,1
...,...,...,...
5495,Axios,President Trump just concluded a meeting with ...,1
5496,Vice,President Donald Trump's travel ban was bound...,1
5497,CNBC,At last year's CES tech trade show in Las Vega...,0
5498,Refinery 29,Seth Rogen is responsible for some of the raci...,1


In [11]:
def erase_first_two_sentences(text):
    # Use a regular expression to match sentence-ending punctuation
    sentence_enders = re.compile(r'[.!?]')
    sentence_list = sentence_enders.split(text)

    # Remove the first two sentences if there are more than two
    if len(sentence_list) > 2:
        return ' '.join(sentence_list[2:]).strip()
    else:
        return ''
cleaned_data['articles_short'] = cleaned_data['articles'].apply(erase_first_two_sentences)
cleaned_data['articles_short']

0       Google said that when users search for article...
1       Or so I hear  Big clothes  Even bigger hair  C...
2       In an unusual deal that avoided any jail time,...
3       This comes as other White House officials are ...
4       Brinkley, who is on the cover of Social Life, ...
                              ...                        
5495    But one point is clear: Talking is now the cur...
5496    Patrick Leahy, the ranking member of the Judic...
5497    Jane Horvath, Apple's senior director, was on ...
5498    He literally made a movie about making a porno...
5499    That, of course, didn’t actually happen, as co...
Name: articles_short, Length: 5500, dtype: object

# Data Pre-processing

In [12]:
x = cleaned_data['articles']
x_short = cleaned_data['articles_short'] # I found that this worked better for telling the fake and real articles apart, which makes sense.
y = LabelEncoder().fit_transform(cleaned_data['labels'])


In [13]:
from transformers import BertTokenizer, BertModel
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
#Tokenizing and getting embeddings from BERT

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sequences = [tokenizer.encode(text, add_special_tokens=True) for text in x]

vocab_size = len(tokenizer.get_vocab())

max_length = 400

padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')

Token indices sequence length is longer than the specified maximum sequence length for this model (783 > 512). Running this sequence through the model will result in indexing errors


In [15]:
sequences_short = [tokenizer.encode(text, add_special_tokens=True) for text in x_short]

padded_sequences_short = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')




In [16]:
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

# Preprocessed tokenized texts and labels
texts = torch.tensor(padded_sequences).long()
texts_short = torch.tensor(padded_sequences_short).long()
labels = torch.tensor(y)

# Create dataset and split into training and validation sets
dataset = TextDataset(texts, labels)
train_size = int(0.7 * len(dataset))
val_size = int(0.15 * len(dataset))
test_size = len(dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

# Create DataLoaders for training and validation sets
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


# Transformer Implementation

Some of the edits I had to make to the original design included only using the encoder portion, introducing a fixed embedding layer that utilizes embeddings from word2vec, and implementation of the TransformerClassification class

In [17]:
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)


def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    return subsequent_mask == 0

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e4)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))


# class Embeddings(nn.Module):
#     def __init__(self, d_model, vocab, pretrained_embeddings):
#         super(Embeddings, self).__init__()
#         self.lut = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True)
#         self.d_model = d_model

#     def forward(self, x):
#         return self.lut(x) * math.sqrt(self.d_model)

class BertEmbeddings(nn.Module):
    def __init__(self, bert_model):
        super(BertEmbeddings, self).__init__()
        self.bert_model = bert_model
        for param in self.bert_model.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            bert_output = self.bert_model(input_ids, attention_mask=attention_mask)

        embeddings = bert_output.last_hidden_state
        return embeddings


class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)


In [18]:
class ClassificationTransformer(nn.Module):
    def __init__(self, encoder, d_model,n_classes, pool_type='mean',pretrained_embeddings=None, bert_model=None):
        super(ClassificationTransformer, self).__init__()
        self.embed = BertEmbeddings(bert_model)
        # self.pos_enc = PositionalEncoding(d_model, dropout)
        self.encoder = encoder
        self.pool_type = pool_type


        # Output layer for classification
        self.output_layer = nn.Linear(d_model, n_classes)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x, mask=None):
        # Pass input through the encoder layers
        attention_mask = (x != 0).float()
        x = self.embed(x, attention_mask)
        x = self.encoder(x, mask)

        # Pooling
        if self.pool_type == 'mean':
            x = x.mean(dim=1)
        elif self.pool_type == 'max':
            x, _ = x.max(dim=1)
        else:
            raise ValueError('Invalid pooling type, choose either "mean" or "max".')

        # Pass through the output layer
        x = self.output_layer(x)
        x = self.softmax(x)

        return x

# Initialize the transformer model for classification
N = 4  # number of layers
d_model = 768  # model dimension. This was changed from the original architecture of 512 to 768 because of the embeddings dimension being 768.
d_ff = 512  # dimension of the feed-forward network. Maybe don't need as many?
h = 4  # number of attention heads
dropout = 0.1
n_classes = len(np.unique(y))  # number of classes for binary classification. Did it by np.unique(y)

# Create an instance of the encoder
c = copy.deepcopy
attn = MultiHeadedAttention(h, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)

encoder = Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N)
model = ClassificationTransformer(encoder, d_model, n_classes, pool_type='mean', bert_model = bert_model)
# Send the model to the appropriate device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


# Loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=0.0001)


In [19]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")


The model has 12,615,682 trainable parameters


# Training and Evaluation

In [22]:
from typing import Tuple
def train_and_validate(model, criterion, optimizer, train_loader, val_loader, num_epochs: int, device: torch.device) -> Tuple[dict, dict, dict, torch.nn.Module]:
    model.to(device)

    train_loss_history = {}
    val_loss_history = {}
    val_acc_history = {}
    train_acc_history = {}
    best_val_loss = float('inf')
    best_model = None

    for epoch in range(num_epochs):
        # Training loop
        model.train()
        running_loss = 0.0
        train_total=0
        train_correct=0
        for batch_idx, (inputs, targets) in enumerate(train_loader):
            inputs, targets = inputs.to(device), targets.to(device)

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, targets)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()

            # Update model parameters
            optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            train_total += targets.size(0)
            train_correct += (predicted == targets).sum().item()

        # Validation loop
        model.eval()
        val_running_loss = 0.0
        correct = 0
        total = 0
        with torch.no_grad():
            for inputs, targets in val_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, targets)

                val_running_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()

        # Calculate average losses and accuracy
        train_loss = running_loss / len(train_loader)
        val_loss = val_running_loss / len(val_loader)
        val_acc = correct / total
        train_acc = train_correct/train_total

        # Save losses and accuracy to history
        train_loss_history[epoch] = train_loss
        train_acc_history[epoch] = train_acc
        val_loss_history[epoch] = val_loss
        val_acc_history[epoch] = val_acc

        # Check if current model is better than previous best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model = model.state_dict()

        print(f'Epoch {epoch + 1}/{num_epochs}, '
              f'Train Loss: {train_loss:.4f},Train Acc: {train_acc:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')

    return train_loss_history, val_loss_history, val_acc_history, best_model


In [29]:
def evaluate(model, criterion, data_loader, device):
    model.eval()
    total_loss = 0.0
    total_correct = 0
    total_samples = 0

    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            total_loss += loss.item() * inputs.size(0)
            preds = torch.argmax(outputs, dim=1)
            total_correct += torch.sum(preds == labels).item()
            total_samples += inputs.size(0)

    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples

    return avg_loss, accuracy

In [23]:
def visualize_trainable_parameters(model):
    total_params = 0
    for name, param in model.named_parameters():
        if param.requires_grad:
            num_params = param.numel()
            total_params += num_params
            print(f"{name}: {num_params}")
    print(f"Total trainable parameters: {total_params}")

# Example usage
visualize_trainable_parameters(model)

encoder.layers.0.self_attn.linears.0.weight: 589824
encoder.layers.0.self_attn.linears.0.bias: 768
encoder.layers.0.self_attn.linears.1.weight: 589824
encoder.layers.0.self_attn.linears.1.bias: 768
encoder.layers.0.self_attn.linears.2.weight: 589824
encoder.layers.0.self_attn.linears.2.bias: 768
encoder.layers.0.self_attn.linears.3.weight: 589824
encoder.layers.0.self_attn.linears.3.bias: 768
encoder.layers.0.feed_forward.w_1.weight: 393216
encoder.layers.0.feed_forward.w_1.bias: 512
encoder.layers.0.feed_forward.w_2.weight: 393216
encoder.layers.0.feed_forward.w_2.bias: 768
encoder.layers.0.sublayer.0.norm.a_2: 768
encoder.layers.0.sublayer.0.norm.b_2: 768
encoder.layers.0.sublayer.1.norm.a_2: 768
encoder.layers.0.sublayer.1.norm.b_2: 768
encoder.layers.1.self_attn.linears.0.weight: 589824
encoder.layers.1.self_attn.linears.0.bias: 768
encoder.layers.1.self_attn.linears.1.weight: 589824
encoder.layers.1.self_attn.linears.1.bias: 768
encoder.layers.1.self_attn.linears.2.weight: 589824


# Wiki Articles Dataset

In [24]:
from datasets import load_dataset
dataset = load_dataset("aadityaubhat/GPT-wiki-intro")


Downloading readme:   0%|          | 0.00/2.61k [00:00<?, ?B/s]

Downloading and preparing dataset csv/aadityaubhat--GPT-wiki-intro to /root/.cache/huggingface/datasets/aadityaubhat___csv/aadityaubhat--GPT-wiki-intro-10ad8b711a5f3880/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/127M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/aadityaubhat___csv/aadityaubhat--GPT-wiki-intro-10ad8b711a5f3880/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
a = dataset['train']['wiki_intro'][:500]
b = dataset['train']['generated_intro'][:500]
c = a + b
labels_ = [1 for i in range(len(a))] + [0 for j in range(len(b))]
wiki = pd.DataFrame()
wiki['articles'] = c
wiki['labels'] = labels_
wiki_shuffled = wiki.sample(frac=1).reset_index(drop=True)
combination = cleaned_data[['articles', 'labels']].reset_index(drop=True)
testing = pd.concat([combination,wiki_shuffled]).reset_index(drop=True)

In [26]:
y_test = LabelEncoder().fit_transform(testing['labels']) #changing from labels to publication
x_test = testing['articles']
test_sequences = [tokenizer.encode(text, add_special_tokens=True) for text in x_test]
test_padded_sequences = pad_sequences(test_sequences, maxlen=400, padding='post', truncating='post')
test_texts = torch.tensor(test_padded_sequences).long()
test_labels = torch.tensor(y_test)

dataset_test = TextDataset(test_texts, test_labels)
train_size = int(0.7 * len(dataset_test))
val_size = int(0.15 * len(dataset_test))
test_size = len(dataset_test) - train_size - val_size
train_dataset, val_dataset, test_dataset = random_split(dataset_test, [train_size, val_size, test_size])

# Create DataLoaders for training and validation sets
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [27]:
n_classes = len(np.unique(y))  # number of classes for binary classification. Did it by np.unique(y)

# Create an instance of the encoder
c = copy.deepcopy
attn = MultiHeadedAttention(h, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)

encoder = Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N)
model_test = ClassificationTransformer(encoder, d_model, n_classes, pool_type='mean', bert_model = bert_model)
# Send the model to the appropriate device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_test.to(device)


# Loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model_test.parameters(), lr=0.0001)


In [None]:
train_loss_history, val_loss_history, val_acc_history, best_model_ = train_and_validate(
    model_test, criterion, optimizer, train_loader, val_loader, 10, device)


Epoch 1/10, Train Loss: 0.6421,Train Acc: 0.6418, Val Loss: 0.6946, Val Acc: 0.6144
Epoch 2/10, Train Loss: 0.5480,Train Acc: 0.7530, Val Loss: 0.6195, Val Acc: 0.6790
Epoch 3/10, Train Loss: 0.5206,Train Acc: 0.7862, Val Loss: 0.6668, Val Acc: 0.6503
Epoch 4/10, Train Loss: 0.5940,Train Acc: 0.7086, Val Loss: 0.6705, Val Acc: 0.6421
Epoch 5/10, Train Loss: 0.5010,Train Acc: 0.8044, Val Loss: 0.6128, Val Acc: 0.6944
Epoch 6/10, Train Loss: 0.4655,Train Acc: 0.8431, Val Loss: 0.5294, Val Acc: 0.7733
Epoch 7/10, Train Loss: 0.4535,Train Acc: 0.8556, Val Loss: 0.5915, Val Acc: 0.7149
Epoch 8/10, Train Loss: 0.4506,Train Acc: 0.8620, Val Loss: 0.5932, Val Acc: 0.7118
Epoch 9/10, Train Loss: 0.4476,Train Acc: 0.8620, Val Loss: 0.5456, Val Acc: 0.7610


In [None]:
test_loss, test_acc = evaluate(model_test, criterion, test_loader, device)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

Test Loss: 0.4762, Test Accuracy: 0.8320


In [31]:
test_loss, test_acc = evaluate(model_test, criterion, test_loader, device)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

Test Loss: 0.6123, Test Accuracy: 0.6923


In [32]:
a = dataset['train']['wiki_intro'][501:]
b = dataset['train']['generated_intro'][501:]
c = a + b
labels_ = [1 for i in range(len(a))] + [0 for j in range(len(b))]
wiki = pd.DataFrame()
wiki['articles'] = c
wiki['labels'] = labels_
wiki_shuffled = wiki.sample(frac=1).reset_index(drop=True)
y_wiki = LabelEncoder().fit_transform(wiki_shuffled['labels']) #changing from labels to publication
x_wiki = wiki_shuffled['articles']
wiki_sequences = [tokenizer.encode(text, add_special_tokens=True) for text in x_wiki]

wiki_padded_sequences = pad_sequences(wiki_sequences, maxlen=400, padding='post', truncating='post')
wiki_texts = torch.tensor(wiki_padded_sequences).long()
wiki_labels = torch.tensor(y_wiki)

wiki_dataset = TextDataset(wiki_texts, wiki_labels)
batch_size = 64
wiki_loader = DataLoader(wiki_dataset, batch_size=batch_size, shuffle=True)


In [None]:
test_loss, test_acc = evaluate(model_test, criterion, wiki_loader, device)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

In [30]:
import plotly.graph_objs as go
import plotly.io as pio


# plot for the training and validation loss
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=list(train_loss_history.keys()), y=list(train_loss_history.values()), mode='lines', name='Training Loss'))
fig1.add_trace(go.Scatter(x=list(val_loss_history.keys()), y=list(val_loss_history.values()), mode='lines', name='Validation Loss'))
fig1.update_layout(title='Training and Validation Loss', xaxis_title='Epoch', yaxis_title='Loss')
pio.show(fig1)

# plot for the validation accuracy
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=list(val_acc_history.keys()), y=list(val_acc_history.values()), mode='lines', name='Validation Accuracy'))
fig2.update_layout(title='Validation Accuracy', xaxis_title='Epoch', yaxis_title='Accuracy')
pio.show(fig2)
