### 1. Data Preparation:

- The dataset is split into two files: True.csv and Fake.csv, each containing news articles labeled as true or fake.
- The code reads these two CSV files and assigns a label:

        True News: Label is set to 1.
        Fake News: Label is set to 0.
        
- Only the relevant columns (text and label) are retained, and both datasets are concatenated to create a single DataFrame.
- The combined dataset is shuffled to ensure random distribution of true and fake articles.
- The train_test_split function is used to split the dataset into 80% training data and 20% testing data.

In [12]:
import pandas as pd
import torch

true_news = pd.read_csv('True.csv')
fake_news = pd.read_csv('Fake.csv')

print(true_news.head())
print(fake_news.head())


                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  
0  December 31, 2017   
1  December 29, 2017   
2  December 31, 2017   
3  December 30, 2017   
4  December 29, 2017   
                                               title  \
0   Donald Trump Sends Out Embarrassing Ne

In [13]:
# Adding label for true and fake news
true_news['label'] = 1
fake_news['label'] = 0

# Selecting only the relevant columns
true_news = true_news[['text', 'label']]
fake_news = fake_news[['text', 'label']]

In [14]:
# Combining the two datasets
df = pd.concat([true_news, fake_news], ignore_index=True)

# Shuffling the combined DataFrame to mix the true and fake articles
df = df.sample(frac=1, random_state=42).reset_index(drop=True)


In [15]:
# Checking for missing values
print(df.isnull().sum())

text     0
label    0
dtype: int64


In [16]:
df.head()

Unnamed: 0,text,label
0,"Donald Trump s White House is in chaos, and th...",0
1,Now that Donald Trump is the presumptive GOP n...,0
2,Mike Pence is a huge homophobe. He supports ex...,0
3,SAN FRANCISCO (Reuters) - California Attorney ...,1
4,Twisted reasoning is all that comes from Pelos...,0


In [17]:
from sklearn.model_selection import train_test_split

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)


In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### 2. Text Tokenization:
- BERT models require tokenized inputs, so the BertTokenizer is used to tokenize the text data.
- The training and testing text data are tokenized with padding and truncation set to true, ensuring all sequences are of equal length (up to max_length=128).

In [19]:
from transformers import BertTokenizer

# Loading the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenizing the text data
train_encodings = tokenizer(list(X_train), padding=True, truncation=True, return_tensors="pt", max_length=128)
test_encodings = tokenizer(list(X_test), padding=True, truncation=True, return_tensors="pt", max_length=128)




### 3. Creating a Custom Dataset Class:
- A custom NewsDataset class is created to handle the encoding and labels for the dataset. This class is used to create PyTorch datasets for training and testing.
- The  __getitem__  method returns a dictionary of encoded text items and labels for a given index.

In [20]:
from torch.utils.data import Dataset

# Custom Dataset class
class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Creating train and test datasets fit for PyTorch DataLoader
train_dataset = NewsDataset(train_encodings, y_train.tolist())
test_dataset = NewsDataset(test_encodings, y_test.tolist())


### 4. Model Loading and Training:
- The pre-trained BERT model distilbert-base-uncased is loaded using the BertForSequenceClassification class, and it is set up for binary classification with num_labels=2.
- The model is moved to the appropriate device (CPU or GPU).
- Training arguments are defined using TrainingArguments, such as batch size, number of epochs, learning rate, and weight decay.
- A Trainer instance is created using the model, training arguments, and datasets.
- The trainer.train() method starts the training process, and trainer.evaluate() evaluates the model on the test dataset.

In [22]:
import torch

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Check if model is on the correct device
model.to(device)
print(f"Model device: {next(model.parameters()).device}")


Using device: cuda


NameError: name 'model' is not defined

In [23]:
print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of testing samples: {len(test_dataset)}")

print(f"First training sample: {train_dataset[0]}")
print(f"First testing sample: {test_dataset[0]}")

Number of training samples: 35918
Number of testing samples: 8980
First training sample: {'input_ids': tensor([  101,  2061,  1996, 14783,  3510,  2074,  2165,  1037,  2735,  1998,
         5163, 20147,  1010,  2145,  2725,  1996,  2694,  4984,  2000,  5326,
         8398,  1055,  6745,  3658,  1010,  2089,  2025,  2130,  2113,  2009,
         2664,  1012,  2096,  5163, 20147,  2001,  5697,  2183,  2006,  5830,
         2000,  2360,  8398,  1055,  4297, 11631,  7869,  3372,  1010,  4795,
         1056, 28394,  3215,  2323,  2022,  6439,  1998,  2111,  2323,  3579,
         2006,  2010,  4297, 11631,  7869,  3372,  1010,  4795,  4506,  2612,
         1010,  2014,  3129,  2018,  4593,  2584,  1037,  4911,  2391,  2007,
         1996, 28072,  1012,  2006, 10474,  1010,  2002,  9339,  1996,  2168,
        10041,  2594,  1056, 28394,  3215,  2010,  2564,  2001,  2006,  2694,
         6984,  1998,  2170,  2068,  2054,  2027,  2020,  1024,  3143, 13044,
         1012,  2577, 14783,  1055, 104

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


In [24]:
from transformers import BertTokenizer, BertForSequenceClassification

# Download and cache the tokenizer and model
BertTokenizer.from_pretrained('distilbert-base-uncased')
BertForSequenceClassification.from_pretrained('distilbert-base-uncased')


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 
The class this function is called from is 'BertTokenizer'.
You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['encoder.layer.5.attention.self.value.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.4.output.dense.bias', 'encoder.layer.1.attention.output.LayerNorm.bias', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.11.output.dense.bias', 'encoder.layer.8.attention.self.query.weight', 'en

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [16]:
#for offline running
import os
os.environ['HF_DATASETS_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'


In [28]:
from transformers import BertForSequenceClassification, BertTokenizer

#path to your checkpoint
checkpoint_path = "results/checkpoint-6735"

# Loading the model from the checkpoint directory
model = BertForSequenceClassification.from_pretrained(checkpoint_path)

# Load the original tokenizer used during training
tokenizer = BertTokenizer.from_pretrained('distilbert-base-uncased')

print("Model and tokenizer successfully loaded from the checkpoint.")


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 
The class this function is called from is 'BertTokenizer'.


Model and tokenizer successfully loaded from the checkpoint.


In [30]:
# Re-instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Resume training from the specific checkpoint
trainer.train(resume_from_checkpoint=checkpoint_path)


  0%|          | 0/6735 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


{'train_runtime': 0.4346, 'train_samples_per_second': 247923.308, 'train_steps_per_second': 15496.07, 'train_loss': 0.0, 'epoch': 3.0}


TrainOutput(global_step=6735, training_loss=0.0, metrics={'train_runtime': 0.4346, 'train_samples_per_second': 247923.308, 'train_steps_per_second': 15496.07, 'train_loss': 0.0, 'epoch': 3.0})

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load the BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
model.to(device)  # Move the model to the appropriate device (CPU/GPU)


training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,                      # Number of training epochs
    per_device_train_batch_size=4,           # Batch size per device during training
    per_device_eval_batch_size=4,            # Batch size per device during evaluation
    gradient_accumulation_steps=4,           # Accumulate gradients over 4 batches
    warmup_steps=500,                        # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                       # Weight decay to prevent overfitting
    logging_strategy="steps",                # Log at each logging step
    logging_steps=100,                       # Log every 100 steps
    evaluation_strategy="epoch",             # Evaluate the model at the end of each epoch
    save_strategy="epoch",                   # Save the model at the end of each epoch
    report_to=[],                            # No reporting to external services
    load_best_model_at_end=True,             # Load the best model when finished training
    learning_rate=1e-4,                      # Set an appropriate learning rate
    fp16=True,                               # Enable mixed precision training
)



# Create a Trainer instance
trainer = Trainer(
    model=model,                         # The BERT model instance
    args=training_args,                  # Training arguments defined above
    train_dataset=train_dataset,         # The training dataset
    eval_dataset=test_dataset,           # The evaluation dataset
)

# Train the model
try:
    trainer.train()
except Exception as e:
    print(f"An error occurred during training: {e}")



You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['encoder.layer.5.attention.self.value.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.4.output.dense.bias', 'encoder.layer.1.attention.output.LayerNorm.bias', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.11.output.dense.bias', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.3.output.LayerNorm.bias', 'encoder.layer.7.output.dense.bias', 'encoder.layer.9.output.dense.weight', 'encoder.layer.1.attention.output.LayerNorm.weight', 'encoder.layer.7.attention.output.dense.bias', 'classifier.bias', 'encoder.layer.7.attention.self.value.bias', 'encoder.layer.

  0%|          | 0/6735 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


KeyboardInterrupt: 

### 5. Model Evaluation:
- Predictions are made on the test dataset using the trained model.
- Accuracy, precision, recall, and F1-score are calculated using the sklearn metrics module.

In [31]:
# Evaluating the model on the test dataset
trainer.evaluate()

# Getting predictions
predictions = trainer.predict(test_dataset).predictions
predictions = torch.argmax(torch.tensor(predictions), axis=1)

# Calculating metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


  0%|          | 0/2245 [00:00<?, ?it/s]

  0%|          | 0/2245 [00:00<?, ?it/s]

Accuracy: 0.9976614699331848
Precision: 0.9981421272642824
Recall: 0.996984458362329
F1 Score: 0.9975629569455727


### 6. Saving and Loading the Model:
- The trained model and tokenizer are saved locally using the save_pretrained() method, and later loaded using from_pretrained().

In [63]:
from transformers import BertForSequenceClassification, BertTokenizer

#Paths to save model and tokenizer
model_save_path = "saved_model/bert_model"
tokenizer_save_path = "saved_model/bert_tokenizer"

# Saving the BERT model
model.save_pretrained(model_save_path)

# Saving the BERT tokenizer
tokenizer.save_pretrained(tokenizer_save_path)

print(f"Model and tokenizer saved successfully at {model_save_path} and {tokenizer_save_path}")


Model and tokenizer saved successfully at saved_model/bert_model and saved_model/bert_tokenizer


In [64]:
model.cpu()
model.save_pretrained(model_save_path)


In [66]:
device
torch.cuda.empty_cache()