## ⚙️ 1. Environment Setup

This first cell prepares our environment. It installs the necessary libraries for data handling, modeling, and visualization.

- **`transformers`, `accelerate`, `datasets`**: The core Hugging Face stack for loading models, speeding up training, and handling data.
- **`openpyxl`**: A required library for `pandas` to be able to read and write Excel files (`.xlsx`).
- **`seaborn`, `matplotlib`**: Standard Python libraries for creating visualizations.
- **`warnings.filterwarnings('ignore')`**: This line is included to suppress warnings during execution, keeping the output clean for this demonstration. This is generally not recommended for production code.

In [None]:
# !pip install -U transformers
# !pip install -U accelerate
# !pip install -U datasets
# !pip install -U bertviz
# !pip install -U umap-learn
# !pip install seaborn --upgrade

# !pip install -U openpyxl

# Don't do in production. Doing now to keep output clean for understanding
import warnings
warnings.filterwarnings('ignore')

## 📥 2. Data Loading and Cleaning

We load our dataset, which is an Excel file containing news articles, using `pandas`. A crucial first step in any machine learning project is to handle missing data. We check for null values using `.isnull().sum()` and then remove any rows with missing values using `.dropna()` to ensure our dataset is clean and ready for analysis.

In [None]:
import pandas as pd

df = pd.read_excel("https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/fake_news.xlsx")

In [None]:
df.isnull().sum()
df = df.dropna()

df.isnull().sum()

## 📊 3. Exploratory Data Analysis (EDA)

With a clean dataset, we explore its characteristics.

1.  **Class Distribution**: We check the balance between 'Real' and 'Fake' news articles using `.value_counts()` and visualize it with a bar chart. Our dataset appears to be well-balanced.
2.  **Token Length Analysis**: We estimate the number of tokens in the `title` and `text` of each article (approximating 1.5 tokens per word). Histograms show us the distribution of these lengths. We can see that titles are short, while the article texts have a much wider range of lengths. This notebook will focus on classifying news based on the **title only** for efficiency.

In [None]:
df.shape

df['label'].value_counts()

In [None]:
df.title

In [None]:
import matplotlib.pyplot as plt

In [None]:
label_counts = df['label'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

In [None]:
# 1.5 tokens per word on average
df['title_tokens'] = df['title'].apply(lambda x: len(x.split())*1.5)
df['text_tokens'] = df['text'].apply(lambda x: len(x.split())*1.5)


fig, ax = plt.subplots(1,2, figsize=(15,5))

ax[0].hist(df['title_tokens'], bins=50, color = 'skyblue')
ax[0].set_title("Title Tokens")

ax[1].hist(df['text_tokens'], bins=50, color = 'orange')
ax[1].set_title("Text Tokens")

plt.show()

## ✂️ 4. Data Splitting & Formatting

To train and evaluate our model properly, we split the data into three distinct sets: a **training set (70%)** to train the model, a **test set (20%)** for final evaluation, and a **validation set (10%)** to monitor performance during training. The `stratify=df['label']` argument ensures that the proportion of 'Real' and 'Fake' news is the same across all three sets.

Finally, we convert our pandas DataFrames into a `DatasetDict`, which is the standard data structure used by the Hugging Face `Trainer` API.

In [None]:
from sklearn.model_selection import train_test_split

# 70% for training, 20% test, 10% validation
train, test = train_test_split(df, test_size=0.3, stratify=df['label'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label'])

train.shape, test.shape, validation.shape, df.shape



In [None]:
from datasets import Dataset, DatasetDict

dataset = DatasetDict(
    {
        "train": Dataset.from_pandas(train, preserve_index=False),
        "test": Dataset.from_pandas(test, preserve_index=False),
        "validation": Dataset.from_pandas(validation, preserve_index=False)
    }
)

dataset

## ✍️ 5. Tokenization

Here we prepare the text data for the model. **Tokenization** is the process of converting text into numerical IDs that the model can understand.

- **Model Choice**: We choose `distilbert-base-uncased`. This is a smaller, faster, and lighter version of BERT, making it ideal for quick training and inference without a huge sacrifice in performance.
- **`AutoTokenizer`**: We load the specific tokenizer that corresponds to our chosen model to ensure data is processed correctly.
- **Applying Tokenization**: We create a `tokenize` function that takes a batch of data and applies the tokenizer to the `title` column. Using `.map()`, we efficiently apply this function to our entire dataset. `padding=True` and `truncation=True` handle variable-length titles by making them all the same size.

In [None]:
from transformers import AutoTokenizer

text = "Machine learning is awesome!! Thanks KGP Talkie."

model_ckpt = "distilbert-base-uncased"
distilbert_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
distilbert_tokens = distilbert_tokenizer.tokenize(text)

# model_ckpt = "google/mobilebert-uncased"
# mobilebert_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# mobilebert_tokens = mobilebert_tokenizer.tokenize(text)

# model_ckpt = "huawei-noah/TinyBERT_General_4L_312D"
# tinybert_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# tinybert_tokens = tinybert_tokenizer.tokenize(text)


In [None]:
distilbert_tokenizer,
# mobilebert_tokenizer,
# tinybert_tokenizer

In [None]:
def tokenize(batch):
    temp = distilbert_tokenizer(batch['title'], padding=True, truncation=True)
    return temp

print(tokenize(dataset['train'][:2]))

In [None]:
encoded_dataset = dataset.map(tokenize, batch_size=None, batched=True)

## 🤖 6. Model Configuration

Now we load and configure the pre-trained DistilBERT model for our specific task.

- **Label Mapping**: We create `label2id` and `id2label` dictionaries to map our string labels ('Real', 'Fake') to integer IDs (0, 1). Models work with numbers, so this mapping is essential.
- **`AutoConfig`**: We load the model's default configuration and update it with our specific number of labels and our label mappings.
- **`AutoModelForSequenceClassification`**: This command downloads the pre-trained DistilBERT model and attaches a new, untrained classification head on top, configured according to our settings. This head is what we will fine-tune.
- **Device**: The model is moved to the GPU (`cuda`) if one is available, which significantly speeds up training.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoConfig
import torch

label2id = {"Real": 0, "Fake": 1}
id2label = {0:"Real", 1:"Fake"}

model_ckpt = "distilbert-base-uncased"
# model_ckpt = "google/mobilebert-uncased"
# model_ckpt = "huawei-noah/TinyBERT_General_4L_312D"


num_labels = len(label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)




## 🚀 7. Advanced Training Setup

We configure the training process using `TrainingArguments` and a custom metrics function. This notebook uses several advanced arguments to improve training stability and efficiency.

- **`compute_metrics_evaluate`**: A function that calculates accuracy and a weighted F1-score during evaluation. The F1-score is a good metric that balances precision and recall.
- **`TrainingArguments`**: We set key hyperparameters, but also include several improvements:
  - **`eval_strategy="steps"`**: Evaluate the model periodically *during* an epoch, not just at the end.
  - **`load_best_model_at_end=True`**: The trainer will keep track of the model with the best `accuracy` on the validation set and automatically load it at the end of training.
  - **`fp16=True`**: Enables mixed-precision training, which uses a combination of 16-bit and 32-bit floating-point types to speed up training and reduce memory usage on modern GPUs.
  - **`warmup_ratio=0.1`**: Implements a learning rate scheduler that starts with a very low learning rate, gradually increases it for the first 10% of training steps, and then decays it. This helps stabilize training in the early stages.

In [None]:
# use sklearn to build compute metrics
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics_evaluate(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {"accuracy": acc, "f1": f1}

In [None]:
from transformers import TrainingArguments

# batch_size = 32
training_dir = "train_dir"


training_args = TrainingArguments(
    output_dir=training_dir,
    num_train_epochs=3,                  # Increased epochs for better convergence on this dataset size
    learning_rate=2e-5,
    per_device_train_batch_size=64,      # A common batch size
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    
    # --- Improvements ---
    eval_strategy="steps",         # 💡 Evaluate during training
    eval_steps=100,                      # Evaluate every 100 steps
    save_strategy="steps",               # 💡 Match saving strategy to evaluation
    save_steps=100,
    logging_steps=50,                    # Log training loss every 50 steps
    load_best_model_at_end=True,         # 🚀 Automatically load the best model
    metric_for_best_model="accuracy",    # The metric to monitor for the "best" model
    save_total_limit=2,                  # Only keep the best model and the most recent one
    fp16=True,                           # ⚡️ Enable mixed-precision for faster training (requires a modern GPU)
    warmup_ratio=0.1,                    # Use a learning rate scheduler with warmup
)



## ▶️ 8. Model Training

We instantiate the `Trainer`, which brings together the model, training arguments, datasets, tokenizer, and metrics function. Calling `trainer.train()` starts the fine-tuning process. The `Trainer` handles all the complexity of the training loop, including moving data to the device, calculating loss, performing backpropagation, updating weights, and running evaluations.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics_evaluate,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=distilbert_tokenizer
)

In [None]:
trainer.train()

## 🧐 9. Final Evaluation and Inference

After training, we evaluate the final model on the test set, which it has never seen before. 

1.  **Inference Function**: We create a `get_prediction` function to test the model with a single, custom headline. This shows how the model would be used in a real application.
2.  **Test Set Prediction**: We use `trainer.predict()` to get predictions for the entire test set and review the final performance metrics.
3.  **Classification Report**: We generate a detailed `classification_report` from `scikit-learn` to see the precision, recall, and F1-score for both the 'Real' and 'Fake' classes.

In [None]:
text = "Researchers Publish Findings on Efficacy of New Alzheimer's Drug"

def get_prediction(text):
    input_encoded = distilbert_tokenizer(text, return_tensors='pt').to(device)

    with torch.no_grad():
        outputs = model(**input_encoded)

    logits = outputs.logits

    pred = torch.argmax(logits, dim=1).item()
    return id2label[pred]

get_prediction(text)

In [None]:
preds_output = trainer.predict(encoded_dataset['test'])
preds_output.metrics

In [None]:
import numpy as np
y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = encoded_dataset['test'][:]['label']

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, target_names=list(label2id)))

## 💾 10. Saving and Using the Model

The final step is to save our work and demonstrate an easy way to use the model.

- **`trainer.save_model()`**: This command saves the trained model's weights, configuration, and tokenizer information to a directory (named "fake_news").
- **`pipeline`**: We load the saved model into a `text-classification` pipeline. The pipeline is the easiest way to perform inference, as it wraps all the necessary steps (tokenization, model prediction, and converting the output to a label) into a single, simple function call.

In [None]:
trainer.save_model("fake_news")

In [None]:
from transformers import pipeline

classifier = pipeline('text-classification', model= 'fake_news')

In [None]:
classifier("some text data")