## Data Preparation

In this notebook we will explore the different datasets we are going to be using to fine-tune our Transformer model. The objective of this model is to be able to detect (classify) spanish news articles between 2 classes (binary): `Fake` and `Real`. After analysis all datasets we will build an unique dataset including all data (clean, standarized, etc...).

#### Datasets to analyze
* `Spanish Fake and Real News` - Universidad de Madrid.
* `Spanish Political Fake News`- Universidad de Vigo.
* `Noticias falsas en español` - Kaggle (includes both fake and true datasets).
* `Fake News Corpus Spanish` - by jpposadas (Github).

In the future we could try adding more datasets or getting data our self by scraping it from news sites or sites that already provide the labelling of fake or real. But as a first model we will go with this 4 datasets which initially have 58370 articles in total (before data clean up).

In [1]:
import pandas as pd

### Spanish Fake and Real News

This dataset was created for a final project in a cybersecurity master's program by a student from the Polytechnic University of Madrid. It mainly contains news from 2019.

The kaggle page made available 2 csv files:
* `spanishFakeNews.csv` - Training data.
* `testSpanishFakeNews.csv` - Testing data.

As we are going to use also a validation set, we will perform the split at the end using the final dataset.

Dataset page: [Spanish Fake and Real News - Kaggle](https://www.kaggle.com/datasets/zulanac/fake-and-real-news)


In [2]:
spanish_fake_real_news_df = pd.read_csv("../data/preview/spanishFakeNews.csv", names=["text", "label"], header=0)

print(f"Training dataset shape: {spanish_fake_real_news_df.shape}")
print(f"Training dataset count:\n{spanish_fake_real_news_df.count()}")
print(f"Training dataset labels: {spanish_fake_real_news_df['label'].unique()}")

Training dataset shape: (538, 2)
Training dataset count:
text     538
label    538
dtype: int64
Training dataset labels: ['fake' 'real']


In [3]:
spanish_fake_real_news_test_df = pd.read_csv("../data/preview/testSpanishFakeNews.csv", names=["text", "label"], header=0)

print(f"Testing dataset shape: {spanish_fake_real_news_test_df.shape}")
print(f"Testing dataset count:\n{spanish_fake_real_news_test_df.count()}")
print(f"Testing dataset labels: {spanish_fake_real_news_test_df['label'].unique()}")

Testing dataset shape: (60, 2)
Testing dataset count:
text     60
label    60
dtype: int64
Testing dataset labels: ['fake' 'real']


##### Create final `Spanish Fake and Real News` dataset

After loading both datasets we are going to join them as one final dataset

In [4]:
# Replace "categorical" labels with numerical labels in both training and testing datasets
spanish_fake_real_news_df["label"] = spanish_fake_real_news_df["label"].replace({'real': 1 , 'fake': 0}).astype("int8")
spanish_fake_real_news_test_df["label"] = spanish_fake_real_news_test_df["label"].replace({'real': 1 , 'fake': 0}).astype("int8")

# Join together both datasets
spanish_fake_real_news_final_df = pd.concat(objs=[spanish_fake_real_news_df, spanish_fake_real_news_test_df], ignore_index=True)
spanish_fake_real_news_final_df = spanish_fake_real_news_final_df.sample(frac=1).reset_index(drop=True)


print(f"Final dataset shape: {spanish_fake_real_news_final_df.shape}")
print(f"Final dataset count:\n{spanish_fake_real_news_final_df.count()}")
print(f"Final dataset labels: {spanish_fake_real_news_final_df['label'].unique()}")

Final dataset shape: (598, 2)
Final dataset count:
text     598
label    598
dtype: int64
Final dataset labels: [0 1]


  spanish_fake_real_news_df["label"] = spanish_fake_real_news_df["label"].replace({'real': 1 , 'fake': 0}).astype("int8")
  spanish_fake_real_news_test_df["label"] = spanish_fake_real_news_test_df["label"].replace({'real': 1 , 'fake': 0}).astype("int8")


In [5]:
# Create a column with the number of characters
spanish_fake_real_news_final_df["length_words"] = spanish_fake_real_news_final_df["text"].str.split().str.len()

print("Articles length information: ")
print(spanish_fake_real_news_final_df["length_words"].describe())

spanish_fake_real_news_final_df = spanish_fake_real_news_final_df.drop(columns=["length_words"])

Articles length information: 
count     598.000000
mean      387.428094
std       325.559810
min        11.000000
25%       232.250000
50%       334.500000
75%       479.750000
max      4669.000000
Name: length_words, dtype: float64


In [6]:
# Check if the dataset is balanced (Real articles have aprox. 30% more)
spanish_fake_real_news_final_df["label"].value_counts()

label
1    339
0    259
Name: count, dtype: int64

### Spanish Political Fake News

This dataset was created for a thesis by a student at the University of Vigo. It contains news articles from April 2017 to June 2023. And it's main focus is in politics articles.

The kaggle page made available 1 csv file:
* `D57000_complete.csv` - Contains all data.

***Observation**: Instead of including a text or something similar it includes a description and a lot of articles are Null, so we will have to analyze it properly and see what we keep*.

Dataset page: [Spanish Political Fake News - Kaggle](https://www.kaggle.com/datasets/javieroterovizoso/spanish-political-fake-news)


In [7]:
spanish_political_fake_news_df = pd.read_csv("../data/preview/D57000_complete.csv", sep=";", names=["id", "label", "title", "text", "date"], header=0)
spanish_political_fake_news_df = spanish_political_fake_news_df.drop(columns=["id", "date"])

print(f"Training dataset shape: {spanish_political_fake_news_df.shape}")
print(f"Training dataset count:\n{spanish_political_fake_news_df.count()}")
print(f"Training dataset labels: {spanish_political_fake_news_df['label'].unique()}")

Training dataset shape: (57231, 3)
Training dataset count:
label    57231
title    57231
text     57231
dtype: int64
Training dataset labels: [1 0]


In [8]:
# Combine title with text to make articles a bit longer
spanish_political_fake_news_df["text"] = spanish_political_fake_news_df["text"] + spanish_political_fake_news_df["title"]

In [9]:
# Create a column with the number of characters
spanish_political_fake_news_df["length_words"] = spanish_political_fake_news_df["text"].str.split().str.len()

print("Articles length information: ")
print(spanish_political_fake_news_df["length_words"].describe())

spanish_political_fake_news_df = spanish_political_fake_news_df.drop(columns=["length_words"])

Articles length information: 
count    57231.000000
mean        53.054603
std         15.960961
min          8.000000
25%         43.000000
50%         51.000000
75%         61.000000
max        199.000000
Name: length_words, dtype: float64


### Noticias falsas en español

We don't have much information about this dataset as it was not provided by the authors. The only think we know is that it includes 2 csv files:

* `onlyfakes1000.csv` - Containing almost 1k of fake news articles (965).
* `onlytrue1000.csv` - Containing almost 1k of real news articles (993).

Dataset page: [Noticias falsas en español - Kaggle](https://www.kaggle.com/datasets/arseniitretiakov/noticias-falsas-en-espaol)

In [10]:
only_fake_df = pd.read_csv("../data/preview/onlyfakes1000.csv", names=["text"], header=0)
only_fake_df["label"] = 0

print(f"Training dataset shape: {only_fake_df.shape}")
print(f"Training dataset count:\n{only_fake_df.count()}")
print(f"Training dataset labels: {only_fake_df['label'].unique()}")

Training dataset shape: (1000, 2)
Training dataset count:
text     1000
label    1000
dtype: int64
Training dataset labels: [0]


In [11]:
only_true_df = pd.read_csv("../data/preview/onlytrue1000.csv", names=["text"], header=0)
only_true_df["label"] = 1

print(f"Training dataset shape: {only_true_df.shape}")
print(f"Training dataset count:\n{only_true_df.count()}")
print(f"Training dataset labels: {only_true_df['label'].unique()}")

Training dataset shape: (1000, 2)
Training dataset count:
text     1000
label    1000
dtype: int64
Training dataset labels: [1]


In [12]:
# Join together both datasets
true_fake_final_df = pd.concat(objs=[only_fake_df, only_true_df], ignore_index=True)
true_fake_final_df = true_fake_final_df.sample(frac=1).reset_index(drop=True)


print(f"Final dataset shape: {true_fake_final_df.shape}")
print(f"Final dataset count:\n{true_fake_final_df.count()}")
print(f"Final dataset labels: {true_fake_final_df['label'].unique()}")

Final dataset shape: (2000, 2)
Final dataset count:
text     2000
label    2000
dtype: int64
Final dataset labels: [0 1]


In [13]:
# Create a column with the number of characters
true_fake_final_df["length_words"] = true_fake_final_df["text"].str.split().str.len()

print("Articles length information: ")
print(true_fake_final_df["length_words"].describe())

true_fake_final_df = true_fake_final_df.drop(columns=["length_words"])

Articles length information: 
count    2000.000000
mean       39.680500
std        14.310969
min         6.000000
25%        39.000000
50%        42.000000
75%        45.000000
max       379.000000
Name: length_words, dtype: float64


### Fake News Corpus Spanish

It contains a collection of 971 news divided into 491 real news and 480 fake news. The corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. And the articles were published between November 2020 and March 2021.

Extra information provided by the author is that the articles come from a variaty of countries: Argentina, Bolivia, Chile, Colombia, Costa Rica, Ecuador, Spain, United States, France, Peru, Uruguay, England and Venezuela.

It provides 3 xlsx files:
* `train.xlsx` - Training data.
* `development.xlsx` - Development/Validation data.
* `test.xlsx` - Testing data.

Dataset page: [Fake News Corpus Spanish - jpposadas Github](https://github.com/jpposadas/FakeNewsCorpusSpanish)

In [14]:
train_df = pd.read_excel("../data/preview/train.xlsx", names=["id", "label", "topic", "source", "headline", "text", "link"], header=0)
train_df = train_df.drop(columns=["id", "topic", "source", "link", "headline"])

print(f"Training dataset shape: {train_df.shape}")
print(f"Training dataset count:\n{train_df.count()}")
print(f"Training dataset labels: {train_df['label'].unique()}")

Training dataset shape: (676, 2)
Training dataset count:
label    676
text     676
dtype: int64
Training dataset labels: ['Fake' 'True']


In [15]:
development_df = pd.read_excel("../data/preview/development.xlsx", names=["id", "label", "topic", "source", "headline", "text", "link"], header=0)
development_df = development_df.drop(columns=["id", "topic", "source", "link", "headline"])

print(f"Development dataset shape: {development_df.shape}")
print(f"Development dataset count:\n{development_df.count()}")
print(f"Development dataset labels: {development_df['label'].unique()}")

Development dataset shape: (295, 2)
Development dataset count:
label    295
text     295
dtype: int64
Development dataset labels: ['Fake' 'True']


In [16]:
test_df = pd.read_excel("../data/preview/test.xlsx", names=["id", "label", "topic", "source", "headline", "text", "link"], header=0)
test_df = test_df.drop(columns=["id", "topic", "source", "link", "headline"])

print(f"Test dataset shape: {test_df.shape}")
print(f"Test dataset count:\n{test_df.count()}")
print(f"Test dataset labels: {test_df['label'].unique()}")

Test dataset shape: (572, 2)
Test dataset count:
label    572
text     572
dtype: int64
Test dataset labels: [ True False]


##### Create final `Fake News Corpus Spanish` dataset

After loading the 3 datasets we are going to join them as one final dataset

In [17]:
# Replace labels from "Fake" to 0 and "True" to 1
train_df["label"] = train_df["label"].replace({'True': 1 , 'Fake': 0}).astype("int8")
development_df["label"] = development_df["label"].replace({'True': 1 , 'Fake': 0}).astype("int8")
# The test dataset has different labels (True/False)
test_df["label"] = test_df["label"].replace({'True': 1 , 'False': 0}).astype("int8")

# Join all fake news corpush spanish datasets
fake_news_corpus_spanish_final_df = pd.concat(objs=[train_df, development_df, test_df], ignore_index=True)
fake_news_corpus_spanish_final_df = fake_news_corpus_spanish_final_df.sample(frac=1).reset_index(drop=True)

print(f"Final dataset shape: {fake_news_corpus_spanish_final_df.shape}")
print(f"Final dataset count:\n{fake_news_corpus_spanish_final_df.count()}")
print(f"Final dataset labels: {fake_news_corpus_spanish_final_df['label'].unique()}")

Final dataset shape: (1543, 2)
Final dataset count:
label    1543
text     1543
dtype: int64
Final dataset labels: [0 1]


  train_df["label"] = train_df["label"].replace({'True': 1 , 'Fake': 0}).astype("int8")
  development_df["label"] = development_df["label"].replace({'True': 1 , 'Fake': 0}).astype("int8")


In [18]:
# Create a column with the number of characters
fake_news_corpus_spanish_final_df["length_words"] = fake_news_corpus_spanish_final_df["text"].str.split().str.len()

print("Articles length information: ")
print(fake_news_corpus_spanish_final_df["length_words"].describe())

fake_news_corpus_spanish_final_df = fake_news_corpus_spanish_final_df.drop(columns=["length_words"])

Articles length information: 
count    1543.000000
mean      438.546338
std       354.403013
min        20.000000
25%       231.000000
50%       338.000000
75%       528.000000
max      4006.000000
Name: length_words, dtype: float64


## Conclusion

After loading and reviewing all 4 datasets (including all files), I reached the conclusion that we have 2 datasets which containg long articles and 2 datasets that contain short articles:

* `Fake News Corpus Spanish` + `Spanish Fake and Real News` - Containg long articles.
* `Noticias falsas en español` + `Spanish Political Fake News` - Containg very short articles.

We can't combine all datasets together as that would bring unbalance to our model. What we could do is use the short articles dataset (aprox. 50k articles) to **pre-fine-tune** the model and then use the 'good' dataset (with long articles and around 2k articles) to do the final **fine-tune** of the model (a.k.a. Sequential Fine Tuning ).

The sequential fine tuning would go like this:

1. The Foundation Model (`dccuchile/bert-base-spanish-wwm-uncased`)
    * We start with a pre-trained BERT model that has already "read" a massive corpus of Spanish text (Wikipedia, legal documents, news).

2. Intermediate Tuning (`50k Dataset`)
    * In this stage, we take the generalist BERT model and train it on our large, noisy dataset to teach it the broad concept of Fake vs. Real news.
    * The result is a model that understands Fake News in a general sense but is somewhat biased toward short, lower-quality text.

3. Final Fine-Tuning (`2k Dataset`)
    * We take the model from Stage 2 and fine-tune it on our high-quality “Gold Standard” dataset using a low learning rate to adapt it to long-form context and reliable ground-truth labels.
    * This works because the model already knows Spanish from Stage 1 and already understands the general characteristics of Fake News from Stage 2.
    * This final step simply aligns those learned patterns with the specific structure, nuance, and depth of our high-quality long articles.

In [19]:
# Final Long Dataset
final_long_df = pd.concat(objs=[spanish_fake_real_news_final_df, fake_news_corpus_spanish_final_df], ignore_index=True)
final_long_df = final_long_df.sample(frac=1).reset_index(drop=True)

# Final Short Dataset
final_short_df = pd.concat(objs=[spanish_political_fake_news_df, true_fake_final_df], ignore_index=True)
final_short_df = final_short_df.sample(frac=1).reset_index(drop=True)

# Save csv files
final_long_df.to_csv("../data/final_long_dataset.csv", index=False)
final_short_df.to_csv("../data/final_short_dataset.csv", index=False)