# Fake News Detector - Model comparison

## Authors
- Jose Garzon
- Germán Patiño
- Alejandro Salazar

*Universidad EAFIT*
## References
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb
* https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

## Libraries

In [60]:
import os
import pandas as pd
from pathlib import Path

# Spark NLP
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

# Data handling
from pyspark.sql.functions import concat, coalesce, lit, when, col, isnan

# Spark ML
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier

# Model evaluation
from sklearn.metrics import classification_report, accuracy_score

## Path definitions

In [61]:
# Get the current directory of the script
current_dir = Path().resolve()

MODEL_FOLDER = ".\models"
LOGS_FOLDER = ".\logs"
DATA_FOLDER = r".\data\news_data"

model_path = current_dir.parent / MODEL_FOLDER
logs_path = current_dir.parent / LOGS_FOLDER
data_path = current_dir.parent / DATA_FOLDER

if not os.path.exists(model_path):
    os.makedirs(model_path)

if not os.path.exists(logs_path):
    os.makedirs(logs_path)

## Spark initialization

In [62]:
ss = sparknlp.start(gpu=True) 

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", ss.version)

Spark NLP version 5.3.3
Apache Spark version: 3.5.1


## Data ingestion

In [63]:
# Read data set
data = ss.read.parquet(str(data_path))

In [64]:
# Combine title and text columns into the full_text column, removing blank or null values.
data = data.withColumn(
    "full_text",
    concat(
        coalesce(when(col("title").isNotNull() & ~isnan(col("title")), col("title")).otherwise(lit("")), lit("")),
        lit(" "),
        coalesce(when(col("text").isNotNull() & ~isnan(col("text")), col("text")).otherwise(lit("")), lit(""))
    )
)


In [65]:
print("Data sample:")
data.show(5)

Data sample:
+--------------------+--------------------+-----+--------------------+
|               title|                text|label|           full_text|
+--------------------+--------------------+-----+--------------------+
|The Week In Pictu...|New Report Finds ...|    1|The Week In Pictu...|
|State Department ...|WASHINGTON – The ...|    0|State Department ...|
|If Hillary Clinto...|Archives Michael ...|    1|If Hillary Clinto...|
|Extreme rhetoric ...|The use of extrem...|    0|Extreme rhetoric ...|
|UFO Investigator ...|link a reply to: ...|    1|UFO Investigator ...|
+--------------------+--------------------+-----+--------------------+
only showing top 5 rows



In [66]:
record_counts = data.count()
print(f"Total records: {record_counts}")

Total records: 66793


## TF-IDF Pipeline

In [67]:
%%time

document_assembler = DocumentAssembler() \
      .setInputCol("full_text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=20) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document_assembler,
            tokenizer,
            normalizer,
            stopwords_cleaner,
            stemmer,
            finisher,
            hashingTF,
            idf])

nlp_model_tf = nlp_pipeline_tf.fit(data)

processed_tf = nlp_model_tf.transform(data)

tfidf_records = processed_tf.count()
print(f"Transformed records: {record_counts}")

Transformed records: 66793
CPU times: total: 46.9 ms
Wall time: 1min 22s


In [68]:
# Show vectorized features
print("Vectorized features:")
processed_tf.select('full_text','features','label').show()

Vectorized features:
+--------------------+--------------------+-----+
|           full_text|            features|label|
+--------------------+--------------------+-----+
|The Week In Pictu...|(10000,[68,276,32...|    1|
|State Department ...|(10000,[8,130,221...|    0|
|If Hillary Clinto...|(10000,[0,7,70,15...|    1|
|Extreme rhetoric ...|(10000,[15,46,88,...|    0|
|UFO Investigator ...|(10000,[5,57,193,...|    1|
|The Left Turns on...|(10000,[15,236,29...|    1|
|Comment on “This ...|(10000,[7,8,29,33...|    1|
|Trump controlled ...|(10000,[15,84,96,...|    1|
|Blame Government,...|(10000,[7,19,33,7...|    1|
|Trump suggests he...|(10000,[193,236,2...|    0|
|Sniff your undera...|(10000,[453,1468,...|    1|
|Selected Not Elec...|(10000,[15,30,55,...|    1|
|What Trump Will N...|(10000,[7,24,55,7...|    0|
|The real reason t...|(10000,[63,70,145...|    0|
|Clinton Says She ...|(10000,[158,281,3...|    0|
|How the swing vot...|(10000,[0,15,29,1...|    0|
|Islamic State cla...|(10000,

In [69]:
%%time

(trainingData, testData) = processed_tf.randomSplit([0.8, 0.2], seed = 100)
print(f"Training Dataset Count: {trainingData.count()}")
print(f"Test Dataset Count: {testData.count()}")

Training Dataset Count: 53334
Test Dataset Count: 13459
CPU times: total: 0 ns
Wall time: 2min 35s


## Logistic Regression

### Training

In [70]:
%%time

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(trainingData)

CPU times: total: 31.2 ms
Wall time: 2min 33s


### Inference

In [71]:
predictions_tf = lrModel_tf.transform(testData)
predictions_tf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|The Secret History of Colom...|[0.9999999288621939,7.11378...|    0|       0.0|
|A Saudi Morals Enforcer Cal...|[0.999996911835493,3.088164...|    0|       0.0|
|Where Even Nightmares Are C...|[0.9999879259482756,1.20740...|    0|       0.0|
|Katinka Hosszu and Her Husb...|[0.9999834210454314,1.65789...|    0|       0.0|
|As campaigns launch, poll f...|[0.9999475755424966,5.24244...|    0|       0.0|
|How the Obama White House r...|[0.999931316742936,6.868325...|    0|       0.0|
|United States v. Texas, the...|[0.999872649353562,1.273506...|    0|       0.0|
|The Daily 202: How Democrat...|[0.9998405588006547,1.59441...|    0|       0.0|
|Factbox: Trump fills top jo...|[0.9998289655949362,1.71034...|    0|       0.0|
|Factbox: Trump finishes fil

### Evaluation

In [72]:

y_true = predictions_tf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_tf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.94      0.93      0.94      7213
           1       0.92      0.94      0.93      6246

    accuracy                           0.93     13459
   macro avg       0.93      0.93      0.93     13459
weighted avg       0.93      0.93      0.93     13459

0.9309755553904451


### Store the model

In [73]:
model_name = "LR_Model_tf"
model_filename = os.path.join(model_path, model_name)
lrModel_tf.save(model_filename)

## Random Forest

### Training

In [74]:
%%time

rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

rfModel = rf.fit(trainingData)

CPU times: total: 46.9 ms
Wall time: 6min 16s


### Inference

In [75]:
predictions_rf = rfModel.transform(testData)
predictions_rf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Syria Strike Puts U.S. Rela...|[0.7125964365460536,0.28740...|    0|       0.0|
|Senate takes step toward pa...|[0.7104522226548046,0.28954...|    0|       0.0|
|Trump outlines plans for fi...|[0.7091418132233812,0.29085...|    0|       0.0|
|Trump urges 'strong and swi...|[0.7076387378880427,0.29236...|    0|       0.0|
|Tillerson urges 'new approa...|[0.7060153083002274,0.29398...|    0|       0.0|
|Separate mothers and childr...|[0.7053842739279532,0.29461...|    0|       0.0|
|Palestinians to snub Pence ...|[0.7045675853958024,0.29543...|    0|       0.0|
|Carson signals exit, U.S. R...|[0.7037308215783893,0.29626...|    0|       0.0|
|Facing revolt on healthcare...|[0.7020194300912339,0.29798...|    0|       0.0|
|Executive actions ready to 

### Evaluation

In [76]:
y_true = predictions_rf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_rf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.78      0.95      0.86      7213
           1       0.93      0.69      0.79      6246

    accuracy                           0.83     13459
   macro avg       0.85      0.82      0.83     13459
weighted avg       0.85      0.83      0.83     13459

0.8323798201946653


### Store the model

In [77]:
model_name = "RF_Model"
model_filename = os.path.join(model_path, model_name)
rfModel.save(model_filename)

## BERT Pipeline

In [23]:
# Split data into training and test sets
trainDataset, testDataset= data.randomSplit([0.8, 0.2])

In [24]:
document_assembler = DocumentAssembler() \
    .setInputCol("full_text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name='bert_base_uncased', lang='en') \
    .setInputCols(["document",'token'])\
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(10)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True) \
    .setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        bert_embeddings,
        embeddingsSentence,
        classsifierdl
])

bert_base_uncased download started this may take some time.
Approximate size to download 392,5 MB
[OK!]


### Training

In [25]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: total: 516 ms
Wall time: 1h 59min 27s


### Inference

In [26]:
preds = bert_clf_pipelineModel.transform(testDataset)
preds_df = preds.select('label','full_text',"class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : int(x[0]))

### Evaluation

In [27]:

print(classification_report(preds_df.label, preds_df.result))
print(accuracy_score(preds_df.label, preds_df.result))

              precision    recall  f1-score   support

           0       0.97      0.93      0.95      7124
           1       0.92      0.97      0.95      6200

    accuracy                           0.95     13324
   macro avg       0.95      0.95      0.95     13324
weighted avg       0.95      0.95      0.95     13324

0.9487391173821675


### Store the model


In [28]:
bert_clf_pipelineModel.stages

[DocumentAssembler_e0bee2632907,
 REGEX_TOKENIZER_7affb0229300,
 BERT_EMBEDDINGS_4fbd72cbda5a,
 SentenceEmbeddings_fcafc5ffbad2,
 ClassifierDLModel_2a5937e2406e]

In [59]:
model_name = "BERT_ClassifierDL_Layer"
model_filepath = model_path / model_name
model_filepath = model_filepath.as_uri()
print(f"Saving model to {model_filepath}")
bert_clf_pipelineModel.stages[-1].write().overwrite().save(f"{model_filepath}")

Saving model to file:///C:/Users/jmgarzonv/Desktop/EAFIT/MMDS/models/BERT_ClassifierDL_Layer
