# Fake News Detector - Model comparison

## Authors
- Jose Garzon
- Germán Patiño
- Alejandro Salazar

*Universidad EAFIT*
## References
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb
* https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

## Libraries

In [1]:
import os
import pandas as pd
from pathlib import Path

# Spark NLP
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

# Data handling
from pyspark.sql.functions import concat, coalesce, lit, when, col, isnan

# Spark ML
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier

# Model evaluation
from sklearn.metrics import classification_report, accuracy_score

## Path definitions

In [2]:
# Get the current directory of the script
current_dir = Path().resolve()

MODEL_FOLDER = ".\models"
LOGS_FOLDER = ".\logs"
DATA_FOLDER = r".\data\news_data"

model_path = current_dir.parent / MODEL_FOLDER
logs_path = current_dir.parent / LOGS_FOLDER
data_path = current_dir.parent / DATA_FOLDER

if not os.path.exists(model_path):
    os.makedirs(model_path)

if not os.path.exists(logs_path):
    os.makedirs(logs_path)

## Spark initialization

In [3]:
ss = sparknlp.start(gpu=True) 

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", ss.version)

Spark NLP version 5.3.3
Apache Spark version: 3.5.1


## Data ingestion

In [4]:
# Read data set
data = ss.read.parquet(str(data_path))

In [5]:
# Combine title and text columns into the full_text column, removing blank or null values.
data = data.withColumn(
    "full_text",
    concat(
        coalesce(when(col("title").isNotNull() & ~isnan(col("title")), col("title")).otherwise(lit("")), lit("")),
        lit(" "),
        coalesce(when(col("text").isNotNull() & ~isnan(col("text")), col("text")).otherwise(lit("")), lit(""))
    )
)


In [6]:
print("Data sample:")
data.show(5)

Data sample:
+--------------------+--------------------+-----+--------------------+
|               title|                text|label|           full_text|
+--------------------+--------------------+-----+--------------------+
|NAZI Flying Sauce...|. NAZI Flying Sau...|    1|NAZI Flying Sauce...|
|Mitt Romney Calls...|The most recent R...|    0|Mitt Romney Calls...|
|Clinton takes the...|PHILADELPHIA — Ch...|    0|Clinton takes the...|
|The Deteriorating...|Tweet Widget by Y...|    1|The Deteriorating...|
|Old rivals Obama ...|Washington (CNN) ...|    0|Old rivals Obama ...|
+--------------------+--------------------+-----+--------------------+
only showing top 5 rows



In [7]:
record_counts = data.count()
print(f"Total records: {record_counts}")

Total records: 66793


## TF-IDF Pipeline

In [11]:
%%time

document_assembler = DocumentAssembler() \
      .setInputCol("full_text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=20) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document_assembler,
            tokenizer,
            normalizer,
            stopwords_cleaner,
            stemmer,
            finisher,
            hashingTF,
            idf])

nlp_model_tf = nlp_pipeline_tf.fit(data)

processed_tf = nlp_model_tf.transform(data)

tfidf_records = processed_tf.count()
print(f"Transformed records: {record_counts}")

Transformed records: 66793
CPU times: total: 281 ms
Wall time: 6min 49s


In [12]:
# Show vectorized features
print("Vectorized features:")
processed_tf.select('full_text','features','label').show()

Vectorized features:
+--------------------+--------------------+-----+
|           full_text|            features|label|
+--------------------+--------------------+-----+
|NAZI Flying Sauce...|(10000,[8,15,28,2...|    1|
|Mitt Romney Calls...|(10000,[15,23,51,...|    0|
|Clinton takes the...|(10000,[7,9,29,15...|    0|
|The Deteriorating...|(10000,[4,8,15,23...|    1|
|Old rivals Obama ...|(10000,[19,145,16...|    0|
|George Soros begi...|(10000,[30,58,70,...|    1|
|Donald Trump is d...|(10000,[33,230,26...|    0|
|How To Open Your ...|(10000,[1045,1480...|    1|
|VA Secretary Robe...|(10000,[43,55,281...|    0|
|Three Likely GOP ...|(10000,[230,323,3...|    0|
|Ryan Endorses Tru...|(10000,[323,332,3...|    0|
|Podesta WikiLeaks...|(10000,[8,15,20,2...|    1|
|3 Year Old Son of...|(10000,[5,7,15,24...|    1|
|The inane spectac...|(10000,[7,15,21,3...|    0|
|Clinton “Fixer”: ...|(10000,[7,13,15,2...|    1|
|Slain reporter's ...|(10000,[15,40,55,...|    0|
|Hillary Clinton’s...|(10000,

In [13]:
%%time

(trainingData, testData) = processed_tf.randomSplit([0.8, 0.2], seed = 100)
print(f"Training Dataset Count: {trainingData.count()}")
print(f"Test Dataset Count: {testData.count()}")

Training Dataset Count: 53489
Test Dataset Count: 13304
CPU times: total: 0 ns
Wall time: 14min 3s


In [29]:
processed_tf.show(5)

+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|               title|                text|label|           full_text|            document|               token|          normalized|         cleanTokens|                stem|      token_features|         rawFeatures|            features|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|NAZI Flying Sauce...|. NAZI Flying Sau...|    1|NAZI Flying Sauce...|[{document, 0, 22...|[{token, 0, 3, NA...|[{token, 0, 3, NA...|[{token, 0, 3, NA...|[{token, 0, 3, na...|[nazi, fly, sauce...|(10000,[8,15,28,2...|(10000,[8,15,28,2...|
|Mitt Romney Calls...|The most recent R...| 

## Logistic Regression

### Training

In [14]:
%%time

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(trainingData)

CPU times: total: 31.2 ms
Wall time: 14min 4s


### Most important features

In [28]:
# Extract coefficients
coefficients = lrModel_tf.coefficients
intercept = lrModel_tf.intercept

# Get feature importance
feature_importance = [(feature, abs(coeff)) for feature, coeff in zip(trainingData.columns, coefficients)]
feature_importance.sort(key=lambda x: x[1], reverse=True)

# Display feature importance
print("Feature importance:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Feature importance:
cleanTokens: 0.015486766977184737
stem: 0.009106829343224716
full_text: 0.009008630247096498
token: 0.0069474023453024395
normalized: 0.004120438724842986
features: 0.003288882139715052
token_features: 0.0029783775822809317
rawFeatures: 0.0021202385389569945
text: 0.0016053343916492486
title: 0.001539909823722744
document: 0.0012439575180319103
label: 0.0010352675667627481


### Inference

In [15]:
predictions_tf = lrModel_tf.transform(testData)
predictions_tf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Potential Conflicts Around ...|[0.999999969762807,3.023719...|    0|       0.0|
|How Egypt’s Activists Becam...|[0.999997816840309,2.183159...|    0|       0.0|
|Special Report: 'Treacherou...|[0.9999974247048143,2.57529...|    0|       0.0|
|Katinka Hosszu and Her Husb...|[0.9999953700575471,4.62994...|    0|       0.0|
|Where Even Nightmares Are C...|[0.9999651017637986,3.48982...|    0|       0.0|
|How China Won the Keys to D...|[0.9999460763301009,5.39236...|    0|       0.0|
|Kevin McCarthy drops out of...|[0.9999168979911142,8.31020...|    0|       0.0|
|More Than 160 Republicans D...|[0.9999116677458976,8.83322...|    0|       0.0|
|United States v. Texas, the...|[0.9999029113864484,9.70886...|    0|       0.0|
|**Livewire** President Trum

### Evaluation

In [16]:

y_true = predictions_tf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_tf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.94      0.93      0.93      7140
           1       0.92      0.93      0.92      6164

    accuracy                           0.93     13304
   macro avg       0.93      0.93      0.93     13304
weighted avg       0.93      0.93      0.93     13304

0.9296452194828623


### Store the model

In [73]:
model_name = "LR_Model_tf"
model_filename = os.path.join(model_path, model_name)
lrModel_tf.save(model_filename)

## Random Forest

### Training

In [74]:
%%time

rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

rfModel = rf.fit(trainingData)

CPU times: total: 46.9 ms
Wall time: 6min 16s


### Inference

In [75]:
predictions_rf = rfModel.transform(testData)
predictions_rf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Syria Strike Puts U.S. Rela...|[0.7125964365460536,0.28740...|    0|       0.0|
|Senate takes step toward pa...|[0.7104522226548046,0.28954...|    0|       0.0|
|Trump outlines plans for fi...|[0.7091418132233812,0.29085...|    0|       0.0|
|Trump urges 'strong and swi...|[0.7076387378880427,0.29236...|    0|       0.0|
|Tillerson urges 'new approa...|[0.7060153083002274,0.29398...|    0|       0.0|
|Separate mothers and childr...|[0.7053842739279532,0.29461...|    0|       0.0|
|Palestinians to snub Pence ...|[0.7045675853958024,0.29543...|    0|       0.0|
|Carson signals exit, U.S. R...|[0.7037308215783893,0.29626...|    0|       0.0|
|Facing revolt on healthcare...|[0.7020194300912339,0.29798...|    0|       0.0|
|Executive actions ready to 

### Evaluation

In [76]:
y_true = predictions_rf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_rf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.78      0.95      0.86      7213
           1       0.93      0.69      0.79      6246

    accuracy                           0.83     13459
   macro avg       0.85      0.82      0.83     13459
weighted avg       0.85      0.83      0.83     13459

0.8323798201946653


### Store the model

In [77]:
model_name = "RF_Model"
model_filename = os.path.join(model_path, model_name)
rfModel.save(model_filename)

## BERT Pipeline

In [23]:
# Split data into training and test sets
trainDataset, testDataset= data.randomSplit([0.8, 0.2])

In [24]:
document_assembler = DocumentAssembler() \
    .setInputCol("full_text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name='bert_base_uncased', lang='en') \
    .setInputCols(["document",'token'])\
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(10)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True) \
    .setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        bert_embeddings,
        embeddingsSentence,
        classsifierdl
])

bert_base_uncased download started this may take some time.
Approximate size to download 392,5 MB
[OK!]


### Training

In [25]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: total: 516 ms
Wall time: 1h 59min 27s


### Inference

In [26]:
preds = bert_clf_pipelineModel.transform(testDataset)
preds_df = preds.select('label','full_text',"class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : int(x[0]))

### Evaluation

In [27]:

print(classification_report(preds_df.label, preds_df.result))
print(accuracy_score(preds_df.label, preds_df.result))

              precision    recall  f1-score   support

           0       0.97      0.93      0.95      7124
           1       0.92      0.97      0.95      6200

    accuracy                           0.95     13324
   macro avg       0.95      0.95      0.95     13324
weighted avg       0.95      0.95      0.95     13324

0.9487391173821675


### Store the model


In [28]:
bert_clf_pipelineModel.stages

[DocumentAssembler_e0bee2632907,
 REGEX_TOKENIZER_7affb0229300,
 BERT_EMBEDDINGS_4fbd72cbda5a,
 SentenceEmbeddings_fcafc5ffbad2,
 ClassifierDLModel_2a5937e2406e]

In [59]:
model_name = "BERT_ClassifierDL_Layer"
model_filepath = model_path / model_name
model_filepath = model_filepath.as_uri()
print(f"Saving model to {model_filepath}")
bert_clf_pipelineModel.stages[-1].write().overwrite().save(f"{model_filepath}")

Saving model to file:///C:/Users/jmgarzonv/Desktop/EAFIT/MMDS/models/BERT_ClassifierDL_Layer
