# Fake News Detector - Model comparison

## Authors
- Jose Garzon
- Germán Patiño
- Alejandro Salazar

*Universidad EAFIT*
## References
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb
* https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

In [1]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd
import sparknlp

ss = sparknlp.start(gpu=True) 

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", ss.version)

Spark NLP version 5.3.3
Apache Spark version: 3.5.1


In [2]:
# Funtion for conver Pandas Dataframe to Spark Dataframe
from pyspark.sql.types import StringType, IntegerType, StructField, StructType
def read_data(path):
  schema= StructType(
      [StructField('title',StringType(),True),
      StructField('text',StringType(),True),
      StructField('label',IntegerType(),True)])
  pd_df = pd.read_parquet(path),
  #pd_df= pd.read_csv(path).drop('Unnamed: 0', axis= 1)
  sp_df= ss.createDataFrame(pd_df, schema= schema)
  return sp_df

In [3]:
# Read data set
path_data= 'WELFake_Dataset.csv'
data= read_data(path_data)

In [4]:
from pyspark.sql.functions import concat, coalesce, lit, when, col, isnan

# Define the transformation logic to create the full_text column
data = data.withColumn(
    "full_text",
    concat(
        coalesce(when(col("title").isNotNull() & ~isnan(col("title")), col("title")).otherwise(lit("")), lit("")),
        lit(" "),
        coalesce(when(col("text").isNotNull() & ~isnan(col("text")), col("text")).otherwise(lit("")), lit(""))
    )
)


In [5]:
data.show(5)

+--------------------+--------------------+-----+--------------------+
|               title|                text|label|           full_text|
+--------------------+--------------------+-----+--------------------+
|LAW ENFORCEMENT O...|No comment is exp...|    1|LAW ENFORCEMENT O...|
|                 NaN|Did they post the...|    1| Did they post th...|
|UNBELIEVABLE! OBA...| Now, most of the...|    1|UNBELIEVABLE! OBA...|
|Bobby Jindal, rai...|A dozen political...|    0|Bobby Jindal, rai...|
|SATAN 2: Russia u...|The RS-28 Sarmat ...|    1|SATAN 2: Russia u...|
+--------------------+--------------------+-----+--------------------+
only showing top 5 rows



In [6]:
trainDataset, testDataset= data.randomSplit([0.8, 0.2])

## BERT Pipeline

In [7]:
document_assembler = DocumentAssembler() \
    .setInputCol("full_text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name='bert_base_uncased', lang='en') \
    .setInputCols(["document",'token'])\
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(10)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True) \
    .setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        bert_embeddings,
        embeddingsSentence,
        classsifierdl
])

bert_base_uncased download started this may take some time.
Approximate size to download 392,5 MB
[OK!]


### Training

In [8]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: total: 1.59 s
Wall time: 2h 8min 48s


### Inference

In [9]:
preds = bert_clf_pipelineModel.transform(testDataset)
preds_df = preds.select('label','full_text',"class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : int(x[0]))

### Evaluation

In [10]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(preds_df.label, preds_df.result))
print(accuracy_score(preds_df.label, preds_df.result))

              precision    recall  f1-score   support

           0       0.96      0.95      0.96      7047
           1       0.95      0.97      0.96      7464

    accuracy                           0.96     14511
   macro avg       0.96      0.96      0.96     14511
weighted avg       0.96      0.96      0.96     14511

0.9588587967748604


### Store the model


In [11]:
bert_clf_pipelineModel.stages

[DocumentAssembler_a2e80a272cff,
 REGEX_TOKENIZER_981c8ad0aa49,
 BERT_EMBEDDINGS_4fbd72cbda5a,
 SentenceEmbeddings_36456b97f769,
 ClassifierDLModel_d66c1e805494]

In [12]:
bert_clf_pipelineModel.stages[-1].write().overwrite().save('ClassifierDLModel')

## TF-IDF Pipeline

In [7]:
from pyspark.ml.feature import HashingTF, IDF

In [8]:
%%time

document_assembler = DocumentAssembler() \
      .setInputCol("full_text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=20) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document_assembler,
            tokenizer,
            normalizer,
            stopwords_cleaner,
            stemmer,
            finisher,
            hashingTF,
            idf])

nlp_model_tf = nlp_pipeline_tf.fit(data)

processed_tf = nlp_model_tf.transform(data)

processed_tf.count()

CPU times: total: 46.9 ms
Wall time: 1min 28s


72134

In [9]:
processed_tf.select('full_text','features','label').show()

+--------------------+--------------------+-----+
|           full_text|            features|label|
+--------------------+--------------------+-----+
|LAW ENFORCEMENT O...|(10000,[29,71,88,...|    1|
| Did they post th...|(10000,[568,2460,...|    1|
|UNBELIEVABLE! OBA...|(10000,[639,1226,...|    1|
|Bobby Jindal, rai...|(10000,[15,24,39,...|    0|
|SATAN 2: Russia u...|(10000,[33,58,63,...|    1|
|About Time! Chris...|(10000,[171,387,5...|    1|
|DR BEN CARSON TAR...|(10000,[281,472,1...|    1|
|HOUSE INTEL CHAIR...|(10000,[472,681,7...|    1|
|Sports Bar Owner ...|(10000,[24,35,88,...|    1|
|Latest Pipeline L...|(10000,[104,116,1...|    1|
| GOP Senator Just...|(10000,[35,58,150...|    1|
|May Brexit offer ...|(10000,[8,23,29,1...|    0|
|Schumer calls on ...|(10000,[15,51,52,...|    0|
|WATCH: HILARIOUS ...|(10000,[493,568,7...|    1|
|No Change Expecte...|(10000,[4,29,151,...|    0|
|Billionaire Odebr...|(10000,[63,158,25...|    0|
|BRITISH WOMAN LOS...|(10000,[15,57,158...|    1|


In [10]:
%%time

(trainingData, testData) = processed_tf.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 57661
Test Dataset Count: 14473
CPU times: total: 46.9 ms
Wall time: 2min 33s


## Logistic Regression

In [11]:
from pyspark.ml.classification import LogisticRegression

### Training

In [12]:
%%time

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(trainingData)

CPU times: total: 78.1 ms
Wall time: 2min 33s


### Inference

In [13]:
predictions_tf = lrModel_tf.transform(testData)
predictions_tf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Islamic State Claims Respon...|[0.9999999999986016,1.39843...|    0|       0.0|
|Trump vs. Congress: Now Wha...|[0.9999999983208085,1.67919...|    0|       0.0|
|When to Leave on Your Thank...|[0.9999623778332197,3.76221...|    0|       0.0|
|Taxpayers Will Defend Trump...|[0.9999229752678362,7.70247...|    0|       0.0|
|Tragedy Made Steve Kerr See...|[0.9998925738626535,1.07426...|    0|       0.0|
|Factbox: Trump finishes fil...|[0.9997803775513411,2.19622...|    0|       0.0|
|More Than 160 Republicans D...|[0.9997710723021959,2.28927...|    0|       0.0|
|Across the World, Shock and...|[0.9997576901769588,2.42309...|    0|       0.0|
|How U.S. Torture Left a Leg...|[0.9997542467582056,2.45753...|    0|       0.0|
|One Family. Six Decades. My

### Evaluation

In [14]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_true = predictions_tf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_tf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.95      0.92      0.94      6994
           1       0.93      0.96      0.94      7479

    accuracy                           0.94     14473
   macro avg       0.94      0.94      0.94     14473
weighted avg       0.94      0.94      0.94     14473

0.9416154218199406


### Store the model

In [15]:
lrModel_tf.save('lrModel_tf')

## Random Forest

In [16]:
from pyspark.ml.classification import RandomForestClassifier

### Training

In [17]:
%%time

rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

rfModel = rf.fit(trainingData)

CPU times: total: 62.5 ms
Wall time: 5min 38s


### Inference

In [18]:
predictions_rf = rfModel.transform(testData)
predictions_rf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|As clock ticks, Republicans...|[0.6767145373029972,0.32328...|    0|       0.0|
|EU-U.S. trade deal in doubt...|[0.674859943165003,0.325140...|    0|       0.0|
|Trump Shifting Authority Ov...|[0.6733586222002496,0.32664...|    0|       0.0|
|Trump stars and spars with ...|[0.6728863883598141,0.32711...|    0|       0.0|
|Fury at top of Republican P...|[0.6694708251088034,0.33052...|    0|       0.0|
|Factbox: Trump finishes fil...|[0.6694538103456622,0.33054...|    0|       0.0|
|'Glimmer of hope' seen for ...|[0.6684456063353668,0.33155...|    0|       0.0|
|Trump to press China on Nor...|[0.6676187169463729,0.33238...|    0|       0.0|
|Obama Administration Consid...|[0.6670200786920549,0.33297...|    0|       0.0|
|Britain to detail Brexit bi

### Evaluation

In [19]:
y_true = predictions_rf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_rf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.92      0.82      0.87      6994
           1       0.85      0.93      0.89      7479

    accuracy                           0.88     14473
   macro avg       0.88      0.87      0.88     14473
weighted avg       0.88      0.88      0.88     14473

0.8764596144545015


### Store the model

In [20]:
rfModel.save('rfModel')