# Fake News Detector - Model comparison

## Authors
- Jose Garzon
- Germán Patiño
- Alejandro Salazar

*Universidad EAFIT*
## References
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
* https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb
* https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

In [1]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd
import sparknlp

ss = sparknlp.start() 
sparknlp.start(gpu=True) # >> for training on GPU

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", ss.version)

Spark NLP version 5.3.3
Apache Spark version: 3.5.1


In [2]:
# Funtion for conver Pandas Dataframe to Spark Dataframe
from pyspark.sql.types import StringType, IntegerType, StructField, StructType
def read_data(path):
  schema= StructType(
      [StructField('title',StringType(),True),
      StructField('text',StringType(),True),
      StructField('label',IntegerType(),True)])
  pd_df= pd.read_csv(path).drop('Unnamed: 0', axis= 1)
  sp_df= ss.createDataFrame(pd_df, schema= schema)
  return sp_df

In [3]:
# Read data set
path_data= 'WELFake_Dataset.csv'
data= read_data(path_data)

In [4]:
from pyspark.sql.functions import concat, coalesce, lit, when, col, isnan

# Define the transformation logic to create the full_text column
data = data.withColumn(
    "full_text",
    concat(
        coalesce(when(col("title").isNotNull() & ~isnan(col("title")), col("title")).otherwise(lit("")), lit("")),
        lit(" "),
        coalesce(when(col("text").isNotNull() & ~isnan(col("text")), col("text")).otherwise(lit("")), lit(""))
    )
)


In [5]:
data.show(5)

+--------------------+--------------------+-----+--------------------+
|               title|                text|label|           full_text|
+--------------------+--------------------+-----+--------------------+
|LAW ENFORCEMENT O...|No comment is exp...|    1|LAW ENFORCEMENT O...|
|                 NaN|Did they post the...|    1| Did they post th...|
|UNBELIEVABLE! OBA...| Now, most of the...|    1|UNBELIEVABLE! OBA...|
|Bobby Jindal, rai...|A dozen political...|    0|Bobby Jindal, rai...|
|SATAN 2: Russia u...|The RS-28 Sarmat ...|    1|SATAN 2: Russia u...|
+--------------------+--------------------+-----+--------------------+
only showing top 5 rows



In [6]:
trainDataset, testDataset= data.randomSplit([0.8, 0.2])

## BERT Pipeline

In [19]:
document_assembler = DocumentAssembler() \
    .setInputCol("full_text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name='bert_base_uncased', lang='en') \
    .setInputCols(["document",'token'])\
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(10)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True) \
    .setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        bert_embeddings,
        embeddingsSentence,
        classsifierdl
])

small_bert_L4_256 download started this may take some time.
Approximate size to download 40,5 MB
[OK!]


### Training

In [10]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: total: 46.9 ms
Wall time: 14min 46s


### Inference

In [38]:
preds = bert_clf_pipelineModel.transform(testDataset)
preds_df = preds.select('label','full_text',"class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : int(x[0]))

### Evaluation

In [42]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(preds_df.label, preds_df.result))
print(accuracy_score(preds_df.label, preds_df.result))

              precision    recall  f1-score   support

           0       0.93      0.93      0.93      7011
           1       0.93      0.94      0.93      7285

    accuracy                           0.93     14296
   macro avg       0.93      0.93      0.93     14296
weighted avg       0.93      0.93      0.93     14296

0.9327783995523223


## TF-IDF Pipeline

In [None]:
from pyspark.ml.feature import HashingTF, IDF

In [22]:
%%time

document_assembler = DocumentAssembler() \
      .setInputCol("full_text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=20) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document_assembler,
            tokenizer,
            normalizer,
            stopwords_cleaner,
            stemmer,
            finisher,
            hashingTF,
            idf])

nlp_model_tf = nlp_pipeline_tf.fit(data)

processed_tf = nlp_model_tf.transform(data)

processed_tf.count()

72134

In [24]:
processed_tf.select('full_text','features','label').show()

+--------------------+--------------------+-----+
|           full_text|            features|label|
+--------------------+--------------------+-----+
|LAW ENFORCEMENT O...|(10000,[29,71,88,...|    1|
| Did they post th...|(10000,[568,2460,...|    1|
|UNBELIEVABLE! OBA...|(10000,[639,1226,...|    1|
|Bobby Jindal, rai...|(10000,[15,24,39,...|    0|
|SATAN 2: Russia u...|(10000,[33,58,63,...|    1|
|About Time! Chris...|(10000,[171,387,5...|    1|
|DR BEN CARSON TAR...|(10000,[281,472,1...|    1|
|HOUSE INTEL CHAIR...|(10000,[472,681,7...|    1|
|Sports Bar Owner ...|(10000,[24,35,88,...|    1|
|Latest Pipeline L...|(10000,[104,116,1...|    1|
| GOP Senator Just...|(10000,[35,58,150...|    1|
|May Brexit offer ...|(10000,[8,23,29,1...|    0|
|Schumer calls on ...|(10000,[15,51,52,...|    0|
|WATCH: HILARIOUS ...|(10000,[493,568,7...|    1|
|No Change Expecte...|(10000,[4,29,151,...|    0|
|Billionaire Odebr...|(10000,[63,158,25...|    0|
|BRITISH WOMAN LOS...|(10000,[15,57,158...|    1|


In [25]:
# set seed for reproducibility
%%time
(trainingData, testData) = processed_tf.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 50341
Test Dataset Count: 21793


## Logistic Regression

In [28]:
from pyspark.ml.classification import LogisticRegression

### Training

In [30]:
%%time

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(trainingData)

CPU times: total: 46.9 ms
Wall time: 2min 42s


### Inference

In [31]:
predictions_tf = lrModel_tf.transform(testData)
predictions_tf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Islamic State Claims Respon...|[0.9999999999984583,1.54165...|    0|       0.0|
|Trump vs. Congress: Now Wha...|[0.9999999986995154,1.30048...|    0|       0.0|
|The New Party of No - The N...|[0.9999999918232719,8.17672...|    0|       0.0|
|John Kerry: ISIS responsibl...|[0.9999980459449654,1.95405...|    0|       0.0|
|Donald Trump’s New York Tim...|[0.999992285404873,7.714595...|    0|       0.0|
|When to Leave on Your Thank...|[0.9999717399375468,2.82600...|    0|       0.0|
|Taxpayers Will Defend Trump...|[0.9999215405252954,7.84594...|    0|       0.0|
|Tragedy Made Steve Kerr See...|[0.9999176697149388,8.23302...|    0|       0.0|
|President Obama's final Sta...|[0.9998598929562378,1.40107...|    0|       0.0|
|Factbox: International reac

### Evaluation

In [32]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_true = predictions_tf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_tf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94     10554
           1       0.93      0.96      0.95     11239

    accuracy                           0.94     21793
   macro avg       0.94      0.94      0.94     21793
weighted avg       0.94      0.94      0.94     21793

0.9426421327949341


## Random Forest

In [43]:
from pyspark.ml.classification import RandomForestClassifier

### Training

In [44]:
%%time

rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

rfModel = rf.fit(trainingData)

CPU times: total: 125 ms
Wall time: 5min 47s


### Inference

In [46]:
predictions_rf = rfModel.transform(testData)
predictions_rf.select("full_text","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+------------------------------+-----+----------+
|                     full_text|                   probability|label|prediction|
+------------------------------+------------------------------+-----+----------+
|Trump says he will back awa...|[0.6798491053129065,0.32015...|    0|       0.0|
|Biden visits Iraq in show o...|[0.6693317038369164,0.33066...|    0|       0.0|
|Trump warns 'rogue regime' ...|[0.6678553529713492,0.33214...|    0|       0.0|
|South Korea braces for poss...|[0.6669086312656742,0.33309...|    0|       0.0|
|Trump says U.S. committed t...|[0.6664639064909074,0.33353...|    0|       0.0|
|Trump to press China on Nor...|[0.6660941432660828,0.33390...|    0|       0.0|
|Republican disarray deepens...|[0.6655025403157779,0.33449...|    0|       0.0|
|Puerto Rico debt bill gains...|[0.6654954059595326,0.33450...|    0|       0.0|
|Trump hails deals worth 'bi...|[0.6653982157702998,0.33460...|    0|       0.0|
|Trump warns 'rogue regime' 

### Evaluation

In [48]:
y_true = predictions_rf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_rf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

           0       0.89      0.80      0.84     10554
           1       0.83      0.91      0.87     11239

    accuracy                           0.86     21793
   macro avg       0.86      0.86      0.86     21793
weighted avg       0.86      0.86      0.86     21793

0.8568806497499197


### Almacenamiento

In [49]:
rfModel.save('rfModel')