#Fake News Filter

My maingoal in this project is to construct an algorithm which is working like a sieve for our browser to get rid of some fake news. Due to this problem I downloaded a database made of 5 features where one of them is a label deciding wheter news is real or not. I decided to choose 2 of 4 features - title of the news and source domain. They seem to be the most important. 


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tqdm.keras import TqdmCallback
import warnings 
warnings.simplefilter('ignore')
import gc
from sklearn.metrics import classification_report, accuracy_score

#Uploading the data
First I need to upload our data using read_csv from pandas lib, I decided to use only title and source domain columns, they seemed to be the most important and due to some problems connected with adding tweet number to my model I decided to not use it.  

In [2]:
wanted_cols = ['title','source_domain','tweet_num','real']
df = pd.read_csv('data/FakeNewsNet.csv',usecols=wanted_cols)
df.tail()

Unnamed: 0,title,source_domain,tweet_num,real
23191,Pippa Middleton wedding: In case you missed it...,www.express.co.uk,52,1
23192,Zayn Malik & Gigi Hadid’s Shocking Split: Why ...,hollywoodlife.com,7,0
23193,Jessica Chastain Recalls the Moment Her Mother...,www.justjared.com,26,1
23194,"Tristan Thompson Feels ""Dumped"" After Khloé Ka...",www.intouchweekly.com,24,0
23195,Kelly Clarkson Performs a Medley of Kendrick L...,www.billboard.com,85,1


#Dealing with missing values
Here I just use simple function to delete missing value with dropna function


In [3]:
#usuwa brakujące wartości 
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

#Defining our training and testing data


In [4]:
#Tytułu "zconcateowane" z source domainem axis=1 to wtedy dodaje do wierszy/axis=0 dodaje do kolumn - przy tym accuracy 77/78%
x = df[['title','source_domain']].agg(' '.join, axis=1)
#Same tytułu przy tym 73%/74% accuracy
#x = df['title'].values 
y = df['real'].values 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state= np.random.randint(10))
print(f"Training records: {x_train.shape[0]} | Testing records: {x_test.shape[0]}")

Training records: 18292 | Testing records: 4574


#Tokenization of our data
In this section I am using tensorflow tokenizer function. I decided to leave all of the characteristic signs like ? , ! etc... Because they are for sure important, just like the size of the letter so as to i set lower to False. Then it is just changing our text into numbers.


In [5]:
#W filters lepiej wywalić znaki szczególne (bo takie są istotne przy fake newsach, tak samo jak małe i duże litery)
tok = tf.keras.preprocessing.text.Tokenizer(num_words=None,
                                      filters='',
                                      lower=False,
                                      split=' ',
                                      char_level=False,
                                      oov_token=None)
tok.fit_on_texts(x_train)
#text tokenisation
tok_train = tok.texts_to_sequences(x_train)
tok_test = tok.texts_to_sequences(x_test)

#Padding
Here I am using pad_sequences function to add padding to our data.

In [6]:
# padding the sequence
# tweet_count75 = int(df.tweet_num.quantile(0.75))   # taking the 75th percentile of tweets number <--BEZ SENSU
# padded_train = tf.keras.preprocessing.sequence.pad_sequences(tok_train, maxlen=tweet_count75, padding='post')
# padded_test = tf.keras.preprocessing.sequence.pad_sequences(tok_test, maxlen=tweet_count75, padding='post')

df['wcount'] = df['title'].apply(lambda x: len(x.split(' ')))
max_length = int(df.wcount.quantile(0.8))   # taking the 80th percentile of word count
padded_train = tf.keras.preprocessing.sequence.pad_sequences(tok_train, maxlen=max_length, padding='post')
padded_test = tf.keras.preprocessing.sequence.pad_sequences(tok_test, maxlen=max_length, padding='post')

# c = padded_train.shape[0]
# padded_train = pd.concat([pd.DataFrame(padded_train), pd.DataFrame(df['tweet_num'],dtype = int)[:c]],axis=1,ignore_index=True)
# padded_test = pd.concat([pd.DataFrame(padded_test), pd.DataFrame(df['tweet_num'],dtype = int)[c:].reset_index()],axis=1,ignore_index=True)




#Metrics functions
I am defining a functions of metrics precision(is the fraction of retrieved documents that are relevant to the query) and f1(harmonic mean) metrics. I've decided to use those metrics because they seem to be the best for this type of problem.

In [7]:
def recall_m(y_true, y_pred):
    true_positives = tf.keras.backend.sum(tf.keras.backend.round(tf.keras.backend.clip(y_true * y_pred, 0, 1)))
    possible_positives = tf.keras.backend.sum(tf.keras.backend.round(tf.keras.backend.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + tf.keras.backend.epsilon())
    return recall
def precision_m(y_true, y_pred):
    true_positives = tf.keras.backend.sum(tf.keras.backend.round(tf.keras.backend.clip(y_true * y_pred, 0, 1)))
    predicted_positives = tf.keras.backend.sum(tf.keras.backend.round(tf.keras.backend.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + tf.keras.backend.epsilon())
    return precision
def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+tf.keras.backend.epsilon()))

#Defining a model
 In this section I am defining my model. It is built of embedding and bidirectional layers and typical dense layers, after taht I am compiling my model defining metrics and loss as binary_crossentropy due to our zero's and one's.

In [8]:
#defining a model
# model params
vocab_size = len(tok.word_index) + 1
epoch = 12
unit = 32

# model = tf.keras.models.Sequential(name='FakeNewsCatcher')
# model.add(tf.keras.layers.Embedding(vocab_size, unit, input_length=tweet_count75))
# model.add(tf.keras.layers.SimpleRNN(unit, return_sequences=False))
# model.add(tf.keras.layers.Dense(1, activation='relu'))

model = tf.keras.models.Sequential(name='LapaczFakeNews')
# model.add(tf.keras.layers.Embedding(vocab_size, unit, input_length=tweet_count75))  <-- BEZ SENSU, ale co ciekawe, uzyskiwana accuracy przy tym była wyższa jak dla samych tytułów 74/75%
model.add(tf.keras.layers.Embedding(vocab_size, unit, input_length=max_length))
# model.add(tf.keras.layers.Embedding(vocab_size, unit, input_length=max_length+1))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(unit)))
model.add(tf.keras.layers.Dense(unit, activation='relu'))
model.add(tf.keras.layers.Dense(unit, activation='relu'))
model.add(tf.keras.layers.Dense(unit, activation='relu'))
model.add(tf.keras.layers.Dense(unit, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='relu'))

# compile
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=[precision_m,f1_m])

# summary
model.summary()


Model: "LapaczFakeNews"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 14, 32)            1124224   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               4160      
 l)                                                              
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 dense_2 (Dense)             (None, 32)                1056      
                                                                 
 dense_3 (Dense)             (None, 32)                1056      
                                                    

#Trainig the model
I am adding EarlyStopping function with pateince set to 10 (because trial and error method has shown it is the best) then re-run project and then just using fit function.


In [9]:
# training params (większe patience tylko pogarsza, tak samo mniejsze)
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min',verbose=0, patience=10)

#  re-runs
gc.collect()
tf.keras.backend.clear_session()

# train model
hist = model.fit(x=padded_train,
                 y=y_train,
                 epochs=epoch,
                 shuffle=True,
                 validation_data=(padded_test,y_test),
                 verbose=0,
                 callbacks=[TqdmCallback(verbose=0),es])


0epoch [00:00, ?epoch/s]

#Accuracy of the model

In [18]:
# accuracy
# print(hist.history.keys())
prec = '{:.2%}'.format(hist.history['precision_m'][-1])
f1 = '{:.2%}'.format(hist.history['f1_m'][-1])
print(f"Our model has achieved a precision of {prec} and f1 of {f1} in {hist.epoch[-1]} epoch(s)")

Our model has achieved a precision of 98.32% and f1 of 98.61% in 10 epoch(s)


In [11]:
# predictions
pred = (model.predict(padded_test) > 0.5).astype('int32')
# classification report
print(classification_report(y_test, pred))
# accuracy score
acc_score = '{:.2%}'.format(accuracy_score(y_test, pred))
print(f"\nAccuracy Score: {acc_score}")


              precision    recall  f1-score   support

           0       0.62      0.53      0.57      1113
           1       0.86      0.90      0.88      3461

    accuracy                           0.81      4574
   macro avg       0.74      0.71      0.72      4574
weighted avg       0.80      0.81      0.80      4574


Accuracy Score: 80.80%
