In [None]:
""""
Fake news involves disseminating misleading information that can lead people astray, with potentially serious real-world consequences. The goal of fake news is often to deceive, capture attention, manipulate public opinion, or harm reputations. Detecting fake news is crucial, especially for media outlets that rely on attracting viewers to their websites to generate online advertising revenue. In this project, we will develop a deep learning model using TensorFlow to detect whether news articles are fake or real.

We will utilize the fake_news_dataset, which includes news texts and their corresponding labels (FAKE or REAL). The dataset can be downloaded from the provided link.

The steps involved in this process are as follows:
1. Importing Libraries and Dataset
2. Preprocessing the Dataset
3. Generating Word Embeddings
4. Designing the Model Architecture
5. Model Evaluation and Prediction

In [None]:
""""
Importing Libraries and Dataset
We will use the following libraries:
- NumPy: To handle various mathematical operations.
- Pandas: To load and manipulate the dataset.
- TensorFlow: For data preprocessing and model creation.
- scikit-learn (SkLearn): For splitting the dataset into training and testing sets, and for importing modules necessary for model evaluation.

In [5]:
import numpy as np
import pandas as pd 
import json
import csv
import random

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers

import pprint
import tensorflow.compat.v1 as tf
from tensorflow.python.framework import ops
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
tf.disable_eager_execution()

data = pd.read_csv(r"C:\Users\Sahil Chavan\Downloads\news.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [None]:
""""
Preprocessing Dataset:
The dataset includes an unnamed column that we need to remove. We will drop this column to clean the dataset before further processing.

In [6]:
data = data.drop(["Unnamed: 0"], axis=1)
data.head(5)

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [None]:
""""
Data Encoding:
This step involves transforming the categorical column (label in our case) into numerical values.

In [7]:
le = preprocessing.LabelEncoder()
le.fit(data['label'])
data['label'] = le.transform(data['label'])

In [10]:
embedding_dim = 50
max_length = 54
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"
training_size = 3000
test_portion = 0.1

In [None]:
""""
Tokenization:
This process involves breaking down a large block of continuous text into smaller, distinct units or tokens. For improved accuracy, we handle each column separately in a sequential manner as part of our processing pipeline.

In [11]:
title = []
text = []
labels = []
for x in range(training_size):
    title.append(data['title'][x])
    text.append(data['text'][x])
    labels.append(data['label'][x])

In [14]:
tokenizer1 = Tokenizer()
tokenizer1.fit_on_texts(title)
word_index1 = tokenizer1.word_index
vocab_size1 = len(word_index1)
sequences1 = tokenizer1.texts_to_sequences(title)
padded1 = pad_sequences(sequences1, padding=padding_type, truncating=trunc_type)
split = int(test_portion * training_size)
training_sequences1 = padded1[split:training_size]
test_sequences1 = padded1[0:split]
test_labels = labels[0:split]
training_labels = labels[split:training_size]

In [None]:
""""
Generating Word Embeddings:
Word embeddings map words with similar meanings to similar representations. In this approach, each word is represented as a real-valued vector within a predefined vector space. We will use the `glove.6B.50d.txt` file, which provides this predefined vector space for words. You can download the file using this link.

In [34]:
embeddings_index = {}
file_path = r"C:\Users\Sahil Chavan\Downloads\glove.6B.50d.txt"

try:
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
except FileNotFoundError:
    print(f"File not found: {file_path}")
except UnicodeDecodeError as e:
    print(f"Unicode decode error: {e}")

embeddings_matrix = np.zeros((vocab_size1 + 1, embedding_dim))

word_index1 = {'example': 1}

for word, i in word_index1.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

In [None]:
""""
Creating Model Architecture:
We will now use TensorFlow to build our model. Specifically, we'll employ the TensorFlow embedding technique via the Keras Embedding Layer, which transforms the original input data into a set of real-valued vectors.

In [35]:
model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size1+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix],trainable=False), 
                             tf.keras.layers.Dropout(0.2),
                             tf.keras.layers.Conv1D(64, 5, activation='relu'),
                             tf.keras.layers.MaxPooling1D(pool_size=4),
                             tf.keras.layers.LSTM(64),
                             tf.keras.layers.Dense(1, activation='sigmoid')
                             ])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 54, 50)            500050    
                                                                 
 dropout_2 (Dropout)         (None, 54, 50)            0         
                                                                 
 conv1d_2 (Conv1D)           (None, 50, 64)            16064     
                                                                 
 max_pooling1d_2 (MaxPoolin  (None, 12, 64)            0         
 g1D)                                                            
                                                                 
 lstm_2 (LSTM)               (None, 64)                33024     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                      

In [36]:
num_epochs = 50

training_padded = np.array(training_sequences1)
training_labels = np.array(training_labels)
testing_padded = np.array(test_sequences1)
testing_labels = np.array(test_labels)

history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

Train on 2700 samples, validate on 300 samples
Epoch 1/50
2700/2700 - 2s - loss: 0.6923 - accuracy: 0.5100 - val_loss: 0.6884 - val_accuracy: 0.5667 - 2s/epoch - 822us/sample
Epoch 2/50
2700/2700 - 1s - loss: 0.6908 - accuracy: 0.5237 - val_loss: 0.6874 - val_accuracy: 0.5633 - 755ms/epoch - 280us/sample
Epoch 3/50
2700/2700 - 1s - loss: 0.6913 - accuracy: 0.5278 - val_loss: 0.6885 - val_accuracy: 0.5633 - 782ms/epoch - 290us/sample
Epoch 4/50
2700/2700 - 1s - loss: 0.6909 - accuracy: 0.5215 - val_loss: 0.6849 - val_accuracy: 0.5667 - 811ms/epoch - 301us/sample
Epoch 5/50
2700/2700 - 1s - loss: 0.6912 - accuracy: 0.5222 - val_loss: 0.6861 - val_accuracy: 0.5667 - 855ms/epoch - 317us/sample
Epoch 6/50
2700/2700 - 1s - loss: 0.6911 - accuracy: 0.5219 - val_loss: 0.6863 - val_accuracy: 0.5633 - 943ms/epoch - 349us/sample
Epoch 7/50
2700/2700 - 1s - loss: 0.6910 - accuracy: 0.5204 - val_loss: 0.6854 - val_accuracy: 0.5667 - 973ms/epoch - 360us/sample
Epoch 8/50
2700/2700 - 1s - loss: 0.690

In [None]:
""""
Model Evaluation and Prediction:
With the detection model successfully constructed using TensorFlow, we will now proceed to evaluate its performance. We will test the model by predicting the authenticity of some news text, determining whether it is genuine or fake.

In [37]:
X = "Karry to go France in gesture of sympathy"

sequences = tokenizer1.texts_to_sequences([X])[0]
sequences = pad_sequences([sequences], maxlen=54, padding=padding_type, truncating=trunc_type)
if(model.predict(sequences, verbose=0)[0][0] >= 0.5):
    print("This news is True")
else:
    print("This news is False")

This news is True


In [None]:
""""
Conclusion:
By following these steps, we can effectively create a fake news detection model using TensorFlow and Python.