### Dataset: Fake News | Kaggle : https://www.kaggle.com/c/fake-news/data

##### Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

##### Reading Dataset

In [2]:
path_train = 'train_refatorado.csv'
path_test = 'test.csv'

train_data = pd.read_csv(path_train, encoding='utf-8')

##### Classification of variable

In [3]:
table = [["id","Qualitative Nominal"],["title","Qualitative Nominal"],
         ["author","Qualitative Nominal"],["text","Qualitative Nominal"],
         ["label","Discrete Quantitative"]]

filing = pd.DataFrame(table, columns=["Variable", "Classification"])
filing

Unnamed: 0,Variable,Classification
0,id,Qualitative Nominal
1,title,Qualitative Nominal
2,author,Qualitative Nominal
3,text,Qualitative Nominal
4,label,Discrete Quantitative


##### Data Dictionary

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable
* 1: unreliable
* 0: reliable

##### Data Profiting

Viewing the first ten records

In [4]:
train_data.head()

Unnamed: 0,id,title,author,text,label,title_author_text,token_title,token_author,token_text,token_title_author_text,...,verbos,substantivos_vs_comprimento,adjetivos_vs_comprimento,verbos_vs_comprimento,substantivos_vs_palavras,adjetivos_vs_palavras,verbos_vs_palavras,countagem_palavras_title,media_palavras_len,por_cento
0,276,4 Secrets About True Leaders,WakingTimes,Waking Times \nYou can only get so far with or...,1,4 Secrets About True Leaders WakingTimes Wakin...,"['4', 'secrets', 'about', 'true', 'leaders']",['wakingtimes'],"['waking', 'times', 'you', 'can', 'only', 'get...","['4', 'secrets', 'about', 'true', 'leaders', '...",...,0,0.107143,0.0,0.0,0.6,0.0,0.0,0,11.0,0.0
1,5092,Re: 55 Reasons Why California Is The Worst Sta...,ken williams,55 Reasons Why California Is The Worst State I...,1,Re: 55 Reasons Why California Is The Worst Sta...,"['re', '55', 'reasons', 'why', 'california', '...","['ken', 'williams']","['55', 'reasons', 'why', 'california', 'is', '...","['re', '55', 'reasons', 'why', 'california', '...",...,1,0.084746,0.0,0.016949,0.454545,0.0,0.090909,0,5.5,9.090909
2,713,Teenage Boy KNOCKS OUT His Classmate For Assau...,shelby andrews,0 comments \nNot a lot of teenage boys would g...,1,Teenage Boy KNOCKS OUT His Classmate For Assau...,"['teenage', 'boy', 'knocks', 'out', 'his', 'cl...","['shelby', 'andrews']","['0', 'comments', 'not', 'a', 'lot', 'of', 'te...","['teenage', 'boy', 'knocks', 'out', 'his', 'cl...",...,1,0.11,0.0,0.01,0.647059,0.0,0.058824,0,6.5,0.0
3,8239,BAHAHA! Wanna bet Hillary made THIS face when ...,Sam J.,"— Adam Baldwin (@AdamBaldwin) October 28, 2016...",1,BAHAHA! Wanna bet Hillary made THIS face when ...,"['bahaha', 'wanna', 'bet', 'hillary', 'made', ...","['sam', 'j']","['adam', 'baldwin', 'adambaldwin', 'october', ...","['bahaha', 'wanna', 'bet', 'hillary', 'made', ...",...,3,0.085366,0.0,0.036585,0.466667,0.0,0.2,2,2.5,0.0
4,5302,CNN Talker Famous for Saying 'Pu***' on Air La...,Mike Miller,Share on Twitter The Wildfire is an opinion pl...,1,CNN Talker Famous for Saying 'Pu***' on Air La...,"['cnn', 'talker', 'famous', 'for', 'saying', '...","['mike', 'miller']","['share', 'on', 'twitter', 'the', 'wildfire', ...","['cnn', 'talker', 'famous', 'for', 'saying', '...",...,3,0.09901,0.0,0.029703,0.555556,0.0,0.166667,2,5.0,0.0


Let's create two new columns, called title_author_text and len_title_author_text, to store the concatenation of the title, author, text and the size of this feature.

In [5]:
train_data['title_author_text'] = train_data['title'] + ' ' + train_data['author'] + ' ' + train_data['text']
train_data['len_title_author_text'] = [len(x) for x in train_data['title_author_text']]

In [6]:
detail = train_data['len_title_author_text'].describe()
print(detail)

count     15848.000000
mean       4812.488579
std        5323.513156
min          23.000000
25%        1826.000000
50%        3631.000000
75%        6539.250000
max      143053.000000
Name: len_title_author_text, dtype: float64


##### Machine Learning

Let's separate the data set in 80% training and 20% test.

In [7]:
#train_features = train_data.drop(['id', 'label'], axis=1)
train_features = train_data['title_author_text']
train_targets = train_data['label']

X_train, X_test, y_train, y_test = train_test_split(train_features, train_targets, test_size=0.2, random_state=42)

print('Train Data Feature: {}'.format(len(X_train)))
print('Train Data Label: {}'.format(len(y_train)))

print('Test Data Feature: {}'.format(len(X_test)))
print('Test Data Label: {}'.format(len(y_test)))

Train Data Feature: 12678
Train Data Label: 12678
Test Data Feature: 3170
Test Data Label: 3170


In [8]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

# fix random seed for reproducibility
# This method is called when RandomState is initialized.
np.random.seed(7)

Using TensorFlow backend.


Create a token dictionary with max 5000 words

In [9]:
num_token = 5000
token = Tokenizer(num_words = num_token, filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
token.fit_on_texts(X_train)

We need to truncate and fill in the input sequences so they have the same length for modeling. Vectors of the same length are required to perform the calculation in Keras.
We use as maximum sequence size the average of the feature 'len_title_author_text'

In [10]:
max_review_length = int(detail['mean'])

x_train_token = token.texts_to_sequences(X_train)
x_test_token = token.texts_to_sequences(X_test)

X_train_seq = sequence.pad_sequences(x_train_token, maxlen=max_review_length)
X_test_seq = sequence.pad_sequences(x_test_token, maxlen=max_review_length)

Create the model:
The first layer is the Embedded Layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 100 units of memory. As this is a classification problem, we use a Densa output layer with a single neuron and a sigmoid activation function to make the predictions 0 or 1 for the two classes (Unreliable and Reliable).

In [11]:
embedding_vector_length = 32

model = Sequential()
model.add(Embedding(input_dim=num_token, output_dim=embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
#model.add(Dense(units = 256, activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


Recurrent neural networks such as LSTM usually have the problem of overfitting, so let's use the layers of elimation with Dropout Keras.

In [12]:
dropout = 0.2

model.add(Dropout(dropout))

print(model.summary())

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4812, 32)          160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
_________________________________________________________________
dropout_1 (Dropout)          (None, 1)                 0         
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


We will use the function logloss (binary_crossentropy) and the ADAM optimization algorithm. The model is fit for only two epochs with a batch of 64 ratings to space the weight updates.

In [13]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [14]:
model.fit(X_train_seq, y_train, epochs=3, batch_size=64)

Instructions for updating:
Use tf.cast instead.
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x295b6378860>

In [15]:
scores = model.evaluate(X_test_seq, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 93.25%
