# Using neural classification
As has been proven by [Wang (2017)](https://arxiv.org/abs/1705.00648), neural classifiers carry better results than non-neural classifiers when detecting fake news. However, it is unknown how well neural networks classify fake news when using previously mentioned text embeddings. 
In this notebook, the second research question will be answered: *how well do neural network architecture classify fake news compared to non-neural classification algorithms?*

<hr>

## On the usage of neural networks
Literature on CNNs and Bi-LSTMs


In [1]:
# General imports
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import keras
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, Reshape, Conv1D, Flatten

# Set offline mode for plotly
init_notebook_mode(connected = True)

# The DataLoader class gives access to pretrained vectors from the Liar dataset
from data_loader import DataLoader
data = DataLoader()

Using TensorFlow backend.


In [2]:
general = data.get_dfs()

# Recode labels from 6 to 3
def recode(label):
    if label == 'false' or label == 'pants-fire' or label == 'barely-true':
        return 0
    elif label == 'true' or label == 'mostly-true':
        return 2
    elif label == 'half-true':
        return 1

for dataset in general.keys():
    general[dataset]['label'] = general[dataset]['label'].apply(lambda label: recode(label))

<hr>

## Bidirectional LSTMs

In [3]:
bert = data.get_bert()

In [3]:
# Get max-pooled BERT embeddings from RQ1
def max_pool(statement):
    if len(statement) > 1:
        return [row.max() for row in np.transpose([[token_row.max() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.max() for token_row in np.transpose(statement[0])]

max_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in bert.keys()
}

In [6]:
def get_bilstm_score(X_train, X_test, X_validation, y_train = general['train']['label'], y_test = general['test']['label'], y_validation = general['validation']['label'], reshape = True):
    # Rearrange data types    
    params = locals().copy()    
    inputs = {
        dataset: np.array(params[dataset])
        for dataset in params.keys()
    }
    
    for dataset in inputs.keys():
        if dataset[0:1] == 'X' and reshape:
            # Reshape datasets from 2D to 3D
            inputs[dataset] = np.reshape(inputs[dataset], (inputs[dataset].shape[0], inputs[dataset].shape[1], 1))
        elif dataset[0:1] == 'y':
            inputs[dataset] = np_utils.to_categorical(np.array(inputs[dataset]), 3)
    
    # Set model parameters
    epochs = 5
    batch_size = 32
    input_shape = X_train.shape

    # Create the model
    model = Sequential()
    model.add(Bidirectional(LSTM(64, input_shape = input_shape)))
    model.add(Dropout(0.8))
    model.add(Dense(3, activation = 'softmax'))
    model.compile('sgd', 'categorical_crossentropy', metrics = ['accuracy']) 
    
    # Fit the training set over the model and correct on the validation set
    model.fit(inputs['X_train'], inputs['y_train'],
            batch_size = batch_size,
            epochs = epochs,
            validation_data = (inputs['X_validation'], inputs['y_validation']))
    
    # Get score over the test set
    score, acc = model.evaluate(inputs['X_test'], inputs['y_test'])
    
    return acc

In [5]:
get_bilstm_score(max_pooled_bert['train'], max_pooled_bert['test'], max_pooled_bert['validation'])

Instructions for updating:
Colocations handled automatically by placer.
2019-05-21 12:40:37,047 From /Users/martijn/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
2019-05-21 12:40:37,448 From /Users/martijn/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
2019-05-21 12:40:37,537 From /Users/martijn/anaconda3/lib/python3.6/site-packages/tensorflo

0.43715415052745654

Apparently, the condensed datasets from RQ1 do not perform well when using a neural classifier. The next step is trying out a padding approach.

In [65]:
%%time

# Store accuracies
accuracies = {
    padding_len: 0.0 for padding_len in list(range(5,36))
}

concatenated_bert = {
    dataset: [np.concatenate(np.array(statement)) for statement in bert[dataset].statement]
    for dataset in bert.keys()
}

for max_len in accuracies.keys():
    padded_bert = {
        dataset: sequence.pad_sequences(concatenated_bert[dataset], maxlen = max_len, dtype = float)
        for dataset in concatenated_bert.keys()
    }
    
    accuracies[max_len] = get_bilstm_score(padded_bert['train'], padded_bert['test'], padded_bert['validation'], reshape = False)

Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 1

Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5


In [67]:
bert_rounds

[{5: 0.513043478637816,
  6: 0.5122529648038239,
  7: 0.5075098817998712,
  8: 0.5083003953747127,
  9: 0.49407114631573673,
  10: 0.5098814233018476,
  11: 0.507509881752753,
  12: 0.5035573126299108,
  13: 0.4996047434599503,
  14: 0.5027667987488004,
  15: 0.507509881752753,
  16: 0.5090909091144682,
  17: 0.5067193679187609,
  18: 0.5114624509227135,
  19: 0.5098814233018476,
  20: 0.5162055336203971,
  21: 0.5193675893097527,
  22: 0.5193675893097527,
  23: 0.5233201584797131,
  24: 0.5217391304583417,
  25: 0.5106719371358397,
  26: 0.5106719367824524,
  27: 0.5177865616417685,
  28: 0.5075098817998712,
  29: 0.513043478637816,
  30: 0.5043478264639029,
  31: 0.5162055336203971,
  32: 0.5075098817998712,
  33: 0.5114624509698318,
  34: 0.513833992471808,
  35: 0.5075098817998712},
 {5: 0.5098814233018476,
  6: 0.5098814229484603,
  7: 0.49723320160458684,
  8: 0.5090909091144682,
  9: 0.5051383399445077,
  10: 0.505928854131887,
  11: 0.498814229272571,
  12: 0.5114624509698318,


In [69]:
traces = [round1, round2, round3, round4, round5]

# Create traces
bert_trace = go.Scatter(
    x = list(round1.keys()),
    y = list(round1.values()),
    mode = 'lines+markers',
    name = 'BERT'
)

def create_scatter(counter):
    acc_dict = traces[counter]
    
    return go.Scatter(
        x = list(acc_dict.keys()),
        y = list(acc_dict.values()),
        mode = 'lines+markers',
        name = 'Round ' + str(counter)
    )

trace_data = [create_scatter(trace) for trace in range(len(traces))]

layout = go.Layout(
    title = 'Test set accuracy of padded BERT dataset with variable maximum lengths',
)

fig = go.Figure(data = trace_data, layout = layout)

iplot(fig)

In [3]:
elmo = data.get_elmo()

In [4]:
def calculate_round(dataset):
    # Store accuracies
    accuracies = {
        padding_len: 0.0 for padding_len in list(range(5,36))
    }

    for max_len in accuracies.keys():
        padded_dataset = {
            fold: sequence.pad_sequences(dataset[fold], maxlen = max_len, dtype = float)
            for fold in dataset.keys()
        }

        accuracies[max_len] = get_bilstm_score(padded_dataset['train'], padded_dataset['test'], padded_dataset['validation'], reshape = False)

    return accuracies

In [7]:
%%time

concatenated_elmo = {
    fold: [np.concatenate(np.array(statement)) for statement in elmo[fold]['statement']]
    for fold in elmo.keys()
}

elmo_rounds = [calculate_round(concatenated_elmo) for round in range(1)]

Instructions for updating:
Colocations handled automatically by placer.
2019-06-03 19:48:57,899 From /Users/martijn/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
2019-06-03 19:49:01,164 From /Users/martijn/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
2019-06-03 19:49:01,360 From /Users/martijn/anaconda3/lib/python3.6/site-packages/tensorflo

Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5


Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 7h 48min 51s, sys: 2h 1min 19s, total: 9h 50min 11s
Wall time: 2h 14min 55s


In [37]:
elmo_rounds

[{5: 0.5177865616417685,
  6: 0.5241106722665869,
  7: 0.5114624509698318,
  8: 0.5067193676124919,
  9: 0.5067193679187609,
  10: 0.5059288540847687,
  11: 0.49802371574484783,
  12: 0.5011857710808162,
  13: 0.49169960507291105,
  14: 0.49723320191085574,
  15: 0.4932806327408953,
  16: 0.5169960478077764,
  17: 0.5154150200455556,
  18: 0.513043478637816,
  19: 0.5083003955396268,
  20: 0.5185770754286423,
  21: 0.5067193679187609,
  22: 0.5146245063058001,
  23: 0.5011857710808162,
  24: 0.5106719370887214,
  25: 0.5169960477606581,
  26: 0.5193675893097527,
  27: 0.5114624509227135,
  28: 0.5130434785906977,
  29: 0.5241106723137052,
  30: 0.5169960477606581,
  31: 0.5169960477606581,
  32: 0.5146245063058001,
  33: 0.5185770754286423,
  34: 0.5051383402507766,
  35: 0.5035573125827925},
 {5: 0.5122529647567056,
  6: 0.5169960478077764,
  7: 0.5146245062586818,
  8: 0.5233201584325948,
  9: 0.5138339924246897,
  10: 0.4996047434599503,
  11: 0.49090909123891896,
  12: 0.4909090912

In [31]:
traces = elmo_rounds

# Create traces
def create_scatter(counter):
    acc_dict = traces[counter]
    
    return go.Scatter(
        x = list(acc_dict.keys()),
        y = list(acc_dict.values()),
        mode = 'lines+markers',
        name = 'Round ' + str(counter)
    )

trace_data = [create_scatter(trace) for trace in range(len(traces))]

layout = go.Layout(
    title = 'Test set accuracy of padded ELMo dataset with variable maximum lengths',
)

fig = go.Figure(data = trace_data, layout = layout)

iplot(fig)

In [23]:
x = list(range(5, 36))
x_rev = x[::-1]

# BERT
bert_matrix = np.transpose(np.array([np.array(list(acc_round.values())) for acc_round in bert_rounds]))
bert_y = [np.average(row) for row in bert_matrix]
bert_y_upper = [row.max() for row in bert_matrix]
bert_y_lower = [row.min() for row in bert_matrix]
bert_y_lower = bert_y_lower[::-1]

bert1 = go.Scatter(
    x = x + x_rev,
    y = bert_y_upper + bert_y_lower,
    fill = 'tozerox',
    fillcolor = 'rgba(0,100,80,0.2)',
    line = dict(color = 'rgba(255,255,255,0)'),
    showlegend = False,
    name = 'BERT',
)
bert2 = go.Scatter(
    x = x,
    y = bert_y,
    line = dict(color='rgb(0,100,80)'),
    mode = 'lines+markers',
    name = 'BERT',
)

# ELMo
elmo_matrix = np.transpose(np.array([np.array(list(acc_round.values())) for acc_round in elmo_rounds]))
elmo_y = [np.average(row) for row in elmo_matrix]
elmo_y_upper = [row.max() for row in elmo_matrix]
elmo_y_lower = [row.min() for row in elmo_matrix]
elmo_y_lower = elmo_y_lower[::-1]

elmo1 = go.Scatter(
    x = x + x_rev,
    y = elmo_y_upper + elmo_y_lower,
    fill = 'tozerox',
    fillcolor = 'rgba(0,176,246,0.2)',
    line = dict(color = 'rgba(255,255,255,0)'),
    showlegend = False,
    name = 'ELMo',
)
elmo2 = go.Scatter(
    x = x,
    y = elmo_y,
    line = dict(color='rgb(0,176,246)'),
    mode = 'lines+markers',
    name = 'ELMo',
)


data = [bert1, bert2, elmo1, elmo2]
layout = go.Layout(
    title = 'Test set accuracy of padded datasets with variable maximum lengths',
    paper_bgcolor = 'rgb(255,255,255)',
    plot_bgcolor = 'rgb(229,229,229)',
    xaxis = dict(
        gridcolor = 'rgb(255,255,255)',
        range = [5,35],
        showgrid = True,
        showline = False,
        showticklabels = True,
        tickcolor = 'rgb(127,127,127)',
        ticks = 'outside',
        zeroline = False
    ),
    yaxis=dict(
        gridcolor='rgb(255,255,255)',
        showgrid = True,
        showline = False,
        showticklabels = True,
        tickcolor = 'rgb(127,127,127)',
        ticks = 'outside',
        zeroline = False
    ),
)

fig = go.Figure(data = data, layout = layout)

iplot(fig)

<hr>

## Convolutional neural networks

In [15]:
def get_cnn_score(X_train, X_test, X_validation, y_train = general['train']['label'], y_test = general['test']['label'], y_validation = general['validation']['label'], reshape = True):
    # Rearrange data types    
    params = locals().copy()    
    inputs = {
        dataset: np.array(params[dataset])
        for dataset in params.keys()
    }
    
    # Reshape datasets
    for dataset in inputs.keys():
        if dataset[0:1] == 'X':
            if reshape:
                inputs[dataset] = np.reshape(inputs[dataset], (inputs[dataset].shape[0], inputs[dataset].shape[1], 1))
            
        elif dataset[0:1] == 'y':
            inputs[dataset] = np_utils.to_categorical(np.array(inputs[dataset]), 3)
            
    # Set model parameters
    epochs = 5
    batch_size = 32
    input_shape =  inputs['X_train'].shape
    print(input_shape)
    
    # Create the model
    model = Sequential()
    model.add(Conv1D(128, kernel_size = 2, activation='relu', input_shape = (input_shape[1], input_shape[2]), data_format = 'channels_first'))
    model.add(Conv1D(128, kernel_size = 3, activation='relu'))
    model.add(Conv1D(128, kernel_size = 4, activation='relu'))
    model.add(Dropout(0.8))
    model.add(Flatten())
    model.add(Dense(3, activation = 'softmax'))
    model.compile('sgd', 'categorical_crossentropy', metrics = ['accuracy']) 
    
    # Fit the training set over the model and correct on the validation set
    model.fit(inputs['X_train'], inputs['y_train'],
            batch_size = batch_size,
            epochs = epochs,
            validation_data = (inputs['X_validation'], inputs['y_validation']))
    
    # Get score over the test set
    score, acc = model.evaluate(inputs['X_test'], inputs['y_test'])
    
    return acc

In [11]:
concatenated_bert = {
    dataset: [np.concatenate(np.array(statement)) for statement in bert[dataset].statement]
    for dataset in bert.keys()
}

padded_bert = {
    fold: sequence.pad_sequences(concatenated_bert[fold], maxlen = 22, dtype = float)
    for fold in concatenated_bert.keys()
}

In [16]:
get_cnn_score(padded_bert['train'], padded_bert['test'], padded_bert['validation'], reshape = False)

(10235, 5, 3072)
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
2019-06-04 08:30:54,068 From /Users/martijn/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
2019-06-04 08:30:54,211 From /Users/martijn/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 10235 samples, validate on 1284 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.49090909128603727

In [9]:
def calculate_round(dataset):
    # Store accuracies
    accuracies = {
        padding_len: 0.0 for padding_len in list(range(5,36))
    }

    for max_len in accuracies.keys():
        padded_dataset = {
            fold: sequence.pad_sequences(dataset[fold], maxlen = max_len, dtype = float)
            for fold in dataset.keys()
        }
        print(max_len)
        accuracies[max_len] = get_cnn_score(padded_dataset['train'], padded_dataset['test'], padded_dataset['validation'], reshape = False)

    return accuracies

In [10]:
%%time

cnn_bert_rounds = [calculate_round(concatenated_bert) for round in range(5)]

5


ValueError: Negative dimension size caused by subtracting 4 from 2 for 'conv1d_6/convolution/Conv2D' (op: 'Conv2D') with input shapes: [?,1,2,128], [1,4,128,128].

<hr>

### References

```
@article{DBLP:journals/corr/Wang17j,
  author    = {William Yang Wang},
  title     = {"Liar, Liar Pants on Fire": {A} New Benchmark Dataset for Fake News
               Detection},
  journal   = {CoRR},
  volume    = {abs/1705.00648},
  year      = {2017},
  url       = {http://arxiv.org/abs/1705.00648},
  archivePrefix = {arXiv},
  eprint    = {1705.00648},
  timestamp = {Mon, 13 Aug 2018 16:48:58 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/Wang17j},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```