# Turning vectors into a fixed length
The statements in the Liar dataset do not hold a fixed length needed as input for machine learning algorithms. This means that the vectors generated from the different embedding techniques vary in length and need to be generalized to a fixed length. 
In this notebook, the first research question will be answered: *which way of reshaping vectors to a fixed length works best for classifying fake news?*

<hr>

## Exploring the options
In computer vision, feature pooling is used to reduce noise in data. The goal of this step is to transform joint feature representation into a new, more usable one that preserves important information while discarding irrelevant details. Pooling techniques such as max pooling and average pooling perform mathematical operations to reduce several numbers into one [(Boureau et al., 2010)](https://www.di.ens.fr/willow/pdfs/icml2010b.pdf). In the case of transforming the shape of the data, we can reduce vectors to the smallest vector in the dataset to create a uniform shape.

In [1]:
# General imports
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from keras.preprocessing.sequence import pad_sequences

# Set offline mode for plotly
init_notebook_mode(connected = True)

# The DataLoader class gives access to pretrained vectors from the Liar dataset
from data_loader import DataLoader
data = DataLoader()

Using TensorFlow backend.


In [2]:
from classifiers import Classifiers
clfs = Classifiers()

<hr>

## Data gathering and processing

In [3]:
general = data.get_dfs()

# Recode labels from 6 to 3
def recode(label):
    if label == 'false' or label == 'pants-fire' or label == 'barely-true':
        return 'false'
    elif label == 'true' or label == 'mostly-true':
        return 'true'
    elif label == 'half-true':
        return 'half-true'

for dataset in general.keys():
    general[dataset]['label'] = general[dataset]['label'].apply(lambda label: recode(label))

### Applying shaping techniques
#### Max pooling

In [3]:
def max_pool(statement):
    if len(statement) > 1:
        return [row.max() for row in np.transpose([[token_row.max() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.max() for token_row in np.transpose(statement[0])]

#### Min pooling

In [64]:
def min_pool(statement):
    if len(statement) > 1:
        return [row.min() for row in np.transpose([[token_row.min() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.min() for token_row in np.transpose(statement[0])]

#### Average pooling

In [69]:
def average_pool(statement):
    if len(statement) > 1:
        return [np.average(row) for row in np.transpose([[np.average(token_row) for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [np.average(token_row) for token_row in np.transpose(statement[0])]

<hr>

## ELMo
### Getting the data

In [5]:
elmo = data.get_elmo()

### Applying classifier
#### Max pooling

In [6]:
max_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in elmo.keys()
}

In [7]:
clfs.get_logres_score(max_pooled_elmo['train'], max_pooled_elmo['test'], max_pooled_elmo['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])






lbfgs failed to converge. Increase the number of iterations.




lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.




lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.




lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.


lbfgs failed to converge. Increase the number of iterations.



0.524901185770751

#### Min pooling

In [66]:
min_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: min_pool(statement)).values))
    for dataset in elmo.keys()
}

clfs.get_logres_score(min_pooled_elmo['train'], min_pooled_elmo['test'], min_pooled_elmo['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])

0.5241106719367589

#### Average pooling

In [70]:
average_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: average_pool(statement)).values))
    for dataset in elmo.keys()
}

clfs.get_logres_score(average_pooled_elmo['train'], average_pooled_elmo['test'], average_pooled_elmo['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])

0.51699604743083

<hr>

## BERT
### Getting the data

In [3]:
bert = data.get_bert()

### Applying classifier
#### Max pooling

In [77]:
max_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in bert.keys()
}

clfs.get_logres_score(max_pooled_bert['train'], max_pooled_bert['test'], max_pooled_bert['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])

0.5296442687747036

#### Min pooling

In [81]:
min_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: min_pool(statement)).values))
    for dataset in bert.keys()
}

clfs.get_logres_score(min_pooled_bert['train'], min_pooled_bert['test'], min_pooled_bert['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])

0.5114624505928854

#### Average pooling

In [82]:
average_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: average_pool(statement)).values))
    for dataset in bert.keys()
}

clfs.get_logres_score(average_pooled_bert['train'], average_pooled_bert['test'], average_pooled_bert['validation'], general['train']['label'], general['test']['label'], general['validation']['label'])

0.5043478260869565

<hr>

# Results

In [3]:
infersent_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.4853754940711462, 0.46561264822134385, 0.43715415019762843],
    name = 'InferSent'
)

elmo_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.525691699604743, 0.51699604743083, 0.5241106719367589],
    name = 'ELMo'
)

bert_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.5296442687747036, 0.5043478260869565, 0.5114624505928854],
    name = 'BERT'
)

data = [infersent_data, elmo_data, bert_data]
layout = go.Layout(
    barmode = 'group',
    title = 'Test set accuracies with different pooling techniques'
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

<hr>

### References

```
@inproceedings{boureau2010theoretical,
  title={A theoretical analysis of feature pooling in visual recognition},
  author={Boureau, Y-Lan and Ponce, Jean and LeCun, Yann},
  booktitle={Proceedings of the 27th international conference on machine learning (ICML-10)},
  pages={111--118},
  year={2010}
}
```