# Turning vectors into a fixed length
The statements in the Liar dataset do not hold a fixed length needed as input for machine learning algorithms. This means that the vectors generated from the different embedding techniques vary in length and need to be generalized to a fixed length. 
In this notebook, the first research question will be answered: *which way of reshaping vectors to a fixed length works best for classifying fake news?*

<hr>

## Exploring the options
In computer vision, feature pooling is used to reduce noise in data. The goal of this step is to transform joint feature representation into a new, more usable one that preserves important information while discarding irrelevant details. Pooling techniques such as max pooling and average pooling perform mathematical operations to reduce several numbers into one [(Boureau et al., 2010)](https://www.di.ens.fr/willow/pdfs/icml2010b.pdf). In the case of transforming the shape of the data, we can reduce vectors to the smallest vector in the dataset to create a uniform shape.

Instead of reducing longer vectors to the smallest size, we can decide to do the opposite, and take the biggest vector and reshape smaller ones to the shape of the vector with the biggest length. This technique called *padding* is also a way of gaining a fixed vector shape for our dataset.


In [2]:
# General imports
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from keras.preprocessing.sequence import pad_sequences
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from hypopt import GridSearch

# Set offline mode for plotly
init_notebook_mode(connected = True)

# The DataLoader class gives access to pretrained vectors from the Liar dataset
from data_loader import DataLoader
data = DataLoader()

Using TensorFlow backend.


<hr>

## InferSent
### Getting the data

In [2]:
general = data.get_dfs()
infersent = data.get_infersent()

Creating InferSent representation and saving them as files...
[nltk_data] Downloading package punkt to /Users/martijn/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found 15916(/16722) words with w2v vectors
Vocab size : 15916




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



### Applying shaping techniques
#### Max pooling

In [5]:
def max_pool(statement):
    if len(statement) > 1:
        return [row.max() for row in np.transpose([[token_row.max() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.max() for token_row in np.transpose(statement[0])]

In [4]:
max_pooled_infersent = {
    dataset: pd.DataFrame(list(infersent[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in infersent.keys()
}

#### Min pooling

In [64]:
def min_pool(statement):
    if len(statement) > 1:
        return [row.min() for row in np.transpose([[token_row.min() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.min() for token_row in np.transpose(statement[0])]

In [None]:
min_pooled_infersent = {
    dataset: pd.DataFrame(list(infersent[dataset].statement.apply(lambda statement: min_pool(statement)).values))
    for dataset in infersent.keys()
}

#### Average pooling

In [69]:
def average_pool(statement):
    if len(statement) > 1:
        return [np.average(row) for row in np.transpose([[np.average(token_row) for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [np.average(token_row) for token_row in np.transpose(statement[0])]

In [None]:
average_pooled_infersent = {
    dataset: pd.DataFrame(list(infersent[dataset].statement.apply(lambda statement: average_pool(statement)).values))
    for dataset in infersent.keys()
}

#### Padding

In [3]:
combined_infersent = {
    dataset: infersent[dataset].statement.apply(lambda statement: np.concatenate([item.flatten() for item in statement]))
    for dataset in infersent.keys()
}

In [4]:
whole_set = pd.concat([combined_infersent['train'], combined_infersent['test'], combined_infersent['validation']]).apply(lambda vector: len(vector))
seq_n = whole_set.median()
seq_std = whole_set.std()

In [5]:
print('The total percentage of statements below the threshold:', len(whole_set.where(whole_set <= seq_n + seq_std).dropna()) / len(whole_set) * 100)

The total percentage of statements below the threshold: 80.88235294117648


In [None]:
# The median and two times the standard deviation gets us the almost 95% of the tokens
max_length = int(seq_n + seq_std)

padded_infersent = {
    dataset: pad_sequences(infersent[dataset].statement.apply(lambda statement: np.concatenate([item.flatten() for item in statement])), maxlen = max_length)
    for dataset in infersent.keys()
}

### Applying classifier

In [30]:
general = data.get_dfs()

# Recode labels from 6 to 3
def recode(label):
    if label == 'false' or label == 'pants-fire' or label == 'barely-true':
        return 'false'
    elif label == 'true' or label == 'mostly-true':
        return 'true'
    elif label == 'half-true':
        return 'half-true'

for dataset in general.keys():
    general[dataset]['label'] = general[dataset]['label'].apply(lambda label: recode(label))

In [88]:
def get_logres_score(X_train, X_test, X_validation, y_train = general['train']['label'], y_test = general['test']['label'], y_validation = general['validation']['label']):
    param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
    gs = GridSearch(model = LogisticRegression(), param_grid = param_grid)
    gs.fit(X_train, y_train, X_validation, y_validation)
    
    return gs.score(X_test, y_test)

In [32]:
get_logres_score(max_pooled_infersent['train'], max_pooled_infersent['test'], max_pooled_infersent['validation'])

0.4853754940711462

In [36]:
get_logres_score(min_pooled_infersent['train'], min_pooled_infersent['test'], min_pooled_infersent['validation'])

0.43715415019762843

In [39]:
get_logres_score(average_pooled_infersent['train'], average_pooled_infersent['test'], average_pooled_infersent['validation'])

0.46561264822134385

<hr>

## ELMo
### Getting the data

In [51]:
elmo = data.get_elmo()

### Applying classifier
#### Max pooling

In [61]:
max_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in elmo.keys()
}

get_logres_score(max_pooled_elmo['train'], max_pooled_elmo['test'], max_pooled_elmo['validation'])

0.525691699604743

#### Min pooling

In [66]:
min_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: min_pool(statement)).values))
    for dataset in elmo.keys()
}

get_logres_score(min_pooled_elmo['train'], min_pooled_elmo['test'], min_pooled_elmo['validation'])

0.5241106719367589

#### Average pooling

In [70]:
average_pooled_elmo = {
    dataset: pd.DataFrame(list(elmo[dataset].statement.apply(lambda statement: average_pool(statement)).values))
    for dataset in elmo.keys()
}

get_logres_score(average_pooled_elmo['train'], average_pooled_elmo['test'], average_pooled_elmo['validation'])

0.51699604743083

#### Padding

In [65]:
combined_elmo = {
    dataset: elmo[dataset].statement.apply(lambda statement: np.concatenate([np.array(item).flatten() for item in statement]))
    for dataset in elmo.keys()
}

In [68]:
whole_set = pd.concat([combined_elmo['train'], combined_elmo['test'], combined_elmo['validation']]).apply(lambda vector: len(vector))
seq_n = whole_set.median()
seq_std = whole_set.std()
seq_n, seq_std

(58368.0, 26364.56514004267)

In [69]:
print('The total percentage of statements below the threshold:', len(whole_set.where(whole_set <= seq_n + seq_std).dropna()) / len(whole_set) * 100)

The total percentage of statements below the threshold: 82.07916145181477


In [82]:
# The median and two times the standard deviation gets us the almost 95% of the tokens
max_length = int(seq_n + seq_std)

padded_elmo = {
    dataset: pd.DataFrame(pad_sequences(elmo[dataset].statement.apply(lambda statement: np.concatenate([np.array(item).flatten() for item in statement])), maxlen = max_length, dtype = float))
    for dataset in elmo.keys()
}

<hr>

## BERT
### Getting the data

In [3]:
bert = data.get_bert()

### Applying classifier
#### Max pooling

In [77]:
max_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in bert.keys()
}

get_logres_score(max_pooled_bert['train'], max_pooled_bert['test'], max_pooled_bert['validation'])

0.5296442687747036

#### Min pooling

In [81]:
min_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: min_pool(statement)).values))
    for dataset in bert.keys()
}

get_logres_score(min_pooled_bert['train'], min_pooled_bert['test'], min_pooled_bert['validation'])

0.5114624505928854

#### Average pooling

In [82]:
average_pooled_bert = {
    dataset: pd.DataFrame(list(bert[dataset].statement.apply(lambda statement: average_pool(statement)).values))
    for dataset in bert.keys()
}

get_logres_score(average_pooled_bert['train'], average_pooled_bert['test'], average_pooled_bert['validation'])

0.5043478260869565

#### Padding

<hr>

## GPT-2
### Getting the data

In [7]:
gpt2 = data.get_gpt2()

### Applying classifier
#### Max pooling

In [31]:
combined_gpt2 = {
    dataset: gpt2[dataset].statement.apply(lambda statement: np.array([len(sentence) for sentence in statement]).min())
    for dataset in gpt2.keys()
}

whole_set = pd.concat([combined_gpt2['train'], combined_gpt2['test'], combined_gpt2['validation']])

In [38]:
# Reducing vectors to the minimum length
minimum_length = whole_set.min()
minimum_length, whole_set.max()

(768, 57600)

In [46]:
def pool_to_fixed(statement, maxlen, calcfunc):
    
    return len(statement[0])

In [39]:
def max_func(arr):
    return np.array(arr).max()

In [47]:
pool_to_fixed(gpt2['train'].iloc[0].statement, minimum_length, max_func)

13056

In [None]:
max_pooled_gpt2 = {
    dataset: pd.DataFrame(list(gpt2[dataset].statement.apply(lambda statement: max_pool(statement)).values))
    for dataset in gpt2.keys()
}

get_logres_score(max_pooled_gpt2['train'], max_pooled_gpt2['test'], max_pooled_gpt2['validation'])

<hr>

# Results

In [3]:
infersent_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.4853754940711462, 0.46561264822134385, 0.43715415019762843],
    name = 'InferSent'
)

elmo_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.525691699604743, 0.51699604743083, 0.5241106719367589],
    name = 'ELMo'
)

bert_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.5296442687747036, 0.5043478260869565, 0.5114624505928854],
    name = 'BERT'
)

data = [infersent_data, elmo_data, bert_data]
layout = go.Layout(
    barmode = 'group'
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

<hr>

### References

```
@inproceedings{boureau2010theoretical,
  title={A theoretical analysis of feature pooling in visual recognition},
  author={Boureau, Y-Lan and Ponce, Jean and LeCun, Yann},
  booktitle={Proceedings of the 27th international conference on machine learning (ICML-10)},
  pages={111--118},
  year={2010}
}
```