# Turning vectors into a fixed length
The statements in the Liar dataset do not hold a fixed length needed as input for machine learning algorithms. This means that the vectors generated from the different embedding techniques vary in length and need to be generalized to a fixed length. 
In this notebook, the first research question will be answered: *which way of reshaping vectors to a fixed length works best for classifying fake news?*

<hr>

## Exploring the options
In computer vision, feature pooling is used to reduce noise in data. The goal of this step is to transform joint feature representation into a new, more usable one that preserves important information while discarding irrelevant details. Pooling techniques such as max pooling and average pooling perform mathematical operations to reduce several numbers into one [(Boureau et al., 2010)](https://www.di.ens.fr/willow/pdfs/icml2010b.pdf). In the case of transforming the shape of the data, we can reduce vectors to the smallest vector in the dataset to create a uniform shape.

Instead of reducing longer vectors to the smallest size, we can decide to do the opposite, and take the biggest vector and reshape smaller ones to the shape of the vector with the biggest length. This technique called *padding* is also a way of gaining a fixed vector shape for our dataset.


In [1]:
# General imports
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Set offline mode for plotly
init_notebook_mode(connected = True)

# The DataLoader class gives access to pretrained vectors from the Liar dataset
from data_loader import DataLoader
data = DataLoader()

<hr>

## Sentence embeddings
### Getting the data

In [2]:
infersent = data.get_infersent()

Creating InferSent representation and saving them as files...
[nltk_data] Downloading package punkt to /Users/martijn/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found 15916(/16722) words with w2v vectors
Vocab size : 15916




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



### Applying shaping techniques
#### Max pooling

In [106]:
def max_pool(statement):
    if len(statement) > 1:
        return [row.max() for row in np.transpose([[token_row.max() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.max() for token_row in np.transpose(statement[0])]

max_pooled_infersent = {
    dataset: infersent[dataset].statement.apply(lambda statement: max_pool(statement))
    for dataset in infersent.keys()
}

#### Min pooling

In [None]:
def min_pool(statement):
    if len(statement) > 1:
        return [row.min() for row in np.transpose([[token_row.min() for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [token_row.min() for token_row in np.transpose(statement[0])]

min_pooled_infersent = {
    dataset: infersent[dataset].statement.apply(lambda statement: min_pool(statement))
    for dataset in infersent.keys()
}

#### Average pooling

In [114]:
def average_pool(statement):
    if len(statement) > 1:
        return [np.average(row) for row in np.transpose([[np.average(token_row) for token_row in np.transpose(np.array(sentence))] for sentence in statement])]
    else:
        return [np.average(token_row) for token_row in np.transpose(statement[0])]

average_pooled_infersent = {
    dataset: infersent[dataset].statement.apply(lambda statement: average_pool(statement))
    for dataset in infersent.keys()
}

#### Padding
afhakken en padden
- neem de mediaan lengte als vaste lengte
- neem die lengte waarmee je de alles pakt tot en met mean articlelengte (in aantal woorden) + (1 of2) stds (kijk op wat voor percentage je dan komt)
- kijk in de lit/handboek wat het best werkt (vaakst gedaan wordt) voor padden. (Of wat de genism man ervan zegt)

In [132]:
combined_infersent = {
    dataset: infersent[dataset].statement.apply(lambda statement: np.concatenate([item.flatten() for item in statement]))
    for dataset in infersent.keys()
}

In [143]:
whole_set = pd.concat([combined_infersent['train'], combined_infersent['test'], combined_infersent['validation']]).apply(lambda vector: len(vector))
seq_n = whole_set.median()
seq_std = whole_set.std()
seq_std

183859.15958189743

In [164]:
print('The total percentage of statements below the threshold:', len(whole_set.where(whole_set <= seq_n + seq_std * 2).dropna()) / len(whole_set) * 100)

The total percentage of statements below the threshold: 94.70431789737171


In [159]:
# The median and two times the standard deviation gets us the almost 95% of the tokens
max_length = int(seq_n + seq_std * 2)

def pad_to_length(vector, cutoff = max_length):
    if len(vector) > cutoff:
        return vector[:cutoff + 1]
    else:
        return vector

<hr>

## Results

In [None]:
infersent_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling', 'Padding'],
    y = [0.20, 0.20, 0.20, 0.20],
    name = 'InferSent'
)

data = [infersent_data]
layout = go.Layout(
    barmode = 'group'
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

<hr>

### References

```
@inproceedings{boureau2010theoretical,
  title={A theoretical analysis of feature pooling in visual recognition},
  author={Boureau, Y-Lan and Ponce, Jean and LeCun, Yann},
  booktitle={Proceedings of the 27th international conference on machine learning (ICML-10)},
  pages={111--118},
  year={2010}
}
```