# Turning vectors into a fixed length
The statements in the Liar dataset do not hold a fixed length needed as input for machine learning algorithms. This means that the vectors generated from the different embedding techniques vary in length and need to be generalized to a fixed length. 
In this notebook, the first research question will be answered: *which way of reshaping vectors to a fixed length works best for classifying fake news?*

<hr>

## Exploring the options
In computer vision, feature pooling is used to reduce noise in data. The goal of this step is to transform joint feature representation into a new, more usable one that preserves important information while discarding irrelevant details. Pooling techniques such as max pooling and average pooling perform mathematical operations to reduce several numbers into one [(Boureau et al., 2010)](https://www.di.ens.fr/willow/pdfs/icml2010b.pdf). In the case of transforming the shape of the data, we can reduce vectors to the smallest vector in the dataset to create a uniform shape.

In [1]:
# General imports
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from tqdm.auto import tqdm

# Set offline mode for plotly
init_notebook_mode(connected = True)

In [2]:
# The DataLoader class gives access to pretrained vectors from the Liar dataset
from data_loader import DataLoader
data = DataLoader()

In [3]:
from classifiers import Classifiers
clfs = Classifiers()

Using TensorFlow backend.


<hr>

## Data gathering and processing

In [4]:
general = data.get_dfs()

# Recode labels from 6 to 3
def recode(label):
    if label == 'false' or label == 'pants-fire' or label == 'barely-true':
        return 'false'
    elif label == 'true' or label == 'mostly-true':
        return 'true'
    elif label == 'half-true':
        return 'half-true'

for dataset in general.keys():
    general[dataset]['label'] = general[dataset]['label'].apply(lambda label: recode(label))

<hr>

## Run experiments

In [5]:
def get_test_scores(technique):
    '''Get test scores of each dataset for the specified pooling technique'''
    embeddings = ['elmo', 'gpt', 'bert', 'transformerxl', 'flair']
    
    scores = {
        'embeddings': embeddings,
        'scores': []
    }
    
    for dataset in tqdm(embeddings):
        # Gather data
        df = eval('data.get_' + dataset + '()')
        
        # Apply pooling technique
        df = data.apply_pooling(technique, df)
        
        scores['scores'].append(clfs.get_logres_score(df['train'], df['test'], df['validation'], general['train']['label'], general['test']['label'], general['validation']['label']))
    
    return scores

In [6]:
results = {
    technique: get_test_scores(technique) for technique in ['min', 'average', 'max']
}

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Applying min pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying min pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying min pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying min pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying min pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))


























HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Applying average pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying average pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying average pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying average pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying average pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))


























HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Applying max pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying max pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying max pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying max pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))























Applying max pooling to the dataset...


HBox(children=(IntProgress(value=0, max=10235), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1265), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1284), HTML(value='')))


























In [7]:
results

{'min': {'embeddings': ['elmo', 'gpt', 'bert', 'transformerxl', 'flair'],
  'scores': [0.5241106719367589,
   0.5035573122529644,
   0.5114624505928854,
   0.4932806324110672,
   0.5130434782608696]},
 'average': {'embeddings': ['elmo', 'gpt', 'bert', 'transformerxl', 'flair'],
  'scores': [0.51699604743083,
   0.5011857707509881,
   0.5043478260869565,
   0.4893280632411067,
   0.5272727272727272]},
 'max': {'embeddings': ['elmo', 'gpt', 'bert', 'transformerxl', 'flair'],
  'scores': [0.525691699604743,
   0.4893280632411067,
   0.5296442687747036,
   0.49881422924901186,
   0.525691699604743]}}

<hr>

# Results

In [11]:
def group_results(technique):
    return go.Bar(
        x = results[technique]['embeddings'],
        y = results[technique]['scores'],
        name = technique
    )
    

data = [group_results(technique) for technique in results.keys()]

layout = go.Layout(
    barmode = 'group',
    title = 'Test set accuracies on logistic regression with different pooling techniques',
    
    yaxis=dict(
        range=[0.45, 0.55]
    )
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [3]:
infersent_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.4853754940711462, 0.46561264822134385, 0.43715415019762843],
    name = 'InferSent'
)

elmo_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.525691699604743, 0.51699604743083, 0.5241106719367589],
    name = 'ELMo'
)

bert_data = go.Bar(
    x = ['Max pooling', 'Average pooling', 'Min pooling'],
    y = [0.5296442687747036, 0.5043478260869565, 0.5114624505928854],
    name = 'BERT'
)

data = [infersent_data, elmo_data, bert_data]
layout = go.Layout(
    barmode = 'group',
    title = 'Test set accuracies on logistic regression with different pooling techniques'
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

<hr>

### References

```
@inproceedings{boureau2010theoretical,
  title={A theoretical analysis of feature pooling in visual recognition},
  author={Boureau, Y-Lan and Ponce, Jean and LeCun, Yann},
  booktitle={Proceedings of the 27th international conference on machine learning (ICML-10)},
  pages={111--118},
  year={2010}
}
```