# Turning vectors into a fixed length
The statements in the Liar dataset do not hold a fixed length needed as input for machine learning algorithms. This means that the vectors generated from the different embedding techniques vary in length and need to be generalized to a fixed length. 
In this notebook, the first research question will be answered: *which way of reshaping vectors to a fixed length works best for classifying fake news?*

<hr>

## Exploring the options
In computer vision, feature pooling is used to reduce noise in data. The goal of this step is to transform joint feature representation into a new, more usable one that preserves important information while discarding irrelevant details. Pooling techniques such as max pooling and average pooling perform mathematical operations to reduce several numbers into one [(Boureau et al., 2010)](https://www.di.ens.fr/willow/pdfs/icml2010b.pdf). In the case of transforming the shape of the data, we can reduce vectors to the smallest vector in the dataset to create a uniform shape.

In [8]:
# General imports
import json
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from tqdm.auto import tqdm

# Set offline mode for plotly
init_notebook_mode(connected = True)

<hr>

## Selecting a regularization technique

In [16]:
with open('results.json') as json_data:
    results = json.load(json_data)
    
    data = [go.Bar(
        x = list(results.keys()),
        y = [results[embedding]['3']['logres'][reg]['max'] for embedding in results.keys()],
        name = reg
    ) for reg in ['l1', 'l2']]
    
    layout = go.Layout(
        barmode = 'group',
        title = 'Test set accuracies on logistic regression with different regularization types',
        yaxis=dict(
            range=[0.45, 0.55]
        )
    )

    fig = go.Figure(data = data, layout = layout)
    iplot(fig)

<hr>

## Comparing pooling performance

In [34]:
with open('results.json') as json_data:
    results = json.load(json_data)
    
    data = [go.Bar(
        x = list(results[embedding]['3']['logres']['l2'].keys()),
        y = list(results[embedding]['3']['logres']['l2'].values()),
        name = embedding
    ) for embedding in results.keys()]
    
    data = [go.Bar(
        x = list(results.keys()),
        y = [results[embedding]['3']['logres']['l2'][pooling] for embedding in results.keys()],
        name = pooling
    ) for pooling in results['gpt']['3']['logres']['l2'].keys()]
    
    layout = go.Layout(
        barmode = 'group',
        title = 'Test set accuracies on logistic regression with different pooling techniques',
        yaxis=dict(
            range=[0.45, 0.55]
        )
    )

    fig = go.Figure(data = data, layout = layout)
    iplot(fig)

<hr>

### References

```
@inproceedings{boureau2010theoretical,
  title={A theoretical analysis of feature pooling in visual recognition},
  author={Boureau, Y-Lan and Ponce, Jean and LeCun, Yann},
  booktitle={Proceedings of the 27th international conference on machine learning (ICML-10)},
  pages={111--118},
  year={2010}
}
```