# Autoencoders & Recommender Systems

In this lesson we will learn about:
- new type of neural network architecture (autoencoders)
- how one could use them to perform recommendation

# Autoencoders

> __Autoencoders are self-supervised systems which, given input `x` try to reconstruct it at the output__

Let's see how one could write it using mathematical notation:

$$
L(x, d(e(x)))
$$

where:

$$
e(x) \rightarrow \text{latent representation space created by encoder}
$$

$$
d(\hat{x}) \rightarrow \text{decoder of latent representation space}
$$

> __Goal of the autoencoder is to reconstruct the input while, at the same time, finding useful latent representation of the input__

There are multiple variants of autoencoders, let's see basic possibilities:

## Undercomplete

> __Same formulation as above, BUT dimensionality of latent space has to be smaller than that of `x` input__

$$
x^N, e(x)^M
$$

where

$$
M << N
$$

__Traits:__
- If `M` too large autoencoder might try to simply copy the data
- If `M` sufficiently smaller it has to __compress `x` representation__
- __Shallow linear autoencoder__ - learns approximately PCA with specified dimensionality
- __Deeper non-linear autoencoder__ - learns non-linear generalization of PCA (or T-SNE)


## Sparse

> __Sparse autoencoder forces the latent space to be sparse via L1 regularization__

$$
L(x, d(e(x))) + L1(h)
$$

where:

$$
h = e(x) \rightarrow \text{latent representation space}
$$

$$
\Omega(h) \rightarrow \text{L1 regularization}
$$

__Traits:__
- __Can be wider than undercomplete due to regularization__
- Latent variables act as an explanatory terms of `x` input
- __One could also use `nn.ReLU` for the representation (instead of `nn.Linear`) to force ACTUALLY sparse representation__

## Denoising

> __Very popular variation - given `x` we add some random noise to it and force the model to reconstruct `x`__

$$
L(x, d(e(\hat{x})))
$$

where

$$
\hat{x} \rightarrow \text{noise disturbed input}
$$

__Traits:__
- Noise usually from normal distribution (can be different though)
- Structure of the data has to be learned in order to denoise it

## Autoencoder features

There are a few important concepts one should keep in mind:
- __One can use multiple data sources bringing them to a single latent representation__:
    - images via convolution
    - text via RNNs (or `nn.Conv1d`)
    - tabular data via `nn.Linear`)
- Try __not to__ overregularize:
    - representation can be smaller, __yet not too small__
    - the more regularization we introduce, __the large our latent space should be__
- Rarely used in non-semi-supervised 
- Keep `encoder` output linear (unless you want to constrain representation)
- __We should force the latent representation closer to our goal__ - example of sparse autoencoder
- __We can get a little more creative with our task__ - we will see that during an exercise

### Loss function

> __What kind of loss function/activations should we use?__

__We should always check our `x` data and act accordingly__, for example:
- continuous data in `[0, 1]` range - `sigmoid` activation at the end and `MSELoss`
- continuous data in `[-1, 1]` range - `tanh` activation at the end and `MSELoss`
- categorical data - no activation and `CrossEntropyLoss` (maybe binary)
- continuous data within unspecified range - __try to avoid it with normalization__, other than that `MSELoss`

> __Remember multidimensional data can be used directly for those loss functions!__

In [None]:
import torch


class AutoEncoder(torch.nn.Module):
    def __init__(self, encoder, decoder):
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, x):
        return self.decoder(self.encoder(x))

    def encode(self, x):
        return self.encoder(x)


encoder = torch.nn.Sequential(
    torch.nn.Linear(100, 80),
    torch.nn.ReLU(),
    torch.nn.Linear(80, 60),
    torch.nn.ReLU(),
    torch.nn.Linear(60, 40),
)


decoder = torch.nn.Sequential(
    torch.nn.Linear(40, 60),
    torch.nn.ReLU(),
    torch.nn.Linear(60, 80),
    torch.nn.ReLU(),
    torch.nn.Linear(80, 100),
    torch.nn.Sigmoid(),
)

autoencoder = AutoEncoder(encoder, decoder)

# Recommendation systems

> __This part is only a brief, field is expanding rapidly and directions often change!__

![filtering](images/colab_filter.svg)


## Content based

> __Content based recommends THE USER based on HIS interactions__

For example:
- User liked multiple poor quality horror movies, hence it is likely that the next poor quality horror movie will interest him
- User does not like high heels, hence we shouldn't recommend them

__Cons:__
- A lot of data needed for a user (__cold start problem__)
- __Data sparsity__ (most of the items will have no interactions)
- We will not be able to push him out of his "comfort zone" (we don't know we like something unless we try it)

## Collaborative filtering

> __Collaborative filtering makes recommendations based on actions performed by other users/entities__

For example:
- User A liked similar movies to user B, hence we might suggest similar movies

__Cons:__
- __Cold start problem__
- __Data sparsity__
- Creating suitable representation of a user based on his interactions


## Tips

- __Ask users to engage with the content at the beginning__
- __Include as many data points for each user as possible__, some example could be:
    - geolocation
    - list of friends (integrating system with `Facebook` or `Twitter`)
    - non-content related preferences (politics, fashion, webpages visited)
- __DO NOT PUSH USERS TOO HARD__ as it may have inverse effect and destroy the data
- Gather data in a transparent way for the users (so it does not interfere with their tasks/goals)
- Make the engagement simple (e.g. liking/disliking is better than `0`-`10` ratings) because: 
    - Users usually choose __extreme options__ (either `0`/`1` or `9`/`10`)
    - Data obtained this way is often non-representative
- Mix a lot of approaches and data sources together (what we are going to do in the exercise)
- Separate recommendation sources:
    - Based on user's content
    - Based on other users activity
- __Verify often what works based on user feedback__
- __Experiment__ (some approaches use reinforcement learning)

# Exercise

> __Let's be creative!__

We will try to create autoencoder which has the following goal (__at the same time!__):
- Collaborative filtering
- Content Based

First of all, we have __randomly generated data__ of shape `(M, N)` (users, items) with the following values:
- `0` - user did not interact with this item
- `1` - user __liked__ the item

Below are the steps we will take in order to construct a recommendation system:

# Dataset

## __init__

- Takes two arguments:
    - `data` (our `(M, N)` matrix)
    - `p` - value between `(0, 1)` which specifies __positive sample probability__ (we will later see what that means)
    - Save as attributes for later use
    
Now, the first __method__ we will need:
    
## Intersection Over Union

> __Protip:__ unsqueeze matrix to obtain `(M, M, N)` tensor and apply __summation__ across last dimension

Given our `(M, N)` matrix we have to:
- Calculate intersection (__same positive values__) for each user __based on items__, getting `(M, M)` matrix (__protip: multiplication__)
- Calculate union (__positive values for either user__) for each user __based on items__, getting `(M, M)` matrix (__protip: addition__)
- Divide intersection by union to obtain __IoU__ (save it as `iou` attribute in the dataset)
- Created inverse of the `iou` (`1 - iou`) and save it as `iou_inverse` attribute

> __Create the above as a standalone method and run it from within `__init__`__

## __getitem__(self, index)

We will do the following steps:
- Obtain user by `index`ing into `self.data`
- Based on `self.p` probability (see [here](https://stackoverflow.com/a/5887040/10886420) for example):
    - if `True` was sampled:
        - Create [`WeightedRandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler), where:
            - `weights` are taken from appropriate `iou` row by `index`ing into it
            - `num_samples=1` - take a single positive example
        - Sample from the created sampler to get `positive_index`
        - Get value from `iou` and save it as a `similarity` variable (indexed by `index` and `positive_index`)
        - Get `positive_sample` from `data` based on `positive_index`
        - __Return `tuple` with three elements: `user`, `positive`, `similarity`__
    - if `False` was sampled:
        - Same as above, but:
            - `iou_inverse` is used for sampling
            - We return `user`, `negative`, `similarity`
    - __Remove unnecessary code repetition!__
            
            
> __Why all of that?__

Based on this sample we can:
- Force `latent` representation of autoencoder to be similar/dissimilar based on `positive`/`negative` sample

# Autoencoder

Create `AutoEncoder` similarly to what we did in the first code cell, but with the following additions:
- `__init__` takes `p` argument (probability between `[0, 1]`) and creates `torch.nn.Dropout` using it (default value for `p` should be around `0.1`)
- `forward` takes additional argument `mask: bool`
- `forward` returns `tuple` with two elements:
    - latent space (after `encoding`)
    - recreated representation by `decoder`
    
Also create `3`/`4` layer encoder and decoder (separate neural networks) and pass them into `AutoEncoder` class:
- end `decoder` with `sigmoid` layer

# Loss function(s)

## Similarity Measurement

Create a new loss function called `EuclideanSimilarity` by inheriting from [PairwiseDistance](https://pytorch.org/docs/stable/generated/torch.nn.PairwiseDistance.html) and __overriding `forward`__:
- Takes two arguments `x1` and `x2`
- Passes them through `super().__call__(x1, x2)` and transforms distance (`d`) into similarity according to formula:

$$
S = \frac{1}{1 + d}
$$

## Generic loss

Create a new loss function by inheriting from `nn.Module` and inside:
- `__init__`:
    - create `BCEWithLogitsLoss` and save it as `bce_logits`
    - create `BCELoss` and save it as `bce`
    - create `EuclideanSimilarity` and save it as `similarity`
- `forward`:
    - Takes three arguments:
        - `user_reconstruction` (`torch.Tensor` of shape `(batch, features)`)
        - `original` (`torch.Tensor` of shape `(batch, features)`)
        - `user_latent` (`torch.Tensor` of shape `(batch, features)`)
        - `other_latent` (`torch.Tensor` of shape `(batch, features)`)
        - `positivity` (`torch.Tensor` of shape `(batch,)` with binary labels)
    - `reconstruction = self.bce_logits(reconstructed, original)` - obtains reconstruction loss of our (possibly masked) sample
    - `similarity = self.similarity(user_latent, other_latent)` - how close the representation is to positive/negative sample in terms of euclidean similarity
    - `positive_negative_similarity = self.bce(similarity, positivity)` - measures whether the representation should be moved away in euclidean space or pushed closer together
    - Return `reconstruction + positive_negative_similarity` as final loss function
    
# Training loop

For each item in `Dataset` (`user`, `other`, `similarity`) do the following:
- Pass `user` through `autoencoder` with `mask=True` to obtain `(user_latent, user_reconstruction)`
- Pass `other` through `autoencoder` with `mask=False` to obtain `(other_latent, _)` (__second return value is not needed!__)
- Pass (`user_reconstruction, user, user_latent, other_latent, similarity`) to out previously constructed loss function
- Backpropagate and try to minimize loss

# After training

> __What (hopefully!) have we achieved at the end?__

## Collaborative filtering

- Pass whole dataset `(users, items)` through `encoder` to get encoded representation
- For the user of choice find the most similar one via `EuclideanSimilarity` (or multiple similar users)
- __Recommend items the user did not like from the other users__ (possibly with majority voting)

## Content-Based

- Pass specific user through both `encoder` and `decoder` (__with `mask=False` specified!__)
- Get the largest value __which is `zero` in the original representation__
- Recommend this item as the next one (possibly with some `threshold` for the representation)


# Good Luck :)

In [None]:
users = 10000
items = 100

data = torch.randint(high=2, size=(users, items))

# Remember to create new cells for each part of the exercise!

# Challenges

## Assessment

- Contrastive autoencoder:
    - How does it work?
    - To which other type of autoencoder (shown in this lecture) is it connected?
    - What is the rationale standing behind it?
- How to tie weights in PyTorch? Create autoencoder with weights tied between `encoder` and `decoder`, use it in our project

## Non-assessment

- Variational Autoencoders:
    - Why is it a generative model?
    - How does the __reparametrization trick__ work?
    - Code an example VAE
    
- Check [DeepRec](https://arxiv.org/abs/1708.01715) model. Do you know how to code it?
- Check [SimCLR](https://arxiv.org/pdf/2002.05709.pdf) research paper for creative approach to representation learning
- What is [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition) and how does it relate [Matrix Factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems) technique?