# Matrix factorization for finding Hidden genres

For this, we need to look into the rating matrix again. But now comes an additional problem. The more data we have, bigger the matrix become. And therefore computations become time consuming. In order to maintain the data features while minimizing the time consumption(computation power required), we use a technique called `Dimensional reduction`. In this notebook to separate the rating matrix into smaller parts we use a mathematical technique called `Singular Value Decomposition` (SVD).

Before going in to the mentioned topics, it is important to understand the basics of matrix factorization. Assume you have a matrix `R` and then we can decompose it in following form.

<center>

__R = U.V__
</center>

If R has dimensions n\*m then U will have n\*d and V will have d\*m dimensions. This is called `UV-Decomposition`. In recommender systems field, U is the user-feature matrix, V is the item-feature matrix and R will be the ratings matrix. The idea behind the factorization is to find values to U and V which yield their multiplication to R matrix as close as possible, in theory it is basically solving several linear equations. (ie. which satisfy the matrix multiplication.)

## SVD (Singular Value Decomposition)



One of the most common way of factorizing matrices is SVD. In SVD we construct 3 matrices namely `U`, `V* (V Transpose)` and `Σ (Sigma)`. Here U and V acts as factors while sigma acts as the regulator for the data dimensions.
<pre style='color:yellow'>
<center>M = UΣV*</center>

- M  = Data Marix
- U  = User feature matrix
- Σ  = Weights diagonal matrix (eigen values matrix)
- V* = Item Feature matrix
</pre>
To get more theoritical understanding behind SVD you can watch [videos in here.](https://www.youtube.com/watch?v=gXbThCXjZFM&list=PLMrJAkhIeNNSVjnsviglFoY2nXildDCcv&index=1)

Instead of implementing myself, I have used the numpy implementation of the algorithm.

In [1]:
import pandas as pd
import numpy as np

movies = ['mib', 'st', 'av', 'b', 'ss', 'lm']
users = ['Sara', 'Jesper', 'Therese', 'Helle', 'Pietro', 'Ekaterina']
M = pd.DataFrame([
                    [5.0, 3.0, 0.0, 2.0, 2.0, 2.0],
                    [4.0, 3.0, 4.0, 0.0, 3.0, 3.0],
                    [5.0, 2.0, 5.0, 2.0, 1.0, 1.0],
                    [3.0, 5.0, 3.0, 0.0, 1.0, 1.0],
                    [3.0, 3.0, 3.0, 2.0, 4.0, 5.0],
                    [2.0, 3.0, 2.0, 3.0, 5.0, 5.0]],
                columns=movies,
                index=users)

from numpy import linalg

U, sigma, V_t = linalg.svd(M)

If we check the output from above matrices, we will see that their multiplication does not add up to the original values exactly. But close enough to be usable. Also in the sigma matrix (eigen values sorted) we can check the amount of information given by the each of the data rows/columns in the U/V matrices.

Also we can use the sigma matrix to reduce the dimensions of the matrices while retaining most of the information available on the original data. To do that we can select the most weighted values from the sigma matrix.

In [9]:
def rank_k(k):
    '''
    Function to reduce the rank to the given level
    '''
    U_reduced= np.mat(U[:,:k])
    Vt_reduced = np.mat(V_t[:k,:])
    Sigma_reduced = np.eye(k)*sigma[:k]
    return U_reduced, Sigma_reduced, Vt_reduced

U_reduced, Sigma_reduced, Vt_reduced = rank_k(4)
M_hat = U_reduced * Sigma_reduced * Vt_reduced

print(M_hat)

[[ 4.87147087  3.11444112  0.04893344  2.23870109  1.94083799  1.920736  ]
 [ 3.49344678  3.45787572  4.19067126  0.94886084  2.61521613  2.82032378]
 [ 5.22111879  1.8034114   4.91572235  1.58969108  1.09528095  1.14205388]
 [ 3.25351113  4.77315242  2.90384191 -0.4721446   1.14157873  1.13455568]
 [ 2.93061675  3.04700483  3.03112668  2.11137004  4.29526848  4.67079756]
 [ 2.27270952  2.76664391  1.89315701  2.50473044  4.91596291  5.35161957]]
