## SVD for topic analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.  
<br>

|   -   | Matrix | Alien | Serenity | Casablanca | Amelie |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     1 |        1 |          0 |      0 |
|   **Bob** |      3 |     3 |        3 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, Serenity) are Sci-fi movies and the last two (Casablanca, Amelie) are Romance. Can we mathematically pull out these "topics" from user ratings?

Let's do the computation with Python.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
M = np.array([[1, 1, 1, 0, 0],
              [3, 3, 3, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

### Compute SVD using numpy
[Documentation](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.linalg.svd.html)

In [3]:
# Compute SVD
from numpy.linalg import svd
U, sigma, VT = svd(M, full_matrices=False, compute_uv=True) #see documentation for full_matrics parameters

In [4]:
# Check if U and VT matrices are column orthogonal
# If the matrix is orthogonal, it's dot product with itself should give
# the identity matrix (inline with itself (1), orthogonal with all the rest (0))

print("Checking U:")
print(np.dot(U.T,U).round(1))
print("\nChecking V:")
print(np.dot(VT.T,VT).round(1))

Checking U:
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1. -0.  0.]
 [ 0.  0. -0.  1. -0.]
 [ 0.  0.  0. -0.  1.]]

Checking V:
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0. -0.]
 [ 0.  0.  1. -0.  0.]
 [ 0.  0. -0.  1.  0.]
 [ 0. -0.  0.  0.  1.]]


## Part 1

Describe the matrices

In [5]:
## U matrix
## shape
print(U.shape)
## it relates
print("\nU matrix:")
print("Relates Users (rows) to latent features (columns) based on magnitude of values in matrix.\n")
print(U.round(2))

(7, 5)

U matrix:
Relates Users (rows) to latent features (columns) based on magnitude of values in matrix.

[[-0.14  0.02  0.01  0.99 -0.  ]
 [-0.41  0.07  0.03 -0.06 -0.89]
 [-0.55  0.09  0.04 -0.08  0.42]
 [-0.69  0.12  0.05 -0.1   0.19]
 [-0.15 -0.59 -0.65 -0.    0.  ]
 [-0.07 -0.73  0.68  0.   -0.  ]
 [-0.08 -0.3  -0.33 -0.   -0.  ]]


In [6]:
## sigma matrix
## shape
print("S matrix")
print(sigma.shape) # these are just the diagonal values of the singular values matrix
print("The latent feature singular values. The (singular value)^2 is\n"
      "is the eigenvalue.")
print("\nThe singular values:")
print(sigma.round(2)) # these are just the diagonal values of the singular values matrix
print("\nThe singular values matrix, S:")
sigma_m = sigma * np.eye(len(sigma))
print(sigma_m.round(2))

S matrix
(5,)
The latent feature singular values. The (singular value)^2 is
is the eigenvalue.

The singular values:
[12.48  9.51  1.35  0.    0.  ]

The singular values matrix, S:
[[12.48  0.    0.    0.    0.  ]
 [ 0.    9.51  0.    0.    0.  ]
 [ 0.    0.    1.35  0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]]


In [7]:
## VT matrix
# shape
print("The VT matrix")
print(VT.shape)
# it relates
print("Relates latent topics (rows) to the items (columns). In this case the columns are movies.\n")
print(VT.round(2))

The VT matrix
(5, 5)
Relates latent topics (rows) to the items (columns). In this case the columns are movies.

[[-0.56 -0.59 -0.56 -0.09 -0.09]
 [ 0.13 -0.03  0.13 -0.7  -0.7 ]
 [ 0.41 -0.8   0.41  0.09  0.09]
 [-0.71  0.    0.71 -0.    0.  ]
 [-0.    0.   -0.    0.71 -0.71]]


## Part 2

Making use of the factorized version of our ratings

In [8]:
# Make interpretable
movies = ['Matrix','Alien','Serenity','Casablanca','Amelie']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

#U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
df_U = pd.DataFrame(U, index=users)
df_VT = pd.DataFrame(VT, columns=movies)

print("U matrix - relates users to topics")
print(df_U.round(2))
print("--------------------------------------")
print("\nSingular values")
print("square these to get the eigenvalues (and describe proportional variance)")
print(np.diag(sigma.round(2)))
print("--------------------------------------")
print("\nV matrix - relates topics to the movies")
print(df_VT.round(2))

U matrix - relates users to topics
          0     1     2     3     4
Alice -0.14  0.02  0.01  0.99 -0.00
Bob   -0.41  0.07  0.03 -0.06 -0.89
Cindy -0.55  0.09  0.04 -0.08  0.42
Dan   -0.69  0.12  0.05 -0.10  0.19
Emily -0.15 -0.59 -0.65 -0.00  0.00
Frank -0.07 -0.73  0.68  0.00 -0.00
Greg  -0.08 -0.30 -0.33 -0.00 -0.00
--------------------------------------

Singular values
square these to get the eigenvalues (and describe proportional variance)
[[12.48  0.    0.    0.    0.  ]
 [ 0.    9.51  0.    0.    0.  ]
 [ 0.    0.    1.35  0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]]
--------------------------------------

V matrix - relates topics to the movies
   Matrix  Alien  Serenity  Casablanca  Amelie
0   -0.56  -0.59     -0.56       -0.09   -0.09
1    0.13  -0.03      0.13       -0.70   -0.70
2    0.41  -0.80      0.41        0.09    0.09
3   -0.71   0.00      0.71       -0.00    0.00
4   -0.00   0.00     -0.00        0.71   -0.71


In [9]:
# Multiply the U, S, and VT matrices to reconstruct the original dataset
print(np.around(df_U.dot(np.diag(sigma)).dot(df_VT),2))
print("Perfectly reconstructs - no surprise because we used all the latent factors.")

       Matrix  Alien  Serenity  Casablanca  Amelie
Alice     1.0    1.0       1.0         0.0     0.0
Bob       3.0    3.0       3.0        -0.0    -0.0
Cindy     4.0    4.0       4.0         0.0    -0.0
Dan       5.0    5.0       5.0        -0.0    -0.0
Emily     0.0    2.0      -0.0         4.0     4.0
Frank     0.0    0.0      -0.0         5.0     5.0
Greg      0.0    1.0      -0.0         2.0     2.0
Perfectly reconstructs - no surprise because we used all the latent factors.


## Trim the matrices to represent a factorization from only the top two factors

In [10]:
lf = 2 # will use only two latent factors 
df_U_lf = df_U.iloc[:,:lf]
print("U matrix\n", df_U_lf.round(2))
print("\n")
sigma_lf = sigma[:lf].round(2)
print("sigma", sigma_lf)
df_VT_lf = df_VT.iloc[:lf,:].round(2)
print("\n")
print("V matrix")
print(df_VT_lf.round(2))

U matrix
           0     1
Alice -0.14  0.02
Bob   -0.41  0.07
Cindy -0.55  0.09
Dan   -0.69  0.12
Emily -0.15 -0.59
Frank -0.07 -0.73
Greg  -0.08 -0.30


sigma [12.48  9.51]


V matrix
   Matrix  Alien  Serenity  Casablanca  Amelie
0   -0.56  -0.59     -0.56       -0.09   -0.09
1    0.13  -0.03      0.13       -0.70   -0.70


## Part 3: Does your approximate version of the matrix still reasonably reconstruct the original?

In [11]:
# How well are original ratings captured when only using a subset of the latent factors?
print("Here's the original:")
df_M = pd.DataFrame(M, index=users, columns=movies)
print(df_M)
print("\nHere's the reconstruction:")
df_R = df_U_lf.dot(np.diag(sigma_lf)).dot(df_VT_lf)
print(df_R.round(1))

Here's the original:
       Matrix  Alien  Serenity  Casablanca  Amelie
Alice       1      1         1           0       0
Bob         3      3         3           0       0
Cindy       4      4         4           0       0
Dan         5      5         5           0       0
Emily       0      2         0           4       4
Frank       0      0         0           5       5
Greg        0      1         0           2       2

Here's the reconstruction:
       Matrix  Alien  Serenity  Casablanca  Amelie
Alice     1.0    1.0       1.0        -0.0    -0.0
Bob       3.0    3.0       3.0        -0.0    -0.0
Cindy     4.0    4.0       4.0        -0.0    -0.0
Dan       5.0    5.0       5.0        -0.0    -0.0
Emily     0.3    1.3       0.3         4.1     4.1
Frank    -0.4    0.7      -0.4         4.9     4.9
Greg      0.2    0.6       0.2         2.1     2.1


Take aways:  
1) A matrix X can be factorized into U, S, V matrices using singular value decomposition.  
2) The U and V matrices contain orthogonal vectors.  These vectors are latent topics.  You can think of latent topics as principal components.  (US = principal components = XV)  
3) The U matrix relates rows in X to the latent topics, based on the magnitude of the values in the matrix (the larger the value, the more it loads onto that latent topic).  
4) The S matrix contains the singular values associated with the latent topics.  Squared singular values are the same as eigenvalues.  
5) The V matrix relates the topics (rows) to the columns of the X matrix (in this case the movies)  
6) By selecting the number of singular values, you are simultaneously reducing dimensionality and eliminating collinearity, and finding latent topics that can be used to reconstruct your original data.