## SVD for topic analysis

We can use SVD to determine what we call ***latent features***. This will be best demonstrated with an example.

### Example

Let's look at users ratings of different movies. The ratings are from 1-5. A rating of 0 means the user hasn't watched the movie.

|       | Matrix | Alien | StarWars | Casablanca | Titanic |
| ----- | ------ | ----- | -------- | ---------- | ------ |
| **Alice** |      1 |     2 |        2 |          0 |      0 |
|   **Bob** |      3 |     5 |        5 |          0 |      0 |
| **Cindy** |      4 |     4 |        4 |          0 |      0 |
|   **Dan** |      5 |     5 |        5 |          0 |      0 |
| **Emily** |      0 |     2 |        0 |          4 |      4 |
| **Frank** |      0 |     0 |        0 |          5 |      5 |
|  **Greg** |      0 |     1 |        0 |          2 |      2 |

Note that the first three movies (Matrix, Alien, StarWars) are Sci-fi movies and the last two (Casablanca, Titanic) are Romance. We will be able to mathematically pull out these topics!

Let's do the computation with Python.

In [186]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [187]:
M = np.array([[1, 2, 2, 0, 0],
              [3, 5, 5, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])

In [188]:
# Compute SVD
from numpy.linalg import svd
U, sigma, VT = svd(M,full_matrices=False)

## Part 1

Describe in your own words what the matrices contain and how they might be used

In [189]:
## U matrix
## print the shape and add your description

# this matrix represents the machine 'discovered' significant vectors.
# these vectors define feature significance

# my explanation:
# this matrix describes HOW MUCH each PERSON likes (or dislikes)
# each characteristic that was found to be a significant differentiator
# to this data set

print(U.shape)
U

(7, 5)


array([[-2.12142669e-01,  2.35889359e-02,  3.05275882e-01,
         2.55204195e-01, -7.18116542e-01],
       [-5.48509647e-01,  6.39541961e-02,  5.32055497e-01,
         4.61448643e-01,  3.59058271e-01],
       [-4.96897235e-01,  6.71052975e-02, -3.13985067e-01,
        -1.95838988e-01,  4.28414958e-01],
       [-6.21121543e-01,  8.38816219e-02, -3.92481334e-01,
        -2.44798735e-01, -4.14543621e-01],
       [-1.24855356e-01, -5.96778016e-01,  3.95328299e-01,
        -5.21519583e-01, -3.60822483e-16],
       [-4.41332838e-02, -7.33917008e-01, -4.19213292e-01,
         5.32614583e-01, -1.11022302e-16],
       [-6.24276782e-02, -2.98389008e-01,  1.97664149e-01,
        -2.60759791e-01, -1.80411242e-16]])

In [190]:
## sigma matrix
## print the shape and add your description

# this matrix defines the probabilities to determine 
# the likelihood that Person-A would like Movie-B

# my explanation:
# this matrix describes HOW IMPORTANT the characteristics are 
# in explaining preferences


print(sigma.shape)
sigma

(5,)


array([1.38366398e+01, 9.52139961e+00, 1.68783520e+00, 1.02056846e+00,
       6.27520304e-17])

In [191]:
## VT matrix
## print the shape and add your description

# this is the resulting preference matrix

# my explanation:
# this matrix describes HOW MUCH each MOVIE displays 
# each characteristic that was found to be a significant differentiator
# to this data set

print(VT.shape)
VT

(5, 5)


array([[-5.02352330e-01, -6.19526758e-01, -5.96967929e-01,
        -6.10656353e-02, -6.10656353e-02],
       [ 9.48684921e-02, -4.59141416e-02,  1.10779738e-01,
        -6.98791711e-01, -6.98791711e-01],
       [-7.80232905e-01,  6.16649691e-01,  3.10944517e-02,
        -7.07562953e-02, -7.07562953e-02],
       [-3.60386901e-01, -4.83551079e-01,  7.93971206e-01,
         5.43569648e-02,  5.43569648e-02],
       [-0.00000000e+00,  1.46317186e-16, -1.46317186e-16,
         7.07106781e-01, -7.07106781e-01]])

## Part 2

Making use of the factorized version of our ratings

In [192]:
# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

U, sigma, VT = (np.around(x,2) for x in (U,sigma,VT))
df_U = pd.DataFrame(U, index=users)
df_VT = pd.DataFrame(VT, columns=movies)

print(df_U)
print("--------------------------------------")
print(np.diag(sigma))
print("--------------------------------------")
print(df_VT)

          0     1     2     3     4
Alice -0.21  0.02  0.31  0.26 -0.72
Bob   -0.55  0.06  0.53  0.46  0.36
Cindy -0.50  0.07 -0.31 -0.20  0.43
Dan   -0.62  0.08 -0.39 -0.24 -0.41
Emily -0.12 -0.60  0.40 -0.52 -0.00
Frank -0.04 -0.73 -0.42  0.53 -0.00
Greg  -0.06 -0.30  0.20 -0.26 -0.00
--------------------------------------
[[13.84  0.    0.    0.    0.  ]
 [ 0.    9.52  0.    0.    0.  ]
 [ 0.    0.    1.69  0.    0.  ]
 [ 0.    0.    0.    1.02  0.  ]
 [ 0.    0.    0.    0.    0.  ]]
--------------------------------------
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70
2   -0.78   0.62      0.03       -0.07    -0.07
3   -0.36  -0.48      0.79        0.05     0.05
4   -0.00   0.00     -0.00        0.71    -0.71


Add your own description of how the matrices relate to each other

## Trim the matrices to represent a factorization from only the top two factors

In [193]:
# prints row 0 and 1 and all cols
#print ( df_U[:2] )

# leaves you with 2 x 2 matrix
#trimmed_U = df_U.iloc[:2,0:2]

trimmed_U = df_U.iloc[:,0:2]
print ( trimmed_U )

trimmed_sigma = sigma[0:2]
print (trimmed_sigma)

trimmed_VT = df_VT.iloc[0:2,:]
print ( trimmed_VT )

          0     1     2     3     4
Alice -0.21  0.02  0.31  0.26 -0.72
Bob   -0.55  0.06  0.53  0.46  0.36
          0     1
Alice -0.21  0.02
Bob   -0.55  0.06
Cindy -0.50  0.07
Dan   -0.62  0.08
Emily -0.12 -0.60
Frank -0.04 -0.73
Greg  -0.06 -0.30
[13.84  9.52]
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70


## Part 3: Does your approximate version of the matrix still reasonably reconstruct the original?

In [194]:
# Use this code but swap in your matrices
#np.around(df_U.dot(np.diag(sigma)).dot(_VT))

np.around(trimmed_U.dot(np.diag(trimmed_sigma)).dot(trimmed_VT))
# yes ! sweet

Unnamed: 0,Matrix,Alien,StarWars,Casablanca,Titanic
Alice,1.0,2.0,2.0,0.0,0.0
Bob,4.0,5.0,5.0,0.0,0.0
Cindy,4.0,4.0,4.0,-0.0,-0.0
Dan,4.0,5.0,5.0,-0.0,-0.0
Emily,0.0,1.0,0.0,4.0,4.0
Frank,-0.0,1.0,-0.0,5.0,5.0
Greg,0.0,1.0,0.0,2.0,2.0


## Part 4: Make some recommendations


Use cosine similarity to compare all other users to Alice (using movie profiles)

np.dot(x,y) / ( np.linalg.norm(x) * np.linalg.norm(y) )

In [195]:
#np.dot(trimmed_U,trimmed_VT) / ( np.linalg.norm(trimmed_U) * np.linalg.norm(trimmed_VT))
print (np.dot(trimmed_U,trimmed_VT) / ( np.linalg.norm(trimmed_U) * np.linalg.norm(trimmed_VT)))

Users_norm = np.linalg.norm(trimmed_U[0]) # Alice - row 0
print (Users_norm)

Movie_norm = np.linalg.norm(trimmed_VT)
print (Movie_norm)

print (np.dot(trimmed_U,trimmed_VT) / ( Users_norm * Movie_norm))

[[ 0.0533587   0.06455003  0.06405042 -0.00069946 -0.00069946]
 [ 0.14009156  0.16886929  0.16816983 -0.00449652 -0.00449652]
 [ 0.12805088  0.15313147  0.153731   -0.00949265 -0.00949265]
 [ 0.15847733  0.19005289  0.19025273 -0.00939273 -0.00939273]
 [ 0.00299768  0.05215963  0.00299768  0.21343479  0.21343479]
 [-0.02283233  0.03062629 -0.02812823  0.25650145  0.25650145]
 [ 0.00149884  0.02607981  0.00149884  0.10671739  0.10671739]]
1.000299955013495
1.4157330256796299
[[ 0.07541533  0.09123278  0.09052664 -0.00098859 -0.00098859]
 [ 0.19800055  0.23867399  0.2376854  -0.00635522 -0.00635522]
 [ 0.18098267  0.2164307   0.21727806 -0.01341659 -0.01341659]
 [ 0.22398636  0.26861416  0.26889661 -0.01327536 -0.01327536]
 [ 0.00423682  0.0737206   0.00423682  0.30166132  0.30166132]
 [-0.03227042  0.04328614 -0.03975546  0.36253025  0.36253025]
 [ 0.00211841  0.0368603   0.00211841  0.15083066  0.15083066]]


Use cosine similarity to comare all other movies to StarWars (using user profiles)

In [196]:
#np.dot(trimmed_U,trimmed_VT) / ( np.linalg.norm(trimmed_U) * np.linalg.norm(trimmed_VT))
print (np.dot(trimmed_U,trimmed_VT) / ( np.linalg.norm(trimmed_U) * np.linalg.norm(trimmed_VT)))

Users_norm = np.linalg.norm(trimmed_U)
print (Users_norm)

Movie_norm = np.linalg.norm(trimmed_VT['StarWars']) # StarWars col 2
print (Movie_norm)

print (np.dot(trimmed_U,trimmed_VT) / ( Users_norm * Movie_norm))

[[ 0.0533587   0.06455003  0.06405042 -0.00069946 -0.00069946]
 [ 0.14009156  0.16886929  0.16816983 -0.00449652 -0.00449652]
 [ 0.12805088  0.15313147  0.153731   -0.00949265 -0.00949265]
 [ 0.15847733  0.19005289  0.19025273 -0.00939273 -0.00939273]
 [ 0.00299768  0.05215963  0.00299768  0.21343479  0.21343479]
 [-0.02283233  0.03062629 -0.02812823  0.25650145  0.25650145]
 [ 0.00149884  0.02607981  0.00149884  0.10671739  0.10671739]]
1.4137892346456737
0.61
[[ 0.1238388   0.14981249  0.14865295 -0.00162336 -0.00162336]
 [ 0.32513484  0.3919243   0.39030095 -0.01043585 -0.01043585]
 [ 0.29718994  0.35539881  0.35679026 -0.02203125 -0.02203125]
 [ 0.36780588  0.44108877  0.44155259 -0.02179934 -0.02179934]
 [ 0.00695724  0.12105591  0.00695724  0.49535521  0.49535521]
 [-0.05299095  0.07107976 -0.06528207  0.59530751  0.59530751]
 [ 0.00347862  0.06052795  0.00347862  0.24767761  0.24767761]]


Provide a new vector of ratings and determine which is closest

In [199]:

newM = np.array([[1, 2, 2, 0, 0],
                 [3, 5, 5, 0, 0],
                 [4, 4, 4, 0, 0],
                 [5, 5, 5, 0, 0],
                 [0, 2, 0, 4, 4],
                 [0, 0, 0, 5, 5],
                 [0, 1, 0, 2, 2]])

newM = np.array([[1, 1, 1, 0, 0],
                 [5, 5, 5, 0, 0],
                 [5, 5, 5, 0, 0],
                 [5, 5, 5, 0, 0],
                 [0, 5, 0, 5, 5],
                 [0, 0, 0, 5, 5],
                 [0, 1, 0, 1, 1]])



# Compute SVD
from numpy.linalg import svd
newU, newsigma, newVT = svd(newM,full_matrices=False)


# Make interpretable
movies = ['Matrix','Alien','StarWars','Casablanca','Titanic']
users = ['Alice','Bob','Cindy','Dan','Emily','Frank','Greg']

newU, newsigma, newVT = (np.around(x,2) for x in (newU,newsigma,newVT))
newdf_U = pd.DataFrame(U, index=users)
newdf_VT = pd.DataFrame(VT, columns=movies)

print(newdf_U)
print("--------------------------------------")
print(np.diag(newsigma))
print("--------------------------------------")
print(newdf_VT)


newtrimmed_U = newdf_U.iloc[:,0:2]
print ( newtrimmed_U )

newtrimmed_sigma = newsigma[0:2]
print (newtrimmed_sigma)

newtrimmed_VT = newdf_VT.iloc[0:2,:]
print ( newtrimmed_VT )

#np.dot(trimmed_U,trimmed_VT) / ( np.linalg.norm(trimmed_U) * np.linalg.norm(trimmed_VT))
print (np.dot(newtrimmed_U,newtrimmed_VT) / ( np.linalg.norm(newtrimmed_U) * np.linalg.norm(newtrimmed_VT)))

newUsers_norm = np.linalg.norm(newtrimmed_U)
print (newUsers_norm)

newMovie_norm = np.linalg.norm(newtrimmed_VT)
print (newMovie_norm)

print (np.dot(newtrimmed_U,newtrimmed_VT) / ( newUsers_norm * newMovie_norm))


          0     1     2     3     4
Alice -0.21  0.02  0.31  0.26 -0.72
Bob   -0.55  0.06  0.53  0.46  0.36
Cindy -0.50  0.07 -0.31 -0.20  0.43
Dan   -0.62  0.08 -0.39 -0.24 -0.41
Emily -0.12 -0.60  0.40 -0.52 -0.00
Frank -0.04 -0.73 -0.42  0.53 -0.00
Greg  -0.06 -0.30  0.20 -0.26 -0.00
--------------------------------------
[[15.53  0.    0.    0.    0.  ]
 [ 0.   10.35  0.    0.    0.  ]
 [ 0.    0.    2.77  0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.  ]]
--------------------------------------
   Matrix  Alien  StarWars  Casablanca  Titanic
0   -0.50  -0.62     -0.60       -0.06    -0.06
1    0.09  -0.05      0.11       -0.70    -0.70
2   -0.78   0.62      0.03       -0.07    -0.07
3   -0.36  -0.48      0.79        0.05     0.05
4   -0.00   0.00     -0.00        0.71    -0.71
          0     1
Alice -0.21  0.02
Bob   -0.55  0.06
Cindy -0.50  0.07
Dan   -0.62  0.08
Emily -0.12 -0.60
Frank -0.04 -0.73
Greg  -0.06 -0.30
[15.53 10.35]
   Matrix  Alien  Star

In [203]:

bill_matrix = np.array([[5,10],[1,1]])
print ( bill_matrix )
print ( bill_matrix.T )
print ( np.linalg.inv(bill_matrix) )


[[ 5 10]
 [ 1  1]]
[[ 5  1]
 [10  1]]
[[-0.2  2. ]
 [ 0.2 -1. ]]


In [202]:
# lat long
ll_matrix = np.array([[1,3],[2,5]])
print ( ll_matrix )
print ( ll_matrix.T )
print ( np.linalg.inv(ll_matrix) )


[[1 3]
 [2 5]]
[[1 2]
 [3 5]]
[[-5.  3.]
 [ 2. -1.]]
