<a href="https://colab.research.google.com/github/Ashishjayswal/DataSet/blob/main/Singular_Value_Decomposition_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Singular Value Decomposition(SVD) in Python

$\textbf{Mandatory Key steps:}$

1. Loading the data set.
2. Use TruncatedSVD function in scikit learn to apply SVD for dimensionality reduction
3. Set the value of rank as a parameter for TruncatedSVD
4. Fit and transform the data set 

$\textbf{Optional steps}$

1. Investigate U, $\Sigma$ and V matrices
2. Finding the best rank for the given data set using Frobenius Norm

$\textbf{Loading libraries and data set}$

In [2]:
import pandas as pd
from numpy import diag
import numpy as np
from sklearn.decomposition import TruncatedSVD
import zipfile

 
df=pd.read_csv("movie_review.csv")
df

Unnamed: 0,shimla,manali,kashmir,goa,kerela
0,1,1,1,0,0
1,3,3,3,0,0
2,4,4,4,0,0
3,5,5,5,0,0
4,0,2,0,4,4
5,0,0,0,5,5
6,0,1,0,2,2


In [None]:
df["cv_tag"] = df['cv_tag'].str.replace('cv', '').astype(float) 
df

$\textbf{Calling TruncatedSVD function from scikit learn and fitting, transforming the data set into desired number of dimensions  controlled $\\$  by parameter n_components of TruncatedSVD function  }$


In [4]:
svd = TruncatedSVD(n_components =4)
result= svd.fit_transform(df)
print(result)

[[ 1.71737671e+00 -2.24512179e-01  1.45434442e-02 -4.44089210e-16]
 [ 5.15213013e+00 -6.73536537e-01  4.36303325e-02 -1.33226763e-15]
 [ 6.86950685e+00 -8.98048716e-01  5.81737766e-02 -1.77635684e-15]
 [ 8.58688356e+00 -1.12256089e+00  7.27172208e-02 -1.77635684e-15]
 [ 1.90678810e+00  5.62055093e+00 -8.79526240e-01 -5.55111512e-16]
 [ 9.01335372e-01  6.95376220e+00  9.12571001e-01  0.00000000e+00]
 [ 9.53394050e-01  2.81027546e+00 -4.39763120e-01 -2.77555756e-16]]


In SVD the data set D  is decomposed into three matrices namely U,$\Sigma$ and V, i.e.,

$\begin{equation}
D= U \Sigma V^{T}
\end{equation}$

- The result of fit and transform functions under TruncatedSVD  is $\textbf{U $\Sigma$}$. The dot product of $\textbf{U $\Sigma$}$ gives us the new data point to be projected along vectors given by V.

- In order to find  $\Sigma$, $\textbf{singular_values_}$ under TruncatedSVD is used. 

- Matrix $\textbf{V_T}$ is obtained by using $\textbf{components_}$ function

- To obtain U, we can divide transformed data by values obtained by singular_values_


In [5]:
sigma = svd.singular_values_
v = svd.components_
u = result/svd.singular_values_

print(sigma.shape)
print(v.shape)
print(u.shape)

(4,)
(4, 5)
(7, 4)


$\textbf{Testing if the dot product of U, $\Sigma$ and V gives the original data set}$

In [7]:
s= diag(svd.singular_values_)
data_new = np.dot(result, v)
data_new = np.round(data_new,2)
print(data_new)
print(df)

[[ 1.  1.  1. -0. -0.]
 [ 3.  3.  3.  0.  0.]
 [ 4.  4.  4. -0. -0.]
 [ 5.  5.  5. -0. -0.]
 [ 0.  2. -0.  4.  4.]
 [ 0.  0. -0.  5.  5.]
 [ 0.  1. -0.  2.  2.]]
   shimla  manali  kashmir  goa  kerela
0       1       1        1    0       0
1       3       3        3    0       0
2       4       4        4    0       0
3       5       5        5    0       0
4       0       2        0    4       4
5       0       0        0    5       5
6       0       1        0    2       2


Finding the best rank for the given data set using  Frobenius Norm. The Frobenius Norm has to be small

$\begin{equation}
\mid (A - B)_F \mid = \sqrt{\sum_{ij} (A_{ij} - B_{ij})^2}
\end{equation}$

Where A and B represents old and new data set respectively

In [9]:
data_diff = np.subtract(df, data_new)
data_squarediff = np.square(data_diff)
print('Frobenius Norm = ', np.sqrt(data_squarediff.sum()))



Frobenius Norm =  shimla     0.0
manali     0.0
kashmir    0.0
goa        0.0
kerela     0.0
dtype: float64
