
# **Python Latent Semantic Analysis (LSA) Tutorial**



Import dependencies:

In [None]:
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import rand
from sklearn.metrics.pairwise import cosine_similarity
from numpy import argsort

In this tutorial we assume that rows represent samples and columns are features according to sklearn.

Generate a random binary 150x100 matrix (150 samples, 100 features):





In [None]:
B = rand(150, 100, density=0.3, format='csr')
B.data[:] = 1
print("B shape: " + str(B.shape))

B shape: (150, 100)


Generate a random binary query (1x100 vector):

In [None]:
query = rand(1, 100, density=0.3, format='csr')
query.data[:] = 1
print("Query shape: " + str(query.shape))

Query shape: (1, 100)


Generate the k-truncated B matrix using SVD decomposition:


*   trunc_SVD_model is a TruncatedSVD object;
*   fit_transform is a method of TruncatedSVD which computes the rank k SVD decomposition of B and the approximated B matrix;
*   the SVD decomposition is saved into the trunc_SVD_model state.

In this case k=5:

In [None]:
trunc_SVD_model = TruncatedSVD(n_components=5)
approx_B = trunc_SVD_model.fit_transform(B)
print("Approximated B shape: " + str(approx_B.shape))

Approximated B shape: (150, 5)


Transform the query for the new B using the transform method of trunc_SVD_model:



In [None]:
transformed_query = trunc_SVD_model.transform(query)
print("Transformed query: " + str(transformed_query))
print("Query shape: " + str(transformed_query.shape))

Transformed query: [[ 2.89472577  0.33361881 -0.34122844  0.40161553 -0.45117078]]
Query shape: (1, 5)


Compute cosine similarities between the transformed query and the column vectors of B:

In [None]:
similarities = cosine_similarity(approx_B, transformed_query)
print("Similarities shape: " + str(similarities.shape))

Similarities shape: (150, 1)


Let's take the indexes of the n most similarity documents:

In [None]:
n=3
indexes = np.argsort(similarities.flat)[-n:]
print("Top n documents: " + str(indexes))
print("Top n similarities: " + str(similarities.flat[indexes]))

Top n documents: [135  26 130]
Top n similarities: [0.9810957  0.98346939 0.99006088]
