# <center>Latent Semantic Analysis using SVD</center>

___

Latent Semantic Analysis is a technique in Natural Language Processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

LSA finds this relationship by using a mathematical technique called Singular Vector Decomposition (SVD), which decomposes the document term matrix into three matrices.

$$A\ = U\sum V^T$$

- Matrix A, the left singular vectors define the word-word relationship

- Matrix U, the singular values (eigen values) define the word-document relationship *(This extracts the hidden concept dimension.)*

- Matrix V, the right singular vectors define the document-document relationship

SVD also reduces the dimensions significantly as the new concept space defined by eigen vectors is in sorted order, with first dimension defining the strongest concept dimension.

Just like PCA, we need not use the entire matrix and can pick first k values which will define majority of the relationship between terms and documents.

This is why its called `Reduced SVD` or `Truncated SVD`

Now, let's see how to implement this.

#### Load the dataset

In [2]:
import numpy as np
import pandas as pd

In [3]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
X_train = pd.DataFrame(fetch_20newsgroups(random_state = 1, subset = 'train', 
                                         remove = ('headers', 'footers', 'quotes')).data, dtype = str)

X_test = pd.DataFrame(fetch_20newsgroups(random_state = 1, subset = 'test', 
                                        remove =('headers', 'footers', 'quotes')).data, dtype = str)
        

#### Displaying the different topics of train data

In [6]:
topics = fetch_20newsgroups(subset = 'train')
topics.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

#### Check the shape

In [7]:
X_train.shape

(11314, 1)

In [8]:
X_test.shape

(7532, 1)

#### View the data

In [9]:
X_train.head()

Unnamed: 0,0
0,Well i'm not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ..."


In [10]:
X_test.head()

Unnamed: 0,0
0,: In article <34592@oasys.dt.navy.mil> odell@o...
1,Ithaca technical support can be reached at:\n\...
2,Devorski unfortunately helped to taint an othe...
3,"\nI would further add that a 486/50,S3/928,8mb..."
4,A rep at the dealer (actually it's a universit...


#### Calculate Term Frequency - Inverted Document Frequency

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer 

In [12]:
vectorizer = TfidfVectorizer(max_df = 0.5, stop_words = 'english')

#### Build Truncated SVD Model

In [13]:
from sklearn.decomposition import TruncatedSVD

In [14]:
svd_model = TruncatedSVD(n_components = 500, random_state = 123)

#### Create and Execute Pipeline

Build a Pipeline Object

In [15]:
from sklearn.pipeline import Pipeline

In [16]:
modelling_pipe = Pipeline([('tfidf', vectorizer), ('svd', svd_model)])

Run the Model

In [17]:
fitted_model = modelling_pipe.fit(X_train[0])

In [18]:
#vectorizer.vocabulary_

Check the size of the vocabulary

In [19]:
print(len(vectorizer.vocabulary_))

101322


Transform the Train features

In [20]:
svd_matrix_train = fitted_model.transform(X_train[0])

#### Check the Shape

In [21]:
print(svd_matrix_train.shape)

(11314, 500)


#### View the object

In [22]:
print(svd_matrix_train)

[[ 0.08690072 -0.05274135 -0.01616869 ...  0.00091931 -0.02401085
  -0.01480978]
 [ 0.12571505 -0.03759296  0.01870882 ... -0.01097828  0.00234251
   0.01323197]
 [ 0.11543528 -0.05215315 -0.02240148 ...  0.04062794  0.03953857
   0.00133556]
 ...
 [ 0.06512122 -0.0196766  -0.02061401 ... -0.00982606  0.01427701
  -0.00259227]
 [ 0.03353651  0.022044   -0.00068139 ... -0.00625029  0.01272091
  -0.01433058]
 [ 0.17359084 -0.03098295 -0.04834833 ... -0.00424957  0.01372578
  -0.01831141]]


In [23]:
print(type(svd_matrix_train))

<class 'numpy.ndarray'>


#### Query the Model

In [24]:
# Transform the Test Data

query = fitted_model.transform(X_test.iloc[2])

In [25]:
query.shape

(1, 500)

In [26]:
print(query)

[[ 6.70776812e-02 -2.13308295e-02 -3.66785911e-02  1.34992201e-03
  -5.39209249e-02 -2.84246325e-02  2.27992246e-02 -1.87607029e-02
  -2.86134069e-02 -1.86607505e-02  1.08551481e-02  7.24856361e-03
   1.22344918e-03  1.21645819e-03 -9.96724193e-03 -5.81949860e-03
  -2.12239119e-03  5.31987040e-03  7.36037147e-03  1.39896099e-02
   4.01497320e-04 -7.05713977e-03 -2.45690666e-03 -1.58566442e-02
  -2.84781370e-05  8.98633398e-03 -4.74061157e-03 -7.10921572e-03
  -1.91178529e-02 -1.26256846e-03 -1.86566629e-02 -7.02231430e-03
  -1.42665202e-02 -4.65910702e-03  6.81622852e-03  1.43985037e-02
   1.63449435e-02 -9.52576321e-03 -1.83879307e-03  1.46156483e-02
  -7.21651765e-03 -4.31382641e-03 -7.01359438e-03 -1.39657287e-02
  -8.78193789e-03 -2.49131682e-02  1.26134975e-02  6.40119941e-03
  -5.46763016e-03  1.81765480e-02 -2.70795701e-02 -2.74857424e-02
   1.80398445e-02 -1.12120232e-03  5.59174134e-02  2.76732556e-02
   7.39665992e-03 -7.48763009e-04  4.79575130e-03  9.34002535e-03
  -1.81229

#### Calculate cosine similarity

In [27]:
from sklearn.metrics.pairwise import cosine_similarity

In [28]:
distance_matrix = cosine_similarity(svd_matrix_train, query)

In [29]:
print(distance_matrix)

[[0.02143847]
 [0.13943067]
 [0.02956333]
 ...
 [0.05498933]
 [0.00212685]
 [0.08855379]]


In [30]:
distance_matrix[:5]

array([[0.02143847],
       [0.13943067],
       [0.02956333],
       [0.02434709],
       [0.06598569]])

#### Sort the Cosine Similarity Matrix 

In [31]:
flat = distance_matrix.flatten()

In [32]:
print(flat)

[0.02143847 0.13943067 0.02956333 ... 0.05498933 0.00212685 0.08855379]


In [33]:
np.sort(flat)

array([-0.0955352 , -0.08235915, -0.07575124, ...,  0.41020251,
        0.42097608,  0.44135968])

In [34]:
sort_values = np.sort(flat)

In [35]:
sort_values[-5:]

array([0.39750618, 0.40904918, 0.41020251, 0.42097608, 0.44135968])

In [36]:
sort_values[:5]

array([-0.0955352 , -0.08235915, -0.07575124, -0.07058224, -0.06986834])

In [37]:
np.argsort(flat)

array([ 552, 4268, 1113, ..., 4275, 7731, 4931], dtype=int64)

In [38]:
sort_indices = np.argsort(flat)

In [39]:
print(sort_indices)

[ 552 4268 1113 ... 4275 7731 4931]


The last find indices indicate the 5 most similar document to the query document

In [40]:
print(sort_indices[-5:])

[1684 2738 4275 7731 4931]


In [41]:
# to show longer length in pandas series object output

pd.options.display.max_colwidth = 2000

In [42]:
X_train.iloc[4931]

0    Ten years ago, the number of Europeans in the NHL was roughly a quarter\nof what it is now. Going into the 1992/93 season, the numbers of Euros on\nNHL teams have escalated to the following stats:\n\nCanadians: 400\nAmericans: 100\nEuropeans: 100\n\n   Please note that these numbers are rounded off, and taken from the top\n25 players on each of the 24 teams. My source is the Vancouver Sun.\n\n   Here's the point: there are far too many Europeans in the NHL. I am sick\nof watching a game between an American and a Canadian team (let's say, the\nRed Wings and the Canucks) and seeing names like "Bure" "Konstantinov" and\n"Borshevshky". Is this North America or isn't it? Toronto, Detriot, Quebec,\nand Edmonton are particularly annoying, but the numbers of Euros on other\nteams is getting worse as well. \n\n    I live in Vancouver and if I hear one more word about "Pavel Bure, the\nRussian Rocket" I will completely throw up. As it is now, every time I see\nthe Canucks play I keep hoping

In [43]:
X_test.iloc[2]

0    Devorski unfortunately helped to taint an otherwise brilliant display\nby MacLean.  The Canucks tied up the Jets so tightly that I thought that\nthey were mailing them.\n\nBTW, Greg...next time, don't fall asleep in geography class, it's pretty\nsad when a fellow in Norway can spell Winnipeg properly and a guy in\nNorth America can't.\n\nOne more thing...how LONG has Vancouver been in the NHL?  How many\nchampionships do they have?  \n\nOh yeah...and I CAN go to the Arena and see not one, not two, but\n*six* championship banners hanging from the rafters.  3 Stanley Cup\nbanners, and 3 Avco Cup banners.  My NHL guide says that Vancouver has\nwon the Cup once (as many times as the rockin' town of Kenora has won it!)
Name: 2, dtype: object

#### Notice both documents mention NHL. So, SVD Model was able to pick on the similarity.

____