## Latent semantic Indexing

### M_s = K_s * S_s * D_s.T

- M = m terms x n documents 
- M_s: only the S largest singular values 
- K_s: m terms x s importances
- S_s = s * s sorted singular values
- D_s.T = s * n documents

#### How to answer queries:

* Use cosine similarity to compaire documents: (Ds.T)i (ith column), (Ds.T)j (jth column)
* Query = additional document
* Mapping M -> D:  D = M.T * K * S^-1
* **Mapping query** : q* = q.T * K.s * S.s^-1
* **Comparing similarities**: sim(q*, d_i) = (q* dot (Ds.T)i) / (norm(q*) * norm(Ds.T_i))

#### Dropping a document d3 in document space...
The document ordering does not change even if d3 is dropped. Recall that all the documents in the term-document matrix can be considered as vectors in a  Rm dimensional vector space. Thus, since d3 has a similar magnitude and direction as d4 and d2, dropping d3 does not alter substantially the term space ( K ) and the document space ( D ) of the SVD.

To modify the term and document space we should change d3 such that it in a different direction as compared to the other vectors. For example, d3 = (0, 0, 1, 1, 2, 1, 0, 0, 2, 0, 2) changes the document ordering to d2 >d4 >d1 >d3.

##### Algebraic Interpretation:  

recall that the matrix M transforms a unit ball into an ellipsoid, and in LSI we keep only the directions with the strongest distortion. Intuitively, if we combine linearly $d_2$ and $d_4$ with a 0.5 coefficient, we’ll find a vector that is not very dissimilar from $d_3$ (i.e., the norm is almost the same, and the direction overlaps on many components). Therefore, it’s not surprising that (in this specific example) removing $d_3$ did not lead to a different ranking. Bear in mind that, with slightly different numbers, this might not be the case anymore.
In a real-world scenario with LSI (i.e., millions of documents) removing just a few documents rarely changes the ranking dramatically, because the documents we still keep into account will have high probability to contain the same concepts that are contained in the removed ones. That is to say, the resulting ellipsoid won’t change substantially.

In [23]:
import numpy as np
import math
import operator
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/yawen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [51]:
M = [[1,1,1,1], 
     [0,1,1,1],
     [1,0,0,0],
     [0,1,0,0],
     [1,0,0,0],
     [1,0,1,2],
     [1,1,1,1],
     [1,1,1,0],
     [1,0,0,0],
     [0,2,1,1],
     [0,1,1,0]]

def SVD(M):
    # compute SVD
    K, S, Dt = np.linalg.svd(M, full_matrices=False) # K = min(M, N) if full_matrices = False
    return K, S, Dt

K, S, Dt =  SVD(M)

def SVD_k(M, k):
    K, S, Dt = SVD(M)
    # LSI select dimensions
    K_sel = K[:,0:k]
    S_sel = np.diag(S)[0:k,0:k]
    Dt_sel = Dt[0:k,:]
    return K_sel, S_sel, Dt_sel

In [52]:
# q = quest; M = Term-doc matrix; k = top k documents;
def q_star(q, M, k):
    K, S, Dt = SVD(M)
    
    K_sel = K_sel = K[:,0:k]
    S_sel = np.diag(S)[0:k,0:k]
    #Map the query q onto the document space D as q* = qT · (K_sel · S_sel−1)
    mapper = np.dot(K_sel, np.linalg.inv(S_sel))
    q_trans =  np.dot( q, mapper)
    return q_trans

q = np.array([0,0,0,0,0,1,0,0,0,1,1])

q_ = q_star(q, M, 2)

In [53]:
def cosine_similarity(v1, v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy*1.0/math.sqrt(sumxx*sumyy)

def rank_documents(q_star, Dt, k):
    # dict of (doc_id, cos_sim)
    document_ranking = dict()
    # Shrink the documents according to highest k singular values
    Dt_sel = Dt[0:k,:]

    for i in range(0, Dt_sel.shape[1]):
        d = Dt_sel[:, i]
        cos_sim = cosine_similarity(q_star, d)
        document_ranking[i] = cos_sim
    
    # Sort according to values
    document_ranking_sorted = sorted(document_ranking.items(), key=operator.itemgetter(1), reverse = True)
    return document_ranking_sorted

rank_documents(q_, Dt, 2)

[(2, 0.9524776244205609),
 (1, 0.9388827727147444),
 (3, 0.5931086268074786),
 (0, -0.012057913278690475)]

In [54]:
# Tokenize, stem a document
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        idf[term] = math.log(num_documents/sum(term in document for document in documents), math.e)
    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]/max_count
    return vector

def vectorize_query(query, vocabulary, idf):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    return query_vector

In [55]:
# define stemmer
stemmer = PorterStemmer()

# Read a list of documents from a file. Each line in a file is a document
with open("bread.txt") as f:
# with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

# create the vocabulary
vocabulary = set([item for sublist in documents for item in sublist])
vocabulary = [word for word in vocabulary if word not in stopwords.words('english')]
vocabulary.sort()

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]
vocabulary

['art',
 'bake',
 'best',
 'book',
 'bread',
 'cake',
 'comput',
 'french',
 'london',
 'numer',
 'pastri',
 'pie',
 'quantiti',
 'recip',
 'scientif',
 'smith',
 'without']

In [59]:
#### Find top documents for quest 'baking', with k = 3 ####

## take transpose of document vectors to convert to term document matrix.
M = np.matrix.transpose(np.array(document_vectors))

## Run LSI.
K, S, Dt = SVD(M)
K_sel, S_sel, Dt_sel = SVD_k(M, 3)

## Prepare query
# transform query and documents
q = np.array([0]*len(vocabulary))
#Set the term corresponding to baking = 1 (see vocabulary)
q[1] = 1
q__ = q_star(q, M, 3)

# Compute similarities
rank_documents(q__, Dt, 3)


[(0, 0.9980518678772611),
 (3, 0.7231078789682566),
 (2, -0.0023288750529669527),
 (4, -0.6551062911362043),
 (1, -0.6577609566355738)]

In [57]:
q__

array([-0.00416487,  0.11537136, -0.14603541])