# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* *Name 1*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle as pk
import numpy as np
import scipy as sc
from scipy.sparse.linalg import svds
from lab04_helper import *
from utils import *

## Exercise 4.4: Latent semantic indexing

In [2]:
k = 300
td_matrix = load_sparse_csr("occ_matrix.npz")
# SVD decomposition
print(td_matrix.shape)
u, s, v = svds(td_matrix,k=k)

(15248, 854)


In [3]:
u_k = u[:,:k]
v_k_T = v[:k,:]
s_k = np.diag(s[:k])
print("20 biggest eigenvalues : \n",s[:20])

20 biggest eigenvalues : 
 [ 16.13695676  16.15132115  16.19533589  16.28524658  16.29884434
  16.30850426  16.35280397  16.38506679  16.42800987  16.45750665
  16.47148612  16.50836612  16.54994431  16.59233144  16.60731843
  16.62798393  16.68640329  16.77242987  16.80275446  16.81623913]


In the matrix U, each __row__ correspond to a term, and for each term the value index i correspond to how much relevant the ith concept is for this term.
In the matrix V, each __column__ correspond to a document(cours description), and for each document the value at index i correspond to how much relevant the ith concept is for this term.

## Exercise 4.5: Topic extraction

In [4]:
n = 10
m = 10
wordIndex = load_pkl("indexToWord")
coursIndex = load_pkl("indexToCourse.txt")

n_topic_relevant_term_index = np.argpartition(u_k, n,0)[-n:]
n_topic_relevant_document_index = np.argpartition(v_k_T, -n,1)[:,-n:]

for i in range(n):
    print("*************************************************")
    print("In topic",i,"the relevant terms are:")
    print(",".join(map(wordIndex.get,n_topic_relevant_term_index[:,i][:m])))
    print("An the relevant courses are:")
    print(",".join(map(coursIndex.get,n_topic_relevant_document_index[i][:m])))


*************************************************
In topic 0 the relevant terms are:
flexures,pneumatic,cp3,gunjan,argawal,robertson,amir,firouzeh,zhenishbek,zhakypov
An the relevant courses are:
CIVIL-369,CH-447,ENV-720,MICRO-708,MSE-437,ENV-320,CH-422,ENV-546,MICRO-430,FIN-504
*************************************************
In topic 1 the relevant terms are:
economics,issues,introduces,society,cross-sectoral,sectoral,basically,infrastructures,regulation,offers
An the relevant courses are:
ME-524,CS-487,ME-231(a),COM-308,COM-407,ENV-540,MATH-332,ENG-603,FIN-506,EE-518
*************************************************
In topic 2 the relevant terms are:
flexures,pneumatic,cp3,gunjan,argawal,robertson,amir,firouzeh,zhenishbek,zhakypov
An the relevant courses are:
ChE-301,FIN-406,EE-606,ME-466,COM-308,EE-470,EE-612,BIO-714,EE-519,BIO-802
*************************************************
In topic 3 the relevant terms are:
water,energy,sectoral,translates,basically,infrastructures,focus,r

## Exercise 4.6: Document similarity search in concept-space

In [18]:
wordToIndex = load_pkl("wordToIndex")

def simVectors(querry_indices):
    """
        querry_indices : List of indices of all atomic terms contained in the querry.
        
        Return a matrix containing for each document, 
        how much this document is similare to each atomic terms of the querry. 
        
        If we call sim vector many times part of this can be precomputed.
    """
    querry_matrix = np.zeros(u_k.shape)
    querry_matrix[querry_indices,:]=u_k[querry_indices,:] 
    svdt = s_k @ v_k_T
    non_normalized_sims = querry_matrix @ s_k @ v_k_T
    
    # Compute only the norms we care about 
    # 🚨 terms_norms shape depend on querry_indices length.
    terms_norms = np.linalg.norm(querry_matrix[querry_indices,:],axis=1)
    weighted_document_norms = np.linalg.norm(svdt,axis=0)
    
    #Less computation but also avoid doing 0/0
    term_normalized_sims = np.zeros(non_normalized_sims.shape)
    term_normalized_sims[querry_indices,:] = (non_normalized_sims[querry_indices,:].T/terms_norms).T
    
    return term_normalized_sims/weighted_document_norms


def searchTerm(t):
    querry_words = cleaner(t)
    querry_indices = list(filter(lambda x : x >=0,map(lambda x : wordToIndex.get(x,-1),querry_words)))
    query_scores = np.sum(simVectors(querry_indices),0)
    
    return list(reversed(np.argsort(query_scores)))

def printSearchResult(search_result):
    for r in search_result:
        print(coursIndex[r])
#     print("cleaned : <",cleaned,">")
#     top = np.argsort(np.sum(list(map(rankAtomicTerm,cleaned)),0),0)
n = 10
t1 = "facebook"
print("Top",n,"course about",t1)
printSearchResult(searchTerm(t1)[:n])
t1 = "markov chains"
print("Top",n,"course about",t1)
printSearchResult(searchTerm(t1)[:n])

Top 10 course about facebook
EE-727
EE-593
AR-402(w)
ChE-302
ME-608
COM-308
EE-552
CS-714
MGT-641(a)
ENV-615
Top 10 course about markov chains
MGT-484
MATH-332
COM-516
EE-605
MGT-602
COM-512
FIN-606
COM-308
EE-477
BIO-463


## Exercise 4.7: Document-document similarity