# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* *Name 1*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle as pk
import numpy as np
import scipy as sc
from scipy.sparse.linalg import svds
from lab04_helper import *
from utils import *

## Exercise 4.4: Latent semantic indexing

In [None]:
k = 300
td_matrix = load_sparse_csr("occ_matrix.npz")
# SVD decomposition
print(td_matrix.shape)
u, s, v = svds(td_matrix,k=k)

In [None]:
u_k = u[:,:k]
v_k_T = v[:k,:]
s_k = np.diag(s[:k])
print("20 biggest eigenvalues : \n",s[:20])

In the matrix U, each __row__ correspond to a term, and for each term the value index i correspond to how much relevant the ith concept is for this term.
In the matrix $V^T$, each __column__ correspond to a document(cours description), and for each document the value at index i correspond to how much relevant the ith concept is for this term. The values on the diagonal of S corespound to the importance of each topics.

## Exercise 4.5: Topic extraction

In [None]:
n = 10
m = 10
wordIndex = load_pkl("indexToWord")
coursIndex = load_pkl("indexToCourse")

n_topic_relevant_term_index = np.argsort(u_k , 0)[-n:,:]
n_topic_relevant_document_index = np.argsort(v_k_T, 1)[:,-n:]

allCourses = load_json('data/courses.txt') 
nameDict =  {t["courseId"]:t["name"] for t in allCourses}


for i in range(n):
    print("\n*************************************************")
    print("Topic #",(i+1),"\n")
    listPrettyPrint( ["%s - %.7f" %(wordIndex[idx], u_k[idx,i])for idx in reversed(n_topic_relevant_term_index[-m:,i])],3)
    print("\n\n")
    for s in ["%s - %.7f" % (nameDict[coursIndex[idx]] , v_k_T[i,idx]) for idx in reversed(n_topic_relevant_document_index[i,-m:])] :
        print(s)
    


* **Topic 1** Mostly biologic material and energy
* **Topic 2** Use big data sets
* **Topic 3** Mostly artifical material and energy
* **Topic 4** Bio engneering, circuits and public policy
* **Topic 5** Physical principles used by cells
* **Topic 6** Things with risk management

## Exercise 4.6: Document similarity search in concept-space

In [None]:
wordToIndex = load_pkl("wordToIndex")

def simVectors(querry_indices):
    """
        querry_indices : List of indices of all atomic terms contained in the querry.
        
        Return a matrix containing for each document, 
        how much this document is similare to each atomic terms of the querry. 
        
        If we call sim vector many times part of this can be precomputed.
    """
    querry_matrix = np.zeros(u_k.shape)
    querry_matrix[querry_indices,:]=u_k[querry_indices,:] 
    svdt = s_k @ v_k_T
    non_normalized_sims = querry_matrix @ s_k @ v_k_T
    
    # Compute only the norms we care about 
    # 🚨 terms_norms shape depend on querry_indices length.
    terms_norms = np.linalg.norm(querry_matrix[querry_indices,:],axis=1)
    weighted_document_norms = np.linalg.norm(svdt,axis=0)
    
    #Less computation but also avoid doing 0/0
    term_normalized_sims = np.zeros(non_normalized_sims.shape)
    term_normalized_sims[querry_indices,:] = (non_normalized_sims[querry_indices,:].T/terms_norms).T
    
    return term_normalized_sims/weighted_document_norms


def searchTerm(t):
    querry_words = cleaner(t)
    querry_indices = list(filter(lambda x : x >=0,map(lambda x : wordToIndex.get(x,-1),querry_words)))
    query_scores = np.sum(simVectors(querry_indices),0)
    
    return list(reversed(np.argsort(query_scores)))

def printSearchResult(search_result):
    for r in search_result:
        c = coursIndex[r]
        print(c,":",nameDict[c])
n = 10
t1 = "facebook"
print("\nTop",n,"course about",t1)
printSearchResult(searchTerm(t1)[:n])
t1 = "markov chains"
print("\nTop",n,"course about",t1)
printSearchResult(searchTerm(t1)[:n])

## Exercise 4.7: Document-document similarity


$$sim(d_{1},d_2)={Sv_{d_1}^T . Sv_{d_2}^T\over{\left\lVert Sv_{d_1}^T \right\rVert \left\lVert Sv_{d_2}^T \right\rVert}} $$

As we want the course that among all our collection is the most similar with some other given course, it may be more efficient to turn this into matrix operations over the whole matrix v. Note that with this method we must not forget that the course that will rank first place will be the course we received as this course is probably highly similar to itself. 

To simplify our life we can compute the similarity of any document with any other document by first computing $v^N$ the matrix obtained by normalizing the row of the $Sv^T$ matrix. Then we compute ${v^N}^T v^N$ and then we can read the similarity of the document I with the document j in the cell ij or ji of this matrix. This approach is also more efficient if we request similarity much more often that we add or remove courses which may be not true in our case but is a reasonable assumption in a reallife application.


In [None]:
def mkSimMatrix():
    vn = normalized(s_k @ v_k_T,0)
    return vn.T @ vn

simMatrix = mkSimMatrix()
coursToIndex =  {v: k for k, v in coursIndex.items()}

def findCoursesSimilarTo(coursId):
    index = coursToIndex[coursId]
    sims = simMatrix[index,:]
    return list(reversed(np.argsort(sims)[:-1]))

n = 5
t1 = "COM-308"
print("Top",n,"course similare to ",t1)
printSearchResult(findCoursesSimilarTo(t1)[:n])    