# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *K*

**Names:**

* *Mathieu Sauser*
* *Luca Mouchel*
* *Jérémy Chaverot*
* *Heikel Jebali*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import numpy as np
from utils import load_json
import scipy as sp

courses = load_json('data/courses.txt')
N = len(courses)

## Exercise 4.4: Latent semantic indexing

In [2]:
X = np.load('TFIDF.npy')
docToIdx = np.load('docToIdx.npy', allow_pickle=True)
idxToDoc = dict(zip(docToIdx.item().values(), docToIdx.item().keys()))
docToIdx = dict(zip(idxToDoc.values(), idxToDoc.keys()))
termToIdx = np.load('termToIdx.npy',  allow_pickle=True)
idxToTerm = dict(zip(termToIdx.item().values(), termToIdx.item().keys()))
termToIdx = dict(zip(idxToTerm.values(), idxToTerm.keys()))

In [None]:
U, S, V = sp.sparse.linalg.svds(X, k=300)
V = V.T
print(U.shape)
print(S.shape)
print(V.shape)

The columns of U represent the influence of topics i on the columns of the data

The index i of S represent the strengh of topic i

The rows of V represent the influence of topic i on the rows of the data

In [None]:
for i in range(20):
    print(f"{i+1}: {S[-(i + 1)]}")
    if i < 19:
        print("=====================")
    

## Exercise 4.5: Topic extraction

prendre les 10 indexes avec des valeurs max dans S. Pour U: prendres les colonnes aux indexes choppés avant et prendre les 10 valeurs max de la colonnes en question (faire pareil pour V, mais avec les lignes

In [None]:
num_docs = 10
num_index = 10
num_terms = 10

In [None]:
index_max_value = np.argsort(S)[-num_terms:]
terms = [] # 10x10 matrix. For the 10 most important topics, we choose the 10 most important terms (U)
documents = [] # 10x10 matrix. For the 10 most important topics, we choose the 10 most important documents (V)
for i in range(num_terms):
    j = index_max_value[i]
    col_u = U[:,j]
    ten_max_index_col_u = np.argsort(col_u)[-num_terms:]
    row_v = V[j]
    ten_max_index_row_v = np.argsort(row_v)[-num_docs:]
    if i == 0:
        terms = ten_max_index_col_u
        documents = ten_max_index_row_v
    else:
        terms = np.vstack((terms, ten_max_index_col_u))
        documents  = np.vstack((documents, ten_max_index_row_v))
terms

In [None]:
for i in range(num_terms):
    print(f"topic {i+1}")
    print(f" -top {num_terms} terms:")
    for j in terms[i]:
        print(f"{idxToTerm[j]}")
    print(f"\n -top {num_docs} documents")
    for j in documents[i]:
        print(f"{idxToDoc[j]}")
    if i < num_terms -1:
        print("=============")

il faut encore donner un label à chacun

## Exercise 4.6: Document similarity search in concept-space

In [None]:
def sim(t, d):
    U_t = U[t]
    V_d = V[d]
    sv = np.diag(S) @ V_d
    return np.dot(U_t, sv)/(np.linalg.norm(U_t)*np.linalg.norm(sv)) # demander

In [None]:
markov_chain_index = termToIdx["markov"] #à changer
sim_with_markov_chain = []

for i in range(N):
    sim_with_markov_chain.append(sim(markov_chain_index, i))


In [None]:
top_five_courses = np.argsort(sim_with_markov_chain)[-5:]
print("top 5 courses for : markov chain") #à changer
for i in np.flip(top_five_courses):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_markov_chain[i]}')


In [None]:
facebook_index = termToIdx["facebook"] #à changer
sim_with_facebook = []

for i in range(N):
    sim_with_facebook.append(sim(facebook_index, i))

In [None]:
top_five_courses = np.argsort(sim_with_facebook)[-5:]
print("top 5 courses for : facebook") #à changer
for i in np.flip(top_five_courses):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_facebook[i]}')

## 4.7 Document-document similarity

We can use cosine similarity to find the similarity between document: $cos_{sim}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$

In [None]:
def cos_sim(n, m):
    dn = V[n]
    dm = V[m]
    return np.dot(dn, dm)/(np.linalg.norm(dn)*np.linalg.norm(dm))

In [None]:
IX_index = docToIdx["COM-308"]
sim_with_ix = []
for i in range(N):
    sim_with_ix.append(cos_sim(IX_index, i))
sim_with_ix[IX_index] = -10

In [None]:
top_five_courses_similar_to_IX = np.argsort(sim_with_ix)[-5:]
print("Top 5 courses similar to IX:")
for i in np.flip(top_five_courses_similar_to_IX):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_ix[i]}')