# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *K*

**Names:**

* *Mathieu Sauser*
* *Luca Mouchel*
* *Jérémy Chaverot*
* *Heikel Jebali*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import numpy as np
from utils import load_json
import scipy as sp

courses = load_json('data/courses.txt')
N = len(courses)

## Exercise 4.4: Latent semantic indexing

In [2]:
X = np.load('TFIDF.npy')
docToIdx = np.load('docToIdx.npy', allow_pickle=True)
#create a dictionnary with index as key, and document as value
idxToDoc = dict(zip(docToIdx.item().values(), docToIdx.item().keys()))
#create a dictionnary with document as key, and index as value
docToIdx = dict(zip(idxToDoc.values(), idxToDoc.keys()))
termToIdx = np.load('termToIdx.npy',  allow_pickle=True)
#create a dictionnary with index as key, and term as value
idxToTerm = dict(zip(termToIdx.item().values(), termToIdx.item().keys()))
#create a dictionnary with term as key, and index as value
termToIdx = dict(zip(idxToTerm.values(), idxToTerm.keys()))

In [3]:
#SVD decomposition
U, S, V = sp.sparse.linalg.svds(X, k=300)
V = V.T
print(U.shape)
print(S.shape)
print(V.T.shape)

(10815, 300)
(300,)
(300, 854)


Each columns of $U$ represent the influence of a terms on topics i 

The index i of $S$ represent the strengh of topic i (singular value)

Each rows of $V^T$ represent the influence of a doc on topic i

In [4]:
#print the top 20 biggest singular value
for i in range(20):
    print(f"{i+1}: {S[-(i + 1)]}")
    if i < 19:
        print("=====================")
    

1: 6.669278191262075
2: 4.548502896539546
3: 2.867597468752914
4: 2.196187036905732
5: 1.8603560966014787
6: 1.6821205364600134
7: 1.6190756001591227
8: 1.5728474730472282
9: 1.5391747914258258
10: 1.5254035282163458
11: 1.4810524769317563
12: 1.4664165283239408
13: 1.4458611065352382
14: 1.3898992534052956
15: 1.3708084249410597
16: 1.3617568606179178
17: 1.3507910883487848
18: 1.3426638423578032
19: 1.3232450150285355
20: 1.3116115226252492


## Exercise 4.5: Topic extraction

prendre les 10 indexes avec des valeurs max dans S. Pour U: prendres les colonnes aux indexes choppés avant et prendre les 10 valeurs max de la colonnes en question (faire pareil pour V, mais avec les lignes

In [5]:
num_docs = 10
num_index = 10
num_terms = 10

In [6]:
#keep num_terms (=10) largest absolute value of singular value 
index_max_value = np.argsort(np.abs(S))[-num_terms:]
terms = [] # 10x10 matrix. For the 10 most important topics, we choose the 10 most important terms (U)
documents = [] # 10x10 matrix. For the 10 most important topics, we choose the 10 most important documents (V)
for i in range(num_terms):
    j = index_max_value[i]
    #col_u = the jth most important topics
    col_u = U[:,j]
    #we only keep the 10 max index of col_u
    ten_max_index_col_u = np.argsort(col_u)[-num_terms:]
    #we do the same for every row of V
    col_v = V[:,j]
    ten_max_index_row_v = np.argsort(col_v)[-num_docs:]
    #we store the results in the 2 matrices
    if i == 0:
        terms = ten_max_index_col_u
        documents = ten_max_index_row_v
    else:
        terms = np.vstack((terms, ten_max_index_col_u))
        documents  = np.vstack((documents, ten_max_index_row_v))

In [7]:
for i in range(num_terms):
    print(f"topic {i+1} (singular value = {S[-(i + 1)]})")
    print(f" -top {num_terms} terms:")
    for j in terms[i]:
        print(f"{idxToTerm[j]}")
    print(f"\n -top {num_docs} documents")
    for j in documents[i]:
        print(f"{idxToDoc[j]}")
    if i < num_terms -1:
        print("=============")

topic 1 (singular value = 6.669278191262075)
 -top 10 terms:
combinatorics
finite field
electron microscopy
laser
optical
diffraction
microscopy
tem
additive
electron

 -top 10 documents
MSE-655
MSE-704
PHYS-610
MSE-638
MSE-636(b)
MSE-636(a)
MSE-637(b)
MSE-637(a)
MSE-635
MATH-636
topic 2 (singular value = 4.548502896539546)
 -top 10 terms:
combinatorics
finite field
energy transport
diagnostics
magnetic
fusion
additive
confinement
magnetic confinement
plasma

 -top 10 documents
PHYS-734
BIO-630
MATH-625(2)
MATH-625(1)
PHYS-445
PHYS-423
PHYS-424
PHYS-732
MATH-636
PHYS-731
topic 3 (singular value = 2.867597468752914)
 -top 10 terms:
lie
result
emphasis
group
field
finite
combinatorial
combinatorics
finite field
additive

 -top 10 documents
MSE-657
COM-102
MATH-726(2)
MATH-726
MSE-804
MATH-625
MATH-409
MATH-625(2)
MATH-625(1)
MATH-636
topic 4 (singular value = 2.196187036905732)
 -top 10 terms:
energy transport
energy
quantum
optical
laser
fusion
magnetic
confinement
magnetic confinement


$\textbf{topic 1: microscopy}\\ \text{(electron) microscopy, diffraction, laser, optical}$


$\textbf{topic 2: electromagnetism} \\ \text{magnetic, finite field, fusion, plasma}$


$\textbf{topic 3: algebra} \\ \text{field, group, additive, finite field}$

$\textbf{topic 4: electromagnetism} \\ \text{magnetic, plasma, magnetic, quantum}$

$\textbf{topic 5: architecture} \\ \text{Vylder, Taillieu (both famous architect), studio, house learning}$

$\textbf{topic 6: quantum physics} \\ \text{quantum, laser, energy, system}$

$\textbf{topic 7: Drug} \\ \text{Drug, cell, system, compound, chemical}$

$\textbf{topic 8: project} \\ \text{submission result, report late, semester project }$

$\textbf{topic 9: It seems we have something wrong here}$

$\textbf{topic 10: Bioluminescence} \\ \text{fluorescence, light microscopy, drug}$

## Exercise 4.6: Document similarity search in concept-space

In [8]:
# document similarity given the document's and term's index
def sim(t, d):
    U_t = U[t]
    V_d = V[d]
    sv = S * V_d
    return np.dot(U_t, sv)/(np.linalg.norm(U_t)*np.linalg.norm(sv))

In [9]:
#we get the index of the term "markov chain"
markov_chain_index = termToIdx["markov chain"]
sim_with_markov_chain = []

#we compute the similarity of markov chain with every courses
for i in range(N):
    sim_with_markov_chain.append(sim(markov_chain_index, i))


In [10]:
#retrieve the 5 most similar courses 
top_five_courses = np.argsort(sim_with_markov_chain)[-5:]
print("top 5 courses for : markov chain")
for i in np.flip(top_five_courses):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_markov_chain[i]}')


top 5 courses for : markov chain
MATH-332 : 0.920654644477996
COM-516 : 0.7783805923744408
MGT-484 : 0.7123732607538066
MATH-600 : 0.4306409909772205
EE-605 : 0.2934241933500188


In the previous section, the top 5 courses for markov chain was: MATH-332, COM-516, MGT-484, MATH-600, COM-512.

We can see that our top 4 is the same

In [11]:
#retrieve the index for facebook
facebook_index = termToIdx["facebook"]
sim_with_facebook = []
#compute the similarity of every courses with facebook
for i in range(N):
    sim_with_facebook.append(sim(facebook_index, i))

In [16]:
#retrieve the top 5 courses that are the most similar to facebook
top_five_courses = np.argsort(sim_with_facebook)[-5:]
print("top 5 courses for : facebook")
#print the top 5 courses that are the most similar to facebook
for i in np.flip(top_five_courses):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_facebook[i]}')

top 5 courses for : facebook
EE-727 : 0.9727453128059316
EE-593 : 0.6844992203313365
HUM-432(a) : 0.41579967185608024
COM-308 : 0.3944049092093338
EE-552 : 0.22487305535582988


The previous section seems to have a bug because only the course EE-727 does not have a similarity of 0. But it's also our top 1, so it's not too bad

## 4.7 Document-document similarity

We can use cosine similarity to find the similarity between document: $cos_{sim}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$

In [13]:
def cos_sim(n, m):
    dn = V[n]
    dm = V[m]
    return np.dot(dn, dm)/(np.linalg.norm(dn)*np.linalg.norm(dm))

In [14]:
#retrieve the index of Internet analytics (IX)
IX_index = docToIdx["COM-308"]
sim_with_ix = []
#compute the similarity of every courses with IX
for i in range(N):
    sim_with_ix.append(cos_sim(IX_index, i))
#we set the index value of IX to -10, because otherwise,
#it's obvious that the price most similar to IX will be IX itself. 
#since cosine takes its values in [-1, 1], 
#if we set it to -10, we're certain that IX won't be the price most similar to IX.
sim_with_ix[IX_index] = -10

In [15]:
top_five_courses_similar_to_IX = np.argsort(sim_with_ix)[-5:]
print("Top 5 courses similar to IX:")
for i in np.flip(top_five_courses_similar_to_IX):
    courseId = next(iter(courses[i].values()))
    print(f'{courseId} : {sim_with_ix[i]}')

Top 5 courses similar to IX:
CS-423 : 0.6441620403958727
EE-558 : 0.5849158231059765
CS-401 : 0.5275570322125526
EE-724 : 0.4942567658307351
CS-422 : 0.42301480158451943
