# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *F*

**Names:**

* *Dessimoz Frank*
* *Micheli Vincent*
* *Lefebvre Hippolyte*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_pkl, load_json

Load TF_IDF matrix computed previously:

In [107]:
TF_IDF = load_pkl('tf_idf.pkl')
classes = load_json('data/courses.txt')
terms = load_pkl('terms.pkl')

Cosine similarity built previously:

In [97]:
def cosine_similarity(di, dj):#compute closeness between a topic and a document
    score= np.dot(di, dj) / (np.linalg.norm(di) * np.linalg.norm(dj))# Frobenius norm byt default
    return score

## Exercise 4.4: Latent semantic indexing

In [115]:
U, S, V_T = svds(TF_IDF, k=300, which='LM')

print('U shape:', U.shape)
print('S shape:', S.shape)
print('V_T shape:', V_T.shape)

U shape: (11819, 300)
S shape: (300,)
V_T shape: (300, 854)


1) According to the Dimensionality Reduction lecture:  
* The rows of U describe how each term is related to each underlying concept. A large entry (i,j) indicates that term i is strongly related to concept j.
* The columns of V_T describe how each class is related to each underlying concept. A large entry (i,j) indicates that concept i is strongly related to class j.
* The roots of S describe the strength of each concept. A large entry (i,i) indicates that concept i is "strong".

In [47]:
#Top eigenvalues roots are located at the end of S
top_20_eigenvalues_sqrt = S[::-1][:20]
top_20_eigenvalues = [e*e for e in top_20_eigenvalues_sqrt]

In [48]:
print(top_20_eigenvalues)

[125034.54771043283, 50329.233321973268, 45116.877815600696, 42257.958891422379, 37633.456079130032, 37113.735957742443, 36008.276425117074, 35220.610036424769, 34494.550609889076, 33433.925382265923, 30772.715117247084, 30289.483195605, 29468.024971463365, 28477.490802826716, 27238.952476757793, 26488.2531148747, 26117.669567240573, 25319.194356977066, 25106.691697178565, 24865.841939481787]


Eigenvalues are correctly orderer by decreasing order.

## Exercise 4.5: Topic extraction

Now we would like to understand better what types of courses are offered on the
campus. To do so, we use LSI to extract some topics from the corpus of documents.

In [114]:
#Top 10 terms per topic
for topic in range(-1,-11,-1):
    top_10_terms = [terms[t] for t in np.argsort(U[:,topic])[-10:]]
    top_10_documents = [classes[d]['name'] for d in np.argsort(V_T[topic])[-10:]]
    print('Topic ', topic,':')
    print(top_10_terms)
    print(top_10_documents)
    print()

Topic  -1 :
['phenomena5', 'microinstabilities4', 'magnetohydrodynam', 'plasmas3', 'aspectsglp', 'aspectsmarket', 'aspectsregulatori', 'aspectslega', 'aspectsstructurefinanci', 'edmt']
['De- and re-regulation of Network Industries', 'Plasma Diagnostics in Basic Plasma Physics Devices and Tokamaks: from Principles to Practice', 'Additive Combinatorics', 'Multidisciplinary organization of medtechs/biotechs', 'Medicinal chemistry', 'Project 1 (EDIC)', 'Project 2 (EDIC)', 'Field Research Project A', 'Field Research Project B', 'Training Rotation (EDNE)']

Topic  -2 :
['asset', 'option', 'stochast', 'arbitrag', 'financ', 'financi', 'market', 'portfolio', 'price', 'risk']
['Martingales in financial mathematics', 'Advanced topics in financial econometrics', 'Investments', 'Advanced derivatives', 'Stochastic calculus II', 'Financial econometrics', 'Stochastic calculus I', 'Risk and energy', 'Derivatives', 'Quantitative methods in finance']

Topic  -3 :
['excurs', 'form', 'design', 'week', 'rep

We retrieved the top 10 topics ad a combination of of 10 terms and 10 document.

We can label the topics as following:  

    1) Organization and projects in medical technologies  
    2) Financial engineering  
    3) Architecture  
    4) Environmental engineering  
    5) Applied microscopic medical sciences  
    6) Molecular and cellular biophysics  
    7) Systems modeling and natural language processing  
    8) Risk management  
    9) Applied robotics  
    10)Electrical engineering  

## Exercise 4.6: Document similarity search in concept-space

Now we will Implement a search function using LSI concept-space, and search for "markov chains" and
"facebook". We aim at comparing vith Vector Space results computed previsouly.

In [50]:
#Obtain a diagonal matrix from vector S
S_diag = np.diag(S)

In [51]:
len(classes)

854

In [124]:
def t_d_sim(query):
    sim = np.zeros(len(classes))
    term_U = np.zeros(300)
    for term in query.split(" "):
        
        term_i = terms.index(term)
        term_U += U[term_i]
        
    for i in range(len(classes)):
        class_V_T = V_T[:,i]
        d = np.dot(S_diag, class_V_T)
        #num = np.dot(term_U, d)
        #denom = np.linalg.norm(term_U) * np.linalg.norm(d)
        #quotient = num / denom
        q=cosine_similarity(term_U, d)
        sim[i] += q
    return sim

In [126]:
def LSI_search(query):
    sim = t_d_sim(query)
    top_5 = np.argsort(sim)[::-1][:5]
    for course_i in top_5:
        print('Course: ',classes[course_i]['name'],',', ' Similarity: ', sim[course_i])

2) As expected LSI performs better than VSM. This improvement comes from the fact that LSI relies on underlying/latent topics whereas VSM implements naive term frequency comparison. Therefore for a rare term such as "facebook" we get better recommendations.

In [127]:
LSI_search('facebook')

Course:  Computational Social Media ,  Similarity:  0.914629726075
Course:  Social media ,  Similarity:  0.56961813888
Course:  Studio MA2 (Escher et GuneWardena) ,  Similarity:  0.196441545312
Course:  Human computer interaction ,  Similarity:  0.179784450763
Course:  Transport phenomena II ,  Similarity:  0.177257241069


In [128]:
LSI_search('markov chain')

Course:  Applied stochastic processes ,  Similarity:  0.779779129885
Course:  Applied probability & stochastic processes ,  Similarity:  0.768744936584
Course:  Markov chains and algorithmic applications ,  Similarity:  0.640434513617
Course:  Supply chain management ,  Similarity:  0.541455248869
Course:  Mathematical models in supply chain management ,  Similarity:  0.500342638133


Indeed this time 'facebook' is included into social media topic.

## Exercise 4.7: Document-document similarity

Once again we resort to cosine similarity, i.e. we compute the cosine similarity of the column from V_T corresponding to Internet Analytics with each course/topic mapping column of V_T.

We identify IX class:

In [98]:
IX_id = [i for (i, c) in enumerate(classes) if c['courseId'] == "COM-308"][0]
IX_id

43

And look for courses similar to ours:

In [129]:
IX_sim = np.apply_along_axis(cosine_similarity, 0, V_T, V_T[:,IX_id])
top_5 = np.argsort(IX_sim)[::-1][0:6]
for i, c_i in enumerate(top_5):
    print('Top ',i,' :',' ',classes[c_i]['name'])

Top  0  :   Internet analytics
Top  1  :   Distributed information systems
Top  2  :   A Network Tour of Data Science
Top  3  :   Financial big data
Top  4  :   Networks out of control
Top  5  :   Analytic algorithms


We notice that the five topics outputed are coherent and can be great advise to take similar courses
next semester.