# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** P

**Names:**

* Matthias Leroy
* Pierre Fouche

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer

## Exercise 4.4: Latent semantic indexing

In [None]:
#We get from the disc the term-document matrix X
#as well as the dictionaries that link terms and document(courses) with their indices
with open("matrix.pickle", "rb") as f:
    X = pickle.load(f, encoding="utf-8")
with open("newCoursesDict.pickle", "rb") as f:
    newCoursesDict = pickle.load(f, encoding="utf-8")
with open("termsDict.pickle", "rb") as f:
    termsDict = pickle.load(f, encoding="utf-8")

# we apply SVD with K = 300 to our term-document matrix X.
U,S,Vt = svds(X,300)
print([e*e for e in S[-20:][::-1]])

1) Each singular value of S describe the importance of some concept/topics, thus the highest values correspond to the most significance concept.<br/>
The columns of U describe a concept among all the terms of our matrix. Thus its rows define the terms.<br/>
The rows of Vt describe a concept among all the documents of our matrix. This its columns define the documents<br/>



2) The top-20 eigenvalues of X are the 20 biggest square of the singular value of S: see above.

## Exercise 4.5: Topic extraction

In [None]:
topicTermDict = {}
indiceT = 1
#we get the columns of U corresponding to the highest singular value of S.
#then we look for the 10 most important terms inside each topic
for topicTerm in np.absolute(np.transpose(U))[-10:][::-1]:
    idxTerms = topicTerm.argsort()[-10:][::-1]
    tempList1 = []
    for idxT in idxTerms:
        term = termsDict[idxT]
        tempList1.append(term)
    topicTermDict[indiceT]=tempList1
    indiceT+=1
    
indiceD = 1
topicDocDict = {}
#we get the columns of Vt corresponding to the highest singular value of S.
#then we look for the 10 most important document(class) inside each topic
for topicDoc in np.absolute(Vt)[-10:][::-1]:
    idxDoc = topicDoc.argsort()[-10:][::-1]
    tempList2 = []
    for idxD in idxDoc:
        doc = newCoursesDict[idxD]['name']
        tempList2.append(doc)
    topicDocDict[indiceD]=tempList2
    indiceD+=1


for i in range(1,11):
    print('Terms for topic',i,'\n')
    for j in range(10):
        print(topicTermDict[i][j])
    print('\nDocument for topic',i,'\n')
    for k in range(10):
        print(topicDocDict[i][k])
    print('------------------------------------------------------------\n')


2)

label topic 1: Visual computing<br/>
label topic 2: Medical science<br/>
label topic 3: Bio electronic<br/>
label topic 4: Thermodynamics<br/>
label topic 5: Electromagnetism<br/>
label topic 6: Physical waves<br/>
label topic 7: General physics<br/>
label topic 8: Astrophysics<br/>
label topic 9: Biological fluid<br/>
label topic 10: Fluid experimentation<br/>

## Exercise 4.6: Document similarity search in concept-space

In [None]:
#function that compute the similarity between a term of index t and a document of index d
def similarity(t,d):
    Urow = U[t].reshape(1,300)
    Sdiag = np.diag(S)
    Vrow = Vt[:,d].reshape(300,1)
    Svd = np.dot(Sdiag,Vrow)
    
    sim = (np.dot(Urow,Svd))/(np.linalg.norm(Urow)*np.linalg.norm(Svd))
    return sim[0][0]

In [None]:
ps = PorterStemmer()

#Function that find the index of a terms
def findIndexTerm(q):
    b=0
    for key,value in termsDict.items():
        if value == ps.stem(q):
            b = key
            break;
    return b

In [None]:
#Search function that compare the similarity between a given term and all the classes
#it print the top 10 of courses with it score of similarity with the term
def searchFunction(query):
    t = findIndexTerm(query)
    resultSim = []
    
    for d,doc in newCoursesDict.items():        
        resultSim.append((doc['name'],similarity(t,d)))
    
    sorted_by_similarity = sorted(resultSim, key=lambda tup: tup[1], reverse=True)
    
    return(sorted_by_similarity[:10])

In [None]:
#we search for facebook
for result in searchFunction('facebook'):
    print('class:',result[0])
    print('score:', result[1])
    print('\n')

1) With VSM in the first exercise, when we have searched facebook, we find only one class that match because only the course of 'Computational Social Media' have the term facebook in its description. However with LSI we have connected other terms with facebook, thus it returns all the courses that are connected to the same idea (like social media for example).

In [None]:
#we search for markov chains
for result in searchFunction('markov chains'):
    print('class:',result[0])
    print('score:', result[1])
    print('\n')

2) We find the same courses but with other one that could be interesting without explicitly talking about marking chain (at least in their description).

## Exercise 4.7: Document-document similarity

In [None]:
a = 0
for key,value in newCoursesDict.items():
    if value['name'] == 'Internet analytics':
        a = key
        break;
        
resultSim2 = []
    
for i,terms in termsDict.items():        
    resultSim2.append((terms,similarity(i,a)))
    
sorted_by_similarity2 = sorted(resultSim2, key=lambda tup: tup[1], reverse=True)

coursesSim = []
for n,s in sorted_by_similarity2[:300]:
    for c,s2 in searchFunction(n):
        coursesSim.append((c,s+s2))

In [None]:
import collections as col

c = col.Counter()
for k, v in coursesSim:
    c[k] += v

sortCourses = sorted(c.items(), key=lambda tup: tup[1], reverse=True)

for course,score in sortCourses[1:6]:
    print(course)

1) We have gotten the 5 first courses similar to COM 308 by combining the terms with the best similarity scores with Internet Analitics and the similarity scores of these terms with all the other courses.

2)Thus we have find:<br/><br/>
Applied data analysis<br/>
Distributed information systems<br/>
A Network Tour of Data Science<br/>
Data science for business<br/>
Database systems<br/>