# Lab 8 - Latent Semantic Analysis (Part 1)

(70 points)

In the first part of this lab, you will use the Latent Semantic Analysis technique to analyze the latent semantics of a Medieval European recipe corpus. 
Besides this iPython notebook, you are also given one dataset (ingredients.xls) as follows:

Each row represents one recipe (in terms of ingredient list).  The first cell of each row denotes the number of ingredients in this row.

You will use this file through the lab, as explained in detail later.

###  What to hand in: 
You will need to pack following things into a file.


   * The completed Notebook files (ipynb) - Remember to answer all the questions in the notebooks!
   

Load the packages

In [31]:
import xlrd
import numpy as np
eps = np.finfo(float).eps

 Load the dataset

In [32]:
filename = 'ingredients.xls'
sheet_name = 'ingredients'


### Step 1: Create a vocabulary dictionary, a term list to save all terms and a document list to save all recipes (in the form of a list of all terms):

1. first read the first cell (cell_idx=0) of each row as num_words;
2. then read from the first term (cell_idx=1) to the last term (cell_idx = num_words);
3. continue if the read term is empty or term == None

In [33]:
wb = xlrd.open_workbook(filename)
ws = wb.sheet_by_name(sheet_name)
vocab = dict()
doc_list = []
term_list = []
num_vocab = 0   

#### 1. COMPLETE THE CODE BELOW ####
# (15 points)
# Build a vocabulary dictionary (vocab), where the word is the key, and the value is the index of 
# entry in the dictionary.
#
# Also, build a document list (doc_list) which saves lists of words in each recipe.
# Hint: Use ws.cell_value(i,j) to get the value of the cell at location i,j.
# You can use ws.nrows to get number of rows in an open excel sheet

temp_list = []

for i in range(ws.nrows):
    num_words = ws.cell_value(i,0)
    for j in range(1,int(num_words)+1):
        if ws.cell_value(i,j) == '' or ws.cell_value(i,j) == None:
            continue
        if ws.cell_value(i,j) not in vocab.keys():
            vocab[ws.cell_value(i,j)] = num_vocab
            num_vocab += 1
        temp_list.append(ws.cell_value(i,j))
    doc_list.append(temp_list)
    temp_list = []
    
for keys in vocab.keys():
    term_list.append(keys)


print len(term_list), num_vocab, len(vocab)


386 386 386


### Step 2: Create the term-document occurrence matrix

The term-document matrix ''tdMatrix" describes the occurrences of terms in documents.

Each row represents one term, and each column represents one document.

tdMatrix[i, j] = 1.0 if the term i occurs in the document j.
else tdMatrix[i, j] = 0.0

In [34]:
num_docs = len(doc_list)
tdMatrix = np.zeros((num_vocab, num_docs), dtype='float')

In [35]:
#### 2. WRITE YOUR CODE HERE TO FILL IN THE TERM DOCUMENT OCCURRENCE MATRIX ####
# Use the vocabulary dictionary effectively.
# (10 points)
for j in range(num_docs):
    for word in doc_list[j]:
        if word in vocab.keys():
            tdMatrix[vocab[word]][j] = 1.0
        

### Step 3: SVD decomposition of the term-document matrix

Use the numpy function ''np.linalg.svd()" to do the SVD decomposition. Remember to set the parameter ''full_matrices" to False.

Please check http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html for more details.

tdMatrix = U \* diagonal (s) \* V

In [100]:
#### 3. YOUR CODE HERE ####
# (5 points)
U, s, V = np.linalg.svd(tdMatrix, full_matrices=False)

In [101]:
S = np.diag(s)
print U.shape, V.shape, s.shape, S.shape

(386, 386) (386, 4133) (386,) (386, 386)


Set the feature dimension of the latent semantic matrices.

The largest possible feature dimension should be smaller than min(tdMatrix.shape[0], tdMatrix.shape[1])

In [102]:
lower_ftr_dim = 300

### Step 4: Create the latent semantic matrices for terms and documents

Only keep the first ''lower_ftr_dim" columns of U and the first ''lower_ftr_dim" columns of **transpose of V**

In [103]:
#### 4. YOUR CODE HERE ####
# (5 points)
U = U[:,:lower_ftr_dim]
s = S
V_transpose = np.transpose(V)[:,:lower_ftr_dim]
print U.shape, V_transpose.shape, s.shape

(386, 300) (4133, 300) (386, 386)


### Step 5: Function to return the indices of the most similar terms/documents to a selected term/document by using the cosine similarity

The function has the following inputs:
    
    1. id: the index of the selected term/document;
    
    2. ftr_mtx: the latent semantic feature matrix;
        
    3. top_k: the number of most similar terms/documents to be returned;

The function returns a list which contains the indexes of the most similar terms/documents. Note that this function can be used for both terms and documents, by providing the appropriate feature matrix.

The cosine similarity(A, B) = (the dot product of A and B)/(norm(A) * norm(B)). 

In [104]:
#### 5. COMPLETE THE FUNCTION BELOW ####
# (25 points)
    
def MostSimilar(term_id, ftr_mtx, top_k):
    
    target = ftr_mtx[term_id] # the target term/document you want to find terms/documents similar to
    
    # a) First compute the cosine similarity matrix between every element in ftr_mtx and target
    # add 'eps' value to the denominator of the cosine similarity you calculate, for numerical stability
    
    sim_list = [] # similarity list: List of similarity values of target and every element in ftr_mtx
    for i in range(len(ftr_mtx)):
        cos_sim = np.dot(target, ftr_mtx[i])/((np.linalg.norm(target)*np.linalg.norm(ftr_mtx[i]))+eps)
        sim_list.append(cos_sim)
    
    
    # b) Will the maximum similarity always be 1? If yes, write code to correct for it. 
    index_arr = np.argsort(sim_list) 
    ret_arr = index_arr[::-1]
    ret_list = ret_arr[1:top_k+1] # list to be returned
    # c) Iteratively find the top_k indices based on the similarity matrix and return
    # the argument of the entry in the ret_list
    
    # for j in range(top_k):
        
    return ret_list

### Step 6a: Find the most similar top_k terms for a chosen term

In [105]:
# Find the top k entries related to a chosen_term of your choice
top_k = 5
chosen_term = 'apple'

In [106]:
if chosen_term in term_list:
    print 'the selected term is ' + chosen_term
    count = 0
    for term in term_list:
        if term==chosen_term:
            break
        count += 1
    indexes = MostSimilar(count, U, top_k)
    similar_terms = []
    print 'the top ' + str(top_k) + ' similar terms are '
    for id in indexes:
        similar_terms.append(term_list[id])
    print similar_terms
else:
    print 'the selected term is not in the vocabulary!'

the selected term is apple
the top 5 similar terms are 
[u'skylark', u'parsnip', u'peach', u'veal', u'patience']


### Step 6b: Find the most similar top_k documents for a chosen document

In [86]:
# Find the top k documents for a document (id) of your choice
top_k = 3
chosen_doc_id = 100
print len(doc_list)

4133


In [87]:
if chosen_doc_id < len(doc_list):
    print 'the selected doc is ' + str(doc_list[chosen_doc_id]) 
    indexes = MostSimilar(chosen_doc_id, V_transpose, top_k)
    print 'the top ' + str(top_k) + ' similar docss are '        
    for id in indexes:
        print doc_list[id]            
else:
    print 'the selected docs id is out of range!'

the selected doc is [u'mussel', u'wine', u'broth', u'almond', u'milk', u'leek', u'pepper', u'clove', u'onion', u'oil', u'saffron', u'verjuice', u'vinegar', u'ginger', u'cinnamon']
the top 3 similar docss are 
[u'mussel', u'wine', u'almond', u'broth', u'verjuice', u'vinegar', u'leek', u'oil', u'onion', u'saffron', u'salt']
[u'broth', u'egg', u'oil', u'almond', u'onion', u'ginger', u'cinnamon', u'clove', u'saffron', u'verjuice']
[u'kidney', u'bean', u'broth', u'onion', u'oil', u'pepper', u'cinnamon', u'saffron']


## Questions: 
### (3+3+4 points)
Based on your experiments, answer:
   * What was the word you chose, and what were the top 5 words similar to it? 
   * Are you satisfied with the similar words you got for the word you chose? Explain. 
   
   
   * What was the document you chose, and what were the top 3 documents similar to it? 
   * Are you satisfied with the similar documents you got for the word you chose? Explain. 
   
   
   * Do you think changing the lower_feature_dimension will affect the results?
   * Experiment and explain your answer.

### Your Answers Here
The word I chose was apple, and the top five words I got similar to it were berry, trencher, skylark, bustard, and mace. Besides berry, I am not really satisfied with the four other words since none of them are fruits. I expected more fruits in the result.

The document I chose was [u'mussel', u'wine', u'broth', u'almond', u'milk', u'leek', u'pepper', u'clove', u'onion', u'oil', u'saffron', u'verjuice', u'vinegar', u'ginger', u'cinnamon'] and the top three documents similar to it are 

[u'mussel', u'wine', u'almond', u'broth', u'verjuice', u'vinegar', u'leek', u'oil', u'onion', u'saffron', u'salt']

[u'broth', u'egg', u'oil', u'almond', u'onion', u'ginger', u'cinnamon', u'clove', u'saffron', u'verjuice']

[u'kidney', u'bean', u'broth', u'onion', u'oil', u'pepper', u'cinnamon', u'saffron']

If we increase the lower_feature_dimension, our results do change, but I am not sure that it makes it a lot more clear. For example, when I set lower_feature_dimension to 300, there was parsnip and peach in the results, which seem a lot closer to apple then other results. It seems to make sense, since there are more entries to get cosine similarities of.
