#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2019


# Homework 2:   word2vec + SVM + Evaluation

### 100 points [6% of your final grade]

### Due: Tuesday, February 26, 2019 by 11:59pm

*Goals of this homework:* Understand word2vec-like term embeddings,  explore real-world challenges with SVM-based classifiers, understand and implement several evaluation metrics.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw2.ipynb`. For example, my homework submission would be something like `555001234_hw2.ipynb`. Submit this notebook via eCampus (look for the homework 2 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: Term embeddings + SVM (80 points)

### Dataset


For this homework, we will still play with Yelp reviews from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge). As in Homework 1, you'll see that each line corresponds to a review on a particular business. Each review has a unique "ID" and the text content is in the "review" field. Additionally, this time, we also offer you the "label". If `label=1`, it means that this review is `Food-relevant`. If `label=0`, it means that this review is `Food-irrelevant`. Similarly, we have already done some basic preprocessing on the reviews, so you can just tokenize each review using whitespace.

There are about 40,000 reviews in total, in which about 20,000 reviews are "Food-irrelevant". We split the review data into two sets. *review_train.json* is the training set. *review_test.json* is the testing set. 

In [2]:
import json
import numpy as np
from sklearn.svm import SVC
import gensim
import dill
import math




In [5]:
dill.dump_session('notebook_env.db')

In [6]:
dill.load_session('notebook_env.db')

In [8]:
# Please load the dataset
# Your code below

all_train_data = []
all_test_data = []
with open("review_train.json") as f:
    for line in f:
        all_train_data.append(json.loads(line))
        
with open("review_test.json") as f:
    for line in f:
        all_test_data.append(json.loads(line))



###  Pre-trained term embeddings

To save your time, you can make use of  pre-trained term embeddings. In this homework, we are using one of the great pre-trained models from [GloVe](https://nlp.stanford.edu/projects/glove/) based on 2 billion tweets. GloVe is quite similar to word2vec. Unzip the *glove.6B.50d.txt.zip* file and run the code below. You will be able to load the term embeddings model, with which each word can be represented with a 50-dimension vector.

In [9]:
# reload the pre-trained term embeddings

with open("glove.6B.50d.txt", "rb") as lines:
    model = {line.split()[0].decode(encoding="utf-8", errors="strict"): np.array(list(map(float, line.split()[1:])))
           for line in lines}
    #print(shape(model.get('the')))

Now, you have a vector representation for each word. First, we use the simple (arithmetic) **mean** of these vectors of words in a review to represent the review. *Note: Just ignore those words which are not in the corpus of this pre-trained model.*

In [10]:
# Please figure out the vector representation for each review in the training data and testing data.
# Your code below
# get the vector for each review in trainning data and correponding labels
def vector_for_review(review, model):
    res = np.zeros(50)
    count = 0;
    for word in review:
        if word in model:
            res += model.get(word)
            count += 1;
    res /= count;
    return res;

train_data_vectors = np.zeros((len(all_train_data), 50))
train_data_labels = np.zeros(len(all_train_data))
train_data_reviews = []
i = 0
for train_data in all_train_data:
    review = train_data.get("review").split()
    train_data["review_score"] = vector_for_review(review, model)
    train_data_vectors[i] = train_data["review_score"]
    train_data_labels[i] = train_data["label"]
    train_data_reviews.append(train_data.get("review"))
    i += 1


In [11]:
# get the vector for each review in test data and correponding labels
test_data_vectors = np.zeros((len(all_test_data), 50))
test_data_labels = np.zeros(len(all_test_data))
test_data_reviews = []
i = 0
for test_data in all_test_data:
    review = test_data.get("review").split()
    test_data["review_score"] = vector_for_review(review, model)
    test_data_vectors[i] = test_data["review_score"]
    test_data_labels[i] = test_data["label"]
    test_data_reviews.append(test_data.get("review"))
    i += 1

### SVM

With the vector representations you get for each review, please train an SVM model to predict whether a given review is food-relevant or not. **You do not need to implement any classifier from scratch. You may use scikit-learn's built-in capabilities.** You can only train your model with reviews in *review_train.json*.

In [12]:
# SVM model training
# Your code here
# train the classifier using SVM
clf = SVC(gamma="scale")
clf.fit(train_data_vectors, train_data_labels)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Your goal is to predict whether a given review is food-relevant or not. Please report the overall accuracy, precision and recall of your model on the **testing data**. You should **implement the functions for accuracy, precision, and recall**.

In [13]:
# Your code here
def getParameters(test_labels, test_result):
    TP = TN = FP = FN = 0
    for i in range(len(test_labels)):
        if test_labels[i] == 0 and test_result[i] == 0:
            TN += 1
        elif test_labels[i] == 0 and test_result[i] == 1:
            FP += 1
        elif test_labels[i] == 1 and test_result[i] == 1:
            TP += 1
        elif test_labels[i] == 1 and test_result[i] == 0:
            FN += 1
    return TP, TN, FP, FN
            
    
    
def getResult (TP, TN, FP, FN):
    accurancy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    return accurancy, precision, recall
    
    
test_result = clf.predict(test_data_vectors)
TP, TN, FP, FN = getParameters(test_result, test_data_labels)
accurancy, precision, recall = getResult(TP, TN, FP, FN)



In [14]:
print(accurancy)
print(TP, TN, FP, FN)  #Accuracy = TP+TN/TP+FP+FN+TN
print(len(test_result))
print((TP + TN) / (TP + FP + FN + TN))
print(accurancy, precision, recall)




0.9033557046979865
5549 5219 396 756
11920
0.9033557046979865
0.9033557046979865 0.9333894028595459 0.8800951625693894


### Document-based embeddings

Instead of taking the mean of term embeddings, you can directly train a **doc2vec** model for paragraph or document embeddings. You can refer to the paper [Distributed Representations of Sentences and Documents](https://arxiv.org/pdf/1405.4053v2.pdf) for more details. And in this homework, you can make use of the implementation in [gensim](https://radimrehurek.com/gensim/models/doc2vec.html).

Now, you need to:
* Train a doc2vec model based on all reviews you have (training + testing sets).
* Use the embeddings from your doc2vec model to represent each review and train a new SVM model.
* Report the overall accuracy, precision and recall of your model on the testing data.

In [29]:
# Train a doc2vec
# Your code here


def read_corpus(reviews, tags):
    for i in range(len(reviews)):
        yield gensim.models.doc2vec.TaggedDocument((reviews[i]), [tags[i]])



train_corpus = list(read_corpus(train_data_reviews, train_data_labels))
test_corpus = list(read_corpus(test_data_reviews, test_data_labels))
all_corpus = train_corpus + test_corpus




In [30]:
doc2vec_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=1, epochs=40)
doc2vec_model.build_vocab(all_corpus)


In [33]:
doc2vec_model.infer_vector(["delicious", "food"])

array([ 0.0076412 ,  0.00634679,  0.00108604,  0.00443693, -0.0062633 ,
        0.00056938, -0.00942396, -0.00849839, -0.00479078,  0.00224439,
       -0.00723876,  0.00448345, -0.00461784, -0.00642925, -0.00495205,
        0.00603837, -0.00946594, -0.00454503,  0.00178015,  0.00866581,
       -0.00844645, -0.0084723 ,  0.00258921, -0.0052261 ,  0.00852515,
       -0.0009018 , -0.00698526,  0.00124907, -0.00395198, -0.00963363,
        0.00148812,  0.00116596, -0.00644631, -0.00954802,  0.00512538,
       -0.00781566,  0.00452334,  0.00909967, -0.00431494, -0.00793551,
       -0.0006508 ,  0.00630571, -0.00389133, -0.00838595, -0.00672794,
       -0.00629247, -0.00791844,  0.00729725,  0.0024317 ,  0.00067204],
      dtype=float32)

In [50]:

fname = gensim.test.utils.get_tmpfile("my_doc2vec_model")
doc2vec_model.save(fname)
doc2vec_model = Doc2Vec.load(fname)  

In [34]:
def get_vector (datas, model):
    vec = np.zeros((len(datas), 50))
    i = 0
    for data in datas:
        review = data.get("review").split()
        data["doc2vec_review_score"] = model.infer_vector(review)
        vec[i] = data["doc2vec_review_score"]
        i += 1
    return vec
    

In [36]:
# Train a SVM
# Your code here
train_data_doc2vec_vec = get_vector(all_train_data, doc2vec_model)
clf_doc2vec = SVC(gamma="scale")
clf_doc2vec.fit(train_data_doc2vec_vec, train_data_labels)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [38]:
test_data_doc2vec_vec = get_vector(all_test_data, doc2vec_model)
test_doc2vec_result = clf.predict(test_data_doc2vec_vec)
TP, TN, FP, FN = getParameters(test_doc2vec_result, test_data_labels)
accurancy, precision, recall = getResult(TP, TN, FP, FN)

In [39]:
# Report the performance
# Your code here
print(TP, TN, FP, FN)
print(accurancy, precision, recall)


5945 0 0 5975
0.498741610738255 1.0 0.498741610738255


What do you observe? How different are your results for the term-based average approach vs. the doc2vec approach? Why do you think this is?

*provide a brief (1-2 paragraph) discussion based on these questions.*

### Can you do better?

Finally, see if you can do better than either the word- or doc- based embeddings approach for classification. You may explore new features, new classifiers, etc. Whatever you like. Just provide your code and a justification.

In [None]:
# your code here

# Part 2: NDCG (20 points)

You calculated the recall and precision in Part 1 and now you get a chance to implement NDCG. 

Assume that Amy searches for "food-relevant" reviews in the **testing set** on two search engines `A` and `B`. Since the ground-truth labels for the reviews are unknown to A and B, they need to make a prediction for each review and then return a ranked list of results based on their probabilities. The results from A are in *search_result_A.json*, and the results from B are in *search_result_B.json*. Each line contains the id of a review and its corresponding ranking.

You can check their labels in *review_test.json* while calculating the NDCG scores. If a review is "food-relevant", the relevance score is 1. Otherwise, the relevance score is 0.

In [19]:
search_result_A = []
search_result_B = []
with open("search_result_A.json") as f:
    for line in f:
        search_result_A.append(json.loads(line))

with open("search_result_B.json") as f:
    for line in f:
        search_result_B.append(json.loads(line))



In [20]:
def DCG_score(rel, i):
    # calculate the DCG Score, given rel, i
    return rel / (math.log2(i + 1));

def id_rel(datas):
    # mapping the id with label
    id_rel = {}
    for data in datas:
        id_rel[data.get("id")] = data.get("label")
    return id_rel


In [21]:
test_id2rel = id_rel(all_test_data)

In [26]:
for result in search_result_A:
    cur_id = result.get("id")
    rel = test_id2rel.get(cur_id)
    
    

1
1
1
0
1
1
1
1
0
0
0
1
0
1
1
0
0
1
1
1
1
1
1
1
1
0
0
1
1
1
0
1
0
0
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
1
0
0
0
1
1
1
1
1
1
1
0
1
0
1
1
1
1
0
0
0
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
0
1
0
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
1
1
1
0
0
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
0
1
1
0
0
1
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
0
1
0
1
0
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
0
0
1
1
1
1
1
1
1
0
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1


0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
1
1
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0


In [40]:
# NDCG for search_result_A.json
# Your code here
DCG_A = 0
for result in search_result_A:
    cur_id = result.get("id")
    rel = test_id2rel.get(cur_id)
    DCG_A += DCG_score(rel, result.get("rank"))

IDCG_A = 0
new_rank = 1
for result in search_result_A:
    cur_id = result.get("id")
    if test_id2rel.get(cur_id) == 1:
        IDCG_A += DCG_score(1, new_rank)
        new_rank += 1

NDCG_A = DCG_A / IDCG_A
print(DCG_A, IDCG_A, NDCG_A)



505.3979278728301 521.4752244965487 0.96916958684041


In [41]:
# NDCG for search_result_B.json
# Your code here
DCG_B = 0
for result in search_result_B:
    cur_id = result.get("id")
    rel = test_id2rel.get(cur_id)
    DCG_B += DCG_score(rel, result.get("rank"))

IDCG_B = 0
new_rank = 1
for result in search_result_B:
    cur_id = result.get("id")
    if test_id2rel.get(cur_id) == 1:
        IDCG_B += DCG_score(1, new_rank)
        new_rank += 1

NDCG_B = DCG_B / IDCG_B
print(DCG_B, IDCG_B, NDCG_B)


121.34577387849328 121.68560167059829 0.9972073294831962


## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*