# Homework 3

## Tasks(Cook/Make prediction)

Let’s split the training data (‘trainInteractions.csv.gz’) as follows:

(1) Reviews 1-400,000 for training \
(2) Reviews 400,000-500,000 for validation

1. Evaluate the performance (accuracy) of the baseline model on the validation set you have built 

In [1]:
import gzip
import random
import csv
from collections import defaultdict
from sklearn import linear_model

In [2]:
path="assignment1/trainInteractions.csv.gz"

In [3]:
def readCSV(path):
    f = gzip.open(path, 'rt')
    c = csv.reader(f)
    header = next(c)
    for l in c:
        d = dict(zip(header,l))
        yield d['user_id'],d['recipe_id'],d

In [4]:
dataset = list(readCSV(path))

In [5]:
dataset[0]

('88348277',
 '03969194',
 {'user_id': '88348277',
  'recipe_id': '03969194',
  'date': '2004-12-23',
  'rating': '5'})

In [6]:
itemsPerUser = defaultdict(set)
usersPerItem=defaultdict(set)
itemSet=set([d[1] for d in dataset])

In [7]:
for d in dataset:
    user,item = d[0], d[1]
    itemsPerUser[user].add(item)
    usersPerItem[item].add(user)

In [8]:
def build_validate_set(dataset):
    validate_set=[]
    random.seed(50)
    for d in dataset:
        positive_entry=[d[0],d[1],1]
        negative_entry_item_set=itemSet.difference(itemsPerUser[d[0]])
        random_item=random.choice(list(negative_entry_item_set))
        negative_entry=[d[0],random_item,0]
        validate_set.append(positive_entry)
        validate_set.append(negative_entry)
    return validate_set

In [9]:
def build_train_set(dataset):
    train_set=[]
    for d in dataset:
        positive_entry=[d[0],d[1]]
        train_set.append(positive_entry)
    return train_set

In [10]:
train_set=build_train_set(dataset[:400000])
validate_set=build_validate_set(dataset[400000:500000])

In [11]:
train_set[:10]

[['88348277', '03969194'],
 ['86699739', '27096427'],
 ['03425965', '44197323'],
 ['73973193', '24971400'],
 ['15215209', '60170202'],
 ['75799794', '39662395'],
 ['77745222', '88709727'],
 ['80598779', '09359141'],
 ['35769308', '83909791'],
 ['31763244', '20530585']]

In [12]:
validate_set[:10]

[['90764166', '01768679', 1],
 ['90764166', '76921392', 0],
 ['68112239', '24923981', 1],
 ['68112239', '23778772', 0],
 ['32173358', '57597698', 1],
 ['32173358', '68548852', 0],
 ['30893740', '16266088', 1],
 ['30893740', '52444283', 0],
 ['69780905', '62953151', 1],
 ['69780905', '28155581', 0]]

Base line model

In [13]:
recipeCount = defaultdict(int)
totalCooked = 0

# for user,recipe,_ in readCSV("assignment1/trainInteractions.csv.gz"):
#   recipeCount[recipe] += 1
#   totalCooked += 1
for d in train_set:
    recipeCount[d[1]] += 1
    totalCooked += 1

mostPopular = [(recipeCount[x], x) for x in recipeCount]
mostPopular.sort()
mostPopular.reverse()

return1 = set()
count = 0
for ic, i in mostPopular:
    count += ic
    return1.add(i)
    if count > totalCooked/2: break

In [14]:
total_size=len(validate_set)

In [15]:
correct_size=0
for i in range(total_size):
    sample=validate_set[i]
    item=sample[1]
    predict=0
    if item in return1:
        predict=1
    if predict==sample[2]:
        correct_size+=1

In [16]:
accuracy=correct_size/total_size
accuracy

0.669945

2. See if you can find a better threshold and report its performance on your validation set 

In [17]:
def baseline_model_accuracy(threshhold):
    return1 = set()
    count = 0
    for ic, i in mostPopular:
      count += ic
      return1.add(i)
      if count > totalCooked*threshhold: break
    correct_size=0
    for i in range(total_size):
        sample=validate_set[i]
        item=sample[1]
        predict=0
        if item in return1:
            predict=1
        if predict==sample[2]:
            correct_size+=1
    accuracy=correct_size/total_size
    print([accuracy,threshhold])

We go through 1/10 to 9/10 to check the accuracy

In [18]:
for i in range(10):
    baseline_model_accuracy((i+1)*0.1)

[0.547455, 0.1]
[0.59083, 0.2]
[0.627175, 0.30000000000000004]
[0.65384, 0.4]
[0.669945, 0.5]
[0.67576, 0.6000000000000001]
[0.66676, 0.7000000000000001]
[0.63734, 0.8]
[0.5579, 0.9]
[0.46477, 1.0]


We see that when threshhold is 0.6*totalCooked, the accuracy improve to 0.67576, so 0.6*totalCooked is a better threshhold

3. A stronger baseline than the one provided might make use of the Jaccard similarity (or another similarity
metric). Given a pair (u,g) in the validation set, consider all training items g′ that user u has cooked. For each, compute the Jaccard similarity between g and g′, i.e., users (in the training set) who have made ′
g and users who have made g . Predict as ‘made’ if the maximum of these Jaccard similarities exceeds a threshold (you may choose the threshold that works best). Report the performance on your validation set (1 mark).

We need to redefined below:

In [19]:
itemsPerUser = defaultdict(set)
usersPerItem=defaultdict(set)
for d in train_set:
    user,item = d[0], d[1]
    itemsPerUser[user].add(item)
    usersPerItem[item].add(user)

In [20]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    if denom == 0:
        return 0
    return numer / denom

In [21]:
def predictUsingJaccard(item,user,t):
    predict=0
    maxSim=0
    for d in itemsPerUser[user]:
        sim=Jaccard(usersPerItem[d],usersPerItem[item])
        maxSim=max(maxSim,sim)
    if maxSim>t:
        predict=1
    return predict

In [22]:
def jaccard_model_accuracy(t):
    total_size=len(validate_set)
    correct_size=0
    for i in range(total_size):
        sample=validate_set[i]
        item=sample[1]
        user=sample[0]
        predict=predictUsingJaccard(item,user,t)
        if predict==sample[2]:
            correct_size+=1
    accuracy=correct_size/total_size
    print([accuracy,t])

We go through t from 1/10 to 10/10 to check the accuracy

In [23]:
for i in range(10):
    jaccard_model_accuracy((i+1)*0.1)

[0.49967, 0.1]
[0.493395, 0.2]
[0.491505, 0.30000000000000004]
[0.494635, 0.4]
[0.50011, 0.5]
[0.500085, 0.6000000000000001]
[0.500005, 0.7000000000000001]
[0.5, 0.8]
[0.5, 0.9]
[0.5, 1.0]


Then we go through t from 1/100 to 1/10 to check the accuracy

In [24]:
for i in range(10):
    jaccard_model_accuracy((i+1)*0.01)

[0.593985, 0.01]
[0.584725, 0.02]
[0.561155, 0.03]
[0.535635, 0.04]
[0.51912, 0.05]
[0.510805, 0.06]
[0.50634, 0.07]
[0.503075, 0.08]
[0.50168, 0.09]
[0.49967, 0.1]


Then we go through t from 1/1000 to 1/100 to check the accuracy

In [25]:
for i in range(10):
    jaccard_model_accuracy((i+1)*0.001)

[0.588305, 0.001]
[0.58885, 0.002]
[0.590135, 0.003]
[0.591505, 0.004]
[0.59221, 0.005]
[0.592555, 0.006]
[0.593085, 0.007]
[0.59371, 0.008]
[0.59386, 0.009000000000000001]
[0.593985, 0.01]


In this case, we use 0.01 as our threshhold, and the accuracy on the validation set is 0.595145.

4. Improve the above predictor by incorporating both a Jaccard-based threshold and a popularity based threshold. Report the performance on your validation set.

In [26]:
def jaccardPopularityModel(train_set, test_set, jt=0.01, pt=0.6):
    itemsPerUser = defaultdict(set)
    usersPerItem=defaultdict(set)
    for d in train_set:
        user,item = d[0], d[1]
        itemsPerUser[user].add(item)
        usersPerItem[item].add(user)
    
    itemSet=set([d[1] for d in train_set])
    userSet=set([d[0] for d in train_set])
    
    # calculate average number of recipes made in the train_set
    averageNum=len(train_set)/len(userSet)

    # calculate most popular set in train_set
    recipeCount = defaultdict(int)
    totalCooked = 0
    for d in train_set:
        recipeCount[d[1]] += 1
        totalCooked += 1

    mostPopular = [(recipeCount[x], x) for x in recipeCount]
    mostPopular.sort()
    mostPopular.reverse()

    return1 = set()
    count = 0
    for ic, i in mostPopular:
        count += ic
        return1.add(i)
        if count > totalCooked*pt: break

    # evalute on test_set
    total_size=len(test_set)
    correct_size=0
    for i in range(total_size):
        sample=test_set[i]
        item=sample[1]
        user=sample[0]
        predict=0
        
         
        maxSim=0
        for d in itemsPerUser[user]:
            sim=Jaccard(usersPerItem[d],usersPerItem[item])
            maxSim=max(maxSim,sim)
        if maxSim>jt or item in return1:
            predict=1
                
        if predict==sample[2]:
            correct_size+=1
    accuracy=correct_size/total_size
    print([accuracy,jt,pt])

Since we incorporat the two threshhold, so we need to modified the similarity threshhold to 0.6, popularity threshhold to 0.6

In [27]:
jaccardPopularityModel(train_set,validate_set,0.6,0.6)

[0.675845, 0.6, 0.6]


improvement:

In [28]:
0.675845/0.67576-1

0.00012578430211918068

Therefore the model incorporating Jaccard and popularity on validation set's accuracy is 0.675845, which improved 0.126% from question 2.

5. To run our model on the test set, we’ll have to use the files ‘stub Made.txt’ to find the user id/recipe id pairs about which we have to make predictions. Using that data, run the above model and upload your solution to Kaggle. 

To upload to kaggle, we can use the whole dataset as our train_set

In [29]:
train_set=build_train_set(dataset)

We modify it to upload the result to kaggle

In [31]:
def jaccardPopularityModel(train_set, jt=0.01, pt=0.6):
    print("predicting....")
    itemsPerUser = defaultdict(set)
    usersPerItem=defaultdict(set)
    for d in train_set:
        user,item = d[0], d[1]
        itemsPerUser[user].add(item)
        usersPerItem[item].add(user)
    
    itemSet=set([d[1] for d in train_set])
    userSet=set([d[0] for d in train_set])
    
    # calculate average number of recipes made in the train_set
    averageNum=len(train_set)/len(userSet)

    # calculate most popular set in train_set
    recipeCount = defaultdict(int)
    totalCooked = 0
    for d in train_set:
        recipeCount[d[1]] += 1
        totalCooked += 1

    mostPopular = [(recipeCount[x], x) for x in recipeCount]
    mostPopular.sort()
    mostPopular.reverse()

    return1 = set()
    count = 0
    for ic, i in mostPopular:
        count += ic
        return1.add(i)
        if count > totalCooked*pt: break

    predictions = open("predictions_Made.txt", 'w')
    for l in open("stub_Made.txt"):
        if l.startswith("user_id"):
            predictions.write(l)
            continue
        user,item = l.strip().split('-')
        predict=0
        
        maxSim=0
        for d in itemsPerUser[user]:
            sim=Jaccard(usersPerItem[d],usersPerItem[item])
            maxSim=max(maxSim,sim)
        if maxSim>jt or item in return1:
            predict=1
                
        if predict==1:
            predictions.write(user + '-' + item + ",1\n")
        else:
            predictions.write(user + '-' + item + ",0\n")
    predictions.close()
    print("predicting finished!")

In [32]:
train_set=build_train_set(dataset)

In [33]:
jaccardPopularityModel(train_set,0.6,0.6)

predicting....
predicting finished!


The name on kaggle is Ethannewbee

## Tasks (Rating prediction)

9. Fit a predict of the form
$$
\text { rating (user, item ) } \simeq \alpha+\beta_{\text {user }}+\beta_{\text {item }}
$$
by fitting the mean and the two bias terms as described in the lecture notes. Use a regularization
parameter of $\lambda$ = 1. Report the MSE on the validation set

In [59]:
import gzip
import csv
from collections import defaultdict
import scipy
import scipy.optimize
import numpy

In [60]:
path="assignment1/trainInteractions.csv.gz"

In [61]:
def readCSV(path):
    f = gzip.open(path, 'rt')
    c = csv.reader(f)
    header = next(c)
    for l in c:
        d = dict(zip(header,l))
        yield d

In [62]:
dataset = list(readCSV(path))

In [63]:
dataset[0]

{'user_id': '88348277',
 'recipe_id': '03969194',
 'date': '2004-12-23',
 'rating': '5'}

In [64]:
train_set=dataset[:400000]
validate_set=dataset[400000:500000]

In [65]:
labels = [int(d['rating']) for d in train_set]

In [66]:
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)
# reviewsPerUser = defaultdict(set)
# reviewsPerItem = defaultdict(set)

In [67]:
for d in train_set:
    user,item = d['user_id'], d['recipe_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)

In [68]:
ratingMean = sum([int(d['rating']) for d in train_set]) / len(train_set)
ratingMean

4.5808

In [69]:
N = len(train_set)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())

In [70]:
alpha = ratingMean

In [71]:
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

In [72]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

In [73]:
def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item]

In [74]:
def prediction_(user, item):
    if user in userBiases and item in itemBiases:
        pred=alpha + userBiases[user] + itemBiases[item]
    elif user in userBiases:
        pred=alpha + userBiases[user]
    elif item in itemBiases:
        pred=alpha + itemBiases[item]
    else:
        pred=alpha
    if pred<0:
        return 0
    if pred>5.0:
        return 5.0
#     if 5-pred<0.2:
#         return round(pred)
    return pred

In [75]:
def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    alpha = theta[0]
    userBiases = dict(zip(users, theta[1:nUsers+1]))
    itemBiases = dict(zip(items, theta[1+nUsers:]))

In [76]:
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['user_id'], d['recipe_id']) for d in train_set]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in userBiases:
        cost += lamb*userBiases[u]**2
    for i in itemBiases:
        cost += lamb*itemBiases[i]**2
    return cost

In [77]:
def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(train_set)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    for d in train_set:
        u,i = d['user_id'], d['recipe_id']
        pred = prediction(u, i)
        diff = pred - int(d['rating'])
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    return numpy.array(dtheta)

Here we set $\lambda$ =1

In [78]:
lamb=1

In [79]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),
                             derivative, args = (labels, lamb))

MSE = 0.8987313599958769
MSE = 0.8856358581692618
MSE = 0.8985952813610849
MSE = 0.8985952329948594


(array([ 4.58067734e+00, -8.60369441e-05, -8.06941551e-06, ...,
        -1.45226856e-06,  1.04576614e-06, -1.45209452e-06]),
 0.8986631878143104,
 {'grad': array([ 5.03931794e-07, -2.16526137e-07, -9.07182253e-09, ...,
         -1.25235739e-09, -1.45141735e-10, -1.25490187e-09]),
  'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
  'funcalls': 4,
  'nit': 2,
  'warnflag': 0})

After training on train set, we evaludate the result on the validation set

In [83]:
predictions = []
validate_truth=[int(d['rating']) for d in validate_set]
for d in validate_set:
    user=d['user_id']
    item=d['recipe_id']
    predictions.append(prediction_(user, item))

MSE(predictions, validate_truth)

0.9094108423649899

Therefore, the MSE on the validation set is 0.9094108423649899

10. Report the user and recipe IDs that have the largest and smallest values of $\beta$

In [84]:
[max(userBiases, key=userBiases.get),max(userBiases.values())]


['32445558', 0.003670108443486672]

The user ID with the largest $\beta$ is 32445558

In [85]:
[max(itemBiases, key=itemBiases.get),max(itemBiases.values())]

['98124873', 0.00020946436381658414]

The item ID with the largest $\beta$ is 98124873

In [86]:
[min(userBiases, key=userBiases.get),min(userBiases.values())]

['70705426', -0.0012946275802599271]

The user ID with the smallest $\beta$ is 70705426

In [87]:
[min(itemBiases, key=itemBiases.get),min(itemBiases.values())]

['29147042', -0.0002853192176457722]

The item ID with the smallest $\beta$ is 29147042

11. Find a better value of $\lambda$ using your validation set. Report the value you chose, its MSE, and upload your solution to Kaggle by running it on the test data

In [90]:
def validation_mse():
    predictions = []
    validate_truth=[int(d['rating']) for d in validate_set]
    for d in validate_set:
        user=d['user_id']
        item=d['recipe_id']
        predictions.append(prediction_(user, item))

    print(MSE(predictions, validate_truth))

In [103]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),
                             derivative, args = (labels, 10**(-5)), maxiter=6)

MSE = 0.909671527164235
MSE = 1.6942491620860933
MSE = 0.8985688354549487
MSE = 0.898432691370945
MSE = 0.8978924902631409
MSE = 0.8958016894804687
MSE = 0.8885585447239309
MSE = 0.8606412718262736
MSE = 0.8426554564331113
MSE = 0.8254599045185007
MSE = 0.8069106284337644


(array([ 4.48131120e+00, -1.09144661e-01, -2.91699317e-02, ...,
        -5.60199519e-03,  2.76760356e-03, -4.98369526e-03]),
 0.8094759995950068,
 {'grad': array([-2.00041981e-04, -1.27467781e-04,  9.36242455e-06, ...,
          2.27880848e-06, -8.66824184e-07,  6.94887278e-07]),
  'task': b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT',
  'funcalls': 11,
  'nit': 6,
  'warnflag': 1})

In [104]:
validation_mse()

0.8470583195829219


improvement:

In [105]:
0.9094108423649899/0.8470583195829219-1

0.073610660966968

The mse is 0.8470583195829219, which is an 7% improvement from question 9.

In [106]:
predictions = open("predictions_Rated.txt", 'w')
for l in open("stub_Rated.txt"):
    if l.startswith("user_id"):
        
      #header
        predictions.write(l)
        continue
    u,i = l.strip().split('-')
    predictions.write(u + '-' + i + ',' + str(prediction_(u, i)) + '\n')

predictions.close()