### In this notebook, we shall use a modified version of the item-based Recommender algorithm from Ch.14 of Machine Learning in Action and use it on joke ratings data based on Jester Online joke Recommender System.
### The data set contains two files. The file "modified_jester_data.csv" contains the ratings on 100 jokes by 1000 users (each row is a user profile). The ratings have been normalized to be between 1 and 21 (a 20-point scale), with 1 being the lowest rating. A zero indicated a missing rating.

In [1]:
# Load the necessary packages.
import numpy as np
from numpy import linalg as la # this is the package that doesthe factorization for us 
import pandas as pd
from sklearn import decomposition

## a. Load the joke ratings data and joke test data into the appropriate data structures.

In [2]:
def load_jokes(file):
    jokes = np.genfromtxt(file, delimiter=',', dtype=str)
    jokes = np.array(jokes[:,1])
    return jokes

In [3]:
fileName = r'../../data/jokes.csv'
jokes = load_jokes(fileName)
jokes.shape

(100,)

<p>Since we don't need the index, the first column needs to be ignored.</p>

In [4]:
# jokes = np.array(jokes[:, 1])
# print(jokes.shape)
# jokes[0:2] # display the two records

<p>The jokes numpy array contains 100 joke ids mapped to the actual text of the jokes. </p>

In [5]:
# load the modified jester data 
modified_jester_data  = np.genfromtxt('../../data/modified_jester_data.csv',delimiter=',')

In [6]:
modified_jester_data.shape

(1000, 100)

<p>The modified_jester_data contains ratings on 100 jokes by 1000 users where each row is a user profile.</p>

In [7]:
modified_jester_data[0:3] # first three records

array([[ 3.18, 19.79,  1.34,  2.84,  3.48,  2.5 ,  1.15, 15.17,  2.02,
         6.24,  2.5 ,  4.25,  3.82, 19.45,  3.82,  3.48,  3.57,  1.19,
         1.15,  1.15,  1.63, 12.5 ,  6.63,  1.19,  2.5 , 12.12, 18.82,
        13.86, 20.13,  3.57, 13.14,  6.92,  1.92, 18.82, 16.05, 15.95,
         1.83,  2.6 ,  2.6 ,  2.6 ,  2.89,  1.87,  1.97,  1.92,  3.86,
         4.74, 14.79, 10.9 , 14.93, 15.13,  2.31,  3.86, 14.2 , 19.3 ,
         6.44, 11.92,  1.87,  1.58, 13.82,  2.36, 19.59, 14.59,  4.16,
         1.97, 13.82,  9.64,  1.92, 19.3 , 16.68,  6.19,  0.  ,  0.  ,
         0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.58,  0.  ,  0.  ,  0.  ,
         3.28,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        13.82,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  5.37,  0.  ,  0.  ,
         0.  ],
       [15.08, 10.71, 17.36, 15.37,  8.62,  1.34, 10.27,  5.66, 19.88,
        20.22, 17.75, 19.64, 15.42, 18.43, 15.56, 10.03, 15.66, 10.32,
        14.3 ,  9.79, 11.87, 19.64, 19.35, 20.17, 11.05, 18.5

## b. Complete the definition for the function "test". This function iterates over all users and for each performs evaluation (by calling the provided "cross_validate_user" function), and returns the error information necessary to compute Mean Absolute Error (MAE). Use this function to perform evaluation (wiht 20% test-ratio for each user) comparing MAE results using standard item-based collaborative filtering (based on the rating prediction function "standEst") with results using the SVD-based version of the rating item-based CF (using "svdEst" as the prediction engine).

In [8]:
def ecludSim(inA,inB):
    return 1.0 / (1.0 + la.norm(inA - inB))

In [9]:
def pearsSim(inA,inB):
    if len(inA) < 3 :
        return 1.0
    return 0.5 + 0.5 * np.corrcoef(inA, inB, rowvar = 0)[0][1]

In [10]:
def cosSim(inA,inB):
    num = float(inA.T * inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5 + 0.5 * (num / denom)

In [11]:
def standEst(dataMat, user, simMeas, item):
    """Item-based recommendation engine:
    calculates the estimated rating a user would give an item for a given similarity measure"""
    n = np.shape(dataMat)[1] # number of items in the data set.
    simTotal = 0.0; ratSimTotal = 0.0 # initialize the two varaibles that will be used to calculate the estimated rating.
    for j in range(n): # loop over every item in the row
        userRating = dataMat[user,j]
        if userRating == 0:
            continue          # if an item is rated zero, it means that the user hasn't rated it and you will skipt it
        overLap = np.nonzero(np.logical_and(dataMat[:,item]>0, \
                                      dataMat[:,j]>0))[0]
        if len(overLap) == 0:
            similarity = 0
        else:
            similarity = simMeas(dataMat[overLap,item], \
                                   dataMat[overLap,j])
        #print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0:
        return 0
    else:
        return ratSimTotal/simTotal

In [12]:
def svdEst(dataMat, user, simMeas, item):
    """This function creates an estimated rating for a given item for a given user. 
    This function performs SVD  on the data set"""
    n = np.shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    data=np.mat(dataMat)
    U,Sigma,VT = la.svd(data) # perform SVD on the data set
    Sig4 = np.mat(np.eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
    xformedItems = data.T * U[:,:4] * Sig4.I  #create transformed items
    for j in range(n):
        userRating = data[user,j]
        if userRating == 0 or j==item:
            continue
        similarity = simMeas(xformedItems[item,:].T, xformedItems[j,:].T) # we calcuate matrices in alower dimension space.
        #print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0:
        return 0
    else:
        return ratSimTotal/simTotal

In [13]:
# This function is not needed for Assignment 4, but may be useful for experimentation
def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst):
    unratedItems = np.nonzero(dataMat[user,:].A==0)[1] #find unrated items 
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]

In [14]:
# This function performs evaluatoin on a single user based on the test_ratio
# For example, with test_ratio = 0.2, a randomly selected 20 percent of rated 
# items by the user are withheld and the rest are used to estimate the withheld ratings

def cross_validate_user(dataMat, user, test_ratio, estMethod=standEst, simMeas=pearsSim):
    number_of_items = np.shape(dataMat)[1]
    rated_items_by_user = np.array([i for i in range(number_of_items) if dataMat[user,i]>0])
    test_size = int(test_ratio * len(rated_items_by_user))
    test_indices = np.random.randint(0, len(rated_items_by_user), test_size)
    withheld_items = rated_items_by_user[test_indices]
    original_user_profile = np.copy(dataMat[user])
    dataMat[user, withheld_items] = 0 # So that the withheld test items is not used in the rating estimation below
    error_u = 0.0
    count_u = len(withheld_items)

    # Compute absolute error for user u over all test items
    for item in withheld_items:
        # Estimate rating on the withheld item
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        error_u = error_u + abs(estimatedScore - original_user_profile[item])

    # Now restore ratings of the withheld items to the user profile
    for item in withheld_items:
        dataMat[user, item] = original_user_profile[item]

    # Return sum of absolute errors and the count of test cases for this user
    # Note that these will have to be accumulated for each user to compute MAE
    return error_u, count_u


In [15]:
def test(dataMat, test_ratio, estMethod):
    # Write this function to iterate over all users and for each perform evaluation by calling
    # the above cross_validate_user function on each user. MAE will be the ratio of total error 
    # across all test cases to the total number of test cases, for all users
    absolute_error = 0
    cnt_testcases = 0
    
    for num in range(dataMat.shape[0]): # iterate through the user rows of the data set
        
        if estMethod == 'StandEst':
            error_u, count_u = cross_validate_user(dataMat, num, test_ratio, standEst)
            absolute_error += error_u
            cnt_testcases += count_u
            
        if estMethod == 'svdEst':
            error_u, count_u = cross_validate_user(dataMat, num, test_ratio, svdEst)
            absolute_error += error_u
            cnt_testcases += count_u
    
#     if cnt_testcases == 0:
#         MAE = 0
#     else:
        MAE = absolute_error/cnt_testcases
            
    print('Mean Absoloute Error for ',estMethod,' : ', MAE)

#### Using the test function to perform an evaluation (with 20% test-ratio for each user) comparing MAE results using standard item-based collaborative filtering (based on the rating prediction function "standEst") with results using the SVD-based version of the rating item-based CF (using "svdEst" as the prediction engine).

In [16]:
%time
test(modified_jester_data, 0.2, 'StandEst')

Wall time: 0 ns
Mean Absoloute Error for  StandEst  :  3.683889553965739


<p><strong>This shows us how accurate we intend to be in our evaluations; the lowest error is going to be below or above 3.68. </strong></p>

In [17]:
%time
test(modified_jester_data, 0.2, 'svdEst')

Wall time: 0 ns
Mean Absoloute Error for  svdEst  :  3.6576755551824065


<p><strong>This shows us how accurate we intend to be in our evaluations; the lowest error is going to be below or above 3.66. </strong></p>

#### A function "print_most_similar_jokes" which takes the jokes ratings data, a query , a query joke id, a parameter k for the number of nearest neighbors, and a similarity metric function, and prints the text of the query joke as well as the texts of the top k most similar jokes based on user ratings.

In [18]:
def print_most_similar_jokes(dataMat, jokes, queryJoke, k, metric=pearsSim):
    print('Joke Id %s\n \n%s' %(queryJoke, jokes[queryJoke]))
    print('\n\nThe Top %s recommended jokes are:' %k)
    
    jokeList = []
    transposedDataSet = dataMat.T
    
    #  derive the shape of the array and get jokes
    jokesColumn = transposedDataSet.shape[0]
    # print('Number Of Jokes extracted: %s\n' %jokesColumn)
        
    for num in range(jokesColumn):
        similarity = metric(transposedDataSet[queryJoke], transposedDataSet[num])
        # sim = [similarity, num]
        jokeList.append([similarity, num])
        
    jokeList.sort()
    

    for num in range(len(jokeList[0:k])):
        print('\n %s' %(jokes[jokeList[num][1]]))
        print('')
        print('---'*30)

In [19]:
modified_jester_data.T.shape

(100, 1000)

In [20]:
print_most_similar_jokes(modified_jester_data, jokes, 5, 10)

Joke Id 5
 
Bill & Hillary are on a trip back to Arkansas. They're almost out of gas so Bill pulls into a service station on the outskirts of town. The attendant runs out of the station to serve them when Hillary realizes it's an old boyfriend from high school. She and the attendant chat as he gases up their car and cleans the windows. Then they all say good-bye. As Bill pulls the car onto the road he turns to Hillary and says 'Now aren't you glad you married me and not him ? You could've been the wife of a grease monkey !' To which Hillary replied 'No Bill. If I would have married him you'd be pumping gas and he would be the President !'


The Top 10 recommended jokes are:

 Q:  What did the blind person say when given some matzah?A:  Who the hell wrote this?

------------------------------------------------------------------------------------------

 Q. Did you hear about the dyslexic devil worshipper? A. He sold his soul to Santa.

---------------------------------------------------