In [1]:
import pandas as pd
import numpy as np

In [2]:
path = 'ml-100k/'
ratings_file = 'u.data'
movies_files = 'u.item'

##Data Description

The dataset comprises of movie ratings.
MovieLens was developed by the GroupLens project at the University of Minnesota.
You can download the dataset from http://www.grouplens.org/node/12. There are two
datasets here. In this tutorial we have used a dataset containing 100,000 movie ratings.

In [3]:
def loadMovieLens(path):
    # Get movie titles
    movies={}
    for line in open(path+movies_files):
        (id,title)=line.split('|')[0:2]
        movies[id]=title
    # Load data
    dataset={}
    for line in open(path+ratings_file):
        (user,movieid,rating,ts)=line.split('\t')
        dataset.setdefault(user,{})
        dataset[user][movies[movieid]]=float(rating)
    return dataset

The dataset is contains each user_id as a key and each record is represented as **{'USER_ID': {'movie1': rating1, 'movie2': rating2,.....,'movieN': ratingN}**

In [4]:
dataset=loadMovieLens(path)

**Movies rated by user id 88**

In [5]:
dataset['88']

{'Air Bud (1997)': 5.0,
 'Air Force One (1997)': 3.0,
 'Amistad (1997)': 2.0,
 'Apt Pupil (1998)': 4.0,
 'English Patient, The (1996)': 5.0,
 'Everyone Says I Love You (1996)': 3.0,
 'FairyTale: A True Story (1997)': 4.0,
 'G.I. Jane (1997)': 5.0,
 'In & Out (1997)': 4.0,
 'L.A. Confidential (1997)': 3.0,
 'Letter From Death Row, A (1998)': 5.0,
 'Life Less Ordinary, A (1997)': 5.0,
 'Ma vie en rose (My Life in Pink) (1997)': 5.0,
 'Money Talks (1997)': 5.0,
 'Mother (1996)': 1.0,
 'Postman, The (1997)': 4.0,
 'Seven Years in Tibet (1997)': 4.0,
 'Soul Food (1997)': 3.0,
 'Titanic (1997)': 3.0,
 'Wedding Singer, The (1998)': 5.0,
 'Wings of the Dove, The (1997)': 5.0}

##Collaborative filtering
###Predict an individuals preference based on their peers ratings.
A collaborative filtering algorithm searches a large group of users and find people with similar tastes as you. An algorithm does this by comparing each person with every other person in the dataset and calculating a similarity score. There are different ways to compute a similarity score as mentioned [here](http://en.wikipedia.org/wiki/Metric_%28mathematics%29#Examples). The metrics we have used is know as ***Euclidian Distance***.

###Calculate Euclidian Distance

In [6]:
from math import sqrt
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1
    # if they have no ratings in common, return 0
    if len(si)==0: return 0
    # Add up the squares of all the differences
    sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
        for item in prefs[person1] if item in prefs[person2]])
    return 1/(1+sum_of_squares)

### Returns the best matches for person from the dataset

In [7]:
# Number of results and similarity function are optional params.
def topMatches(prefs,person,n=5):
    scores=[(sim_distance(prefs,person,other),other) for other in prefs if other!=person]
    # Sort the list so the highest scores appear at the top
    scores.sort()
    scores.reverse()
    return scores[0:n]

###Gets recommendations for a person by using a weighted average of every other user's rankings

In [8]:

def getRecommendations(prefs,person):
    totals={}
    simSums={}
    for other in prefs:
        # don't compare me to myself
        if other==person: continue
        sim=sim_distance(prefs,person,other)

        # ignore scores of 0 or lower
        if sim<=0: continue
        for item in prefs[other]:

            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item]==0:
            # Similarity * Score
                totals.setdefault(item,0)
                totals[item]+=prefs[other][item]*sim
                # Sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+=sim

            # Create the normalized list
            rankings=[(total/simSums[item],item) for item,total in totals.items()]

            # Return the sorted list
            rankings.sort()
            rankings.reverse()
    return rankings

#####Note : Execution of below command will take few minutes

In [10]:
collaborative_recommendations = getRecommendations(dataset,'88')


###Top 10 recommendations for userid 88 based on ratings by other users similar to it. 

In [11]:
collaborative_recommendations[:10]

[(5.000000000000001, 'Tough and Deadly (1995)'),
 (5.0, 'They Made Me a Criminal (1939)'),
 (5.0, 'Star Kid (1997)'),
 (5.0, "Someone Else's America (1995)"),
 (5.0, 'Santa with Muscles (1996)'),
 (5.0, 'Saint of Fort Washington, The (1993)'),
 (5.0, 'Prefontaine (1997)'),
 (5.0, 'Marlene Dietrich: Shadow and Light (1996) '),
 (5.0, 'Great Day in Harlem, A (1994)'),
 (5.0, 'Entertaining Angels: The Dorothy Day Story (1996)')]

####Now you know how to find similar people and recommend movies for a given person, but what if you want to see which movies are similar to each other?
In this case, you can determine similarity by looking at who liked a particular movie and seeing the other things they liked. This is actually the same method we used earlier to determine similarity between users you just need to swap the users and the movies.

In [12]:
def transformPrefs(prefs):
    result={}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item,{})
            # Flip item and person
            result[item][person]=prefs[person][item]
    return result

This function first inverts the score dictionary using the transformPrefs function defined earlier, giving a list of movies along with how they were rated by each user. It then loops over every movie and passes the transformed dictionary to the topMatches function to get the most similar movies along with their similarity scores. Finally, it creates and returns a dictionary of movies along with a list of their most similar movies.

In [13]:
def calculateSimilarItems(prefs,n=10):
    # Create a dictionary of items showing which other items they
    # are most similar to.
    result={}
    # Invert the preference matrix to be item-centric
    itemPrefs=transformPrefs(prefs)
    c=0
    for item in itemPrefs:
        # Status updates for large datasets
        c+=1
        if c%100==0: print "%d / %d" % (c,len(itemPrefs))
            # Find the most similar items to this one
        scores=topMatches(itemPrefs,item,n=n)
        result[item]=scores
    return result

In [14]:
itemsim=calculateSimilarItems(dataset)

100 / 1664
200 / 1664
300 / 1664
400 / 1664
500 / 1664
600 / 1664
700 / 1664
800 / 1664
900 / 1664
1000 / 1664
1100 / 1664
1200 / 1664
1300 / 1664
1400 / 1664
1500 / 1664
1600 / 1664


Now we're going to get all the items that the user has ranked, find the similar items, and weight them according to how similar they are.

In [15]:
def getRecommendedItems(prefs,itemMatch,user):
    userRatings=prefs[user]
    scores={}
    totalSim={}
    
    # Loop over items rated by this user
    for (item,rating) in userRatings.items( ):
        
        # Loop over items similar to this one
        for (similarity,item2) in itemMatch[item]:
            
            # Ignore if this user has already rated this item
            if item2 in userRatings: continue

            # Weighted sum of rating times similarity
            scores.setdefault(item2,0)
            scores[item2]+=similarity*rating
            # Sum of all the similarities
            totalSim.setdefault(item2,0)
            totalSim[item2]+=similarity
    # Divide each total score by total weighting to get an average
    rankings=[(score/totalSim[item],item) for item,score in scores.items( )]
    
    # Return the rankings from highest to lowest
    rankings.sort( )
    rankings.reverse( )
    return rankings

###Top 10 recommendations for userid 88 based on precomputed most similar items for each item rated by the user.

In [16]:
getRecommendedItems(dataset,itemsim,'88')[:10]

[(5.0, 'Women, The (1939)'),
 (5.0, 'Wishmaster (1997)'),
 (5.0, 'Winter Guest, The (1997)'),
 (5.0, 'Wild America (1997)'),
 (5.0, "Widows' Peak (1994)"),
 (5.0, 'White Balloon, The (1995)'),
 (5.0, 'When We Were Kings (1996)'),
 (5.0, "Wend Kuuni (God's Gift) (1982)"),
 (5.0, 'Welcome To Sarajevo (1997)'),
 (5.0, 'Walking and Talking (1996)')]