Importing necessary python packages and using pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. In a real production, we'd stick with ID's and worry about the names at the display layer to make things more efficient.

In [12]:
# Importing necessary python packages
import pandas as pd
import numpy as np

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('./dataset/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('./dataset/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)

ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


One of the most common tasks in data science is to manipulate the data frame we have to a specific format. For example, sometime we may want to take data frame with fewer columns, say in long format, summarize and convert into a data frame with multiple columns, i.e. a wide data frame.
In the below code block, the pivot_table function on a DataFrame will construct a user / movie rating matrix where NaN indicates missing data - movies that specific users didn't rate.

In [3]:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix! This gives us a correlation score between every pair of movies (where at least one user rated both movies - otherwise NaN's will show up.)

In [11]:
# min_periods argument to throw out results where fewer than 100 users rated a given movie pair
corrMatrix = userRatings.corr(method='pearson', min_periods=200)
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Creating some movie recommendations for user ID 300, who I manually added to the data set as a test case. This guy really likes Love Jones, Money Talks, etc but hated Mimic. I'll extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data.

In [8]:
myRatings = userRatings.loc[300].dropna()
myRatings

title
Air Bud (1997)                               3.0
Air Force One (1997)                         4.0
Beverly Hills Ninja (1997)                   4.0
Booty Call (1997)                            4.0
Bulletproof (1996)                           4.0
Conspiracy Theory (1997)                     3.0
Fargo (1996)                                 3.0
Jack (1996)                                  4.0
Jungle2Jungle (1997)                         4.0
Liar Liar (1997)                             3.0
Love Jones (1997)                            5.0
McHale's Navy (1997)                         2.0
Men in Black (1997)                          4.0
Mimic (1997)                                 1.0
Money Talks (1997)                           5.0
Murder at 1600 (1997)                        4.0
Private Parts (1997)                         4.0
Scream (1996)                                4.0
Thin Line Between Love and Hate, A (1996)    5.0
Name: 300, dtype: float64

Now, we need to go through each movie the user (UID 300) rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones that has been rated.

So for each movie that has been rated, we have to retrieve the list of similar movies from our correlation matrix. We then scale those correlation scores by how well the user rated the movie they are similar to, so movies similar to ones the user liked count more than movies similar to ones user hated:

In [10]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print ("Adding sims for " + myRatings.index[i] + "...")
    # Retrieve similar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    # Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
#Glance at our results so far:
print ("sorting...")
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Adding sims for Air Bud (1997)...
Adding sims for Air Force One (1997)...
Adding sims for Beverly Hills Ninja (1997)...
Adding sims for Booty Call (1997)...
Adding sims for Bulletproof (1996)...
Adding sims for Conspiracy Theory (1997)...
Adding sims for Fargo (1996)...
Adding sims for Jack (1996)...
Adding sims for Jungle2Jungle (1997)...
Adding sims for Liar Liar (1997)...
Adding sims for Love Jones (1997)...
Adding sims for McHale's Navy (1997)...
Adding sims for Men in Black (1997)...
Adding sims for Mimic (1997)...
Adding sims for Money Talks (1997)...
Adding sims for Murder at 1600 (1997)...
Adding sims for Private Parts (1997)...
Adding sims for Scream (1996)...
Adding sims for Thin Line Between Love and Hate, A (1996)...
sorting...


Air Force One (1997)             6.292379
Conspiracy Theory (1997)         5.422248
Liar Liar (1997)                 5.010027
Scream (1996)                    4.890971
Men in Black (1997)              4.285142
Murder at 1600 (1997)            4.000000
Fargo (1996)                     3.320103
Return of the Jedi (1983)        3.221278
Toy Story (1995)                 3.203148
Independence Day (ID4) (1996)    2.711179
dtype: float64