## Movie Recommendation System

Model-based Collaborative Filtering Systems

SVD Matrix Factorization

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.decomposition import TruncatedSVD

The MovieLens dataset was collected by the GroupLens Research Project at the University of Minnesota. You can download the dataset for this demostration at the following URL: https://grouplens.org/datasets/movielens/100k/

### Preparing the data

In [2]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
frame = pd.read_csv('/Users/darrenklee/Desktop/Recommender_Systems/Ex_Files_Intro_Python_Rec_Systems/Exercise_Files/02_02/ml-100k/u.data', sep='\t', names=columns)
frame.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('/Users/darrenklee/Desktop/Recommender_Systems/Ex_Files_Intro_Python_Rec_Systems/Exercise_Files/02_02/ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
movie_names.head()

Unnamed: 0,item_id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [4]:
combined_movies_data = pd.merge(frame, movie_names, on='item_id')
combined_movies_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


Find the movie with the highest amount of rating counts, which is the original Star Wars movie!

In [5]:
combined_movies_data.groupby('item_id')['rating'].count().sort_values(ascending=False).head()

item_id
50     583
258    509
100    508
181    507
294    485
Name: rating, dtype: int64

In [6]:
filter = combined_movies_data['item_id']==50
combined_movies_data[filter]['movie title'].unique()

array([u'Star Wars (1977)'], dtype=object)

### Building a Utility Matrix

In [7]:
rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


What is the Average Star Rating for the movie Star Wars? Excluding those who did not vote!

In [8]:
rating_crosstab["Star Wars (1977)"].value_counts() # 360 users did not rate the movie

0    360
5    325
4    176
3     57
2     16
1      9
Name: Star Wars (1977), dtype: int64

In [9]:
rating_crosstab["Star Wars (1977)"].count() - 360 # Same number previously, as the movie with the highest rating count

583

In [10]:
m = list(rating_crosstab["Star Wars (1977)"])
results = []

for i in m:
    if i != 0:
        results.append(i)
print len(results)

583


In [11]:
results = pd.DataFrame(results)
print "The Average Star Rating for Star Wars is:" 
print float(results.mean())

The Average Star Rating for Star Wars is:
4.35849056604


### Transposing the Matrix

In [12]:
rating_crosstab.shape # 943 users by 1664 movie titles

(943, 1664)

In [13]:
X = rating_crosstab.T
X.shape 

(1664, 943)

Now movie titles are along the rows, and users as columns. The next step with condense the users column to a much lower dimension than 943. 12 is the number of dimensions selected for this notebook.

### Decomposing the Matrix

In [14]:
SVD = TruncatedSVD(n_components=12, random_state=17)

resultant_matrix = SVD.fit_transform(X)

resultant_matrix.shape

(1664, 12)

In [15]:
resultant_matrix

array([[ 1.03999361e+00,  6.59884498e-01,  4.56894384e-02, ...,
        -7.53561261e-01, -3.01529677e-01, -5.36511447e-01],
       [ 4.36584337e-01, -2.57261460e-01,  3.52955096e-01, ...,
        -2.47432271e-01,  3.52606851e-01, -8.35835521e-02],
       [ 1.25437438e+01,  5.66923364e+00, -4.90781163e+00, ...,
         3.80234835e+00,  6.02188649e-01, -8.31986802e-01],
       ...,
       [ 3.58929614e-01,  3.71257163e-01,  2.29745164e-02, ...,
        -8.87474913e-02,  1.83072132e-01, -2.97030810e-02],
       [ 1.42428013e+00,  8.14939513e-01, -4.90237341e-01, ...,
         1.56784300e-01,  5.69624249e-01, -6.37080308e-01],
       [ 2.29210339e-01, -6.22518604e-03,  2.73162116e-01, ...,
         8.32677774e-02, -8.14794217e-02, -1.56028182e-01]])

Decomposing the matrix to 12 components or dimensions.

### Generating a Correlation Matrix

In [16]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

In [17]:
corr_mat

array([[ 1.        , -0.10298113,  0.52210159, ...,  0.39854553,
         0.22143017,  0.5039286 ],
       [-0.10298113,  1.        ,  0.06549218, ...,  0.16134137,
         0.5091753 ,  0.23355053],
       [ 0.52210159,  0.06549218,  1.        , ...,  0.7658073 ,
         0.44348034,  0.19721751],
       ...,
       [ 0.39854553,  0.16134137,  0.7658073 , ...,  1.        ,
         0.18088492,  0.10342131],
       [ 0.22143017,  0.5091753 ,  0.44348034, ...,  0.18088492,
         1.        ,  0.18524109],
       [ 0.5039286 ,  0.23355053,  0.19721751, ...,  0.10342131,
         0.18524109,  1.        ]])

### Isolating Star Wars From the Correlation Matrix

In [18]:
movie_names = rating_crosstab.columns
movies_list = list(movie_names)

star_wars = movies_list.index('Star Wars (1977)')
print "The index number to the movie Star Wars is: "
print star_wars

The index number to the movie Star Wars is: 
1398


Here's the movie_index, in case you want to find recommendations for other movie types.

In [19]:
movie_index = zip(movie_names, movies.index)
movie_index[108:120]

[(u'B. Monkey (1998)', 108),
 (u'Babe (1995)', 109),
 (u'Baby-Sitters Club, The (1995)', 110),
 (u'Babyfever (1994)', 111),
 (u'Babysitter, The (1995)', 112),
 (u'Back to the Future (1985)', 113),
 (u'Backbeat (1993)', 114),
 (u'Bad Boys (1995)', 115),
 (u'Bad Company (1995)', 116),
 (u'Bad Girls (1994)', 117),
 (u'Bad Moon (1996)', 118),
 (u'Bad Taste (1987)', 119)]

Create an individual Correlation Matrix for Star Wars

In [20]:
corr_star_wars = corr_mat[1398]
corr_star_wars.shape

(1664,)

### Recommending a Highly Correlated Movie

In [21]:
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.9)])

[u'Die Hard (1988)',
 u'Empire Strikes Back, The (1980)',
 u'Fugitive, The (1993)',
 u'Raiders of the Lost Ark (1981)',
 u'Return of the Jedi (1983)',
 u'Terminator 2: Judgment Day (1991)',
 u'Terminator, The (1984)',
 u'Toy Story (1995)']

In [22]:
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.95)])

[u'Return of the Jedi (1983)']

If we want to recommend a really highly correlated movie to Star Wars, there is no surprise that it would be one of the Trilogies! Or was that a surprise? We can essentially use this recommendation engine for other movies from the movie_index. Let's cherry pick Bad Boys, one of my favorite buddy-policeman action films as a child.

In [23]:
t = list(rating_crosstab["Bad Boys (1995)"])
test = []

for i in t:
    if i != 0:
        test.append(i)
print len(test)

57


In [24]:
test = pd.DataFrame(test)
test.mean() # Star Rating for Bad Boys. That shows you my taste in movies! :p

0    3.105263
dtype: float64

In [25]:
corr_bb = corr_mat[115]
list(movie_names[(corr_bb < 1.0) & (corr_bb > 0.95)])

[u'Demolition Man (1993)',
 u'Desperado (1995)',
 u'Getaway, The (1994)',
 u'GoldenEye (1995)',
 u'Hard Target (1993)',
 u'Judgment Night (1993)',
 u'Quick and the Dead, The (1995)']

The movie list is showing its age, although there are some names that we recognize. We can see that Bad Boys is a movie related to "action", and perhaps the use of guns. We can say that the SVD algorithm worked well in terms of an aged dataset. Not bad for the poor-man's Netflix recommendation engine right?