## Model-Based Collaborative Filtering System

The MovieLens dataset was collected by the GroupLens Research Project at the University of Minnesota. You can download the dataset for this demostration at the following URL: https://grouplens.org/datasets/movielens/100k/


In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import TruncatedSVD

## Preparing the Data

In [3]:
columns = ['user-id','item-id','rating','timestamp']
data = pd.read_csv('ml-100k/u.data',sep='\t',names = columns)
data.head()

Unnamed: 0,user-id,item-id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [9]:
columns = ['item-id','movie title','release date','video release date','IMDb URL','unknown','Action',
           'Adventure','Animation','Children','Comedy','Crime','Documentary','Drama','Fantasy',
              'Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']
movies = pd.read_csv('ml-100k/u.item',sep='|',names=columns,encoding='latin-1')
movie_names = movies[['item-id','movie title']]
movie_names.head()

Unnamed: 0,item-id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [13]:
combined_movies_data = pd.merge(data,movie_names,on='item-id')
combined_movies_data.head()

Unnamed: 0,user-id,item-id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [16]:
combined_movies_data.groupby('item-id')['rating'].count().sort_values(ascending=False).head()

item-id
50     583
258    509
100    508
181    507
294    485
Name: rating, dtype: int64

In [28]:
filter = combined_movies_data['item-id'] == 294
combined_movies_data[filter]['movie title'].unique()

array(['Liar Liar (1997)'], dtype=object)

<span style="font-family:'Times New Roman';font-size:1.5em;">We will now build a model to find movies similar to 'Liar Liar'. Let's see if we can find anything similar.</span>

## Building a Utility Matrix

In [31]:
rating_crosstab = combined_movies_data.pivot_table(values='rating',index='user-id',columns='movie title',fill_value=0)
rating_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


## Transposing the Matrix

In [33]:
rating_crosstab.shape

(943, 1664)

In [36]:
X = rating_crosstab.T
X.shape

(1664, 943)

## Decomposing the Matrix

In [37]:
SVD = TruncatedSVD(n_components=12,random_state=20)
resultant_matrix = SVD.fit_transform(X)
resultant_matrix.shape


(1664, 12)

## Generating a Correlation Matrix

In [38]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

## Isolating Liar Liar form Correlation Matrix

In [44]:
movie_names = rating_crosstab.columns
movies_list = list(movie_names)

liar_liar = movies_list.index('Liar Liar (1997)')
liar_liar

860

In [45]:
corr_liar_liar = corr_mat[860]
corr_liar_liar.shape

(1664,)

## Recommending a Highly Correlated Movie

In [46]:
list(movie_names[(corr_liar_liar < 1) & (corr_liar_liar > 0.9)])

['Anaconda (1997)',
 'Conspiracy Theory (1997)',
 'Daylight (1996)',
 "Devil's Own, The (1997)",
 'Jungle2Jungle (1997)',
 'Metro (1997)',
 'Saint, The (1997)',
 'Speed 2: Cruise Control (1997)',
 'Volcano (1997)']

In [50]:
list(movie_names[(corr_liar_liar < 1) & (corr_liar_liar > 0.94)])

['Saint, The (1997)']