# Item-Based Collaborative Filtering

There are multiple ways that we can use collaborative filtering. One approach would be to look at the similarities between users, called the neighbourhood-bassed approach. For example, in the neighborhood-based approach a number of users are selected based on their similarity to the chosen user. A prediction for the chosen user is made by calculating a weighted average of the ratings of the selected users. 

Another approach is the item-based approach where the ratings are used to measure the correlation between items. The correlation score can be used as a measure. We'll be building a full blown recommender system using the item-based approach that looks at the movie ratings every user gave in order to recommend them movies they are most likely to love.

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('/Users/jacquesthibodeau/Python Data Science/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))

m_cols = ['movie_id', 'title']
movies = pd.read_csv('/Users/jacquesthibodeau/Python Data Science/ml-100k/u.item', encoding = "ISO-8859-1", sep='|', names=m_cols, usecols=range(2))

ratings = pd.merge(movies, ratings)

ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Let's pivot the table and make each row contain all of the information for a specific user in order to get a better view of our dataset.

In [7]:
userRatings = ratings.pivot_table(index=['user_id'], columns=['title'], values=['rating'])
userRatings.head()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


NaN means that the user has not yet rated a particular movie. 

When building a recommender system, we need to have an idea how different items relate to each other. We can easily understand the relationship between movies by looking at the correlation score of those two movies. The score essentially tells us that if at least one user has seen both movies, the two movie ratings are either correlated (+), non-correlated (0) or anti-correlated (-). If they are correlated, it means that if one user liked one of the movies, it's likely that he will like the other one. If they are non-correlated, it means that if one user liked one of the movies, we have no idea whether they will like the other movie. If they are anti-correlated, it means that if one user liked one of the movies, it's likely that he will dislike the other one.  

Ideally, we should take into account how many people have rated those movies since one person could love both movies and that would give those two movies a high correlation score. In this case, we'll deal with this problem by setting a minimum amount of ratings before recommending the movie to others.

In [8]:
corrMatrix = userRatings.corr()
corrMatrix.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
rating,1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
rating,101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,0.0,0.707107,,
rating,12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.144338,1.0,1.0,
rating,187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.475327,,,


In order to make sure our recommendations are good, we will throw out movies who have less than 100 user ratings.

In [9]:
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
rating,1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
rating,101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
rating,12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
rating,187 (1997),,,,,,,,,,,...,,,,,,,,,,


Let's give a recommendation to user ID 0.

In [10]:
userID0 = userRatings.loc[0].dropna()
userID0.head()

        title                          
rating  Empire Strikes Back, The (1980)    5.0
        Gone with the Wind (1939)          1.0
        Star Wars (1977)                   5.0
Name: 0, dtype: float64

Looks like user ID 0 loved the Star Wars films, but hated Gone with the Wind.

In [13]:
simCandidates = pd.Series()
for i in range(0, len(userID0.index)):
    sims = corrMatrix[userID0.index[i]].dropna()
    sims = sims.map(lambda x: x * userID0[i])
    simCandidates = simCandidates.append(sims)
    
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

(rating, Empire Strikes Back, The (1980))                       5.000000
(rating, Star Wars (1977))                                      5.000000
(rating, Empire Strikes Back, The (1980))                       3.741763
(rating, Star Wars (1977))                                      3.741763
(rating, Return of the Jedi (1983))                             3.606146
(rating, Return of the Jedi (1983))                             3.362779
(rating, Raiders of the Lost Ark (1981))                        2.693297
(rating, Raiders of the Lost Ark (1981))                        2.680586
(rating, Austin Powers: International Man of Mystery (1997))    1.887164
(rating, Sting, The (1973))                                     1.837692
dtype: float64

We see that Return of the Jedi shows up, like it should, since ID 0 liked the other Star Wars movies. Though we can see that some movies appear more than once. This is because they correlate well with more than one movie. Instead of leaving things like this, we should group the same movies together and sum the correlation scores in order to give them more weight in the recommendation.

In [15]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

(rating, Empire Strikes Back, The (1980))              8.877450
(rating, Star Wars (1977))                             8.870971
(rating, Return of the Jedi (1983))                    7.178172
(rating, Raiders of the Lost Ark (1981))               5.519700
(rating, Indiana Jones and the Last Crusade (1989))    3.488028
(rating, Bridge on the River Kwai, The (1957))         3.366616
(rating, Back to the Future (1985))                    3.357941
(rating, Sting, The (1973))                            3.329843
(rating, Cinderella (1950))                            3.245412
(rating, Field of Dreams (1989))                       3.222311
dtype: float64

We can finish by filtering out the movies user ID 0 has already rated.

In [16]:
filteredCandidates = simCandidates.drop(userID0.index)
filteredCandidates.head(10)

(rating, Return of the Jedi (1983))                    7.178172
(rating, Raiders of the Lost Ark (1981))               5.519700
(rating, Indiana Jones and the Last Crusade (1989))    3.488028
(rating, Bridge on the River Kwai, The (1957))         3.366616
(rating, Back to the Future (1985))                    3.357941
(rating, Sting, The (1973))                            3.329843
(rating, Cinderella (1950))                            3.245412
(rating, Field of Dreams (1989))                       3.222311
(rating, Wizard of Oz, The (1939))                     3.200268
(rating, Dumbo (1941))                                 2.981645
dtype: float64

There we have it, we have our recommender system.