# Item Based Collaborative Filtering

Technical Documents

https://www.geeksforgeeks.org/item-to-item-based-collaborative-filtering/

https://towardsdatascience.com/item-based-collaborative-filtering-in-python-91f747200fab

## Case Study - Movie Recommendation

A movie company wants to do item based recommendation with their dataset. The datasets includes 27,000 films and 2 million ratings. There are two dataset called movie and rating.


Movie Variables:
* title: Film name
* movieId: Unique film number

Rating Variables:
* userId: Unique user id
* movieId: Unique film number
* rating: user rating
* timestamp: Rate date

### Step 1: Merge the movie and rating datasets

**Import movie and rating data and check info**

In [4]:
import pandas as pd


movie = pd.read_csv('movie.csv')
rating = pd.read_csv('rating.csv')

In [5]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


In [6]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [8]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB


**Merge movie and rating dataframes**

In [9]:
df = movie.merge(rating, on='movieId', how='left')

In [10]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [11]:
df.shape

(20000797, 6)

### Step 2: Creation of user_movie_df

In [12]:
df['title'].nunique()

27262

In [13]:
df['title'].value_counts()

Pulp Fiction (1994)                          67310
Forrest Gump (1994)                          66172
Shawshank Redemption, The (1994)             63366
Silence of the Lambs, The (1991)             63299
Jurassic Park (1993)                         59715
                                             ...  
Rapture (Arrebato) (1980)                        1
Education of Mohammad Hussein, The (2013)        1
Satanas (2007)                                   1
Psychosis (2010)                                 1
Innocence (2014)                                 1
Name: title, Length: 27262, dtype: int64

**We need to delete movies with a rating less than 1000** 

In [20]:
comment_rating = pd.DataFrame(df['title'].value_counts())
comment_rating.head()

Unnamed: 0,title
Pulp Fiction (1994),67310
Forrest Gump (1994),66172
"Shawshank Redemption, The (1994)",63366
"Silence of the Lambs, The (1991)",63299
Jurassic Park (1993),59715


In [19]:
rare_movies = comment_rating[comment_rating['title'] <= 1000].index
rare_movies

Index(['Bear, The (Ours, L') (1988)', 'Rosewood (1997)', 'Ted (2012)',
       'One Night at McCool's (2001)', 'Marked for Death (1990)',
       'Three to Tango (1999)', 'Adam's Rib (1949)',
       'I Now Pronounce You Chuck and Larry (2007)',
       'Italian for Beginners (Italiensk for begyndere) (2000)',
       'Husbands and Wives (1992)',
       ...
       'Satan's Sword (Daibosatsu tôge) (1960)',
       'Blind Massage (Tui na) (2014)', 'Prêt à tout (2014)',
       'Ditchdigger's Daughters, The (1997)', 'A.K. (1985)',
       'Rapture (Arrebato) (1980)',
       'Education of Mohammad Hussein, The (2013)', 'Satanas (2007)',
       'Psychosis (2010)', 'Innocence (2014)'],
      dtype='object', length=24103)

In [26]:
common_movies = df[~df['title'].isin(rare_movies)]
common_movies

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41
...,...,...,...,...,...,...
19985698,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,28195.0,4.0,2014-09-22 20:52:18
19985699,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,51334.0,3.0,2014-09-23 15:53:39
19985700,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,120575.0,2.5,2014-10-08 14:23:39
19985701,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,124998.0,2.5,2014-09-20 22:16:14


In [27]:
common_movies['title'].nunique()

3159

In [28]:
common_movies

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41
...,...,...,...,...,...,...
19985698,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,28195.0,4.0,2014-09-22 20:52:18
19985699,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,51334.0,3.0,2014-09-23 15:53:39
19985700,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,120575.0,2.5,2014-10-08 14:23:39
19985701,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,124998.0,2.5,2014-09-20 22:16:14


In [35]:
user_movie_df = common_movies.pivot_table(index='userId', columns='title', values='rating')

In [36]:
user_movie_df.shape

(138493, 3159)

In [37]:
user_movie_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,


### Step 3: Making Movie Recommendations

In [38]:
movie_name = 'Matrix, The (1999)'

In [39]:
movie_name = user_movie_df[movie_name]

In [40]:
movie_name

userId
1.0         NaN
2.0         NaN
3.0         5.0
4.0         NaN
5.0         NaN
           ... 
138489.0    4.0
138490.0    NaN
138491.0    NaN
138492.0    5.0
138493.0    4.5
Name: Matrix, The (1999), Length: 138493, dtype: float64

In [44]:
user_movie_df.corrwith(movie_name).sort_values(ascending=False).head(10)

title
Matrix, The (1999)                                           1.000000
Matrix Reloaded, The (2003)                                  0.516906
Matrix Revolutions, The (2003)                               0.449588
Animatrix, The (2003)                                        0.367151
Blade (1998)                                                 0.334493
Terminator 2: Judgment Day (1991)                            0.333882
Minority Report (2002)                                       0.332434
Edge of Tomorrow (2014)                                      0.326762
Mission: Impossible (1996)                                   0.320815
Lord of the Rings: The Fellowship of the Ring, The (2001)    0.318726
dtype: float64

**Check film the all data**

In [45]:
def check_film(keyword):
    return [col for col in user_movie_df.columns if keyword in col]

In [46]:
check_film('Sherlock')

['Sherlock Holmes (2009)',
 'Sherlock Holmes: A Game of Shadows (2011)',
 'Young Sherlock Holmes (1985)']

### Step 3: Functionalize all steps

In [47]:
def create_user_movie_df():
    
    movie = pd.read_csv('movie.csv')
    rating = pd.read_csv('rating.csv')
    df = movie.merge(rating, on='movieId', how='left')
    comment_rating = pd.DataFrame(df['title'].value_counts())
    rare_movies = comment_rating[comment_rating['title'] <= 1000].index
    common_movies = df[~df['title'].isin(rare_movies)]
    user_movie_df = common_movies.pivot_table(index='userId', columns='title', values='rating')
    
    return user_movie_df    

In [50]:
def item_based_recommender(movie_name, user_movie_df):
    
    movie_name = user_movie_df[movie_name]
    recommendations = user_movie_df.corrwith(movie_name).sort_values(ascending=False).head(10)
    
    return recommendations   

In [51]:
item_based_recommender('Blade (1998)', user_movie_df)

title
Blade (1998)                                  1.000000
Blade II (2002)                               0.655421
Blade: Trinity (2004)                         0.579932
Underworld (2003)                             0.469069
Underworld: Evolution (2006)                  0.464850
Resident Evil: Apocalypse (2004)              0.462848
Last Boy Scout, The (1991)                    0.459956
Chronicles of Riddick, The (2004)             0.454267
Spawn (1997)                                  0.444073
Prince of Persia: The Sands of Time (2010)    0.441964
dtype: float64

In [66]:
movie = pd.Series(user_movie_df.columns).sample(1).values[0]
movie

'Natural, The (1984)'

In [67]:
item_based_recommender(movie, user_movie_df)

title
Natural, The (1984)                                1.000000
[REC] (2007)                                       0.477953
Field of Dreams (1989)                             0.459676
Futurama: The Beast with a Billion Backs (2008)    0.451982
Something Wicked This Way Comes (1983)             0.426661
Hoosiers (a.k.a. Best Shot) (1986)                 0.408590
King of Kong, The (2007)                           0.378868
Man of the House (1995)                            0.376615
Oblivion (2013)                                    0.371369
Major League (1989)                                0.364148
dtype: float64