# User Based Collaborative Filtering

Technical Documents

https://www.geeksforgeeks.org/user-based-collaborative-filtering/

https://www.analyticsvidhya.com/blog/2022/02/introduction-to-collaborative-filtering/

## Case Study - Movie Recommendation

A movie company wants to do user based recommendation with their dataset. The datasets includes 27,000 films and 2 million ratings. There are two dataset called movie and rating.


Movie Variables:
* title: Film name
* movieId: Unique film number

Rating Variables:
* userId: Unique user id
* movieId: Unique film number
* rating: user rating
* timestamp: Rate date

### Step 1: Create user movie df and get random user

In [1]:
import pandas as pd

In [2]:
def create_user_movie_df():
    
    movie = pd.read_csv('movie.csv')
    rating = pd.read_csv('rating.csv')
    df = movie.merge(rating, on='movieId', how='left')
    comment_rating = pd.DataFrame(df['title'].value_counts())
    rare_movies = comment_rating[comment_rating['title'] <= 1000].index
    common_movies = df[~df['title'].isin(rare_movies)]
    user_movie_df = common_movies.pivot_table(index='userId', columns='title', values='rating')
    
    return user_movie_df   

In [4]:
user_movie_df = create_user_movie_df()

In [7]:
user_movie_df.head(10)

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,
6.0,,,,,,,,,,,...,,,,,,,,,,
7.0,,,,,,,,,,,...,,,,,,,,,,2.0
8.0,,,,,,,,,,,...,,,,,,,,,,
9.0,,,,,,,,,,,...,,,,,,,,,,
10.0,,,,,,,,,,,...,,,,,,,,,,


In [17]:
random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=45).values)
random_user

28941

### Step 2: Get the watched films by random user

In [23]:
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28941.0,,,,,,,,,,,...,,,,,,,,,,


In [74]:
movie_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
movie_watched

['Ace Ventura: Pet Detective (1994)',
 'Ace Ventura: When Nature Calls (1995)',
 'Aladdin (1992)',
 'American President, The (1995)',
 'Apollo 13 (1995)',
 'Babe (1995)',
 'Bullets Over Broadway (1994)',
 'Clueless (1995)',
 'Disclosure (1994)',
 'Forrest Gump (1994)',
 'Four Weddings and a Funeral (1994)',
 'Home Alone (1990)',
 'Jurassic Park (1993)',
 'Like Water for Chocolate (Como agua para chocolate) (1992)',
 'Little Women (1994)',
 "Mr. Holland's Opus (1995)",
 'Mrs. Doubtfire (1993)',
 'Much Ado About Nothing (1993)',
 "Muriel's Wedding (1994)",
 'Nine Months (1995)',
 'Operation Dumbo Drop (1995)',
 'Piano, The (1993)',
 'Postman, The (Postino, Il) (1994)',
 'Ready to Wear (Pret-A-Porter) (1994)',
 'Remains of the Day, The (1993)',
 'Sabrina (1995)',
 "Schindler's List (1993)",
 'Secret Garden, The (1993)',
 'Sense and Sensibility (1995)',
 'Shadowlands (1993)',
 'Silence of the Lambs, The (1991)',
 'Star Trek: Generations (1994)',
 'Stargate (1994)']

In [39]:
len(movie_watched)

33

### Step 3: Other users wahtching the same movies

In [75]:
movie_watched_df = user_movie_df[movie_watched]
movie_watched_df

title,Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Aladdin (1992),"American President, The (1995)",Apollo 13 (1995),Babe (1995),Bullets Over Broadway (1994),Clueless (1995),Disclosure (1994),Forrest Gump (1994),...,Ready to Wear (Pret-A-Porter) (1994),"Remains of the Day, The (1993)",Sabrina (1995),Schindler's List (1993),"Secret Garden, The (1993)",Sense and Sensibility (1995),Shadowlands (1993),"Silence of the Lambs, The (1991)",Star Trek: Generations (1994),Stargate (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,3.5,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,5.0,5.0,5.0
4.0,,3.0,,,,,,,,4.0,...,,,,,3.0,,,,3.0,
5.0,,,5.0,5.0,5.0,,,,,,...,,3.0,,,5.0,3.0,,3.0,,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489.0,,,,,,,,,,,...,,,,,,,,4.0,,
138490.0,,,,,4.0,5.0,,,,,...,,,,,4.0,4.0,,5.0,,
138491.0,,,,,,,,,,,...,,,,,,,,,,
138492.0,,,,,,,,,,,...,,,,,,,,,,


In [53]:
user_movie_count = movie_watched_df.T.notnull().sum()
user_movie_count

userId
1.0          1
2.0          2
3.0          4
4.0          6
5.0         11
            ..
138489.0     1
138490.0     7
138491.0     0
138492.0     2
138493.0     9
Length: 138493, dtype: int64

In [54]:
user_movie_count = user_movie_count.reset_index()
user_movie_count

Unnamed: 0,userId,0
0,1.0,1
1,2.0,2
2,3.0,4
3,4.0,6
4,5.0,11
...,...,...
138488,138489.0,1
138489,138490.0,7
138490,138491.0,0
138491,138492.0,2


In [56]:
user_movie_count.columns = ['userId', 'movie_count']
user_movie_count.head()

Unnamed: 0,userId,movie_count
0,1.0,1
1,2.0,2
2,3.0,4
3,4.0,6
4,5.0,11


In [59]:
user_movie_count[user_movie_count['movie_count'] > 20].sort_values('movie_count', ascending=False)

Unnamed: 0,userId,movie_count
94230,94231.0,33
100398,100399.0,33
118204,118205.0,33
15918,15919.0,33
124051,124052.0,33
...,...,...
79214,79215.0,21
79174,79175.0,21
9105,9106.0,21
78515,78516.0,21


In [65]:
user_same_movies = user_movie_count[user_movie_count['movie_count'] > 20]['userId']
user_same_movies.head()

129    130.0
155    156.0
157    158.0
183    184.0
294    295.0
Name: userId, dtype: float64

### Step 4: Determination of similarity

**Concatenate random user and other users wahtching the same movies**

In [76]:
final_df = pd.concat([movie_watched_df[movie_watched_df.index.isin(user_same_movies)],
                    random_user_df[movie_watched]])

final_df

title,Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Aladdin (1992),"American President, The (1995)",Apollo 13 (1995),Babe (1995),Bullets Over Broadway (1994),Clueless (1995),Disclosure (1994),Forrest Gump (1994),...,Ready to Wear (Pret-A-Porter) (1994),"Remains of the Day, The (1993)",Sabrina (1995),Schindler's List (1993),"Secret Garden, The (1993)",Sense and Sensibility (1995),Shadowlands (1993),"Silence of the Lambs, The (1991)",Star Trek: Generations (1994),Stargate (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130.0,4.0,3.0,,3.0,3.0,,,3.0,5.0,5.0,...,,3.0,,5.0,,,3.0,5.0,,3.0
156.0,3.0,,,5.0,5.0,3.0,,,4.0,5.0,...,,,4.0,5.0,,4.0,4.0,5.0,3.0,4.0
158.0,2.0,1.0,4.0,4.0,3.0,5.0,,4.0,,5.0,...,,5.0,3.0,5.0,5.0,4.0,5.0,5.0,,
184.0,2.0,3.0,3.0,4.0,4.0,,3.0,,4.0,3.0,...,,4.0,4.0,5.0,4.0,,4.0,5.0,3.0,4.0
295.0,,,3.0,3.0,3.0,3.0,3.0,2.0,,4.0,...,,3.0,3.0,4.0,3.0,4.0,,4.0,3.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138279.0,3.0,,,3.0,5.0,5.0,5.0,4.0,,5.0,...,,5.0,,5.0,,4.0,,5.0,3.0,3.0
138382.0,1.0,1.0,4.0,2.0,3.0,5.0,,4.0,3.0,5.0,...,,,3.0,,3.0,3.0,,5.0,,4.0
138415.0,1.0,,3.0,4.0,3.0,5.0,,,,4.0,...,,5.0,,5.0,3.0,4.0,4.0,,3.0,
138483.0,1.0,1.0,3.0,3.0,3.0,,3.0,3.0,2.0,4.0,...,3.0,4.0,,4.0,3.0,4.0,4.0,4.0,2.0,2.0


**Transpoze dataframe to calculate correlation**

In [86]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()

In [88]:
corr_df.head()

userId    userId 
28866.0   67756.0   -0.936065
60562.0   37121.0   -0.915003
34103.0   80593.0   -0.898718
62575.0   21398.0   -0.896612
117826.0  48416.0   -0.883969
dtype: float64

In [90]:
corr_df = pd.DataFrame(corr_df, columns=['corr'])
corr_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,corr
userId,userId,Unnamed: 2_level_1
28866.0,67756.0,-0.936065
60562.0,37121.0,-0.915003
34103.0,80593.0,-0.898718
62575.0,21398.0,-0.896612
117826.0,48416.0,-0.883969


In [92]:
corr_df.index.names = ['userId1', 'userId2']
corr_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,corr
userId1,userId2,Unnamed: 2_level_1
28866.0,67756.0,-0.936065
60562.0,37121.0,-0.915003
34103.0,80593.0,-0.898718
62575.0,21398.0,-0.896612
117826.0,48416.0,-0.883969


In [94]:
corr_df.reset_index(inplace=True)

In [95]:
corr_df.head()

Unnamed: 0,userId1,userId2,corr
0,28866.0,67756.0,-0.936065
1,60562.0,37121.0,-0.915003
2,34103.0,80593.0,-0.898718
3,62575.0,21398.0,-0.896612
4,117826.0,48416.0,-0.883969


In [119]:
top_users = corr_df[(corr_df['userId1'] == random_user) & (corr_df['corr'] >= 0.65)][['userId2', 'corr']].reset_index(drop=True)

In [123]:
top_users = top_users.sort_values('corr', ascending=False)
top_users.head()

Unnamed: 0,userId,corr
27,28941.0,1.0
26,45158.0,0.800749
25,101628.0,0.790405
24,127259.0,0.763925
23,9783.0,0.747942


In [124]:
top_users.rename(columns = {'userId2': 'userId'}, inplace=True)
top_users.head()

Unnamed: 0,userId,corr
27,28941.0,1.0
26,45158.0,0.800749
25,101628.0,0.790405
24,127259.0,0.763925
23,9783.0,0.747942


In [107]:
rating = pd.read_csv('rating.csv')

In [108]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [125]:
top_users_rating = top_users.merge(rating[['userId', 'movieId', 'rating']], how='inner')
top_users_rating.head()

Unnamed: 0,userId,corr,movieId,rating
0,28941.0,1.0,7,5.0
1,28941.0,1.0,11,3.0
2,28941.0,1.0,17,5.0
3,28941.0,1.0,19,2.0
4,28941.0,1.0,34,5.0


In [126]:
top_users_rating = top_users_rating[top_users_rating['userId'] != random_user]
top_users_rating.head()

Unnamed: 0,userId,corr,movieId,rating
33,45158.0,0.800749,1,1.5
34,45158.0,0.800749,3,1.0
35,45158.0,0.800749,17,3.5
36,45158.0,0.800749,19,2.0
37,45158.0,0.800749,22,3.0


In [127]:
top_users_rating

Unnamed: 0,userId,corr,movieId,rating
33,45158.0,0.800749,1,1.5
34,45158.0,0.800749,3,1.0
35,45158.0,0.800749,17,3.5
36,45158.0,0.800749,19,2.0
37,45158.0,0.800749,22,3.0
...,...,...,...,...
13680,132307.0,0.660726,593,4.0
13681,132307.0,0.660726,595,4.0
13682,132307.0,0.660726,597,4.0
13683,132307.0,0.660726,608,5.0


### Step 5: Weighted Average Recomendations Score Calculation

In [130]:
top_users_rating['weighted_rating'] = top_users_rating['corr'] * top_users_rating['rating']
top_users_rating.head()

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
33,45158.0,0.800749,1,1.5,1.201124
34,45158.0,0.800749,3,1.0,0.800749
35,45158.0,0.800749,17,3.5,2.802622
36,45158.0,0.800749,19,2.0,1.601498
37,45158.0,0.800749,22,3.0,2.402248


In [131]:
top_users_rating.groupby('movieId').agg({'weighted_rating': 'mean'})

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
1,2.205824
2,1.463950
3,1.181772
4,1.642769
5,1.151408
...,...
56367,2.565300
56508,2.931772
56788,2.565300
58303,2.931772


In [133]:
recommendation_movies = top_users_rating.groupby('movieId').agg({'weighted_rating': 'mean'})
recommendation_movies.head()

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
1,2.205824
2,1.46395
3,1.181772
4,1.642769
5,1.151408


In [134]:
recommendation_movies.reset_index(inplace=True)

In [135]:
recommendation_movies.head()

Unnamed: 0,movieId,weighted_rating
0,1,2.205824
1,2,1.46395
2,3,1.181772
3,4,1.642769
4,5,1.151408


In [140]:
movies_to_be_recommend = recommendation_movies[recommendation_movies['weighted_rating'] > 3.5].\
                        sort_values('weighted_rating', ascending=False)

In [142]:
movies_to_be_recommend

Unnamed: 0,movieId,weighted_rating
26,30,3.952023
257,326,3.952023
183,242,3.952023
45,53,3.952023
920,1348,3.692678
394,501,3.679739
3873,25850,3.664714
3691,8128,3.664714
3907,26394,3.664714
3622,7585,3.664714


In [145]:
movie = pd.read_csv('movie.csv')
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [146]:
movies_to_be_recommend.merge(movie[['movieId', 'title']], how='inner')

Unnamed: 0,movieId,weighted_rating,title
0,30,3.952023,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
1,326,3.952023,To Live (Huozhe) (1994)
2,242,3.952023,Farinelli: il castrato (1994)
3,53,3.952023,Lamerica (1994)
4,1348,3.692678,"Nosferatu (Nosferatu, eine Symphonie des Graue..."
5,501,3.679739,Naked (1993)
6,25850,3.664714,Holiday (1938)
7,8128,3.664714,Au revoir les enfants (1987)
8,26394,3.664714,"Turning Point, The (1977)"
9,7585,3.664714,Summertime (1955)


### Step 6: Functionalize all steps

In [147]:
def create_user_movie_df():
    
    movie = pd.read_csv('movie.csv')
    rating = pd.read_csv('rating.csv')
    df = movie.merge(rating, on='movieId', how='left')
    comment_rating = pd.DataFrame(df['title'].value_counts())
    rare_movies = comment_rating[comment_rating['title'] <= 1000].index
    common_movies = df[~df['title'].isin(rare_movies)]
    user_movie_df = common_movies.pivot_table(index='userId', columns='title', values='rating')
    
    return user_movie_df   

In [170]:
def user_based_recommend(random_user, user_movie_df, ratio=60, corr_th=0.65, score=3.5):
    
    # Create random users df
    random_user_df = user_movie_df[user_movie_df.index == random_user]
    
    # Watched movie by random user
    movie_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
    
    # Other users watching same movies
    movie_watched_df = user_movie_df[movie_watched]
    user_movie_count = movie_watched_df.T.notnull().sum()
    user_movie_count = user_movie_count.reset_index()
    user_movie_count.columns = ['userId', 'movie_count']
    perc = len(movie_watched) * ratio / 100
    user_same_movies = user_movie_count[user_movie_count['movie_count'] > perc]['userId']
    
    # Determination similarity
    final_df = pd.concat([movie_watched_df[movie_watched_df.index.isin(user_same_movies)],
                    random_user_df[movie_watched]])
    corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
    corr_df = pd.DataFrame(corr_df, columns=['corr'])
    corr_df.index.names = ['userId1', 'userId2']
    corr_df.reset_index(inplace=True)
    top_users = corr_df[(corr_df['userId1'] == random_user) & (corr_df['corr'] >= corr_th)][['userId2', 'corr']].reset_index(drop=True)
    top_users = top_users.sort_values('corr', ascending=False)
    top_users.rename(columns = {'userId2': 'userId'}, inplace=True)
    rating = pd.read_csv('rating.csv')
    top_users_rating = top_users.merge(rating[['userId', 'movieId', 'rating']], how='inner')
    top_users_rating = top_users_rating[top_users_rating['userId'] != random_user]
    
    # Weighted average recomendations score calculation
    top_users_rating['weighted_rating'] = top_users_rating['corr'] * top_users_rating['rating']
    top_users_rating.groupby('movieId').agg({'weighted_rating': 'mean'})
    recommendation_movies = top_users_rating.groupby('movieId').agg({'weighted_rating': 'mean'})
    recommendation_movies.reset_index(inplace=True)
    movies_to_be_recommend = recommendation_movies[recommendation_movies['weighted_rating'] > score].\
                        sort_values('weighted_rating', ascending=False)
    movie = pd.read_csv('movie.csv')
    return movies_to_be_recommend.merge(movie[['movieId', 'title']], how='inner')

In [168]:
random_user_2 = int(pd.Series(user_movie_df.index).sample(1).values)
random_user_2

15700

In [169]:
user_based_recommend(random_user_2, user_movie_df, ratio=60, corr_th=0.65, score=3.5)

Unnamed: 0,movieId,weighted_rating,title
0,2249,3.602415,My Blue Heaven (1990)
1,3881,3.602415,Phish: Bittersweet Motel (2000)
2,127441,3.602415,Last Days in Vietnam (2014)
3,120815,3.602415,Patton Oswalt: Werewolves and Lollipops (2007)
4,120813,3.602415,Patton Oswalt: My Weakness Is Strong (2009)
5,120811,3.602415,Patton Oswalt: Finest Hour (2011)
6,120807,3.602415,John Mulaney: New In Town (2012)
7,105835,3.602415,"Double, The (2013)"
8,102974,3.602415,Somebody Up There Likes Me (2012)
9,102672,3.602415,New York: A Documentary Film (1999)
