**MovieLens Dataset Recommender System**

In [5]:
import numpy as np
import pandas as pd

Genres are a pipe-separated list, and are selected from the following:

Action/
Adventure/
Animation/
Children's/
Comedy/
Crime/
Documentary/
Drama/
Fantasy/
Film-Noir/
Horror/
Musical/
Mystery/
Romance/
Sci-Fi/
Thriller/
War/
Western/
(no genres listed)

In [6]:
DATA_DIR="ml-latest/"
movie = pd.read_csv(f"{DATA_DIR}/movies.csv")
print(movie.columns)
movie.head(5)

Index(['movieId', 'title', 'genres'], dtype='object')


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
rating = pd.read_csv(f"{DATA_DIR}/ratings.csv")
rating.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [8]:
#Selecting all rows and certain columns from the DataFrame
rating = rating.loc[:,["userId","movieId","rating"]]
rating.head(10)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,110,4.0
2,1,158,4.0
3,1,260,4.5
4,1,356,5.0
5,1,381,3.5
6,1,596,4.0
7,1,1036,5.0
8,1,1049,3.0
9,1,1066,4.0


In [9]:
data = pd.merge(movie,rating)
print (data.head(5))
print (data.shape)
# data.shape #(20000263, 5) This is very large!!!!!!!!!!!!!!

   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   userId  rating  
0       1     4.0  
1       2     5.0  
2       7     4.0  
3      10     3.0  
4      12     5.0  
(33832162, 5)


In [10]:
#Pivot table
data = data.iloc[:1000000,:]#Due to the data is too large
pivot_table = data.pivot_table(index=["userId"], columns=["title"], values="rating")
print(pivot_table.shape) #(123161, 146)
pivot_table.head(10)


(194592, 94)


title,Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),"American President, The (1995)",Angels and Insects (1995),Antonia's Line (Antonia) (1995),Assassins (1995),Babe (1995),Balto (1995),Beautiful Girls (1996),Bed of Roses (1996),...,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Two Bits (1995),Two if by Sea (1996),"Usual Suspects, The (1995)",Vampire in Brooklyn (1995),Waiting to Exhale (1995),When Night Is Falling (1995),"White Balloon, The (Badkonake sefid) (1995)",White Squall (1996),Wings of Courage (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,3.0,,3.0,,,,5.0,,,,...,,,,4.0,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,5.0,,,,3.0,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,4.0,,,,,3.0,4.0,,,,...,,,,4.0,,,,,,
10,,,,,,,,,,,...,4.0,,,,,,,,,
12,,,,,,,,,,,...,,,4.0,,,,,,,
14,0.5,,,,,,0.5,,,,...,3.5,,,4.0,,,,,,


In [11]:
#Find similar movies to certain movies
movie_watched = pivot_table["Vampire in Brooklyn (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched)  # find correlation
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()

title
Vampire in Brooklyn (1995)            1.000000
Across the Sea of Time (1995)         0.663366
Kids of the Round Table (1995)        0.629709
Big Bully (1996)                      0.629429
Last Summer in the Hamptons (1995)    0.560340
dtype: float64

In [12]:
#How many times each movie is rated
rating_counts = pd.DataFrame(data["title"].value_counts())
rating_counts

Unnamed: 0,title
Toy Story (1995),76813
"Usual Suspects, The (1995)",72893
Seven (a.k.a. Se7en) (1995),65666
Twelve Monkeys (a.k.a. 12 Monkeys) (1995),59730
Babe (1995),37698
...,...
Lamerica (1994),166
Across the Sea of Time (1995),89
Kids of the Round Table (1995),83
Wings of Courage (1995),75


**What is User-Based Collaborative Filtering**

User-based collaborative filtering is a technique used in recommender systems to provide personalized recommendations to users based on their preferences and the preferences of similar users. It is a form of collaborative filtering that focuses on the similarity between users rather than items.

In user-based collaborative filtering, recommendations are generated by identifying users who have similar preferences to a target user and suggesting items that these similar users have liked or interacted with. The underlying assumption is that if two users have similar tastes and preferences, they are likely to have similar preferences for other items as well.

The process of user-based collaborative filtering typically involves the following steps:

Data collection: Gather data on user-item interactions, such as ratings, reviews, or purchase history.

User similarity calculation: Calculate the similarity between users based on various metrics, such as cosine similarity or Pearson correlation. The similarity is usually determined by comparing the ratings or preferences of users who have interacted with similar items.

Neighborhood selection: Identify a subset of similar users for each user in the system. This subset, known as the user's neighborhood, consists of users who have similar preferences and tastes.

Recommendation generation: Once the user's neighborhood is established, the system can generate recommendations by considering the items liked or interacted with by users in the neighborhood. The system identifies items that the target user has not interacted with and recommends those items based on the assumption that the user will likely be interested in them.

User-based collaborative filtering has several advantages. It can capture user preferences and provide recommendations even for new users with limited data. It can also handle the "cold start" problem, where there is limited information about new users or items. Additionally, it can provide serendipitous recommendations by suggesting items that users may not have discovered on their own.

However, user-based collaborative filtering can have challenges with scalability when dealing with large user populations and sparse user-item interaction matrices. It may also face issues when users have diverse or evolving preferences, as the recommendations may not accurately reflect their current interests.

Overall, user-based collaborative filtering is a popular and effective approach in building recommender systems, particularly in scenarios where user similarities can be accurately measured and utilized for generating recommendations.

In [13]:
class MovieLensDataSet(object):
    # the MovieLen Dataset
    def __init__(self, data_path):
        movies_path = "./movies_part.csv"
        ratings_path = "./ratings_part.csv"
        self.movies = pd.read_csv(f"{data_path}/{movies_path}", usecols=['movieId', 'title'])
        self.ratings = pd.read_csv(f"{data_path}/{ratings_path}", usecols=['userId','movieId', 'rating'])
        #self.user_ratings = pd.merge(movie, rating).pivot_table(index=["userId"], columns=["title"], values="rating")
        #self.user_ratings = self.user_ratings.fillna(0)
    
    def split_data_set(self, split_ratio=0.8):
        """
        split the rating file to train and test
        paramters :
        split_ratio -- the ratio split
        return :
        train_data_set:
        test_data_set:
        """
        # shuffle first
        self.ratings = self.ratings.sample(frac=1)
        # print (self.user_ratings.head())
        #self.user_ratings = self.user_ratings.sample(frac=1)
        dataset_len = self.ratings.shape[0]
        train_dataset_len = int(dataset_len * split_ratio // 1)
        test_dataset_len = dataset_len - train_dataset_len
        self.train_ratings, self.test_ratings = self.ratings[:train_dataset_len], self.ratings[train_dataset_len:]
        self.train_user_ratings = pd.merge(self.movies, self.train_ratings).pivot_table(index=["userId"], columns=["title"], values="rating")
        self.test_user_ratings = pd.merge(self.movies, self.test_ratings).pivot_table(index=["userId"], columns=["title"], values="rating")
        self.train_user_ratings = self.train_user_ratings.fillna(0)
        self.test_user_ratings = self.test_user_ratings.fillna(0)
        print (self.train_user_ratings.index.shape)
        print (self.test_user_ratings.index.shape)
        return self.train_user_ratings, self.test_user_ratings
dataset = MovieLensDataSet(DATA_DIR)
dataset.split_data_set()

(13,)
(11,)


(title   Ace Ventura: Pet Detective (1994)  \
 userId                                      
 1                                     0.0   
 2                                     3.0   
 4                                     0.0   
 5                                     0.0   
 6                                     0.0   
 7                                     0.0   
 8                                     0.0   
 9                                     3.0   
 10                                    0.0   
 11                                    0.0   
 12                                    0.0   
 13                                    0.0   
 14                                    1.0   
 
 title   Ace Ventura: When Nature Calls (1995)  \
 userId                                          
 1                                         0.0   
 2                                         3.0   
 4                                         0.0   
 5                                         0.0   
 6      

In [15]:
class UserBasedCF(object):
    # User-Based Collaborative Filtering
    def __init__(self, dataset):
        """
        """
        self.dataset = dataset
        self.train_dataset, self.test_dataset = self.dataset.split_data_set()
        print (self.train_dataset, self.test_dataset )
    
    def compute_distance_matrix(self):
        """
        compute the distance matrix based user
        the number of user is too large. so, compute the matrix maybe is not good idea.
        """
        # self.train_dataset_np = self.train_dataset.values
        self.user_len = self.train_dataset_np.shape[0]
        # self.test_dataset_np = self.test_dataset.values
        self.distance_matrix = np.zeros((user_len, user_len))

        for i in range(user_len):
            for j in range(user_len):
                print (x, y)
                x = self.train_dataset.iloc(i)
                y = self.train_dataset.iloc(j)
                print (x, y)
        
    
    def recommend(self, userid_index, n = 5, k = 7):
        """
        recommend to user the movies top n
        return :
        the list of recommendations for userid
        """
        distance = np.zeros(self.user_len)
        for index, user_ratings in self.train_dataset.iterrows():
            distance[index]
                
        #print ("train userid:", self.train_dataset[index])
        userid = self.train_dataset.index.get_loc(userid_index)
        print ("train userid index:", userid) #
        distance = self.distance_matrix[userid]
        print ("distance", distance)
        sorted_distance = pd.DataFrame({"distance": distance}).sort_values(by="distance", ascending=False)
        print (f"similar userid list: {list(sorted_distance.index)[:n]}")
        for similar_userid_index in list(sorted_distance.index)[:n + 1]:
            similar_userid = self.train_dataset.index[similar_userid_index]
            print ("similar_userid ", similar_userid)
            print ("similar_userid movies ratings", self.train_dataset.loc[similar_userid])
        
        return []
    def test_model(self):
        """
        return :
        Precision
        Recall
        Coverage
        Popularity
        """
        
        #print (self.test_dataset.index)
        #print (self.train_dataset.index)
        for index, user in self.test_dataset.iterrows():
            print ("test userid:", index, user.values)
            print (self.recommend(index))
ml_dataset = MovieLensDataSet(DATA_DIR)
userbased_cf = UserBasedCF(ml_dataset)
userbased_cf.test_model()

(14,)
(12,)
title   Ace Ventura: Pet Detective (1994)  \
userId                                      
1                                     0.0   
2                                     3.0   
3                                     0.0   
4                                     0.0   
5                                     0.0   
6                                     0.0   
7                                     0.0   
8                                     0.0   
9                                     3.0   
10                                    0.0   
11                                    0.0   
12                                    0.0   
13                                    0.0   
14                                    1.0   

title   Ace Ventura: When Nature Calls (1995)  Airheads (1994)  \
userId                                                           
1                                         0.0              0.0   
2                                         0.0              0.0   
3  

AttributeError: 'UserBasedCF' object has no attribute 'train_dataset_np'