### Collaborative Filtering (CF) 

It is a mean of recommendation based on users’ past behavior. There are two categories of CF:

User-based: measure the similarity between target users and other users

Item-based: measure the similarity between the items that target users rates/ interacts with and other items

The key idea behind CF is that similar users share the same interest and that similar items are liked by a user.

![user_item_cf](images/user_item_cf.jpg)



RMSE of training of model is a metric which measure how much the signal and the noise is explained by the model. I noticed that my RMSE is quite big. I suppose I might have overfitted the training data.

Overall, Memory-based Collaborative Filtering is easy to implement and produce reasonable prediction quality. However, there are some drawback of this approach:

It doesn't address the well-known cold-start problem, that is when new user or new item enters the system.
It can't deal with sparse data, meaning it's hard to find users that have rated the same items.
It suffers when new users or items that don't have any ratings enter the system.
It tends to recommend popular items.

Load dataset from the ratings, only need the following info: user, movie, rate

    [('186', '302', '3'),
     ('22', '377', '1'),
     ('244', '51', '2'),
     ('166', '346', '1'),
     ('298', '474', '4'),
     ('115', '265', '2'),
     ('253', '465', '5'),
     ('305', '451', '3'),
     ('6', '86', '3')]

### Loading and paring input data

In [252]:
import collections
import os
import itertools
import random
from collections import namedtuple

BuiltinDataset = namedtuple('BuiltinDataset', ['url', 'path', 'sep', 'reader_params'])

BUILTIN_DATASETS = {
    'ml-100k':
        BuiltinDataset(
            url='http://files.grouplens.org/datasets/movielens/ml-100k.zip',
            path='data/ml-100k/u.data',
            sep='\t',
            reader_params=dict(line_format='user item rating timestamp',
                               rating_scale=(1, 5),
                               sep='\t')
        ),
    'ml-100k-movie':
        BuiltinDataset(
            url='http://files.grouplens.org/datasets/movielens/ml-100k.zip',
            path='data/ml-100k/u.item',
            sep='|',
            reader_params=dict(line_format='user item rating timestamp',
                               rating_scale=(1, 5),
                               sep='\t')
        ),
    'ml-1m'  :
        BuiltinDataset(
            url='http://files.grouplens.org/datasets/movielens/ml-1m.zip',
            path='data/ml-1m/ratings.dat',
            sep='::',
            reader_params=dict(line_format='user item rating timestamp',
                               rating_scale=(1, 5),
                               sep='::')
        ),
}

# modify the random seed will change dataset spilt.
# if you want to use the model saved before, please don't modify this seed.
random.seed(0)


class DataSet:
    """Base class for loading datasets.

    Note that you should never instantiate the :class:`Dataset` class directly
    (same goes for its derived classes), but instead use one of the below
    available methods for loading datasets."""

    def __init__(self):
        pass

    @classmethod
    def load_dataset(cls, name='ml-100k'):
        """Load a built-in dataset.

        :param name:string: The name of the built-in dataset to load.
                Accepted values are 'ml-100k', 'ml-1m', and 'jester'.
                Default is 'ml-100k'.
        :return: ratings for each line.
        """
        try:
            dataset = BUILTIN_DATASETS[name]
        except KeyError:
            raise ValueError('unknown dataset ' + name +
                             '. Accepted values are ' +
                             ', '.join(BUILTIN_DATASETS.keys()) + '.')
        if not os.path.isfile(dataset.path):
            raise OSError(
                "Dataset data/" + name + " could not be found in this project.\n"
                                         "Please download it from " + dataset.url +
                ' manually and unzip it to data/ directory.')
        print('Loading dataset file %s' % dataset.path)
        with open(dataset.path, encoding = "ISO-8859-1") as f:
            ratings = [cls.parse_line(line, dataset.sep) for line in itertools.islice(f, 0, None)]
        print("Load " + name + " dataset success.")
        return ratings

    @classmethod
    def parse_line(cls, line: str, sep: str):
        """
        Parse a line.

        Ratings as ensured to positive integers.

        the separator in rating.data is `::`.

        :param sep: the separator between fields. Example : ``';'``.
        :param line: The line to parse

        :return: tuple: User id, item id, rating score.
                The timestamp will be ignored cause it wasn't used in Collaborative filtering.
        """
        user, movie, rate = line.strip('\r\n').split(sep)[:3]
        return user, movie, rate

    @classmethod
    def train_test_split(cls, ratings, test_size=0.2):
        """
        Split rating data to training set and test set.

        The default `test_size` is the test percentage of test size.

        The rating file should be a instance of DataSet.

        :param ratings: raw dataset
        :param test_size: the percentage of test size.
        :return: train_set and test_set
        """
        train, test = collections.defaultdict(dict), collections.defaultdict(dict)
        trainset_len = 0
        testset_len = 0
        for user, movie, rate in ratings:
            if random.random() <= test_size:
                test[user][movie] = int(rate)
                testset_len += 1
            else:
                train[user][movie] = int(rate)
                trainset_len += 1
        print('split rating data to training set and test set success.')
        print('train set size = %s' % trainset_len)
        print('test set size = %s\n' % testset_len)
        return train, test

### Model Manager will save and load the trained the model

In [254]:
import pickle

class ModelManager:
    """
    Model manager is designed to load and save all models.
    No matter what dataset name.
    """
    # This dataset_name belongs to the whole class.
    # So it should be init for only once.
    path_name = ''

    @classmethod
    def __init__(cls, dataset_name=None, test_size=0.3):
        """
        cls.dataset_name should only init for only once.
        :param dataset_name:
        """
        if not cls.path_name:
            cls.path_name = "model/" + dataset_name + '-testsize' + str(test_size)

    def save_model(self, model, save_name: str):
        """
        Save model to model/ dir.
        :param model: source model
        :param save_name: model saved name.
        :return: None
        """
        if 'pkl' not in save_name:
            save_name += '.pkl'
        if not os.path.exists('model'):
            os.mkdir('model')
        pickle.dump(model, open(self.path_name + "-%s" % save_name, "wb"))

    def load_model(self, model_name: str):
        """
        Load model from model/ dir via model name.
        :param model_name:
        :return: loaded model
        """
        if 'pkl' not in model_name:
            model_name += '.pkl'
        if not os.path.exists(self.path_name + "-%s" % model_name):
            raise OSError('There is no model named %s in model/ dir' % model_name)
        else:
            print('Load model from %s \n' % self.path_name)
        return pickle.load(open(self.path_name + "-%s" % model_name, "rb"))

    @staticmethod
    def clean_workspace(clean=False):
        """
        Clean the whole workspace.
        All File in model/ dir will be removed.
        :param clean: Boolean. Clean workspace or not.
        :return: None
        """
        if clean and os.path.exists('model'):
            shutil.rmtree('model')

### Calculate the similarity 

Based on user's WH, to calculate the similarity. 
    let's say user 1 watched movies: 1, 2, 3, 4, 5
              user 2 watched movies: 1, 3, 5
              
    The similarity score between user1 and user2 will be 3, and considering iif, as following
    
    Calculate user similarity matrix by building movie-users inverse table.
    
    The calculating will only between users which have common items votes.
    :param use_iif_similarity:  
    
    This is based on User User IIF similarity index - Google Search, same reason as TF-IDF
                                if the item is very popular, users' similarity will be lower.
    :param trainset: trainset
    :return: similarity matrix


In [253]:
import collections

import math

from collections import defaultdict


def calculate_user_similarity(trainset, use_iif_similarity=False):
    # build inverse table for item-users
    # key=movieID, value=list of userIDs who have seen this movie
    print('building movie-users inverse table...')
    movie2users = collections.defaultdict(set)
    movie_popular = defaultdict(int)

    for user, movies in trainset.items():
        for movie in movies:
            movie2users[movie].add(user)
            movie_popular[movie] += 1
    print('building movie-users inverse table success.')

    # save the total movie number, which will be used in evaluation
    movie_count = len(movie2users)
    print('total movie number = %d' % movie_count)

    # count co-rated items between users
    print('generate user co-rated movies similarity matrix...')
    # the keys of usersim_mat are user1's id,
    # the values of usersim_mat are dicts which save {user2's id: co-occurrence times}.
    # so you can seem usersim_mat as a two-dim table.
    # TODO DO NOT USE DICT TO SAVE MATRIX, USE LIST INDEED.
    # TODO IF USE LIST, THE MATRIX WILL BE VERY SPARSE.
    usersim_mat = {}
    # record the calculate time has spent.
    for movie, users in movie2users.items():
        for user1 in users:
            # set default similarity between user1 and other users equals zero
            usersim_mat.setdefault(user1, defaultdict(int))
            for user2 in users:
                if user1 == user2:
                    continue
                # ignore the score they voted.
                # user similarity matrix only focus on co-occurrence.
                if use_iif_similarity:
                    # if the item is very popular, users' similarity will be lower.
                    usersim_mat[user1][user2] += 1 / math.log(1 + len(users))
                else:
                    # origin method, users'similarity based on common items count.
                    usersim_mat[user1][user2] += 1
        # log steps and times.
    print('generate user co-rated movies similarity matrix success.')

    # calculate user-user similarity matrix
    print('calculate user-user similarity matrix...')
    # record the calculate time has spent.
    for user1, related_users in usersim_mat.items():
        len_user1 = len(trainset[user1])
        for user2, count in related_users.items():
            len_user2 = len(trainset[user2])
            # The similarity of user1 and user2 is len(common movies)/sqrt(len(user1 movies)* len(user2 movies)
            usersim_mat[user1][user2] = count / math.sqrt(len_user1 * len_user2)
            # log steps and times.

    print('calculate user-user similarity matrix success.')
    return usersim_mat, movie_popular, movie_count


### Recommend: UserCF model implementation

The recommendation be as following steps:

    1, find the top relevant users, based on similarity score
    2, filtering out the watched movie by current user
    3, predict the score for current user, based on relevant users,
        predict_score[movie] += similarity_factor * rating
    4, return the top ones, based on the predict_score.

In [255]:
import collections
from operator import itemgetter

import math

from collections import defaultdict

class UserBasedCF:
    """
    User-based Collaborative filtering.
    Top-N recommendation.
    """

    def __init__(self, k_sim_user=20, n_rec_movie=10, use_iif_similarity=False, save_model=True):
        """
        Init UserBasedCF with n_sim_user and n_rec_movie.
        :return: None
        """
        print("UserBasedCF start...\n")
        self.k_sim_user = k_sim_user
        self.n_rec_movie = n_rec_movie
        self.trainset = None
        self.save_model = save_model
        self.use_iif_similarity = use_iif_similarity

    def fit(self, trainset):
        """
        Fit the trainset by calculate user similarity matrix.
        :param trainset: train dataset
        :return: None
        """
        model_manager = ModelManager()
        try:
            self.user_sim_mat = model_manager.load_model(
                'user_sim_mat-iif' if self.use_iif_similarity else 'user_sim_mat')
            self.movie_popular = model_manager.load_model('movie_popular')
            self.movie_count = model_manager.load_model('movie_count')
            self.trainset = model_manager.load_model('trainset')
            print('User origin similarity model has saved before.\nLoad model success...\n')
        except OSError:
            print('No model saved before.\nTrain a new model...')
            self.user_sim_mat, self.movie_popular, self.movie_count = \
                calculate_user_similarity(trainset=trainset,
                                                     use_iif_similarity=self.use_iif_similarity)
            self.trainset = trainset
            print('Train a new model success.')
            if self.save_model:
                model_manager.save_model(self.user_sim_mat,
                                         'user_sim_mat-iif' if self.use_iif_similarity else 'user_sim_mat')
                model_manager.save_model(self.movie_popular, 'movie_popular')
                model_manager.save_model(self.movie_count, 'movie_count')
            print('The new model has saved success.\n')

    def recommend(self, user):
        """
        Find K similar users and recommend N movies for the user.
        :param user: The user we recommend movies to.
        :return: the N best score movies
        """
        if not self.user_sim_mat or not self.n_rec_movie or \
                not self.trainset or not self.movie_popular or not self.movie_count:
            raise NotImplementedError('UserCF has not init or fit method has not called yet.')
        K = self.k_sim_user
        N = self.n_rec_movie
        predict_score = collections.defaultdict(int)
        if user not in self.trainset:
            print('The user (%s) not in trainset.' % user)
            return
        # print('Recommend movies to user start...')
        watched_movies = self.trainset[user]
        for similar_user, similarity_factor in sorted(self.user_sim_mat[user].items(),
                                                      key=itemgetter(1), reverse=True)[0:K]:
            for movie, rating in self.trainset[similar_user].items():
                if movie in watched_movies:
                    continue
                # predict the user's "interest" for each movie
                # the predict_score is sum(similarity_factor * rating)
                predict_score[movie] += similarity_factor * rating
                # log steps and times.
        # print('Recommend movies to user success.')
        # return the N best score movies
        return [movie for movie, _ in sorted(predict_score.items(), key=itemgetter(1), reverse=True)[0:N]]

    def predict(self, testset):
        """
        Recommend movies to all users in testset.
        :param testset: test dataset
        :return: `dict` : recommend list for each user.
        """
        movies_recommend = defaultdict(list)
        print('Predict scores start...')
        # record the calculate time has spent.
        for i, user in enumerate(testset):
            rec_movies = self.recommend(user)  # type:list
            movies_recommend[user].append(rec_movies)
            # log steps and times.
            predict_time.count_time()
        print('Predict scores success.')
        predict_time.finish()
        return movies_recommend

### Ttrain the model

In [262]:
def train_model(model, dataset_name, test_size=0.3, clean=False):
    print('*' * 70)
    print('\tThis is %s model trained on %s with test_size = %.2f' % ('UCF', dataset_name, test_size))
    print('*' * 70 + '\n')
    model_manager = ModelManager(dataset_name, test_size)
    try:
        trainset = model_manager.load_model('trainset')
        testset = model_manager.load_model('testset')
    except OSError:
        ratings = DataSet.load_dataset(name=dataset_name)
        trainset, testset = DataSet.train_test_split(ratings, test_size=test_size)
        model_manager.save_model(trainset, 'trainset')
        model_manager.save_model(testset, 'testset')
    '''Do you want to clean workspace and retrain model again?'''
    '''if you want to change test_size or retrain model, please set clean_workspace True'''
    model_manager.clean_workspace(clean)
    model.fit(trainset)
    return model, trainset, testset
    

### Run Prediction using trained model

In [263]:
def recommend_test(model, user_list, trainset):
    for user in user_list:
        recommend = model.recommend(str(user))
        print("recommend for userid = %s, who has watched:" % user)
        movie_dict = get_movie_dict()
        for movie in trainset[str(user)]:
            print(movie_dict[movie])
        print("recommend for userid = %s, the following movies:" % user)
        for movie in recommend:
            print(movie_dict[movie])
        print()

def get_movie_dict():
    movie_dict = {movie[0]:movie[1:] for movie in movies}
    return movie_dict
        
if __name__ == '__main__':
    trainset = None
    testset = None
    dataset_name = 'ml-100k'
    test_size = 0.1
    model = UserBasedCF()
    model, trainset, testset = train_model(model, dataset_name, test_size, False)
    recommend_test(model, [100], testset)

UserBasedCF start...

**********************************************************************
	This is UCF model trained on ml-100k with test_size = 0.10
**********************************************************************

Load model from model/ml-100k-testsize0.1 

Load model from model/ml-100k-testsize0.1 

Load model from model/ml-100k-testsize0.1 

Load model from model/ml-100k-testsize0.1 

Load model from model/ml-100k-testsize0.1 

Load model from model/ml-100k-testsize0.1 

User origin similarity model has saved before.
Load model success...

recommend for userid = 100, who has watched:
('Wedding Singer, The (1998)', '13-Feb-1998')
('Chasing Amy (1997)', '01-Jan-1997')
('Hard Rain (1998)', '16-Jan-1998')
('Gattaca (1997)', '01-Jan-1997')
('Full Monty, The (1997)', '01-Jan-1997')
('Phantoms (1998)', '01-Jan-1998')
recommend for userid = 100, the following movies:
("Devil's Advocate, The (1997)", '01-Jan-1997')
('Cop Land (1997)', '01-Jan-1997')
('Saint, The (1997)', '14-Mar-19