# Category and User Similarity Based Recommender System

Team Members: Kritika Kurani, Shivanshu Arora, Sourabh Shenoy

### Inroduction and Problem Statement: 

The web has a large collection of documents. Most of the time spent by a user on the web is often in search of information relevant to his topic of interest. This is where recommender systems come into play. Collaborative Filtering is one of the primary approaches used in recommender systems. However, it suffers from problems such as cold start and a sparse utility matrix. In this project, we implement a hybrid approach where we use collaborative filtering and movie genre which would solve the aforementioned problems, while also attempting to reduce the Root Mean Squared Error (RMSE). We compare this approach with another where we establish movie correlations based on genre compositions and wordnet similarity between genres.
This project attempts to build a simpler model for movie recommendations using minimal and most important features.

### Related Work:

### Previous Work:

A) Collaborative Filtering based on User Preferences: User Similarity is measured based on Pearson Coefficient, which is measured using the formula [1] given below: <img src="pc.jpg",width=300,height=300>


Where, X is a user selected for recommendation, and  X' is a mean rating of user X. Then, σX is the standard deviation of rating of user X. Xi is the rating for the ith item by user X. Let Y be the other users. The Pearson correlation coefficient is always between -1 and 1.

B) Genre Correlation: Each movie belongs to at least one genre. Correlation is found by introducing edges from every preceeding genres to the genres following it. For each edge, the counter for the genre-genre is incrememted. For example, if genre combination is G1 | G2 | G5, then G1 is selected as a criterion genre first and increase by one between a criterion genre G1 and another G2 and G5. Next, G2 is selected as a criterion genre, and increase by one between G2 and G5. After all the values are obtained, the rows and columns are normalised. [1]. User ratings are predcited based on the preferred genres of users, which is obtained explicitly, and the genre correlations of the genres that movie belongs to.

### How our approach differentiates:

A) Collaborative Filtering:
To measure user similarity, we use Pearson corrleation as above. We also determine the effect of demographics (age and gender) on user similarity and thereby movie rating predictions. 

B) Genre Correlation:
Previous approaches have user preferred genres explicitly defined, much similar to netflix or movielens which asks new users to provide their preferred genres. Since we do not have that data available, we determine user preferred genres based on the pearson similarity calculated above. The three genres that were found to be prevalent among the neighbors of the user were assigned as that user's preferred genres. 

### Our Approach: 

### Dataset Used:

Movielens 100k dataset has been used in this project. 
u.base file contains the user id, movie id and the corresponding rating. 
u.genre file contains the list of all genres. 
u.item has data about the movie id, movie name, release date, imdb link and a boolean vector representing the combination of genres it belongs to. 
u.user has information such as user id, age, sex, occupation and zip code.

We organized users, movies and ratings into separate classes.
The class structure is as follows:
### User Class
<img src="user.jpg",width=500,height=500>

### Movie Class
<img src="movie.jpg",width=500,height=500>

### Dataset Info
<img src="info.jpg",width=500,height=600>


In [6]:
class User:
    def __init__(self,user_id,age,sex,occupation,zipcode):
        self.id = user_id
        self.age = age
        self.sex = sex
        self.occupation = occupation
        self.zipcode = zipcode
        self.avg_rating = 0
        self.pref_genre = []

class Movie:
    def __init__(self,movie_id,name,release_data,imdb_link,genre):
        self.id = movie_id
        self.name = name
        self.release_date = release_data
        self.imdb_link = imdb_link
        self.genre = genre
        self.avg_rating = 0


class Rating:
    def rating_matrix(self,rate_matrix,filename):
        f = open(filename, "r")
        print filename
        ratings = f.readlines()
        for r in ratings:
            r = r.split("\t")
            rate_matrix[int(r[0])-1][int(r[1])-1] = int(r[2])


The next step was to collect the data. We created a class which will read the values from the corresponding files and populate the User, Movie and Rating objects.

In [3]:
class Data:
    def __init__(self):
        self.genres_list = self.get_genres()
        self.genre_corr = [[0 for col in range(19)] for row in range(19)]

    def user_data(self):
        users = []
        f = open("./ml-100k/u.user","r")
        lines = f.readlines()
        for line in lines:
            data = line.split("|")
            new_user = User(int(data[0])-1,int(data[1]),data[2],data[3],data[4])
            users.append(new_user)
        return users

    def movies_data(self):
        movies = []
        f = open("./ml-100k/u.item","r")
        lines = f.readlines()
        # making a movie object
        for line in lines:
            new_movie = []
            data = line.split("|")
            genre = {}
            i = 5
            for g in self.genres_list:
                genre[g] = int(data[i])
                i += 1
            new_movie = Movie(int(data[0])-1,data[1],data[2],data[3],genre)
            movies.append(new_movie)
        return movies

    def get_genres(self):
        f_genre = open("./ml-100k/u.genre", "r")
        genres = []
        # getting all the genres
        lines = f_genre.readlines()
        for line in lines:
            line = line.split("|")
            genres.append(line[0])
        return genres[:-1]

    def create_rating_matrix(self,rate_matrix,filename):
        r = Rating()
        #rate_matrix = np.zeros((943, 1682))
        r.rating_matrix(rate_matrix,filename)
        #print rate_matrix[0]

    def genre_correlation(self):
        f = open("./ml-100k/u.item", "r")
        lines = f.readlines()
        for i in range(19):
            self.genre_corr[i][i] = 1
        for line in lines:
            data = line.split("|")
            genre = data[5:]
            avg = 0
            k = 0
            genre = [int(x) for x in genre]
            for i in range(19):
                if genre[i] == 1:
                    for j in range(i+1,19):
                        if genre[j] == 1:
                            self.genre_corr[i][j] += 1
                            self.genre_corr[j][i] += 1

        for i in range(19):
            avg = sum(self.genre_corr[i])-1
            if avg != 0:
                self.genre_corr[i] = [(float(x)/avg) for x in self.genre_corr[i]]
                self.genre_corr[i][i] = 1

# Predicting the movie ratings

## MoviePredict class

We create a separate class that imports the Data class defined above and retrieves the data stored in files. Next, we work on predicting the rating based on users and movies obtained from the training and testing data files.

## Movie Clustering

Since a movie belongs to one or more genres, the movies need to be clustered so that it now belongs to the most important genre. We used K-means clustering to achieve this which was initialized with 19 seeds that were determined based on certain heuristics which speeds up convergence. This was achieved using sklearn module in Python.

## Ratings Matrix: Users to Genres

The utility matrix in a movie recommender system provides the ratings from a user for a movie. Since we try to establish a relation between movies based on their genres, it is crucial to determine which genres are more interesting to users. To achieve this, we use the ratings available in training data. For each user movie pair, the prevalent genre of the movie is determined from the clustering created above and added to the list of movie genre pair.
After all the ratings of the training data have been analyzed, we have a matrix that stores all the ratings by a user to a particular genre. Next, for every user genre pair, we determine the average of each, so we have the an average user genre rating at the end.

## Average User Ratings

For every user, the average of all the ratings provided by him to the movies is calculated and stored in individual user objects.

## Normalize Ratings

We normalize the ratings i.e. subtract the average user ratings from the user genre ratings in order to account for user personalities. Let's say a user is highly optimistic and provides high ratings (say, 3-5) to all the movies he's watched and another user is critical and provides a rating of 3 to a movie he really liked and a rating of 1 he dislikes. These two users may have similar tastes but if we do not normalize the ratings, these users may seem far from being similar.

## Pearson Similarity

After we calculate the normalized ratings for each user genre pair, we find the similarity based on these ratings using the Pearson correlation Coefficient defined above.

## Bias Form Mean Average - Prediction I

To determine the user rating for a particular movie, we use weighted average of the ratings provide by other similar users which are the top 150 users with highest pearson correlation coefficient to the user in question.
To predict a rating for a movie from a user, we determine the genre of the movie based on the clustering provided above. Then the predicted rating would be the sum of user's average rating for the genre and summation of product of similarity and normalized rating for the neighbors divided by the summation of similarities of all the neighbors.

## User preferred genres

Since we do not have explicit information about the preferred genres of each user, we determine the majority genres among similar users. 
We again determine top 150 users with highest Pearson Correlation with the user in question. For each neighbor and the user himself, we take into account the three genres with highest normalized ratings. The three genres with the highest cumulative score are then designated as the user's preferred genres.

## Computing Genre Correlation

We use two approaches to determine genre correlation:

1). Since a movie belongs to at least one genre, say G1,G2,G5, there definitely exists some relation between the genres. To account for this relation, we create a matrix of genre by genre. Taking G1 as primary genre, we increment the value of G1-G2 by 1 and G1-G5 by 1. Next, considering G2 as primary genre, we increment G2-G5 by 1. After accounting for genre correlations of all the movie in the dataset, we normalize the correlation matrix created.

2). Above approach has a limitation since they utilize genre combinations to establish connections between genres. This is highly dependent on the training set available. We try to alleviate this using WordNet. It provides similarity between words based on inherited hypernym.

## Genre Correlation Ratings - Prediction II

The first step here is to compute the average movie ratings, which were calculated and assigned to individual movie objects. 
To predict a rating for a user movie pair, we compute summation of correlation values of all pairs of user preferred genre and the genres of the movie under consideration. This value is then multiplied with the average movie rating and divided by the number of user preferred genres, which in our case if fixed to be 3, but in real time scenarios, can vary from user to user.

In [5]:
    def movieClustering(self):
        movie_genre = []
        for m in self.movies:
            mg = []
            for g in self.genres:
                mg.append(m.genre[g])
            movie_genre.append(mg)

        movie_genre = np.array(movie_genre)
        self.movie_cluster = KMeans(n_clusters=len(self.genres)).fit_predict(movie_genre)


### References:

1) A Content Recommendation System Based on Category Correlations; Sang-Min Choi, Yo-Sub Han