# SWE-546 Final Project

# Contributors

Emre Bolat <br />
Koray Bostancı <br />
Taygun Gökdemir <br />
<br />


# Recommending Movies from the MovieLens Dataset

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat", user information is in the file "users.dat" and movie information is in the file "movies.dat".


In [1]:
import pandas as pd
import numpy as np

# Building the data model

First we start with reading the rating and movie data from files using Pandas. 
Next step is merging the movie and ratings tables.

We've builded a data model named MoviLensData to encapsulate the movie and rating data together.

In [2]:
class MovieLensData :
    
    def __init__(self):
        movie_cols = ['movie_id', 'title', 'genres']
        movie_data = pd.read_table('../data/ml-1m/movies.dat', sep='::', names=movie_cols, usecols=range(5), header=None, engine='python')
        
        rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
        rating_data = pd.read_table('../data/ml-1m/ratings.dat', sep='::', names=rating_cols, header=None, engine='python')
        
        movie_rating_data = pd.merge(movie_data, rating_data)
        self.allratings = movie_rating_data

In [3]:
movielens = MovieLensData()

In [4]:
movielens.allratings[:10]

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474
5,1,Toy Story (1995),Animation|Children's|Comedy,18,4,978154768
6,1,Toy Story (1995),Animation|Children's|Comedy,19,5,978555994
7,1,Toy Story (1995),Animation|Children's|Comedy,21,3,978139347
8,1,Toy Story (1995),Animation|Children's|Comedy,23,4,978463614
9,1,Toy Story (1995),Animation|Children's|Comedy,26,3,978130703


In [5]:
class MovieLensHelper :
    
    def __init__(self):
        pass
    
    @staticmethod
    def getRatingCountOfMovies(movielens):
        return movielens.allratings['title'].value_counts()
    
    @staticmethod
    def getMoviesFilteredByRatingCount(movielens, rating_greater_than=1):
        allratings_indexed = movielens.allratings.set_index('title')
        allratings_filtered = allratings_indexed.ix[movielens.allratings['title'].value_counts() > rating_greater_than]
        
        movie_titles = allratings_filtered.index.unique()
        
        allratings_new = allratings_filtered.drop(['timestamp','genres','movie_id'],axis=1)
        
        return movie_titles, allratings_new
      
    @staticmethod
    def getAverageRatingsOfUsersByMovieId(moveielens):
        avg_user_rating = movielens.allratings.groupby('user_id').mean().drop(['movie_id','timestamp'],axis=1)
        avg_user_rating.columns = ['avg_rating']
        
        return avg_user_rating
    
    @staticmethod
    def mergeDataSets(dataset1, dataset2, index_column=''):
        merged_dataset = pd.merge(dataset1.reset_index(),dataset2.reset_index())
        merged_dataset = merged_dataset.set_index(index_column)
        
        return merged_dataset
    
    @staticmethod
    def setIndex(dataset, index_columns=[]):
        ds = dataset.reset_index().set_index(index_columns)
        ds.index
        
        return ds

The below list shows the titles and the number of ratings among them.

In [6]:
MovieLensHelper.getRatingCountOfMovies(movielens)

American Beauty (1999)                                       3428
Star Wars: Episode IV - A New Hope (1977)                    2991
Star Wars: Episode V - The Empire Strikes Back (1980)        2990
Star Wars: Episode VI - Return of the Jedi (1983)            2883
Jurassic Park (1993)                                         2672
Saving Private Ryan (1998)                                   2653
Terminator 2: Judgment Day (1991)                            2649
Matrix, The (1999)                                           2590
Back to the Future (1985)                                    2583
Silence of the Lambs, The (1991)                             2578
Men in Black (1997)                                          2538
Raiders of the Lost Ark (1981)                               2514
Fargo (1996)                                                 2513
Sixth Sense, The (1999)                                      2459
Braveheart (1995)                                            2443
Shakespear

In [7]:
movie_titles, ratings_filtered = MovieLensHelper.getMoviesFilteredByRatingCount(movielens,100)
ratings_filtered[:10]

Unnamed: 0_level_0,user_id,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story (1995),1,5
Toy Story (1995),6,4
Toy Story (1995),8,4
Toy Story (1995),9,5
Toy Story (1995),10,5
Toy Story (1995),18,4
Toy Story (1995),19,5
Toy Story (1995),21,3
Toy Story (1995),23,4
Toy Story (1995),26,3


In [8]:
movie_titles

array(['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',
       ..., 'Meet the Parents (2000)', 'Requiem for a Dream (2000)',
       'Contender, The (2000)'], dtype=object)

# Adding Avg. User Ratings column 

A column for the average rating for each user across all movies that user rated is added. This is not required for the Pearson correlation, but is needed by the Adjusted Cosine Similary metric, which will be explored later. To do this aggregation of the ratings of the user from the ratings table is needed. 


In [9]:
avg_user_rating = MovieLensHelper.getAverageRatingsOfUsersByMovieId(movielens)
avg_user_rating[:10]

Unnamed: 0_level_0,avg_rating
user_id,Unnamed: 1_level_1
1,4.188679
2,3.713178
3,3.901961
4,4.190476
5,3.146465
6,3.901408
7,4.322581
8,3.884892
9,3.735849
10,4.114713


In [10]:
merged_ratings = MovieLensHelper.mergeDataSets(ratings_filtered, avg_user_rating, 'title')
merged_ratings[:10]

Unnamed: 0_level_0,user_id,rating,avg_rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Toy Story (1995),1,5,4.188679
Pocahontas (1995),1,5,4.188679
Apollo 13 (1995),1,5,4.188679
Star Wars: Episode IV - A New Hope (1977),1,4,4.188679
Schindler's List (1993),1,5,4.188679
"Secret Garden, The (1993)",1,4,4.188679
Aladdin (1992),1,4,4.188679
Snow White and the Seven Dwarfs (1937),1,4,4.188679
Beauty and the Beast (1991),1,5,4.188679
Fargo (1996),1,4,4.188679


# The Heirarchical Index

With the help of the heirarchical index movies can be selected by title to get a table with columns [user_id, rating] for that movie. 

In [11]:
rs = MovieLensHelper.setIndex(merged_ratings, ['title','user_id'])
rs.ix['Fargo (1996)'][:10]

Unnamed: 0_level_0,rating,avg_rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4,4.188679
8,5,3.884892
9,4,3.735849
23,4,3.315789
28,4,3.757009
36,5,4.19943
48,4,3.068562
56,5,3.970149
65,3,4.347107
76,5,4.172414


# Pearson Correlation Coefficient

Unlike the Euclidean Distance similarity score (which is scaled from 0 to 1), this metric measures how highly correlated are two variables and is measured from -1 to +1. Similar to the modified Euclidean Distance, a Pearson Correlation Coefficient of 1 indicates that the data objects are perfectly correlated but in this case, a score of -1 means that the data objects are not correlated. In other words, the Pearson Correlation score quantifies how well two data objects fit a line. 





Requirements for Pearson's correlation coefficient

Scale of measurement should be interval or ratio
Variables should be approximately normally distributed
The association should be linear
There should be no outliers in the data



![from IPython.core.display import Image, display
displayImage](https://dl.dropboxusercontent.com/u/58490833/Capture.jpg)


# Adjusted Cosine Similarity

One fundamental difference between the similarity computation in user-based CF and item-based CF is that in case of user-based CF the similarity is computed along the rows of the matrix but in case of the item-based CF the similarity is computed along the columns i.e., each pair in the co-rated set corresponds to a different user. Computing similarity using basic cosine measure in item-based case has one important drawback-the difference in rating scale between different users are not taken into account. The adjusted cosine similarity offsets this drawback by subtracting the corresponding user average from each co-rated pair. Formally, the similarity between items i and j using this scheme is given by 

![from IPython.core.display import Image, display
displayImage](https://dl.dropboxusercontent.com/u/58490833/Capture2.jpg)



In [12]:
class SimilarityFinder :
    
    def __init__(self):
        pass
    
    @staticmethod
    def join(rs, movie1_title, movie2_title):
        movie1 = rs.ix[movie1_title].reset_index()
        movie2 = rs.ix[movie2_title].reset_index()
        
        mm = pd.merge(movie1, movie2, on='user_id').drop('user_id', axis=1)
        
        return mm
    
    @staticmethod
    def findPearsonSimilarityValue(rs, movie1_title, movie2_title):
        data = SimilarityFinder.join(rs, movie1_title, movie2_title)
        
        data['rating_x'] = data['rating_x'].astype('float32')
        data['rating_y'] = data['rating_y'].astype('float32')
        
        pcorr = data.corr(method='pearson')
        
        return pcorr.ix['rating_x']['rating_y']
    
    @staticmethod
    def findAdjustedCosineSimilarityValue(rs, movie1_title, movie2_title, verbose=False):
        data = SimilarityFinder.join(rs, movie1_title, movie2_title)
        
        diff_xx = data['rating_x'] - data['avg_rating_x']
        diff_yy = data['rating_y'] - data['avg_rating_y']
        
        num  = (diff_xx).dot((diff_yy))
        den1 = (diff_xx).dot((diff_xx))
        den2 = (diff_yy).dot((diff_yy))

        sim = num / np.sqrt(den1*den2)
        if sim > 1.0: sim = 1.0 
        return sim
    
    @staticmethod
    def findBestMatches(rs, movie_title, movie_titles, ntop=100, similarity_function_name='pearson'):
        score = []
        
        for movie_title2 in movie_titles:
            if movie_title == movie_title2: 
                continue
            if similarity_function_name == 'pearson':
                corr = SimilarityFinder.findPearsonSimilarityValue(rs, movie_title, movie_title2)
            if similarity_function_name == 'adjustedcosine':
                corr = SimilarityFinder.findAdjustedCosineSimilarityValue(rs, movie_title, movie_title2)
            if pd.isnull(corr): 
                continue
            
            score += [(corr, movie_title2)]
            
        score.sort()
        score.reverse()
        
        return score[0:ntop]

<br />Find a similarity value for the movies 'Fargo (1996)' and 'Saving Private Ryan (1998)' using Pearson method.

In [13]:
value = SimilarityFinder.findPearsonSimilarityValue(rs, 'Fargo (1996)', 'Saving Private Ryan (1998)')
value

0.095240647626515379

Find a similarity value for the movies 'Fargo (1996)' and 'Saving Private Ryan (1998)' using Adjusted Cosine method.

In [14]:
value = SimilarityFinder.findAdjustedCosineSimilarityValue(rs, 'Fargo (1996)', 'Saving Private Ryan (1998)')
value

0.38085497648191285

Find the most recommended 10 movies for the movie 'Fargo (1996)' using Pearson method.

In [15]:
time score = SimilarityFinder.findBestMatches(rs, 'Fargo (1996)', movie_titles, 10, 'pearson')

Wall time: 1min 18s


In [16]:
score

[(0.47018213897196393, 'Nights of Cabiria (Le Notti di Cabiria) (1957)'),
 (0.46367319289985065, 'Melvin and Howard (1980)'),
 (0.44939942927904264, "All the King's Men (1949)"),
 (0.42237735941287047,
  'Paradise Lost: The Child Murders at Robin Hood Hills (1996)'),
 (0.421200275072797, 'Killing, The (1956)'),
 (0.408153879371079, 'Waking the Dead (1999)'),
 (0.39059679471634945, 'Trouble with Harry, The (1955)'),
 (0.3894528439543139, 'Paths of Glory (1957)'),
 (0.38863877131901914, 'Big One, The (1997)'),
 (0.38669117024690125,
  'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)')]

Find the most recommended 10 movies for the movie 'Fargo (1996)' using Adjusted Cosine method.

In [17]:
time score = SimilarityFinder.findBestMatches(rs, 'Fargo (1996)', movie_titles, 10, 'adjustedcosine')

Wall time: 1min 14s


In [18]:
score

[(0.71294368771014205, 'Paths of Glory (1957)'),
 (0.63484085532685475, 'Killing, The (1956)'),
 (0.63318022401381446,
  'Paradise Lost: The Child Murders at Robin Hood Hills (1996)'),
 (0.62924653905609318, 'Blood Simple (1984)'),
 (0.62655884251878458, 'Creature Comforts (1990)'),
 (0.60484119655782287,
  'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)'),
 (0.59152707197108478, 'Stop Making Sense (1984)'),
 (0.58788177068162284, 'Close Shave, A (1995)'),
 (0.58563761487437915,
  'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)'),
 (0.58013425054500944, 'When We Were Kings (1996)')]

Note: 1M MovieLens Dataset is used in this project in order to avoid any kind of performance related issues.