### Import Libraries

In [1]:
import pandas as pd
import os, io
import numpy as np
from pandas import Series, DataFrame, read_table
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error
%matplotlib inline

### Exporting Dataset from Zip File

In [2]:
import zipfile
with zipfile.ZipFile('ml-100k.zip', 'r') as zip_ref:
    zip_ref.extractall(".")

We start to explore the data set of movie ratings and our interest lies particularly  in ratings. Let's see how we recommend the most popular (that is, highly rated) movies.

In [3]:
#Load the Ratings data
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = read_table('ml-100k//u.data',header=None,sep='\t')
ratings.columns = r_cols

i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = read_table('./ml-100k//u.item', sep='|',names=i_cols,
 encoding='latin-1')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


Ratings is a variable that stores all the columns from the ml-100k dataset in the u.data file.
 The head() function displays the first five rows from ratings.

In [5]:
## Let's Build a Popularity-Based Recommender System

With our initial exploration, we decided that ideal data would be the one where we could also have the movie ratings with us.



We will use the pd.merge function that is used to combine data on common columns or indices.

In [6]:
new_data = pd.merge(items,ratings,on='movie_id')
new_data  = new_data[['movie_id','movie title','user_id','rating']]

In [7]:
new_data.head()

Unnamed: 0,movie_id,movie title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


New_data is a variable that stores data read by the pd.merge function. It consists of items and ratings.
 The head() function displays the first five rows from new_data.

Before proceeding to build the recommender system, we will observe the following steps to recommend movies:
- Find unique users
- Count the number of times the movie has been seen
- [Rank](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rank.html) the scores (counts)

In [8]:
def popularity(train,title,ids):
    train_data_grouped = train.groupby([title])[ids].count().reset_index()  #user_id  #movie title
    train_data_grouped.rename(columns = {ids: 'score'},inplace=True)            
    train_data_sort = train_data_grouped.sort_values(['score',title], ascending = [0,1])
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
    popularity_recommendations = train_data_sort.head(10) 
    return popularity_recommendations

In [9]:
popularity(new_data,'movie title','user_id')

Unnamed: 0,movie title,score,Rank
1398,Star Wars (1977),584,1.0
333,Contact (1997),509,2.0
498,Fargo (1996),508,3.0
1234,Return of the Jedi (1983),507,4.0
860,Liar Liar (1997),485,5.0
460,"English Patient, The (1996)",481,6.0
1284,Scream (1996),478,7.0
1523,Toy Story (1995),452,8.0
32,Air Force One (1997),431,9.0
744,Independence Day (ID4) (1996),429,10.0


## Drawback

Having recommended the movies, we can immediately conclude that the major drawback of such a system would be the **lack of personalization**.

## **Collaborative Filtering**
 
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if person A has the same opinion as person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

## Types of Collaborative Filtering
 
### User-Based Collaborative Filtering
 
In this type, we find look-alike customers (based on similarity) and offer products that the first customer's look-alike chose in the past. This algorithm is very effective but takes a lot of time and resources. It computes every customer pair information, which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizing system.
 
1. Build a matrix of things each user bought or viewed or rated
2. Compute similarity scores between users
3. Find users similar to you
4. Recommend stuff they bought or viewed or rated that you haven’t yet
 
### Problems 
1. People are fickle, so their tastes tend to change
2. There are usually more people than things

### Item-Based Collaborative Filtering
 
It is quite similar to the previous algorithm, but instead of finding customer look-alikes, it tries to find items that look alike. Once we have an item look-alike matrix, we can easily recommend similar items to customers who have purchased an item from the store. This algorithm is far less resource-consuming than user-based collaborative filtering. 
 
1. Find every pair of movies that were watched by the same person
2. Measure the similarity of rating across all the users who watched both
3. Sort movies by the similarity strength
 


Let's get started with building our item-based collaborative recommender system. For convenience, let's split this into two parts. 

- To find similarities between items
- To recommend them to users

Item-based collaborative filtering would be the most feasible solution, as the number of items is always lesser than the number of users and it improves the computational speed.

**Leverage the Pandas** 
 
- To begin with, we will use the pandas pivot table to look at relationships between movies and we will use the pivot table in pandas. Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables.

- We start building a utility matrix (matrix consisting of movies and ratings)

In [11]:
movie_ratings = new_data.pivot_table(index=['user_id'],columns=['movie title'],values='rating')

In [12]:
movie_ratings.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


The above table gives information about the rating given by each user against the movie title. There are many NaN as it is not necessary for each user to review each movie. Let’s start by looking at the geeks' most favorite, Star Wars, and see how it correlates pairwise with other movies in the table.

## Similarity Function

To decide the similarity between two items in the dataset, let's briefly look at the popular similarity functions.

### Terminology

- Let $\textbf{$r_x$}$ denote the rating of the item x given by the user and $\textbf{$r_y$}$ be the rating of item y. To find the similarity pairwise between two items the following metrics can be used:

## cosine Index

$$sim(\textbf{$r_{x}$},\textbf{$r_y$}) = cos(\textbf{$r_x$},\textbf{$r_y$}) = \dfrac{\textbf{$r_x$}\textbf{$r_y$}}{||\textbf{$r_x$}||\  ||\textbf{$r_y$}||} $$ 

The major problem is that it treats missing values as negative.

## Pearson Index

$S_{xy}$ = Items x and y both have ratings

$$sim(\textbf{$r_{x}$},\textbf{$r_y$})=\dfrac{\sum_{x\epsilon s}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})(\textbf{$r_{ys}$}- \textbf{$r_y$})}{(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})^2}(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{ys}$}- \textbf{$r_{ym}$})^2}} $$ 

## Jaccard Index

$$Jaccard \ Index = \dfrac{Number \ in\  both \  sets}{Number \  in\  either \ set}  $$

Let's start with the Pearson Index in this case. Now that we have understood how similar products can be found, let's start with the movie, Star Wars.

In [13]:
StarWarsRatings = movie_ratings['Star Wars (1977)'] 
StarWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Now, let’s use the **corrwith()** function to check the pairwise correlation of Star Wars’s user rating with other films in the column.

In [14]:
similarmovies = movie_ratings.corrwith(StarWarsRatings)
similarmovies =similarmovies.dropna()
df = pd.DataFrame(similarmovies)
df.head()

Unnamed: 0_level_0,0
movie title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398


If we look at the data closely, we will find something incorrect. 

The potential reason here is that a handful of people who have seen obscure films are messing up our movies. We want to get rid of the movies that only a few people have watched that show incorrect results.

We have used groupby function that involves some combination of splitting the object, applying a function, and combining the results and sort_values function that sorts by the values along either axis.

In [15]:
movie_stats = new_data.groupby('movie title').agg({'rating':[np.size,np.mean]})

In [16]:
check = movie_stats.sort_values([('rating','mean')],ascending=False)

In [17]:
check.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
They Made Me a Criminal (1939),1,5.0
Marlene Dietrich: Shadow and Light (1996),1,5.0
"Saint of Fort Washington, The (1993)",2,5.0
Someone Else's America (1995),1,5.0
Star Kid (1997),3,5.0


Now, we can clearly observe that there are movies that have very few rating counts (size). Therefore, we set a threshold of the movie count to have at least 100 ratings.

In [18]:
popularmovies = movie_stats['rating']['size']>=100

movie_stats[popularmovies].sort_values([('rating','mean')],ascending=False)[:10]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.45679
"Shawshank Redemption, The (1994)",283,4.44523
Rear Window (1954),209,4.38756
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),584,4.359589
12 Angry Men (1957),125,4.344
Citizen Kane (1941),198,4.292929


In [19]:
df = movie_stats[popularmovies].join(DataFrame(similarmovies,columns=['similarity']))
df.sort_values('similarity',ascending=False)[:20]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.0
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
"Sting, The (1973)",241,4.058091,0.367538
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
Pinocchio (1940),101,3.673267,0.347868
"Frighteners, The (1996)",115,3.234783,0.332729
L.A. Confidential (1997),297,4.161616,0.319065


## Building an End-to-End Recommender System

We will list points that need to be followed to recommend a movie based on what we did till now :

- Compute the correlation score for every pair in the matrix
- Choose a user and find his or her movies of interest
- Recommend movies to him or her
- Improve on the recommendation

The pandas method **corr()** will compute the correlation score for every pair in the matrix. This gives a correlation score between every pair of movies in turn creating a sparse matrix. Let's see how this looks.

In [20]:
corrMatrix = movie_ratings.corr(method='pearson',min_periods=100)
corrMatrix.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Now, we want to recommend movies to a friend, so let's have a look at the movies our friend has rated.

In [21]:
friend_ratings = movie_ratings.loc[1].dropna()[1:4]
friend_ratings

movie title
12 Angry Men (1957)                    5.0
20,000 Leagues Under the Sea (1954)    3.0
2001: A Space Odyssey (1968)           4.0
Name: 1, dtype: float64

In [22]:
simcandidates= pd.Series()
for i in range(0,len(friend_ratings.index)):
    print('Adding similars to ', friend_ratings.index[i])
    
    print('--------------------------------')
    sims = corrMatrix[friend_ratings.index[i]].dropna()
    sims = sims.map(lambda x: x*friend_ratings[i]) # Assigning lower weights to movies with lower ratings.
    simcandidates  = simcandidates.append(sims)
    
    print('sorting')
    
    simcandidates.sort_values(inplace=True,ascending=False)
    
    print(simcandidates.head(10))

Adding similars to  12 Angry Men (1957)
--------------------------------
sorting
12 Angry Men (1957)               5.000000
Star Wars (1977)                  0.921447
Raiders of the Lost Ark (1981)    0.646672
dtype: float64
Adding similars to  20,000 Leagues Under the Sea (1954)
--------------------------------
sorting
12 Angry Men (1957)               5.000000
Star Wars (1977)                  0.921447
Raiders of the Lost Ark (1981)    0.646672
dtype: float64
Adding similars to  2001: A Space Odyssey (1968)
--------------------------------
sorting
12 Angry Men (1957)                                                            5.000000
2001: A Space Odyssey (1968)                                                   4.000000
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    1.571663
Clockwork Orange, A (1971)                                                     1.552285
Citizen Kane (1941)                                                            1.481653
Lawr

Some movies come up more than once, because they are very similar to the ones that the user has rated. Let's eliminate them.

In [23]:
simcandidates = simcandidates.groupby(simcandidates.index).sum()
simcandidates.sort_values(inplace=True,ascending=False)
simcandidates.head(10)

12 Angry Men (1957)                                                            5.000000
2001: A Space Odyssey (1968)                                                   4.000000
Star Wars (1977)                                                               1.844984
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    1.571663
Clockwork Orange, A (1971)                                                     1.552285
Citizen Kane (1941)                                                            1.481653
Raiders of the Lost Ark (1981)                                                 1.438781
Lawrence of Arabia (1962)                                                      1.324881
Chinatown (1974)                                                               1.311644
Apocalypse Now (1979)                                                          1.251388
dtype: float64