## Recommendation System

<a id="System1"> </a>
## Types of recommendation system


There are majorly six types of recommender systems which work primarily in the Media and Entertainment industry::

1. Popularity based recommendation system
2. Content-based recommendation system
3. Collaborative recommendation system
4. Matrix factorization recommendation system
5. Association Rule
6. Hybrid-recommendation system

<a id="System3"> </a>

###  Popularity based recommendation system

This model is not actually personalized - it simply recommends to a user the most popular items that the user has not previously consumed i.e. even though you know the behaviour of the user you cannot recommend items accordingly.

In [1]:
#Import the basic libraries
import pandas as pd
import numpy as np

In [3]:
#Reading the dataset
data=pd.read_csv('book.csv')
data.head(2)

Unnamed: 0.1,Unnamed: 0,userID,ISBN,bookRating,bookTitle,totalRatingCount,Location
0,0,276725,034545104X,0,Flesh Tones: A Novel,60,"tyler, texas, usa"
1,1,2313,034545104X,5,Flesh Tones: A Novel,60,"cincinnati, ohio, usa"


In [6]:
# Top 10 books in terms of average rating 

top_10books=pd.DataFrame(data.groupby('bookTitle')['bookRating'].mean())
top_10books.sort_values(by='bookRating', ascending=False).head(10)

Unnamed: 0_level_0,bookRating
bookTitle,Unnamed: 1_level_1
Das Parfum: Die Geschichte Eines Morders,10.0
Matilda,8.0
Harry Potter and the Chamber of Secrets (Book 2),6.720588
MÃ?Â¶rder ohne Gesicht.,6.5
Ender's Game (Ender Wiggins Saga (Paperback)),5.857143
Sabine's Notebook: In Which the Extraordinary Correspondence of Griffin &amp; Sabine Continues,5.785714
The Cat in the Hat,5.734694
Harry Potter and the Order of the Phoenix (Book 5),5.565693
City of Bones,5.325581
Into the Forest,5.206897


In [7]:
# some books may get high average rating, but it is not reviewed by many users so it might be false measure to recommend.
# Hence we need to consider the review count also for better recommendation
popularity_table=data.groupby('bookTitle').agg({'bookRating':'mean','totalRatingCount':'count'})

In [8]:
popularity_table['rating_per_count']=popularity_table['bookRating']/popularity_table['totalRatingCount']
popularity_table.sort_values('rating_per_count',ascending=False)

Unnamed: 0_level_0,bookRating,totalRatingCount,rating_per_count
bookTitle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Das Parfum: Die Geschichte Eines Morders,10.000000,1,10.000000
Saving Faith,5.000000,1,5.000000
Toxin,5.000000,1,5.000000
Matilda,8.000000,2,4.000000
MÃ?Â¶rder ohne Gesicht.,6.500000,2,3.250000
...,...,...,...
The Lovely Bones: A Novel,4.622624,1052,0.004394
Wild Animus,1.055014,1436,0.000735
Childhood's End,0.000000,1,0.000000
The Return,0.000000,5,0.000000


In [9]:
#consider the books for recommendation only if it has 100 rating counts
top_popularity_table=popularity_table[popularity_table['totalRatingCount']>100]

In [10]:
#Top 10 books to recommed based on popularity
top_popularity_table.sort_values('bookRating',ascending=False).head(10)

Unnamed: 0_level_0,bookRating,totalRatingCount,rating_per_count
bookTitle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harry Potter and the Chamber of Secrets (Book 2),6.720588,136,0.049416
Harry Potter and the Order of the Phoenix (Book 5),5.565693,274,0.020313
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)),4.850598,502,0.009663
To Kill a Mockingbird,4.761329,331,0.014385
The Da Vinci Code,4.699329,745,0.006308
The Lovely Bones: A Novel,4.622624,1052,0.004394
Fahrenheit 451,4.61512,291,0.01586
A Wrinkle In Time,4.569444,144,0.031732
Girl with a Pearl Earring,4.319648,341,0.012668
The Notebook,4.307692,104,0.04142


**Interpreation:**
- Popularity recommendation system uses highest rating to recommend to any users. 
- It uses the items which are in trend(most rated)

<a id="System4"> </a>

###  Content-based recommendation system

This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended..

### Content Based Recommendation

In [11]:
#Read the movie dataset
data1=pd.read_csv('movie_metadata.csv')

In [12]:
data1.head(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


In [13]:
data1.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [15]:
genres1=data1['genres'].str.split('|', expand=True)
genres1

Unnamed: 0,0,1,2,3,4,5,6,7
0,Action,Adventure,Fantasy,Sci-Fi,,,,
1,Action,Adventure,Fantasy,,,,,
2,Action,Adventure,Thriller,,,,,
3,Action,Thriller,,,,,,
4,Documentary,,,,,,,
...,...,...,...,...,...,...,...,...
5038,Comedy,Drama,,,,,,
5039,Crime,Drama,Mystery,Thriller,,,,
5040,Drama,Horror,Thriller,,,,,
5041,Comedy,Drama,Romance,,,,,


In [16]:
#Consider only 3 genres with column indexes 0,1,2. Other columns have many null values. 
genres=genres1.iloc[:,0:3]

In [17]:
#Fill the null values with others & give the names of columns
genres=genres.fillna('Others')
genres.columns=['genre1','genre2','genre3']
genres.head(2)

Unnamed: 0,genre1,genre2,genre3
0,Action,Adventure,Fantasy
1,Action,Adventure,Fantasy


In [18]:
data1=pd.concat([data1,genres],axis=1)
data1.head(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,genre1,genre2,genre3
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,Action,Adventure,Fantasy
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,Action,Adventure,Fantasy


In [19]:
#Consider only the features which are required for further analysis. 
movie_feat=['movie_title','genre1','genre2','genre3','content_rating','imdb_score']
data2=data1[movie_feat]
data2.head(2)

Unnamed: 0,movie_title,genre1,genre2,genre3,content_rating,imdb_score
0,Avatar,Action,Adventure,Fantasy,PG-13,7.9
1,Pirates of the Caribbean: At World's End,Action,Adventure,Fantasy,PG-13,7.1


In [20]:
#Remove duplicates if any. 
data2=data2.drop_duplicates()

In [22]:
#Make the Movie Title as index
data2=data2.set_index('movie_title',1)
data2.head()

Unnamed: 0_level_0,genre1,genre2,genre3,content_rating,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avatar,Action,Adventure,Fantasy,PG-13,7.9
Pirates of the Caribbean: At World's End,Action,Adventure,Fantasy,PG-13,7.1
Spectre,Action,Adventure,Thriller,PG-13,6.8
The Dark Knight Rises,Action,Thriller,Others,PG-13,8.5
Star Wars: Episode VII - The Force Awakens,Documentary,Others,Others,,7.1


In [24]:
data3=data2.dropna() #Drop all the null values

In [25]:
#Categorical encoding
data3=pd.get_dummies(data3)
data3.head(2)

Unnamed: 0_level_0,imdb_score,genre1_Action,genre1_Adventure,genre1_Animation,genre1_Biography,genre1_Comedy,genre1_Crime,genre1_Documentary,genre1_Drama,genre1_Family,...,content_rating_Passed,content_rating_R,content_rating_TV-14,content_rating_TV-G,content_rating_TV-MA,content_rating_TV-PG,content_rating_TV-Y,content_rating_TV-Y7,content_rating_Unrated,content_rating_X
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,7.9,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Pirates of the Caribbean: At World's End,7.1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
from sklearn.neighbors import NearestNeighbors

In [27]:
rec_model = NearestNeighbors(metric = 'cosine')
rec_model.fit(data3)

NearestNeighbors(metric='cosine')

In [28]:
query_movie_index=200
dist, ind = rec_model.kneighbors(data3.iloc[query_movie_index, :].values.reshape(1, -1), n_neighbors = 6)

In [29]:
list(data3.index[ind[0]])[1:]

['War of the Worlds\xa0',
 'Insurgent\xa0',
 'The Hunger Games: Catching Fire\xa0',
 'Jurassic Park\xa0',
 'My Name Is Khan\xa0']

In [30]:
for i in range(0, len(dist[0])):
    if i == 0:
        print('Top 5 Recommendations for the user who watched the movie :',data3.index[query_movie_index])
    else:
        print(i, data3.index[ind[0][i]])

Top 5 Recommendations for the user who watched the movie : The Hunger Games: Mockingjay - Part 1 
1 War of the Worlds 
2 Insurgent 
3 The Hunger Games: Catching Fire 
4 Jurassic Park 
5 My Name Is Khan 


<a id="System5"> </a>

###  Collaborative recommendation system

Collaborative filtering is currently one of the most frequently used approaches and usually provides better results than content-based recommendations. Some examples of this are found in the recommendation systems of Youtube, Netflix, and Spotify.
Collaborative Filtering, which is also known as User-User Filtering. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function.

### Collaborative Based Recommendation

In [31]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from surprise import KNNWithMeans,SVDpp
from surprise import accuracy

ModuleNotFoundError: No module named 'surprise'

In [261]:
ratings = pd.read_csv('ratings.csv')
reader = Reader(rating_scale=(1, 5))

In [276]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


In [263]:
rating_data = Dataset.load_from_df(ratings[['userId','movieId','rating']],reader)
[trainset, testset] = train_test_split(rating_data, test_size=.15,shuffle=True)

In [264]:
trainsetfull = rating_data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  671 

Number of items:  9066 



In [268]:
# my_k = 15
# my_min_k = 5
# my_sim_option = {'name':'pearson', 'user_based':False}

In [269]:
# algo = KNNWithMeans(k = my_k, min_k = my_min_k, sim_options = my_sim_option, verbose = True)

In [270]:
# results = cross_validate(
#     algo = algo, data = rating_data, measures=['RMSE'], 
#     cv=5, return_train_measures=True
#     )

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [271]:
# print(results['test_rmse'].mean())

0.9425029068398955


In [280]:
alg=SVDpp()
alg.fit(trainsetfull)

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x2121b6c5b88>

In [272]:
#algo.fit(trainsetfull)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x21218bb1188>

In [305]:
algo.predict(uid = 50, iid =2)

Prediction(uid=50, iid=2, r_ui=None, est=3.5128177784724195, details={'actual_k': 15, 'was_impossible': False})

In [299]:
item_id=ratings['movieId'].unique()
item_id10=ratings.loc[ratings['userId']==10,'movieId']
item_id_pred=np.setdiff1d(item_id,item_id10)

In [301]:
item_id_pred

array([     1,      2,      3, ..., 162542, 162672, 163949], dtype=int64)

In [302]:
testset=[[50,iid,4] for iid in item_id_pred]
pred=alg.test(testset)
pred

[Prediction(uid=50, iid=1, r_ui=4, est=3.516703723488918, details={'was_impossible': False}),
 Prediction(uid=50, iid=2, r_ui=4, est=3.296133199197558, details={'was_impossible': False}),
 Prediction(uid=50, iid=3, r_ui=4, est=3.0401317790965816, details={'was_impossible': False}),
 Prediction(uid=50, iid=4, r_ui=4, est=2.2682505524468946, details={'was_impossible': False}),
 Prediction(uid=50, iid=5, r_ui=4, est=2.9633757212148355, details={'was_impossible': False}),
 Prediction(uid=50, iid=6, r_ui=4, est=3.673916477341491, details={'was_impossible': False}),
 Prediction(uid=50, iid=7, r_ui=4, est=3.1475918356306445, details={'was_impossible': False}),
 Prediction(uid=50, iid=8, r_ui=4, est=3.2916413742943234, details={'was_impossible': False}),
 Prediction(uid=50, iid=9, r_ui=4, est=2.789571943761879, details={'was_impossible': False}),
 Prediction(uid=50, iid=10, r_ui=4, est=3.3001861818333844, details={'was_impossible': False}),
 Prediction(uid=50, iid=11, r_ui=4, est=3.57181856033

In [308]:
pred_ratings=np.array([pred1.est for pred1 in pred])
i_max=pred_ratings.argmax()
iid=item_id_pred[i_max]
print("Top item for user 10 has iid {0} with predicted rating {1}".format(iid,pred_ratings[i_max]))

Top item for user 10 has iid 905 with predicted rating 4.161535167945879
