## Mariana Ferreira: Writing recommender Systems 
#### Coursera Guided Project by Charles Ivan Niswander II

Recommender systems capture the pattern of people’s behavior and use it to predict what else they might want or like.

There are two main types of recommendation systems: **Content-based** and **collaborative filtering**. 
Also, there are Hybrid recommender systems, which combine various mechanisms. 

* A **Content-based** recommendation system tries to recommend items to users based on their profile. 
The user's profile revolves around that user's preferences and tastes.

* **Collaborative filtering** is based on the fact that relationships exist between products and people's interests. 

Collaborative filtering has basically two approaches: **user-based** and **item-based**. 
* User-based collaborative filtering is based on the user similarity or neighborhood. 
* Item-based collaborative filtering is based on similarity among items.

In terms of implementing recommender systems, there are 2 types: **Memory-based** and **Model-based**. 

* **Memory-based**: Use the entire user-item dataset to generate a recommendation system. It uses statistical techniques to approximate users or items. Examples of these techniques include: Pearson Correlation, Cosine Similarity and Euclidean Distance, among others. 

* **Model-based**: A model of users is developed in an attempt to learn their preferences. Models can be created using Machine Learning techniques like regression, clustering, classification, and so on. 

##### The following code shows how to create simple versions of these different kinds of recommender engines.

### 1) Writing simplest recommender system

#### Simplified engine that suggest items based on ratings or scores.

In this case, the database used is the IMDb Top 250 movies collected using metadata from 45000 movies on IMDb (obtained from MovieLens). 
The dataset consists of movies released on or before July, 2017. 
There is also a subset of 9000 movies and their ratings (100000 ratings, from 700 users based on 9000 movies). 

In [1]:
%pip install matplotlib

import pandas as pd
import numpy as np

Note: you may need to restart the kernel to use updated packages.


In [2]:
metadata= pd.read_csv('/Users/solara/Downloads/imdb_dataset/movies_metadata.csv', low_memory=False)
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


#### Let's calculate the value of the mean rating across all movies

In [3]:
C = metadata['vote_average'].mean()
print(C)

5.618207215134185


#### Calculate the minumum number of votes required to be in the 90th percentile

In [4]:
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


#### Filter out all qualified to be analyzed movies with votes greater or equal to 160

In [5]:
q_movies = metadata.copy().loc[metadata['vote_count']>= m]
print(q_movies.shape)
print(metadata.shape)

(4555, 24)
(45466, 24)


#### Define and calculate the weighted rating for each qualified movie

In [6]:
def weighted_raiting(x, m=m, C=C):
    v= x['vote_count']
    R= x['vote_average']
    
    #Calculation based on the IMDB formula
    return (v/(v + m) *R)+(m/(m + v) *C)

#Define a new feature 'score' and calculate its value with 'weighted_raiting'

q_movies['score'] = q_movies.apply(weighted_raiting, axis=1)

#### Sort movies based on score calculated above

In [7]:
q_movies= q_movies.sort_values('score', ascending= False)

# Print the top 15 movies

q_movies[['title', 'vote_count', 'vote_average','score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


### 2) Writing a simple Collaborative-Filtering recommendation engine

#### Build a recommender system for a retail data set from GroupLens.

In [8]:
import os
import pandas as pd
os.chdir('/Users/solara/Downloads/ml-100k/')
path = os.getcwd()

path

'/Users/solara/Downloads/ml-100k'

In [9]:
#Reading users file

u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/Users/solara/Downloads/ml-100k/u.user.csv', sep='|', names=u_cols, encoding='latin-1')

print(users.shape)
users.head()

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [10]:
#Reading ratings file

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestap']
ratings = pd.read_csv('/Users/solara/Downloads/ml-100k/u.data.csv', sep='\t', names=r_cols, encoding='latin-1')

print(ratings.shape)
ratings.head()

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestap
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [11]:
#Reading items file

i_cols= ['movie id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 
         'Animation', 'Children\s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 
         'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


items = pd.read_csv('/Users/solara/Downloads/ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')
print(items.shape)
items.head()

(1682, 24)


Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children\s,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### The dataset has already been split into train and test by GroupLens, where the test data has 10 ratings for each user, i.e a total of 9430 rows. Let's read both files. 

In [12]:
ratings_train = pd.read_csv('/Users/solara/Downloads/ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('/Users/solara/Downloads/ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')

ratings_train.shape, ratings_test.shape

((90570, 4), (9430, 4))

#### The recommender engine will recommend movies based on user-user similarity and item-item similarity. 
#### Calculate the number of unique users and items.

In [13]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]

#### Create a user-item matrix to calculate the similarity between users and items. 

In [14]:
data_matrix = np.zeros((n_users, n_items))
for line in ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1]= line[3]


#### Calculate the similarity using the pairwise distance function to calculate the cosine similarity. 

In [15]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [16]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

#### Define a function to make predictions based on similarities. 

In [17]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1).reshape(-1,1)
        
        ratings_diff = (ratings - mean_user_rating)
        pred = mean_user_rating + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
        
    elif type== 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
        
    return pred

#Passing in the data matrix and the user and item similarity matrix into the predict function. 

user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

print('user_pred', user_prediction)
print('item_pred', item_prediction)

user_pred [[ 2.06532606  0.73430275  0.62992381 ...  0.39359041  0.39304874
   0.3927712 ]
 [ 1.76308836  0.38404019  0.19617889 ... -0.08837789 -0.0869183
  -0.08671183]
 [ 1.79590398  0.32904733  0.15882885 ... -0.13699223 -0.13496852
  -0.13476488]
 ...
 [ 1.59151513  0.27526889  0.10219534 ... -0.16735162 -0.16657451
  -0.16641377]
 [ 1.81036267  0.40479877  0.27545013 ... -0.00907358 -0.00846587
  -0.00804858]
 [ 1.8384313   0.47964837  0.38496292 ...  0.14686675  0.14629808
   0.14641455]]
item_pred [[0.44627765 0.475473   0.50593755 ... 0.58815455 0.5731069  0.56669645]
 [0.10854432 0.13295661 0.12558851 ... 0.13445801 0.13657587 0.13711081]
 [0.08568497 0.09169006 0.08764343 ... 0.08465892 0.08976784 0.09084451]
 ...
 [0.03230047 0.0450241  0.04292449 ... 0.05302764 0.0519099  0.05228033]
 [0.15777917 0.17409459 0.18900003 ... 0.19979296 0.19739388 0.20003117]
 [0.24767207 0.24489212 0.28263031 ... 0.34410424 0.33051406 0.33102478]]


### 3) Writing a simple Content-Based recommender system

#### Build a recommender system of a retail dataset using a TF - IDF Vectorizer (Term frequency - Inverse Document Frequency vectorizer). 
 TF - IDF is a measure of originality of a word by comparing the number of times a word appears in a document, with the number of documents the word appears in.

The product of the word's TF and IDF scores is referred to as the TF-IDF weight of that term. 

**The higher the TF - IDF weight score, the rarer and more significant the word, and viceversa.**

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [19]:
ds = pd.read_csv('/Users/solara/Downloads/sample-data.txt')

In [20]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])

#### Measure the similarity of relevance or relevance of a text to another text. 
In this model, each item is stored as a vector of its attributes which are also vectors in an n dimensional space.
The angles between the vectors are calculated to determine the similarity between the vectors.

The method of calculating the users likes and dislikes and measurements is calculated by taking the cosine
of the angle, in this case case,between two document vectors.

Calculate cosine similarity. The value of cosine will increase as the angle between vectors decreses, which signifies more similarity. 

In [21]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}
for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]
    results[row['id']] = similar_items[1:]

#### Compute and output the recommendation.
Input an item id and the number of recommendations wanted. 

The function collects the results corresponding to that item idea. 

In [22]:
def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split('-')[0]


In [23]:
def recommend(item_id, num):
    print("Recommending" + str(num) + 'products similar to ' + item(item_id) + "...")
    print (".......")
    recs = results[item_id][:num]
    for rec in recs:
        print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) +")")
        
print(recommend(item_id=11, num=5))
    

Recommending5products similar to Baby sunshade top ...
.......
Recommended: Sunshade hoody  (score:0.21330296021085024)
Recommended: Baby baggies apron dress  (score:0.10975311296284812)
Recommended: Runshade t (score:0.09988151262780731)
Recommended: Runshade t (score:0.09530698241688207)
Recommended: Runshade top  (score:0.08510550093018411)
None


#### These are basic working prototypes for creating a recommender system.
####  Which recommender system is used for a particular project depends on the project, type of data 
#### and type of recommendations that we wish our model to make.

#### Sources:

Coursera Project Network: Build a Recommender System in Python by Charles Ivan Niswander II


F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872