# A simple recommender example using Python


https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/

## 1. Data

The quantity and quality of the data determines the quality of predictions. 
Gather enough data, refine (extract) relevant subset,...

### 2.1 Content-based filtering

Assume that if a user like A, and B is similar to A, then the user will probably like B, too.
Consider a *profile vector*, which contains behavior of a Netflix user. This vector can be compared to *item vectors* of videos containing informations such as genre, cast, directors, etc. 


#### Measures
The similarity can be calculated as the cosine between vectors as
#### Cosine similarity

\begin{equation*}
 sim(A,B) = cos(\theta) = \frac{A \dot B}{\left \| A \right \|\left \| B \right \|}
\end{equation*}

#### Euclidean distance

#### Pearson's Correlation

\begin{equation*}
sim(u,v)=\frac { \sum { ({ r }_{ ui }-{ \bar { r }  }_{ u })({ r }_{ vi }-\bar { { r }_{ v } } ) }  }{ \sqrt { { \sum { ({ r }_{ ui }-{ \bar { r }  }_{ u }) }  }^{ 2 } } \sqrt { { \sum { ({ r }_{ vi }-\bar { { r }_{ v } } ) }  }^{ 2 } }  } 
\end{equation*}



#### Draw back
This method can only be applied to the data set of the *same* kind as the user has rated.

### 2.2 More general, collaborative filtering

#### Between users
If user A and B both like a,b, and B also like c, then A may like c, too. 

$P_{u,i}$, the prediction of an item i for a user u can be calculated as

\begin{equation*}
{ P }_{ u,i }=\frac { \sum _{ v }^{  }{ \left( { r }_{ v,i }*{ s }_{ u,v } \right)  }  }{ \sum _{ v }^{  }{ { s }_{ u,v } }  } 
\end{equation*}

, where v denotes other user, and S represents the similarity between users.

Calculating similiarty between all pairs of users is time consuming. Thus a subset of users might be used in practice.

The same can be done **between items.**

# Case study

In [2]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

In [12]:
u_cols = ["user_id", "age", "sex", 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols, encoding="latin-1")

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, encoding="latin-1")
ratings_train = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')

i_cols = ['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL',
          "unknown", 'Action', 'Adventure', 'Animation', 'Children\'s',
          'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance',
          'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

print(ratings_train.shape, ratings_test.shape)

(90570, 4) (9430, 4)


## Collaborative filtering model
 both user similarity and item similarity.

In [14]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]

In [16]:
# -1 because pandas index starts from 1.

data_matrix = np.zeros((n_users, n_items))
for line in ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]

In [19]:
from sklearn.metrics.pairwise import pairwise_distances
"""
    Distance metrics ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
"""

# user-user similarity, item-item similarity
user_similarity = pairwise_distances(data_matrix, metric="cosine")
item_similarity = pairwise_distances(data_matrix.T, metric="cosine")

In [20]:
def predict(ratings, similarity, type="user"):
    "Predicted rating based on "
    if type=='user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + \
                similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
        
    return pred

In [23]:
u_pred = predict(data_matrix, user_similarity, type="user")
i_pred = predict(data_matrix, item_similarity, type="item")