# What is a recommendation model?
An ML-based recommendation model determines how similar items are to other things you like and then serves up a recommendation.

There two major types:
  - User-based
  - Item-based

# Why recommendations?
A recommendation system helps users find compelling content in a large corpora. 
For example, the Google Play Store provides millions of apps, while YouTube provides billions of videos. More apps and videos are added every day. How can users find new compelling content? Yes, one can use search to access content. However, a recommendation engine can display items that users might not have thought to search for on their own.

# Terminology
- Item

 The entities a system recommends. 

- Query

 The information a system uses to make recommendations. 

- Embedding

 A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedding space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items.



# Recommendation System Over - An IR Point of View
### Candidate Generation -> Scoring -> Re-ranking
- Candidate Generation

 In this first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds or thousands. The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates. **In this stage, we can use metrics which are not very precise/strict, for example, mAP@1000.**

- Scoring

 Next, another model scores and ranks the candidates in order to select the set of items to display to the user. Since this model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries. **Also, more strict/precise metrics.**

- Re-ranking

 Finally, the system must take into account additional constraints**(business rules)** for the final ranking. For example, the system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.

# Candidate Generation

### Classic Approaches

- Content-based Filtering
 
 Uses similarity between items to recommend items similar to what the user likes. **If user A watches two cute cat videos, then the system can recommend cute animal videos to that user.**


- Collaborative Filtering

 Uses similarities between queries and items simultaneously to provide recommendations. **If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn’t seen any videos similar to video 1).**

### Embedding Space and Similarity Measurments

- Embedding Space
  
  Both content-based and collaborative filtering **map each item and each query (or context) to an embedding vector in a common embedding space.** Typically, the embedding space is **low-dimensional (much smaller than the size of the corpus)**, and captures some latent structure of the item or query set. Similar items, such as YouTube videos that are usually watched by the same user, end up close together in the embedding space.

- Similarity Measurements

 **A similarity measure is a function that takes a pair of embeddings and returns a scalar measuring their similarity.** The embeddings can be used for candidate generation as follows: given a query embedding _q_, the system looks for item embeddings that are close to _q_, that is, embeddings with high similarity.

  - Cosine
  
   Describe the **"angle"** between two vectors/embeddings

  - Dot Product
  
   The dot product between two vectors/embeddings. Recall the formula of cosine similarity, **if these two vectors have already been normalized, then the dot product and cosine similarity coincide.**

  - Euclidean Distance
  
   A smaller distance means higher similarity. **Note that when the embeddings are normalized, the squared Euclidean distance coincides with dot-product (and cosine) up to a constant.**

 - Which Similarity Measure to Choose?
  
   Compared to the cosine, **the dot product similarity is sensitive to the norm of the embedding.** That is, the larger the norm of an embedding, the higher the similarity (for items with an acute angle) and the more likely the item is to be recommended. This can affect recommendations as follows:

   - **Items that appear very frequently in the training set tend to have embeddings with large norms.** If capturing popularity information is desirable, then you should prefer dot product. However, if you're not careful, the popular items may end up dominating the recommendations. In practice, you can use other variants of similarity measures that put less emphasis on the norm of the item.

   - **Items that appear very rarely may not be updated frequently during training.** Consequently, if they are initialized with a large norm, the system may recommend rare items over more relevant items. To avoid this problem, be careful about embedding initialization, and use appropriate regularization.

# Content-based Filtering

- Advantages

 - **The model doesn't need any data about other users, since the recommendations are specific to this user.** This makes it easier to scale to a large number of users.

 - **The model can capture the specific interests of a user,** and can recommend niche items that very few other users are interested in.

- Disadvantages

 - Since the feature representation of the items are hand-engineered to some extent, this technique requires a lot of domain knowledge. Therefore, the model can only be as good as the hand-engineered features.

 - **The model can only make recommendations based on existing interests of the user.** In other words, the model has limited ability to expand on the users' existing interests.


Use ml-25m dataset for demonstrating. For simplicity, we will only use genres of the movie to build the embedding matrix

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
movie = pd.read_csv("/content/drive/My Drive/Data/Movielens/ml-25m/movies.csv")
movie.shape

(62423, 3)

In [3]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
class ContentFiltering():
  
  def __init__(self, movie, limit=-1):
    self.movie = movie
    if limit != -1:
      self.corpus = self.movie["genres"][:limit]
    else:
      self.corpus = self.movie["genres"]
    self.embeddings = None
    self.predictions = None
    self.vectorizer = None

  def train(self):
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    # Use tfidf to represent the embedding matrix, note by default it applys l2 norm
    self.vectorizer = TfidfVectorizer()
    self.embeddings = self.vectorizer.fit_transform(self.corpus)

    # Thus we can use either cosine or dot product as our measurement metric
    self.predictions = cosine_similarity(self.embeddings, self.embeddings)

  def get_feature_name(self):
    return self.vectorizer.get_feature_names()

  def get_embedding_matrix(self):
    return self.embeddings.todense()

  def predict(self, movieId, topk=10):
    result = []
    index = self.movie.index[self.movie["movieId"] == movieId]
    indexs = self.predictions[index][0].argsort()[-2:-(topk+2):-1]
    for i in indexs:
      print(self.movie.iloc[i, 1:].values)


In [6]:
cb = ContentFiltering(movie, 20000)
cb.train()

In [7]:
cb.get_feature_name()

['action',
 'adventure',
 'animation',
 'children',
 'comedy',
 'crime',
 'documentary',
 'drama',
 'fantasy',
 'fi',
 'film',
 'genres',
 'horror',
 'imax',
 'listed',
 'musical',
 'mystery',
 'no',
 'noir',
 'romance',
 'sci',
 'thriller',
 'war',
 'western']

In [8]:
cb.get_embedding_matrix()

matrix([[0.        , 0.40769171, 0.54285348, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.50911033, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [9]:
cb.predict(1, 20)

['Antz (1998)' 'Adventure|Animation|Children|Comedy|Fantasy']
["Emperor's New Groove, The (2000)"
 'Adventure|Animation|Children|Comedy|Fantasy']
['Adventures of Rocky and Bullwinkle, The (2000)'
 'Adventure|Animation|Children|Comedy|Fantasy']
['DuckTales: The Movie - Treasure of the Lost Lamp (1990)'
 'Adventure|Animation|Children|Comedy|Fantasy']
['Toy Story 2 (1999)' 'Adventure|Animation|Children|Comedy|Fantasy']
['Wild, The (2006)' 'Adventure|Animation|Children|Comedy|Fantasy']
['Shrek the Third (2007)' 'Adventure|Animation|Children|Comedy|Fantasy']
['Asterix and the Vikings (Astérix et les Vikings) (2006)'
 'Adventure|Animation|Children|Comedy|Fantasy']
['Tale of Despereaux, The (2008)'
 'Adventure|Animation|Children|Comedy|Fantasy']
['Monsters, Inc. (2001)' 'Adventure|Animation|Children|Comedy|Fantasy']
['Valhalla (1986)' 'Adventure|Animation|Children|Fantasy']
['Lord of the Rings, The (1978)' 'Adventure|Animation|Children|Fantasy']
['Sinbad: Legend of the Seven Seas (2003)'
 'Ad

# Reference 
1. Google Recommendation Systems https://developers.google.com/machine-learning/recommendation
2. Movielens https://grouplens.org/datasets/movielens/
3. TfidfVectorizer https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
4. Cosine Similarity https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
