## Recommendation Systems.
There are three main types of recommendation systems:
* **Content-based filtering** - looks at the item and recommends similar items eg action movie(Jumanji) then the recommendation can be a similar action movie (Tomb Raider)
* **Collaborative filtering** - looks at who liked what, then suggests these items.
* **Hybrid Models** - combines both for more nuanced recommendations.(Amazon, Netflix & Google Ads)

We'll start by looking at `content-based filtering` using cosine similarity on the movielens dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
#load the movie and genres dataset

items = ["movie_id", "movie title", "release date", "video_release_date",
            "IMDb URL", "unknown", "action", 'adventure', "animation",
              "children's", "comedy", "crime", "documentary", "drama", "fantasy",
              "Film-Noir", "horror", "musical", "mystery", "romance", "Sci-Fi",
              "thriller", "war", "western"]

movies = pd.read_csv('./data/mlk/u.item', sep="|", names=items, encoding='latin-1')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   movie title         1682 non-null   object 
 2   release date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   IMDb URL            1679 non-null   object 
 5   unknown             1682 non-null   int64  
 6   action              1682 non-null   int64  
 7   adventure           1682 non-null   int64  
 8   animation           1682 non-null   int64  
 9   children's          1682 non-null   int64  
 10  comedy              1682 non-null   int64  
 11  crime               1682 non-null   int64  
 12  documentary         1682 non-null   int64  
 13  drama               1682 non-null   int64  
 14  fantasy             1682 non-null   int64  
 15  Film-Noir           1682 non-null   int64  
 16  horror

We'll do a bit of cleaning by removing column with the most null entries.

In [3]:
movies = movies.drop(['video_release_date','IMDb URL'], axis=1)
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   movie_id      1682 non-null   int64 
 1   movie title   1682 non-null   object
 2   release date  1681 non-null   object
 3   unknown       1682 non-null   int64 
 4   action        1682 non-null   int64 
 5   adventure     1682 non-null   int64 
 6   animation     1682 non-null   int64 
 7   children's    1682 non-null   int64 
 8   comedy        1682 non-null   int64 
 9   crime         1682 non-null   int64 
 10  documentary   1682 non-null   int64 
 11  drama         1682 non-null   int64 
 12  fantasy       1682 non-null   int64 
 13  Film-Noir     1682 non-null   int64 
 14  horror        1682 non-null   int64 
 15  musical       1682 non-null   int64 
 16  mystery       1682 non-null   int64 
 17  romance       1682 non-null   int64 
 18  Sci-Fi        1682 non-null   int64 
 19  thrill

In [4]:
movies.head()

Unnamed: 0,movie_id,movie title,release date,unknown,action,adventure,animation,children's,comedy,crime,...,fantasy,Film-Noir,horror,musical,mystery,romance,Sci-Fi,thriller,war,western
0,1,Toy Story (1995),01-Jan-1995,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


The columns seem to be one-hot encoded so lets create vectors to represent each movie. Then use these vectors in the cosine similarity function to get recommendations.

In [5]:
genres = [ "unknown", "action", 'adventure', "animation",
              "children's", "comedy", "crime", "documentary", "drama", "fantasy",
              "Film-Noir", "horror", "musical", "mystery", "romance", "Sci-Fi",
              "thriller", "war", "western"]

genre_features = movies[genres]

genre_features.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   unknown      1682 non-null   int64
 1   action       1682 non-null   int64
 2   adventure    1682 non-null   int64
 3   animation    1682 non-null   int64
 4   children's   1682 non-null   int64
 5   comedy       1682 non-null   int64
 6   crime        1682 non-null   int64
 7   documentary  1682 non-null   int64
 8   drama        1682 non-null   int64
 9   fantasy      1682 non-null   int64
 10  Film-Noir    1682 non-null   int64
 11  horror       1682 non-null   int64
 12  musical      1682 non-null   int64
 13  mystery      1682 non-null   int64
 14  romance      1682 non-null   int64
 15  Sci-Fi       1682 non-null   int64
 16  thriller     1682 non-null   int64
 17  war          1682 non-null   int64
 18  western      1682 non-null   int64
dtypes: int64(19)
memory usage: 249.8 KB


## Cosine Similarity
Instead of using angles directly, cosine similarity gives us a score between -1 and 1.
* 1 = items are identical in terms of direction
* 0 = they are completely different
* -1 = opposite direction

We will use these values to compare movies based on ttheir genre vectors.

In [6]:
#get the matrix to use within our function.
cosine_matrix = cosine_similarity(genre_features)

#displaying similarity of first movie with others
cosine_matrix[0][:15]
# cosine_matrix

array([1.        , 0.        , 0.        , 0.33333333, 0.        ,
       0.        , 0.        , 0.66666667, 0.        , 0.        ,
       0.        , 0.        , 0.57735027, 0.        , 0.        ])

In [7]:
movies.head(10)

Unnamed: 0,movie_id,movie title,release date,unknown,action,adventure,animation,children's,comedy,crime,...,fantasy,Film-Noir,horror,musical,mystery,romance,Sci-Fi,thriller,war,western
0,1,Toy Story (1995),01-Jan-1995,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Next, lets define a function that will take a movie title as an input, use the list from the cosine matrix to sort similar movies then return a list of top 5 movies as the output.

In [8]:
def movie_recommendation(movie_title, n=5):
    movie_index = movies[movies['movie title'] == movie_title].index[0]

    scores = list(enumerate(cosine_matrix[movie_index]))

    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

    #return the scores ommitting the first movie and adding the 6th one to make five 
    return_scores = sorted_scores[1: n+1]
    #get movie indices and names
    movie_indices = [i[0] for i in return_scores]

    return movies['movie title'].iloc[movie_indices]



movie_recommendation("Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)")

8       Dead Man Walking (1995)
14    Mr. Holland's Opus (1995)
17    White Balloon, The (1995)
18        Antonia's Line (1995)
29         Belle de jour (1967)
Name: movie title, dtype: object

In [9]:
movie_recommendation('Dead Man Walking (1995)')

8       Dead Man Walking (1995)
14    Mr. Holland's Opus (1995)
17    White Balloon, The (1995)
18        Antonia's Line (1995)
29         Belle de jour (1967)
Name: movie title, dtype: object

In [10]:
movies.tail(12)

Unnamed: 0,movie_id,movie title,release date,unknown,action,adventure,animation,children's,comedy,crime,...,fantasy,Film-Noir,horror,musical,mystery,romance,Sci-Fi,thriller,war,western
1670,1671,"Further Gesture, A (1996)",20-Feb-1998,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1671,1672,Kika (1993),01-Jan-1993,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1672,1673,Mirage (1995),01-Jan-1995,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1673,1674,Mamma Roma (1962),01-Jan-1962,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1674,1675,"Sunchaser, The (1996)",25-Oct-1996,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1675,1676,"War at Home, The (1996)",01-Jan-1996,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1676,1677,Sweet Nothing (1995),20-Sep-1996,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1677,1678,Mat' i syn (1997),06-Feb-1998,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Improve on the function to make sure it doesn't return the movie title parsed into it.

In [11]:
def improved_movie_recommendation(movie_title, n=5):
    movie_index = movies[movies['movie title'] == movie_title].index[0]

    scores = list(enumerate(cosine_matrix[movie_index]))

    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

    #make a list without the parsed in movie_title using the movie_id
    return_scores = []
    for score in sorted_scores:
        index = score[0]
        if index != movie_index:
            return_scores.append(score)
    #to break out of this loop
        if len(return_scores) == n:
            break

    movie_indices = [i[0] for i in return_scores]

    return movies['movie title'].iloc[movie_indices]


movie_recommendation('Dead Man Walking (1995)')

8       Dead Man Walking (1995)
14    Mr. Holland's Opus (1995)
17    White Balloon, The (1995)
18        Antonia's Line (1995)
29         Belle de jour (1967)
Name: movie title, dtype: object

In [12]:
# movie_recommendation('B. Monkey (1998)')
improved_movie_recommendation('Dead Man Walking (1995)')

5     Shanghai Triad (Yao a yao yao dao waipo qiao) ...
14                            Mr. Holland's Opus (1995)
17                            White Balloon, The (1995)
18                                Antonia's Line (1995)
29                                 Belle de jour (1967)
Name: movie title, dtype: object

In [13]:
improved_movie_recommendation('B. Monkey (1998)')

606                         Rebecca (1940)
885           Life Less Ordinary, A (1997)
32                        Desperado (1995)
67                        Crow, The (1994)
89     So I Married an Axe Murderer (1993)
Name: movie title, dtype: object

In [14]:
improved_movie_recommendation('Dead Man Walking (1995)')


5     Shanghai Triad (Yao a yao yao dao waipo qiao) ...
14                            Mr. Holland's Opus (1995)
17                            White Balloon, The (1995)
18                                Antonia's Line (1995)
29                                 Belle de jour (1967)
Name: movie title, dtype: object

In [15]:
movie_recommendation('Dead Man Walking (1995)')


8       Dead Man Walking (1995)
14    Mr. Holland's Opus (1995)
17    White Balloon, The (1995)
18        Antonia's Line (1995)
29         Belle de jour (1967)
Name: movie title, dtype: object

### Collaborative Filtering
We have seen an example of content-based filtering, now as you might have guessed its not a reliable system, just because two items are similar doesn't mean that the user will like it. 

So comes in collaborative filtering where we'll be looking at user preferences and recommending based on that. 

Collaborative Filtering (CF) is currently the most widely used approach to build recommendation systems and uses the users’ behavior in the form of user-item ratings for predictions.

In [16]:
""" 
          Sleepy Hollow  |      Dead-Pool     |     Madagascar |        Dodge 
Jacob           5                   3                   0                  2
Martha          4                   0                   0                  1
Tony            1                   1                   0                  5
"""


# [5, 3, 0, 2],
# [4, 0, 0, 1],
# [1, 1, 0, 5]


' \n          Sleepy Hollow  |      Dead-Pool     |     Madagascar |        Dodge \nJacob           5                   3                   0                  2\nMartha          4                   0                   0                  1\nTony            1                   1                   0                  5\n'

SVD breaksdown this matrix into three parts:
- `U` -  users' relationship to the hidden pattern/latent factors/hidden features.
- `S` - this one contains the weight/strengh of each dimension.
- `V^t` - captures how each item maps to these same dimensions.

In [19]:
import numpy as np
from numpy.linalg import svd


ratings = np.array([
    [5, 3, 0, 2],
    [4, 0, 0, 1],
    [1, 1, 0, 5]
])


u, s, vt = svd(ratings)


# s = np.pad(s, ((0, 0), (0, 1)), mode='constant')

# np.dot(u, np.dot(s, vt))

In [20]:
ratings.shape

(3, 4)

In [21]:
s.shape

(3,)

In [22]:
vt.shape

(4, 4)

In [23]:
s = np.diag(s)
s

array([[7.93906308, 0.        , 0.        ],
       [0.        , 4.01213138, 0.        ],
       [0.        , 0.        , 1.69531093]])

In [35]:
movie_ratings = pd.read_csv('./data/mlk/u.data', sep="\t", names=['user_id', 'movie_id', 'ratings', 'timestampt'])

movie_ratings_copy = movie_ratings.copy()
movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   user_id     100000 non-null  int64
 1   movie_id    100000 non-null  int64
 2   ratings     100000 non-null  int64
 3   timestampt  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [36]:
movie_ratings['ratings'].unique()

array([3, 1, 2, 4, 5], dtype=int64)

In [37]:
# !pip install scikit-surprise
from surprise import SVD, Dataset, accuracy, Reader
from surprise.model_selection import train_test_split


#get ratings to surprise format
reader = Reader(rating_scale=(1, 5))

#load the data into surprise format
data = Dataset.load_from_df(movie_ratings[['user_id', 'movie_id', 'ratings']], reader)


In [38]:
#split data
train_set, test_set = train_test_split(data, test_size=0.3, random_state=21)

#initiate model
svd_model = SVD()


In [39]:
#modelling 
svd_model.fit(train_set)

#prediction and evaluation
svd_ratings = svd_model.test(test_set)

#rmse & mae
rsme = accuracy.rmse(svd_ratings)
mae = accuracy.mae(svd_ratings)

RMSE: 0.9425
MAE:  0.7432


### ALS
Alternating Least Squares works well with binary formats.

In [None]:
# !pip install implicit
import implicit
from sklearn.model_selection import train_test_split
# from surprise.model_selection import train_test_split

# specify a rating
movie_ratings['ratings'] = (movie_ratings['ratings'] >= 3).astype(int)

movie_ratings['user_id'] = movie_ratings['user_id'].astype('category').cat.codes

movie_ratings['movie_id'] = movie_ratings['movie_id'].astype('category').cat.codes

#split dataset and make sure that all user are represented in both sets
train_set_als, test_set_als = train_test_split(movie_ratings, test_size=0.2, random_state=42, stratify=movie_ratings['user_id'])


train_set_als.info()

<class 'pandas.core.frame.DataFrame'>
Index: 80000 entries, 80618 to 67324
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   user_id     80000 non-null  int16
 1   movie_id    80000 non-null  int16
 2   ratings     80000 non-null  int32
 3   timestampt  80000 non-null  int64
dtypes: int16(2), int32(1), int64(1)
memory usage: 1.8 MB


In [50]:
users = movie_ratings['user_id'].nunique()
movies = movie_ratings['movie_id'].nunique()

In [54]:
#create a sparse matrix for both train and test
from scipy.sparse import csr_matrix

train_sparse = csr_matrix((train_set_als['ratings'], (train_set_als['user_id'], train_set_als['movie_id'])), shape=(users, movies))
test_sparse = csr_matrix((test_set_als['ratings'], (test_set_als['user_id'], test_set_als['movie_id'])), shape=(users, movies))

In [55]:
model = implicit.als.AlternatingLeastSquares(random_state=42)

model.fit(train_sparse)

  check_blas_config()
100%|██████████| 15/15 [00:00<00:00, 28.69it/s]


In [56]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(random_state=42)

model.fit(train_sparse)

100%|██████████| 15/15 [00:00<00:00, 27.83it/s]


In [61]:

values = []
predicted_ratings = []

for user, item, pred_rating in zip(test_set_als['user_id'], test_set_als['movie_id'], test_set_als['ratings']):
    if user < model.user_factors.shape[0] and item < model.item_factors.shape[0]:
        user_v = model.user_factors[user]
        item_v = model.item_factors[item]

        
        values.append(np.dot(user_v, item_v))
        predicted_ratings.append(pred_rating)

    else:
        pass


###create a logic to get movie names

model.recommend(25, train_sparse[25], 5)

(array([750, 280, 247, 116, 284]),
 array([0.5000793 , 0.4959383 , 0.45313233, 0.4440935 , 0.44021663],
       dtype=float32))