### Introducing the MovieLens 1M database
[MovieLens 1M database] published by [GroupLens](https://grouplens.org). This database includes 1,000,000 (1M) ratings from about 6000 _users_ for about _4000_ movies. We can also find similar databases with about 10M, 20M ratings.

We only need to care about the following files:

* `u.data`: Contains all ratings of 943 _users_ for 1682 movies. Each user rates at least 20 movies. Information about rate time is also given but we do not use it in this article. 

* `rating_test.dat, rating_train.dat`: is a way to divide the entire data into two subsets, one for training, one for testing with a ratio of 80%-20%. 

* `user.dat`: Contains information about _users_, including: id, age, gender, occupation, zipcode (region), because this information can also affect _users_'s interests.

* `item.dat`: information about each movie. The first few lines of the file:
```
1|Toy Story (1995)|January 1, 1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|1|1|1| 0|0|0|0|0|0|0|0|0|0|0|0|0|0
2|Jumanji (1995)|January 1, 1995||http://us.imdb.com/M/title-exact?Jumanji%20(1995)|0|1|0|1|0|0|0 |0|1|0|0|0|0|0|0|0|0|0|0
```
In each line, we will see the movie's _id_, movie name, release date, link on imdb, and the binary numbers `0`, `1` at the end to indicate which of the 19 genres the movie belongs to. given type. Information about this category will be used to build item profiles.

In [1]:
import pandas as pd 
# Reading user file:
u_cols =  ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-1m/user.dat', sep='|', names=u_cols,
 encoding='latin-1')

n_users = users.shape[0]
print ('Number of users:', n_users)
# users.head() #uncomment this to see some few examples

Number of users: 6040


In [2]:
import pandas as pd 
import numpy as np
#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_base = pd.read_csv('ml-1m/rating_train.dat', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-1m/rating_test.dat', sep='\t', names=r_cols, encoding='latin-1')

rate_train = ratings_base.to_numpy()
rate_test = ratings_test.to_numpy()

print ('Number of training rates:', rate_train.shape[0])
print ('Number of test rates:', rate_test.shape[0])

Number of training rates: 800168
Number of test rates: 200041



### Build item profiles

The important job in the Content-Based recommendation system is to build a profile for each item, that is, a feature vector for each item. First of all, we need to load all information about _items_ into the variable `items`:

In [3]:
#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

items = pd.read_csv('ml-1m/item.dat', sep='|', names=i_cols,
 encoding='latin-1')

n_items = items.shape[0]
print ('Number of items:', n_items)

Number of items: 3883



Only care about the 19 binary values ​​at the end of each row:

In [4]:
X0 = items.to_numpy()
X_train_counts = X0[:, -19:]
print (X0)

[[1 'Toy Story (1995)' '01-Jan-1995' ... 0 0 0]
 [2 'Jumanji (1995)' '01-Jan-1995' ... 0 0 0]
 [3 'Grumpier Old Men (1995)' '01-Jan-1995' ... 0 0 0]
 ...
 [3950 'Tigerland (2000)' '01-Jan-2000' ... 0 0 0]
 [3951 'Two Family House (2000)' '01-Jan-2000' ... 0 0 0]
 [3952 'Contender, The (2000)' '01-Jan-2000' ... 0 0 0]]



Next, we will build a feature vector for each item based on the movie genre matrix and TF-IDF

In [5]:
#tfidf
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=True, norm ='l2')
tfidf = transformer.fit_transform(X_train_counts.tolist()).toarray()
print (tfidf)

[[0.         0.         0.72890105 ... 0.         0.         0.        ]
 [0.         0.4998135  0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


After this step, each row of `tfidf` corresponds to the feature vector of a movie.

Next, for each _user_, we need to build which movies that _user_ has _rated_, and the value of those _ratings_.

In [6]:
import numpy as np
def get_items_rated_by_user(rate_matrix, user_id):
    """
    return (item_ids, scores)
    """
    y = rate_matrix[:,0] # all users
    # item indices rated by user_id
    # we need to +1 to user_id since in the rate_matrix, id starts from 1 
    # but id in python starts from 0
    ids = np.where(y == user_id +1)[0] 
    item_ids = rate_matrix[ids, 1] - 1 # index starts from 0 
    scores = rate_matrix[ids, 2]
    return (item_ids, scores)

From now, we can find out the Ridge Regression's coefficient for each _user_:

After having the coefficient `W` and `b`, _ratings_ for each _items_ are predicted by:

In [8]:
import numpy as np
from sklearn.linear_model import Ridge
d = tfidf.shape[1]  # data dimension
W = np.zeros((d, n_users))
b = np.zeros((1, n_users))

for n in range(n_users):
    ids, scores = get_items_rated_by_user(rate_train, n)
    
    if len(ids) > 0:
        max_id = max(ids)
        if max_id >= tfidf.shape[0]:
            new_shape = (max_id + 1, tfidf.shape[1])
            tfidf_temp = np.zeros(new_shape)
            tfidf_temp[:tfidf.shape[0], :] = tfidf
        else:
            tfidf_temp = tfidf

        clf = Ridge(alpha=0.01, fit_intercept=True)
        Xhat = tfidf_temp[ids, :]
        
        clf.fit(Xhat, scores)
        W[:, n] = clf.coef_
        b[0, n] = clf.intercept_
    else:
        print(f"No items rated by user {n}")

print(W)
print(b)

[[ 0.30243023 -0.30668165  3.08438071 ...  1.2104714   0.52077892
   0.13176732]
 [-1.1797542   1.00993362  0.3713706  ... -1.29414489  1.53748308
  -0.11726785]
 [ 1.30028656  0.61205063 -1.63814676 ...  2.82143938 -0.81340502
   0.12741224]
 ...
 [ 0.05344967  0.15691763  1.03006534 ...  2.53039548  0.29288775
  -0.9079871 ]
 [ 0.          1.05541271  0.28957891 ...  0.          0.
   0.0211495 ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]
[[4.50909544 2.86727198 3.7075253  ... 1.75950551 2.92901693 3.66018486]]


In [9]:
# Calculate predicted scores
Yhat = tfidf.dot(W) + b
print(Yhat)

[[4.98328148 3.98587726 3.1059481  ... 4.08274266 2.81757845 3.53099594]
 [3.42609478 4.6874466  2.2549113  ... 1.11267442 4.84296543 3.78592335]
 [4.33098398 4.03759697 4.14102968 ... 3.03716871 4.21251547 3.66299828]
 ...
 [4.24833163 4.01264743 3.44837918 ... 4.0174268  3.94924483 3.646784  ]
 [4.24833163 4.01264743 3.44837918 ... 4.0174268  3.94924483 3.646784  ]
 [3.76728843 3.94184484 3.79353906 ... 2.50501085 4.26786956 2.88143311]]


Here is an example with _user_ whose _id_ is `100`.

In [21]:
n = 100
ids, scores = get_items_rated_by_user(rate_test, 10)
Yhat[n, ids]
print ('Rated movies ids:', ids)
print ('True ratings:', scores)
print ('Predicted ratings:', Yhat[ids, n])

Rated movies ids: [2958 1886 2538 2745 2638  356 1562  787 3808 2794 1752 1545  453 2328
 1672  332 2430 1746 3181 1664 1243  341  355]
True ratings: [5 2 4 4 2 1 5 1 5 5 4 4 2 5 3 4 2 3 1 2 4 3 5]
Predicted ratings: [4.45249714 4.91901155 4.91901155 4.45249714 4.83573704 4.53589923
 4.8224865  4.48765186 4.68228948 4.54059856 4.60611225 5.10255803
 4.7697096  4.91901155 4.66613744 5.19536749 4.48765186 4.91901155
 4.91901155 4.73870491 5.19536749 4.73183024 4.60611225]


To evaluate the found model, we will use the Root Mean Squared Error (RMSE), which is the square root of the average of the squares of the error. The error is calculated as the difference between _true rating_ and _predicted rating_:

In [13]:
from math import sqrt
def evaluate(Yhat, rates, W, b):
    se = 0
    cnt = 0
    for n in range(n_users):
        ids, scores_truth = get_items_rated_by_user(rates, n)
        
        if len(ids) > 0:
            max_id = max(ids)
            if max_id >= Yhat.shape[0]:
                new_shape = (max_id + 1, Yhat.shape[1])
                Yhat_temp = np.zeros(new_shape)
                Yhat_temp[:Yhat.shape[0], :] = Yhat
            else:
                Yhat_temp = Yhat

        scores_pred = Yhat_temp[ids, n]
        e = scores_truth - scores_pred
        se += (e * e).sum(axis=0)
        cnt += e.size
    return sqrt(se / cnt)

print('RMSE for training:', evaluate(Yhat, rate_train, W, b))
print('RMSE for test:', evaluate(Yhat, rate_test, W, b))

RMSE for training: 1.019168710035481
RMSE for test: 1.1641591390251866


Also, we evaluate the Mean Average Error (MAE)

In [16]:
from math import sqrt

def evaluate_mae(Yhat, rates, W, b):
    ae = 0
    cnt = 0
    n_users = Yhat.shape[1]  # số lượng người dùng
    for n in range(n_users):
        ids, scores_truth = get_items_rated_by_user(rates, n)
        
        if len(ids) > 0:
            max_id = max(ids)
            if max_id >= Yhat.shape[0]:
                new_shape = (max_id + 1, Yhat.shape[1])
                Yhat_temp = np.zeros(new_shape)
                Yhat_temp[:Yhat.shape[0], :] = Yhat
            else:
                Yhat_temp = Yhat

        scores_pred = Yhat_temp[ids, n]
        e = scores_truth - scores_pred
        ae += np.sum(np.abs(e))
        cnt += e.size
    return abs(ae / cnt) 

print('MAE for training:', evaluate_mae(Yhat, rate_train, W, b))
print('MAE for test:', evaluate_mae(Yhat, rate_test, W, b))

MAE for training: 0.7806232997638736
MAE for test: 0.9050784338561105



Thus, with the training set, the RMSE is about _1.02_, while MAE is _0.78_ ; With the test set, the error is slightly larger, about _1.16_ and _0.90_. We see that this result is not really good because we have simplified the model too much. Better results can be seen in the other two models that we have in this project!