

<a id="ref1"></a>
# Acquiring the Data

To acquire and extract the data, simply run the following Bash scripts:  
Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. 


In [1]:
# !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
# print('unziping ...')
# !unzip -o -j moviedataset.zip 

Now you're ready to start working with the data!

<hr>

<a id="ref2"></a>
# Preprocessing

First, let's get all of the imports out of the way:

In [1]:
import pandas as pd
import numpy as np
import time

Now let's read each file into their Dataframes:

In [2]:
movies_df = pd.read_csv('ml-latest/movies.csv')
ratings_df = pd.read_csv('ml-latest/ratings.csv')

Let's also take a peek at how each of them are organized:

In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [5]:
users_list = list(ratings_df['userId'].unique())
movies_list = list(ratings_df['movieId'].unique())

In [6]:
print(len(users_list))
print(len(movies_list))

247753
33670


In [7]:
no_users = 100
no_items = 1000
subusers = users_list[:no_users]
submovies = movies_list[:no_items]

In [8]:
sub_ratings_df = ratings_df[ratings_df['userId'].isin(subusers) & ratings_df['movieId'].isin(submovies)]

In [9]:
no_items = len(sub_ratings_df['movieId'].unique())
no_items

1000

In [10]:
user_item = np.zeros((no_users, no_items), dtype=np.float32)

In [11]:
user_item.shape

(100, 1000)

In [12]:
user_item_df = pd.DataFrame(user_item, columns = sub_ratings_df['movieId'].unique(), index = subusers)
user_item_df.head()

Unnamed: 0,169,2471,48516,2571,109487,112552,112556,356,2394,2431,...,413,419,421,428,429,442,458,460,473,482
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
sub_ratings_df.shape

(5332, 4)

In [14]:
sub_ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [15]:
for idx, row in sub_ratings_df.iterrows():
    user_item_df.loc[row['userId'], row['movieId']] = row['rating']

In [16]:
user_item_df.head()

Unnamed: 0,169,2471,48516,2571,109487,112552,112556,356,2394,2431,...,413,419,421,428,429,442,458,460,473,482
1,2.5,3.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,3.5,4.0,5.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,4.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,2.5,0.0,4.5,4.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
user_item_df.shape

(100, 1000)

<hr>

<a id="ref3"></a>
# Recommender System — Non-Negative Matrix Factorization

In [36]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.01, beta=0.02, epsilon=0.1, patience=10, decay=0.01):
    '''
    R: rating matrix
    P: |U| * K (User features matrix)
    Q: |D| * K (Item features matrix)
    K: latent features
    steps: iterations
    alpha: learning rate
    beta: regularization parameter'''
    history_error = [float('inf')]
    Q = Q.T
    patience_count = 0
    learning_rate = alpha
    for step in range(steps):
        total_e = 0
        alpha = (learning_rate / (1. + decay * step))
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j]> 0:
                    # calculate error
                    eij = R[i][j] - P[i,:].dot(Q[:,j])
                    total_e += eij**2 + beta/2 * (P[i,:].dot(P[i,:]) + Q[:,j].dot(Q[:,j]))
                    for k in range(K):
                        # calculate gradient with a and beta parameter
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        if step % 10 == 0:
            print('Step {}: total errors: {}'.format(step, total_e))
        if history_error[-1] - total_e  < epsilon:
            patience_count += 1
        else: patience_count = 0
        history_error.append(total_e)
        if patience_count == patience:
            break
    return P, Q.T, history_error

In [37]:
R = np.array(user_item_df.values)
N = len(R)
M = len(R[0])
K = 2
P = np.random.rand(N,K)
Q = np.random.rand(M,K)

nP, nQ, history_error = matrix_factorization(R, P, Q, K, alpha=0.02, steps=500, epsilon=0.1, decay=0.01)

nR = np.dot(nP, nQ.T)

Step 0: total errors: 13999.819357461529
Step 10: total errors: 4127.465320291851
Step 20: total errors: 3671.0216163462605
Step 30: total errors: 3479.509069196595
Step 40: total errors: 3366.8243046494595
Step 50: total errors: 3285.509702929918
Step 60: total errors: 3222.0321058225463
Step 70: total errors: 3171.062666734065
Step 80: total errors: 3128.879567934683
Step 90: total errors: 3093.3900340631176
Step 100: total errors: 3063.9280916738594
Step 110: total errors: 3039.712336450849
Step 120: total errors: 3019.4486472648946
Step 130: total errors: 3002.015321664241
Step 140: total errors: 2986.6836086189633
Step 150: total errors: 2972.991477629929
Step 160: total errors: 2960.626914323989
Step 170: total errors: 2949.3655972535603
Step 180: total errors: 2939.0387008514463
Step 190: total errors: 2929.5150209507187
Step 200: total errors: 2920.6902209194213
Step 210: total errors: 2912.4798896348984
Step 220: total errors: 2904.8147861201896
Step 230: total errors: 2897.63

### Predict rating

In [38]:
nR = np.dot(P, Q.T)

In [39]:
nR.shape

(100, 1000)

In [40]:
movie_id = user_item_df.columns
movie_id

Int64Index([   169,   2471,  48516,   2571, 109487, 112552, 112556,    356,
              2394,   2431,
            ...
               413,    419,    421,    428,    429,    442,    458,    460,
               473,    482],
           dtype='int64', length=1000)

In [41]:
random_user = 11

In [42]:
user_ratings = user_item_df.iloc[random_user]
user_rated_mask = user_ratings.values > 0

In [43]:
user_rec = nR[random_user][~user_rated_mask]

In [44]:
movie_id_rating = movie_id[~user_rated_mask]

In [45]:
data = {'movie_id': movie_id_rating, 'pre_rating': user_rec}

In [46]:
user_rec_df = pd.DataFrame.from_dict(data)
user_rec_df = user_rec_df.sort_values('pre_rating', ascending=False)
user_rec_df.head()

Unnamed: 0,movie_id,pre_rating
578,1290,6.487918
217,163,6.14601
432,3456,5.760552
862,42,5.73995
789,55094,5.73648


## Sklearn

In [47]:
R = np.array(user_item_df.values)

In [48]:
from sklearn.decomposition import NMF

model = NMF(init='nndsvda', n_components=4)
P = model.fit_transform(R)
Q = model.components_

R_estimate = np.dot(P, Q)

In [49]:
user_rec = R_estimate[random_user][~user_rated_mask]
movie_id_rating = movie_id[~user_rated_mask]
data = {'movie_id': movie_id_rating, 'pre_rating': user_rec}
user_rec_df = pd.DataFrame.from_dict(data)
user_rec_df = user_rec_df.sort_values('pre_rating', ascending=False)
user_rec_df.head()

Unnamed: 0,movie_id,pre_rating
18,296,1.754278
31,593,1.620861
32,608,1.571245
92,2858,1.53423
46,1136,1.428864
