## <a src="data/RankingMarix.csv">Movie Recommendation System with MovieLens Data with 100K Dataset </a>

data is cited from the following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872

In [78]:
import numpy as np
import pandas as pd
from numpy import *

df_movieLens = pd.read_csv(u'data/RankingMatrix.csv',sep=';')


### Data Summary

The data has 
- 943 users, 
- 1682 movies,
- Total of 100,000 ratings,
- Each user has rated at least 20 movies,
- Ranking scale is between 1-5,
- Data is sorted based on Movie ID

You may find the data below:

    For example; User1 has rated Movie1 as 5

In [79]:
df_movieLens

Unnamed: 0,MOVIE ID,User 1,User 2,User 3,User 4,User 5,User 6,User 7,User 8,User 9,...,User 934,User 935,User 936,User 937,User 938,User 939,User 940,User 941,User 942,User 943
0,1,5,4,0,0,4,4,0,0,0,...,2,3,4,0,4,0,0,5,0,0
1,2,3,0,0,0,3,0,0,0,0,...,4,0,0,0,0,0,0,0,0,5
2,3,4,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0
3,4,3,0,0,0,0,0,5,0,0,...,5,0,0,0,0,0,2,0,0,0
4,5,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,6,5,0,0,0,0,0,0,0,5,...,0,0,5,0,0,0,0,0,0,0
6,7,4,0,0,0,0,2,5,3,4,...,0,0,4,0,4,0,4,4,0,0
7,8,1,0,0,0,0,4,5,0,0,...,0,0,0,0,0,0,5,0,0,0
8,9,5,0,0,0,0,4,5,0,0,...,0,1,4,5,3,5,3,0,0,3
9,10,3,2,0,0,0,0,4,0,0,...,0,0,0,0,0,0,0,0,0,0


### Visualization

Nasıl visualize edeceğine dair buraya bilgileri buraya yazabilirsin

###  Mixture of Content Based and Collaborative Filtering based Recommendation System

<img src="data/movie_recommendation.jpg" width="500"> </img>


In this project; how to build a movie recommendation engine will be looked step by step, in order to do that mixture of content based and collaborative filtering algorithm will be proposed.


## Make movie recommendation step by step

#### STEP 1 - Create Datasets

In [80]:
# Assign the number of users and movies
num_movies=1682
num_users=943

#Create the ratings data
movieLens = np.matrix(df_movieLens[0:])
ratings=movieLens[0:1682,1:944]
print (ratings)

#Create the data if the the movie rated by the corresponding user put 1, if not rated put 0 
did_rate = (ratings !=0)* 1
print (did_rate)



[[5 4 0 ..., 5 0 0]
 [3 0 0 ..., 0 0 5]
 [4 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[[1 1 0 ..., 1 0 0]
 [1 0 0 ..., 0 0 1]
 [1 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


#### STEP 2 - Normalize Dataset

In [81]:
#create a function that normalizes a dataset

def normalize_ratings(ratings,did_rate):
    num_movies=ratings.shape[0]
    
    ratings_mean=zeros(shape = (num_movies,1))
    ratings_norm=zeros(shape = ratings.shape)
    
    for i in range(num_movies):
        #Get all the indexes where there is a 1
        idx = where(did_rate[i] == 1)[1]
        #Calculate mean rating of ith movie only from user's that gave a rating
        ratings_mean[i]=mean(ratings[i,idx])
        ratings_norm[i,idx]=ratings[i,idx] - ratings_mean[i]
        
    return ratings_norm, ratings_mean


##### What does normalize_ratings function do?

Store the ratings_matrix's indexes in an array where there is a rating for the corresponding user & movie

In [82]:
i=0
idx = np.where(did_rate[i]== 1 )[1]
print (idx)

[  0   1   4   5   9  12  14  15  16  17  19  20  22  24  25  37  40  41
  42  43  44  48  53  55  56  57  58  61  62  63  64  65  66  69  71  72
  74  76  78  80  81  82  83  88  91  92  93  94  95  96  98 100 101 105
 107 108 116 119 120 123 124 127 129 130 133 136 137 140 143 144 147 149
 150 156 157 159 161 167 173 176 177 180 181 183 188 192 193 197 198 199
 200 201 202 203 208 209 212 215 221 222 229 230 231 233 234 241 242 243
 245 246 247 248 249 250 251 252 253 255 261 262 264 267 270 273 274 275
 276 278 279 285 286 288 289 290 291 292 293 294 295 296 297 298 300 302
 304 306 307 310 311 312 313 319 321 323 324 325 326 329 330 331 335 337
 338 339 342 343 344 346 347 349 356 358 359 362 364 370 373 377 378 379
 380 386 387 388 389 392 393 394 395 397 398 400 401 402 405 406 410 411
 415 416 418 421 423 424 428 431 433 434 437 440 444 446 449 453 454 455
 456 457 458 459 462 464 466 467 469 470 471 477 478 482 483 485 486 487
 489 492 493 494 496 499 502 504 507 511 513 516 51

For the each row (movie rank) calculate the mean, but ignore the non-rated movies (zeros), because they should not be included in mean calculation.

In [83]:
ratings_mean = zeros(shape=(num_movies,1))
ratings_mean[i]=mean(ratings[i,idx])
print (ratings_mean)

[[ 3.87831858]
 [ 0.        ]
 [ 0.        ]
 ..., 
 [ 0.        ]
 [ 0.        ]
 [ 0.        ]]


For the each row (each movie) substract mean from actual rate, and get normalized ratings.

In [84]:
ratings_norm =zeros(shape=ratings.shape)
ratings_norm[i,idx]=ratings[i,idx]-ratings_mean[i]
print (ratings_norm)

[[ 1.12168142  0.12168142  0.         ...,  1.12168142  0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


Below is the whole normalized ratings matrix.

- if the matrix element is (-); it means user rated that movie below the average rate of that movie, 
    user is more likely to dislike
- if the matrix elemens is (+); it means user rated that movie above the average rate of that movie, 
    user more likely to like

In [85]:
ratings, ratings_mean = normalize_ratings(ratings,did_rate)
print (ratings)

[[ 1.12168142  0.12168142  0.         ...,  1.12168142  0.          0.        ]
 [-0.20610687  0.          0.         ...,  0.          0.          1.79389313]
 [ 0.96666667  0.          0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


In [86]:
print(ratings_mean)

[[ 3.87831858]
 [ 3.20610687]
 [ 3.03333333]
 ..., 
 [ 2.        ]
 [ 3.        ]
 [ 3.        ]]


#### STEP 3 - Create Movie Features and User Preferences

In [87]:
num_users = 943
num_features = 3
#Features can assumed to be such as Comedy, Action, Romantic


In [88]:
movie_features = random.randn(num_movies,num_features)
user_prefs = random.randn(num_users, num_features)
initial_X_and_theta = r_[movie_features.T.flatten(),user_prefs.T.flatten()]
print ( movie_features)

[[ 0.55678354  0.49445734  0.9970598 ]
 [-1.88287832 -0.77361228  0.2699167 ]
 [-1.32127433 -0.4840332   0.68455568]
 ..., 
 [ 1.37702147 -0.19258596  1.59023886]
 [-0.2055576   0.09840224 -0.97488578]
 [ 1.44334083  1.21519915  1.42756957]]


In [89]:
print(user_prefs)

[[-1.08031973 -0.26140547 -0.83566036]
 [-0.03431459  0.56739575  1.01666243]
 [-1.12787042  1.81857725 -0.3547377 ]
 ..., 
 [-0.70734851  0.87700991  0.47662635]
 [-0.7869922   1.21441974 -0.04058775]
 [ 0.38024498  0.89821516  1.67864141]]


In [90]:
initial_X_and_theta.shape

(7875,)

#### STEP 4 - MultiLinear Regression : CostFunction and others

In [91]:
def unroll_params(X_and_theta, num_users, num_movies, num_features):
    #Retrieve the X and theta matrixes from X_and_theta, based on their dimensions
    #------------------
    #Get the first 5046 (1682*3) rows in the 7875x1 column vector
    first_5046 = X_and_theta[:num_movies * num_features]
    #Reshape this column vector into a 1682x3 matrix
    X = first_5046.reshape(num_features,num_movies).transpose()    
    #Get the rest of the 943 the numbers, after the first 5046
    last_2829=X_and_theta[num_movies * num_features:]
    # Reshape this column vector into a 943x3 matrix
    theta = last_2829.reshape((num_features, num_users)).transpose()
    return X, theta         

In [92]:
def calculate_gradient(X_and_theta,ratings,did_rate,num_users,num_movies, num_features, reg_param):
    X,theta =unroll_params(X_and_theta,num_users,num_movies,num_features)
    
    difference = (X.dot(theta.T))
    X_grad = difference.dot(theta)+ reg_param * X
    theta_grad = difference.T.dot(X) + reg_param * theta
    
    #wrap the gradients back into a column vector
    return r_[X_grad.T.flatten(),theta_grad.T.flatten()]

In [93]:
def calculate_cost(X_and_theta,ratings,did_rate,num_users,num_movies, num_features, reg_param):
    X, theta =unroll_params(X_and_theta,num_users,num_movies,num_features)
    
    #we multiply (element-wise) by did rate because we only want to consider observations
    cost = sum ((X.dot(theta.T))**2)/2
    regularization=(reg_param/2)*(sum(theta**2)+sum(X**2))
    return cost + regularization
      
    

#### STEP 4 - Make Movie Recommendation

In [94]:
from scipy import optimize
reg_param = 5046

In [95]:
minimized_cost_and_optimal_params = optimize.fmin_cg(calculate_cost, x0=initial_X_and_theta,fprime=calculate_gradient, args=(ratings, did_rate, num_users, num_movies, num_features, reg_param), maxiter=100, disp=True, full_output=True ) 

         Current function value: 836139.404447
         Iterations: 1
         Function evaluations: 19
         Gradient evaluations: 7


In [96]:
cost, optimal_movie_features_and_user_prefs=minimized_cost_and_optimal_params[1], minimized_cost_and_optimal_params[0]

In [97]:
movie_features, user_prefs = unroll_params(optimal_movie_features_and_user_prefs, num_users, num_movies, num_features)

In [98]:
print (movie_features)

[[-0.07277779 -0.07407872 -0.1339316 ]
 [ 0.25172353  0.11915848 -0.04100532]
 [ 0.1777019   0.07480774 -0.09588029]
 ..., 
 [-0.17571952  0.01977717 -0.21149636]
 [ 0.02391538 -0.0123077   0.1313948 ]
 [-0.19146931 -0.18158043 -0.19042939]]


In [99]:
all_predictions = movie_features.dot(user_prefs.T)
print (all_predictions)

[[-0.06177569  0.05151743  0.00206251 ...,  0.02274343  0.00830867
   0.09365421]
 [ 0.07867857 -0.0067304   0.01176482 ...,  0.02485217  0.01252603
  -0.04147802]
 [ 0.03836363  0.01653218  0.00614362 ...,  0.02904078  0.01120287
   0.00585171]
 ..., 
 [-0.10519596  0.05839991 -0.08766156 ..., -0.01055403 -0.04812817
   0.11876955]
 [ 0.03866327 -0.03660771  0.02780526 ..., -0.01015276  0.01131542
  -0.06367783]
 [-0.12173162  0.08557655  0.01550041 ...,  0.03434975  0.01938258
   0.16384521]]


In [100]:
prediction_for_user1 = all_predictions[:,0:1] + ratings_mean
print (prediction_for_user1)

[[ 3.81654289]
 [ 3.28478544]
 [ 3.07169696]
 ..., 
 [ 1.89480404]
 [ 3.03866327]
 [ 2.87826838]]
