# Movie Recommendations using Machine Learning
# This code was developed by Mohamed Eraky : me500@scarletmail.rutgers.edu

**Name:**  

In [252]:

import numpy as np
import pandas as pd

## Reading the Data
Now that we have downloaded the files from the link above and placed them in the same directory as this Jupyter Notebook, we can load each of the tables of data as a CSV into Pandas. Execute the following, provided code.

In [253]:
# Read the dataset from the two files into ratings_data and movies_data
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')
column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python',encoding='ISO-8859-1')
column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

In [277]:
### use numpy to create a ratings data matrix
nr_users = np.max(ratings_data.UserID.values)
nr_movies = np.max(ratings_data.MovieID.values)
ratings_matrix = np.ndarray(shape=(nr_users, nr_movies),dtype=np.uint8)

In [278]:
ratings_matrix[ratings_data.UserID.values - 1, ratings_data.MovieID.values - 1] = ratings_data.Ratings.values

In [276]:
# Print the shape
print("sample of 10x10 of rating matrix:\n", ratings_matrix[:10,:10])
print("shape of ratings_matrix:",ratings_matrix.shape)

sample of 10x10 of rating matrix:
 [[ 1.7587341  -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282   0.96066227
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [ 1.26664377 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282   2.33661676
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [ 1.26664377 -0.34446173 -0.27495584  6.06227179 -0.2143282  -0.41529223
  -0.27447136 -0

## Question 2

Normalize the ratings matrix using Z-score normalization. 

- All of the `NaN` values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps. 
- Your first step should be to get the average of every *column* of the ratings matrix (we want an average by title, not just by user!).
- Second, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every column. It may be very close but not exactly zero because of the limited precision `float`s allow.

In [279]:
ratings_matrix = (ratings_matrix - ratings_matrix.mean(axis = 0))/ratings_matrix.std(axis = 0) #normalize the data

  ratings_matrix = (ratings_matrix - ratings_matrix.mean(axis = 0))/ratings_matrix.std(axis = 0) #normalize the data


In [280]:
ratings_matrix[np.isnan(ratings_matrix)] = 0 #replace nan values with 0.

In [281]:
print("sample of 10x10 of Normalized rating matrix:\n", ratings_matrix[:10,:10])
print("shape of Normalized ratings_matrix:",ratings_matrix.shape)

sample of 10x10 of Normalized rating matrix:
 [[ 1.7587341  -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282   0.96066227
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [ 1.26664377 -0.34446173 -0.27495584 -0.15931661 -0.2143282  -0.41529223
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282   2.33661676
  -0.27447136 -0.10175063 -0.12185716 -0.4005674 ]
 [ 1.26664377 -0.34446173 -0.27495584  6.06227179 -0.2143282  -0.41529223
  -0.

SVD COMPUTATION OF NORMALIZED MATRIX

In [260]:
U, S, vh = np.linalg.svd(ratings_matrix, full_matrices=False) # vh is VT

In [274]:

print("shape of U:",U.shape)
print("shape of S:",S.shape)
print("shape of vT:",vh.shape)


shape of U: (6040, 3952)
shape of S: (3952,)
shape of vT: (3952, 3952)


Verification of SVD computation

In [262]:
S_diagonal=np.diag(S) #DIAGONALIZATION
print("shape of S diagonal: \n",S_diagonal.shape)
print ()
c = np.matmul(U,S_diagonal )
ver=np.matmul(c,vh)
print("ver:",ver [:5,:5])
print("ratings_matrix:\n",ratings_matrix[:5,:5])

shape of S diagonal: 
 (3952, 3952)

ver: [[ 1.7587341  -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]]
ratings_matrix:
 [[ 1.7587341  -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]
 [-0.70171755 -0.34446173 -0.27495584 -0.15931661 -0.2143282 ]]


Slicing with different K

In [263]:
k1=3
U1=U[:,:k1]
S1=S_diagonal[:k1,:k1]
vh1=vh.T[:,:k1]
print(" U1:\n",U1)
print(" s1:\n",S1)
print(" vh1:\n",vh1)

print("shape of U1:",U1.shape)
print("shape of S1:",S1.shape)
print("shape of vh1:",vh1.shape)

step1 = np.matmul(U1,S1 )
R1=np.matmul(step1,vh1.T)
print("shape of R1:",R1.shape)
print("user rating ", R1[0,1376])

k2=1000
U2=U[:,:k2]
S2=S_diagonal[:k2,:k2]
vh2=vh[:k2,:]
step2 = np.matmul(U2,S2 )
R2=np.matmul(step2,vh2)
print("Shape of R2:",R2.shape)
print("User rating ", R2[50,1377])

k3=2000
U3=U[:,:k3]
S3=S_diagonal[:k3,:k3]
vh3=vh[:k3,:]
step3 = np.matmul(U3,S3 )
R3=np.matmul(step3,vh3)
print("Shape of R3:",R3.shape)
print("User rating ", R3[0,1376])

k4=3000
U4=U[:,:k4]
S4=S_diagonal[:k4,:k4]
vh4=vh[:k4,:]
step4 = np.matmul(U4,S4 )
R4=np.matmul(step4,vh4)
print("User rating ", R4[0,1376])










 U1:
 [[ 0.00694592  0.00033918  0.00161647]
 [ 0.00280136  0.0009485   0.00016381]
 [ 0.00701156 -0.0025459  -0.00127488]
 ...
 [ 0.00885949  0.00110523 -0.00037722]
 [ 0.00316321  0.01140523 -0.00676983]
 [-0.00764705  0.03203913 -0.00023632]]
 s1:
 [[1238.60598482    0.            0.        ]
 [   0.          717.43506225    0.        ]
 [   0.            0.          578.64240044]]
 vh1:
 [[-0.02452332 -0.01142005  0.01352385]
 [-0.02737761 -0.02607295 -0.00457039]
 [-0.0198921  -0.02235377  0.01523321]
 ...
 [-0.01188253  0.01064472  0.00491984]
 [-0.00435173  0.01196432  0.01364856]
 [-0.01401484  0.00956447  0.02197116]]
shape of U1: (6040, 3)
shape of S1: (3, 3)
shape of vh1: (3952, 3)
shape of R1: (6040, 3952)
user rating  -0.2936290749948192
Shape of R2: (6040, 3952)
User rating  -0.22777199512279936
Shape of R3: (6040, 3952)
User rating  -0.19244551852973074
User rating  -0.3837378301326029


In [264]:
def top_cosine_similarity(ratings_data , MovieID, top_n=10):
    index = MovieID - 1 # Movie id starts from 1
    movie_row = ratings_data [index, :]
    magnitude = np.sqrt(np.einsum('ij, ij -> i', ratings_data , ratings_data ))
    similarity = np.dot(movie_row, ratings_data.T) / (magnitude[index] * magnitude)
    sort_indexes = np.argsort(-similarity)
    return sort_indexes[:top_n]

# Helper function to print top N similar movies
def print_similar_movies(movies_data, MovieID, top_indexes):
    print('Recommendations for {0}: \n'.format(
    movies_data[movies_data.MovieID == MovieID].Title.values[0]))
    for id in top_indexes + 1:
        print(movies_data[movies_data.MovieID == id].Title.values[0])

User Input only : Movie will be recommended 

In [267]:
k = 50                                                   # sorting based on matrix reduction 
movie_id = 1377                                          # Movie ID
top_n = 5                                                # select only top 5 movies 
sliced = vh.T[:, :k]                                     # sliced matrix 
indexes = top_cosine_similarity(sliced, movie_id, top_n) # index sorting
print_similar_movies(movies_data, movie_id, indexes)     # print similar movie

Recommendations for Batman Returns (1992): 

Batman Returns (1992)
Batman Forever (1995)
Dick Tracy (1990)
Lethal Weapon 4 (1998)
GoldenEye (1995)


  similarity = np.dot(movie_row, ratings_data.T) / (magnitude[index] * magnitude)
