# Product Recommendation

In this blog post, I will walk through how you can manually make a product recommendation system for movies.

The movie rating data set is from movieLens and can be downloaded [here](https://grouplens.org/datasets/movielens/20m/). This blog post has been adapted from a lab exercise from the Standford Machine Learning course on coursera found [here](https://www.coursera.org/learn/machine-learning)

## Loading in data:

In [1]:
# Importing libraries:
import scipy.io as sio
import numpy as np
import pandas as pd
from scipy.optimize import minimize, rosen, rosen_der

# Reading in data:
ratings = pd.read_csv(filepath_or_buffer="../data/ratings.csv")
ratings = ratings.drop(labels='timestamp', axis=1)

Great we've loaded in the data set, now let's take a peak:

In [2]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [3]:
movies = pd.read_csv(filepath_or_buffer="../data/movies.csv")
movies = movies.drop('genres', axis=1)
movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


Almost all of the data wrangling has been done for us! Let's just finish off the wrangling a bit

## Data Wrangling:

First, we want to get the data in a form where each row corresponds to a specific movie and each column corresponds to a specific user. We can do this by pivotingg the table:

In [4]:
# Pivotting the table
ratings_spread = ratings.pivot(index='movieId', columns='userId', values='rating')

Wow, that operation took a really long time. Let's see how big it is:

In [5]:
ratings_spread.shape

(26744, 138493)

Wow, this is quite large. We have 26744 movies and 138493 users. Out of the interest of time, let's just consider the first ten thousand movies and the first one thousand users:

In [8]:
# Get only first ten thousand movies and first one thousand users
ratings_sub = ratings_spread.iloc[0:10000, 0:1000] 

As a result from the subsetting, we are no longer sure if all of the first ten thousand movies have been rated at all by the first 1000 users. The same is true vice versa. Let's get rid of all movies that have not been rated as well as the users that haven't rated anything.

In [11]:
# Dropping all unrated movies and users:
ratings_sub = ratings_sub.dropna(axis=0, how='all') # drop all movies with no ratings
ratings_sub = ratings_sub.dropna(axis=1, how='all') # drop all users who didn't rate

ratings_sub.shape

(7063, 1000)

After the unrated movies have been removed, there are only 7036 movies left. Surprisingly, all of the 1000 users have rated a movie at least once. Good on them. 

In [14]:
ratings_sub.head()

userId,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,4.0,,,5.0,,4.0,,4.0,...,,,,,,,,,,
2,3.5,,,,3.0,,,,,,...,,,,,0.5,,,,,
3,,4.0,,,,3.0,3.0,5.0,,,...,,,,,,,,,3.0,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,1.0,


Each row in the table above corresponds to a movie while each column corresponds to a specific user. For example, the user with `userId = 1` has rated the movie with `movieId = 2` with a score of 3.5.

### Important:

The main point of this machine learning algorithm is to predict the `NaN` values for each user. We want to be able to predict how `Alice` will rate `Fast and the Furious 20` even though she has not seen it yet. This is discussed further below.

Now let's get the movie titles of the 7063 movies:

In [12]:
# Get subset of movie titles
movies_sub = movies[movies['movieId'].isin(ratings_sub.index)]
movies_sub.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


Turns out that the user with `userId = 1` gave the movie "Grumpier Old Men" a 3.5. 

The last thing that we'll need is the dataframe which records whether or not a user has rated a specific movie. We can easily do this:

In [13]:
# Check which users rated which movies
has_rated = ~pd.isnull(ratings_sub)
has_rated.head()

userId,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,True,False,False,True,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,True,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,False,True,False,False,False,True,True,True,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In the table above, the `False` values correspond to a movie that has not recieved a rating while a `True` means that a rating has been given. We can see that the user with `userId = 1` has rated the movie with `movieId = 2`, "Grumpier Old Men" because it is marked with a `True`.

Now that we've been able to load in the data, let's first discuss how we can approach this problem using machine learning. 

## Some Theory and Background

So far, we've been able to wrangle up the following table:

In [15]:
ratings_sub.head()

userId,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,4.0,,,5.0,,4.0,,4.0,...,,,,,,,,,,
2,3.5,,,,3.0,,,,,,...,,,,,0.5,,,,,
3,,4.0,,,,3.0,3.0,5.0,,,...,,,,,,,,,3.0,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,1.0,


Our main task is to fill in the missing `NaN` values for each user. We do this by using a Collaborative Filtering Algorithm. Let's first take the simple case where we just want to see how user Randal will rate a specific movie. This algorithm is able to take into consderation 

## Implementation:

Now, we have to do the tricky stuff. The main thing that the `compute_cost` function is doing is calculating the "cost" or error associated with a rating. 

In [18]:
# Getting data
ratings = has_rated
y = ratings_sub.fillna(value=0)
movie_titles = movies_sub['title']
movie_titles = movie_titles.reset_index(drop=True)

# getting dimensions:
num_features = 10
num_movies = y.shape[0]
num_users = y.shape[1]

# Making new user:
new_user = np.zeros(num_movies)
rated = np.zeros(num_movies)

# Entering user preferences:
# Science Fiction Movies:
new_user[509] = 5 
new_user[4577] = 5
new_user[1185] = 5 
new_user[1186] = 5
new_user[1187] = 5 
new_user[4577] = 5
new_user[4988] = 5

# Romance movies:
new_user[251] = 1
new_user[266] = 1
new_user[446] = 1 
new_user[814] = 1 
new_user[1242] = 1

# Marking which movies the user has rated:
for i,r in enumerate(new_user):
    if r != 0:
        rated[i] = 1
ratings = np.vstack((ratings.T, rated)).T

# Printing out rated movies:
for i, movie_title in enumerate(movie_titles):
    if rated[i] == 1:
        print(movie_title, "was rated as:", new_user[i])

# adding user preferences to database
y = np.vstack((y.T, new_user)).T

# getting new dimensions:
num_features = 10
num_movies = y.shape[0]
num_users = y.shape[1]

# making random x values - this corresponds to initializing user preferences
X = np.random.rand(num_users, num_features)

# making random theta values - this corresponds to initializing movie attributes
theta = np.random.rand(num_movies, num_features)

# Combining x and theta:
X_theta = np.append(np.ravel(X), np.ravel(theta))

# Normalizing y:
y_mean = np.mean(y,axis=1)
y_norm = y.T - y_mean.T
y_norm = y_norm.T

Love Affair (1994) was rated as: 1.0
Nina Takes a Lover (1994) was rated as: 1.0
What's Love Got to Do with It? (1993) was rated as: 1.0
Blade Runner (1982) was rated as: 5.0
Love in the Afternoon (1957) was rated as: 1.0
Star Trek: The Motion Picture (1979) was rated as: 5.0
Star Trek VI: The Undiscovered Country (1991) was rated as: 5.0
Star Trek V: The Final Frontier (1989) was rated as: 5.0
Falling in Love Again (1980) was rated as: 1.0
Star Wars: Episode II - Attack of the Clones (2002) was rated as: 5.0
Star Trek: Nemesis (2002) was rated as: 5.0


In [316]:
def compute_cost(X_theta, y, rated, reg_coeff, num_features):
    # Get dimensions
    num_users = y.shape[1]
    num_movies = y.shape[0]
    
    # Reconstructing X:
    X = X_theta[0:num_movies*num_features]
    X = X.reshape((num_movies, num_features))
    
    # Reconstructing theta:
    theta = X_theta[num_movies*num_features:]
    theta = theta.reshape((num_users, num_features))
    
    # Calculating estimate:
    y_hat = np.dot(X, theta.T)
    
    # Calculating error:
    error = np.multiply((y_hat - y), rated)
    sq_error = error**2
    
    # Calculating cost:
    theta_regularization = (reg_coeff/2)*(np.sum(theta**2))
    X_regularization = (reg_coeff/2)*(np.sum(X**2))       
    J =  (1/2)*np.sum(sq_error) + theta_regularization + X_regularization
    
    # Calculating gradients:
    theta_gradient = np.dot(error.T,X) + reg_coeff*theta
    X_gradient = np.dot(error,theta) + reg_coeff*X 
    X_theta_gradient = np.append(np.ravel(X_gradient), np.ravel(theta_gradient))

    return(J, X_theta_gradient)

## Training:

In [335]:
reg_coeff = 10
min_results = minimize(fun=compute_cost,
                       x0=X_theta, 
                       method='CG',         
                       jac=True,
                       args=(y_norm, ratings, reg_coeff, num_features),
                       options={'maxiter':1000})      
min_results

     fun: 103592.22409803561
     jac: array([  1.62981252e-05,  -3.99503583e-05,   4.89095190e-06, ...,
         2.14329646e-05,  -5.02857968e-05,  -1.41180163e-05])
 message: 'Maximum number of iterations has been exceeded.'
    nfev: 1503
     nit: 1000
    njev: 1503
  status: 1
 success: False
       x: array([ 0.85609286, -0.01339389,  0.32997466, ...,  0.04919602,
        0.41774098,  0.1313267 ])

## Prediction:

In [336]:
X_theta_pred = min_results['x']

# Reconstructing X:
X_pred = X_theta_pred[0:num_movies*num_features]
X_pred = X_pred.reshape((num_movies, num_features))

# Reconstructing theta:
theta_pred = X_theta_pred[num_movies*num_features:]
theta_pred = theta_pred.reshape((num_users, num_features))

# Predicting new_user:
predictions = np.dot(X_pred, theta_pred.T)
test = np.vstack((range(0,num_movies), predictions[:,-1].T)).T

new_user_df = pd.DataFrame(test)
new_user_df.columns = ["movie_id", "predicted_rating"]
new_user_df['predicted_rating'] = new_user_df['predicted_rating'] + y_mean
new_user_df = new_user_df.sort_values(by='predicted_rating', ascending=False)

movie = []

for i, movie_id in enumerate(new_user_df['movie_id']):
    movie.append(movie_titles[int(movie_id)])

new_user_df['movie'] = movie

new_user_df

Unnamed: 0,movie_id,predicted_rating,movie
496,496.0,3.487064,Schindler's List (1993)
296,296.0,3.421979,"Shawshank Redemption, The (1994)"
333,333.0,3.316255,Forrest Gump (1994)
554,554.0,3.240955,"Silence of the Lambs, The (1991)"
0,0.0,3.119193,Toy Story (1995)
136,136.0,3.105234,Apollo 13 (1995)
426,426.0,3.025278,"Fugitive, The (1993)"
242,242.0,3.020870,Star Wars: Episode IV - A New Hope (1977)
746,746.0,2.991318,"Godfather, The (1972)"
104,104.0,2.987873,Braveheart (1995)


In [54]:
X.shape

(1683, 10)