# Your own personal Netflix
## Data Preprocessing

To read the dataset you might need to alter the path to look for it:

In [1]:
import pandas as pd # pandas is a data manipulation library
import numpy as np
# lets explore movies.csv
movies= pd.read_csv('ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [2]:
# lets explore ratings.CSV
ratings=pd.read_csv('ml-latest-small/ratings.csv',sep=',')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


The given ratings are in the range of 0.5 and 5:

In [3]:
min(ratings["rating"]), max(ratings["rating"])

(0.5, 5.0)

We convert the sparse data representation of movie ratings into a data matrix. The missing values are filled with zeros.

In [4]:
df_movie_ratings = ratings.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)  #fill unobserved entries with μ
df_movie_ratings.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We consider here only the movies which have been rated by more than 100 users. That are 134 movies. We will not be able to infer a pattern for movies with very few observations anyways, but for this exercise we are mostly interested in the principle and do not need a big dataset.

In [5]:
np.sum(np.sum(df_movie_ratings!=0,0)>100)

134

In [6]:
keep_movie = np.sum(df_movie_ratings!=0,0)>100
df_D = df_movie_ratings.loc[:,keep_movie]
df_D.head()

movieId,1,2,6,10,32,34,39,47,50,110,...,7153,7361,7438,8961,33794,48516,58559,60069,68954,79132
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,4.5,0.0,0.0,4.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Furthermore, we will throw out all the users which have rated fewer than five movies. It would be hard anyways to make recommendations based on 4 movies.

In [7]:
np.sum(np.sum(df_D!=0,1)>=5)

556

The resulting dataset has the userID as rows and movieIDs as columns. Hence, userID 1 and 4 addresses the first two rows of this dataset.

In [8]:
keep_user = np.sum(df_D!=0,1)>=5
df_D = df_D.loc[keep_user,:]
df_D.head()

movieId,1,2,6,10,32,34,39,47,50,110,...,7153,7361,7438,8961,33794,48516,58559,60069,68954,79132
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,4.5,0.0,0.0,4.0
4,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,4.0,4.0,3.0,4.0,4.0,0.0,4.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The movie number- title assignments are given as follows:

In [9]:
movies.loc[movies['movieId'].isin(df_D.columns)]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
5,6,Heat (1995),Action|Crime|Thriller
9,10,GoldenEye (1995),Action|Adventure|Thriller
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
...,...,...,...
6315,48516,"Departed, The (2006)",Crime|Drama|Thriller
6710,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
6772,60069,WALL·E (2008),Adventure|Animation|Children|Romance|Sci-Fi
7039,68954,Up (2009),Adventure|Animation|Children|Drama


The resulting data matrix is given as follows:

In [10]:
D = df_D.to_numpy()
D.shape

(556, 134)

## Optimization
Use the following initialization for your implementation of the optimization scheme.

In [11]:
def matrix_completion(D,r, t_max=100, λ = 0.1):
    # Initialize X and Y
    n, d = D.shape
    np.random.seed(0)
    X = np.random.normal(size =(d,r))
    Y = np.random.normal(size =(n,r))
    
     # Indicator matrix for non-zero elements in D
    O = (D != 0).astype(int)
    
    # Matrix completion via block coordinate descent
    for t in range(t_max):
        for k in range(d):
            O_X_k = np.diag(O[:, k])  
            X[k, :] = Y.T @ D[:, k] @ np.linalg.inv(Y.T @ O_X_k @ Y + λ * np.eye(r))           
        for i in range(n):
            O_Y_i = np.diag(O[i, :])
            Y[i, :] = X.T @ D[i,:] @ np.linalg.inv(X.T @ O_Y_i @ X + λ * np.eye(r))
    
    return X, Y

X, Y = matrix_completion(D, 20)

Average squared approximation error on the observed entries after 100 iterations

In [23]:
# Indicator matrix
O = (D != 0).astype(int)

# Calculate the average squared approximation error
error = D - (O * (Y @ X.T)) 
squared_error = np.linalg.norm(error**2)

# divide by the number of observed entries
average_squared_error = squared_error / np.sum(O)  

round(average_squared_error,3)

0.002

Indicate for the following movies the estimated rating for the first user.

In [13]:
movies[movies['title'] == "Lord of the Rings: The Two Towers, The (2002)"]

Unnamed: 0,movieId,title,genres
4137,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy


In [14]:
movie_index1 = df_D.columns.get_loc(5952)
movie_index2 = df_D.columns.get_loc(58559)
movie_index3 = df_D.columns.get_loc(39)
movie_index4 = df_D.columns.get_loc(924)

In [15]:
print(movie_index1)
# Lord of the Rings
Y[0] @ X[movie_index1].T

119


5.740036155521295

In [16]:
# Dark knight
Y[0] @ X[movie_index2].T

6.928218068561034

In [17]:
# Clueless
Y[0] @ X[movie_index3].T

4.94114977293542

In [18]:
# Space Odyssey
Y[0] @ X[movie_index4].T

4.087655840218025

### Exercise 3b
Running matrix_completion with varying $\lambda$ and observing results.

In [19]:
for λ in [0.01, 0.1, 0.5]:
    X, Y = matrix_completion(D, 20, t_max=100, λ=λ)
    O = (D != 0).astype(int)
    
    # Calculate approximation error
    error = np.sum((D - O * (Y @ X.T))**2)
    imputed_values = Y @ X.T
    imputed_values_missing = imputed_values[O == 0]
    
    # Calculate mean and variance where O = 0
    variance = np.var(imputed_values_missing)
    mean = np.mean(imputed_values_missing)
    
    # Count the number of missing values that is outside the range 0.5-5
    num_of_missing_values_outside_range = imputed_values_missing[(imputed_values_missing < 0.5) | (imputed_values_missing > 5)]
    
    # Print results
    print("λ = ", λ)
    print("Error ", round(error,5)) 
    print("Variance of missing value imputations:", round(variance,5))
    print("Mean of missing value imputations:", round(mean,5))
    print("Number of missing imputed values outside range [0.5-5]: ", len(num_of_missing_values_outside_range))
    print("\n")

λ =  0.01
Error  1824.37363
Variance of missing value imputations: 6.16147
Mean of missing value imputations: 3.32581
Number of missing imputed values outside range [0.5-5]:  14619


λ =  0.1
Error  1849.60422
Variance of missing value imputations: 3.4139
Mean of missing value imputations: 3.03149
Number of missing imputed values outside range [0.5-5]:  10692


λ =  0.5
Error  2002.37161
Variance of missing value imputations: 1.31518
Mean of missing value imputations: 3.34626
Number of missing imputed values outside range [0.5-5]:  4087


