## Importing packages

As always we start by importing necessary packages.

In [1]:
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install scipy
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds


Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31

Then we can load the data. The raw data from the site has been kindly preprocessed by another group member, where different files were combined and the resultant data drame grouped by user and split into training and test sets. (On my laptop sometimes it gives an error at first, usually resolved by running again)

In [3]:
data = pd.read_csv('user_train_df.csv')
data.shape

(32550, 30)

In [107]:
data.head()

Unnamed: 0,User ID,Item ID,Rating,timestamp,Age,Gender,Occupation,zip code,Movie Title,Release Date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,168,5,874965478,24,M,technician,85711,Monty Python and the Holy Grail (1974),01-Jan-1974,...,0,0.0,0,0,0.0,0,0,0.0,0,0
1,1,172,5,874965478,24,M,technician,85711,"Empire Strikes Back, The (1980)",01-Jan-1980,...,0,0.0,0,0,0.0,1,1,0.0,1,0
2,1,165,5,874965518,24,M,technician,85711,Jean de Florette (1986),01-Jan-1986,...,0,0.0,0,0,0.0,0,0,0.0,0,0
3,1,156,4,874965556,24,M,technician,85711,Reservoir Dogs (1992),01-Jan-1992,...,0,0.0,0,0,0.0,0,0,1.0,0,0
4,1,166,5,874965677,24,M,technician,85711,Manon of the Spring (Manon des sources) (1986),01-Jan-1986,...,0,0.0,0,0,0.0,0,0,0.0,0,0


In [5]:
data.columns

Index(['User ID', 'Item ID', 'Rating', 'timestamp', 'Age', 'Gender',
       'Occupation', 'zip code', 'Movie Title', 'Release Date', 'URL',
       'Unknown', 'Action', 'Adeventure', 'Animation', 'Childrens', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')

Here User IDs identify people that rated different movies, Item IDs refer to the movies, and we have a range of information from ones about the user and ones that relate to the movie. 

## Matrix Factorisation

Our task is to create a collaborative filtering using matrix factorisation. 

Collaborative filtering is a model where based on movies it knows a user liked, it finds other users with similar ratings and proposes movies they liked as likely candidates. One possible approach would be to guess how a user would rate different movies, then recommend the ones we think they would rate highest. 

To do this we can use a low-rank approximation of the ratings matrix:

$$ R_ k= U_k \Sigma_kV^\top_k $$

where we use only the k first singular values and vectors. 

By doing this we implicitly introduce k latent variables, where we can intrepret $U_k$ as a matrix describing how much a user likes those k attributes, and $V_k$ as a matrix showing how each movie is aligned with those k factors. Contextually this could include things like genre or actors, but here this structure is ignored and we focus on computing the most mathematically significant factors. 

When multiplied together the product of the full matrices returns the true ratings, while using only the low-rank approximation we get slightly deviated values from which we can select the most fitting movies. 

We start by creating the matrix of ratings.



In [7]:
ratings = data.pivot(index = 'User ID', columns ='Item ID', values = 'Rating').fillna(0)
ratings

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next to apply the SVD we centre the matrix.

In [8]:
r = ratings.values # converts to matrix
means = np.mean(r, axis = 1).reshape(-1,1) 
r2 = r - means

Now we can decompose the matrix $R$.

In [9]:
U, sigma, Vt = svds(r2 , k =  100)
sigma = np.diag(sigma)

Then we multiply the matrices and add the means back to get our approximation.

In [10]:
p = np.matmul(np.matmul(U, sigma), Vt) + means
pred = pd.DataFrame(p,columns=ratings.columns)
pred.head()

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
0,4.535338,3.37742,2.148937,4.209495,2.060415,2.719416,4.22681,0.967013,5.040234,2.782738,...,-0.010706,-0.051346,0.058949,0.02229,0.070588,0.02229,-0.010325,-0.013475,-0.022062,0.077854
1,3.799748,-0.219038,0.019976,0.403776,-0.1162,0.45571,0.036102,-0.228277,0.165513,1.874556,...,-0.027696,-0.02956,-0.013837,-0.0199,-0.006861,-0.0199,0.01222,0.000482,-0.052872,-0.056812
2,-0.073539,-0.066183,0.074263,-0.044759,0.027685,-0.100111,0.021347,-0.096357,0.006834,0.006971,...,0.016716,-0.001075,0.001984,-8.9e-05,0.002856,-8.9e-05,0.002446,0.00098,-0.001928,-0.011543
3,0.190695,-0.299708,0.048256,0.201174,-0.139472,0.009617,0.032057,-0.041062,-0.097846,-0.374609,...,0.038677,0.012169,-0.03116,-0.009319,-0.027393,-0.009319,0.023335,0.016914,-0.011643,-0.005425
4,4.231784,2.118404,0.209611,-0.322296,0.358826,0.134808,-0.476681,-0.359183,-0.056438,0.245643,...,0.044355,0.01881,-0.043988,-0.02666,-0.152268,-0.02666,-0.016105,-0.012787,-0.0038,-0.002031


To obtain recommendations we can select the highest ratings from each row:

In [11]:
def rec(UserID,k):
    u = np.where(ratings.iloc[UserID-1]==0)[0] # find movies with rating 0 which mean they haven't been watched
    preds = pred.iloc[UserID-1,u].nlargest(k) # indices with highest k ratings
    index = preds.index
    titles = []
    for j in index:
        titles.append((data.loc[data['Item ID']==j])['Movie Title'].iloc[1]) # finds title of movie with index and add to list
    return pd.DataFrame({ 'Movie Title' : titles, 'Rating': preds})

In [12]:
pred.iloc[9,np.where(ratings.iloc[9]==0)[0]].nlargest(5).index

Index([1039, 193, 641, 636, 10], dtype='int64', name='Item ID')

And here are a few movies recommended for user 10:

In [13]:
rec(10,5)

Unnamed: 0_level_0,Movie Title,Rating
Item ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1039,Hamlet (1996),1.53448
193,"Right Stuff, The (1983)",1.271954
641,Paths of Glory (1957),1.271799
636,Escape from New York (1981),1.185152
10,Richard III (1995),1.083364


$ 943 * 605 $ 943 * 18 18 * 605

In [84]:
U = np.random.uniform(-1, 1, size=(943, 18))
U

array([[-0.85234403, -0.90408672,  0.17431559, ..., -0.94190948,
         0.21201398,  0.3827924 ],
       [ 0.54331589, -0.75329451, -0.43069033, ..., -0.84120907,
        -0.36037583, -0.54791903],
       [-0.39260025,  0.34003126, -0.48548763, ...,  0.94124214,
        -0.18773861,  0.82056721],
       ...,
       [ 0.01811465, -0.98923915,  0.71089213, ..., -0.53686759,
        -0.642457  ,  0.05200795],
       [ 0.9463679 ,  0.07541999,  0.22920815, ..., -0.85147042,
        -0.38172645,  0.86187018],
       [-0.36960833, -0.01250313, -0.95787458, ...,  0.99116966,
        -0.57976741, -0.6559985 ]])

In [86]:
dropped = data.drop_duplicates(subset='Item ID')
dropped = dropped[['Item ID','Action','Adeventure', 'Animation', 'Childrens', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western']]
dropped = dropped.set_index('Item ID')
dropped[dropped != 1] = 0
dropped = dropped.sort_index()
dropped.head()

Unnamed: 0_level_0,Action,Adeventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
Item ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0.0,0,0,1.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0
2,1.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,1.0,0,0
3,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,1.0,0,0
4,1.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0
5,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,0.0,0,0,1.0,0,0


In [104]:
def descent(U,M,steps,alpha):
    N = 943 * 605
    preds = np.matmul(U,np.transpose(M))
    preds.index += 1
    MSE = np.square(ratings-pred).mean()
    dU = np.matmul(-2/N * (ratings - preds),M) 
    dM = np.matmul(np.transpose(-2/N * (ratings - preds)),U)
    U -= alpha * dU
    M -= alpha * dM
    return U,M



In [24]:
A = dropped.values
for i in range(605):
    for j in range(18):
        if isinstance(A[i,j], str):
            A[i,j] = 0   
        else:
            A[i,j] = dropped.values[i,j]
list(set(i for j in A for i in j))

[0.0, 1.0]

In [79]:
np.matmul(U,np.transpose(dropped))

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
0,-0.61507,-0.255832,-0.397627,0.141794,-0.397627,0.0,0.0,-0.61507,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.397627,0.0,0.0,0.0
1,-0.291335,-0.07287,-0.74273,0.669859,-0.74273,0.0,0.0,-0.291335,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.74273,0.0,0.0,0.0
2,-0.560754,-0.134064,0.144373,-0.278437,0.144373,0.0,0.0,-0.560754,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.144373,0.0,0.0,0.0
3,0.358746,-0.666548,0.138655,-0.805203,0.138655,0.0,0.0,0.358746,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.138655,0.0,0.0,0.0
4,0.472515,-0.441526,0.166128,-0.607654,0.166128,0.0,0.0,0.472515,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.166128,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,-0.494833,-1.518618,-0.879168,-0.639449,-0.879168,0.0,0.0,-0.494833,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.879168,0.0,0.0,0.0
939,-0.267027,-0.75809,-0.087081,-0.671009,-0.087081,0.0,0.0,-0.267027,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.087081,0.0,0.0,0.0
940,-0.746968,-0.351834,-0.479129,0.127295,-0.479129,0.0,0.0,-0.746968,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.479129,0.0,0.0,0.0
941,0.750964,0.125753,0.005289,0.120464,0.005289,0.0,0.0,0.750964,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.005289,0.0,0.0,0.0


In [70]:
ratings

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [103]:
U.shape

(943, 18)

In [106]:
descent(U,dropped,1000,0.01)

(         Action  Adeventure  Animation  Childrens  Comedy  Crime  Documentary  \
 User ID                                                                         
 1           NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 2           NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 3           NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 4           NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 5           NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 ...         ...         ...        ...        ...     ...    ...          ...   
 939         NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 940         NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 941         NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 942         NaN         NaN        NaN        NaN     NaN    NaN          NaN   
 943         NaN

In [97]:
np.matmul(descent(U,100000,0.01),np.transpose(dropped))

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.977166,-1.794014,-0.941814,-0.852201,-0.941814,0.0,0.0,-0.977166,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.941814,0.0,0.0,0.0
2,-0.367322,-0.297865,-0.84117,0.543305,-0.84117,0.0,0.0,-0.367322,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.84117,0.0,0.0,0.0
3,0.360267,0.548619,0.94121,-0.392591,0.94121,0.0,0.0,0.360267,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.94121,0.0,0.0,0.0
4,-0.227022,0.607644,0.346975,0.260668,0.346975,0.0,0.0,-0.227022,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.346975,0.0,0.0,0.0
5,0.928099,1.375503,0.904205,0.471298,0.904205,0.0,0.0,0.928099,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.904205,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.030521,0.139316,0.822249,-0.682933,0.822249,0.0,0.0,0.030521,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.822249,0.0,0.0,0.0
940,0.115249,0.906787,0.163617,0.74317,0.163617,0.0,0.0,0.115249,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.163617,0.0,0.0,0.0
941,0.008593,-0.518703,-0.536835,0.018132,-0.536835,0.0,0.0,0.008593,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.536835,0.0,0.0,0.0
942,-0.753792,0.094909,-0.851442,0.946351,-0.851442,0.0,0.0,-0.753792,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.851442,0.0,0.0,0.0


In [94]:
ratings

Item ID,1,2,3,4,5,6,7,8,9,10,...,1660,1661,1662,1663,1664,1665,1679,1680,1681,1682
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
for curr_epoch in range(num_epochs):
    # Reconstruct R_hat from latent factor matrices (linear operation)
    R_hat = U @ V.T  # @ means matrix multiplcation, .T means transpose
    # Calc MSE loss of this reconstruction
    loss = np.square(R - R_hat).mean()
    # Calc partial derivative of MSE with respect to U and V
    U_grad = -2./N * (R - R_hat)@V
    V_grad = -2./N * (R - R_hat)@U
    # Update U and V using their respective gradients and learning rate
    U -= lr * U_grad
    V -= lr * V_grad

## References

https://beckernick.github.io/matrix-factorization-recommender/

https://sparkbyexamples.com/pandas/pandas-iloc-usage-with-examples/

https://stackoverflow.com/questions/13070461/get-indices-of-the-top-n-values-of-a-list

https://discuss.datasciencedojo.com/t/how-to-find-the-row-number-of-nth-largest-value/984/4

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html

https://medium.com/@maxbrenner-ai/matrix-factorization-for-collaborative-filtering-linear-to-non-linear-models-in-python-5cf54363a03c

In [None]:
data.loc[data['Item ID']==19]['Movie Title'][16]

In [53]:
np.random.uniform(-1, 1, size=(943, 18))

array([[-0.95118821,  0.11291434,  0.42715043, ...,  0.03139911,
        -0.37063838,  0.04290505],
       [ 0.84512812, -0.42471224,  0.5174864 , ...,  0.52020444,
        -0.56005202, -0.14773131],
       [-0.89716248, -0.28528075, -0.86500581, ..., -0.65013447,
        -0.54280007,  0.90215532],
       ...,
       [ 0.45182039,  0.1567632 ,  0.96588336, ...,  0.63227141,
        -0.85620938,  0.05346202],
       [-0.04995197,  0.72946141,  0.12376662, ..., -0.64696298,
        -0.30640318, -0.6577234 ],
       [ 0.00775638, -0.65520187,  0.66524394, ..., -0.80585962,
        -0.05143924,  0.7045995 ]])

In [None]:
def descent(U,M,steps,alpha):
    N = 943 * 605
    preds = np.matmul(U,np.transpose(M))
    MSE = np.square(ratings-pred).mean()
    dU = np.matmul(-2/N * (ratings - preds),M) 
    dM = np.matmul(-2/N * (ratings - preds),U)
    U -= alpha * dU
    M -= alpha * dM
    return U,M

In [None]:
def descent(U,M,steps,alpha):
    N = 943 * 605
    preds = np.matmul(U,np.transpose(M))
    preds.index += 1
    MSE = np.square(ratings-pred).mean()
    dU = np.matmul(-2/N * (ratings - preds),M) 
    dM = np.matmul(-2/N * (ratings - preds),U)
    U -= alpha * dU
    M -= alpha * dM
    return U,M
