In this project we will apply the Non-negative matrix factorisation, to create a recomendation system working on a subset of Book-Crossing Dataset which is downloaded from the following [link](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

The datasets contains information about books, users and the ratings of the books by the users. The ratings are expressed on a scale from 1-10 (higher values means higher appreciation), the 0 value as a rate denoting not rated. 
 

## Packages 
In the following cell we call the packages that we will use in rest of the project:

In [620]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')


## Data reading and cleaning
### Data reading
From the link that we mentioned above, we have downloaded the datasets `Books.csv`, and `BX_Ratings.csv` that we will save them respectively on the pandas.dataframe variables `Books` and `Ratings`.

+ The dataset `Books` contains the books information.('ISBN', 'Book-Title', 'Book-Author')
+ The dataset `Ratings` contains the users ratings for each book.('User-ID', 'ISBN', 'Book-Rating')
### Data cleaning
For the data cleaning we have dropped the useless `Books` columns for the project, and the lines of `Rating` that contains some encoding problems.

After this process we will create another dataset `Books_rate` which is a merge of `Books` and `Rating` by the variable `ISBN` (the book ID).

In [621]:
Books = pd.read_csv('Books.csv',sep=';')
Ratings = pd.read_csv('BX_Ratings.csv',sep=";",encoding="latin1")
Books=Books.drop(["Image-URL-S","Image-URL-M","Image-URL-L","Publisher","Year-Of-Publication"],axis=1)
Ratings=Ratings.drop(Ratings[Ratings["ISBN"].isin(["0373761619","0735201994","0330482750","0413326608","0440500702","0373166982","0894805959","8423920143","034050823X","039482492X","0553570722","096401811X","085409878X","1874100055","0006479839","0807735132","0394720784","0723245827","1581801653","006263545X"])].index)
Books_rate = Ratings.merge(Books,how="left",on="ISBN")

As we can see at the output of the following cell the shape of our new dataset `Books_rate` is 1149740 lines and 5 columns ('User-ID', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author'), that's why we will use a filter to keep a subset of this dataset to work on it.

In [622]:
{'shape': Books_rate.shape,'columns':Books_rate.columns}

{'shape': (1149740, 5),
 'columns': Index(['User-ID', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author'], dtype='object')}

Our filter consists of keeping the users who have rated more than 300 books, and the books that have been rating by at least 100 users.

In [623]:
U=Books_rate.groupby("User-ID")["Book-Rating"].count()
U=U.loc[U>299].index.values
B=Books_rate.groupby("ISBN")["Book-Rating"].count()
B=B.loc[B>99].index.values
F=Books_rate.loc[Books_rate["User-ID"].isin(U)&Books_rate["ISBN"].isin(B)]

## Main matrix
Now, after creating the filtered dataframe F, we will create the main matrix `X` that we will for the decomposition.

The matrix `X` it's a pivot table (users $\times$ books), such that $ x_{ij} $ denotes the rate given by the user $i$ to the book $j$, as we know from above the missing values are represented by 0 , we will fill the nan values by 0 using `fillna()` function of `pandas`  

As we see in the output, $X\in \mathbb R ^{553 \times 731}$

In [624]:
X = pd.pivot_table(F,index="User-ID",columns="ISBN",values="Book-Rating").fillna(0)
X.shape

(553, 731)

## Needed functions 
In the following cell, we have defined the functions that we will use in our project:

In [637]:
def NMF_gen(X,r,alpha):
    """ Returns predicted matrix of a given matrix using the sk.learn NMF method to generate an NMF decomposition with :solver='mu',beta_loss='frobenius',l1_ratio=1,max_iter=50.

    Parameters
    ----------
    X : matrix

    r : number of components to use in the decomposition (int)

    alpha : the regulisation term (float)
    
    """
    X_fact=NMF(n_components=r,solver='mu',beta_loss='frobenius',alpha=alpha,l1_ratio=1,max_iter=50)
    W,H=X_fact.fit_transform(X),X_fact.components_
    X_hat=pd.DataFrame(W@H)
    X_hat[X_hat>10]=10
    X_hat[X_hat<1]=1
    X_hat.columns,X_hat.index=X.columns,X.index
    return X_hat

def RMSE(X,Y,obs):
    """ Returns the RMSE between two matrixs on a given subset of coordenates.

    Parameters
    ----------
    X,Y : matrix

    obs : dataframe of subset of coordenates
    
    """
    err=0
    for i in range(len(obs)):
        err=err+(X.iloc[(obs.iloc[i]["i"],obs.iloc[i]["j"])]-Y.iloc[(obs.iloc[i]["i"],obs.iloc[i]["j"])])**2
    return np.sqrt(err/len(obs))

def n_remove(X,n_obs):
    """ Returns a copy of the given matrix after removing (replaced by 0) a n_obs non zero coordinates in random positions , and a dataframe of the removed coordinates.

    Parameters
    ----------
    X : matrix

    n_obs : number of coordenates to remove (int)
    
    """
    Y,X_1=X.copy(),X.copy()
    X_1=(X_1!=0)
    N_zero=pd.DataFrame(np.asmatrix(np.where(X_1)).T,columns=["i","j"])
    if n_obs>len(N_zero):
        return print("n_obs must be less than ",len(N_zero))
    obs=N_zero.sample(n_obs)
    for i in range(len(obs)):
        Y.iloc[(obs.iloc[i]["i"],obs.iloc[i]["j"])]=0
    return Y,obs

def range_err(X,vect_r,vect_alpha,n_obs):
    """ Returns a matrix RMSE in a given range of r and alpha variables of the function NMF_gen, between the given matrix and the predicted matrix calculated without the n_obs reel values. 

    Parameters
    ----------
    X : matrix

    vect_r : a range of the number of components to use in the decomposition (int)

    vect_alpha : a range of the regulisation term (float)

    n_obs : number of coordenates to remove (int)
    
    """
    n_rem=n_remove(X,n_obs)
    X_1,obs=n_rem[0],n_rem[1]
    ERR=np.zeros((len(vect_r),len(vect_alpha)))
    for i in range(len(vect_r)):
        for j in range(len(vect_alpha)):
            X_hat=NMF_gen(X_1,vect_r[i],vect_alpha[j])
            ERR[i,j]=RMSE(X,X_hat,obs)
    return ERR

def RMSE_mean(X,vect_alpha,vect_r,n_obs,n):
    """ Returns a matrix of the mean of RMSE calculated n times on the return of range_err function, and argmin on r and alpha in the given range.

    Parameters
    ----------
    X : matrix

    vect_r : a range of the number of components to use in the decomposition (int)

    vect_alpha : a range of the regulisation term (float)

    n_obs : number of coordenates to remove (int)

    n : number of times to repeat the calculation of the similarity matrix 
    
    """
    ERR=np.zeros((len(vect_r),len(vect_alpha),n))
    for i in range(n):
        Y=X.copy()
        ERR[:,:,i]=range_err(Y,vect_r,vect_alpha,n_obs)
    RMSE=np.zeros((len(vect_r),len(vect_alpha)))
    for i in range(len(vect_r)):
        for j in range(len(vect_alpha)):
            RMSE[i,j]=np.sqrt(np.mean(ERR[i,j,:]))
    RMSE=pd.DataFrame(RMSE,index=vect_r,columns=vect_alpha)
    arg=np.unravel_index(np.argmin(RMSE),RMSE.shape)
    print("the argmin of the mean of RMSE in this range are: r =",RMSE.index[arg[0]],"alpha =",RMSE.columns[arg[1]])
    return RMSE,RMSE.index[arg[0]],RMSE.columns[arg[1]]

def best_model_recom(X,r,alpha,s):
    """ Returns a dataframe of users and the best rated books (predicted rates) with a model calculated with the given parameters , and a list of users that we can do a recommendation for them.

    Parameters
    ----------
    X : matrix

    r : number of components to use in the decomposition (int)

    alpha : the regulisation term (float)

    s : a rate to recommend the books that have rates in above of it  (int between 1-10)

    """
    X_hat=round(NMF_gen(X,r,alpha))
    X_hat_1 = X_hat>s 
    X_hat_2 = X==0
    X_hat_b = X_hat_1*X_hat_2
    rec=pd.DataFrame(np.asmatrix(np.where(X_hat_b)).T,columns=["i","j"])
    rec["userID"]=[X_hat.index[i] for i in rec.i]
    rec["ISBN"]=[X_hat.columns[j] for j in rec["j"]]
    rec["Book_Title"]=[Books.loc[Books.ISBN==j]["Book-Title"] for j in rec.ISBN ]
    rec["Author"]=[Books[Books.ISBN==j]["Book-Author"] for j in rec.ISBN ]
    rec["rate"]=[X_hat.loc[np.array(rec.userID)[i],np.array(rec.ISBN)[i]] for i in range(len(rec))]
    rec_df=rec[["userID","ISBN","Book_Title","Author","rate"]]
    count=rec_df.groupby(['userID']).size().reset_index(name='counts').sort_values(['counts'],ascending=False)
    rec_df["count"]=[int(count[count.userID==i]["counts"]) for i in rec_df.userID]
    rec_df=rec_df.sort_values(["count","rate"],ascending=False,ignore_index=True)[["userID","ISBN","Book_Title","Author","rate"]]
    return rec_df,rec_df.userID.unique()

def book_to_rec(rec_df,userID):
    """ Returns a dataframe of the recommended books for a given user

    Parameters
    ----------
    rec_df : dataframe of users and their recommended books

    UserdID : user ID 

    """

    return rec_df.loc[rec_df.userID==userID]

## The best model

We will calculate the best model for $\alpha \in \{0,0.025, 0.05 , 0.075, 0.1\}$ , $ r \in \{8,\cdots, 14\}$ with a $n_{obs}=100$ removed coordinate, and we will take the mean of the RMSE of $n=20$ models calculated, and the best parameters in this range is $r=13$ and $\alpha=0.025$ as we see in the output.

In [631]:
vect_r=range(8,15,1)
vect_alpha=np.linspace(0,0.1,5)
R=RMSE_mean(X,vect_alpha,vect_r,n=20,n_obs=100)

the argmin of RMSE in this range are: r = 13 alpha = 0.025


We can visualize the matrix of the mean of RMSE of our range of parameters:

In [632]:
fig = go.Figure(go.Surface(colorscale="reds",
    x = np.asarray(vect_alpha),
    y = np.asarray(vect_r),
    z = np.asarray(R[0])))
fig.update_traces(contours_z=dict(show=True, usecolormap=True, project_z=True))
fig.update_layout(scene = dict(
                    xaxis_title='alpha',
                    yaxis_title='r',
                    zaxis_title='RMSE'),
                    width =600,
                    height=600,
                    margin=dict(r=0, b=0, l=0, t=0))
fig.show()


## Recommendation
The function `RMSE_mean()` returns the best parameters in our range(R[1],R[2]), so we can use them to calculate predicted model $\widehat X$ of X with the function `best_model_recom()`, and it returns a list of the users ID who we can recommend books for them, and a dataframe of them with the books that we can recommend for each one of them, with the condition of the predicted rate of a recommended book must be greater than 5:

In [638]:
rec=best_model_recom(X,R[1],R[2],5)
rec[0].sample(10)

Unnamed: 0,userID,ISBN,Book_Title,Author,rate
19,114368,0515127833,"584 River's End Name: Book-Title, dtype: ob...","584 Nora Roberts Name: Book-Author, dtype: ...",6.0
37,31826,0451160525,"11103 The Gunslinger (The Dark Tower, Book ...","11103 Stephen King Name: Book-Author, dtype...",6.0
12,23872,059035342X,2143 Harry Potter and the Sorcerer's Stone ...,"2143 J. K. Rowling Name: Book-Author, dtype...",7.0
10,135149,044022165X,"1012 The Rainmaker Name: Book-Title, dtype:...","1012 JOHN GRISHAM Name: Book-Author, dtype:...",6.0
20,153662,0439064864,5432 Harry Potter and the Chamber of Secret...,"5432 J. K. Rowling Name: Book-Author, dtype...",6.0
22,170513,043935806X,5506 Harry Potter and the Order of the Phoe...,"5506 J. K. Rowling Name: Book-Author, dtype...",6.0
24,211426,059035342X,2143 Harry Potter and the Sorcerer's Stone ...,"2143 J. K. Rowling Name: Book-Author, dtype...",6.0
11,153662,0439136350,3839 Harry Potter and the Prisoner of Azkab...,"3839 J. K. Rowling Name: Book-Author, dtype...",8.0
3,104636,0425154092,"5786 From Potter's Field Name: Book-Title, ...",5786 Patricia Daniels Cornwell Name: Book-A...,6.0
38,87141,0440214041,"2445 The Pelican Brief Name: Book-Title, dt...","2445 John Grisham Name: Book-Author, dtype:...",6.0


List of users who we can recommend for them:

In [639]:
rec[1]

array([104636, 274301, 135149, 153662,  23872,   6575,  13552, 114368,
       170513, 211426, 110912,  21014,  78553, 162639,    254,   6251,
        88733, 179978, 208141, 230522,  23902,  25981,  31826,  87141,
       100906, 204864, 227520, 240144, 242106, 258534], dtype=int64)

And finally, we can use the function `book_to rec()` to get a sorted dataframe (by rate), of all the books that we can recommend to a given user(chosen from the previous list).

In [640]:
book_to_rec(rec[0],104636)

Unnamed: 0,userID,ISBN,Book_Title,Author,rate
0,104636,312924585,"4479 Silence of the Lambs Name: Book-Title,...","4479 Thomas Harris Name: Book-Author, dtype...",7.0
1,104636,425163407,"11453 Unnatural Exposure Name: Book-Title, ...",11453 Patricia Daniels Cornwell Name: Book-...,7.0
2,104636,425147622,"2918 The Body Farm Name: Book-Title, dtype:...",2918 Patricia Daniels Cornwell Name: Book-A...,6.0
3,104636,425154092,"5786 From Potter's Field Name: Book-Title, ...",5786 Patricia Daniels Cornwell Name: Book-A...,6.0
