<a href="https://colab.research.google.com/github/AnnaK8090/CIND-820_Big-Data-Analytics-Project/blob/main/CIND_820_Big_Data_Analytics_Project_2_Collaborative_Filtering_Matrix_Factorization_subset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Matrix factorization algorythm:**

1. Initialize 2 random matrices P and Q with dimensions M by P and P by N such that when multiplied, their dimension matches the original matrix R (that has dimensions M by N).
2. Multiply P by Q to achieve an estimate for R.

3. Subtract R real values from those in estimated R matrix (**loss function**) to evaluate how far off the estimate is from the real matrix.

3. Use **gradient descent** formulas to adjust each of the values in P and Q in the right direction.

4. Repeat steps 2 to 4 repeatedly until the error has reached a reasonable value.

5. By multiplying P by Q, we now have an estimate for R that not only closely matches the known values of R, but also provides an estimate for the unknown values.



In [1]:
# 1. Importing libraries:
import numpy as np 
import pandas as pd      

In [16]:
# 2. Loading csv file and saving it into a dataframe:
masterDF = pd.read_csv('MasterDF.csv', on_bad_lines='skip')

In [17]:
masterDF.shape

(113210, 40)

In [18]:
# 3. Since order_ID (~ transaction id) might have more than 1 review it makes sense to aggregate dataframe by order_id and choose max review score:
masterDF_grouped = masterDF.groupby(['order_id','customer_unique_id','product_id','product_category_name_english'])['review_score'].max()
masterDF_grouped = masterDF_grouped.reset_index()
masterDF_grouped.shape

(98091, 5)

In [19]:
masterDF_grouped.to_csv ('MasterDF_grouped.csv', index = None, header=True)

In [20]:
masterDF_grouped.head()

Unnamed: 0,order_id,customer_unique_id,product_id,product_category_name_english,review_score
0,00010242fe8c5a6d1ba2dd792cb16214,871766c5855e863f6eccc05f988b23cb,4244733e06e7ecb4970a6e2683c13e61,cool_stuff,5
1,00018f77f2f0320c557190d7a144bdd3,eb28e67c4c0b83846050ddfb8a35d051,e5f2d52b802189ee658865ca93d83a8f,pet_shop,4
2,000229ec398224ef6ca0657da4fc703e,3818d81c6709e39d06b2738a8d3a2474,c777355d18b72b67abbeef9df44fd0fd,furniture_decor,5
3,00024acbcdf0a6daa1e931b038114c75,af861d436cfc08b2c2ddefd0ba074622,7634da152a4610f1595efa32f14722fc,perfumery,4
4,00042b26cf59d7ce69dfabb4e55b4fd9,64b576fb70d441e8f1b2d7d446e483c5,ac6c3623068f30de03045865e4e10089,garden_tools,5


In [51]:
#4. To be able to validate the results reducing the number of order_id (~transactions) for only those customers that bought more than 2 products (3 or more):

# firstly we count products per customer (in a separate dataframe):
ProductsPerCustomer = masterDF_grouped.groupby(['customer_unique_id'])['product_id'].agg('count').reset_index()

#secondly we filter the dataframe by condition >2:
ProductsGreater2PerCustomer = ProductsPerCustomer.loc[ProductsPerCustomer['product_id'] >2]

#thirdly we filter the initial dataframe - only those customers that bought >2 products will remain: 
result = masterDF_grouped[(masterDF_grouped.customer_unique_id.isin(ProductsGreater2PerCustomer.customer_unique_id))]
result.head()

Unnamed: 0,order_id,customer_unique_id,product_id,product_category_name_english,review_score
128,005d9a5423d47281ac463a968b3936fb,6204c4e582a95b6a350adf6988623bfb,4c3ae5db49258df0784827bdacf3b396,baby,1
129,005d9a5423d47281ac463a968b3936fb,6204c4e582a95b6a350adf6988623bfb,fb7a100ec8c7b34f60cec22b1a9a10e0,toys,1
198,0095790a64527ec83aeaaf99023c050e,35ecdf6858edc6427223b64804cf028e,e8c6039a25765995ac7c1ec2cbef5765,watches_gifts,5
219,00a250dbdb3153cc6ecf4d3f07ef6a17,8004f80e361a5ee23aadb7418a685fc2,ee0c1cf2fbeae95205b4aa506f1469f0,perfumery,2
228,00a9536682ecb394a3794c1608200803,c5f6047fb345ffd234cf5b26268988be,24a014458ccc6e989b4fcef5fa71da58,bed_bath_table,3


In [52]:
result.shape

(2721, 5)

In [27]:
import progressbar as pb
import random

In [28]:
#5. The following code creates "User-Item" matrix or dataFrame where each column (x-axis) represents the product_id and the each row (y-axis) represents the customer_unique_id:

c_data = pd.DataFrame()
products = []
c_data['order_id'] = 0
for product in pb.progressbar(result['product_id']):
    if product not in products:
        products.append(product)
        c_data[product] = 0
users = []
for user in pb.progressbar(result['customer_unique_id']):
    if user not in users:
        users.append(user)
        append_dic = {'order_id':user} 
        for column in c_data.columns:
            if column != 'order_id':
                append_dic[column] = 0
        c_data = c_data.append(pd.DataFrame([append_dic]))
c_data = c_data.set_index('order_id')

  if __name__ == '__main__':
100% (2721 of 2721) |####################| Elapsed Time: 0:00:01 Time:  0:00:01
100% (2721 of 2721) |####################| Elapsed Time: 0:02:26 Time:  0:02:26


In [29]:
result.reset_index(inplace = True,drop = True)

In [30]:
#6. The following code goes through all the data in vertical format and plugs it into our table:
for index in pb.progressbar(range(len(result)-1)):
    c_data.loc[result['customer_unique_id'][index],result['product_id'][index]] = result['review_score'][index]


100% (2720 of 2720) |####################| Elapsed Time: 0:00:00 Time:  0:00:00


In [31]:
c_data.to_csv ('c_data_reduced.csv', index = None, header=True)

In [55]:
c_data.head()

Unnamed: 0_level_0,4c3ae5db49258df0784827bdacf3b396,fb7a100ec8c7b34f60cec22b1a9a10e0,e8c6039a25765995ac7c1ec2cbef5765,ee0c1cf2fbeae95205b4aa506f1469f0,24a014458ccc6e989b4fcef5fa71da58,3552627a68384dc559f0fd4cce173269,55939df5d8d2b853fbc532bf8a00dc32,6c90c0f6c2d89eb816b9e205b9d6a36a,b7d94dc0640c7025dc8e3b46b52d8239,d143bf43abb18593fa8ed20cc990ae84,...,e0d3e5cf1969f20bd69e052ec6cf8f8f,056d012d264624accb7f73d31caee034,6f735de7025b8e74fc832dfd6ec2bf5d,ce6f74096c84567f22728c84f3d6e7fc,803f77475e1b51b47f1bfec4f2ec353f,bd0ac51dc93e62c4dbe6ca9d70a9b311,bd6e8cf9fe4122c385da2bcb9f979d5d,3321ad579f19476d0d668f726f8dffec,fec565c4e3ad965c73fb1a21bb809257,b10ecf8e33aaaea419a9fa860ea80fb5
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6204c4e582a95b6a350adf6988623bfb,1,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35ecdf6858edc6427223b64804cf028e,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8004f80e361a5ee23aadb7418a685fc2,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
c5f6047fb345ffd234cf5b26268988be,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1373e04979cfa0fb2092909abbd57f25,0,0,0,0,0,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
c_data.shape

(791, 2234)

In [39]:
# 7. Matrix factorization function to predict empty entries in "User-Item" matrix:

# R –  M by N "User-Item" matrix holding the true values (with unknown values marked as 0)
# P and Q are the two matrices that, when multiplied, form an estimate for R
# K represents the columns of P and the rows of Q 
# P is of dimensions M by P
# Q is of dimensions P by N


def matrix_factorization(R, P, Q, K, steps=1000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in pb.progressbar(range(steps)):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T

In [40]:
# 8. Transforming dataframe to array:
R = np.array(c_data)

In [41]:
# 9. Number of rows of matrix R: 
N = len(R)
print(N)

791


In [42]:
# 10. Number of columns of matrix R: 
M = len(R[0])
print(M)

2234


In [43]:
# 10. Setting K = 2 (columns of P and the rows of Q) and then creating random values for matrices P and Q:
K = 2

P = np.random.rand(N,K)
Q = np.random.rand(M,K)

# finally running Matrix Factorization function: 
nP, nQ = matrix_factorization(R, P, Q, K)
nR = np.dot(nP, nQ.T)

100% (1000 of 1000) |####################| Elapsed Time: 0:29:23 Time:  0:29:23


In [44]:
# 11. Final array with predicted values (along with a very close values for original=known values):
nR

array([[1.56005735, 4.40043656, 3.116055  , ..., 1.19520604, 1.99656491,
        1.58050208],
       [2.55087373, 6.65404351, 4.99655615, ..., 2.2536877 , 3.05800577,
        2.25141255],
       [1.36915727, 3.2434762 , 2.62211905, ..., 1.39111039, 1.51612632,
        1.00665276],
       ...,
       [1.98868693, 4.93983327, 3.85025254, ..., 1.89404371, 2.2894765 ,
        1.60284197],
       [1.88645921, 5.36825121, 3.7765933 , ..., 1.41919548, 2.43229028,
        1.94017621],
       [0.76882669, 1.9738996 , 1.50019222, ..., 0.69674371, 0.90960638,
        0.65912476]])

In [46]:
# 12. Savinf the file with predicted values into csv file:
pd.DataFrame(nR).to_csv('predictions.csv')