## Matrix Factorization for a small subset
In this notebook, we're going to build our first recommender system, which follows a **collaborative filtering approach** and only takes into account all the readers and all the articles in a small subset of our data. The goal with this **matrix factorization technique** is to 'learn' two embedding matrices with the respective size of the numbers of readers/articles and an arbitrarily chosen (and thus tunable) size of latent factors. 

Thus, if we had 10 readers, 5 articles and were to assume we needed 3 latent factors (which could represent implicit, but substantive differences in our reader/article-base), our method will calculate two matrices (a 10 by 3 for the readers and a 3 by 5 for the articles) whose scalar products yield a new matrix the size of our original one (10 x 5), which *approximates* the original matrix best. This optimization problem is typically solved by stochastic gradient descent (although there are, of course, other possibilities) and from a once extremely sparse matrix (obviously, ervery single reader only reads/clicks a tiny fraction of the articles available to us), we get a densely populated table which now contains information on wether some reader might be more or less inclined to read certain articles. 

The approach might sound a bit dry and mathematic at first, but with the embeddings we actually learn some lower dimensional representations of our readers/articles and can hereby determine *resemblances in preferences*. If you ever wondered how amazon or google knew what you were interested in before you even searched for it: here you go!

In [1]:
import pandas as pd
import numpy as np

In [2]:
behaviors = pd.read_csv('../../data/mind_small_train/behaviors.tsv', sep="\t", header=None)
news= pd.read_csv('../../data/mind_small_train/news.tsv', sep="\t", header = None)

The news dataset stores the information of all the news articles (id, header, abstract, ...). It looks like this:

In [97]:
news.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


At first, we will only need to work with the behaviors dataset, which looks like this:

In [3]:
behaviors.head()

Unnamed: 0,0,1,2,3,4
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


and needs some column-relabelling:

In [4]:
behaviors= behaviors.rename(columns={3:'history'})
behaviors = behaviors.rename(columns={0:'impression_id'})
behaviors = behaviors.rename(columns= {1 : 'user_id'})
behaviors = behaviors.rename(columns= {2 : 'time'})
behaviors = behaviors.rename(columns= {4 : 'labels'})

Now we want to check if there are readers with multiple sessions:

In [63]:
behaviors.user_id.value_counts()

U32146    62
U15740    44
U20833    41
U44201    40
U51286    40
          ..
U13647     1
U39699     1
U58128     1
U60275     1
U93310     1
Name: user_id, Length: 50000, dtype: int64

In [108]:
len(behaviors.user_id.unique()), len(behaviors.user_id)

(50000, 156965)

Apparently, there are! For matrix factorization, we only want to work with the click history, so let's check whether the click histories for the duplicate users are the same:

In [53]:
duplicate_users = behaviors.user_id.value_counts()

In [61]:
duplicate_users = duplicate_users[duplicate_users!=1].index.to_list()    # create list with the IDs of duplicate users

In [88]:
# ATTENTION: This cell needs some time to compute (~5min), so only uncomment if you have some spare time.
# Check whether the click histories of the duplicate users are the same. If not, save the user ID to diff_hist.

# diff_hist = []
# for user in duplicate_users:
#     l = behaviors[behaviors.user_id==user].history.to_list()
#     if len(set(l)) != 1:
#         diff_hist.append(user)

In [87]:
# len(diff_hist)

0

All users with multiple sessions have equal history logs. In contrast, the recommendations and clicks are not the same

In [98]:
behaviors[behaviors.user_id == 'U32594'].head()

Unnamed: 0,impression_id,user_id,time,history,labels
615,616,U32594,11/10/2019 4:38:09 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N54595-0 N23757-0 N23820-0 N18572-0 N41220-0 N...
2202,2203,U32594,11/14/2019 2:27:10 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N41612-0 N16148-0 N3031-0 N51954-0 N2021-0 N33...
4511,4512,U32594,11/14/2019 3:47:55 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N16419-0 N3167-0 N30071-0 N47721-0 N16148-0 N8...
5095,5096,U32594,11/9/2019 12:36:17 PM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N58051-0 N56396-0 N31372-0 N24272-0 N59852-0 N...
5747,5748,U32594,11/12/2019 3:05:21 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N31978-0 N49157-0 N21741-0 N50675-0 N14184-0 N...


In [99]:
x = user_U67455.history.iloc[1].split(' ')
len(x), len(set(x))

(278, 275)

It also looks like there are readers who clicked the same articles multiple times. We treat these instances as redundancies here, which -- together with the repeating histories in general -- don't pose a problem for constructing our **original reader-article-matrix**, what we will do in the following:

In order to reduce computing time, we want to reduce our dataset to the first 10,000 impressions for this task:

In [100]:
behav_part_1 = behaviors.iloc[:10000, :]

In [101]:
behav_part_1 = behav_part_1.dropna()
behav_part_1.shape

(9796, 5)

In [104]:
behav_part_1.head(1)

Unnamed: 0,impression_id,user_id,time,history,labels
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0


In [105]:
id_dict = pd.Series(behav_part_1.user_id.values,index=behav_part_1.impression_id).to_dict()

'U5787'

In [20]:
behaviors_part_1_set = behav_part_1.set_index('user_id').history.str.split(' ', expand =True).stack().reset_index(1, drop=True).reset_index(name='article')



In [78]:
user_U67455_set = behaviors_part_1_set[behaviors_part_1_set.user_id == 'U67455']

In [87]:
user_U67455_set.article.value_counts()

N59894    10
N6163     10
N13231    10
N36253     5
N54026     5
          ..
N14984     5
N28296     5
N59183     5
N25306     5
N5855      5
Name: article, Length: 275, dtype: int64

In [22]:
behaviors_part_1_set['zus'] = 1

In [132]:
behaviors_part_1_set

Unnamed: 0,user_id,article,zus
0,U13740,N55189,1
1,U13740,N42782,1
2,U13740,N34694,1
3,U13740,N45794,1
4,U13740,N18445,1
...,...,...,...
322773,U72585,N14742,1
322774,U72585,N51983,1
322775,U72585,N21189,1
322776,U72585,N46811,1


In [23]:
behaviors_part_1_pivot = behaviors_part_1_set.pivot_table(index='user_id', columns='article', values='zus').fillna(0)

In [29]:
behaviors_part_1_pivot.shape, len(behav_part_1.user_id.unique())

((8502, 20688), 8502)

In [26]:
behaviors_part_1_pivot.head()

article,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,N10024,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10062,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
import scipy as sp
from scipy.sparse.linalg import svds

In [31]:
b1 = behaviors_part_1_pivot.to_numpy(copy=True)
b1_mean = np.mean(b1, axis=1)
b1 -= b1_mean.reshape(-1,1)

In [32]:
U, sigma, Vt = svds(b1, k=20)

In [33]:
sigma = np.diag(sigma)


In [34]:
sigma.shape

(20, 20)

In [35]:
recommendations_df = pd.DataFrame(np.dot(np.dot(U, sigma), Vt) + b1_mean.reshape(-1, 1))
recommendations_df.columns = behaviors_part_1_pivot.columns
recommendations_df['user_ids'] = behaviors_part_1_pivot.index
recommendations_df = recommendations_df.set_index('user_ids')

In [37]:
recommendations_df

article,user_ids,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
0,U10022,-0.000280,-0.000243,0.004196,0.000124,-0.000161,-0.000625,0.001704,0.007723,-0.000838,...,0.001413,-0.002237,-0.001649,0.003230,-0.002287,0.004048,0.003238,0.001986,-0.000648,0.000227
1,U10043,0.000978,0.001051,0.000647,0.001053,0.001231,0.001009,0.001018,0.000300,0.001030,...,0.001074,0.000777,0.001134,0.000905,0.001014,0.001096,0.001055,0.001161,0.000896,0.001211
2,U10045,0.001679,0.001302,0.004273,0.001828,0.001645,0.001876,0.003305,-0.001754,0.001566,...,0.003646,-0.001291,-0.000977,0.001018,0.000648,0.002572,0.000337,0.001198,0.001577,0.002121
3,U10059,0.000582,-0.000819,0.001246,-0.000453,-0.001358,-0.000569,-0.000668,0.002088,-0.001098,...,-0.000597,0.000823,-0.001739,0.001315,-0.000515,0.000465,0.000518,0.001193,0.001138,0.000827
4,U10062,-0.000813,-0.000512,0.001235,-0.002485,-0.002926,-0.003568,0.004105,-0.004009,-0.002954,...,-0.001081,0.000073,-0.006820,-0.003728,-0.006758,-0.005362,-0.003010,0.004399,-0.002810,-0.000508
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8497,U9965,-0.000199,-0.000449,-0.000042,-0.000206,-0.000382,0.000117,-0.000284,0.001205,-0.000047,...,-0.000072,0.000079,0.000357,-0.000794,-0.000149,-0.000279,-0.000371,0.000171,0.000823,0.000022
8498,U9969,0.000154,0.000264,0.000058,0.000174,0.000133,0.000222,0.000360,0.000657,0.000155,...,0.000164,0.000462,0.000299,0.000024,0.000054,-0.000170,0.000124,0.000125,0.000085,0.000277
8499,U9984,-0.000324,-0.000591,-0.000409,-0.000151,-0.000068,-0.000017,0.001003,0.000484,-0.000443,...,0.000943,-0.000597,-0.001698,-0.000144,0.000664,0.000291,-0.001581,0.001115,0.001792,-0.000385
8500,U999,0.000278,-0.000227,0.000463,0.000390,0.000119,0.001605,0.000272,-0.003089,0.000424,...,0.000443,-0.000559,0.001371,-0.002222,0.001039,-0.000430,0.000136,-0.001938,0.000508,0.000049


In [None]:
#recommendations_df = recommendations_df.reset_index()

In [38]:
recommendations_df = recommendations_df.set_index('user_ids')

In [40]:
recommendations_df

article,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,N10024,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
user_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,-0.000280,-0.000243,0.004196,0.000124,-0.000161,-0.000625,0.001704,0.007723,-0.000838,-0.002165,...,0.001413,-0.002237,-0.001649,0.003230,-0.002287,0.004048,0.003238,0.001986,-0.000648,0.000227
U10043,0.000978,0.001051,0.000647,0.001053,0.001231,0.001009,0.001018,0.000300,0.001030,0.001044,...,0.001074,0.000777,0.001134,0.000905,0.001014,0.001096,0.001055,0.001161,0.000896,0.001211
U10045,0.001679,0.001302,0.004273,0.001828,0.001645,0.001876,0.003305,-0.001754,0.001566,0.000592,...,0.003646,-0.001291,-0.000977,0.001018,0.000648,0.002572,0.000337,0.001198,0.001577,0.002121
U10059,0.000582,-0.000819,0.001246,-0.000453,-0.001358,-0.000569,-0.000668,0.002088,-0.001098,-0.000746,...,-0.000597,0.000823,-0.001739,0.001315,-0.000515,0.000465,0.000518,0.001193,0.001138,0.000827
U10062,-0.000813,-0.000512,0.001235,-0.002485,-0.002926,-0.003568,0.004105,-0.004009,-0.002954,0.001731,...,-0.001081,0.000073,-0.006820,-0.003728,-0.006758,-0.005362,-0.003010,0.004399,-0.002810,-0.000508
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U9965,-0.000199,-0.000449,-0.000042,-0.000206,-0.000382,0.000117,-0.000284,0.001205,-0.000047,0.001189,...,-0.000072,0.000079,0.000357,-0.000794,-0.000149,-0.000279,-0.000371,0.000171,0.000823,0.000022
U9969,0.000154,0.000264,0.000058,0.000174,0.000133,0.000222,0.000360,0.000657,0.000155,0.000630,...,0.000164,0.000462,0.000299,0.000024,0.000054,-0.000170,0.000124,0.000125,0.000085,0.000277
U9984,-0.000324,-0.000591,-0.000409,-0.000151,-0.000068,-0.000017,0.001003,0.000484,-0.000443,-0.000383,...,0.000943,-0.000597,-0.001698,-0.000144,0.000664,0.000291,-0.001581,0.001115,0.001792,-0.000385
U999,0.000278,-0.000227,0.000463,0.000390,0.000119,0.001605,0.000272,-0.003089,0.000424,0.002624,...,0.000443,-0.000559,0.001371,-0.002222,0.001039,-0.000430,0.000136,-0.001938,0.000508,0.000049


In [191]:
news.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [185]:
titles_dict = pd.Series(news[3].values,index=news[0]).to_dict()

In [183]:
def give_recommendations(user, n = 5):
    recos = recommendations_df.T[user].sort_values().tail(n)
    return recos

In [184]:
give_recommendations('U91836')

article
N11101    0.306672
N6233     0.320884
N41375    0.329777
N37509    0.354515
N14761    0.456252
Name: U91836, dtype: float64

In [177]:
recommendations_df.T['U91836']    #[user].sort_values().tail(n)

article
N100      0.000032
N1000     0.000729
N10001   -0.001700
N10003   -0.000208
N10009    0.000684
            ...   
N9977     0.000857
N9978    -0.002033
N9984     0.001415
N9992    -0.000673
N9993     0.000366
Name: U91836, Length: 20688, dtype: float64

In [168]:
news.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [172]:
titles_dict = pd.Series(news[3].values,index=news[0]).to_dict()

In [164]:
give_recommendations('U91836', n=10)

article
N12349    0.236248
N59704    0.239799
N27526    0.259654
N4607     0.269513
N11231    0.276449
N11101    0.306672
N6233     0.320884
N41375    0.329777
N37509    0.354515
N14761    0.456252
Name: U91836, dtype: float64

In [150]:
xy=give_recommendations('U91836')

In [151]:
xy

('N9993', 0.0003662812823958931)

In [75]:
recommendations_df.index.unique()

Index(['U13740', 'U91836', 'U73700', 'U34670', 'U8125', 'U19739', 'U8355',
       'U46596', 'U79199', 'U53231',
       ...
       'U2459', 'U31197', 'U22929', 'U43334', 'U39829', 'U56382', 'U8905',
       'U15501', 'U48077', 'U5787'],
      dtype='object', name='user_ids', length=8502)

In [5]:
behaviors_dev = pd.read_csv('../../data/mind_small_dev/behaviors.tsv', sep="\t", header=None)

In [11]:
len(set(behaviors_dev[1]) & set(behaviors[1])), len(set(behaviors_dev[1]))

(5943, 50000)

In [None]:
hist_set = set(behaviors_part_1_set['hist'].to_list())

In [18]:
beh_num = behav_part_1.to_numpy()


In [19]:
user_dic = {}
for i in range(beh_num.shape[0]):
    tri = [s[:-2] for s in beh_num[i][4].split(' ') if s[-1] == '1']
    
    unity = set(tri) & hist_set
    if len(unity) > 0:
        user_dic[i] = list(unity)

In [20]:
map_dict = {}
for i, s in enumerate(behaviors_part_1_pivot.columns):
    map_dict[s] = i

In [23]:
map_dict['N10284'], user_dic[21]


(98, ['N47020'])

In [22]:
np.dot(np.dot(U[21, :], sigma), Vt[:, 13175])
np.dot(np.dot(U[24, :], sigma), Vt[:, 7831])

0.003881433463938959

In [25]:
results = []
for k, v in user_dic.items():
    for n in v:
        news_idx = map_dict[n]
        pred = np.dot(np.dot(U[k, :], sigma), Vt[:, news_idx])
        results.append(pred + b1_mean[k])
    

In [26]:
results.sort(reverse=True)


In [27]:
erg = pd.DataFrame(np.dot(np.dot(U, sigma), Vt) + b1_mean.reshape(-1, 1))

In [116]:
erg.columns = behaviors_part_1_pivot.columns

In [117]:
erg.iloc[24]['N10016']

-0.0006385966356098913

In [126]:
np.mean(erg.mean())

0.0015677812684249813

In [127]:
np.std(erg.mean())

0.0045430839861551634

In [119]:
user_dic
erg.iloc[24]['N47020']

0.039320806567492386

In [120]:
recos = []
for user, article in user_dic.items():
    recos.append(erg.iloc[user][article].to_list())

In [121]:
recos_2 =[]
for x in recos:
    for y in x:
        recos_2.append(y)
        

In [122]:
recos_2 = pd.Series(recos_2)

In [123]:
recos_2.describe()

count    1927.000000
mean        0.013329
std         0.045887
min        -0.152860
25%         0.000146
50%         0.001807
75%         0.007613
max         0.623872
dtype: float64

In [134]:
np.mean(erg.mean())

0.0015677812684249813

In [135]:
np.std(erg.mean())

0.0045430839861551634

In [136]:
erg.mean

hist
N100      0.000087
N1000     0.000250
N10001    0.000434
N10003    0.000064
N10009    0.000201
            ...   
N9977     0.000697
N9978     0.000306
N9984     0.000162
N9992     0.000216
N9993     0.000158
Length: 20688, dtype: float64

In [137]:
from sklearn.decomposition import NMF

In [138]:
beahviors_np = behaviors_part_1_pivot.to_numpy(copy=True)

In [140]:
beahviors_np.shape

(9796, 20688)

In [None]:
model = NMF(n_components=10, init='random', random_state=420)

In [142]:
W = model.fit_transform(beahviors_np)

In [143]:
H = model.components_

In [148]:
H.shape

(10, 20688)

In [147]:
W.shape

(9796, 10)

In [151]:
nmf_matrix = np.dot(W, H)

In [154]:
nfm_matrix_df = pd.DataFrame(nmf_matrix)

In [155]:
nfm_matrix_df.columns = behaviors_part_1_pivot.columns

In [156]:
nfm_matrix_df

hist,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,N10024,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
0,0.000118,0.000098,0.000526,0.000000,0.000005,0.000028,0.000545,0.000000,0.000000,0.000058,...,6.513567e-05,0.000567,0.000004,0.000938,0.000056,0.000208,0.000281,3.205593e-04,0.000387,0.000119
1,0.000076,0.000798,0.000108,0.000000,0.000083,0.000078,0.000773,0.006497,0.000067,0.000185,...,1.342857e-05,0.001280,0.000032,0.000256,0.000095,0.001115,0.000016,4.180561e-04,0.000263,0.000120
2,0.000000,0.000013,0.001416,0.000004,0.000118,0.000021,0.000039,0.001274,0.000012,0.000090,...,5.154157e-06,0.000231,0.000166,0.000040,0.000018,0.000991,0.000232,1.948240e-05,0.000032,0.000032
3,0.000000,0.002986,0.000631,0.000000,0.000918,0.000033,0.000267,0.000787,0.000725,0.000924,...,7.811169e-05,0.000000,0.000000,0.004338,0.000000,0.000000,0.000000,1.789488e-04,0.000000,0.000026
4,0.000007,0.000054,0.000099,0.000000,0.000012,0.000007,0.000015,0.000654,0.000005,0.000006,...,2.292389e-07,0.000037,0.000003,0.000008,0.000005,0.000137,0.000017,9.552872e-06,0.000005,0.000012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9791,0.000114,0.000010,0.000054,0.000000,0.000001,0.000008,0.000023,0.000006,0.000002,0.000000,...,6.694931e-06,0.000000,0.000006,0.000905,0.000013,0.000010,0.000272,2.675551e-05,0.000261,0.000103
9792,0.000004,0.000130,0.000184,0.000000,0.000052,0.000000,0.000000,0.000299,0.000030,0.000044,...,0.000000e+00,0.000021,0.000000,0.000199,0.000000,0.000130,0.000032,4.524199e-07,0.000005,0.000005
9793,0.000042,0.000358,0.000795,0.000009,0.000093,0.000038,0.000079,0.004253,0.000042,0.000113,...,2.701838e-05,0.000142,0.000381,0.000130,0.000063,0.000832,0.000178,6.162307e-05,0.000054,0.000021
9794,0.000118,0.000039,0.000207,0.000000,0.000000,0.000011,0.000088,0.000000,0.000000,0.000000,...,2.562749e-05,0.000000,0.000003,0.000937,0.000009,0.000000,0.000281,6.644106e-05,0.000271,0.000100


In [166]:
recos_nfm = []
for user, article in user_dic.items():
    recos_nfm.append(nfm_matrix_df.iloc[user][article].to_list())
    
recos_nfm_2 =[]
for x in recos_nfm:
    for y in x:
        recos_nfm_2.append(y)
        