## AutoRec
- 介紹
    - 核心原理是，透過autoencoder的模型結構進行建模，模型就會儲存所有資料向量的精華，具備了一定的缺失維度預測能力；主要是利用協同過濾的思考方式，用共現矩陣的item vector / user vector去做modeling。
    - 可分為user-based、item-based，端看輸入。
    - 深度學習的開端之一
    - 三層神經網路，隱藏層神經元數k << 輸入向量維度。
- 優點
    - 簡單、易訓練。
- 缺點
    - 結構簡單，因此，泛化能力有限。
- 適用場景
- 後續延伸新的方法

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import os
import shutil
%matplotlib inline

In [2]:
# 資料讀取
data_dir = 'C:/Users/aband/OneDrive/桌面/github-daily/daily-ds/papers/ml-1m/ml-1m'        
        
users = pd.read_csv(data_dir+'/users.dat', delimiter='::', names=['user_id', 'gender', 'age', 'occupation', 'zip_code'])
movies = pd.read_csv(data_dir+'/movies.dat', delimiter='::', names=['movie_id', 'title', 'genres'])
ratings = pd.read_csv(data_dir+'/ratings.dat', delimiter='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])

  after removing the cwd from sys.path.
  """
  


In [3]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [4]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


要實現AutoRec只需要user_id, movid_id組成的貢獻矩陣

In [10]:
user_item_matrix = ratings.pivot_table(values='rating', index='user_id', columns='movie_id')
user_item_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,,,,2.0,,3.0,,,,,...,,,,,,,,,,
6037,,,,,,,,,,,...,,,,,,,,,,
6038,,,,,,,,,,,...,,,,,,,,,,
6039,,,,,,,,,,,...,,,,,,,,,,


因為許多缺失值，這樣無法丟入模型，可以用平均或者default值替代

In [14]:
for column in user_item_matrix.columns:
    mean = user_item_matrix[column].sum() / user_item_matrix[column].count()
    user_item_matrix[column] = user_item_matrix[column].apply(lambda x: mean if np.isnan(x) else x)
user_item_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.000000,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
2,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
3,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
4,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
5,4.146846,3.201141,3.016736,2.729412,3.006757,2.000000,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,4.146846,3.201141,3.016736,2.000000,3.006757,3.000000,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
6037,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
6038,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928
6039,4.146846,3.201141,3.016736,2.729412,3.006757,3.878723,3.41048,3.014706,2.656863,3.540541,...,3.052083,2.111111,1.488372,2.26,3.472727,3.635731,4.115132,3.666667,3.9,3.780928


資料準備已完成，建立模型！

In [17]:
## 透過 functional api建立: UserAutoRec

k = 128
num_users = user_item_matrix.shape[0]

inputs = tf.keras.Input(num_users)
x = tf.keras.layers.Dense(units=k, activation='relu')(inputs)
outputs = tf.keras.layers.Dense(num_users)(x)

user_auto_rec = tf.keras.Model(inputs, outputs)
user_auto_rec.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 6040)]            0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               773248    
_________________________________________________________________
dense_2 (Dense)              (None, 6040)              779160    
Total params: 1,552,408
Trainable params: 1,552,408
Non-trainable params: 0
_________________________________________________________________


In [20]:
user_auto_rec.compile(optimizer='adam', loss=['mse'], metrics=['mse', 'mae'])

In [22]:
batch_size = 32
epochs = 10

user_auto_rec.fit(user_item_matrix.T, user_item_matrix.T, batch_size=batch_size, epochs=epochs)

Train on 3706 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x238c911ce08>

應用：推薦

In [26]:
user_id = 1

user_item_matrix[user_id]

user_id
1       5.000000
2       4.146846
3       4.146846
4       4.146846
5       4.146846
          ...   
6036    4.146846
6037    4.146846
6038    4.146846
6039    4.146846
6040    3.000000
Name: 1, Length: 6040, dtype: float64

In [40]:
user_item_matrix.T.to_numpy()

array([[5.        , 4.14684641, 4.14684641, ..., 4.14684641, 4.14684641,
        3.        ],
       [3.20114123, 3.20114123, 3.20114123, ..., 3.20114123, 3.20114123,
        3.20114123],
       [3.0167364 , 3.0167364 , 3.0167364 , ..., 3.0167364 , 3.0167364 ,
        3.0167364 ],
       ...,
       [3.66666667, 3.66666667, 3.66666667, ..., 3.66666667, 3.66666667,
        3.66666667],
       [3.9       , 3.9       , 3.9       , ..., 3.9       , 3.9       ,
        3.9       ],
       [3.78092784, 3.78092784, 3.78092784, ..., 3.78092784, 3.78092784,
        3.78092784]])

In [43]:
## 預測第一筆: user_id=1

pred = user_auto_rec.predict(user_item_matrix.T.to_numpy()[:1, :])

In [44]:
pred

array([[4.2134256, 4.075639 , 4.079956 , ..., 4.1582193, 4.044732 ,
        4.131224 ]], dtype=float32)