## VAE

사용자의 영화 시청 여부를 바탕으로 인코딩과 디코딩을 통해서 ? 부분을 유추해내는 방식

input = 행 - 사용자, 열 - 영화, 값 - 시청 여부

output = 사용자가 보지 않은 영화에 대하여 추론하여 볼 확률이 높다면 영화를 추천

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings

warnings.filterwarnings(action='ignore')
%matplotlib inline

In [3]:
movie_paths = '/content/drive/MyDrive/추천/data/movies/'

movie = pd.read_csv(movie_paths + "ratings.csv")
meta = pd.read_csv(movie_paths + 'movies_metadata.csv', low_memory=False)
meta = meta.rename(columns={'id':'movieId'})

In [4]:
movie.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [5]:
meta.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,movieId,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [6]:
movie['movieId'] = movie['movieId'].astype(str)
meta['movieId'] = meta['movieId'].astype(str)

movie = pd.merge(movie, meta[['movieId', 'original_title']], on='movieId')
movie['one'] = 1
movie.head()

Unnamed: 0,userId,movieId,rating,timestamp,original_title,one
0,1,1371,2.5,1260759135,Rocky III,1
1,4,1371,4.0,949810302,Rocky III,1
2,7,1371,3.0,851869160,Rocky III,1
3,19,1371,4.0,855193404,Rocky III,1
4,21,1371,3.0,853852263,Rocky III,1


In [7]:
df = movie.pivot_table(index='userId', columns = 'original_title', values = 'one').fillna(0)
df.head()

original_title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...Più forte ragazzi!,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 + 1,12 Angry Men,13 Tzameti,1408,15 Minutes,16 Blocks,1724 기방난동사건,1942: A Love Story,1984,2 Days in Paris,"20,000 Leagues Under the Sea",2001: A Space Odyssey,2010,2046,2061 - Un anno eccezionale,21 Grams,24 Hour Party People,25th Hour,28 Days Later,28 Weeks Later,29th Street,2:37,3 Ninjas: High Noon at Mega Mountain,3 hommes et un couffin,300,33 sceny z życia,36 Fillette,"4 luni, 3 săptămîni și 2 zile",40 Days and 40 Nights,42nd Street,48 Hrs.,...,真夜中の弥次さん喜多さん,続宮本武蔵　一乗寺の決闘,綠草地,纵横四海,羅生門,英雄,菊次郎の夏,薔薇の葬列,蜘蛛巣城,裸の島,誰も知らない,豚と軍艦,赤ひげ,野火,鎗王之王,隠し砦の三悪人,風の谷のナウシカ,香港製造,鬼婆,鷹爪鐵布衫,거룩한 계보,고지전,공동경비구역 JSA,괴물,김복남 살인사건의 전말,다세포 소녀,도쿄!,밀양,봄 여름 가을 겨울 그리고 봄,빈집,사마리아,야수,연가시,오직 그대만,올드보이,"장화, 홍련",최종병기 활,친절한 금자씨,해안선,활
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
watching_metrix = df.iloc[:, : ].values

In [9]:
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F

In [10]:
# 환경설정
if torch.cuda.is_available():
  DEVICE = torch.device('cuda')
else:
  DEVICE = torch.device('cpu')
print(DEVICE)

cpu


In [11]:
# 학습 조건 설정
BATCH_SIZE = 64
EPOCHS = 100

In [12]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class CustomDataset(Dataset):

  # 데이터 정의
  def __init__(self, x_data, y_data = None):
    self.x_data = x_data
    self.y_data = y_data

  # 이 데이터 셋의 총 데이터 수
  def __len__(self):
    return len(self.x_data)

  # 어떠한 idx를 받았을 때 그에 맞는 데이터를 반환
  def __getitem__(self, idx):
    if self.y_data is None:
      x = torch.FloatTensor(self.x_data[idx])
      return x
    else:
      x = torch.FloatTensor(self.x_data[idx])
      y = torch.FloatTensor(self.y_data[idx])
      return x, y

In [13]:
# AutoEncoder 모델 설계
class Encoder(nn.Module):
    def __init__(self):
        super(Encoder,self).__init__()
        self.fc1_1 = nn.Linear(2798, 256)
        self.fc1_2 = nn.Linear(2798, 256)
        self.relu = nn.ReLU()
                        
    def encode(self, x):
        mu = self.relu(self.fc1_1(x))
        log_var = self.relu(self.fc1_2(x))
                
        return mu,log_var
    
    def reparametrize(self, mu, log_var):
        std = log_var.mul(0.5).exp_()
        eps = torch.FloatTensor(std.size()).normal_().to(DEVICE)
        
        return eps.mul(std).add_(mu)
    
    def forward(self,x):
        mu, log_var = self.encode(x)
        reparam = self.reparametrize(mu,log_var)
        
        return mu,log_var,reparam
        
encoder = Encoder().to(DEVICE)

In [14]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder,self).__init__()
        self.fc1 = nn.Linear(256, 2798)
        self.simoid = nn.Sigmoid()
    
    def forward(self,x):
        out = self.fc1(x)
        out = self.simoid(out)
        
        return out
        
decoder = Decoder().to(DEVICE)

In [15]:
reconstruction_function = nn.MSELoss(size_average=False)

def loss_function(recon_x, x, mu, log_var):
    MSE = reconstruction_function(recon_x, x)

    # see Appendix B from VAE paper:
    # Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
    # https://arxiv.org/abs/1312.6114
    # 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD_element = mu.pow(2).add_(log_var.exp()).mul_(-1).add_(1).add_(log_var)
    KLD = torch.sum(KLD_element).mul_(-0.5)

    return MSE + KLD

parameters = list(encoder.parameters())+ list(decoder.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.0005)

In [16]:
def train(encoder, decoder, train_loader):
  encoder.train()
  decoder.train()
  train_loss = 0

  for feature in train_loader:

    feature = feature.to(DEVICE)
    optimizer.zero_grad()
    mu,log_var,reparam = encoder(feature)
    output = decoder(reparam)
    loss = loss_function(output, feature, mu, log_var)
    loss.backward()
    optimizer.step()
    
    train_loss += loss.item()

  train_loss /= len(train_loader)
  return train_loss

In [17]:
def evaluate(encoder, decoder, train_loader):
  encoder.eval()
  decoder.eval()
  result = []

  with torch.no_grad():

    for feature in train_loader:
      feature = feature.to(DEVICE)
      mu,log_var,reparam = encoder(feature)
      output = decoder(reparam)
      result.append(output.cpu().numpy())

  result = np.concatenate(result)
  return result

In [18]:
train_dataset = CustomDataset(watching_metrix)

train_loader = DataLoader(
  train_dataset,
  batch_size = BATCH_SIZE,
  shuffle = False,
  drop_last = False)

for epoch in range(1, EPOCHS + 1):
  train_loss = train(encoder, decoder, train_loader)
  print(f"\n[EPOCH: {epoch}], \tTrain Loss: {train_loss:.4f}")


[EPOCH: 1], 	Train Loss: 45027.5561

[EPOCH: 2], 	Train Loss: 40257.5884

[EPOCH: 3], 	Train Loss: 31879.4373

[EPOCH: 4], 	Train Loss: 24683.6907

[EPOCH: 5], 	Train Loss: 20087.1842

[EPOCH: 6], 	Train Loss: 16697.0109

[EPOCH: 7], 	Train Loss: 14055.5691

[EPOCH: 8], 	Train Loss: 12502.8780

[EPOCH: 9], 	Train Loss: 11236.0267

[EPOCH: 10], 	Train Loss: 10302.6802

[EPOCH: 11], 	Train Loss: 9511.3062

[EPOCH: 12], 	Train Loss: 8759.5720

[EPOCH: 13], 	Train Loss: 8294.9730

[EPOCH: 14], 	Train Loss: 8031.0213

[EPOCH: 15], 	Train Loss: 7551.4337

[EPOCH: 16], 	Train Loss: 7264.2307

[EPOCH: 17], 	Train Loss: 6926.6519

[EPOCH: 18], 	Train Loss: 6603.3744

[EPOCH: 19], 	Train Loss: 6597.1700

[EPOCH: 20], 	Train Loss: 6315.8139

[EPOCH: 21], 	Train Loss: 6321.0552

[EPOCH: 22], 	Train Loss: 6187.4170

[EPOCH: 23], 	Train Loss: 5879.2591

[EPOCH: 24], 	Train Loss: 5712.3802

[EPOCH: 25], 	Train Loss: 5614.2492

[EPOCH: 26], 	Train Loss: 5646.0684

[EPOCH: 27], 	Train Loss: 5649.5819


In [19]:
result = evaluate(encoder, decoder, train_loader)
result

array([[0.04419555, 0.02060422, 0.04253022, ..., 0.06613365, 0.08281999,
        0.03379852],
       [0.04620316, 0.04552369, 0.04119938, ..., 0.03575445, 0.05829491,
        0.04189261],
       [0.00388523, 0.00343378, 0.00321063, ..., 0.00283079, 0.00624221,
        0.00401381],
       ...,
       [0.00391988, 0.00132025, 0.00190714, ..., 0.00131614, 0.00698443,
        0.00292291],
       [0.00223107, 0.00423682, 0.00342549, ..., 0.00306904, 0.00545732,
        0.00377119],
       [0.02060548, 0.0030377 , 0.00636792, ..., 0.00295551, 0.01683836,
        0.0147502 ]], dtype=float32)

In [20]:
watching_metrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [21]:
movie_li = df.columns.tolist()
result[watching_metrix >= 1] = -1

In [22]:
recommend_li = []
for i in result.argmax(axis=1):
  recommend_li.append(movie_li[i])

In [23]:
recommend_df = pd.DataFrame()
recommend_df['user_id'] = df.index.tolist()
recommend_df['movie'] = recommend_li
recommend_df.head(10)

Unnamed: 0,user_id,movie
0,1,Rain Man
1,2,Rain Man
2,3,The 39 Steps
3,4,Men in Black II
4,5,The 39 Steps
5,6,The Million Dollar Hotel
6,7,Terminator 3: Rise of the Machines
7,8,Confession of a Child of the Century
8,9,The 39 Steps
9,10,Terminator 3: Rise of the Machines


현재 user_id 7 에게 Terminator 3: Rise of the Machines 영화를 추천해주었다는 것을 알 수 있다. 

현재 추천시스템의 성능을 파악하기 위해서는 user_id 7이 Terminator관련 영화를 보았다는 가정이 필요하다.

In [24]:
for i in df.columns.tolist():
  if 'Terminator' in i:
    print(i)

Terminator 2: Judgment Day
Terminator 3: Rise of the Machines
Terminator Salvation
The Terminator
The Terminators


현재 영화 리스트에 Terminator관련 영화가 5개가 존재한다는 것을 알 수 있다.

In [25]:
for i in df.iloc[6][df.iloc[6] >= 1].index.tolist():
  if 'Terminator' in i:
    print(i)

Terminator Salvation


user_id 7은 Terminator Salvation 영화를 보았었기 때문에 Terminator 3: Rise of the Machines 영화를 추천해준 것은 어느 정도 타당성이 존재한다고 볼 수 있다.

하지만 현재 모델 자체의 성능이 우수하다고 평가할 수는 없을 것이다. 왜냐하면 user_id 7이 Terminator Salvation 영화에 대해서 어떠한 평가를 내렸을지에 대한 요소가 포함되지 않았으며 대부분의 사람들에게 Terminator 관련 영화를 추천해주는 것처럼 보이는데 단순히 많은 사람들이 보았기 때문에 추천을 해줄 수도 있기 때문이다. 그럼에도 현재 모델이 Terminator 관련 영화를 본 사람에게 Terminator 관련 영화를 추천해준 것에 큰 의의를 가진다. 