# 이 노트북에서는: 영화 유사도 계산
- 영화 추천 시스템을 위한 데이터 정제
- movielens dataset에 포함되어 있는 "태그" 정보를 이용하여 유사도 계산
- tag는 총 1128개, genome-socres.csv에 각 영화-태그의 relevance score가 저장되어 있음
- cosine similarity 계산

#### import block

In [27]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [52]:
from sklearn.preprocessing import normalize as Normalize

In [4]:
base_path = '/home/ssum/바탕화면/experiments/ml-25m/'
file_list = os.listdir(base_path)
file_list

['README.txt',
 'movies.csv',
 'genome-scores.csv',
 'genome-tags.csv',
 'links.csv',
 'tags.csv',
 'ratings.csv']

In [14]:
df_genome_tags = pd.read_csv(base_path+"/genome-tags.csv")
df_genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [15]:
df_genome_scores = pd.read_csv(base_path+"/genome-scores.csv")
df_genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


### 유사도 계산
- normalize scores into unit scale
- movieId - moviedId covariance matrix 계산

In [16]:
df_genome_scores = df_genome_scores.pivot_table(values='relevance', index='movieId', columns='tagId')

In [53]:
df_genome_scores = pd.DataFrame(Normalize(df_genome_scores, norm='l2', axis=1))

In [59]:
np.linalg.norm(df_genome_scores, axis=1)

array([1., 1., 1., ..., 1., 1., 1.])

In [62]:
cos_similarity = pd.DataFrame(np.dot(df_genome_scores, df_genome_scores.T))


In [66]:
cos_similarity.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13806,13807,13808,13809,13810,13811,13812,13813,13814,13815
0,1.0,0.776151,0.690177,0.645052,0.649132,0.621668,0.652549,0.690146,0.530572,0.638689,...,0.710299,0.751012,0.625985,0.748424,0.696916,0.743547,0.673634,0.715308,0.723754,0.69118
1,0.776151,1.0,0.679791,0.644311,0.650693,0.519485,0.649139,0.761248,0.623412,0.669072,...,0.626447,0.749256,0.637572,0.646968,0.714785,0.692797,0.578015,0.66531,0.649865,0.663943
2,0.690177,0.679791,1.0,0.751037,0.852975,0.633324,0.772888,0.711675,0.631111,0.673364,...,0.721184,0.781558,0.658665,0.744801,0.749483,0.724407,0.690325,0.779733,0.753919,0.78694
3,0.645052,0.644311,0.751037,1.0,0.740004,0.607822,0.80697,0.758645,0.589906,0.603587,...,0.71069,0.757437,0.64019,0.726997,0.741232,0.674239,0.716335,0.731396,0.68433,0.743047
4,0.649132,0.650693,0.852975,0.740004,1.0,0.524169,0.753251,0.663082,0.568722,0.589733,...,0.633,0.705899,0.568299,0.655616,0.698476,0.651054,0.648331,0.677497,0.655607,0.694157


### 결론 및 설명
- cos similarity가 x.T dot x 인 이유?
    1. x 가 unit scale 된 행렬이라면, norm(x) = 1 이다.
    2. 따라서 정의에 의해 cos similarity는 내적과 같게 된다


- 추천시스템에서의 활용
    1. 위에서 계산한 cos similarity는 전체 movieId중 일부만이 사용되었다.
    2. genome_scores는 movielens dataset에 포함되어 있는 점수(데이터 설명 페이지에는 머신러닝으로 계산된 것이라고 소개하고 있음)일 뿐이므로, 실제 구현에서는 tag정보 등을 이용하여 unit scale matrix를 먼저 구성하는 작업이 필요하다.
    3. similarity 정보가 없는 movieId는 제외하고 relation(=graph edge)구성을 할 수 있다.
