# 연관규칙과 협업 필터링

## 추천시스템의 분석 기법 종류
- ![추천시스템](./img/추천시스템.png)
- `연관규칙(장바구니 분석)` : 비교적 직관적이고 단순한 알고리즘  
- `협업 필터링` : 데이터 간 유사도를 구해 추천 항목을 선별 (넷플릭스 추천 시스템)

## 연관규칙
1. 연관규칙 분석이란? A라는 제품을 구매한 사람은 B라는 제품도 구매할 확률이 높다는 결과를 이끌어내는 모델

2. 연관 규칙 분석의 대표적 알고리즘 : Apriori, FP-Growth, DHP 알고리즘..
    - Apriori : 아이템들의 조합에 대한 경우의 수를 최소화하여 처리 속도 효율을 높인 알고리즘.
    - FP-Growth : 트리기반 알고리즘으로, 항목 간 비교 계산을 최소화하여 처리 속도 효율을 높인 알고리즘.
    - DHP : 항목 집합의 개수가 2개인 트랜잭션을 먼저 해시 테이블로 만들어 처리 속도 효율을 높인 알고리즘.

3. 연관규칙 분석 방법 : 
    - 과거 고객들의 구매이력(트랜잭션)을 기반으로 추천 점수를 구해야 한다. 
    - 품목 간의 연관 관계 계산은 각 품목 조합의 출현 빈도를 이용. 
    - 핵심 지표
        - 1. 지지도(Support) : 아이템 A와 B를 동시에 포함하는 거래가 발생할 확률
        - 2. 신뢰도(Confidence) : 아이템 A가 포함되는 거래에 B가 포함될 확률
        - 3. 향상도(Lift) : 아이템 A와 B가 독립적으로 판매되는 것 대비 함께 판매되는 비율
    - `지지도` 는 전체 구매 횟수 중에서 해당 아이템 혹은 조합의 구매가 얼마나 발생하는지를 나타낸다. 
    - `신뢰도` 는 아이템(조합) A가 판매됐을 때 B 아이템(조합)도 함께 포함되는 조건부 확률이다. 
        - 주의할 점 : P(A->B) 와 P(B->A)는 다를 수 있다. 결과절의 지지도는 동일하지만, 조건절의 지지도는 다르기 때문( = 비 대칭적척도)
    - `향상도` 는 아이템(조합) A의 판매 중 아이템(조합) B가 포함될 비율이 전체 거래 중 아이템(조합) B가 판매될 비율보다 얼마나 증가했느지를 나타내는 지표. 향상도는 두 아이템(조합) 간의 연관성을 나타내는 지표이기 때문에 조건절과 결과절의 위치가 바뀌어도 동일하다. ( = 대칭적 척도)
        - ![향상도](./img/향상도.png)
        - 향상도는 1을 기준으로 한다. 1에 가까울 수록 두 아이템(조합)은 서로 독립적인 관계라 해석할 수 있다. 1보다 작으면 서로 음의 상관관계이므로 A를 구매하는 사람은 B를 구매하지 않는 경향이 있는 것이다. 반대로 1보다 크면 A를 구매할 때도 B도 구매하는 경향이 강하다는 뜻이므로, 향상도가 높은 아이템을 추천해주는 것이다. 

4. 연관규칙을 이용한 상품 추천 시스템
    - 1. 지지도와 신뢰도 기준을 잡아서 특정 cut off 미만의 아이템 관계를 필터링
    - 2. 향상도를 내림차순으로 정렬하여 상위에 있는 아이템 조합을 최종선정
    - 3. 연관규칙 분석은 고려해야 할 지표가 세가지나 되기 때문에 분석가와 비즈니스 담당자의 주관적 판단이 많이 들어갈 수 밖에 없다. 
    -> 명확하지 않은 룰 때문에 잦은 수정이 필요해질 수 있다. 
    -> 보완한 평가 척도 : IS(Interest-Support) 척도와 교차지지도(Cross support)
        - * IS 척도 : 향상도와 지지도를 함께 고려한 척도. 만약 향상도와 지지도 중에 비중을 더 두고 싶은 지표가 있다면, 가중치를 주어 영향력을 조정할 수 있다.
        - * 교차지지도 : 개별 아이템이나 조합의 우위를 측정하는 지표는 아니다. 전체 아이템 조합에서 어느 정도 수준의 지지도 이하를 버릴 것인가를 판단할 수 있는 하나의 보조지표. 교차지지도 수치가 작을수록 지지도의 차이가 크다는 것이므로, 이는 의미없는 아이템 조합이 포함됐을 가능성이 높다는 것을 의미.

## 콘텐츠 기반 필터링과 협업 필터링
1. 콘텐츠 기반 필터링이란?
    - 아이템의 속성을 활용하여 추천(=아이템의 메타 정보 활용). 정형화된 데이터를 통해 기존의 선호와 유사한 아이템을 추천할 수 있다는 장점이 있다. 하지만, 모든 제품에 대한 메타 정보를 입력해야 작동하기 때문에 아이템이 많아질 경우 관리가 힘들어지는 문제가 있다. 또한 구매자가 선호를 표현한 아이템과 유사한 속성을 가진 아이템만 추천하기 때문에 속성이 유사하지 않더라도 구매할 가능성이 높은 제품을 추천할 수 없다는 단점이 있다.

2. 콘텐츠 기반 필터링 모델의 단점 보완 => `협업필터링 모델`
    - `최근접 이웃모델` : 
        - 사용자(user)들이 아이템에 매긴 평점(rating) 데이터를 기반으로 매기지 않은 아이템의 평점을 예측하는 방식으로 작동. 예측한 평점이 높은 아이템을 추천. 
        - 방식 구분:
            - `사용자 기반 방식` : 유사한 성향을 가진 사람들을 찾아내어 그 사람들이 선호하는 아이템을 추천해주는 방식.
            - `아이템 기반 방식` : 유사한 특성을 가진 아이템을 찾아내어 아이템을 추천해주는 방식 
    - 희소 행렬(sparse matrix) 문제 : 
        - 제품은 구매자들이 평점을 주는 경우가 드물다. 따라서 상품 추천의 경우에는 다른 접근 방법이 필요. 
        -> 주로 고객들의 구매이력을 사용(온라인의 경우 클릭, 장바구니 담기 등의 데이터를 활용)
        - 명시적 데이터(Explicit) 데이터(ex.영화평점) / 암묵적 데이터(Implicit) 데이터 (ex. 구매, 클릭 이력)
        - 암묵적 데이터 위주의 도메인에서는 협업 필터링 중에서 `잠재요인 모델`을 사용하는 것이 좋다. 

    - `잠재요인 모델`:
        - 잠재요인 모델은 사용자와 아이템 간의 관계 정보를 가지고 있는 데이터를 행렬분해하여, 데이터 안에 숨겨져 있는 잠재요인을 도출한다. 
        - ![잠재요인](./img/잠재요인도출.png)
        - `행렬분해(Matrix factorization)`
            - 사용자들의 속성을 통해 서로 비슷한 속성을 가진 사용자를 찾아낼 수 있으며, 마찬자기로 아이템들의 속성 간 거리를 계산하여 비슷한 특성을 가진 아이템들을 선별할 수 있다. 
            - 기본 기법: `특이값 분해(SVD)` -  SVD는 데이터가 null값으로 누락되어 있는 공간을 모두 평균값 등으로 대치해줘야한다. = 데이터가 왜곡될 가능성이 크다 => 따라서 관측된 데이터로만 행렬분해를 할 수 있는 알고리즘 사용
            - SVD 보완 : `ALS, SGD`
                - ALS : 병렬 시스템을 지원하기 때문에, 대량의 데이터를 처리하는데에 유리. 

    - 협업 필터링 모델의 맹점 
        - 1. 콜드 스타트 문제 :
            - 새로 가입한 사용자나, 이제 막 출시된 아이템은 상호작용 데이터가 없기 때문에 추천을 위한 스코어를 줄 수가 없다.
            - 해결 : 사용자가 처음 가입할 때, 선호하는 아이템을 몇 가지 선택하도록 함. 
        - 2. 아이템 수에 비해 구매, 클릭 등의 상호작용이 부족할 경우 잠재요인을 도출할 데이터가 희박하게 되어 추천성능이 떨어짐. -> 해결 : 하이브리드 필터링 모델
        - 3. 행동 양식이 일관적이지 않은 특이 취향 사용자에 대한 추천 정확도가 떨어짐. -> 해결 : 특이 취향 사용자는 분리해내어 별도의 유사도 알고리즘을 적용하여 추천 정확도 향상 시킬 수 있다. = 하이브리드 필터링 모델

    - `하이브리드 필터링 모델`
        - 종류 1. 협업 필터링과 콘텐츠 기반 피터링을 각각 구현하여 혼합하는 방식.
            - 각 모델의 스코어 결과값의 갖우합을 구하여 최종 스코어 산출. 
            - 추천 성능에 따라 각 모델의 가중치를 조정 혹은 가중합을 하지 않고, 각 모델 중 더 나은 결과값을 선택하는 식으로 설계할 수 있다. 
        - 종류 2. 협업 필터링에 콘텐츠 기반 필터링의 특성을 적용하는 방식.
            - 협업 필터링 모델에 사용자의 특성을 나타내는 변수를 추가한다. (= 희소 행렬의 문제 완화)
        - 종류 3. 콘텐츠 기반 필터링에 협업 필터링의 특성을 적용하는 방식.
            - 사용자의 성별, 연령, 취향 등의 프로필 정보들을 협업 필터링 방식처럼 차원 압축하여 몇 개의 잠재 요인으로 변환하여 사용하는 방식. -> 유의미한 속성 정보를 추를 수 있고, 데이터양을 감소시켜 시스템 효율을 높일 수 있다. 



## 연관규칙과 협업 필터링 실습

## **연관규칙**

### 1. 연관규칙에 필요한 패키지 임포트

In [1]:
!pip install mlxtend




[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import pandas as pd
import numpy as np

### 2. 데이터 불러오기 및 확인

In [3]:
df = pd.read_csv('./datasets/market_basket.csv')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188769 entries, 0 to 188768
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   cust_cd   188769 non-null  object
 1   std_dt    188769 non-null  object
 2   prdct_cd  188769 non-null  object
 3   prdct_nm  188769 non-null  object
dtypes: object(4)
memory usage: 5.8+ MB
None


Unnamed: 0,cust_cd,std_dt,prdct_cd,prdct_nm
0,C617077280704,2021-06-19,A10001,tropical fruit
1,C617077280704,2021-06-19,A10002,whole milk
2,C617077280704,2021-06-19,A10003,pip fruit
3,C617077280704,2021-06-19,A10004,other vegetables
4,C617077280704,2021-06-19,A10005,cream


### 3. 각 장바구니 정보를 리스트 형태로 변환

In [4]:
# apriori 모델 적용을 위한 품목 리스트 가공

itemset = df.drop_duplicates(['cust_cd', 'std_dt', 'prdct_nm']).groupby(['cust_cd', 'std_dt'])['prdct_nm'].apply(list)

itemset = pd.DataFrame(itemset).reset_index().drop(['cust_cd', 'std_dt'], axis='columns')

itemset = itemset.squeeze()
itemset.head()

0    [beef, herbs, tropical fruit, whole milk, chic...
1    [sugar, packaged fruit/vegetables, sausage, sp...
2    [berries, tropical fruit, fruit/vegetable juic...
3    [yogurt, beef, cream, herbs, chicken, bottled ...
4    [berries, beef, yogurt, specialty bar, bottled...
Name: prdct_nm, dtype: object

### 4. 장바구니 리스트를 더미(T/F)형태로 변환

In [5]:
# apriori 모델 적용을 위한 장바구니 - 품목 더미 가공
# TransactionEncoder() = 행렬 형태로 변환 함수
encoder = TransactionEncoder()
encoder_T = encoder.fit(itemset).transform(itemset)

# 데이터프레임으로 변경
itemset_matrix = pd.DataFrame(encoder_T, columns=encoder.columns_)
itemset_matrix.head()

Unnamed: 0,beef,berries,beverages,bottled beer,bottled water,brown bread,butter,butter milk,canned beer,chicken,...,sparkling wine,specialty bar,specialty chocolate,sugar,syrup,tropical fruit,turkey,white wine,whole milk,yogurt
0,True,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,True,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,True,True,False,True,False,False,False,False
2,False,True,False,False,False,False,False,False,False,True,...,False,False,False,False,False,True,False,False,False,False
3,True,False,False,False,True,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,True
4,True,True,False,False,True,False,False,False,False,True,...,False,True,False,False,False,True,False,False,False,True


### 5. 지지도 기준 설명 및 apriori 모델 적용

In [6]:
# 지지도 0.1 기준으로 apriori 적용
#  min_support=0.01 지지도가 0.1미만인 품목이나 품목조합 제거
itemset_apriori = apriori(itemset_matrix, min_support=0.01, use_colnames=True) 
itemset_apriori.head()

Unnamed: 0,support,itemsets
0,0.166612,(beef)
1,0.105074,(berries)
2,0.01701,(beverages)
3,0.025754,(bottled beer)
4,0.095191,(bottled water)


### 6. 품목(조합) 간의 연관도 확인

In [7]:
# 향상도 5이상 상품 조합 추출
# association_rules() 함수 : 품목(조합) 간의 향상도(lift), 지지도(support), 신뢰도(confidence) 등 확인
association_rules(itemset_apriori, metric='lift', min_threshold=5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,"(beef, ham)",(bottled water),0.028179,0.095191,0.014585,0.517601,5.437508,0.011903,1.875645,0.839756
1,(bottled water),"(beef, ham)",0.095191,0.028179,0.014585,0.153223,5.437508,0.011903,1.14767,0.90195
2,"(yogurt, beef)",(bottled water),0.023844,0.095191,0.011977,0.502311,5.276885,0.009707,1.818022,0.830291
3,(bottled water),"(yogurt, beef)",0.095191,0.023844,0.011977,0.12582,5.276885,0.009707,1.116654,0.895763
4,"(pastry, cream, beef)",(bottled water),0.030163,0.095191,0.014879,0.493301,5.182229,0.012008,1.785693,0.832132
5,"(cream, beef)","(pastry, bottled water)",0.069878,0.037474,0.014879,0.212934,5.6822,0.012261,1.222929,0.885918
6,"(cream, bottled water)","(pastry, beef)",0.047797,0.046916,0.014879,0.311299,6.635276,0.012637,1.383887,0.891922
7,"(pastry, beef)","(cream, bottled water)",0.046916,0.047797,0.014879,0.31715,6.635276,0.012637,1.394453,0.891097
8,"(pastry, bottled water)","(cream, beef)",0.037474,0.069878,0.014879,0.397059,5.6822,0.012261,1.542642,0.856093
9,(bottled water),"(pastry, cream, beef)",0.095191,0.030163,0.014879,0.15631,5.182229,0.012008,1.149519,0.891937


>     min_threshold : float (default: 0.8)
      Minimal threshold for the evaluation metric,
      via the `metric` parameter,
      to decide whether a candidate rule is of interest.

> 향상도가 5이면 해당품목을 자연적으로 구매할 확률보다 연관 품목을 구매했을 때 해당 품목을 구매할 확률이 5배 높은 것으로 연관도가 매우 높다고 할 수 있다.

## **협업 필터링**

### 1. 협업 필터링에 필요한 패키지 임포트

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

### 2. 협업 필터링을 위한 데이터 불러오기

In [5]:
df_movies = pd.read_csv('./datasets/movies.csv')
df_ratings = pd.read_csv('./datasets/ratings.csv')

df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [6]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### 3. 고객과 영화의 유사도 측정을 위한 전치 데이터셋 생성

In [8]:
# ratings 데이터와 movies 데이터 결합
df_merge = pd.merge(df_ratings, df_movies, on="movieId")

# 고객-아이템 평점 행렬 생성
# value=rating, index=userId, columns=title
df_merge_pivot = df_merge.pivot_table("rating","userId", "title") 

# 결측 0으로 변환
df_merge_pivot_null = df_merge_pivot.fillna(0)

# 아이템-사용자 평점 행렬로 전치
# value=rating, index=title, columns=userId
df_merge_pivot_T = df_merge_pivot_null.T
df_merge_pivot_T.head()

userId,1,2,3,4,5,6,7,8,9,10,...,659,660,661,662,663,664,665,666,667,668
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"'burbs, The (1989)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> 인덱스 : 영화 제목<br>
> 칼럼 : 668명 고객

**cosine_similarity**
- 벡터와 벡터 간의 유사도를 비교할 때 두 벡터 간의 사잇각을 구해서 얼마나 유사한지 수치로 나타낸 것.
- 벡터 방향이 비슷할수록 두 벡터는 서로 유사하며, 벡터 방향이 90도 일때는 두 벡터 간의 관련성이 없으며, 벡터 방향이 반대가 될수록 두 벡터는 반대 관계. 

![코사인유사도](./img/cosine.png)


### 4. 영화 유사도 행렬 생성

In [12]:
# 영화 유사도 행렬 생성
item_sim = cosine_similarity(df_merge_pivot_T)

# 데이터 프레임 형태 변환
item_sim_df = pd.DataFrame(item_sim, index=df_merge_pivot_T.index, columns=df_merge_pivot_T.index)
item_sim_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 (1979),...,[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),a/k/a Tommy Chong (2005),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.342682,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.158272,0.0,0.098324,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.147486,0.0
'Round Midnight (1986),0.0,0.0,1.0,0.0,0.0,0.0,0.081094,0.0,0.257012,0.680414,...,0.0,0.227429,0.141421,0.0,0.100219,0.0,0.221581,0.0,0.098324,1.0
'Til There Was You (1997),0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"'burbs, The (1989)",0.0,0.0,0.0,0.0,1.0,0.0,0.03161,0.231897,0.100923,0.0,...,0.057358,0.0,0.0,0.0,0.212684,0.0,0.104192,0.0,0.16182,0.0


### 5. 특정 영화의 상위 유사도 영화 리스트 추출

In [13]:
# 500일의 썸머와 유사도가 높은 상위 5개 영화 추출

item_sim_df["(500) Days of Summer (2009)"].sort_values(ascending=False)[1:6]

title
Scott Pilgrim vs. the World (2010)    0.502121
Up in the Air (2009)                  0.498354
Social Network, The (2010)            0.497004
Forgetting Sarah Marshall (2008)      0.472271
Shutter Island (2010)                 0.468202
Name: (500) Days of Summer (2009), dtype: float64

### 6. 고객 유사도 행렬 생성

In [14]:
# 고객 유사도 행렬 생성
user_sim = cosine_similarity(df_merge_pivot_null)

# 데이터 프레임 형태 변환
user_sim_df = pd.DataFrame(user_sim, index=df_merge_pivot_null.index, columns=df_merge_pivot_null.index)
user_sim_df

userId,1,2,3,4,5,6,7,8,9,10,...,659,660,661,662,663,664,665,666,667,668
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.101113,0.210044,0.128766,0.057896,0.077130,0.358090,0.097434,0.239189,0.026663,...,0.291162,0.144741,0.106583,0.091049,0.236805,0.154519,0.245071,0.238660,0.278217,0.153493
2,0.101113,1.000000,0.115559,0.034610,0.032705,0.028305,0.062914,0.471918,0.194232,0.000000,...,0.068325,0.000000,0.477330,0.146887,0.163553,0.061737,0.050948,0.051423,0.035907,0.064822
3,0.210044,0.115559,1.000000,0.058208,0.044426,0.012816,0.084522,0.066620,0.459703,0.068454,...,0.152078,0.301021,0.081626,0.098949,0.310234,0.079452,0.092821,0.080940,0.158943,0.109658
4,0.128766,0.034610,0.058208,1.000000,0.019298,0.005781,0.059089,0.024420,0.050572,0.000000,...,0.055860,0.024329,0.040467,0.108881,0.076241,0.014011,0.042643,0.174275,0.061677,0.157809
5,0.057896,0.032705,0.044426,0.019298,1.000000,0.053378,0.080822,0.041536,0.023168,0.011915,...,0.058450,0.007315,0.024708,0.038163,0.053085,0.048993,0.055431,0.026053,0.086667,0.068281
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
664,0.154519,0.061737,0.079452,0.014011,0.048993,0.011697,0.207565,0.070784,0.098666,0.037486,...,0.156459,0.064727,0.059112,0.082097,0.093945,1.000000,0.101132,0.068558,0.091651,0.102100
665,0.245071,0.050948,0.092821,0.042643,0.055431,0.131016,0.178275,0.033969,0.116292,0.026110,...,0.204038,0.061159,0.020011,0.052398,0.125183,0.101132,1.000000,0.096167,0.211776,0.172575
666,0.238660,0.051423,0.080940,0.174275,0.026053,0.068479,0.174009,0.039911,0.084109,0.049958,...,0.138398,0.038339,0.037231,0.121721,0.079120,0.068558,0.096167,1.000000,0.123252,0.158383
667,0.278217,0.035907,0.158943,0.061677,0.086667,0.031558,0.195029,0.119225,0.169818,0.030780,...,0.268394,0.115746,0.044327,0.090725,0.215705,0.091651,0.211776,0.123252,1.000000,0.110965


### 7. 특정 고객의 상위 유사도 고객 리스트 추출

In [15]:
# 7번 고객과 유사도가 높은 상위 5명 추출
user_sim_df[7].sort_values(ascending=False)[1:6]

userId
403    0.432287
358    0.414600
228    0.396949
328    0.391268
590    0.387817
Name: 7, dtype: float64

### 8. 협업 필터링용 샘플 행렬 생성

In [12]:
# 잠재요인 차원 30으로 설정
K = 30

# 샘플용 영화 30개만 필터링
df_merge_sample = df_merge_pivot.iloc[:,0:30]
df_array = df_merge_sample.values
user_cnt, item_cnt = df_array.shape

# 고객수, 영화 수 x 자원 수 행렬 생성
np.random.seed(47)
user_matrix = np.random.normal(scale=1./K, size=(user_cnt, K))
item_matrix = np.random.normal(scale=1./K, size=(item_cnt, K))

print("고객 행렬 확인:", user_matrix.shape)
print("영화 행렬 확인:", item_matrix.shape)

고객 행렬 확인: (668, 30)
영화 행렬 확인: (30, 30)


> ![np.random.normal](./img/numpy.png)

> np.random.normal(loc, scale, size) : 정규분포로부터 무작위 표본 추출

> scale : 분포의 표준편차

> size : 반환할 값의 크기

> scale = 1./K 로 설정한 이유 : random 정규분포의 표준편차가 작지 않으면 무한대로 수렴하는 경우가 발생할 수 있기 때문.

### 9. RMSE 산출 함수 정의

> MSE : 평균제곱오차, 추측값에 대한 정확성을 측정하는 방법. 

> RMSE : 평균 제곱근 오차, 예측 모델에서 예측한 값과 실제 값 사이의 평균 차이. = RMSE가 낮을수록 정확도가 높음. <br> 예측모델이 목표값(정확도)을 얼마나 잘 예측할 수 있는지 추정

In [18]:
def get_rmse(df_array, user_matrix, item_matrix, not_nan_index): 
    error = 0
    # 예측용 df_array 생성
    # @ = 행렬 곱
    pred_rating_matrix = user_matrix @ item_matrix.T

    # 결측 없는 실제 행렬과 예측 행렬 생성
    df_array_not_null = df_array[not_nan_index]                     # 결측 없는 실제 행렬
    pred_rating_matrix_not_null = pred_rating_matrix[not_nan_index] # 결측 없는 예측 행렬

    # RMSE 산출
    mse = mean_squared_error(df_array_not_null, pred_rating_matrix_not_null)
    rmse = np.sqrt(mse)

    return rmse

### 10. SGD 행렬 분해 알고리즘 함수 정의
> 경사하강법 : 모든 데이터 셋을 전부 사용해 가중치의 값을 업데이트. <->
> SGD (확률적 경사하강법) : 딱 한개의 샘플을 무작위로 선택하고 그 하나의 샘플에 대해 경사하강법을 수행

> 특징 : 하나의 데이터로 가중치값 업데이트 = 속도가 매우 빠르다 & 큰 데이터 셋에서도 학습가능. 하지만 local minimum에서는 탈출하기 쉽지만 Global minimum에는 다다르기 힘들다는 단점. 

![localminimum.globalminimum](./img/localminimum.png)

In [22]:
def matrix_factorization(df_array, K, steps=1000, learning_rate=0.01, r_lambda=0.01):
    # 결측값이 아닌 df_array의 index 생성
    not_nan_index = np.where(np.isnan(df_array)==False)
    # SGD 행렬 분해 알고리즘 적용
    for step in range(steps):
        for p, q, r in zip(not_nan_index[0], not_nan_index[1], df_array[not_nan_index]):
            # 실제값과 예측값 차이 계산
            r_pq = user_matrix[p, :] @ item_matrix[q, :].T
            error_pq = r - r_pq

            # SGD
            user_matrix[p, :] = user_matrix[p, :] + learning_rate*(error_pq * item_matrix[q, :] - r_lambda*user_matrix[p,:])

            item_matrix[q, :] = item_matrix[q, :] + learning_rate*(error_pq * user_matrix[p, :] - r_lambda*item_matrix[q,:])

        rmse = get_rmse(df_array, user_matrix, item_matrix, not_nan_index)

        if ( (step + 1) % 100) == 0 :
            print("반복 횟수: ", step + 1, "RMSE : ", np.round(rmse,3))

    return user_matrix, item_matrix

### 11. 행렬 분해 및 행렬 내적 수행

In [23]:
user_matrix, item_matrix = matrix_factorization(
    df_array, K, steps=1000, learning_rate=0.01, r_lambda=0.01)

pred_matrix = user_matrix @ item_matrix.T

반복 횟수:  100 RMSE :  0.097
반복 횟수:  200 RMSE :  0.027
반복 횟수:  300 RMSE :  0.024
반복 횟수:  400 RMSE :  0.023
반복 횟수:  500 RMSE :  0.021
반복 횟수:  600 RMSE :  0.02
반복 횟수:  700 RMSE :  0.02
반복 횟수:  800 RMSE :  0.019
반복 횟수:  900 RMSE :  0.019
반복 횟수:  1000 RMSE :  0.018


> 100회에선 RMSE가 0.097로 산출됐다가 1000회 째에는 0.018수준으로 감소

### 12. 최종 내적 행렬 데이터 프레임 변환

In [24]:
ratings_pred_matrix = pd.DataFrame(data=pred_matrix,
                                   index= df_merge_sample.index,
                                   columns= df_merge_sample.columns)

ratings_pred_matrix.head(5)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 (1979),...,"10th Kingdom, The (2000)",11-11-11 (11-11-11: The Prophecy) (2011),11:14 (2003),"11th Hour, The (2007)",12 (2007),12 Angry Men (1957),12 Angry Men (1997),12 Rounds (2009),12 Years a Slave (2013),127 Hours (2010)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.051631,-0.083938,-0.06966,-0.105834,-0.186691,-0.094306,-0.098581,-0.177139,-0.103241,-0.082794,...,-0.045882,-0.079783,-0.112016,-0.110756,-0.034177,-0.163101,-0.076059,-0.051933,-0.128448,-0.113047
2,0.068995,0.063763,0.029378,0.103909,0.070665,0.035862,0.094289,0.170707,0.130561,0.053904,...,0.134833,0.066054,0.067439,0.097416,0.013622,0.127945,-0.041964,0.049973,0.106233,0.082817
3,0.128438,0.08974,0.090381,0.066533,0.103108,0.031881,0.212484,0.048289,0.182902,0.07097,...,0.142718,0.191402,0.256842,0.230997,0.064861,0.149501,-0.003335,0.198401,0.207483,0.198376
4,2.895991,1.712282,1.787002,2.657818,2.820935,2.213617,2.082815,2.555981,3.306608,2.252366,...,1.648483,1.936613,2.729978,2.677794,1.10038,4.98198,0.620007,2.018014,3.723871,2.951902
5,0.001853,0.076697,-0.039216,-0.02927,-0.007633,-0.051874,-0.026496,0.004058,0.040571,-0.024389,...,0.042998,0.052845,0.024583,0.030629,-0.026081,-0.008237,0.040031,0.000849,0.007089,-0.021943


### 13. 영화 추천을 위한 함수 설정

In [25]:
# 미상영 영화 리스트 추출 함수
def get_unseen_movies(df_merge_sample, userId):
    # 모든 영화 리스트 생성(tolist()함수)
    movies_list = df_merge_sample.columns.tolist()
    # 고객별 평점 테이블 생성
    ratings = df_merge_sample.loc[userId,:]
    # 평점을 매기지 않은 영화 리스트 생성
    none_rating_list = ratings[ratings.isnull()].index.tolist()
    # 평점 없는 영화로 미상영 영화 리스트 생성
    unseen_movie_list = [movie for movie in movies_list if movie in none_rating_list]

    return unseen_movie_list

# 미상영 영화 중 예측 점수가 높은 순으로 정렬
def recomm_movie_by_userid(pred_df, userId, unseen_movie_list, top_n=10):
    recomm_movies = pred_df.loc[userId, unseen_movie_list].sort_values(ascending=False)[:top_n]
    return recomm_movies

### 14. 특정 고객의 추천 영화 리스트 생성

In [26]:
# 575번 고객의 추천 영화 리스트 생성

# 575번 고객의 미상영 영화 리스트 생성
unseen_movie_list = get_unseen_movies(df_merge_sample, 575)

# 미상영 영화 중 예측 평점 높은 영화 리스트 생성
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, 575, unseen_movie_list, top_n=10)

# 최종 데이터셋 생성
recomm_movies = pd.DataFrame(data=recomm_movies.values,
                             index=recomm_movies.index,
                             columns=['pred_score']).reset_index()

recomm_movies.head(10)

Unnamed: 0,title,pred_score
0,12 Years a Slave (2013),3.569347
1,127 Hours (2010),3.362532
2,101 Dalmatians (One Hundred and One Dalmatians...,2.968062
3,10 Items or Less (2006),2.949394
4,11:14 (2003),2.8847
5,"11th Hour, The (2007)",2.742791
6,*batteries not included (1987),2.701928
7,'71 (2014),2.672483
8,12 Rounds (2009),2.58529
9,10th & Wolf (2006),2.489459
