## [0] 실습 소개

이번 실습에서 해볼 내용
- Jaccard similarity 를 이용한 Content-based filtering
- TF-IDF 를 이용한 텍스트 데이터 vectorizing
- 다양한 유사도 계산을 활용한 Content-based filtering
- User-based CF
- Item-based CF
- K-Means Clustering과 Item-based CF




### 데이터 불러오기

RecSys 기초 대회 강의에서는Book Crossing 데이터를 사용하여, 모든 실습 및 미션, 대회를 진행합니다. [Kaggle Book-Crossing](https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset) 을 출처로 하며, 데이터는 재구성되어 제공되었습니다. 해당 데이터는 CC0: Public Domain 라이센스임을 밝힙니다.

- 데이터 특징
    - 책에 대한 제목, 장르, 출판년도, 작가 등의 데이터가 있음.
    - 유저에 대한 연령, 거주 지역 등의 데이터가 있음.
    - Explicit Feedback: 유저가 책에 대한 선호도를 1 ~ 10 사이의 점수로 표현함.



실습할 데이터를 불러와서merge 합니다.

이번 시간에 실습하는 2강_실습_dataset_ratings.csv 파일은 원본 파일에서 평가 횟수가 적은 사용자를 일부 삭제한 파일입니다.

In [None]:
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-49AjeQvRML0_UzzCeQ9t6LJ1b_X_2GZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-49AjeQvRML0_UzzCeQ9t6LJ1b_X_2GZ" -O books.csv && rm -rf ~/cookies.txt
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gJK_FU4gaSf5rocXBWp8nt1eNyRr-NrR' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gJK_FU4gaSf5rocXBWp8nt1eNyRr-NrR" -O ratings.csv && rm -rf ~/cookies.txt
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-5loL2Z1D9t-yddAf5N4vciT_1fd-cm7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-5loL2Z1D9t-yddAf5N4vciT_1fd-cm7" -O users.csv && rm -rf ~/cookies.txt

In [4]:
import pandas as pd
import numpy as np
from typing import List, Set,Optional
path=''

In [5]:
books = pd.read_csv(path+'books.csv')
users = pd.read_csv(path+'users.csv')
ratings = pd.read_csv(path+'ratings.csv')

print('books shape: ', books.shape)
print('users shape: ', users.shape)
print('ratings shape: ', ratings.shape)

books shape:  (232348, 6)
users shape:  (79516, 3)
ratings shape:  (56290, 3)


In [4]:
books.head()

Unnamed: 0,isbn,book_title,book_author,publisher,language,category
0,2005018,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,en,['Actresses']
1,60973129,Decision in Normandy,Carlo D'Este,HarperPerennial,en,['1940-1949']
2,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,en,['Medical']
3,393045218,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,en,['Design']
4,399135782,The Kitchen God's Wife,Amy Tan,Putnam Pub Group,en,['Fiction']


In [5]:
users.head()

Unnamed: 0,user_id,location,age
0,8,"timmins, ontario, canada",34
1,11400,"ottawa, ontario, canada",49
2,11676,"n/a, n/a, n/a",34
3,41385,"sudbury, ontario, canada",34
4,67544,"toronto, ontario, canada",30


In [6]:
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,243,446606383,6
1,243,446605484,1
2,243,446600474,1
3,243,446364800,9
4,243,446360856,1


In [6]:
df = ratings.merge(books, on='isbn')
df = df.merge(users, on='user_id', how='inner')

In [7]:
df.shape

(56290, 10)

In [8]:
# Return number of unique elements in the object. Excludes NA values by default.
df['book_title'].nunique() # 총 몇개의 책이 있는지 살펴봅니다.

916

In [9]:
# 뽑힌 책을 list로 만들고 활용하겠습니다.
book_title_list = df['book_title'].unique()
book_title_list[:5]

array(['The Midnight Club', 'Roses Are Red (Alex Cross Novels)',
       'Season of the Machete', "The General's Daughter",
       'The Gold Coast'], dtype=object)

## [1] Jaccard similarity 를 이용한 Content-based filtering

### Jaccard similarity

split을 이용하여 자카드 유사도를 구하고 유사한 책을 산출해보겠습니다.

먼저 set을 이용해서 집합계산을 하는 방식을 살펴보겠습니다.

<b>Jaccard similarity 공식</b>

![image](https://user-images.githubusercontent.com/77526788/185778489-805cd3e6-1096-487c-b92a-dcca7d76e55a.png)

In [10]:
a = set('I like banana'.split())
b = set('I like apple'.split())
c = a.intersection(b)

jaccard = len(c) / (len(a) + len(b) - len(c))

print('a: ', a)
print('b: ', b)
print('intersection: ', c)
print('jaccard similarity', jaccard)

a:  {'I', 'like', 'banana'}
b:  {'I', 'apple', 'like'}
intersection:  {'I', 'like'}
jaccard similarity 0.5



이제 book 데이터에 적용하여 유사한 책을 찾아보겠습니다.

자카드 유사도는 단어의 빈도수를 고려하지 않기 때문에 unique한 값을 뽑아낸 book_title_list 를 활용하여 진행하겠습니다.


In [11]:
# title을 기반으로 제목의 유사성을 기반으로 추천을 해보겠습니다.
# title을 넣으면, book title list과 자카드 유사도를 구한다.
def content_based_filtering_jaccard(book_title_list: list, title: str, topn: Optional[int]=None) -> pd.DataFrame:
    topn = 11 if topn is None else topn + 1
    target_split_set = set(title.split()) # title은 text 정보이므로 다 끊어서 set으로 만들어줍니다.
    sim_list = []
    
    # book_title col만 있는 df
    sim_df = pd.DataFrame(book_title_list, columns=['book_title'])
    
    # 모든 책 제목을 순회.
    for idx, book in enumerate(book_title_list):
        title_split_set = set(book.split()) # book_title_list도 마찬가지로 다 끊어서 set으로 만들어줍니다.
        title_intersection = target_split_set.intersection(title_split_set)
        jac_sim = float(len(title_intersection)) / (len(target_split_set) + len(title_split_set) - len(title_intersection))
        sim_list.append(jac_sim) # title과 낱개의 book_title_list와의 자카드 유사도를 저장합니다.[]

    
    sim_df['jaccard_similarity'] = sim_list
    
    return sim_df.sort_values('jaccard_similarity', ascending=False).reset_index(drop=True)[1:topn]

In [12]:
content_based_filtering_jaccard(book_title_list, 'Harry Potter and the Chamber of Secrets (Book 2)', 5)

Unnamed: 0,book_title,jaccard_similarity
1,Harry Potter and the Prisoner of Azkaban (Book 3),0.5
2,Harry Potter and the Goblet of Fire (Book 4),0.5
3,Harry Potter and the Sorcerer's Stone (Book 1),0.416667
4,Harry Potter and the Sorcerer's Stone (Harry P...,0.307692
5,Secrets of the Morning (Cutler),0.272727


## [2] TF-IDF 를 이용한 텍스트 데이터 vectorization

책 제목 `book_title`에 TF-IDF를 적용하여 벡터화해보겠습니다.

우리가 앞서 TF-IDF를 설명할때는 가장 기본적인 식을 기준으로 설명했습니다. 

그러나 실제로는 기본 식에서 조금 조정된 식을 사용합니다.

오늘 실습은 간단하게 Sklearn에 있는 `TfidfVectorizer`함수를 활용해서 진행하겠습니다. 

Sklearn의 TfidVectorizer 함수는 L2 정규화를 통해 값을 조정하기 때문에 직접 계산한 값과는 차이가 있습니다.

본 데이터에 적용하기 앞서 함수의 실행결과를 간단히 살펴보겠습니다.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [37]:
title_example = ["The Lion, the Witch, and the Wardrobe",
                "Lion King Opposites",
                "Edgar the Lion Who Would Be a Good King",
                "King Solomon's Mines"]
tfidf = TfidfVectorizer()
# 각 TF-idf 를 계산합니다.
# print(tfidf.vocabulary_) #각 단어의 인덱스가 어떻게 부여되었는지 보여줍니다.

pd.DataFrame(tfidf.fit_transform(title_example).toarray(), columns=tfidf.get_feature_names_out(), index=title_example)

Unnamed: 0,and,be,edgar,good,king,lion,mines,opposites,solomon,the,wardrobe,who,witch,would
"The Lion, the Witch, and the Wardrobe",0.333301,0.0,0.0,0.0,0.0,0.212742,0.0,0.0,0.0,0.788335,0.333301,0.0,0.333301,0.0
Lion King Opposites,0.0,0.0,0.0,0.0,0.473804,0.473804,0.0,0.742306,0.0,0.0,0.0,0.0,0.0,0.0
Edgar the Lion Who Would Be a Good King,0.0,0.394165,0.394165,0.394165,0.25159,0.25159,0.0,0.0,0.0,0.310764,0.0,0.394165,0.0,0.394165
King Solomon's Mines,0.0,0.0,0.0,0.0,0.411378,0.0,0.644503,0.0,0.644503,0.0,0.0,0.0,0.0,0.0


## [3] 다양한 유사도 계산을 활용한 Content-based filtering

### Euclidean Similarity

book.csv의 book_title 을 TF-IDF로 만들어진 벡터를 이용해 변환하겠습니다. 
그리고 유클리드 유사도를 이용해 유사한 책을 찾아보도록 하겠습니다.

![image](https://user-images.githubusercontent.com/77526788/185792355-6b77d980-d153-42a9-9173-f8e3e7442c75.png)

In [38]:
tfidf = TfidfVectorizer()
wordmatrix = tfidf.fit_transform(book_title_list).toarray() #책 제목 리스트를 TF-IDF를 활용해 벡터라이징 합니다.

In [39]:
wordmatrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [40]:
wordmatrix.shape # (책 갯수, 단어 갯수)

(916, 1528)

In [41]:
from sklearn.metrics.pairwise import euclidean_distances

In [42]:
# 계산한 유사도를 활용하여 책을 추천해보겠습니다.
# wordmatrix는 책 제목을 vectorization 한 array 입니다.
def content_based_filtering_euclidean(book_title_list: list, 
                    wordmatrix: np.array, 
                    title: str,
                    topn: Optional[int]=None) -> pd.DataFrame:
    
    topn=11 if topn is None else topn+1
    
    # row간의 유클리드 거리를 계산합니다. wordmatrix.shape이 (916, 1528)이므로 (916, 916) shape이 나옴.
    sim_matrix = pd.DataFrame(euclidean_distances(wordmatrix), index=book_title_list, columns=book_title_list)
    
    target_similarity_df = sim_matrix[title].reset_index().copy()
    target_similarity_df.columns=['title', 'euclidean_similarity']
    
    # 왜 내림차순? 유클리디안 거리를 기준으로 했으니 거리가 짧을 수록 비슷한 것!
    # 1부터? -> 자기 자신 제거
    # topn까지 추출.
    return target_similarity_df.sort_values('euclidean_similarity', ascending=True).reset_index(drop=True)[1:topn] 

In [43]:
content_based_filtering_euclidean(book_title_list, wordmatrix, 'Harry Potter and the Chamber of Secrets (Book 2)',5)

Unnamed: 0,title,euclidean_similarity
1,Harry Potter and the Goblet of Fire (Book 4),0.93086
2,Harry Potter and the Sorcerer's Stone (Harry P...,0.932147
3,Harry Potter and the Prisoner of Azkaban (Book 3),0.951877
4,Harry Potter and the Sorcerer's Stone (Book 1),0.953682
5,The Chamber,0.991023


### Cosine Similarity
각도 기반으로 주어진 벡터들 사이에 유사도를 계산합니다.

sklearn의 `cosine_similarity`를 import하여 계산하겠습니다. 

![image](https://user-images.githubusercontent.com/77526788/185777726-f8fcb835-933d-4760-af03-50c4edbd0148.png)

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
#코사인 유사도 계산를 content_based_filtering 결과
def content_based_filtering_cosin(book_title_list: list,
                wordmatrix: np.array, 
                title: str,
                topn: Optional[int]=None) -> pd.DataFrame:
    topn=11 if topn is None else topn+1
    
    # 책-책의 유사도
    sim_matrix = pd.DataFrame(cosine_similarity(wordmatrix), index=book_title_list, columns=book_title_list)
    target_similarity_df = sim_matrix[title].reset_index().copy()
    target_similarity_df.columns=['title', 'cosine_similarity']
    return target_similarity_df.sort_values('cosine_similarity', ascending=False).reset_index(drop=True)[1:topn]

In [46]:
content_based_filtering_cosin(book_title_list, wordmatrix, 'Harry Potter and the Chamber of Secrets (Book 2)', 5)

Unnamed: 0,title,cosine_similarity
1,Harry Potter and the Goblet of Fire (Book 4),0.566749
2,Harry Potter and the Sorcerer's Stone (Harry P...,0.565551
3,Harry Potter and the Prisoner of Azkaban (Book 3),0.546965
4,Harry Potter and the Sorcerer's Stone (Book 1),0.545245
5,The Chamber,0.508936


### 피어슨 상관 계수
같은 방식으로 피어슨 상관 계수를 활용해서 추천 결과를 도출해보겠습니다.

피어슨 상관 계수 공식은 다음과 같습니다.

![image](https://user-images.githubusercontent.com/77526788/185777732-7eefa50e-54e4-474b-9639-c41598210128.png)

In [34]:
def content_based_filtering_pearson(book_title_list: list,
                                    wordmatrix: np.array, 
                                    title: str,
                                    topn: Optional[int]=None) -> pd.DataFrame:
    topn=11 if topn is None else topn+1
    sim_matrix = pd.DataFrame(wordmatrix, index=book_title_list).T.corr(method='pearson')
    target_similarity_df = sim_matrix[title].reset_index().copy()
    target_similarity_df.columns=['title', 'pearson_similarity']
    return target_similarity_df.sort_values('pearson_similarity', ascending=False).reset_index(drop=True)[1:topn]

In [35]:
# 오래 걸림
content_based_filtering_pearson(book_title_list, wordmatrix, 'Harry Potter and the Chamber of Secrets (Book 2)', 5)

### test set 적용

train-test set으로 데이터를 나누고 content-based filering을 통해 RMSE를 구해보겠습니다.

코사인 유사도를 적용하는 과정에서 아이템 간의 유사도가 0으로 나와서 나누기를 할 수 없는 문제점이 있습니다.

따라서 아주 작은 수(1e-10)를 더해서 이 문제를 해결하겠습니다.

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [33]:
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

In [34]:
print('train: ', train_df.shape)
print('test: ',test_df.shape)

train:  (50661, 10)
test:  (5629, 10)


In [35]:
book_title_train_list = train_df['book_title'].unique()
tfidf = TfidfVectorizer()
wordmatrix = tfidf.fit_transform(book_title_train_list).toarray() #책 제목 리스트를 TF-IDF를 활용해 벡터라이징 합니다.
sim_matrix = pd.DataFrame(cosine_similarity(wordmatrix), index=book_title_train_list, columns=book_title_train_list) #벡터간 유사도 계산

NameError: name 'TfidfVectorizer' is not defined

In [39]:
sim_matrix

Unnamed: 0,The Third Victim,Chicken Soup for the Pet Lover's Soul (Chicken Soup for the Soul),A Is for Alibi (Kinsey Millhone Mysteries (Paperback)),The Valley of Horses,A Year in Provence,Voyager,The Perfect Storm : A True Story of Men Against the Sea,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Tell Me Lies (Tell Me Lies),The Prince of Tides,...,Hearts In Atlantis : New Fiction,Balzac and the Little Chinese Seamstress,Shipping News,The Outsiders (Now in Speak!),Memoirs of a Geisha Uk,Isle of Dogs,Gap Creek: The Story Of A Marriage,The Perks of Being a Wallflower,Notes From a Small Planet,Horse Whisperer
The Third Victim,1.000000,0.030328,0.000000,0.035998,0.0,0.0,0.042374,0.020639,0.0,0.036985,...,0.000000,0.025994,0.0,0.029280,0.000000,0.000000,0.028829,0.029112,0.0,0.0
Chicken Soup for the Pet Lover's Soul (Chicken Soup for the Soul),0.030328,1.000000,0.103626,0.029325,0.0,0.0,0.034519,0.016813,0.0,0.030129,...,0.000000,0.021175,0.0,0.023852,0.000000,0.000000,0.023485,0.023715,0.0,0.0
A Is for Alibi (Kinsey Millhone Mysteries (Paperback)),0.000000,0.103626,1.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0
The Valley of Horses,0.035998,0.029325,0.000000,1.000000,0.0,0.0,0.079253,0.057246,0.0,0.102586,...,0.000000,0.025134,0.0,0.028312,0.055907,0.064468,0.079963,0.080747,0.0,0.0
A Year in Provence,0.000000,0.000000,0.000000,0.000000,1.0,0.0,0.000000,0.000000,0.0,0.000000,...,0.118786,0.000000,0.0,0.123219,0.000000,0.000000,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Isle of Dogs,0.000000,0.000000,0.000000,0.064468,0.0,0.0,0.037943,0.036961,0.0,0.066235,...,0.000000,0.000000,0.0,0.000000,0.055414,1.000000,0.051629,0.052135,0.0,0.0
Gap Creek: The Story Of A Marriage,0.028829,0.023485,0.000000,0.079963,0.0,0.0,0.198066,0.045845,0.0,0.082156,...,0.000000,0.020128,0.0,0.022674,0.044773,0.051629,1.000000,0.064666,0.0,0.0
The Perks of Being a Wallflower,0.029112,0.023715,0.000000,0.080747,0.0,0.0,0.064092,0.046295,0.0,0.082961,...,0.000000,0.020326,0.0,0.022896,0.045212,0.052135,0.064666,1.000000,0.0,0.0
Notes From a Small Planet,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.0,0.0


In [40]:
# test_df 실험해보려고~
def content_based_filtering(train_df: pd.DataFrame,
                            test_df: pd.DataFrame) -> list:
    pred_rating_list = []
    
    # index, columns, values로 새 matrix(table) 만듦. train_df가 필요해서 그럼
    user_item_matrix = train_df.pivot_table(index=['user_id'], columns=['book_title'], values='rating').fillna(0)
    
    # train_df로 만든 유사도 matrix가 있으니 이걸로 test_df의 데이터들도 예측해보려고 함.
    for test_id, test_title in zip(test_df['user_id'], test_df['book_title']):
        similarity_list = []
        rating_list = []
        
        # train_df내에서 test_id와 같은 user_id를 가진 책들의 제목을 가져옴
        for read_book in train_df[train_df['user_id']==test_id]['book_title']:
            
            # sim_matrix에서 해당 책의 유사도 가져옴.
            similarity_list.append(sim_matrix[read_book][test_title]) 
            rating_list.append(user_item_matrix[test_title][test_id])
        
        similarity_list = np.array(similarity_list)
        rating_list = np.array(rating_list)
        
        pred = (similarity_list * rating_list).sum() / (similarity_list.sum() + 1e-10) #분모가 0이되는 것을 방지하기 위해 작은 수를 더합니다.
        pred_rating_list.append(pred)
        
    return pred_rating_list


In [45]:
pred_rating = content_based_filtering(train_df, test_df)

In [46]:
mean_squared_error(test_df['rating'], pred_rating)**0.5

4.0909885846732745

## [4] User based Collaborative Filtering (UBCF)

이번에는 User-based CF를 실습해보겠습니다.

평점을 기반으로 User-Item matrix를 만들고 코사인 유사도를 이용해 추천 리스트를 만들어 보겠습니다.

In [47]:
user_item_matrix = df.pivot_table(index=['user_id'], columns=['book_title'], values='rating')
print('user_item_matrix shape: ', user_item_matrix.shape)

user_item_matrix shape:  (800, 916)


In [48]:
# na이 92%인데...? 매우 많음.
user_item_matrix.isna().sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1])

0.9237813864628821

In [49]:
user_item_matrix

book_title,1984,1st to Die: A Novel,2010: Odyssey Two,24 Hours,A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper)),A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash,A Bend in the Road,A Case of Need,"A Child Called \It\"": One Child's Courage to Survive""",A Civil Action,...,Wifey,Windmills of the Gods,Winter Moon,Winter Solstice,Wish You Well,Wuthering Heights,You Belong To Me,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zoya,"\O\"" Is for Outlaw"""
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
243,,,,,,,,,,,...,,,,,,,,,,
254,9.0,,,,,,1.0,,,,...,,1.0,,,,,,,,
882,,,,,,,,,,,...,,,,,,,,,,
2276,,,1.0,,,,,,,,...,,,,,,,,,,
2766,,,,,,,,1.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274301,,,,,,,,,,,...,,,,,,,,,,8.0
274308,,,,,,,,,,,...,,,,,,,,,,
275970,1.0,,,,,,,,,,...,,,,,,,,1.0,,
277427,,,,10.0,,,,,,,...,,,,,,,,,,


In [50]:
#유사도 계산을 위해 NaN값을 0으로 채워줍니다.
#앞서 봤듯이 굉장히 sparse함.
user_item_matrix = user_item_matrix.fillna(0)

In [51]:
#코사인 유사도 계산. 
# user-user 유사도임. 우리는 지금 ubcf를 하고 있는 거라니까?
user_similarity = cosine_similarity(user_item_matrix) 

print('user_similarity shape: ', user_similarity.shape) # user_item_matrix[0], user_item_matrix[0]이 나옴.

user_similarity shape:  (800, 800)


In [52]:
user_similarity #코사인 유사도를 계산한 결과

array([[1.        , 0.01523286, 0.01788076, ..., 0.01482404, 0.07060269,
        0.04671182],
       [0.01523286, 1.        , 0.03522542, ..., 0.13414197, 0.04375281,
        0.02126757],
       [0.01788076, 0.03522542, 1.        , ..., 0.01096961, 0.01935006,
        0.04224758],
       ...,
       [0.01482404, 0.13414197, 0.01096961, ..., 1.        , 0.06082653,
        0.06381511],
       [0.07060269, 0.04375281, 0.01935006, ..., 0.06082653, 1.        ,
        0.01947121],
       [0.04671182, 0.02126757, 0.04224758, ..., 0.06381511, 0.01947121,
        1.        ]])

In [53]:
def UBCF(df: pd.DataFrame, 
         user_id: int,
         topn: Optional[int]=None) -> pd.DataFrame:

    topn=11 if topn is None else topn + 1
    if user_id in df['user_id'].values:
        user_id = str(user_id)
        user_item_matrix = df.pivot_table(index=['user_id'], columns=['book_title'], values='rating')
        user_item_matrix = user_item_matrix.fillna(0)
        user_similarity_df = pd.DataFrame(cosine_similarity(user_item_matrix), index=user_item_matrix.index.astype(str), columns=user_item_matrix.index.astype(str))
        sim_user_df = user_similarity_df[user_id].sort_values(ascending=False).reset_index(drop=False).rename(columns={'index':'user_id', user_id:'similarity'})
        print('입력한 사용자 id: ',user_id)
        display(sim_user_df[1:topn]) #비슷한 사용자를 보여줍니다
    else:
        print('사용자 id를 다시 확인해주세요')

In [54]:
UBCF(df, 254, 5) # 254번 유저와 비슷한 유저들을 보여줌.

입력한 사용자 id:  254


Unnamed: 0,user_id,similarity
1,208141,0.557751
2,175003,0.448879
3,36003,0.401443
4,226965,0.386151
5,205735,0.384968


### test set 적용
앞서 분리한 train-test set에 적용하여 RMSE를 계산해보겠습니다.

In [55]:
def UBCF_predict_rating(train_df: pd.DataFrame,
                        test_df: pd.DataFrame) -> list:
    rating_list=[]
    
    # 우리는 ubcf를 해야 하니까 user-user 유사도를 구해야 함.
    user_item_matrix = train_df.pivot_table(index=['user_id'], columns=['book_title'], values='rating')
    user_item_matrix = user_item_matrix.fillna(0)
    user_similarity_df = pd.DataFrame(cosine_similarity(user_item_matrix), index=user_item_matrix.index.astype(str), columns=user_item_matrix.index.astype(str)) # user-user 유사도임. 우리는 지금 ubcf를 하고 있는 거라니까?
    user_similarity_df.index = user_similarity_df.index.astype(int)
    
    # 각 test_id 별로 sum(유사도 * 평점) / sum(유사도)를 구함.
    for test_id, test_book_title in zip(test_df['user_id'].astype(str), test_df['book_title']):
        pred_rating = (user_similarity_df[test_id].sort_index().values * user_item_matrix[test_book_title].sort_index().values).sum()/(user_similarity_df[test_id].values.sum())
        rating_list.append(pred_rating)
    return rating_list

In [56]:
rating_list = UBCF_predict_rating(train_df, test_df)

In [57]:
mean_squared_error(rating_list, test_df['rating'])**0.5

3.8697271791718024

## [5] Item based Collaborative Filtering (IBCF)

이번에는 Item-based CF를 실습해보겠습니다.

평점을 기반으로 User-Item matrix를 만들고 코사인 유사도를 이용해 유사한 책을 추천 해보겠습니다.

In [58]:
# 주의할 점이, IBCF이므로 i-i 유사도를 구해야 함. 그래서 이번엔 index가 book_title로 하였고, sim table도 book-book matrix여야 함.
user_item_matrix = df.pivot_table(index=['book_title'], columns=['user_id'],values='rating')
print('user_item_matrix shape: ', user_item_matrix.shape)

user_item_matrix shape:  (916, 800)


In [59]:
user_item_matrix # 매우 sparse함

user_id,243,254,882,2276,2766,3363,3371,4017,5903,6251,...,271705,273086,273979,274004,274061,274301,274308,275970,277427,278418
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,,9.0,,,,,,,,,...,10.0,,,,,,,1.0,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2010: Odyssey Two,,,,1.0,,,,,,,...,,,,,,,,,,
24 Hours,,,,,,,,,,,...,,,1.0,,,,,,10.0,
A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper)),,,,,,1.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wuthering Heights,,,,,,,,,,,...,,,,,,,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,,1.0,,,,1.0,...,,,,,,,,1.0,,
Zoya,,,,,,,,,,,...,,,1.0,,,,,,,


In [61]:
def IBCF(df: pd.DataFrame, 
         title: str,
         topn: Optional[int]=None) -> pd.DataFrame:
    topn=11 if topn is None else topn+1
    if title in df['book_title'].values:
        user_item_matrix = df.pivot_table(index=['book_title'], columns=['user_id'], values='rating')
        user_item_matrix = user_item_matrix.fillna(0)
        item_similarity_df = pd.DataFrame(cosine_similarity(user_item_matrix), index= user_item_matrix.index, columns=user_item_matrix.index)
        sim_item_df = item_similarity_df[title].sort_values(ascending=False).reset_index().rename(columns={'index':'book_title',title:'similarity'})
        print('입력한 책 이름: ',title)
        display(sim_item_df[1:topn])
    else:
        print('책 제목을 다시 확인해주세요')

In [62]:
IBCF(df,'Harry Potter and the Chamber of Secrets (Book 2)', 5)

입력한 책 이름:  Harry Potter and the Chamber of Secrets (Book 2)


Unnamed: 0,book_title,similarity
1,Harry Potter and the Prisoner of Azkaban (Book 3),0.596622
2,Harry Potter and the Sorcerer's Stone (Harry P...,0.542272
3,Harry Potter and the Goblet of Fire (Book 4),0.491703
4,Harry Potter and the Sorcerer's Stone (Book 1),0.3726
5,Bridget Jones's Diary,0.243946


### test set 적용

앞서 분리해 놓은 train test set에 맞춰서 코드를 수정하고 RMSE를 구해보겠습니다.

In [66]:
def IBCF_predict_rating(train_df: pd.DataFrame, 
                        test_df: pd.DataFrame) -> list:
    rating_list=[]
    user_item_matrix = train_df.pivot_table(index=['book_title'], columns=['user_id'], values='rating')
    user_item_matrix = user_item_matrix.fillna(0)
    item_similarity_df = pd.DataFrame(cosine_similarity(user_item_matrix), index=user_item_matrix.index, columns=user_item_matrix.index)

    for test_id, test_book_title in zip(test_df['user_id'], test_df['book_title']):
        try:
            pred_rating = (item_similarity_df[test_book_title].sort_index().values * user_item_matrix[test_id].sort_index().values).sum() / item_similarity_df[test_book_title].values.sum()
        except:
            pred_rating=0
        rating_list.append(pred_rating)
    return rating_list

In [67]:
rating_list = IBCF_predict_rating(train_df, test_df)

In [68]:
mean_squared_error(rating_list, test_df['rating'])**0.5

3.84702556623124

실제로 서빙할 때는 추천할 책 뽑아 놓고 '보지 않았던' 책을 추천해줘야 함. 이미 봤는데 유사도 높다고 또 추천해주면 안 사잖아.

## [6] Kmeans clustering과 IBCF
K-means 클러스터링을 실습해보겠습니다.

K-means 클러스터링은 가장 가까운 중심점을 갖는 군집에 각 항목을 할당하는 과정을 반복하여 K개의 군집으로 항목을 나누는 알고리즘입니다.

![image.png](https://user-images.githubusercontent.com/77526788/223298457-f9482f4f-1d64-4ecd-bd90-83e594f98a5c.png)


1. 랜덤하게 초기 중심점 배치
2. 각 데이터를 가장 가까운 중심점으로 할당
3. 모인 데이터를 바탕으로 중심점 위치 업데이트
4. 더 이상 중심점이 업데이트 되지 않을 때까지 2-3 단계 반복

유사한 사용자 군집을 구하고 군집내의 유저 데이터를 이용해 추천결과를 도출해 보겠습니다.

In [69]:
from sklearn.cluster import KMeans
import re

In [71]:
df['category'] = df['category'].apply(lambda x: re.sub('[\W_]+', ' ', x).strip()) #데이터 category의 대괄호를 풀어줍니다.

encoding_df = pd.concat([df, pd.get_dummies(df[['category', 'language']])], axis=1) #카테고리와 언어로 one-hot-encoding을 수행합니다.
print('인코딩 결과의 shape: ', encoding_df.shape)

인코딩 결과의 shape:  (56290, 71)


In [72]:
encoding_df.iloc[:,9:].head()

Unnamed: 0,age,category_9,category_Abortion,category_Actors,category_Adulteresses,category_Adultery,category_African American men,category_Aircraft accidents,category_Arctic regions,category_Artificial intelligence,...,category_Pets,category_Rapture Christian eschatology,category_Religion,category_Self Help,category_Social Science,category_Travel,category_Trials Murder,category_True Crime,language_9,language_en
0,34,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,34,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,34,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,34,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,34,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [73]:
n=3 #군집개수를 정합니다.
km = KMeans(n_clusters=n, random_state=42)
km.fit(encoding_df.iloc[:, 9:]) #나이, 언어, 책 카테고리로 군집화를 수행하겠습니다.
labels = km.labels_

encoding_df['label'] = labels

In [74]:
encoding_df['label'].value_counts() #각 클러스터별 데이터 개수를 확인합니다.

0    29466
2    13596
1    13228
Name: label, dtype: int64

군집결과를 확인해 보겠습니다.

군집에 따라 평균 나이와 언어 비율에 차이가 있는 것을 확인할 수 있습니다.

In [75]:
encoding_df[(encoding_df['label']==0)]['age'].mean()

34.76556030679427

In [76]:
encoding_df[(encoding_df['label']==1)]['age'].mean()

50.64008164499546

In [77]:
encoding_df[(encoding_df['label']==2)]['age'].mean()

25.694248308325978

In [78]:
encoding_df[(encoding_df['label']==0)]['language'].value_counts(True)

en    0.665038
9     0.334962
Name: language, dtype: float64

In [79]:
encoding_df[(encoding_df['label']==1)]['language'].value_counts(True)

en    0.673269
9     0.326731
Name: language, dtype: float64

In [80]:
encoding_df[(encoding_df['label']==2)]['language'].value_counts(True)

en    0.663063
9     0.336937
Name: language, dtype: float64

강의 내에 들어간 실습 자료를 제작 당시에 군집 수를 늘려서 군집간의 차이를 보여주었습니다.

그러나 군집 수가 늘어나면 한 군집에 속하는 데이터가 작아지고, 유사한 아이템이 없어서 특정 군집에서는 책 추천이 불가능한 단점이 생깁니다.

따라서 실습파일에서는 추천 결과를 보여주고 비교하기 위해 군집 갯수를 줄였습니다.

군집 개수는 직접 설정해야하는 파라미터이므로 군집 수에 따라 변하는 추천 결과를 확인해 해보실 수 있습니다.

군집에 따른 IBCF 결과를 확인해보겠습니다.

군집에 따라 다른 책을 추천하는 것을 확인할 수 있습니다.

In [81]:
IBCF(encoding_df[encoding_df['label']==0], 'Harry Potter and the Chamber of Secrets (Book 2)' )

입력한 책 이름:  Harry Potter and the Chamber of Secrets (Book 2)


Unnamed: 0,book_title,similarity
1,Harry Potter and the Sorcerer's Stone (Harry P...,0.599127
2,Harry Potter and the Prisoner of Azkaban (Book 3),0.50203
3,Harry Potter and the Goblet of Fire (Book 4),0.461835
4,Charlotte's Web (Trophy Newbery),0.374359
5,Bridget Jones's Diary,0.318786
6,Harry Potter and the Sorcerer's Stone (Book 1),0.313076
7,The Andromeda Strain,0.282895
8,The Poet,0.281339
9,Chicken Soup for the Pet Lover's Soul (Chicken...,0.279968
10,Tough Cookie,0.275961


In [82]:
IBCF(encoding_df[encoding_df['label']==1], 'Harry Potter and the Chamber of Secrets (Book 2)' )

입력한 책 이름:  Harry Potter and the Chamber of Secrets (Book 2)


Unnamed: 0,book_title,similarity
1,Harry Potter and the Prisoner of Azkaban (Book 3),0.696095
2,Harry Potter and the Goblet of Fire (Book 4),0.690532
3,The Perfect Storm : A True Story of Men Agains...,0.489702
4,Harry Potter and the Sorcerer's Stone (Harry P...,0.445134
5,Carolina Moon,0.403948
6,The Talented Mr. Ripley (Vintage Crime/Black L...,0.399915
7,Born in Fire,0.397422
8,Zen and the Art of Motorcycle Maintenance: An ...,0.39428
9,The Hobbit : The Enchanting Prelude to The Lor...,0.383562
10,Harry Potter and the Sorcerer's Stone (Book 1),0.37516


In [83]:
IBCF(encoding_df[encoding_df['label']==2], 'Harry Potter and the Chamber of Secrets (Book 2)' )

입력한 책 이름:  Harry Potter and the Chamber of Secrets (Book 2)


Unnamed: 0,book_title,similarity
1,Harry Potter and the Prisoner of Azkaban (Book 3),0.678289
2,Harry Potter and the Sorcerer's Stone (Harry P...,0.51499
3,Harry Potter and the Sorcerer's Stone (Book 1),0.472443
4,Harry Potter and the Goblet of Fire (Book 4),0.409339
5,The Hobbit: or There and Back Again,0.319154
6,The Tao of Pooh,0.310948
7,A Time to Kill,0.310265
8,The Hitchhiker's Guide to the Galaxy,0.30907
9,High Fidelity,0.298845
10,Four Past Midnight,0.292082


### test set에 적용

같은 방식을 train-test 분리한 데이터에 적용하여 성능을 비교해보겠습니다.

In [84]:
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

df = pd.concat([train_df, test_df], axis=0) # train-test set을 concat하여 인코딩하고 다시 분할하도록 하겠습니다.

df['category'] = df['category'].apply(lambda x: re.sub('[\W_]+', ' ', x).strip()) #데이터 category의 대괄호를 풀어줍니다.

encoding_df = pd.concat([df, pd.get_dummies(df[['category', 'language']])], axis=1) #카테고리와 언어로 one-hot-encoding을 수행합니다.

train_df = encoding_df.iloc[:train_df.shape[0], :]
test_df = encoding_df.iloc[train_df.shape[0]:, :]

In [85]:
n=3 #군집개수를 정합니다.
km = KMeans(n_clusters=n, random_state=42)
km.fit(train_df.iloc[:,9:].copy()) #나이, 언어, 책 카테고리로 군집화를 수행하겠습니다.


In [86]:
train_df = train_df.copy()
train_df['label'] = km.labels_

In [87]:
test_df = test_df.copy()
test_df['label'] = km.predict(test_df.iloc[:,9:].copy())

In [88]:
rating = IBCF_predict_rating(train_df[train_df['label']==0], test_df[test_df['label']==0])
mean_squared_error(test_df[test_df['label']==0]['rating'], rating)**0.5

3.8189750985365176

In [89]:
rating = IBCF_predict_rating(train_df[train_df['label']==1], test_df[test_df['label']==1])
mean_squared_error(test_df[test_df['label']==1]['rating'], rating)**0.5

3.9040023602646525

In [90]:
rating = IBCF_predict_rating(train_df[train_df['label']==2], test_df[test_df['label']==2])
mean_squared_error(test_df[test_df['label']==2]['rating'], rating)**0.5

3.856575105833

<font color='red'><b>**WARNING**</b></font> : **본 교육 콘텐츠의 지식재산권은 재단법인 네이버커넥트에 귀속됩니다. 본 콘텐츠를 어떠한 경로로든 외부로 유출 및 수정하는 행위를 엄격히 금합니다.** 다만, 비영리적 교육 및 연구활동에 한정되어 사용할 수 있으나 재단의 허락을 받아야 합니다. 이를 위반하는 경우, 관련 법률에 따라 책임을 질 수 있습니다.