### Machine Learning-based Data Analysis with Case Study
### [Practice 4] : Anime Recommendation System
    1. 문제정의하기(Problem Define)
    2. 라이브러리 불러오기(Libraries Setting)
    3. 데이터 수집하기(Data Collection)
    4. 데이터 탐색하기(Data Exploration)
    5. 전처리하기(Preprocessing)
    6. 모델링하기(Modeling)
        - 6.1 품목 추천(Item based Recommendations)
        - 6.2 사용자에게 개인맞춤 추천하기 (Personalized Recommendations)

#### [1]: Problem Define : 문제정의
- 애니메이션 추천 시스템 - 협업 필터링 Matrix Factorization : 애니메이션을 추천합니다.

- Data: MyAnimeList Dataset
     - y = f(x)
     - y: 애니메이션 추천
     - x: 애니메이션 데이터, 사용자 데이터

#### [2] 라이브러리 불러오기(Libraries Setting)

In [1]:
import numpy as np # Numeric Python 
import pandas as pd # Data Processing and Database
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Visualization

from sklearn.decomposition import TruncatedSVD # Recommendations
from scipy.sparse.linalg import svds # Recommendations
import pickle

#### [3] 데이터 수집하기(Data Collection)

In [2]:
# 평점 데이터 샘플링 (전체 데이터가 너무 크므로 일부만 사용)
rating_data = pd.read_csv('./data/rating_complete.csv', nrows=100000)
anime_data = pd.read_csv('./data/anime_with_synopsis.csv')
rating_data.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


해석) 
- user_id가 같다는 의미는 한 사람임을 알려줍니다. 
- 한 사람이 여러 애니메이션을 볼 수 있습니다. 
- 한 사람이 여러 애니메이션들에 대해 점수를 매길 수 있습니다.

In [3]:
anime_data.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


#### [4] 데이터 탐색하기(Data Exploration)

### [데이터 셋 구조]

In [4]:
print(anime_data.shape)
print(rating_data.shape)

(16214, 5)
(100000, 3)


### [데이터 타입]

In [5]:
anime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16214 entries, 0 to 16213
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   MAL_ID     16214 non-null  int64 
 1   Name       16214 non-null  object
 2   Score      16214 non-null  object
 3   Genres     16214 non-null  object
 4   sypnopsis  16206 non-null  object
dtypes: int64(1), object(4)
memory usage: 633.5+ KB


### [데이터 타입]

In [6]:
rating_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   100000 non-null  int64
 1   anime_id  100000 non-null  int64
 2   rating    100000 non-null  int64
dtypes: int64(3)
memory usage: 2.3 MB


### [데이터 통계]

In [7]:
anime_data.describe()

Unnamed: 0,MAL_ID
count,16214.0
mean,22069.271555
std,14849.798248
min,1.0
25%,6728.5
50%,24164.0
75%,35978.75
max,48492.0


### [데이터 통계]

In [8]:
rating_data.describe()

Unnamed: 0,user_id,anime_id,rating
count,100000.0,100000.0,100000.0
mean,346.03421,16023.91865,7.59299
std,187.947244,13249.929743,1.706383
min,0.0,1.0,1.0
25%,193.0,3304.75,7.0
50%,346.0,12365.0,8.0
75%,496.0,29095.0,9.0
max,664.0,48456.0,10.0


In [9]:
rating_data.head(10)

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9
5,0,431,8
6,0,578,10
7,0,433,6
8,0,1571,10
9,0,121,9


In [10]:
userId_duplicate = rating_data.drop_duplicates(['user_id'])
userId_duplicate.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
35,1,22535,9
138,2,31580,7
189,3,25835,8
504,4,7103,7


In [11]:
userId_duplicate.shape

(575, 3)

해석) 사용자 수를 확인합니다.

#### [5] 전처리하기(Preprocessing)

In [12]:
# -1 평점 제거 (평가하지 않은 경우)
rating_data = rating_data[rating_data['rating'] != -1]
rating_data.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


In [13]:
# 애니메이션 정보에서 필요한 컬럼만 선택
anime_data_clean = anime_data[['MAL_ID', 'Name', 'Score', 'Genres']].copy()
anime_data_clean = anime_data_clean.rename(columns={'MAL_ID': 'anime_id'})
anime_data_clean.head()

Unnamed: 0,anime_id,Name,Score,Genres
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space"
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural"


In [14]:
# 데이터 샘플링 (효율성)
# 평점이 많은 인기 애니메이션
anime_rating_counts = rating_data['anime_id'].value_counts()
popular_animes = anime_rating_counts[anime_rating_counts >= 50].index

# 활성 사용자
user_rating_counts = rating_data['user_id'].value_counts()
active_users = user_rating_counts[user_rating_counts >= 20].index

# 필터링
rating_data_sampled = rating_data[
    (rating_data['anime_id'].isin(popular_animes)) &
    (rating_data['user_id'].isin(active_users))
]

print(f"Sampled Data Shape: {rating_data_sampled.shape}")
print(f"Users: {rating_data_sampled['user_id'].nunique()}")
print(f"Animes: {rating_data_sampled['anime_id'].nunique()}")

Sampled Data Shape: (47356, 3)
Users: 506
Animes: 511


In [15]:
user_anime_rating = rating_data_sampled.pivot_table('rating', index='user_id', columns='anime_id').fillna(0)
user_anime_rating.shape

(506, 511)

In [16]:
user_anime_rating.head()

anime_id,1,5,6,19,20,24,30,32,33,43,...,38826,38883,39195,39587,39597,40221,40496,40591,40839,41353
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,9.0,10.0,0.0,0.0,0.0,0.0,0.0,...,8.0,8.0,0.0,9.0,0.0,8.0,0.0,0.0,0.0,8.0
2,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 행이 애니메이션, 열이 사용자

In [17]:
anime_user_rating = user_anime_rating.values.T
anime_user_rating.shape

(511, 506)

#### [6] 모델링하기(Modeling)

#### [6.1] 품목 추천(Item based Recommendations)

In [18]:
SVD = TruncatedSVD(n_components=12)
matrix = SVD.fit_transform(anime_user_rating)
matrix.shape

(511, 12)

In [19]:
matrix[0]

array([ 52.90630407,  43.31783617,  31.20387452,   6.47678915,
       -24.96547261,   2.21828057, -11.42527083,   1.30823702,
        12.04426516, -11.44530803,   0.28155161,   1.43862442])

해석) 12개의 component로 차원을 축소했습니다.

In [20]:
corr = np.corrcoef(matrix)
corr.shape

(511, 511)

[알아두기] 피어슨 상관계수
- np.corrcoef(): 피어슨 상관계수 값을 계산합니다.
- 애니메이션-애니메이션 간의 상관관계를 봅니다.

In [21]:
anime_title = user_anime_rating.columns
anime_title_list = list(anime_title)
sample_anime = anime_title_list[0]
sample_index = 0
sample_index

0

In [22]:
corr_sample = corr[sample_index]
list(anime_title[(corr_sample >= 0.9)])[:10]

[1, 5, 6, 19, 43, 47, 205, 227, 323, 339]

해석) 샘플 애니메이션과 유사한 작품 10개를 추천합니다.

#### [6.2] 사용자에게 개인맞춤 추천하기 (Personalized Recommendations)

In [23]:
df_ratings = rating_data_sampled.copy()
df_animes = anime_data_clean.copy()
df_ratings.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
5,0,431,8
6,0,578,10
9,0,121,9
10,0,356,9


In [24]:
df_animes.head()

Unnamed: 0,anime_id,Name,Score,Genres
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space"
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural"


In [25]:
df_user_anime_ratings = df_ratings.pivot_table(values='rating', index='user_id', columns='anime_id').fillna(0)
df_user_anime_ratings.shape

(506, 511)

In [26]:
df_user_anime_ratings.head()

anime_id,1,5,6,19,20,24,30,32,33,43,...,38826,38883,39195,39587,39597,40221,40496,40591,40839,41353
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,9.0,10.0,0.0,0.0,0.0,0.0,0.0,...,8.0,8.0,0.0,9.0,0.0,8.0,0.0,0.0,0.0,8.0
2,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
matrix = df_user_anime_ratings.to_numpy()
user_ratings_mean = np.mean(matrix, axis=1)
matrix_user_mean = matrix - user_ratings_mean.reshape(-1, 1)
matrix.shape

(506, 511)

In [28]:
user_ratings_mean.shape

(506,)

In [29]:
pd.DataFrame(matrix_user_mean, columns=df_user_anime_ratings.columns).head()

anime_id,1,5,6,19,20,24,30,32,33,43,...,38826,38883,39195,39587,39597,40221,40496,40591,40839,41353
0,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,...,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082,-0.178082
1,-1.228963,-1.228963,-1.228963,7.771037,8.771037,-1.228963,-1.228963,-1.228963,-1.228963,-1.228963,...,6.771037,6.771037,-1.228963,7.771037,-1.228963,6.771037,-1.228963,-1.228963,-1.228963,6.771037
2,-0.630137,-0.630137,-0.630137,-0.630137,8.369863,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,...,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137,-0.630137
3,6.117417,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,5.117417,-2.882583,-2.882583,5.117417,...,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583,-2.882583
4,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,...,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505,-0.522505


In [30]:
U, sigma, Vt = svds(matrix_user_mean, k=12)

In [31]:
print(U.shape)
print(sigma.shape)
print(Vt.shape)

(506, 12)
(12,)
(12, 511)


In [32]:
sigma = np.diag(sigma)
sigma.shape

(12, 12)

In [33]:
svd_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
df_svd_preds = pd.DataFrame(svd_user_predicted_ratings, columns=df_user_anime_ratings.columns, index=df_user_anime_ratings.index)
df_svd_preds.head()

anime_id,1,5,6,19,20,24,30,32,33,43,...,38826,38883,39195,39587,39597,40221,40496,40591,40839,41353
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.347941,0.706382,0.825137,0.318202,0.91377,0.240051,1.06735,0.699717,-0.00371,1.248904,...,0.504609,0.24074,-0.034495,-0.025173,0.04061,-0.091136,0.114736,0.125201,-0.11477,0.052404
1,0.650573,-0.570359,-0.012379,0.324081,6.982607,0.48649,2.87032,1.771583,1.075103,-0.354927,...,2.594508,3.168044,2.02953,4.107317,2.538422,2.622397,1.747668,4.193595,1.880068,2.084931
2,2.753063,1.468268,1.426343,0.275775,5.25056,0.416184,2.486632,1.021394,2.091368,1.168581,...,-0.232236,0.03908,0.363848,0.258752,-0.239731,0.251669,-0.130653,0.19181,0.435063,0.059284
3,3.90307,2.516681,3.661994,1.38521,-0.069997,1.108219,0.787104,-0.223646,2.153009,2.533284,...,0.081366,-1.186301,0.621802,1.040693,-0.294117,0.375936,1.741745,1.543304,0.90719,0.126741
4,0.850679,-0.320317,1.256011,0.56541,1.92916,2.481739,2.051696,1.260347,-0.073853,-0.308454,...,-0.25275,-0.177957,0.183164,0.34326,-0.107524,0.1552,-0.094177,0.373784,-0.309553,-0.120498


In [34]:
df_svd_preds.shape

(506, 511)

#### 추천 함수 만들기

In [35]:
def recommend_animes(df_svd_preds, user_id, ori_animes_df, ori_ratings_df, num_recommendations):
    if user_id not in df_svd_preds.index:
        print(f"사용자 ID {user_id}가 존재하지 않습니다.")
        return None, None
    
    sorted_user_predictions = df_svd_preds.loc[user_id].sort_values(ascending=False)
    user_data = ori_ratings_df[ori_ratings_df['user_id'] == user_id]
    user_history = user_data.merge(ori_animes_df, on='anime_id').sort_values(['rating'], ascending=False)
    recommendations = ori_animes_df[~ori_animes_df['anime_id'].isin(user_history['anime_id'])]
    recommendations = recommendations.merge(pd.DataFrame(sorted_user_predictions).reset_index(), on='anime_id')
    recommendations = recommendations.rename(columns={user_id: 'Predictions'}).sort_values('Predictions', ascending=False).iloc[:num_recommendations, :]
    
    return user_history, recommendations

In [36]:
sample_user_id = df_user_anime_ratings.index[0]
already_rated, predictions = recommend_animes(df_svd_preds, sample_user_id, df_animes, df_ratings, 10)

In [37]:
already_rated.head(10)

Unnamed: 0,user_id,anime_id,rating,Name,Score,Genres
2,0,578,10,Hotaru no Haka,8.51,"Drama, Historical"
8,0,2236,10,Toki wo Kakeru Shoujo,8.2,"Adventure, Drama, Romance, Sci-Fi"
0,0,430,9,Fullmetal Alchemist: The Conqueror of Shamballa,7.57,"Military, Comedy, Historical, Drama, Fantasy, ..."
3,0,121,9,Fullmetal Alchemist,8.17,"Action, Adventure, Comedy, Drama, Fantasy, Mag..."
4,0,356,9,Fate/stay night,7.34,"Action, Supernatural, Magic, Romance, Fantasy"
1,0,431,8,Howl no Ugoku Shiro,8.67,"Adventure, Drama, Fantasy, Romance"
7,0,164,8,Mononoke Hime,8.72,"Action, Adventure, Fantasy"
9,0,2034,8,Lovely★Complex,8.06,"Comedy, Romance, Shoujo"
10,0,199,8,Sen to Chihiro no Kamikakushi,8.83,"Adventure, Supernatural, Drama"
5,0,1689,6,Byousoku 5 Centimeter,7.73,"Drama, Romance, Slice of Life"


In [38]:
predictions

Unnamed: 0,anime_id,Name,Score,Genres,Predictions
51,523,Tonari no Totoro,8.31,"Adventure, Comedy, Supernatural",3.340783
50,513,Tenkuu no Shiro Laputa,8.31,"Adventure, Fantasy, Romance, Sci-Fi",2.166297
56,572,Kaze no Tani no Nausicaä,8.4,"Adventure, Fantasy",2.074925
49,512,Majo no Takkyuubin,8.24,"Adventure, Comedy, Drama, Magic, Romance, Fantasy",2.00321
103,2890,Gake no Ue no Ponyo,7.89,"Adventure, Fantasy",1.892591
57,585,Mimi wo Sumaseba,8.23,"Slice of Life, Drama, Romance, Shoujo",1.642228
202,10408,Hotarubi no Mori e,8.38,"Drama, Romance, Shoujo, Supernatural",1.483308
232,12355,Ookami Kodomo no Ame to Yuki,8.64,"Fantasy, Slice of Life",1.457455
11,47,Akira,8.17,"Action, Military, Sci-Fi, Adventure, Horror, S...",1.43198
137,5681,Summer Wars,8.09,"Sci-Fi, Comedy",1.426179


#### 결론)
- 사용자 별로 다르게 추천됨을 알수 있습니다.
- 사용자 맞춤 추천을 할 수 있습니다.