## 주요 추천 알고리즘
* 1) Collaborative filtering
    - 구매 혹은 소비한 제품에 대한 각 user의 평가를 받아서 평가 패턴이 비슷한 user를 한 집단으로 보고 그 집단에 속한 user의 취향을 활용하는 기술
    - 취향이 뚜렷하게 구분되는 제품을 추천할 때 정확함
    - 평가 정보를 구하기 어려운 상황에 제한적이고 cold-start problem 과 같은 문제에 직면할 가능성
    - 클릭스트림 같은 implicit score 를 활용할 수도 있음

* 2) Content-based filtering
    - 제품의 내용을 분석해 추천하는 기술
    - 텍스트 정보가 많이 포함된 책이나 뉴스 콘텐츠 추천에 활용

* 3) Knowledge-based filtering
    - 특정 domain의 지식구조를 구축해 이를 활용하는 방법
    - 제품 카테고리의 계층 구조와 같이 특정한 체계도를 바탕으로 user가 구매한 혹은 관심 있는 제품과 관련이 있는 제품 후보군을 해당 지식 구조 속에서 찾을 수 있음
    - domain 지식이 특히 중요한 제품이나 서비스에서 활용 가능 (ex 커피, 와인, 교육 등)

* 4) Deep learning
    - input으로 다양한 user와 item과 관려된 feature를 활용하고 각 item에 대한 user의 예상 선호도를 output으로 산출하는 모델을 구축
    -  다양한 형태의 input을 활용할 수 있음 (ex. image, text)

* 5) Hybrid-model
    - 실제 추천 시스템 구축 시 두 가지 이상의 기술을 혼합해서 사용하는 hybrid 형태를 많이 활용함
    - 복수의 알고리즘과 방법을 사용하는 경우 정확도가 향상되고 결합 수가 많아질수록 더 큰 향상을 보인다는 연구 결과가 있음

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data_repo = '/content/drive/MyDrive/recommender_system/practice/'

import pandas as pd
import numpy as np

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
u_cols = ['user_id', 'age', 'sex','occupation', 'zip_code']
users = pd.read_csv(data_repo+'u.user', sep='|', names = u_cols, encoding = 'latin-1')
users = users.set_index('user_id')
users.head()

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [None]:
i_cols = ['movie_id', 'title', 'release_date', ' video_release_date', 'IMDB_URL', 'unknown',
          'Action','Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary',
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romane', 'Sci-FI',
          'Thriller', 'War', 'Western']
movies = pd.read_csv(data_repo+'u.item', sep='|', names= i_cols, encoding='latin-1')
movies = movies.set_index('movie_id')
movies.head()

Unnamed: 0_level_0,title,release_date,video_release_date,IMDB_URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romane,Sci-FI,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [None]:
r_cols = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_csv(data_repo+'u.data', sep='\t', names = r_cols, encoding='latin-1')
ratings = ratings.set_index('user_id')
ratings.head()

Unnamed: 0_level_0,movie_id,rating,timestamp
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
196,242,3,881250949
186,302,3,891717742
22,377,1,878887116
244,51,2,880606923
166,346,1,886397596


## Recommend Best seller
- 단순하게 전체 user 가 매긴 점수의 평균으로 예측하는 방법
- user를 구분할 수 있는 feature가 있다면 이를 기준으로 group을 만들고 group 별 평균을 예측값으로 활용할 수 있다

In [None]:
def recom_movie1(n_items):
    movie_sort = movie_mean.sort_values(ascending = False)[:n_items]
    recom_movies = movies.loc[movie_sort.index]
    recommendations = recom_movies['title']
    return recommendations

In [None]:
movie_mean = ratings.groupby(['movie_id'])['rating'].mean()
recom_movie1(n_items=10)

movie_id
1293                                      Star Kid (1997)
1467                 Saint of Fort Washington, The (1993)
1653    Entertaining Angels: The Dorothy Day Story (1996)
814                         Great Day in Harlem, A (1994)
1122                       They Made Me a Criminal (1939)
1599                        Someone Else's America (1995)
1201           Marlene Dietrich: Shadow and Light (1996) 
1189                                   Prefontaine (1997)
1500                            Santa with Muscles (1996)
1536                                 Aiqing wansui (1994)
Name: title, dtype: object

In [None]:
def recom_movie2(n_items):
    # brief code
    return movies.loc[movie_mean.sort_values(ascending=False)[:n_items].index]['title']
recom_movie2(n_items=10)

movie_id
1293                                      Star Kid (1997)
1467                 Saint of Fort Washington, The (1993)
1653    Entertaining Angels: The Dorothy Day Story (1996)
814                         Great Day in Harlem, A (1994)
1122                       They Made Me a Criminal (1939)
1599                        Someone Else's America (1995)
1201           Marlene Dietrich: Shadow and Light (1996) 
1189                                   Prefontaine (1997)
1500                            Santa with Muscles (1996)
1536                                 Aiqing wansui (1994)
Name: title, dtype: object

In [None]:
# test score - RMSE
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true)- np.array(y_pred))**2))

In [None]:
# best seller score

rmse = []
for user in set(ratings.index):
    y_true = ratings.loc[user]['rating']
    y_pred = movie_mean[ratings.loc[user]['movie_id']] #전체 영화 list에 대해 계산하기 때문에 전체 평균 평점을 예측값으로 지정
    accuracy = RMSE(y_true, y_pred)
    rmse.append(accuracy)

print(np.mean(rmse))                        

0.996007224010567


## Group recommendation (demograpics)

In [None]:
from sklearn.model_selection import train_test_split
ratings_df = ratings.reset_index().drop('timestamp', axis=1)
movies_df = movies.reset_index().loc[:,['movie_id','title']]

In [None]:
x= ratings_df.copy()
y= ratings_df['user_id']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = .25, stratify = y, random_state = 7) # 무작위로 unique user를 균일한 size 로 추출하게 됨

In [None]:
def score(model):
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])
    y_true = np.array(X_test['rating'])
    return RMSE(y_true,y_pred)

rating_matrix = X_train.pivot(index='user_id', columns='movie_id', values = 'rating')
rating_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,1629,1631,1634,1635,1636,1637,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1655,1656,1657,1658,1659,1661,1662,1663,1664,1665,1666,1667,1668,1670,1672,1675,1676,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,,3.0,,3.0,3.0,5.0,4.0,1.0,5.0,,,5.0,,5.0,5.0,5.0,3.0,4.0,5.0,4.0,1.0,4.0,,3.0,4.0,,2.0,4.0,1.0,3.0,3.0,,,2.0,1.0,2.0,2.0,3.0,4.0,3.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,4.0,,,,,,,,,2.0,,,4.0,4.0,,,,,3.0,,,,,,4.0,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,4.0,3.0,,,,,,,,,,,,,,,,,,,3.0,,,4.0,3.0,,,,4.0,,,,,,,,,,,4.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
rating_matrix.shape

(943, 1632)

In [None]:
def best_seller(user_id, movie_id):
    try:
        rating = train_mean[movie_id] # 특정 영화의 전체 평균 평점으로 예측
    except:
        rating = 3.0 # 평점이 하나도 없는 경우 3점으로 예측
    return rating

train_mean = X_train.groupby(['movie_id'])['rating'].mean()
score(best_seller)

1.0268412212208893

In [None]:
merged_ratings = pd.merge(X_train, users.reset_index())
merged_ratings.head()

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code,age_gr
0,655,1005,4,50,F,healthcare,60657,5
1,655,410,2,50,F,healthcare,60657,5
2,655,909,3,50,F,healthcare,60657,5
3,655,610,4,50,F,healthcare,60657,5
4,655,930,2,50,F,healthcare,60657,5


In [None]:
g_mean = merged_ratings[['movie_id','sex','rating']].groupby(['movie_id','sex'])['rating'].mean()
g_mean

movie_id  sex
1         F      3.817204
          M      3.911877
2         F      3.294118
          M      3.195122
3         F      2.545455
                   ...   
1675      M      3.000000
1676      M      2.000000
1680      M      2.000000
1681      M      3.000000
1682      M      3.000000
Name: rating, Length: 3026, dtype: float64

In [None]:
def cf_gender(user_id, movie_id):
    if movie_id in rating_matrix:
        gender = users.loc[user_id]['sex'] 
        if gender in g_mean[movie_id]:
            gender_rating = g_mean[movie_id][gender] #특정 user의 성별 group 의 movie_id 에 해당하는 영화의 평균 평점
        else:
            # gender_rating = 3. 
            gender_rating = train_mean[movie_id]
            
    else: # 특정 영화의 평점이 하나도 없을 때
        gender_rating = 3.
    return gender_rating

In [None]:
score(cf_gender) # 전체 평균 방법보다 개선되지 못함

1.0373569822963804

#### Exercise

* 성별을 기반으로 추천하는 방법을 수정해서 직업에 따라 집단을 나누어 예측하는 함수를 만들고 정확도를 계산해보기
* 위의 방법과 같이 연령대를 기준으로 집단을 나누어 예측하는 함수를 만들고 정확도를 계산해보기
* 성별과 직업을 모두 고려한 집단을 구성해 예측값을 계산하는 함수를 만들고 정확도를 계산해보기
* 성별과 연령대를 모두 고려한 집단을 구성해 예측값을 계산하는 함수를 만들고 정확도를 계산해보기 

In [None]:
# occupation group
oc_mean = merged_ratings[['movie_id','occupation','rating']].groupby(['movie_id','occupation'])['rating'].mean()
oc_mean

movie_id  occupation   
1         administrator    4.050000
          artist           4.000000
          doctor           3.500000
          educator         3.694444
          engineer         4.000000
                             ...   
1675      other            3.000000
1676      other            2.000000
1680      student          2.000000
1681      writer           3.000000
1682      engineer         3.000000
Name: rating, Length: 16799, dtype: float64

In [None]:
def cf_occupation(user_id, movie_id):
    if movie_id in rating_matrix:
        occupation = users.loc[user_id]['occupation'] 
        if occupation in oc_mean[movie_id]:
            oc_rating = oc_mean[movie_id][occupation] #특정 user의 성별 group 의 movie_id 에 해당하는 영화의 평균 평점
        else:
            # oc_rating = 3.
            oc_rating = train_mean[movie_id]
            
    else: # 특정 영화의 평점이 하나도 없을 때
        oc_rating = 3.
    return oc_rating

In [None]:
score(cf_occupation) # 전체 평균, 성별보다 더 개선되지 못함

1.1191632173936357

In [None]:
# occupation group
gnd_oc_mean = merged_ratings[['movie_id','sex','occupation','rating']].groupby(['movie_id','sex','occupation'])['rating'].mean()
gnd_oc_mean

movie_id  sex  occupation   
1         F    administrator    4.090909
               artist           4.250000
               educator         3.300000
               entertainment    5.000000
               executive        3.000000
                                  ...   
1675      M    other            3.000000
1676      M    other            2.000000
1680      M    student          2.000000
1681      M    writer           3.000000
1682      M    engineer         3.000000
Name: rating, Length: 22569, dtype: float64

In [None]:
gnd_oc_mean[1]['F']['artist']

4.25

In [None]:
def cf_gender_occupation(user_id, movie_id):
    if movie_id in rating_matrix:
        gender = users.loc[user_id]['sex']
        occupation = users.loc[user_id]['occupation']
        if occupation in gnd_oc_mean[movie_id] and gender in gnd_oc_mean[movie_id]:
            gnd_oc_rating = gnd_oc_mean[movie_id][gender][occupation] #특정 user의 성별, 직업 group 내 movie_id 에 해당하는 영화의 평균 평점
        elif gender in g_mean[movie_id] : 
            gnd_oc_rating = g_mean[movie_id][gender]
        elif occupation in oc_mean[movie_id]:
            gnd_oc_rating = oc_mean[movie_id][occupation]
        else:
            gnd_oc_rating = train_mean[movie_id]

    else: # 특정 영화의 평점이 하나도 없을 때
        gnd_oc_rating = 3.
    return gnd_oc_rating

In [None]:
score(cf_gender_occupation)

1.0375228952383524

In [None]:
merged_ratings['age_gr'] = merged_ratings['age'] // 10
merged_ratings

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code,age_gr
0,655,1005,4,50,F,healthcare,60657,5
1,655,410,2,50,F,healthcare,60657,5
2,655,909,3,50,F,healthcare,60657,5
3,655,610,4,50,F,healthcare,60657,5
4,655,930,2,50,F,healthcare,60657,5
...,...,...,...,...,...,...,...,...
74995,662,1511,4,55,M,librarian,19102,5
74996,662,100,5,55,M,librarian,19102,5
74997,662,93,5,55,M,librarian,19102,5
74998,662,985,4,55,M,librarian,19102,5


In [None]:
# 연령대 group
age_mean = merged_ratings[['movie_id','age_gr','rating']].groupby(['movie_id','age_gr'])['rating'].mean()
age_mean

movie_id  age_gr
1         1         3.645161
          2         3.931973
          3         4.065217
          4         3.666667
          5         3.833333
                      ...   
1675      1         3.000000
1676      1         2.000000
1680      1         2.000000
1681      2         3.000000
1682      2         3.000000
Name: rating, Length: 7226, dtype: float64

In [None]:
users['age_gr'] = users['age']//10

In [None]:
def cf_agegroup(user_id, movie_id):
    if movie_id in rating_matrix:
        age = users.loc[user_id]['age_gr'] 
        if age in age_mean[movie_id]:
            age_rating = age_mean[movie_id][age] #특정 user의 성별 group 의 movie_id 에 해당하는 영화의 평균 평점
        else:
            age_rating = train_mean[movie_id]
            
    else: # 특정 영화의 평점이 하나도 없을 때
        age_rating = 3.
    return age_rating
    
score(cf_agegroup)

1.0578401448842032

In [None]:
# 성별과 연령 조합 group
gnd_age_mean = merged_ratings[['movie_id','sex','age_gr','rating']].groupby(['movie_id','sex','age_gr'])['rating'].mean()
gnd_age_mean

movie_id  sex  age_gr
1         F    1         4.000000
               2         3.942857
               3         4.047619
               4         3.235294
               5         3.625000
                           ...   
1675      M    1         3.000000
1676      M    1         2.000000
1680      M    1         2.000000
1681      M    2         3.000000
1682      M    2         3.000000
Name: rating, Length: 11417, dtype: float64

In [None]:
gnd_age_mean[1]['F'][2]

3.942857142857143

In [None]:
def cf_gender_age(user_id, movie_id):
    if movie_id in rating_matrix:
        gender = users.loc[user_id]['sex']
        age = users.loc[user_id]['age_gr']
        if age in gnd_age_mean[movie_id] and gender in gnd_age_mean[movie_id]:
            gnd_age_rating = gnd_age_mean[movie_id][gender][age] #특정 user의 성별, 직업 group 내 movie_id 에 해당하는 영화의 평균 평점
        elif age in age_mean[movie_id]:
            gnd_age_rating = age_mean[movie_id][age]
        elif gender in g_mean[movie_id]:
            gnd_age_rating = g_mean[movie_id][gender]
        else:
            gnd_age_rating = train_mean[movie_id]

    else: # 특정 영화의 평점이 하나도 없을 때
        gnd_age_rating = 3.
    return gnd_age_rating

In [None]:
score(cf_gender_age)

1.059423472448191

##### - 전체 average < gender < gender * occupation < age  < age*gender < occupation 순서로 rmse 가 낮게 도출되었다. 