<a href="https://colab.research.google.com/github/PingPingE/Dacon-Book-Recommendation/blob/main/book_recommendation_using_categorical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Goal: Prediction of Book-Rating(0~10)
### Dataset Info
- ID : 샘플 고유 ID
- User-ID : 유저 고유 ID
- Book-ID : 도서 고유 ID
- 유저 정보
  - Age : 나이
  - Location : 지역
- 도서 정보
  - Book-Title : 도서 명
  - Book-Author : 도서 저자
- Year-Of-Publication : 도서 출판 년도 (-1일 경우 결측 혹은 알 수 없음)
- Publisher : 출판사
- Book-Rating : 유저가 도서에 부여한 평점 (0점 ~ 10점)
  - 단, 0점인 경우에는 유저가 해당 도서에 관심이 없고 관련이 없는 경우

## 이번에는 출판사, 지역 등 범주형 변수를 추가하자

In [1]:
from google.colab import drive

drive.mount('/content/drive') 

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
!unzip drive/MyDrive/recommend.zip -d data/

Archive:  drive/MyDrive/recommend.zip
  inflating: data/sample_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv          


## Load Data

In [4]:
train_pd = pd.read_csv('data/train.csv')
test_pd = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

In [5]:
def get_splited_data(x):
  return x.replace(' ','').strip().split(',')

In [39]:
#split Location column
train_pd['city'] = train_pd.Location.apply(get_splited_data).str[0].astype('category')
train_pd['state'] = train_pd.Location.apply(get_splited_data).str[1].astype('category')
train_pd['country'] = train_pd.Location.apply(get_splited_data).str[2].astype('category')

In [40]:
train_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 871393 entries, 0 to 871392
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   ID                   871393 non-null  object  
 1   User-ID              871393 non-null  object  
 2   Book-ID              871393 non-null  object  
 3   Book-Rating          871393 non-null  int64   
 4   Age                  871393 non-null  float64 
 5   Location             871393 non-null  object  
 6   Book-Title           871393 non-null  object  
 7   Book-Author          871393 non-null  object  
 8   Year-Of-Publication  871393 non-null  float64 
 9   Publisher            871393 non-null  object  
 10  city                 871393 non-null  category
 11  state                871393 non-null  category
 12  country              871393 non-null  category
 13  Age_category         871393 non-null  int64   
dtypes: category(3), float64(2), int64(2), object(7)
memo

In [41]:
#check Year-Of-Publication's data
train_pd.loc[train_pd['Year-Of-Publication']==-1].groupby(['Book-ID']).count().sort_values('ID',ascending=False)

Unnamed: 0_level_0,ID,User-ID,Book-Rating,Age,Location,Book-Title,Book-Author,Year-Of-Publication,Publisher,city,state,country,Age_category
Book-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
BOOK_102531,180,180,180,180,180,180,180,180,180,180,180,180,180
BOOK_045458,90,90,90,90,90,90,90,90,90,90,90,90,90
BOOK_082346,87,87,87,87,87,87,87,87,87,87,87,87,87
BOOK_178258,71,71,71,71,71,71,71,71,71,71,71,71,71
BOOK_175634,63,63,63,63,63,63,63,63,63,63,63,63,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...
BOOK_141751,1,1,1,1,1,1,1,1,1,1,1,1,1
BOOK_041502,1,1,1,1,1,1,1,1,1,1,1,1,1
BOOK_041452,1,1,1,1,1,1,1,1,1,1,1,1,1
BOOK_141981,1,1,1,1,1,1,1,1,1,1,1,1,1


## EDA

### Country Distribution

In [46]:
train_pd.city.cat.categories

Index(['', '&#304;stanbul', '&#321;ód&#378;', '&#36149;&#28207;', '*',
       '***********', '-', '--', '-----', '---------',
       ...
       'århus', 'århusv', 'élmédano', 'évora', 'óbidos', 'öhringen', 'örebro',
       'øverbygd', 'überallstadt', 'ýzmir'],
      dtype='object', length=13635)

In [47]:
train_pd.state.cat.categories

Index(['', '"n/a', '"n/a"', '"n/a".', '"n/a".)', '"n/a`', '&#24191;&#35199;',
       '&#322;ódzkie', '(alacant)', '(porto)',
       ...
       'zurich', 'zürcherunterland', 'zürich', 'álava', 'îledefrance',
       'östergötland', 'østfold', 'østjylland', 'østlandet', 'ýçanadolu'],
      dtype='object', length=1787)

In [45]:
train_pd.country.cat.categories

Index(['', '"n/a"', '61men', 'aberdeenshire', 'afghanistan', 'alabama',
       'alachua', 'alaska', 'albania', 'alberta',
       ...
       'wisconsin', 'worcester', 'x', 'ysa', 'yu-song', 'yugoslavia', 'zambia',
       'zapopan', 'zimbabwe', 'álava'],
      dtype='object', length=347)

In [48]:
list(train_pd.country.cat.categories)

['',
 '"n/a"',
 '61men',
 'aberdeenshire',
 'afghanistan',
 'alabama',
 'alachua',
 'alaska',
 'albania',
 'alberta',
 'alderney',
 'algeria',
 'almería',
 'america',
 'andalucia',
 'andorra',
 'antarctica',
 'antiguaandbarbuda',
 'argentina',
 'arizona',
 'arkansas',
 'aroostook',
 'aruba',
 'austin',
 'australia',
 'australiancapitalterritory',
 'austria',
 'baden-wuerttemberg',
 'bahamas',
 'bahrain',
 'bangladesh',
 'barbados',
 'bayern',
 'bc',
 'belgium',
 'belize',
 'benin',
 'berguedà',
 'berlin',
 'bermuda',
 'bolivia',
 'bosniaandherzegovina',
 'bourgogne',
 'brazil',
 'britishcolumbia',
 'brunei',
 'bulgaria',
 'burkinafaso',
 'burlington',
 'burma',
 'c',
 'ca.',
 'california',
 'cambodia',
 'cambridgeshire',
 'camden',
 'cameroon',
 'canada',
 'cananda',
 'canaryislands',
 'capeverde',
 'caribbeansea',
 'catalonia',
 'catalunya',
 'catalunyaspain',
 'caymanislands',
 'cherokee',
 'chile',
 'china',
 'co.carlow',
 'collin',
 'colombia',
 'colorado',
 'connecticut',
 'costar

In [42]:
train_pd.loc[(train_pd.city == 'n/a')|(train_pd.city=='')].count()

ID                     13972
User-ID                13972
Book-ID                13972
Book-Rating            13972
Age                    13972
Location               13972
Book-Title             13972
Book-Author            13972
Year-Of-Publication    13972
Publisher              13972
city                   13972
state                  13972
country                13972
Age_category           13972
dtype: int64

In [37]:
train_pd.loc[(train_pd.state== 'n/a')|(train_pd.state=='')].count()

ID                     36986
User-ID                36986
Book-ID                36986
Book-Rating            36986
Age                    36986
Location               36986
Book-Title             36986
Book-Author            36986
Year-Of-Publication    36986
Publisher              36986
city                   36986
state                  36986
country                36986
Age_category           36986
dtype: int64

In [34]:
train_pd.loc[(train_pd.country == 'n/a')|(train_pd.country=='')].count()

ID                     32324
User-ID                32324
Book-ID                32324
Book-Rating            32324
Age                    32324
Location               32324
Book-Title             32324
Book-Author            32324
Year-Of-Publication    32324
Publisher              32324
city                   32324
state                  32324
country                32324
Age_category           32324
dtype: int64

### check
- There are so many meaningless values each column(country,city,state)
  - Especially in 'city' columns
- I'll take only preprocessed 'country' column for modeling
- preprocessing
  - unifying meaningless values
    - definition of meaningless value: Null value or value that only a few people have
  - encoding
  

# Feature Engineering

## Age Categorization
- 0: 0~9
- 1: 10~19
- 2: 20~29
- ...
- 8: 80 ~

In [10]:
def get_age_category(age):
  if age >= 80:
    return 8
  else:
    return int(age*0.1)


Age_category = train_pd.Age.apply(get_age_category)

In [11]:
train_pd['Age_category'] = Age_category

## 

## Segmentation of Zero Book-Rating 
- hypothesis: A zero Rating would have three meanings
  1. pessimist's rating
    - it might not be a terrible book
  2. optimist or normal person's rating
    - it means that 'this book is terrible'
  3. None
    - forgot to rate it or maybe it was a mistake..?


### zero rating ratio of each reader

In [None]:
train_pd['is_zero'] = (train_pd['Book-Rating'] == 0).astype(int)

In [None]:
seg_pd = train_pd[['is_zero','User-ID']].groupby('User-ID').agg(is_zero_count = ('is_zero','count'),
 is_zero_sum = ('is_zero','sum'))

In [None]:
seg_pd['zero_ratio'] = seg_pd['is_zero_sum']/seg_pd['is_zero_count']

In [None]:
sns.kdeplot(x='zero_ratio', data=seg_pd.loc[seg_pd['is_zero_count']>3])
plt.show()

In [None]:
seg_pd.loc[seg_pd['is_zero_count']>3].describe()

In [None]:
seg_pd.info()

### who is pessimist?
  - who gave zero ratings to almost every read book they read (maybe over 80% of books they read)
    

In [None]:
seg_pd['is_pessimist'] = ((seg_pd['is_zero_count'] >3) & (seg_pd['zero_ratio']>=0.8)).astype(int)

In [None]:
seg_pd.loc[seg_pd.is_pessimist == 1]

In [None]:
merge_pd = pd.merge(seg_pd.reset_index(),train_pd,how='right',on='User-ID').fillna(0)

In [None]:
merge_pd

## User info Aggregation

In [None]:
user_info = merge_pd.groupby('User-ID').agg(user_count=('ID','count'),
                                            user_median = ('Book-Rating','median'),
                                            user_mad = ('Book-Rating', 'mad'),
                                            )

In [None]:
user_info

In [None]:
user_info.reset_index(inplace=True)
user_info

## Book info Aggregation

In [None]:
book_info = merge_pd.groupby('Book-ID').agg(book_count=('ID','count'),
                                            book_median = ('Book-Rating','median'),
                                            book_mad = ('Book-Rating', 'mad'),
                                            )

In [None]:
book_info.reset_index(inplace=True)
book_info

In [None]:
merge_pd2 = pd.merge(pd.merge(merge_pd, user_info, how='left', on='User-ID'), book_info, how='left', on='Book-ID')

In [None]:
merge_pd2

# Modeling

In [None]:
merge_pd2.info()

### check consistency

In [None]:
merge_pd2.loc[merge_pd2.is_zero_count != merge_pd2.user_count]

### check duplications

In [None]:
merge_pd2.groupby('ID').agg(count=('ID', 'count')).sort_values('count', ascending=False)

### split data set

In [None]:
feature_cols = ['is_zero_sum','is_pessimist', 'Age', 'Age_category', 'user_count','user_median','user_mad', 'book_count','book_median','book_mad' ]
target_pd = merge_pd2[['ID']+feature_cols+['Book-Rating']]

In [None]:
target_pd

In [None]:
X = target_pd[feature_cols]
y= target_pd['Book-Rating']

In [None]:
N = 710

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state= N)
for train_index, test_index in split.split(X,y):
    train_set = target_pd.loc[train_index]
    valid_set = target_pd.loc[test_index]

#### check the distribution of train and valid set

In [None]:
plt.figure(figsize=(10,5))
sns.kdeplot(valid_set[['Book-Rating']],  shade=True)
sns.kdeplot(train_set[['Book-Rating']],  shade=True)
plt.show()

### Train

In [None]:
import xgboost as xgb
from xgboost import plot_importance 
# # XGBoost 모델을 생성합니다.
model = xgb.XGBRegressor(n_estimators=100, 
                         learning_rate=0.01,
                         subsample=0.8,
                        colsample_bytree=1,
                          max_depth=7)

In [None]:
model.fit(train_set[feature_cols], train_set['Book-Rating'])

In [None]:
plot_importance(model)

#### Valid

In [None]:
valid_pred = model.predict(valid_set[feature_cols])#.astype(int)

In [None]:
def rmse(real: list, predict: list) -> float:
    pred = np.array(predict)
    return np.sqrt(np.mean((real-pred) ** 2))
  
rmse(valid_set['Book-Rating'].to_numpy(), valid_pred)

## Test
- train의 데이터를 갖다 써가지고 user_info, book_info 쓰면 안되나?

In [None]:
merge_test = pd.merge(pd.merge(test_pd, user_info, how='left', on='User-ID'), book_info, how='left', on='Book-ID')

In [None]:
merge_test = pd.merge(merge_test, seg_pd.reset_index(), how='left', on='User-ID')

In [None]:
merge_test['Age_category'] = merge_test.Age.apply(get_age_category)

In [None]:
test_pred = model.predict(merge_test[feature_cols])

In [None]:
merge_test['Book-Rating'] = test_pred#.astype(int)

In [None]:
submission_pd = merge_test[['ID','Book-Rating']]

In [None]:
submission_pd

In [None]:
submission = submission_pd
submission

In [None]:
submission.to_csv(f'./submission.csv', index=False)

In [None]:
pd.read_csv('./submission.csv')

# Ideas(조금씩 꺼내서 확인해보자)
- 유저 관점
  - 후한 유저냐 박한 유저냐
  - 책을 얼마나 읽은 유저냐 
    - 많이 읽었을수록 본인한테 맞는 책을 고르는 감각이 뛰어나서 평점이 점점 높아지지않을까
  - 알바
    - negative: 라이벌(?) 작품에 평점 테러
    - positive: 관련 작품에 후후후한 평점만 
  - 사는 나라, 지역
     - 본인이 살아온 환경, 정서랑 너무 안맞으면 그게 호기심, 흥미를 유발할 수 있지만 이질감때문에 역효과가 날 수도 있음

- 책 관점
  - 대중적인 책인가
    - 하위 평점(0~3)이 거의 없음
  - 호불호가 갈리는 책인가
    - 두개의 분포가 존재

  
- 저자 관점
  - 성공작이 많은가
    - 저자가 쓴 각각의 책 요약값(평점 중앙값 등등)이 높은게 많은지

- 출판사 관점
  - 매번 망작만 출판하는 회사
  - 히트칠 대작만 출판하는 노하우 있는 회사