In [45]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In this notebook, we work on the baseline model. The baseline model is a book recommendation system which mimicks the feature of `Customers who bought this book also bought these`. This is a simplification of the kNN algorithm which will be explained by another ipython notebook. First we load train-validate-test data. For now this notebook is for ALL 5 categories. We can definitely change the input data.

In [46]:
traindf = pd.read_csv("All_train.csv", dtype={'ISBN':np.str})
validatedf = pd.read_csv("All_validate.csv", dtype={'ISBN':np.str})
testdf = pd.read_csv("All_test.csv", dtype={'ISBN':np.str})

First see whether the data has been loaded appropriately.

In [53]:
len(traindf['Title'].unique())
traindf.head()

Unnamed: 0,Title,ID,Date,Class,Category,Author,ISBN,Publisher,Pub_Date,Order_Time,Count,Cart,Cart_Date,Device,Address1,Address2,user_purchase_count,book_sell_count
0,데프콘 한미전쟁 세트,168,20141108,주문,국내문학,<김경진> 등저,9788956372389,씨앗을뿌리는사람,20000229,20,1,N,,기기PID_PC,인천광역시,계양구,1,1
1,다이버전트,186,20141003,주문,해외문학,<베로니카 로스> 저/<이수현> 역,9788956607108,은행나무,20130807,18,1,N,,기기PID_PC,제주특별자치도,서귀포시,1,20
2,"스피라, 세계를 향한 영혼의 승부",188,20140929,주문,자기계발,<김한철> 저,9788925539775,랜덤하우스코리아,20100916,19,1,N,,기기PID_PC,인천광역시,강화군,1,2
3,이상한 나라의 앨리스,193,20140910,주문,해외문학,<루이스 캐럴> 저/<김양미> 역/<김민지> 그림,9788992632126,인디고(글담),20071220,21,1,N,,기기PID_PC,인천광역시,서구,1,67
4,노인과 바다,217,20140904,주문,해외문학,<어니스트 헤밍웨이> 저/<이종인> 역,9788932911984,열린책들,20120210,15,1,N,,기기PID_PC,경기도,수원시 장안구,1,68


The baseline algorithm itself is very simple. For each customer in training set, we correct all the previous transactions. This will give us the set of purchased books per customer. Then we look for all the customers who bought books in this set. This now becomes the set of customers (`set of close customers`). Last we obtain the list of books bought by each customer in this set. We can compute the histogram from the list. We sort the list with respect to the probability of books and recomend to the customer.

In [47]:
%%time
baseline = {}
for cus in list(traindf['ID']): 
    
    bookrec = []
    history = list(traindf[traindf['ID']==cus]['Title'])
    relatives = list(traindf[traindf['Title'].isin(history)]['ID'])
    relatives.remove(cus)  # remove myself
    setdf = traindf[traindf['ID'].isin(relatives)].groupby('Title').size().order(ascending=False)

    # Do not recommend if I bought the book already
    for book in list(setdf.index):
        if book not in history:
            bookrec.append(book)

    baseline[cus] = bookrec[:3]               

CPU times: user 37min 19s, sys: 19.9 s, total: 37min 39s
Wall time: 38min


In [48]:
baseline_df = pd.DataFrame(baseline.items(), columns=['ID', 'Rec'])
baseline_df.head()

Unnamed: 0,ID,Rec
0,98304,"[1cm+, 창문 넘어 도망친 100세 노인, 여덟 단어]"
1,85108,"[심리학의 즐거움, 소크라테스의 변명, 살아갈 날들을 위한 공부]"
2,32770,"[창문 넘어 도망친 100세 노인, 어떤 하루, 내가 알고 있는 걸 당신도 알게 된다면]"
3,76459,"[내가 사랑한 유럽 TOP10, 속죄, 나를 지켜낸다는 것]"
4,32772,"[창문 넘어 도망친 100세 노인, 나는 까칠하게 살기로 했다, 미 비포 유]"


Once we compute the dataframe of recommendation books, we can calculate the accuracy of recommendation from the test set. We can see that for the data including all 5 categories, the accuracy is about 3%. 

In [49]:
score = []
for cus in testdf['ID']:
    actual = list(testdf[testdf['ID'] == cus]['Title'])
    pred = baseline_df[baseline_df['ID']==cus]['Rec']
    score.append(len(list(set(actual).intersection(pred.item())))/float(len(actual)))

print np.mean(score)

0.031786074672


In [6]:
score.sort(reverse=True)