# Book Recommendation with Machine Learning
<img src = "https://img.grouponcdn.com/seocms/ajPaKzNP9U8cyrZ5vyexnD/10-Handwritten-Book-Recommendations-from-Unabridged-Bookstore_600c390-600x390.jpeg">

##1. Overview
Online book store in Korea is as prosperous as in any other countries in the world. It has about 36% of entire book market share in Korea, and this number is consitently growing every year. <i>Yes24</i> is the leading online book store in Korea, occupying about 44% of online book market. However, <i>Yes24</i>'s book recommender system is not so good that it often recommends totally irrelavant books, or even a goods that is not a book. <i>Yes24</i> recently opened a competition where data scientists suggest a solution for accurate book recommender, releasing the purchase history of 19,000 users within 2014. This motivated us to build a book recommeder algorithm, using the technique we've learned throughout the semester in CS109. <br><br>
We've tried three different algorithms: one is the simplest algorithm that <i>Yes24</i> supposedly uses on their current website, and the others are based on k-nearest neighbor (kNN) with user similarity and book similarity. As a result, we found that kNN with user similarity has the best recommendation accuracy.

##2. Data
The data given by <i>Yes24</i> contains the information of 511,742 transactions of about 19,000 users' purchase history, involving about 19,000 books, in the following fields:

- Customer ID: unique, encrypted customer identification `ID`
- Transaction Date `Date`
- Book Title `Title`
- Type of transaction: takes either ordered, cancelled, or refunded `Class`
- Book Category: Foreign literature, Domestic literature, Religion, Self-Development, or Humanities `Category`
- Author: `Author`
- ISBN code of book. '-' or 'nan' if not available `ISBN`
- Publisher `Publisher`
- Published date `Pub_Date`
- Transaction Time `Order_Time`
- Number of books in the transaction: Positive integer if purchased, and negative if refunded or cancelled. `Count`
- Whether or not a book has been in cart: 1 if so, 0 if not `Cart`
- Date when a book was added to cart `Cart_Date`
- Device used to make a transaction `Device`
- Address: `Address1` and more detailed address in `Address2`



##3. Data Cleaning

A first and the important step is cleaning the given raw data. In our case, we have `cancel` and `return` transactions along with `purchase` ones. However, `cancel` and `return` transactions should not be considered since the recommendation of books heavily rely on what the customer purchased before. Moreover, when we remove the `cancel` and `return` transactions, corresponding `purchase` transactions should be removed as well. 

As a result, we can see that originally, we had 

| Original Set               | Pruchase         |  Cancel        |  Return            |  
| :--------------------: |:-------------------:| :-------------------:| :-------------------:| 
| No. of Transactions   | 511742 | 19264 | 3877|

and once we remove all the `cancel` and `remove` transactions along with corresponding `purchase` transactions, we end up having

| Original Set               | Pruchase         |  Cancel        |  Return            |  
| :--------------------: |:-------------------:| :-------------------:| :-------------------:| 
| No. of Transactions   | 488874 | 55 | 218|

Please note that the numbers indicate `(No of purchases) (No of cancels) (No of returns)`. As can be observed, we still have some `cancel` and `return` transactions. This might be due to a wrong training-test split by the provider - 'Yes24'. The 273 transactions of cancel and return do not have corresponding purchase transactions in the given data. We'll drop remaining 273 transactions as well, and only consider 488,874 transactions. For the purpose of this project, we only take into account following 5 large categories: `Domestic Literature`, `Humanities`, `Self Development`, `Religion`, `International Literature`.

Once we do all the data cleaning works, we can now move on to the training-validate-test split process. As we have done in the homework. What we do here is we consider customers with more than (or equal to) $N$ purchases. For those who have made more than $N$ purchases, we choose randomly $M$ transactions and put them to validate set, and randomly select another $L$ transactions (exclusively) to put them into test set. $N$, $M$, and $L$ differ from category to category. Those numbers are determined heuristically. More detailed explanation along with executable code can be found in the ipython notebook of `Data_Exploration.ipynb`. The resulted number of transactions in the training, validate, and test sets are shown in below.



| Category               | Trainin Set         |  Validate Set        |  Test Set            |  
| :--------------------: |:-------------------:| :-------------------:| :-------------------:| 
| Foreign Literature     | 28,851 | 3,420 | 2,280 |
| Korean Literature      |21,726 | 1,818 | 1,212|
| Religion               |10,394 | 1,268 | 951 |
| Self-Development       | 15,808 | 1,592 | 1,194|
| Humanities             |15,053 |1,488 | 1,116 |


##4. Writing Book Recommender

###4.0 Evaluation of Algorithms
Each algorithm, the recommendation accuracy $\Lambda$ is evaluated, which is defined as:
$$\Lambda = \frac{1}{N} \sum^N_{u=1} \frac{|Y_u \cap P_u|}{|Y_u|}$$
where $Y_u$ is the set of books that a user $u$ in the test set purchased, and $P_u$ is the set of books recommended to the user $u$, and $N$ is the total number of users in the test set.<br>


###4.1 Baseline Algorithm

We first start from the simplest algorithm. For each customer in training set, we correct all the previous transactions. This will give us the set of purchased books per customer. Then we look for all the customers who bought books in this set. This now becomes the set of customers (set of close customers). Last we obtain the list of books bought by each customer in this set. We can compute the histogram from the list. We sort the list with respect to the probability of books and recomend to the customer. The executable code can be found in the ipython notebook of `Base_Model.ipynb`.

The recommendation accuracy for each category is shown below:

| Category               | Accuracy            |  
| :--------------------: |:-------------------:| 
| Foreign Literature     | 7.5%                | 
| Korean Literature    | 5.6%                | 
| Religion               | 1.9%                | 
| Self-Development       | 6.3%                |
| Humanities             | 4.7%                |




###4.2 kNN using book similarity

Then we wrote a kNN algorithm using the similarity between books as a distance metric. In this algorithm, for a given user, the books with the largest similarities with the books that the user has purchased are selected for recommendation. The weight for each of title, author, and publisher similarities is determined through the optimization of recommendation accuracy in validation set. The detailed description and working code may be found in the ipython notebook, `BookSim.ipynb`. <br>

The recommendation accuracy for each category is shown below:

| Category               | Accuracy            |  
| :--------------------: |:-------------------:| 
| Foreign Literature     | 4.2%                | 
| Korean Literature    | 2.6%                | 
| Religion               | 4.0%                | 
| Self-Development       | 2.4%                |
| Humanities             | 2.9%                |

###4.3 kNN (동관이)

##5. Results

<img src="result.png" alt="Drawing" style="width: 600px;"/>