# Extension 2: Cold-start
Using the supplementary book data, build a model that can map observable data to the learned latent factor representation for items. To evaluate its accuracy, simulate a cold-start scenario by holding out a subset of items during training (of the recommender model), and compare its performance to a full collaborative filter model.

#### Reference: 
https://github.com/MengtingWan/goodreads/blob/master/samples.ipynb
Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18.  [bibtex]
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]

## Step1: Create Attribute Matrix for Items

### Test on Poetry small data

In [1]:
import gzip
import json
import re
import os
import sys
import numpy as np
import pandas as pd

#### Load Data

In [2]:
def load_data(file_name, head = 500):
    count = 0
    data = []
    with gzip.open(file_name) as fin:
        for l in fin:
            d = json.loads(l)
            count += 1
            data.append(d)
            
            # break if reaches the 100th line
            if (head is not None) and (count > head):
                break
    return data

In [4]:
poetry = load_data('goodreads_books_poetry.json.gz')

In [11]:
poetry[0]

{'isbn': '',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': 'eng',
 'popular_shelves': [{'count': '8', 'name': 'to-read'},
  {'count': '3', 'name': 'poetry'},
  {'count': '2', 'name': 'currently-reading'},
  {'count': '1', 'name': '01-kindle'},
  {'count': '1', 'name': 'real-books'},
  {'count': '1', 'name': 'personal-library'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '3.83',
 'kindle_asin': '',
 'similar_books': [],
 'description': 'Number 30 in a series of literary pamphlets published monthly and available at the price of 15 cents per copy, or a yearly subscription (19 numbers) for $1.25',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/16037549-vision-of-sir-launfal-and-other-poems',
 'authors': [{'author_id': '15585', 'role': ''}],
 'publisher': 'Houghton, Mifflin and Company',
 'num_pages': '80',
 'publication_day': '1',
 'isbn13': '',
 'publication_month': '11',
 'edition_information': '',
 'publication_yea

In [10]:
poetry_df = pd.DataFrame(poetry)
poetry_df.head(2)

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,popular_shelves,asin,is_ebook,average_rating,kindle_asin,...,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series
0,,1,[],US,eng,"[{'count': '8', 'name': 'to-read'}, {'count': ...",,False,3.83,,...,11,,1887,https://www.goodreads.com/book/show/16037549-v...,https://images.gr-assets.com/books/1348176637m...,16037549,3,5212748,Vision of Sir Launfal and Other Poems,Vision of Sir Launfal and Other Poems
1,811223981.0,2,[],US,,"[{'count': '100', 'name': 'to-read'}, {'count'...",,False,3.83,B00U2WY9U8,...,4,,2015,https://www.goodreads.com/book/show/22466716-f...,https://images.gr-assets.com/books/1404958407m...,22466716,37,41905435,Fairy Tales: Dramolettes,Fairy Tales: Dramolettes


### Feature1: genres

#### Extract Genres from popular_shelves

In [21]:
genres = poetry_df.loc[:,['book_id','popular_shelves']]

In [22]:
genres.head(2)

Unnamed: 0,book_id,popular_shelves
0,16037549,"[{'count': '8', 'name': 'to-read'}, {'count': ..."
1,22466716,"[{'count': '100', 'name': 'to-read'}, {'count'..."


In [27]:
extract = pd.concat([pd.DataFrame(x) for x in genres['popular_shelves']],keys=genres['book_id'])
extract.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,name
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
16037549,0,8,to-read
16037549,1,3,poetry
16037549,2,2,currently-reading
16037549,3,1,01-kindle
16037549,4,1,real-books
16037549,5,1,personal-library
22466716,0,100,to-read
22466716,1,6,currently-reading
22466716,2,3,drama
22466716,3,3,plays


In [51]:
extract.loc['16037549']

Unnamed: 0,count,name
0,8,to-read
1,3,poetry
2,2,currently-reading
3,1,01-kindle
4,1,real-books
5,1,personal-library


In [30]:
#number of unique genre labels
len(extract['name'].unique())

9011

In [31]:
extract['name'].describe()

count       22741
unique       9011
top       to-read
freq          498
Name: name, dtype: object

In [41]:
agg = extract.groupby('name').agg('count').sort_values(['count'], ascending=False)
agg.head(5)

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
to-read,498
poetry,497
currently-reading,328
favorites,264
owned,190


In [None]:
#select some genre as features (note for this small dataset most genre is poetry, but will be more of others for the whole datase)
#this is just an example of some genres, could change according to the genres dataset
genres = ['poetry','fiction','fantasy','thriller','mystery']

In [47]:
'fiction' in agg.index

True

#### genres dataset for all books:
(Already built in genres categories)

Row(book_id='5333265', genres=Row(children=None, comics, graphic=None, fantasy, paranormal=None, fiction=None, history, historical fiction, biography=1, mystery, thriller, crime=None, non-fiction=None, poetry=None, romance=None, young-adult=None)),


Row(book_id='1333909', genres=Row(children=None, comics, graphic=None, fantasy, paranormal=None, fiction=219, history, historical fiction, biography=5, mystery, thriller, crime=None, non-fiction=None, poetry=None, romance=None, young-adult=None)),

#### Using 0/1 Encoding to transform the genres to attribute matrix

#### Resulting Genre Attribute Matrix:

In [5]:
'''''''''
+--------+---+---+---+---+---+---+---+---+---+---+
| book_id| g1| g2| g3| g4| g5| g6| g7| g8| g9|g10|
+--------+---+---+---+---+---+---+---+---+---+---+
| 5333265|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|
| 1333909|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|
| 7327624|  0|  0|  1|  1|  0|  1|  0|  1|  0|  0|
| 6066819|  0|  0|  0|  1|  0|  1|  0|  0|  1|  0|
|  287140|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|
|  287141|  1|  0|  1|  1|  1|  0|  0|  0|  0|  1|
|  378460|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|
| 6066812|  1|  0|  1|  1|  0|  0|  0|  0|  0|  1|
|34883016|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|
|  287149|  0|  0|  0|  0|  1|  0|  1|  0|  0|  0|
+--------+---+---+---+---+---+---+---+---+---+---+
''''''''

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-5-5c1fb340fea2>, line 16)

#### Feature 2: author_id
can be combined with author dataset to get author rating

author_df =spark.read.json('hdfs:/user/yw2115/goodreads_book_authors.json.gz')

spark.first()

DataFrame[author_id: string, average_rating: string, name: string, ratings_count: string, text_reviews_count: string]

Row(author_id='604031', 

average_rating='3.98', 

name='Ronald J. Fields', 

ratings_count='49',

text_reviews_count='7')

#### Result: 
Attribute (I*N) matrix A, I items, n features

## Step 2: Attribute-to-feature Mapping

### 1. Load latent factor matrix I of items from recsys

### 2. Map attribute matrix to latent factor matrix

(1) KNN Mapping

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_neighbors(item_row,attribute_matrix,k):
    cs = cosine_similarity(item_row,attribute_matrix)
    idx = np.argsort(cs)[::-1]
    k_idx = idx[:k]
    score = []
    for i in k_idx:
        score.append(cs[i])
    return score,k_idx

    
    