## Item & User Profiles

This script does the following:<br/>
1. Create item profiles and write them to file *"item_profiles.txt"*
2. Create user profiles for all users and write them to file *"user_profiles.txt"*

In [1]:
import pandas as pd
import numpy as np
import json
import re 

### 1: Item profiles

#### Read & inspect the books data

In [2]:
df_books = pd.read_csv('data/books_metadata.csv')

In [3]:
df_books.head()

Unnamed: 0,book_id,title,series,author,description,genres,pages,publisher,firstPublishDate,awards,setting,coverImg
0,14796360,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...
1,7743507,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",870,Scholastic Inc.,06/21/03,['Bram Stoker Award for Works for Young Reader...,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...
2,23390821,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",324,Harper Perennial Modern Classics,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...","['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...
3,1555826,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical...",279,Modern Library,01/28/13,[],"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...
4,28109798,The Book Thief,,Markus Zusak (Goodreads Author),Librarian's note: An alternate cover edition c...,"['Historical Fiction', 'Fiction', 'Young Adult...",552,Alfred A. Knopf,09/01/05,['National Jewish Book Award for Children’s an...,"['Molching (Germany)', 'Germany']",https://i.gr-assets.com/images/S/compressed.ph...


#### Create dictionary with the item profiles and write to file

Set book_id as id in dataframe

In [4]:
df_books = df_books.set_index("book_id")

Create the dictionary from the above dataframe

In [5]:
books_map = df_books.to_dict(orient='index')

In [6]:
books_map[14796360] #test

{'title': 'The Hunger Games',
 'series': 'The Hunger Games #1',
 'author': 'Suzanne Collins',
 'description': "WINNING MEANS FAME AND FORTUNE.LOSING MEANS CERTAIN DEATH.THE HUNGER GAMES HAVE BEGUN. . . .In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and once girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.Sixteen-year-old Katniss Everdeen regards it as a death sentence when she steps forward to take her sister's place in the Games. But Katniss has been close to dead before—and survival, for her, is second nature. Without really meaning to, she becomes a contender. But if she is to win, she will have to start making choices that weight survival against humanity and life against love.",
 'genres': "['Young Adult', 'Fiction', 'Dysto

**Problem 1**: The lists in the cells are stored as strings that looks like lists; we want actual lists in the item profiles dictionary.<br/>

Columns where we have lists in the cells are: genres, awards, setting.<br/>
Formats:<br/> 
genres = "['item1', 'item2', ...]"<br/>
awards = "[\\'item1\\', \\'item2\\', ...]"<br/>
setting = "['item1', 'item2', ...]"<br/>

We create a function to turn string into list for the two formats above.

In [7]:
def string_to_list(s, awards=False):
    if awards:
        s_ = s.strip("[]").split(", ")
        s = []
        for aw in s_:
            s.append(aw[1:-1]) # Chop off the " from front and back, somewhat tideous but works
    else:
        s = s.strip("[]").split("', '")
        # Remove "'" from start of first and end of last item
        s[0] = s[0][1:]
        s[-1] = s[-1][:-1]
    return s

**Problem 2:** There are illustrators and translators displayed as multiple authors for many of the books. Also some authors are "Goodreads Authors". We will only keep the main author for convenience.

Examples:<br/>
"A1, A2 (Illustrator), A3 (Translator), A4"<br/>
"A1 (Goodreads Author)" <br/>
"A1, A2"

-> In all cases above, we just want to keep "A1". We create a function below to solve the problem:

In [8]:
def clean_authorstring(author_str):
    cleaned_str = author_str
    if "," in author_str:
        authors = author_str.strip().split(",")
        cleaned_str = authors[0]
    if "(" in cleaned_str:
        authors = author_str.strip().split("(")
        cleaned_str = authors[0]
    return cleaned_str.strip()

Fix **problem 1** and **problem 2**: Clean the item profile dictionary (books_map) row-wise.

In [9]:
for book_id, metadata in books_map.items():
    books_map[book_id]["awards"] = string_to_list(books_map[book_id]["awards"], awards=True) 
    books_map[book_id]["genres"] = string_to_list(books_map[book_id]["genres"], awards=False) 
    books_map[book_id]["setting"] = string_to_list(books_map[book_id]["setting"], awards=False) 
    books_map[book_id]["author"] = clean_authorstring(books_map[book_id]["author"])

Append the top n scoring TF.IDF scoring words from descriptions of each book to the items profile:<br/>
1.  Create dict {book_id : [(word1, tfidf_score1), (word2, tfidf_score2), ...]}<br/>
2.  Insert, for each book_id in books_map, the item "words" : [word1, word2, ..., wordn] and remove the books in books_map that does not appear in the set of books with a top-n list of TF.IDF scoring words.

In [10]:
# 1.

#open file with the output from tfidf-file
filename = 'tfidf_scores'
infile = open(filename, 'r')

# dict d as {id: (word, tfidf)}
d = {}
#extracting the word, id and tfidf score for each line with regex, and appending it to the dict
for line in infile:
    word_found = re.search(r'\[\"(.+)\",', line) 
    id_found = re.search(r'\[\".+\",(\d+)\]', line)
    tfidf_found = re.search(r'\]\s+(\d+\.\d+)', line)
    if word_found and id_found and tfidf_found:
        word = word_found.group(1)
        id = int(id_found.group(1))
        tfidf = float(tfidf_found.group(1))
        if id in d.keys():
            d[id].append((word, tfidf))
        else:
            d[id] = [(word, tfidf)]

infile.close()

#sort the values for each key in the dict, based on the tfidf score
#only keep the n words with the highest tfidf score
book_tfidf = {}
n_words = 5

for book_id, pair in d.items():
    book_tfidf[book_id] = sorted(pair, key=lambda value: value[1], reverse=True)[:n_words] 

In [11]:
# 2.

for book_id, pairs in book_tfidf.items():
    words_ = []
    for word, score in pairs:
        words_.append(word)
    if book_id in books_map.keys():
        books_map[book_id]["words"] = words_

exclude = set(books_map.keys()).symmetric_difference(set(book_tfidf.keys()))
for ex in exclude:
    books_map.pop(ex, None)


We may now remove the non interesting features from our books_map. That is: the description and the features that do not contain any information sbout the book itself.

In [12]:
books_map[14796360].keys()

dict_keys(['title', 'series', 'author', 'description', 'genres', 'pages', 'publisher', 'firstPublishDate', 'awards', 'setting', 'coverImg', 'words'])

In [13]:
# Non-interesting features = 'description', 'coverImg'
for book_id, data in books_map.items():
    del data['description']
    del data['coverImg']

books_map is now our set of item profiles for the majority of the books from books_metadata.csv, and can be compared to user profiles.<br/>
Finally, write all the item profiles as json strings to file "item_profiles" where each row is an item.

In [14]:
with open('item_profiles.txt', 'w') as outfile:
    for book_id, metadata in books_map.items():
        d_ = {}
        d_[book_id] = metadata
        json.dump(d_, outfile)
        outfile.write('\n')
        d_= {}

### 2: User profiles

#### Read & inspect the users data

In [15]:
df_users = pd.read_csv("data/user_book_ratings.csv")
df_users.head()

Unnamed: 0.1,Unnamed: 0,user_id,book_id,rating
0,0,0,5602347,5
1,1,0,30,5
2,2,0,12528798,5
3,3,0,25026517,4
4,4,0,835,4


In [16]:
df_users_booklist = df_users.groupby('user_id')['book_id'].apply(list).reset_index(name='books_list').set_index('user_id')
df_users_ratinglist = df_users.groupby('user_id')['rating'].apply(list).reset_index(name='ratings_list').set_index('user_id')

In [17]:
df_users = df_users_booklist.join(df_users_ratinglist)
df_users

Unnamed: 0_level_0,books_list,ratings_list
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[5602347, 30, 12528798, 25026517, 835, 2202194...","[5, 5, 5, 4, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 4, ..."
1,"[9003477, 627206, 395614, 1730369, 91760, 2409...","[4, 4, 0, 5, 0, 0, 5, 0, 0, 3, 0, 5, 0, 5, 5, ..."
2,"[22240073, 6666060, 1480805, 6571776, 5973095,...","[4, 4, 3, 4, 3, 3, 3, 2, 4, 4, 5, 3, 3, 3, 4, ..."
3,"[12047693, 12913723, 7029926, 6316356, 12535, ...","[5, 5, 4, 3, 0, 0, 5, 0, 3, 0, 0, 4, 0, 3, 0, ..."
4,"[817661, 20427793, 40075, 12986764, 10127510, ...","[3, 3, 5, 2, 3, 3, 4, 2, 4, 5, 4, 1, 4, 2, 2, ..."
...,...,...
876140,"[879891, 15733551, 420031, 16978009, 894542, 4...","[3, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 5, 4, 0]"
876141,"[11066234, 6066819, 20440128, 453824, 10445007...","[0, 0, 0, 5, 4, 4, 5, 4, 5, 3, 4]"
876142,"[7226415, 821080, 6234106, 16393581, 9003477, ...","[5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 0, 4, 0, 0, ..."
876143,"[17416065, 15947905, 12322066, 7967279, 281097...","[4, 5, 4, 4, 5, 5, 5, 4]"


Create function that makes average ratings for all of the values of the features of the books a given user has rated.

In [18]:
def make_avg_rating(user_id, df_users, books_map, alpha_pages):
    """
    Takes a row from the df_users (i.e. a user id, the books the user have rated and the ratings for 
    those books) and returns a dict on the form {feature1: {val1: avg_rating1, val2: avg_rating2, ...}, ...}
    with the same features as in the item profiles.
    """
    books = df_users.loc[user_id]['books_list']
    ratings = df_users.loc[user_id]['ratings_list']

    # We standardize each rating by subtracting the mean of all ratings for the user, to prevent high-/low-rating users from being handled differently
    ratings_std = [r - np.mean(ratings) for r in ratings]

    # Go through all books to calculate mean of ratings for the possible values of the features
    keys = ['title', 'series', 'author', 'genres', 'pages', 'publisher', 'firstPublishDate', 'awards', 'setting', 'words']
    user_avg = dict.fromkeys(keys, {}) # Choose random book to copy keys from, init all values to an empty dict
    for feature in user_avg.keys(): # Fill dictionary feature-wise
        inner_dict = {} # Should contain values with associated ratings for each of the features 
        for i, book in enumerate(books):

            if book in books_map:
                data = books_map[book][feature]
            else: continue # if not we must drop the book

            if type(data) != list: # If the data is a string (!=list) we mat insert its rating directly
                if data not in inner_dict.keys():
                    inner_dict[data] = [ratings_std[i]]
                else:
                    inner_dict[data].append(ratings_std[i])
            else: # If data is list, each value from the list is given its own rating
                for el in data:
                    if el not in inner_dict.keys():
                        inner_dict[el] = [ratings_std[i]]
                    else:
                        inner_dict[el].append(ratings_std[i])

        user_avg[feature] = inner_dict # Append to outer dict on the given feature

    #  We may then proceed to doing the averages for all the lists in the user_avg dictionary
    for feature, map in user_avg.items():
        for value, ratings_list in map.items():
            user_avg[feature][value] = np.mean(ratings_list)
    
    # Also, for the pages we take the weighted sum of the pages the user has rated and multiply it with a scaling factor, alpha, for the pages
    pages = user_avg["pages"]
    w_avg_pages = 0
    for k,v in pages.items():
        try: 
            if np.isnan(float(k)): 
                w_avg_pages += 0 # We can not assume anything about the length of the book
            else:
                w_avg_pages += v*float(k)
        except ValueError: # there are 8 instances where a value error is raised
            w_avg_pages += 0

    user_avg["pages"] = w_avg_pages*alpha_pages
    
    return user_avg

Obtain the scaling factor for number of pages

In [19]:
# We find the scaling factor for the number of pages as: alpha_pages = 1/mean(number of pages for all books)
N = len(books_map)
sum = 0
for book_id in books_map.keys():
    try: 
        if np.isnan(float(books_map[book_id]['pages'])): # there is about 500 instances where this is true, these are just nan values
            sum += 200 # we add a value close to (a bit lower than) the median where we have missing values
        else:
            sum += float(books_map[book_id]['pages'])
    except ValueError: # there is 8 instances where a value error is raised
        sum += 0
        N = N-1
        print(books_map[book_id]['pages']) # We print them to see

mean_pagenumber = sum/N
alpha_pages = 1/mean_pagenumber

print(f"Sum of all pages from all books: {sum}")
print(f"Number of books summed up: {N}")
print(f"Mean number of pages per book: {mean_pagenumber}")
print(f"Scaling of the pagenumber for a given user is then: {alpha_pages}")

1 page
1 page
1 page
1 page
1 page
1 page
1 page
1 page
Sum of all pages from all books: 8313630.0
Number of books summed up: 25331
Mean number of pages per book: 328.1998341952548
Scaling of the pagenumber for a given user is then: 0.003046924147454241


In [20]:
# test for a random user
d = make_avg_rating(876143, df_users, books_map, alpha_pages)
for k,v in d.items():
    print(f"{k} : {v}")

title : {'Looking for Alaska': -0.5, 'The Fault in Our Stars': 0.5, 'Lock and Key': -0.5, 'Outliers: The Story of Success': -0.5, 'The Book Thief': 0.5, 'I Am the Messenger': 0.5, 'Will Grayson, Will Grayson': 0.5, 'Whirligig': -0.5}
series : {nan: -0.07142857142857142, 'Will Grayson, Will Grayson #1': 0.5}
author : {'John Green': 0.16666666666666666, 'Sarah Dessen': -0.5, 'Malcolm Gladwell': -0.5, 'Markus Zusak': 0.5, 'Paul Fleischman': -0.5}
genres : {'Young Adult': 0.07142857142857142, 'Fiction': 0.07142857142857142, 'Contemporary': 0.0, 'Romance': 0.0, 'Realistic Fiction': 0.0, 'Coming Of Age': 0.0, 'Teen': 0.0, 'Mystery': 0.0, 'Young Adult Contemporary': -0.5, 'High School': -0.5, 'Drama': 0.5, 'Novels': 0.5, 'Love': 0.5, 'Chick Lit': -0.5, 'Nonfiction': -0.5, 'Psychology': -0.5, 'Business': -0.5, 'Self Help': -0.5, 'Sociology': -0.5, 'Science': -0.5, 'Audiobook': 0.16666666666666666, 'Personal Development': -0.5, 'Economics': -0.5, 'Leadership': -0.5, 'Historical Fiction': 0.5, '

#### Create the user profiles for all users

In [21]:
from progressbar import ProgressBar
pbar = ProgressBar()

In [22]:
# Select sample of the users to give recommendation to, let number of users be 25k
df_sample_users = df_users.sample(n=25000)
# Lower amount of users to enable the file to be pushed to repo
df_sample_users = df_users.sample(n=500)

user_profiles = {}
for user in pbar(df_sample_users.index):
    user_profiles[user] = make_avg_rating(user, df_users, books_map, alpha_pages)
    

100% |########################################################################|


#### Write the user profiles to file *"user_profiles.txt"*

In [23]:
with open('user_profiles.txt', 'w') as outfile:
    for user_id, metadata in user_profiles.items():
        d_ = {}
        d_[user_id] = metadata
        json.dump(d_, outfile)
        outfile.write('\n')
        d_= {}