<h1>Importing functions</h1>

In [1]:
import numpy as np
from scipy.sparse import lil_matrix
import os
import pickle
import csv
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation as LDA
from scipy.spatial import distance

<h1> DATA </h1>

The data is composed of a list of books and a list of users rating the books

In [2]:
books_file = 'books.csv'
users_ratings = 'books_ratings.csv'

<h1>Functions</h1>

<h2>Book Functions</h2>

<h3>Search for books</h3>

In [3]:
def search_by_author(text_to_search):
    temp_df = books_df.loc[:,'authors'].dropna().apply(lambda x:x.lower())
    return books_df.loc[temp_df.str.contains(text_to_search.lower())]

def search_by_title(text_to_search):
    temp_df = books_df.loc[:,'original_title'].dropna().apply(lambda x: x.lower())
    return books_df.loc[temp_df.loc[temp_df.str.contains(text_to_search.lower())].index]

<h3>Finds the most similar book or books</h3>


In [4]:
def find_most_similar(vector,matrix):
    return np.argsort(distance.cdist([vector],matrix,metric='cosine'))[::-1]

def find_similar_books(books_vector,books_list,list_length):
    for row in books_list:
        book=books_vector[row,:]
        similarity_list=find_most_similar(book,books_vector)[0]
        print(books_df.iloc[row]["original_title"])
        print(get_book_name(similarity_list,list_length))

<h3>Returns books names based on <i>book IDs</i></h3>


In [5]:
def get_book_name(books_list,size):
    book_names=[]
    for book in books_list[0:size]:
        book_names.append(books_df.iloc[book]["original_title"])
    return book_names

<h3>For a list of groups, shows most probable members</h3>

In [6]:
def show_groups(books_vector,list_length,list_of_groups=None):
    books_vector_transpose=books_vector.transpose()
    if list_of_groups==None:
        iteration_range=range(0,n_topics)
    else:
        iteration_range=list_of_groups
    for group in iteration_range:
        print("Group",group)
        print(get_book_name(books_vector_transpose[group].argsort()[::-1],list_length))

<h2>User Functions</h2>

<h3>With a list of books that the user liked, creating a user vector with probable groups</h3>

In [7]:
def create_new_user(reading_list,model):
    user_matrix=lil_matrix((1,books_df.shape[0]),dtype='int')
    for book in reading_list:
        user_matrix[0,book]=5
    result=model.transform(user_matrix)
    return result[0]

<h3>Returning a list of groups that are relevant to the user, sorted by probability of the user belonging to the specific group</h3>

In [8]:
def user_relevant_groups(user_vector,n_topics):
    new_list=[]
    relevant_groups=np.where(user_vector>(1/n_topics))[0]
    for i in np.argsort(user_vector)[::-1]:
        if i in relevant_groups:
            new_list.append(i)
    return new_list

<h3>creating a user specific list of recommended books and authors</h3>

Multiplying the probability that a user will be in a specific group with the probability that a book will be in that group, we get the probability that the user will fit the book.

In [9]:
def user_recommended_books(user,books_vector,new_user_books):
    temp_mulp=np.sum(np.multiply(new_user,books_vector),axis=1)
    user_recommendations=books_df[~books_df.original_title.isin(new_user_books)]
    user_recommendations=books_df.loc[:,['authors','original_publication_year','original_title']]
    user_recommendations['user_index']=temp_mulp
    user_recommendations=user_recommendations[user_recommendations.user_index>200]
    return user_recommendations.sort_values('user_index',ascending=False)

Using the list of recommended books, aggregation of the books based on authors will generate a list of recomended authors for the user to follow 

In [10]:
def get_recommended_authors(recommended_df):
    authors=recommendation.groupby('authors').mean().sort_values('user_index',ascending=False).index
    return list(authors)

<h1>Build a model</h1>

<h2>Read "Books.csv"</h2>

In [11]:
books_df=pd.read_csv(books_file,index_col='book_id')## book_id as index
books_df['authors']=books_df.authors.apply(lambda x: x.split(",")[0])
books_df.head(5)

Unnamed: 0_level_0,authors,original_publication_year,original_title,language_code
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Suzanne Collins,2008.0,The Hunger Games,eng
1,J.K. Rowling,1997.0,Harry Potter and the Philosopher's Stone,eng
2,Stephenie Meyer,2005.0,Twilight,en-US
3,Harper Lee,1960.0,To Kill a Mockingbird,eng
4,F. Scott Fitzgerald,1925.0,The Great Gatsby,eng


<h3>Database search by author</h3>

In [12]:
search_by_author('hoover')

Unnamed: 0_level_0,authors,original_publication_year,original_title,language_code
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
538,Colleen Hoover,2012.0,Hopeless,eng
643,Colleen Hoover,2012.0,Slammed,eng
961,Colleen Hoover,2012.0,Point of Retreat,eng
1109,Colleen Hoover,2014.0,Maybe Someday,eng
1129,Colleen Hoover,2014.0,Ugly Love,eng
1561,Colleen Hoover,2015.0,,en-GB
1567,Colleen Hoover,2016.0,It Ends with Us,eng
2039,Colleen Hoover,2013.0,Losing Hope,eng
2330,Colleen Hoover,2015.0,November 9,en-US
2394,Colleen Hoover,2015.0,Never Never,en-US


<h3>Database search by title</h3>

In [13]:
search_by_title('kingb')

Unnamed: 0_level_0,authors,original_publication_year,original_title,language_code
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Harper Lee,1960.0,To Kill a Mockingbird,eng
4933,Kathryn Erskine,2010.0,Mockingbird,eng
6854,Karen Miller,2005.0,"The Innocent Mage (Kingmaker, Kingbreaker, #1)",eng
8097,Karen Miller,2005.0,"Innocence Lost (Kingmaker, Kingbreaker, #2)",eng


<h2>Read "books_ratings.csv"</h2>
Only relevant to our model are the users that "liked" a specific book (rated it 4 or 5).

In [14]:
users_liked={}
loc_user_id={}
loc_id_user={}
c=0
with open(users_ratings) as file1:
    csv_users=csv.reader(file1)
    next(csv_users)
    for row in csv_users:
        user_id,book_id,rating=row[0],row[1],int(row[2])
        if rating>=4:
            if user_id not in loc_user_id.keys():
                loc_user_id[user_id]=c
                loc_id_user[c]=user_id
                c+=1
            users_liked[user_id]=users_liked.get(user_id,dict())
            users_liked[user_id][book_id]=rating

<h2> Define sparse matrix with local user id as rows and book ids as columns</h2>

In [15]:
sparse_matrix=lil_matrix((c,books_df.shape[0]),dtype='int')
sparse_matrix.shape

(13071, 10000)

<h2>Fill the sparse matrix with data</h2>
* The rows of the matrix will represent the users
* The columns of the matrix will represent the books

In [16]:
for user,liked in users_liked.items():
    for book_str,rating in liked.items():
        book=int(book_str)
        sparse_matrix[loc_user_id[user],book]=rating

In [17]:
print("sparcity=",sparse_matrix.nnz/(sparse_matrix.shape[0]*sparse_matrix.shape[1]))

sparcity= 0.005159077346798256


<h2>Train the LDA model</h2>

Define the LDA model, <b>n_topics</b> is the number of groups for the model to create

In [18]:
n_topics=100
lda = LDA(n_components=n_topics,max_iter=50)

After defining the model, fitting based on <b>sparse_matrix</b> that was created earlier

In [19]:
lda.fit(sparse_matrix)
books_vector=lda.components_.transpose()
users_vector=lda.transform(sparse_matrix)
# print(books_vector.shape,users_vector.shape)



(10000, 100) (13071, 100)


<b>books_vector</b> and <b>users_vector</b> are the assignment of users and books to each of the <i>n</i> groups

<h1>Using the model</h1>

<h3>Intoducing a new user</h3>

The new user is created using a list of favorite books

In [20]:
user1=[497,3481,8502,263,1183,120,2343,112,267,159]

The new user's list of favorite books

In [21]:
books_df.loc[user1]

Unnamed: 0_level_0,authors,original_publication_year,original_title,language_code
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
497,Leo Tolstoy,1869.0,Война и миръ,eng
3481,Franz Kafka,1926.0,Das Schloß,eng
8502,Thomas Mann,1901.0,Buddenbrooks: Verfall einer Familie,eng
263,Ernest Hemingway,1926.0,The Sun Also Rises,en-US
1183,F. Scott Fitzgerald,1933.0,Tender Is the Night,en-US
120,Vladimir Nabokov,1955.0,Lolita,eng
2343,Charles Bukowski,1975.0,Factotum,
112,Joseph Heller,1961.0,Catch-22,en-US
267,Kazuo Ishiguro,2005.0,Never Let Me Go,eng
159,Charles Dickens,1860.0,Great Expectations,eng


Finding similar books for to the ones on the user's list

In [22]:
find_similar_books(books_vector,user1,3)

Война и миръ
['Война и миръ', 'Мастер и Маргарита', 'Don Quijote de La Mancha']
Das Schloß
['Das Schloß', 'All My Sons', 'Hyperspace: A Scientific Odyssey Through Parallel Universes, Time Warps, and the Tenth Dimension']
Buddenbrooks: Verfall einer Familie
['Buddenbrooks: Verfall einer Familie', 'Veinte poemas de amor y una canción desesperada', 'Babbitt']
The Sun Also Rises
['The Sun Also Rises', 'A Moveable Feast', 'Pnin']
Tender Is the Night
['Tender Is the Night', "Who's Afraid of Virginia Woolf?", 'The medium is the massage']
Lolita
['Lolita', 'Tortilla Flat', 'Cien años de soledad']
Factotum
['Factotum', 'The Haunting οf Hill House', 'Black Hawk Down']
Catch-22
['Catch-22', 'American Psycho', 'A Clockwork Orange']
Never Let Me Go
['Never Let Me Go', 'When We Were Orphans', 'Atonement']
Great Expectations
['Great Expectations', 'Die fröhliche Wissenschaft', nan]


<h3>Applying the model to the new user</h3>

In [23]:
new_user=create_new_user(user1,lda)

<h2>Recommendations for the new user</h2>

<h3>The groups that are relevant to the user</h3>

In [24]:
show_groups(books_vector,10,user_relevant_groups(new_user,n_topics))

Group 78
['Преступление и наказание', 'The Old Man and the Sea', 'The Sun Also Rises', 'A Farewell to Arms', 'Анна Каренина', 'L’Étranger', 'Братья Карамазовы', 'For Whom the Bell Tolls', 'Die Verwandlung', 'Nesnesitelná lehkost bytí']
Group 38
['Invisible Man', 'Lolita', 'As I Lay Dying', 'Catch-22', 'The Sound and the Fury', 'Nine Stories', 'Franny and Zooey', 'White Noise', 'Mrs Dalloway', 'Native Son ']
Group 65
['Atlas Shrugged', 'The Fountainhead', "The Handmaid's Tale", 'Atonement', 'The Historian', 'Life of Pi', 'La sombra del viento', 'Never Let Me Go', 'Oryx and Crake', 'The Blind Assassin']
Group 36
['A Walk in the Woods', 'Revolutionary Road', 'Down Under', 'Brideshead Revisited: The Sacred and Profane Memories of Captain Charles Ryder', 'American Pastoral', 'Notes from a Big Country', 'Through the Looking-Glass, and What Alice Found There', 'Of Human Bondage', 'The Celestine Prophecy', 'The Human Stain']
Group 99
["A People's History of the United States: 1492 to Present "

<h3>Recommended books</h3>

In [27]:
recommendation=user_recommended_books(new_user,books_vector,user1)
recommendation

Unnamed: 0_level_0,authors,original_publication_year,original_title,user_index
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
176,Fyodor Dostoyevsky,1866.0,Преступление и наказание,1189.462455
161,Albert Camus,1942.0,L’Étranger,882.707606
129,Ernest Hemingway,1952.0,The Old Man and the Sea,833.818049
263,Ernest Hemingway,1926.0,The Sun Also Rises,750.123569
400,Ernest Hemingway,1929.0,A Farewell to Arms,716.55102
171,Leo Tolstoy,1877.0,Анна Каренина,698.9682
484,Fyodor Dostoyevsky,1880.0,Братья Карамазовы,667.036864
394,Ernest Hemingway,1940.0,For Whom the Bell Tolls,627.148647
212,Franz Kafka,1915.0,Die Verwandlung,513.502736
735,Ralph Ellison,1952.0,Invisible Man,488.807015


<h3>Recommended authors</h3>

In [26]:
get_recommended_authors(recommendation)

['Fyodor Dostoyevsky',
 'Ernest Hemingway',
 'Albert Camus',
 'Leo Tolstoy',
 'Ralph Ellison',
 'Vladimir Nabokov',
 'Milan Kundera',
 'Joseph Conrad',
 'Franz Kafka',
 'Mikhail Bulgakov',
 'Miguel de Cervantes Saavedra',
 'Gabriel García Márquez',
 'William Faulkner',
 'J.D. Salinger',
 'Joseph Heller',
 'James Joyce',
 'Charles Dickens',
 'Anthony Burgess',
 'Ken Kesey',
 'Ayn Rand',
 'Don DeLillo',
 'F. Scott Fitzgerald',
 'Richard Wright',
 'Virginia Woolf',
 'John Kennedy Toole',
 'Margaret Atwood',
 'Herman Melville',
 'Patrick Süskind']