# Goodreads Book Recommender

This notebook extracts the information from a [Goodread's](https://www.goodreads.com) user and recommends a list of books based on the user's reviews, his or her friends', and the books' average ratings [1]. 

The notebook is divided in **four sections**: 
1. Extracts user information using Goodread's Web API.
2. Prepares data for cleaning and analysis.
3. Cleans and formats book data.
4. Recommends a list of books for the user.


#### Elements I'd like to add in the future
- Include information of the user's friends with forbidden access in their accounts.
- Automatically extract information of all the user's friends, not just some of them.
- Exclude users with whom the user only has three or less books in common. Otherwise, their correlation might be too high, which might alter the final results. 

#### Pending improvements for the code
- There are duplicates of books that appear in both the Spanish and English version.
- The formula for recommending books might not be sufficiently calibrated. It may be giving more weight to some books and less weight to books of friends with whom the user has a negative taste correlation.

[1] _Goodreads is a website in which users can sign up and register books to generate library catalogs and reading lists._

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import re
import numpy as np
import scipy.stats as st
from datetime import datetime

## 1. Get the shelf information of a user

To communicate with its Web API, Goodreads requires developers to get a [key](https://www.goodreads.com/api). I have hidden mine in the following cell, but it should work with yours. 

In [2]:
# Key and website to access the Web API
CONSUMER_KEY = ''
url = "https://www.goodreads.com/"

The following functions combined extract the information of a user's shelf and return it in the form of a dictionary. 

NOTE: Goodreads has a shelf for books the user has read and another one for books the user wants to read. We will focus on the books already read. Also, note that shelves are displayed through multiple pages. 


In [3]:
def shelf_info(user_id, shelf, page):
  
    """Returns a BeautifulSoup object of a single page of the user's shelf"""
    
    time.sleep(1.1) # To comply with the terms and conditions of Goodreads
    info = requests.get(f'{url}/review/list?v=2&id={user_id}&shelf={shelf}&sort=title&page={page}&per_page=200&key={CONSUMER_KEY}')
    print('Status code: ', info.status_code)
    info_content = info.content
    soup = BeautifulSoup(info_content, 'lxml')
    return soup

In [4]:
def paginas_por_estante(libros_en_estante):
    
    """Returns how many pages are needed to display all the books in the user's shelf"""
    
    if libros_en_estante <= 200:
        return 1
    
    elif libros_en_estante <= 400:
        return 2
    
    elif libros_en_estante <= 600:
        return 3
    
    elif libros_en_estante <= 800:
        return 4
    
    elif libros_en_estante <= 1000:
        return 5
    
    elif libros_en_estante <= 1200:
        return 6

    elif libros_en_estante <= 1400:
        return 7
    
    elif libros_en_estante <= 1600:
        return 8
    
    elif libros_en_estante <= 1800:
        return 9
    
    elif libros_en_estante <= 2000:
        return 10

In [5]:
def extract_info(user_id, shelf, libros_en_estante):
    
    """Returns a dictionary with all the information of the user's shelf"""
    
    diccionario = {}
    
    # Define variables of tags to keep adding info as I move through various pages.
    isbn = []
    title = []
    num_pages = []
    publisher = []
    publication_year = []
    name = []
    rating = []
    average_rating = []
    ratings_count = []
    links = []
    
    #Determine how many pages are in the shelf.
    paginas_to_scrap = paginas_por_estante(libros_en_estante)
    
    # Get info of each page.
    for x in range(1, paginas_to_scrap+1):
        
        # Get html code as a Soup object
        page = shelf_info(user_id, shelf=shelf, page=x)
        
        # Get info of tags that do not repeat
        
        isbn_info = page.find_all(f'isbn')
        isbn_info = [elem.get_text() for elem in isbn_info]
        isbn.append(isbn_info)
        
        title_info = page.find_all(f'title')
        title_info = [elem.get_text() for elem in title_info]
        title.append(title_info)
        
        publisher_info = page.find_all(f'publisher')
        publisher_info = [elem.get_text() for elem in publisher_info]
        publisher.append(publisher_info)
        
        num_pages_info = page.find_all(f'num_pages')
        num_pages_info = [elem.get_text() for elem in num_pages_info]
        num_pages.append(num_pages_info)
        
        publication_year_info = page.find_all(f'publication_year')
        publication_year_info = [elem.get_text() for elem in publication_year_info]
        publication_year.append(publication_year_info)
        
        name_info = page.find_all(f'name')
        name_info = [elem.get_text() for elem in name_info]
        name.append(name_info)
        
        rating_info = page.find_all(f'rating')
        rating_info = [elem.get_text() for elem in rating_info]
        rating.append(rating_info)
        
        # Get info of repeated tags
        review_blocks = page.find_all('review')
        link_pattern = re.compile(r'www.goodreads.com.*')
        
        for review in review_blocks: 
            average_rating_info = review.find(f'average_rating').get_text()
            average_rating.append(average_rating_info)

            ratings_count_info = review.find(f'ratings_count').get_text()
            ratings_count.append(ratings_count_info)
            
            # Obtener links        
            if link_pattern.search(review.get_text()):
                link = re.findall(link_pattern, review.get_text()) 
                links.append(link)
    
            else: 
                print('Missing: ', review.title)
    
    # Flatten variables with lists of lists
    isbn = [elem for listt in isbn for elem in listt]
    title = [elem for listt in title for elem in listt]
    num_pages = [elem for listt in num_pages for elem in listt]
    publisher = [elem for listt in publisher for elem in listt]
    publication_year = [elem for listt in publication_year for elem in listt]
    name = [elem for listt in name for elem in listt]
    rating = [elem for listt in rating for elem in listt]
    links = [elem[0] for elem in links]
    
    # Transform everything to a dictionary
    diccionario[f'user_id'] = [user_id for x in range(0, len(isbn))]
    diccionario[f'shelf'] = [shelf for x in range(0, len(isbn))]
    
    diccionario[f'isbn'] = isbn
    diccionario[f'title'] = title
    diccionario[f'author'] = name
    diccionario[f'num_pages'] = num_pages
    diccionario[f'publication_year'] = publication_year
    diccionario[f'publisher'] = publisher
    diccionario[f'my_rating'] = rating
    
    diccionario[f'average_rating'] = average_rating
    diccionario[f'ratings_count'] = ratings_count
    diccionario[f'links'] = links
        
    return diccionario


To demonstrate, I will extract all the information in my shelf (user 'Francisco Galán').

In [6]:
francisco_galan = extract_info('40732498', 'read', 276)

Status code:  200
Status code:  200


In [4]:
# Save date of information
time = datetime.now()
time = now.strftime("%d/%m/%Y")

Also, I will extract the information of seven of my friends.

In [7]:
nicolas_papa = extract_info('85738242', 'read', 87)
fernando_lamoyi = extract_info('22410395', 'read', 104)
cova_sv = extract_info('72222895', 'read', 52)
mario_carballo = extract_info('18141767', 'read', 142)  
andrea_raisman = extract_info('63716476', 'read', 416)
vanessa_romero = extract_info('16421531', 'read', 96)  
maria_lama = extract_info('68889321', 'read', 137)

Status code:  200
Status code:  200
Status code:  200
Status code:  200
Status code:  200
Status code:  200
Status code:  200
Status code:  200
Status code:  200


Finally, let's extract the information of three people I follow on Goodreads. 

In [8]:
eduardo_rosas = extract_info('51214176', 'read', 158)  
stefan_schubert = extract_info('27953287', 'read', 142)
bill_gates = extract_info('23470', 'read', 228)

Status code:  200
Status code:  200
Status code:  200
Status code:  200


For a future version, I would also like to extract the information of three users with forbidden access. For example:
- `srdjan_miletic = extract_info('11055732', 'read', 350)`
- `pablo_staforini = extract_info('3093249', 'read', 1846)`
- `alvaro_migoya = extract_info('57665930', 'read', 63)`  

For simplicity, from now on I will refer to the user's friends and people who he follow as his **contacts**. 

## 2. Prepare data for cleaning and analysis

Let's transform the extracted information to DataFrames.

In [9]:
data_francisco = pd.DataFrame(francisco_galan)
data_nicolas = pd.DataFrame(nicolas_papa)
data_fernando = pd.DataFrame(fernando_lamoyi)
data_cova_sv = pd.DataFrame(cova_sv)
data_mario = pd.DataFrame(mario_carballo)
data_andrea = pd.DataFrame(andrea_raisman)
data_vanessa = pd.DataFrame(vanessa_romero)
data_maria = pd.DataFrame(maria_lama)
data_eduardo = pd.DataFrame(eduardo_rosas)
data_stefan = pd.DataFrame(stefan_schubert)
data_bill = pd.DataFrame(bill_gates)

Dataset example:

In [10]:
data_francisco.head(3)

Unnamed: 0,user_id,shelf,isbn,title,author,num_pages,publication_year,publisher,my_rating,average_rating,ratings_count,links
0,40732498,read,,1984,George Orwell,328,1950,New American Library,4,4.19,3375259,www.goodreads.com/book/show/5470.1984
1,40732498,read,451457994.0,"2001: A Space Odyssey (Space Odyssey, #1)",Arthur C. Clarke,297,2000,Roc,3,4.15,270722,www.goodreads.com/book/show/70535.2001
2,40732498,read,307465357.0,"The 4-Hour Workweek: Escape 9-5, Live Anywhere...",Timothy Ferriss,396,2009,Harmony,3,3.9,210841,www.goodreads.com/book/show/6444424-the-4-hour...


### 2.1 Modify column names

Now, we should join all the datasets into a single one. However, when we do so, the column `my_rating` will be repeated for each contact, so we should change it accordingly. 

In [12]:
# Create lists of user's contacts, datasets and column names.
contactos = ['francisco', 'nicolas', 'fernando', 'cova', 'mario', 'andrea', 'vanessa', 'maria', 'eduardo', 'stefan', 'bill']
data_contactos = [data_francisco, data_nicolas, data_fernando, data_cova_sv, data_mario, data_andrea, data_vanessa, data_maria, data_eduardo, data_stefan ,data_bill]
columnas_rating = [x + '_rating' for x in contactos]

In [13]:
# Change name of 'my_rating' column in each of the contact's datasets.
n = -1
for contacto in data_contactos:
    n += 1
    new_name = columnas_rating[n]
    contacto.rename(columns={'my_rating': new_name}, inplace=True)

In [14]:
# Check that it worked
print(data_vanessa.columns)

Index(['user_id', 'shelf', 'isbn', 'title', 'author', 'num_pages',
       'publication_year', 'publisher', 'vanessa_rating', 'average_rating',
       'ratings_count', 'links'],
      dtype='object')


The new `my_rating` column for the contact 'Vanessa' is now `vanessa_rating`, so the change was successful.

### 2.2 Remove irrelevant columns

To recommend books, not all columns in the datasets interest us, so let's remove them. 

In [15]:
# List of columns that do not interest me. 
columnas_irrelevantes = ['isbn', 'user_id', 'shelf', 'publisher', 'links', 'num_pages', 'publication_year', 'ratings_count']

In [16]:
# Remove irrelevant columns in each dataset.
for contacto in data_contactos:
    for columna in columnas_irrelevantes:
        del contacto[columna]

Ejemplo: 

In [17]:
data_vanessa.head(3)

Unnamed: 0,title,author,vanessa_rating,average_rating
0,La hija única,Guadalupe Nettel,2,4.17
1,1984,George Orwell,5,4.19
2,21 Lessons for the 21st Century,Yuval Noah Harari,5,4.16


### 2.3 Merge datasets

Finally, let's merge all the datasets into a single one. 

In [18]:
data_total = data_francisco.copy()

for dataset in data_contactos[1:]:
    data_total = data_total.merge(dataset, how="outer", on=['title', 'author', 'average_rating'])

In [19]:
# Check that it worked
data_total.head(3)

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating
0,1984,George Orwell,4,4.19,4.0,4.0,,4.0,4.0,5.0,4.0,,,2.0
1,"2001: A Space Odyssey (Space Odyssey, #1)",Arthur C. Clarke,3,4.15,,,,,,,,,,
2,"The 4-Hour Workweek: Escape 9-5, Live Anywhere...",Timothy Ferriss,3,3.9,,,,,,,,,,


## 3. Cleaning and formating book data

We now have a single dataset with all the information. However, we should clean it before proceeding. 

### 3.1 Remove duplicates

Some books appear probably appear multiple times because they might vary slightly in their wording. For example: 

In [20]:
match = "How to Win"
data_total.loc[data_total['title'].str.match(match), :]

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating
122,How to Win Friends & Influence People,Dale Carnegie,3.0,4.21,,,,,,,,,,
330,How to Win Friends and Influence People,Dale Carnegie,,4.21,4.0,,,,,,,,3.0,


The book _How to Win Friends_... appears twice, but only because one version uses "&" instead of "and". 

We should try to find as many duplicate books as we can. Then, we could combine their ratings into a single cell and remove the duplicates. 

In [23]:
# Creating a backup dataset to test our cleaning function.
data_test = data_total.copy()
data_test.shape

(1626, 14)

In [25]:
def clean_title(title):  

    """Returns a title without special characters to make book comparisons easier"""
    
    # In a book title, I saw that everything aftar a colon might change, even if
    # it is the same book. Therefore, we will only select the text before a colon. 
    clean = re.split(r':', title)[0]
    
    # Clean the remaining string
    clean = title.strip().lower()
    clean = clean.replace('&', 'and')
    caracteres_especiales = [',', '#', '(', ')']
    for caracter in caracteres_especiales:
        clean = clean.replace(f'{caracter}', '')
    
    return clean


In [26]:
# Select a row
row_num = 0
for i in range(len(data_test)):
    row_num += 1 
        
    # Select another row for comparison.  
    for x in range(row_num, len(data_test)):
            
        # Check that both rows exist, since they could have been deleted.
        if (i in data_test.index) & (x in data_test.index):
                
            # Check if there is a match between the boock titles
            original = data_test.loc[i, 'title']
            otro_libro = data_test.loc[x, 'title']
                 
            original_clean = clean_title(original)
            otro_libro_clean = clean_title(otro_libro)
                
            if original_clean == otro_libro_clean:
                print(f'\nBook: {original}')
                print(f'Match | Row: {i} , Row: {x}')
                
                # Merge all ratings of other users in the duplicate row with the original row.
                for usuario in columnas_rating[1:]:
                    if pd.notnull(data_test.loc[x, usuario]):
                        data_test.loc[i, usuario] = data_test.loc[x, usuario]
                        print(f'Usuario: {usuario} | Score: {data_test.loc[x, usuario]}')
                        
                # Delete duplicate row
                data_test.drop(x, axis=0, inplace=True)


Book: El Alquimista
Match | Row: 8 , Row: 470
Usuario: cova_rating | Score: 0

Book: Cuentos
Match | Row: 46 , Row: 690
Usuario: andrea_rating | Score: 3

Book: How to Win Friends & Influence People
Match | Row: 122 , Row: 330
Usuario: nicolas_rating | Score: 4
Usuario: stefan_rating | Score: 3

Book: Nudge: Improving Decisions About Health, Wealth, and Happiness
Match | Row: 175 , Row: 176
Usuario: fernando_rating | Score: 3
Usuario: stefan_rating | Score: 4

Book: Viaje Al Centro de La Tierra
Match | Row: 266 , Row: 957
Usuario: andrea_rating | Score: 4

Book: Born a Crime: Stories from a South African Childhood
Match | Row: 300 , Row: 390
Usuario: fernando_rating | Score: 5

Book: The Gene: An Intimate History
Match | Row: 410 , Row: 411
Usuario: fernando_rating | Score: 5

Book: Ready Player One (Ready Player One, #1)
Match | Row: 445 , Row: 446
Usuario: fernando_rating | Score: 0
Usuario: mario_rating | Score: 4

Book: Sapiens: A Brief History of Humankind
Match | Row: 451 , Row:

In [30]:
# Check that it worked
data_test.loc[data_test.title.str.contains('How to Win'), :]

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating
122,How to Win Friends & Influence People,Dale Carnegie,3,4.21,4,,,,,,,,3,


In [31]:
data_test.shape

(1611, 14)

The function worked well, since there is no duplicate of _How to Win Friends_... anymore. Also, we now have 1611 rows, instead of 1626.

### 3.2 Reset indexes

In [32]:
data_test = data_test.reset_index(drop=True)

### 3.3 Change data type

Some columns are saved as `object`,so let's transform them to an adequate format.

In [33]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   title             1611 non-null   object
 1   author            1611 non-null   object
 2   francisco_rating  290 non-null    object
 3   average_rating    1611 non-null   object
 4   nicolas_rating    95 non-null     object
 5   fernando_rating   100 non-null    object
 6   cova_rating       55 non-null     object
 7   mario_rating      142 non-null    object
 8   andrea_rating     415 non-null    object
 9   vanessa_rating    96 non-null     object
 10  maria_rating      137 non-null    object
 11  eduardo_rating    168 non-null    object
 12  stefan_rating     144 non-null    object
 13  bill_rating       249 non-null    object
dtypes: object(14)
memory usage: 176.3+ KB


In [34]:
columnas = list(data_test.columns)
for columna in columnas[2:]:
    data_test[columna] = data_test[columna].astype('float')

In [35]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             1611 non-null   object 
 1   author            1611 non-null   object 
 2   francisco_rating  290 non-null    float64
 3   average_rating    1611 non-null   float64
 4   nicolas_rating    95 non-null     float64
 5   fernando_rating   100 non-null    float64
 6   cova_rating       55 non-null     float64
 7   mario_rating      142 non-null    float64
 8   andrea_rating     415 non-null    float64
 9   vanessa_rating    96 non-null     float64
 10  maria_rating      137 non-null    float64
 11  eduardo_rating    168 non-null    float64
 12  stefan_rating     144 non-null    float64
 13  bill_rating       249 non-null    float64
dtypes: float64(12), object(2)
memory usage: 176.3+ KB


### 3.4 Get additional dataset with the user's unread books.

We need an additional dataset with books that the contacts have read but the user hasn't. We we will use it later as a basis for the recommendations.

In [36]:
data_por_leer = data_test.copy()
data_por_leer.shape

(1611, 14)

In [37]:
for i in range(len(data_por_leer)):
        
    #Check if the row still exists
    if i in data_test.index:
    
        # Check if the user has read the book
        if pd.notnull(data_por_leer.loc[i, 'francisco_rating']):
            data_por_leer.drop(i, axis=0, inplace=True)

In [38]:
data_por_leer = data_por_leer.reset_index(drop=True)

In [39]:
data_por_leer.shape

(1321, 14)

In [40]:
data_por_leer.head(3)

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating
0,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,,3.93,5.0,,,,,,,,,
1,21 Lessons for the 21st Century,Yuval Noah Harari,,4.16,3.0,,,,,5.0,2.0,,,
2,After Europe,Ivan Krastev,,4.11,4.0,,,,,,,,,


### 3.5 Clean ratings of score 0

Sometimes, a Goodreads user can read a but not score it. In those cases, Goodreads assigns it a value of cero, which will be misleading for our analysis. Example: 

In [33]:
data_test['francisco_rating'].value_counts()

3.0    101
2.0     67
4.0     57
5.0     24
1.0     20
0.0      8
Name: francisco_rating, dtype: int64

Instead, let's change these values to a NaN so that they don't bias our analysis.

In [41]:
data = data_test.copy()

In [42]:
for columna in columnas[2:]:
    data[columna] = data[columna].replace(0, np.NaN)
    
for columna in columnas[2:]:
    data_por_leer[columna] = data_por_leer[columna].replace(0, np.NaN)

In [43]:
data['francisco_rating'].value_counts()

3.0    104
2.0     71
4.0     58
5.0     24
1.0     24
Name: francisco_rating, dtype: int64

In [44]:
data_por_leer['fernando_rating'].value_counts()

4.0    33
3.0    21
5.0    21
2.0     3
Name: fernando_rating, dtype: int64

### 3.6 Save datasets with user info

In [47]:
data.to_csv(f'User data/data_francisco_galan_{time}.csv', index=False)
data_por_leer.to_csv(f'User data/unread_francisco_galan_{time}.csv', index=False)

## 4. Recommend books

For our recommendations, we will consider three factors:
- The correlation between the user's rating and his contact's.
- A contact's standardized rating. 
- A book's average rating. 

Why these three? The idea is to find contacts with similar tastes to the user. Such contacts probably liked books that the user also will like. Conversely, contacts with a negative correlation probably liked books that the user didn't.

In addition, we use a contact's standardized rating to give proper weight to her ratings. In other words, if a contact rates most books with two stars, her five-star ratings should count more than those of a contact that rates most books with five stars.

In [5]:
# Load user info
data = pd.read_csv('User data/data_francisco_galan.csv')
unread_data = pd.read_csv('User data/unread_francisco_galan.csv')

### 4.1 Calculate correlation with each contact

In [6]:
data_corr = data.copy()
data_corr = data_corr.drop(['author'], axis=1)

In [7]:
corr_matrix = data_corr.corr()
corr_matrix

Unnamed: 0,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating
francisco_rating,1.0,0.236334,0.361499,-0.191663,,,0.316139,1.0,-0.291748,-0.069824,-0.244444,-0.555556
average_rating,0.236334,1.0,0.570055,0.277579,0.41328,0.45142,0.347652,0.070211,0.217826,0.301307,0.388125,0.458481
nicolas_rating,0.361499,0.570055,1.0,0.333333,,,0.176777,,0.158777,-0.27735,0.0,
fernando_rating,-0.191663,0.277579,0.333333,1.0,,0.166667,0.0,0.327327,-0.707107,0.134568,-1.0,-0.101015
cova_rating,,0.41328,,,1.0,,,,,,,
mario_rating,,0.45142,,0.166667,,1.0,-0.183186,,,-0.405465,,
andrea_rating,0.316139,0.347652,0.176777,0.0,,-0.183186,1.0,-0.456435,-0.10692,,0.316228,0.428174
vanessa_rating,1.0,0.070211,,0.327327,,,-0.456435,1.0,0.522233,,,-0.866025
maria_rating,-0.291748,0.217826,0.158777,-0.707107,,,-0.10692,0.522233,1.0,-0.57735,-0.067926,
eduardo_rating,-0.069824,0.301307,-0.27735,0.134568,,-0.405465,,,-0.57735,1.0,-0.223957,


In [8]:
# Select only the correlations between the user and his contacts. 
corr_matrix['francisco_rating'].reset_index(level=0)

Unnamed: 0,index,francisco_rating
0,francisco_rating,1.0
1,average_rating,0.236334
2,nicolas_rating,0.361499
3,fernando_rating,-0.191663
4,cova_rating,
5,mario_rating,
6,andrea_rating,0.316139
7,vanessa_rating,1.0
8,maria_rating,-0.291748
9,eduardo_rating,-0.069824


In [11]:
# Transform the user's correlations to a single DataFrame
correlaciones = corr_matrix['francisco_rating']
correlaciones = correlaciones.drop('francisco_rating', axis=0)
correlaciones = correlaciones.reset_index(level=0)
correlaciones = correlaciones.reset_index(drop=True)
correlaciones = correlaciones.rename(columns={'index': 'usuario', 'francisco_rating': 'corr'})
correlaciones

Unnamed: 0,usuario,corr
0,average_rating,0.236334
1,nicolas_rating,0.361499
2,fernando_rating,-0.191663
3,cova_rating,
4,mario_rating,
5,andrea_rating,0.316139
6,vanessa_rating,1.0
7,maria_rating,-0.291748
8,eduardo_rating,-0.069824
9,stefan_rating,-0.244444


There are some `NaN` values above. They probably indicate that the contact has no books in common with the user. In such case, let's assign them a correlation of cero. 

In [12]:
correlaciones = correlaciones.fillna(0)
correlaciones

Unnamed: 0,usuario,corr
0,average_rating,0.236334
1,nicolas_rating,0.361499
2,fernando_rating,-0.191663
3,cova_rating,0.0
4,mario_rating,0.0
5,andrea_rating,0.316139
6,vanessa_rating,1.0
7,maria_rating,-0.291748
8,eduardo_rating,-0.069824
9,stefan_rating,-0.244444


### 4.2 Stardardize ratings

In [13]:
# Get the mean rating of each contact. 
avg_rating = data_corr.describe().T[['mean', 'std']][1:]
avg_rating = avg_rating.reset_index(level=0).rename(columns={'index': 'usuario'})
avg_rating

Unnamed: 0,usuario,mean,std
0,average_rating,3.965016,0.306914
1,nicolas_rating,3.819149,0.983332
2,fernando_rating,3.931818,0.868285
3,cova_rating,4.857143,0.478091
4,mario_rating,3.966102,0.927847
5,andrea_rating,3.757282,0.856815
6,vanessa_rating,3.810811,1.081294
7,maria_rating,3.340909,0.914927
8,eduardo_rating,4.101796,1.015748
9,stefan_rating,3.732394,0.815762


Now, to adjust each rating according to the mean of a contact, we can assume that the distribution of the contacts' ratings is normal. Under this assumption, we can infer how many standard deviations a rating is from the mean (z-score). By doing so, it is possible to standardize the grade distributions of all users and compare them with each other.

In [14]:
def weighted_rating(columna_contacto, rating):

    """Adjusts a rating according to the contact's mean rating"""
    
    # Define variables
    media = float(avg_rating.loc[avg_rating['usuario'] == columna_contacto, 'mean'])
    std = float(avg_rating.loc[avg_rating['usuario'] == columna_contacto, 'mean'])
    
    #Calculate z-score
    z_score = (rating - media) / std 
    
    # Get weighted rating
    weighted_rating = (rating + z_score) / rating
    return weighted_rating

Example:

In [15]:
weighted_rating('fernando_rating', 5)

1.0543352601156069

### 4.3 Adjust contact's rating according to correlation with user

Now, let's take the standardized rating and adjust it according to the corresponding correlation.

In [16]:
def correlation_weighted(columna_contacto, rating):  
    
    """Adjusts weighted rating of a contact according to correlation with user"""
    
    # Check if we're dealing with a null rating
    if np.isnan(rating):
        return 0
    
    else: 
        # Define variables
        weighted_rating = weighted_rating(columna_contacto, rating)
        correlacion = correlaciones.loc[correlaciones['usuario'] == columna_contacto, 'corr']
        
        # Get weighted_rating adjusted for correlation
        correlation_weighted = weighted_rating * correlacion
              
        return float(correlation_weighted)

Example:

In [17]:
correlation_weighted('average_rating', 1)

0.05960468698368633

### 4.4 Get recommendation score of a book

Now that we have all of our functions, let's calculate a recommendation score for a book. 

In [22]:
def score_de_recomendacion(fila):
    
    """Returns recommendation score for a book"""
    
    columnas_contactos = list(unread_data.columns)[3:]
    score = 0
    
    for contacto in columnas_contactos:
        x = unread_data.loc[fila, contacto]
        x = correlation_weighted(contacto, x)
        score += x
        
    return score

Example:

In [23]:
score_de_recomendacion(298)

0.2389903632519042

### 4.5 Get recommendation score of all books

In [33]:
columnas_contactos = list(unread_data.columns)[3:]
data_final = unread_data.copy()

In [34]:
for i in range(0, len(data_final)):
    data_final.loc[i, 'score_de_recomendacion'] = score_de_recomendacion(i)

In [35]:
data_final.head(3)

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating,score_de_recomendacion
0,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,,3.93,5.0,,,,,,,,,,0.619656
1,21 Lessons for the 21st Century,Yuval Noah Harari,,4.16,3.0,,,,,5.0,2.0,,,,1.403993
2,After Europe,Ivan Krastev,,4.11,4.0,,,,,,,,,,0.604215


### 4.6 Final veredict

Let's order the books, starting from the one with the highest recommendation score. 

In [40]:
top_5 = data_final.sort_values(by='score_de_recomendacion', ascending=False).head()
top_5 = top_5.reset_index(drop=True)
top_5

Unnamed: 0,title,author,francisco_rating,average_rating,nicolas_rating,fernando_rating,cova_rating,mario_rating,andrea_rating,vanessa_rating,maria_rating,eduardo_rating,stefan_rating,bill_rating,score_de_recomendacion
0,Frankenstein: The 1818 Text,Mary Wollstonecraft Shelley,,3.82,,,,,5.0,5.0,,,,,1.633533
1,Normal People,Sally Rooney,,3.86,2.0,,,,,5.0,,,,,1.572527
2,21 Lessons for the 21st Century,Yuval Noah Harari,,4.16,3.0,,,,,5.0,2.0,,,,1.403993
3,Las batallas en el desierto,José Emilio Pacheco,,4.13,,5.0,,,4.0,4.0,,,,,1.370293
4,Las mujeres que luchan se encuentran: Manual d...,Catalina Ruiz-Navarro,,4.52,,,,,,5.0,,,,,1.306063


In sum, according to our recommendation score, these are the books I should read next:

In [41]:
for i in range(len(top_5)):
    libro = top_5.iloc[i, 0]
    autor = top_5.iloc[i, 1]
    print(f"- {i+1}: {libro} - {autor}")

- 1: Frankenstein: The 1818 Text - Mary Wollstonecraft Shelley
- 2: Normal People - Sally Rooney
- 3: 21 Lessons for the 21st Century - Yuval Noah Harari
- 4: Las batallas en el desierto - José Emilio Pacheco
- 5: Las mujeres que luchan se encuentran: Manual de feminismo pop latinoamericano - Catalina Ruiz-Navarro
