# Popularity recommender
Popularity recommenders are like a music chart that tells you the most popular songs everyone is listening to. Just as the chart reflects the songs loved by a large number of people, popularity recommenders suggest items based on what's trending and widely liked by a broad audience. They aim to give recommendations that are in line with what's currently popular.

---
## 1.&nbsp; Import libraries and files 💾
The dataset we're working with for this project is a smaller portion of the BookCrossing dataset. The BookCrossing (BX) dataset was collected by Cai-Nicolas Ziegler during a 4-week data collection period (August/September 2004) from the Book-Crossing community. It includes information from 278,858 users (with their identities anonymized but with demographic details) and consists of 1,149,780 ratings (both explicit and implicit) for 271,379 books. Because this dataset is massive, we decided to use a smaller chunk of it for our project.

In [1]:
import pandas as pd

In [2]:
url = 'https://drive.google.com/file/d/1yFwxNVF0MuAsiFTAZMfoVGt1nIOatByg/view?usp=sharing'
path = 'https://drive.google.com/uc?id='+url.split('/')[-2]
df = pd.read_csv(path)

---
##2.&nbsp;Explore the data 👩‍🚀

In [3]:
df

Unnamed: 0,user_id,user_location,user_age,book_isbn,book_rating,book_title,book_author,book_year_of_publication,book_publisher,img_s,img_m,img_l,book_summary,book_language,book_category,publisher_city,publisher_state,publisher_country
0,3329,"grantsville, utah, usa",34.7439,0440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],grantsville,utah,usa
1,7346,"sunnyvale, california, usa",49.0000,0440234743,9,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],sunnyvale,california,usa
2,7352,"houston, texas, usa",53.0000,0440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],houston,texas,usa
3,9419,"somewhere, texas, usa",34.7439,0440234743,5,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],somewhere,texas,usa
4,11224,"tumwater, washington, usa",51.0000,0440234743,6,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],tumwater,washington,usa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47900,246311,"lynnwood, washington, usa",35.0000,0385503954,8,Atonement: A Novel,Ian McEwan,2002.0,Nan A. Talese,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,9,9,9,lynnwood,washington,usa
47901,246328,"new york, new york, usa",34.7439,0385503954,9,Atonement: A Novel,Ian McEwan,2002.0,Nan A. Talese,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,9,9,9,new york,new york,usa
47902,253330,"seattle, washington, usa",56.0000,0385503954,9,Atonement: A Novel,Ian McEwan,2002.0,Nan A. Talese,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,9,9,9,seattle,washington,usa
47903,255846,"paris, paris, france",21.0000,0385503954,9,Atonement: A Novel,Ian McEwan,2002.0,Nan A. Talese,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,http://images.amazon.com/images/P/0385503954.0...,9,9,9,paris,paris,france


In [4]:
df.shape

(47905, 18)

Even though this is a reduced set of data, we still have almost 48000 rows of data over 18 columns.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47905 entries, 0 to 47904
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   user_id                   47905 non-null  int64  
 1   user_location             47905 non-null  object 
 2   user_age                  47905 non-null  float64
 3   book_isbn                 47905 non-null  object 
 4   book_rating               47905 non-null  int64  
 5   book_title                47905 non-null  object 
 6   book_author               47905 non-null  object 
 7   book_year_of_publication  47905 non-null  float64
 8   book_publisher            47905 non-null  object 
 9   img_s                     47905 non-null  object 
 10  img_m                     47905 non-null  object 
 11  img_l                     47905 non-null  object 
 12  book_summary              47905 non-null  object 
 13  book_language             47905 non-null  object 
 14  book_c

Not all of these columns are particularly helpful. I'm not too sure about the `img` columns, they might not be useful for our analysis. However, columns like `user_id`, `book_isbn`, and `book_rating` will be extremely valuable. At the moment, our main focus will be on these three columns. It's worth mentioning that in many recommendation systems, they divide users into 'neighbourhoods' to improve recommendations and speed up calculations. If you'd like to explore this further in the future, columns such as `user_age` and `user_location` could come in handy for creating these neighbourhoods.

In [6]:
df.describe()

Unnamed: 0,user_id,user_age,book_rating,book_year_of_publication
count,47905.0,47905.0,47905.0,47905.0
mean,138137.811878,35.638379,7.854504,1997.611398
std,80685.985628,9.952574,1.780345,5.484309
min,9.0,7.0,1.0,1959.0
25%,68185.0,30.0,7.0,1995.0
50%,136240.0,34.7439,8.0,1999.0
75%,209272.0,38.0,9.0,2002.0
max,278854.0,99.0,10.0,2004.0


Based on `.describe()`, we observe that our user base spans from 7 to 99 years old, with the majority falling within the 30 to 38 age range. When it comes to book ratings, they are measured on a scale of 1 to 10, and it appears that most users are generous and tend to rate books between 7 and 9. Additionally, the books in our dataset have publication years ranging from 1959 to 2004, but the majority of them were published between 1995 and 2002.

In [7]:
df.head()

Unnamed: 0,user_id,user_location,user_age,book_isbn,book_rating,book_title,book_author,book_year_of_publication,book_publisher,img_s,img_m,img_l,book_summary,book_language,book_category,publisher_city,publisher_state,publisher_country
0,3329,"grantsville, utah, usa",34.7439,440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],grantsville,utah,usa
1,7346,"sunnyvale, california, usa",49.0,440234743,9,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],sunnyvale,california,usa
2,7352,"houston, texas, usa",53.0,440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],houston,texas,usa
3,9419,"somewhere, texas, usa",34.7439,440234743,5,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],somewhere,texas,usa
4,11224,"tumwater, washington, usa",51.0,440234743,6,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],tumwater,washington,usa


How many individual books do we have in the DataFrame?

In [8]:
df["book_isbn"].nunique()

500

How many individual users do we have in the DataFrame?

In [9]:
df["user_id"].nunique()

20366

Let's also have a look at the full distribution of ratings

In [10]:
df["book_rating"].value_counts(normalize=True)

book_rating
8     0.253543
10    0.203173
9     0.190627
7     0.161445
5     0.080430
6     0.070494
4     0.017180
3     0.012713
2     0.006304
1     0.004091
Name: proportion, dtype: float64

---
##3.&nbsp;How should we build a popularity recommender? 📚

###3.1.&nbsp;Higest rated books
Let's the look at the most popular books by average rating

In [11]:
rating_count_df = df.groupby('book_isbn')['book_rating'].agg(['mean', 'count']).reset_index()
rating_count_df.nlargest(5, ['mean', 'count'])

Unnamed: 0,book_isbn,mean,count
113,0345339738,9.402597,77
234,0439139597,9.262774,137
237,043936213X,9.207547,53
112,0345339711,9.120482,83
233,0439136369,9.082707,133


Book with the highest mean score

In [12]:
highest_rating_isbn = rating_count_df.nlargest(1, 'mean')['book_isbn'].values[0]

highest_rated_isbn_mask = df['book_isbn'] == highest_rating_isbn
book_info_columns = ['book_isbn', 'book_title', 'book_author', 'book_year_of_publication']

df.loc[highest_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,book_isbn,book_title,book_author,book_year_of_publication
31206,345339738,"The Return of the King (The Lord of the Rings,...",J.R.R. TOLKIEN,1986.0


###3.2.&nbsp;Most rated books
But are the most highly rated books also the most well read books?

In [13]:
rating_count_df.sort_values(by=['count', 'mean'], ascending=False).head()

Unnamed: 0,book_isbn,mean,count
91,316666343,8.18529,707
481,971880107,4.390706,581
196,385504209,8.435318,487
61,312195516,8.182768,383
13,60928336,7.8875,320


Book with the most reviews

In [14]:
most_rated_isbn = rating_count_df.nlargest(1, 'count')['book_isbn'].values[0]
most_rated_isbn_mask = df['book_isbn'] == most_rated_isbn

df.loc[most_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,book_isbn,book_title,book_author,book_year_of_publication
4254,316666343,The Lovely Bones: A Novel,Alice Sebold,2002.0


Looks like some books are well loved and some books are well read, we'll need to strike a balance of the two to find out the overall top 10 most popular books.

---
##4.&nbsp;Challenge: build a popularity recommender 😃
Find a hybrid system to sort books, so that you can recommend the "best" books that are both high rated and popular.

In [15]:
# Be creative! Be bold!