# Popularity recommender
Popularity recommenders are like a music chart that tells you the most popular songs everyone is listening to. Just as the chart reflects the songs loved by a large number of people, popularity recommenders suggest items based on what's trending and widely liked by a broad audience. They aim to give recommendations that are in line with what's currently popular.

---
##1.&nbsp;Import libraries and files 💾
The dataset we're working with for this project is a smaller portion of the [BookCrossing dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). The BookCrossing (BX) dataset was collected by Cai-Nicolas Ziegler during a 4-week data collection period (August/September 2004) from the Book-Crossing community. It includes information from 278,858 users (with their identities anonymized but with demographic details) and consists of 1,149,780 ratings (both explicit and implicit) for 271,379 books. Because this dataset is massive, we decided to use a smaller chunk of it for our project.

In [1]:
import pandas as pd

In [2]:
def gd_path(file_id):
    return f"https://drive.google.com/uc?export=download&id={file_id}"


files_id = {
    'links': "1GR8IQ2OXsFI8MNmv4bQIV1XXkq7n56MB",
    'movies': "1PDuCaAhhVTRLYdftMr6VqX23crMqB_qg",
    'rating': "1F4_-HBPBSySMjxdGxlykWVjvVn9AJ0BS",
    'tags': "1bH6HhZfqLT0JGqYxyRLQAk7UIpnYj4x4"
    }


links = pd.read_csv(gd_path(files_id['links']), sep=",")
movies = pd.read_csv(gd_path(files_id['movies']), sep=",")
rating = pd.read_csv(gd_path(files_id['rating']), sep=",")
tags = pd.read_csv(gd_path(files_id['tags']), sep=",")

In [None]:
url = 'https://drive.google.com/file/d/1yFwxNVF0MuAsiFTAZMfoVGt1nIOatByg/view?usp=sharing'
path = 'https://drive.google.com/uc?id='+url.split('/')[-2]
df = pd.read_csv(path)

---
##2.&nbsp;Explore the data 👩‍🚀

In [3]:
movies.shape

(9742, 3)

Even though this is a reduced set of data, we still have almost 48000 rows of data over 18 columns.

In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Not all of these columns are particularly helpful. I'm not too sure about the `img` columns, they might not be useful for our analysis. However, columns like `user_id`, `book_isbn`, and `book_rating` will be extremely valuable. At the moment, our main focus will be on these three columns. It's worth mentioning that in many recommendation systems, they divide users into 'neighbourhoods' to improve recommendations and speed up calculations. If you'd like to explore this further in the future, columns such as `user_age` and `user_location` could come in handy for creating these neighbourhoods.

In [5]:
movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


Based on `.describe()`, we observe that our user base spans from 7 to 99 years old, with the majority falling within the 30 to 38 age range. When it comes to book ratings, they are measured on a scale of 1 to 10, and it appears that most users are generous and tend to rate books between 7 and 9. Additionally, the books in our dataset have publication years ranging from 1959 to 2004, but the majority of them were published between 1995 and 2002.

In [7]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
rating

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


How many individual books do we have in the DataFrame?

In [9]:
rating["movieId"].nunique()

9724

How many individual users do we have in the DataFrame?

In [11]:
rating["userId"].nunique()

610

Let's also have a look at the full distribution of ratings

In [12]:
rating["rating"].value_counts(normalize=True)

rating
4.0    0.265957
3.0    0.198808
5.0    0.131015
3.5    0.130271
4.5    0.084801
2.0    0.074884
2.5    0.055040
1.0    0.027877
1.5    0.017762
0.5    0.013586
Name: proportion, dtype: float64

---
##3.&nbsp;How should we build a popularity recommender? 📚

###3.1.&nbsp;Higest rated books
Let's the look at the most popular books by average rating

In [25]:
rating_count_df = rating.groupby('movieId')['rating'].agg(['mean', 'count']).reset_index()
rating_count_df.nlargest(5, ['mean', 'count'])

Unnamed: 0,movieId,mean,count
48,53,5.0,2
87,99,5.0,2
869,1151,5.0,2
2593,3473,5.0,2
4384,6442,5.0,2


Book with the highest mean score

In [27]:
highest_rating_isbn = rating_count_df.nlargest(1, 'mean')['movieId'].values[0]

highest_rated_isbn_mask = movies['movieId'] == highest_rating_isbn
book_info_columns = ['movieId', 'title', 'genres']

movies.loc[highest_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,movieId,title,genres
48,53,Lamerica (1994),Adventure|Drama


In [28]:
highest_rating_isbn

53

###3.2.&nbsp;Most rated books
But are the most highly rated books also the most well read books?

In [29]:
rating_count_df.sort_values(by=['count', 'mean'], ascending=False).head()

Unnamed: 0,movieId,mean,count
314,356,4.164134,329
277,318,4.429022,317
257,296,4.197068,307
510,593,4.16129,279
1938,2571,4.192446,278


Book with the most reviews

In [30]:
most_rated_isbn = rating_count_df.nlargest(1, 'count')['movieId'].values[0]
most_rated_isbn_mask = movies['movieId'] == most_rated_isbn

movies.loc[most_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,movieId,title,genres
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War


Looks like some books are well loved and some books are well read, we'll need to strike a balance of the two to find out the overall top 10 most popular books.

---
##4.&nbsp;Challenge: build a popularity recommender 😃
Find a hybrid system to sort books, so that you can recommend the "best" books that are both high rated and popular.

In [31]:
rating["rating"].value_counts(normalize=True)

rating
4.0    0.265957
3.0    0.198808
5.0    0.131015
3.5    0.130271
4.5    0.084801
2.0    0.074884
2.5    0.055040
1.0    0.027877
1.5    0.017762
0.5    0.013586
Name: proportion, dtype: float64

In [34]:
rating_count_df[["movieId","count"]]

Unnamed: 0,movieId,count
0,1,215
1,2,110
2,3,52
3,4,7
4,5,49
...,...,...
9719,193581,1
9720,193583,1
9721,193585,1
9722,193587,1


In [46]:
rating_count_df['count'].sum()

100836

In [47]:
rating_count_df["count"].value_counts(normalize=True)

count
1      0.354381
2      0.133484
3      0.082271
4      0.054504
5      0.039284
         ...   
162    0.000103
202    0.000103
192    0.000103
145    0.000103
185    0.000103
Name: proportion, Length: 177, dtype: float64

In [48]:
rating_count_df['waited_count'] = rating_count_df["count"]/100836

In [49]:
rating_count_df['waited_mean'] = rating_count_df["waited_count"]*rating_count_df["mean"]

In [50]:
rating_count_df.nlargest(5, ['waited_mean'])

Unnamed: 0,movieId,mean,count,waited_count,waited_mean
277,318,4.429022,317,0.003144,0.013924
314,356,4.164134,329,0.003263,0.013586
257,296,4.197068,307,0.003045,0.012778
1938,2571,4.192446,278,0.002757,0.011558
510,593,4.16129,279,0.002767,0.011514


In [52]:
all_df = pd.merge(rating_count_df, movies, on ="movieId")

In [53]:
all_df.nlargest(5, ['waited_mean'])

Unnamed: 0,movieId,mean,count,waited_count,waited_mean,title,genres
277,318,4.429022,317,0.003144,0.013924,"Shawshank Redemption, The (1994)",Crime|Drama
314,356,4.164134,329,0.003263,0.013586,Forrest Gump (1994),Comedy|Drama|Romance|War
257,296,4.197068,307,0.003045,0.012778,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1938,2571,4.192446,278,0.002757,0.011558,"Matrix, The (1999)",Action|Sci-Fi|Thriller
510,593,4.16129,279,0.002767,0.011514,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller


In [54]:
all_df[['mean','count','waited_mean','title']].sort_values('waited_mean', ascending=False).drop_duplicates().nlargest(5, ['waited_mean'])

Unnamed: 0,mean,count,waited_mean,title
277,4.429022,317,0.013924,"Shawshank Redemption, The (1994)"
314,4.164134,329,0.013586,Forrest Gump (1994)
257,4.197068,307,0.012778,Pulp Fiction (1994)
1938,4.192446,278,0.011558,"Matrix, The (1999)"
510,4.16129,279,0.011514,"Silence of the Lambs, The (1991)"
