In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../data/processed/clean_data.csv')

### Finding the most popular items

In [4]:
df['Book-Title'].value_counts().index[:5]

Index(['Impossible Vacation', 'The Rescue', 'Airframe',
       'Tycoon'S Temptation (Silhouette Desire, No. 1414)',
       'Past Lives, Present Dreams: How to Use Reincarnation for Personal Growth'],
      dtype='object', name='Book-Title')

### Finding the most liked items

In [5]:
avg_rating_df = df[["Book-Title", "rating"]].groupby(['Book-Title']).mean()
avg_rating_df.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating
Book-Title,Unnamed: 1_level_1
How to Pay Zero Taxes (How to Pay Zero Taxes),10.0
Gideon's Day,10.0
"Redwall (Redwall, Book 1)",10.0
Drawing from Within : Unleashing Your Creative Potential,10.0
Girl With The Phony Name,10.0
...,...
Angel of Mercy (Mercy Trilogy),1.0
Abels Tochter.,1.0
Abyssal Warriors (Abyssal Warriors),1.0
Instant Architecture,1.0


We now see the top values do indeed have very high ratings, but the books may look very unfamiliar. This is because items with very low numbers of ratings can skew the results. A book with only one rating has a solid chance of its only rating being 5 stars pushing it to the top, while a book that has been reviewed hundreds of times is likely to have at least one non-perfect review.|

In [6]:
(df["Book-Title"]=='Just Friends').sum()

1

### Finding the most liked popular items

In [7]:
book_frequency = df['Book-Title'].value_counts()

In [8]:
frequently_reviewed_books = book_frequency[book_frequency>20]

In [9]:
frequently_reviewed_books

Book-Title
Impossible Vacation                                                         158
The Rescue                                                                  144
Airframe                                                                    118
Tycoon'S Temptation (Silhouette Desire, No. 1414)                            88
Past Lives, Present Dreams: How to Use Reincarnation for Personal Growth     87
                                                                           ... 
Men, Women and Relationships                                                 21
The Adventures of Tom Sawyer                                                 21
The Twilight Before Christmas                                                21
An American Salad                                                            21
The Curious Incident of the Dog in the Night-Time : A Novel                  21
Name: count, Length: 327, dtype: int64

In [10]:
frequent_books_df = df[df['Book-Title'].isin(frequently_reviewed_books.index)]

In [11]:
frequent_books_df

Unnamed: 0,user,rating,Book-Title,Book-Author,Age
1,1,7,The Mists of Avalon,MARION ZIMMER BRADLEY,24
3,1,9,What a Wonderful World: A Lifetime of Recordings,Bob Thiele,24
7,1,9,"The Subtle Knife (His Dark Materials, Book 2)",PHILIP PULLMAN,24
9,1,10,Just Here Trying to Save a Few Lives : Tales o...,Pamela Grim,24
16,1,9,The 10th Kingdom (Hallmark Entertainment Books),Kathryn Wesley,24
...,...,...,...,...,...
59175,2943,7,The Queen of the Damned (Vampire Chronicles (P...,Anne Rice,27
59180,2943,7,Fatal Voyage,Kathy Reichs,27
59189,2943,8,Heart of Darkness (Wordsworth Collection),Joseph Conrad,27
59191,2943,9,Great Expectations (Heinemann Guided Readers),John Milne,27


In [12]:
avg_rating_df = frequent_books_df[["Book-Title", "rating"]].groupby(['Book-Title']).mean()
avg_rating_df.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating
Book-Title,Unnamed: 1_level_1
"Tucket's Gold (Tuckets Adventures, Book 4)",9.656250
The Broken Promise Land,9.500000
Agatha Raisin and the Quiche of Death (Agatha Raisin Mysteries (Paperback)),9.391304
All of Me: A Voluptuous Tale,9.310345
Pigs in Heaven,9.285714
...,...
Affinity,6.807692
Sisterhood of the Traveling Pants,6.714286
El seÃ±or de las moscas,6.714286
Purity in Death,6.045455


In [13]:
(df["Book-Title"]=="Affinity").sum()

26

### Content-based Recommendations

In [14]:
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

ModuleNotFoundError: No module named 'sklearn'

In [14]:
book_genre_df = pd.crosstab(df['Book-Title'], df['Book-Author'])

In [15]:
book_genre_df.head(15)

Book-Author,A. A. Attanasio,A. A. Milne,A. Bry,A. C. Bhaktivedanta Swami Prabhupada,A. C. Crispin,A. C. Gordon,A. C. Spearing,"A. Carman, Clark",A. J. Hill,A. Keyton Weissinger,...,Zenna Henderson,Zilpha Keatley Snyder,Zlata Filipovic,Zoe Benjamin,Zora Neale Hurston,Zsuzsa Polgar,Zsuzsanna E. Budapest,"jr., Richard Herman",padriac colum,stephen R Donaldson
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'48,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
'N Sync,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"'Til There Was You (Special Edition, No 576)",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
... vol ... Ã bord ... du.Concordia (Her Les Aventures de Michel Labre),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"1,000 Marbles: A Little Something About Precious Time",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"1,003 Great Things About Getting Older",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100 Malicious Little Mysteries,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100 Questions Every First-Time Home Buyer Should Ask : With Answers from Top Brokers from Around the Country,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100 Selected Stories (Wordsworth Classics),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The metric we will use to measure similarity between items, in our newly encoded dataset is called the Jaccard similarity. The Jaccard similarity is the ratio of attributes that two items have in common, divided by the total number of their combined attributes.
It will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.

In [16]:
hobbit_row = book_genre_df.loc['101 Dalmatians']
GOT_row = book_genre_df.loc['101 Corporate Haiku']
print(jaccard_score(hobbit_row, GOT_row, average='macro'))

0.3332495285983658


For multiclass or multilabel classification tasks, you need to specify an averaging method to aggregate the Jaccard scores for each class. The available options are:

    None: The scores for each class are returned.
    'micro': Calculate metrics globally by counting the total true positives, false negatives, and false positives.
    'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    'weighted': Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
    'samples': Calculate metrics for each instance, and find their average (only relevant for multilabel classification).



In [17]:
book_genre_df.loc['101 Corporate Haiku']

Book-Author
A. A. Attanasio                         0
A. A. Milne                             0
A. Bry                                  0
A. C. Bhaktivedanta Swami Prabhupada    0
A. C. Crispin                           0
                                       ..
Zsuzsa Polgar                           0
Zsuzsanna E. Budapest                   0
jr., Richard Herman                     0
padriac colum                           0
stephen R Donaldson                     0
Name: 101 Corporate Haiku, Length: 7955, dtype: int64

To get all of these similarities at once for our data we will call upon two helpful functions from the scipy package.

First pdist (short for pairwise distance) helps us find all the distances at once, using Jaccard as the metric argument.

In [18]:
# from scipy.spatial.distance import pdist
# from sklearn.utils import resample

# # Sample a subset of your DataFrame if it's too large
# sampled_book_genre_df = resample(book_genre_df, n_samples=200, replace=False, random_state=42)

# # Compute the Jaccard distances for the sampled data
# jaccard_distances = pdist(sampled_book_genre_df.values, metric='jaccard')
# print(jaccard_distances)


In [19]:
# Finding the distance between all items
jaccard_distances = pdist(book_genre_df.values, metric='jaccard')
print(jaccard_distances)

[1. 1. 1. ... 1. 1. 1.]


We then use squareform to get this 1D data into the rectangular shape we need.

In [20]:
square_jaccard_distances = squareform(jaccard_distances)
print (square_jaccard_distances)

[[0. 1. 1. ... 1. 1. 1.]
 [1. 0. 1. ... 1. 1. 1.]
 [1. 1. 0. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 0. 1. 1.]
 [1. 1. 1. ... 1. 0. 1.]
 [1. 1. 1. ... 1. 1. 0.]]


As we want the complement of this, the similarity, we subtract the values from 1.

In [21]:
jaccard_similarity_array = 1 - square_jaccard_distances
print(jaccard_similarity_array)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]
