# RecSys4 - Jaccard Similarity Content Based

A common proximity calculation used to compute the similarity between two items, such as two text documents, is **Jaccard Similarity**. The Jaccard similarity method can be used to determine the similarity between two asymmetric binary vectors or two sets. In literature, Jaccard similarity, symbolized by **J**, can also be referred to as Jaccard Index, Jaccard Coefficient, Jaccard Dissimilarity, and Jaccard Distance.


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import time

books = pd.read_csv (r'C:\Users\user\OneDrive\Desktop\Artificial Intelligence\clean dataset\books.csv',usecols = ['book_id','authors','original_publication_year','title','average_rating','image_url'])
ratings = pd.read_csv (r'C:\Users\user\OneDrive\Desktop\Artificial Intelligence\clean dataset\ratings.csv')

In [2]:
# initializing CountVectorizer that convert a collection of text documents to a matrix of token counts
cv = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')

# generating matrix for authors using fit_transform
author_matrix = cv.fit_transform(books["authors"])

In [3]:
# convert author_matrix to the form of array
author_matrix = author_matrix.toarray()

In [4]:
# How jac_sim is computed with hamming metric?
# https://stackoverflow.com/questions/37003272/how-to-compute-jaccard-similarity-from-a-pandas-dataframe
jac_sim = 1 - pairwise_distances(author_matrix, metric = "hamming")

In [5]:
jac_sim

array([[1.        , 0.99943749, 0.99957812, ..., 0.99957812, 0.99957812,
        0.99957812],
       [0.99943749, 1.        , 0.99943749, ..., 0.99943749, 0.99943749,
        0.99943749],
       [0.99957812, 0.99943749, 1.        , ..., 0.99957812, 0.99957812,
        0.99957812],
       ...,
       [0.99957812, 0.99943749, 0.99957812, ..., 1.        , 0.99957812,
        0.99957812],
       [0.99957812, 0.99943749, 0.99957812, ..., 0.99957812, 1.        ,
        0.99957812],
       [0.99957812, 0.99943749, 0.99957812, ..., 0.99957812, 0.99957812,
        1.        ]])

In [6]:
# Build a 1-dimensional array with book titles
titles = books['title']
indices = pd.Series(books.index, index=books['title'])

# Function that get book recommendations based on the cosine similarity score of book authors
def authors_recommendations_jaccard(title):
    # needa find the author where the title is 
    idx = indices[title]
    euclidean_dis = list(enumerate(jac_sim[idx]))
    euclidean_dis = sorted(euclidean_dis, key=lambda x: x[1], reverse=True)
    euclidean_dis = euclidean_dis[1:7]
    book_indices = [i[0] for i in euclidean_dis]
    return titles.iloc[book_indices].to_frame()

In [7]:
authors_recommendations_jaccard("Pride and Prejudice")

Unnamed: 0,title
4425,The Complete Novels
8834,Succulent Wild Woman
6,The Hobbit
7,The Catcher in the Rye
18,The Fellowship of the Ring (The Lord of the Ri...
33,"Fifty Shades of Grey (Fifty Shades, #1)"


In [8]:
authors_recommendations_jaccard("Rogues")

Unnamed: 0,title
29,Gone Girl
38,"A Game of Thrones (A Song of Ice and Fire, #1)"
108,"A Clash of Kings (A Song of Ice and Fire, #2)"
132,"A Storm of Swords (A Song of Ice and Fire, #3)"
161,"A Feast for Crows (A Song of Ice and Fire, #4)"
163,"American Gods (American Gods, #1)"


In [9]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [10]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [11]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [12]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [15]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [14]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman


In [16]:
authors_recommendations_jaccard("The Hobbit")

Unnamed: 0,title
18,The Fellowship of the Ring (The Lord of the Ri...
152,"The Two Towers (The Lord of the Rings, #2)"
158,"The Return of the King (The Lord of the Rings,..."
184,"The Lord of the Rings (The Lord of the Rings, ..."
941,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
8834,Succulent Wild Woman
