### Similarity Module

#### Module Description

- **Module Name:** similarity_module



- **Module Data Input:** user_preferences dictionary.



- **Module Expected Output:** find the similarity score between 2 users, similarity between two books, find first n similar user for a given user, and find first n similar book for a given book.


- **Module Implementation:** 

5 functions that are used to compute similarity scores between any two given users, these functions takes the
     user_preferences, user1_id, user2_id as parameters.
     
     
one function to return similarity scores between any two books, this function takes the user_preferences, book1_isbn, book2_isbn as parameters
   
   
one function to return first n similar user to a given user, this function takes the user_preferences, user_id as parameters.
   
   
one function to return first n similar book to agiven book, this function takes the user_preferences, book_isbn as parameters.

### import the required libraries

In [18]:
import pickle as pkl
from math import*
from decimal import Decimal
import nbimporter

#### The reference for importing functions from another notebook file

https://stackoverflow.com/questions/50576404/importing-functions-from-another-jupyter-notebook

### Read the user_preference Dictionary

#### To get the user_preferences dictionary, we can load the pickle file or import it from load_data_set_module_nb

In [13]:
with open('user_preferences.pickle', 'rb') as handle:
    user_preferences = pkl.load(handle)

In [17]:
# from load_data_set_module_nb import user_preferences

In [14]:
user_preferences

{'2': {'0195153448': {'book_rate': 0,
   'book_title': 'Classical Mythology',
   'book_author': 'Mark P. O. Morford',
   'year_of_publish': 2002}},
 '8': {'0002005018': {'book_rate': 5,
   'book_title': 'Clara Callan',
   'book_author': 'Richard Bruce Wright',
   'year_of_publish': 2001},
  '0060973129': {'book_rate': 0,
   'book_title': 'Decision in Normandy',
   'book_author': "Carlo D'Este",
   'year_of_publish': 1991},
  '0374157065': {'book_rate': 0,
   'book_title': 'Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It',
   'book_author': 'Gina Bari Kolata',
   'year_of_publish': 1999},
  '0393045218': {'book_rate': 0,
   'book_title': 'The Mummies of Urumchi',
   'book_author': 'E. J. W. Barber',
   'year_of_publish': 1999},
  '0399135782': {'book_rate': 0,
   'book_title': "The Kitchen God's Wife",
   'book_author': 'Amy Tan',
   'year_of_publish': 1991},
  '0425176428': {'book_rate': 0,
   'book_title': "What If?: The World's Forem

## Introduction

#### Recommendation engines are mainly 2 types and one hybrid type:


1- Collaborative filtering.


2- Content-based filtering.


3- Hybrid Recommendation Systems.


#### Collaborative Filtering


Collaborative filtering methods based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences. Using the user information to predict what users will like. The predictions are based on their similarity to other users.

<img src = "notebook_images\userbased.webp">

#### Content-based filtering.

These algorithms try to recommend items that are similar to those a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.

<img src = "notebook_images\1 Lr6qL0YjY_WqVK5u-AYHAQ.png">

### Hybrid Recommendation Systems

<img src = "notebook_images\people-who-liked-this-talk-also-liked-building-recommendation-systems-using-ruby-40-6381.webp">

#### Find the similarity score between two users

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters.

### Define five functions to calculate similarities between two users:


To find similarity between two users, we will use the main Five most popular similarity measures 

- cosine similarity
- pearson's correlation 
- manhattan distance
- jaccard similarity
- euclidean distance

***Note:*** The following codes have ben implemented after studying the reference part, the basics of recommendation system  and by the help of all the attached references, esspicaly

https://dataaspirant.com/five-most-popular-similarity-measures-implementation-in-python/

***Note:*** these functions should first find the common books between two users, and according to these common book find the similarity score

#### Jaccard Index

The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of the data items. 

<img src="notebook_images\maxresdefault9.jpg">

In [83]:
"""

Function Name: jaccard_index.
Parameters: 3 positional arguments user_preference dictionary, two user Ids.
Output: jaccard similarity between the two users.
How it works: applying the jaccard formula

"""

'\n\nFunction Name: jaccard_index.\nParameters: 3 positional arguments user_preference dictionary, two user Ids.\nOutput: jaccard similarity between the two users.\nHow it works: applying the jaccard formula\n\n'

In [1]:
def jaccard_index(prefs, user1_id, user2_id):
     try:
        
        #extract books for both users
        user1_books_dic = prefs[user1_id]
        user2_books_dic = prefs[user2_id]
    
    
        # convert each dictionary to a set of books, for both user1, user2
        user1_books_set = set(user1_books_dic)
        user2_books_set = set(user2_books_dic)
    
        #find intesection and union
        users_books_intersection = len(user1_books_set.intersection(user2_books_set))
        users_books_union = len(user1_books_set.union(user2_books_set))
        
        # find jaccard
        jaccard_ = users_books_intersection / users_books_union
        return jaccard_
    
     except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)

In [15]:
jaccard_similarity_test = jaccard_index(user_preferences, '278851', '278854')
jaccard_similarity_test

0.0

#### Euclidean Distance

Euclidean distance is considered the traditional metric for problems with geometry. It can be simply explained as the ordinary distance between two points. It is one of the most used algorithms in the cluster analysis.

<img src="notebook_images\unnamed4.png">
<img src="notebook_images\eclidean.JPG">

In [84]:
"""

Function Name: euclidean_distance.
Parameters: 3 positional arguments user_preference dictionary, two user Ids.
Output: euclidean distance between the two users.
How it works: applying the euclidean distance formula

"""

'\n\nFunction Name: euclidean_distance.\nParameters: 3 positional arguments user_preference dictionary, two user Ids.\nOutput: euclidean distance between the two users.\nHow it works: applying the euclidean distance formula\n\n'

In [2]:
def euclidean_distance(prefs, user1_id, user2_id):
    try:
        
        # define variables to store summation result
        d = 0
        
        # find the similar books and their rate given by the two users, then apply the formula
        for book in prefs[user1_id]:
            if book in prefs[user2_id]:
                d = d + pow(prefs[user1_id][book]['book_rate'] - prefs[user2_id][book]['book_rate'],2)
        # divde 1 to the result in order to get a small value
        return 1/sqrt(d)
    
    except Exception as e :
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)
        

In [107]:
euclidean_distance_similarity_test = euclidean_distance(user_preferences,'11676', '2770')
euclidean_distance_similarity_test

0.31622776601683794

#### Cosine Similarity

Cosine distance measure for clustering determines the cosine of the angle between two vectors given by the following formula

<img src="notebook_images\cos.PNG">
<img src="notebook_images\eucos.png">

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

In [85]:
"""

Function Name: cosine_similarity.
Parameters: 3 positional arguments user_preference dictionary, two user Ids.
Output: cosine theta between the two users.
How it works: applying the cosine similarity formula

"""

'\n\nFunction Name: cosine_similarity.\nParameters: 3 positional arguments user_preference dictionary, two user Ids.\nOutput: cosine theta between the two users.\nHow it works: applying the cosine similarity formula\n\n'

In [3]:
def cosine_similarity(prefs, user1_id, user2_id):
    try:
        
        # define two empty lists to save rates value for common books
        user1_common_books_rates = []
        user2_common_books__rates = []
        
        
        # find the similar books and their rate given by the two users, then apply the formula
        for book in prefs[user1_id]:
            if book in prefs[user2_id]:
                user1_common_books_rates.append(prefs[user1_id][book]['book_rate'])
                user2_common_books__rates.append(prefs[user2_id][book]['book_rate'])
                
        numerator = sum(x*y for x,y in zip(user1_common_books_rates,user2_common_books__rates))
        denominator = sqrt(sum([x*x for x in user1_common_books_rates]))*sqrt(sum([y*y for y in user2_common_books__rates]))
        
        # avoid zero division error by check the value of denominator 
        if denominator == 0:
            return 0
        return round(numerator/float(denominator),3)
    
    except Exception as e :
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)
        

In [103]:
cosine_similarity_test = cosine_similarity(user_preferences, '40222', '2770')
cosine_similarity_test

1.0

#### Pearson correlation coefficient

Correlation coefficients are used to measure how strong a relationship is between two variables. There are different types of formulas to get correlation coefficient, one of the most popular is Pearson’s correlation (also known as Pearson’s R) which is commonly used for linear regression. The Pearson’s correlation coefficient is denoted with the symbol “R”. The correlation coefficient formula returns a value between 1 and -1. Here,

-1 indicates a strong negative relationship
1 indicates strong positive relationships
And a result of zero indicates no relationship at all

<img src="notebook_images\Pearson correlation coefficient.JPG">

In [86]:
"""

Function Name: pearson_correlation.
Parameters: 3 positional arguments user_preference dictionary, two user Ids.
Output: correlation coefficient between the two users.
How it works: applying the pearson correlation formula

"""

'\n\nFunction Name: pearson_correlation.\nParameters: 3 positional arguments user_preference dictionary, two user Ids.\nOutput: correlation coefficient between the two users.\nHow it works: applying the cosine similarity formula\n\n'

In [4]:
def pearson_correlation(prefs,user1_id,user2_id):
    try:
            # define empty dictionary to save both users books info({book_isbn:{book details}})
            sim={}
            for dic in prefs[user1_id]:
                if dic in prefs[user2_id]: sim[dic]=1
            

            if len(sim)==0:
                return 0 
    
            n=len(sim)
            # Sums of all the preferences
            sum1=sum([prefs[user1_id][item]['book_rate'] for item in sim])
            sum2=sum([prefs[user2_id][item]['book_rate'] for item in sim])

            # Sums of the squares
            sum1Sq=sum([pow(prefs[user1_id][item]['book_rate'],2) for item in sim])
            sum2Sq=sum([pow(prefs[user2_id][item]['book_rate'],2) for item in sim])
    
            # Sum of the products
            pSum=sum([prefs[user1_id][item]['book_rate']*prefs[user2_id][item]['book_rate'] for item in sim])
    
    
            # Calculate r (Pearson score)    
            num= pSum-(sum1*sum2/n)
            den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
            if den==0:
                return 0

            r=num/den

            return r
        
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)     


In [76]:
pearson_correlation_test = pearson_correlation(user_preferences, '34588', '2770')
pearson_correlation_test

1.0

#### Manhattan Distance

This determines the absolute difference among the pair of the coordinates.

Suppose we have two points P and Q to determine the distance between these points we simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).

Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

<img src="notebook_images\manhanten.png">
<img src="notebook_images\1200px-Manhattan_distance.svg_.png">

In [87]:
"""

Function Name: manhattan_distance.
Parameters: 3 positional arguments user_preference dictionary, two user Ids.
Output: manhattan distance between the two users.
How it works: applying the manhattan distance formula

"""

'\n\nFunction Name: manhattan_distance.\nParameters: 3 positional arguments user_preference dictionary, two user Ids.\nOutput: manhattan distance between the two users.\nHow it works: applying the manhattan distance formula\n\n'

In [5]:
def manhattan_distance(prefs,user1_id,user2_id):
    try:
        # define two empty lists to save rates value for common books
        user1_common_books_rates = []
        user2_common_books_rates = []
        
        # extract similar books and their rates
        for book in prefs[user1_id]:
            if book in prefs[user2_id]:
                user1_common_books_rates.append(prefs[user1_id][book]['book_rate'])
                user2_common_books_rates.append(prefs[user2_id][book]['book_rate'])



        return sum(abs(x-y) for x,y in zip(user1_common_books_rates,user2_common_books_rates))
    
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message) 

In [105]:
manhattan_distance_test = manhattan_distance(user_preferences, '11676', '2770')
manhattan_distance_test

4

### Return first n similar user

In order to find the first n similar users to a given user:

- first we find the similarity score between the given user and all other users in the dataset in our casewe choose to       find cosine_similarity,where we can choose any other function to calculate the similarity score
    
- sort the similarity score
    
- return a list of tuples of the similar user_id and the similarity score, its length is equal to n

In [6]:
def first_n_similar_users(prefs, user_id , n):
     try:
        # returns the number_of_users (similar persons) for a given specific person.
        scores = [(pearson_correlation(prefs, user_id, other_user),other_user) for other_user in prefs if  other_user != user_id ]

        # Sort the similar persons so that highest scores person will appear at the first
        scores.sort()
        scores.reverse()
        return scores[0:n]
     except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message) 


In [71]:
first_n_users = first_n_similar_users(user_preferences,'2770',8)

In [72]:
first_n_users

[(1.0, '34588'),
 (0.970725343394151, '11676'),
 (0, '9994'),
 (0, '9992'),
 (0, '9991'),
 (0, '9990'),
 (0, '999'),
 (0, '9989')]

### Find similarity between two books

What is Item-item similarity?


Item-item collaborative filtering, or item-based, or item-to-item, is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people's ratings of those items. Item-item collaborative filtering was invented and used by Amazon.com in 1998.

In [7]:
# this function returns the ratings for book1, and book2 given by common users who rates both book1,book2
def book_common_users_ratings(prefs, book1_isbn, book2_isbn):
    try:
        
        # define two empty lists to save commn users rating given to our two books
        book1_common_usres_rates = []
        book2_common_usres_rates = []
        for user in list(prefs.keys()):
            # check if the user has rated the two books
            if (book1_isbn in list(prefs[user].keys()) and book2_isbn in list(prefs[user].keys())):
                # if yes extract his rate for each book
                book1_common_usres_rates.append(prefs[user][book1_isbn]['book_rate'])
                book2_common_usres_rates.append(prefs[user][book2_isbn]['book_rate'])
            
            
        return book1_common_usres_rates, book2_common_usres_rates
    
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message) 

In [41]:
book1_common_usres_rates, book2_common_usres_rates = book_common_users_ratings(user_preferences, '0553141406','0842329129')
print(book1_common_usres_rates)
print(book2_common_usres_rates)

[9, 10]
[6, 9]


In [8]:
# this function called the find_common_users function to get two lists of rating given to book1,book2 from the same users.
# then find the cosin similarity between them

def books_cosin_similarity(prefs, book1_isbn, book2_isbn):
    try:
        
        book1_common_usres_rates, book2_common_usres_rates = book_common_users_ratings(prefs, book1_isbn, book2_isbn)
        numerator = sum(a*b for a,b in zip(book1_common_usres_rates,book2_common_usres_rates))
        denominator = sqrt(sum([a*a for a in book1_common_usres_rates]))*sqrt(sum([a*a for a in book2_common_usres_rates]))
        if denominator == 0:
            return 0
        
        return round(numerator/float(denominator),3)
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)

In [81]:
books_cosin_similarity(user_preferences, '8433914545','0842329129')

1.0

In [9]:
def books_jaccard_similarity(prefs, book1_isbn, book2_isbn):
    try:
        
        # define two lists to save the users have rated each book
        book1_usres = []
        book2_usres = []
        for user in list(prefs.keys()):
            if book1_isbn in list(prefs[user].keys()):
                book1_usres.append(user)
                
            if book2_isbn in list(prefs[user].keys()):   
                book2_usres.append(user)
                
                
        # convert the list to a set
        p1 = set(book1_usres)
        p2 = set(book2_usres)
        
        n1 = len(p1.intersection(p2))
        
        n2 = len(p1.union(p2))
#        print(p1.intersection(p2))
        return n1/n2
    
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)
        

In [79]:
books_jaccard_similarity(user_preferences, '8433914545','0842329129')

0.018518518518518517

### Rturn first n similar book

In order to find the first n similar users to a given user:

    - first we find the similarity score between the given book and all other books in the dataset.
    in our case we choose to find cosin similarity,where we can choose any other function to calculate the similarity score
    - sort the similarity score
    - return a list of tuples of the similar book_isbn and the similarity score, its length is equal to n

In [20]:
def first_n_cosin_similar_books(prefs, book_isbn, n):
    try:
        
        # define an empty list to save similarity scores values 
        scores = []
        
        # call the cosin function for our book and all other books in the user_preference dictioary
        for dic in list(prefs.keys())[0:1000]:
            for book in list(prefs[dic].keys()):
                if book != book_isbn:
                    try:
                        scores.append((books_cosin_similarity(prefs, book_isbn, book), book))
                    
                    except:
                        continue;
        # sort the scores values
        scores.sort()
        scores.reverse()
        # return the values according to the number of books
        return scores[0:n]
    
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message)

In [77]:
similar_books_test = first_n_cosin_similar_books(user_preferences, '0842329129', 3 )

In [78]:
similar_books_test

[(1.0, '8433914545'), (1.0, '3453863593'), (1.0, '3453212150')]

In [11]:
def first_n_jaccard_similar_books(prefs, book_isbn, n):
    try:
        # define an empty list to save similarity scores values 
        scores = []
        # call the cosin function for our book and all other books in the user_preference dictioary
        for dic in list(prefs.keys())[0:300]:
            for book in list(prefs[dic].keys()):
                if book != book_isbn:
                    try:
                        scores.append((books_jaccard_similarity(prefs, book_isbn, book), book))
                    
                    except:
                        continue;
        scores.sort()
        scores.reverse()
        return scores[0:n]
    
    except Exception as e:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        print(message) 

In [75]:
similar_books_test2 = first_n_jaccard_similar_books(user_preferences, '0842329129', 20 )
similar_books_test2

[(0.12048192771084337, '0440211263'),
 (0.1095890410958904, '0446610399'),
 (0.1095890410958904, '0446610399'),
 (0.1095890410958904, '0446610399'),
 (0.10843373493975904, '0385335881'),
 (0.10784313725490197, '0385335482'),
 (0.10606060606060606, '0553569058'),
 (0.10606060606060606, '0553569058'),
 (0.10606060606060606, '0345369947'),
 (0.10526315789473684, '0767903579'),
 (0.10465116279069768, '0440224675'),
 (0.10465116279069768, '0440224675'),
 (0.1044776119402985, '067976397X'),
 (0.10294117647058823, '0446360589'),
 (0.10126582278481013, '0553250426'),
 (0.10126582278481013, '0553250426'),
 (0.10126582278481013, '0553250426'),
 (0.09859154929577464, '1558744150'),
 (0.09803921568627451, '0375707972'),
 (0.0967741935483871, '0373218400')]

#### References


https://link.springer.com/chapter/10.1007/978-0-387-85820-3_1


https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144059


https://www.geeksforgeeks.org/minkowski-distance-python/


https://dataaspirant.com/collaborative-filtering-recommendation-engine-implementation-in-python/


https://www.geeksforgeeks.org/measures-of-distance-in-data-mining/


https://dataaspirant.com/five-most-popular-similarity-measures-implementation-in-python/


https://realpython.com/build-recommendation-engine-collaborative-filtering/


https://www.geeksforgeeks.org/minkowski-distance-python/


https://betterprogramming.pub/movie-similarity-recommendation-using-python-b98a2670a2ad