<div style="direction:rtl; font-size:12px">به نام خدا</div>
<div style="direction:rtl; font-size:12px">گروه 8 : آرمین افتخاری - محمد سلیمان بهزاد - محمدرضا پوررضا - رضا برزگر طرقبه - سبحان مرادیان دقیق</div>
<div style="direction:rtl"> در این بخش ماتریس شباهتی برای کاربر ها و ماتریس شباهتی برای آیتم ها بر اساس شباهت بین محتوی هریک از آن ها خواهیم ساخت. سپس با استفاده
از اشتراک بین این شباهت ها و شباهت های به دست آمده از بخش collaborative filtering الگوریتمی
ترکیبی برای به دست آوردن نتیجه بهتر و منطقی تر خواهیم داشت.
توضیحات بیشتر و جزئی تر در داکیومنت و خلال کد داده شده است.</div>

### Implementation of Content-Base filtering 

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import warnings
warnings.filterwarnings("ignore")

### Reading the dataset

In [2]:
dataset = pd.read_csv('preprocessed_ratings.csv', sep=',', error_bad_lines=False, encoding='latin-1')

In [3]:
dataset.head()

Unnamed: 0,User_ID,ISBN,Book_Rating,Book_Title,Book_Author,Year_Of_Publication,Publisher,Location,Age
0,53,451,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"strafford, missouri, usa",34.0
1,53,280,0,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown","strafford, missouri, usa",34.0
2,53,647,0,The Da Vinci Code,Dan Brown,2003,Doubleday,"strafford, missouri, usa",34.0
3,53,2028,0,Wild Animus,Rich Shapero,2004,Too Far,"strafford, missouri, usa",34.0
4,53,241,0,Four To Score (A Stephanie Plum Novel),Janet Evanovich,1999,St. Martin's Paperbacks,"strafford, missouri, usa",34.0


### Filtering the dataset in order to get books contents

In [4]:
books = dataset.filter(['Book_Title' , 'Book-Author', 'Year_Of_Publication','Publisher']).copy()
books = books.drop_duplicates(subset='Book_Title', keep='first')

users = dataset.filter(['User_ID', 'Location', 'Age']).copy()
users = users.drop_duplicates(subset='User_ID', keep='first')

In [5]:
books.head()

Unnamed: 0,Book_Title,Year_Of_Publication,Publisher
0,Flesh Tones: A Novel,2002,Ballantine Books
1,The Lovely Bones: A Novel,2002,"Little, Brown"
2,The Da Vinci Code,2003,Doubleday
3,Wild Animus,2004,Too Far
4,Four To Score (A Stephanie Plum Novel),1999,St. Martin's Paperbacks


In [6]:
users.iloc[0]

User_ID                           53
Location    strafford, missouri, usa
Age                               34
Name: 0, dtype: object

### Making the similarity matrix

In [7]:
def similarity_matrix(dataset):
    string = dataset.to_string(header=False,index=False).split('\n')
    tfidf = TfidfVectorizer(stop_words='english')
    dataset_matrix = tfidf.fit_transform(string)
    similar_matrix = linear_kernel(dataset_matrix,dataset_matrix)
    return similar_matrix

In [None]:
with open('sim_matrix', 'wb') as fp:
        pickle.dump(similarity_matrix(books), fp)

### Map the 'Book_Title's to their indexes for easy extracting

In [8]:
book_mapping = pd.Series(books.index,index = books['Book_Title'])
user_mapping = pd.Series(users.index,index = users['User_ID'])
user_mapping

User_ID
53           0
94         100
217        172
273        612
586        679
         ...  
1250    137561
2332    137567
482     137568
1666    137570
2580    137572
Length: 2954, dtype: int64

### Now define a function which get index of a book, and then returns the scores of the most similar books

In [9]:
def cal_sim_score(similarity_matrix,book_index):
    # Get similarity values with other books
    # Similarity_score is the list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[book_index])) # Sort in descending order the similarity score of book inputted with all the other books
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True) # Get the scores of the 100 most similar books. Ignore the first book.
    similarity_score = similarity_score[1:101] # Return book names using the mapping series
    return similarity_score

### Define a function which get name of a book and returns most similar books to it

In [10]:
def recommend_books(book):
    book_index = book_mapping[book]
    sim_matrix = similarity_matrix(books)
    similarity_score= cal_sim_score(sim_matrix,book_index)
    book_indices = [i[0] for i in similarity_score]
    return (books['Book_Title'].iloc[book_indices])

### Define a function which get id of a user and returns most similar users

In [11]:
def similar_users(user):
    user_index = user_mapping[user]
    sim_matrix = similarity_matrix(users)
    similarity_score = cal_sim_score(sim_matrix,user_index)
    user_indices = [i[0] for i in similarity_score]
    return (users['User_ID'].iloc[user_indices])

### Let's test our work

### The 10 most similar books to the input book

In [12]:
book_recommended = recommend_books('The Lovely Bones: A Novel')
book_recommended[:10]

832              The Lovely Bones
529            Sea Glass: A Novel
221               The Beach House
1032                 Nine Stories
240      White Oleander : A Novel
958            All He Ever Wanted
931               Four Blind Mice
939                The Lake House
17                     The Jester
930     The Big Bad Wolf: A Novel
Name: Book_Title, dtype: object

### The 10 most similar users to the input user

In [13]:
user_recommended = similar_users(53)
user_recommended

132009    1829
113138     359
97148     1645
93796     1090
95310     1979
          ... 
99430     1911
119227     265
126910     490
20528     2730
85320     1078
Name: User_ID, Length: 100, dtype: int64

### Ok now wanna combine the two collaborative and content-base filters

### Read the fataset which is outputs of collaborative part

In [14]:
collaborative_outputs = pd.read_csv('collaborativeOutputs.csv')
collaborative_outputs.head()

Unnamed: 0,book_c,book_p,user_c,user_p
0,The Last Juror,Split Second,2184.0,1288.0
1,American Gods: A Novel,Ruby (Landry),273.0,217.0
2,A Patchwork Planet,Coraline,812.0,1709.0
3,Now You See Me,Blow Fly: A Scarpetta Novel,1288.0,1535.0
4,Gone with the Wind,2nd Chance,860.0,2396.0


### Here we go :)

### The combination of a Collaborative and Content-base filtering

### The top books which have both Contant and Rate similarity

In [15]:
print("[The top books which have both content and rate similarity]\n")
for metric in ['cosine', 'pearson']:
    print('\nTop books recommended with {} metric:\n'.format(metric))
    col = 'book_c' if metric is 'cosine' else 'book_p'
    for book in book_recommended:
        if book in collaborative_outputs[col].tolist():
            print('-', book)

[The top books which have both content and rate similarity]


Top books recommended with cosine metric:

- Sea Glass: A Novel
- The Beach House
- White Oleander : A Novel
- The Jester
- The Little Friend
- In Her Shoes : A Novel
- Hard Eight : A Stephanie Plum Novel (A Stephanie Plum Novel)
- Middlesex: A Novel
- Now You See Me
- Coast Road: A Novel

Top books recommended with pearson metric:

- Nine Stories
- Me Talk Pretty One Day
- The Dogs of Babel (Today Show Book Club #12)
- The Hours: A Novel
- 1st to Die: A Novel
- The Nanny Diaries: A Novel
- Midwives: A Novel


### The top users which have both Contant and Rate similarity

In [16]:
print("[The top users which have both content and rate similarity]\n")
for metric in ['cosine', 'pearson']:
    print('\nTop users recommended with {} metric:\n'.format(metric))
    col = 'user_c' if metric is 'cosine' else 'user_p'
    for user in user_recommended:
        if user in collaborative_outputs[col].tolist():
            print('-', user)

[The top users which have both content and rate similarity]


Top users recommended with cosine metric:

- 273
- 860
- 586
- 2184

Top users recommended with pearson metric:

- 273
- 860
- 586
- 2184
