## Projet de machine learning



Author: AKALAH ABDEL-MATINOU

# 1. Présentation de la problématique et expliquant le rôle du ML dans l'étude

In this project, we will analyze and build a book recommendation model using recommendation systems. 

Given our dataset, we would like to recommend new books to users. <br>
This activity is commonly used by online retailers like Amazon, or ecommerce sites, Netflix for movies, etc... 
Using machine learning technique based on pyhton libraries we can handle this challenge and provide preference for users. 


# 2. Etat de l'art synthétique des différentes approches. 

## Recommender systems algorythms

Recommendation systems are machine learning models that try to build a user affinity to propose a personalized recommendation. 

There are 2 main types of recommendation systems :
1. Collaborative filtering which  
    - User-Item collaborative filtering (user-user or item-item similarity)
    - Latent model filtering (user-item matrix factorisation)
2. Content based model:
    - Content based model also called feature based model, tries to rely on the features or attributes of users to build a model that explain the user-item interraction.<br>
    Features here can be a book author, year, genre...or a user sex, age, location, etc...

In our case, given our data set, we could not use content based model since this model relies on user or item features. The data set given lack enough features that could be used in a model. <br> For example : 
The user dataset contains only location and age. Location feature show a strong bias to US.

<img src= "https://miro.medium.com/max/1400/1*rCK9VjrPgpHUvSNYw7qcuQ@2x.png">

[Source : Baptiste Rocca]("https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada")

To know more about recommender systems, refer to the following ressources :
- [Introduction to recommender systems]("https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada")
- [recommendation systems explained]("https://towardsdatascience.com/recommendation-systems-explained-a42fc60591ed#:~:text=Recommendation%20engines%20are%20a%20subclass,returned%20back%20to%20the%20user.")

# 3. Exploration des données 

In [1]:
# let's start by importing the libraries 

#Dataframe manipulation library
import pandas as pd
from math import sqrt
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy

Loading datasets

In [2]:
# Load the data
df_books = pd.read_csv('./dataset/Books.csv', low_memory=False)
#rating data set
df_ratings = pd.read_csv('./dataset/Ratings.csv')
#user data set 
df_users = pd.read_csv('./dataset/Users.csv', low_memory= False)


## Let's explore our Book dataset

In [3]:
# check columns and types 
df_books.dtypes

ISBN                   object
Book-Title             object
Book-Author            object
Year-Of-Publication    object
Publisher              object
Image-URL-S            object
Image-URL-M            object
Image-URL-L            object
dtype: object

In [4]:
# check few rows 
df_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
# Check the shape
print(f'df_books shape: {df_books.shape}')

df_books shape: (271360, 8)


In [6]:
# check general stats description
df_books.describe()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
count,271360,271360,271359,271360,271358,271360,271360,271357
unique,271360,242135,102023,118,16807,271044,271044,271041
top,195153448,Selected Poems,Agatha Christie,2002,Harlequin,http://images.amazon.com/images/P/185326119X.0...,http://images.amazon.com/images/P/185326119X.0...,http://images.amazon.com/images/P/225307649X.0...
freq,1,27,632,17627,7535,2,2,2


In [7]:
df_books['Year-Of-Publication'].value_counts()

2002    17627
1999    17431
2001    17359
2000    17232
1998    15766
        ...  
2038        1
1910        1
1914        1
1904        1
2037        1
Name: Year-Of-Publication, Length: 118, dtype: int64

In [8]:
# We won't need image urls so, we will drop the related columns to save memory
df_books = df_books.drop(columns=["Image-URL-S", "Image-URL-M", "Image-URL-L"], axis=1)
#rename columns 
df_books.columns = ['isbn', 'title', 'author', 'year', 'publisher']
df_books.head()

Unnamed: 0,isbn,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [9]:
# dropping year and publisher columns which are irrelevant for the collaborative filtering 
books = df_books[["isbn", "title", "author"]]

In [10]:
# set isbn as index
books.set_index(keys='isbn', drop=True, inplace=True)
books.head()

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
195153448,Classical Mythology,Mark P. O. Morford
2005018,Clara Callan,Richard Bruce Wright
60973129,Decision in Normandy,Carlo D'Este
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
393045218,The Mummies of Urumchi,E. J. W. Barber


In [11]:
books.at['0002005018', 'title']

'Clara Callan'

Let's take a look now at the rating data set

In [12]:
print(df_ratings.shape)
df_ratings.head()

(1149780, 3)


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [13]:
print(df_ratings.isnull().sum())

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64


In [14]:
print(df_users.shape)
df_users.head()

(278858, 3)


Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [15]:
df_users.isna().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

In [16]:
df_users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

In [17]:
# we'll drop Age column 
df_users = df_users.drop(columns='Age', axis=1)

In [18]:
# We will keep only the country in the location 
df_users['Country'] = df_users['Location'].apply(lambda x: (x.split(',')[-1]).lstrip())
# drop old location

df_users_loc = df_users.drop('Location', axis=1)

In [19]:
#df_users_loc[df_users_loc['Country'].value_counts()]

In [20]:
df_users

Unnamed: 0,User-ID,Location,Country
0,1,"nyc, new york, usa",usa
1,2,"stockton, california, usa",usa
2,3,"moscow, yukon territory, russia",russia
3,4,"porto, v.n.gaia, portugal",portugal
4,5,"farnborough, hants, united kingdom",united kingdom
...,...,...,...
278853,278854,"portland, oregon, usa",usa
278854,278855,"tacoma, washington, united kingdom",united kingdom
278855,278856,"brampton, ontario, canada",canada
278856,278857,"knoxville, tennessee, usa",usa


In [21]:
print(f'ratings shape {df_ratings.shape}')
input_ratings = df_ratings[df_ratings['User-ID'].isin(df_users['User-ID'])]
print(f'ratings shape {input_ratings.shape}')
input_ratings.head()

ratings shape (1149780, 3)
ratings shape (1149780, 3)


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [22]:
ratings = df_ratings.copy()
ratings.columns = ['user','isbn', 'rating']
ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [23]:
# Get book info based on its isbn 
def bookInfo(isbn):
    title = books.at[isbn, "title"]
    author = books.at[isbn, "author"]
    return title, author
bookInfo('0155061224')

('Rites of Passage', 'Judith Rae')

In [24]:
# get books rated by a given user
def Get_userRatings(user,N=10):
    # N = Maximum number of books to get
    UserRatings = ratings[ratings['user']== user]
    UserRatings_Sorted = UserRatings.sort_values(by='rating', ascending=False)[:N]
    UserRatings_Sorted['title'] = UserRatings_Sorted['isbn'].apply(bookInfo)

    return UserRatings_Sorted

Get_userRatings(276729, 10)  #testing 

Unnamed: 0,user,isbn,rating,title
4,276729,0521795028,6,(The Amsterdam Connection : Level 4 (Cambridge...
3,276729,052165615X,3,"(Help!: Level 1, Philip Prowse)"


In [25]:
#filtering books that are on both ratings and books data
print(ratings.shape)
ratings = ratings[ratings["isbn"].isin(books.index)]
print(ratings.shape)

(1149780, 3)
(1031136, 3)


In [26]:
Get_userRatings(276727)

Unnamed: 0,user,isbn,rating,title
2,276727,446520802,0,"(The Notebook, Nicholas Sparks)"


# K Nearest neighbors

We are going to use Colloborative filtering specially based on user-item iteraction matrix. This method comprises the following: 
1. Build iteraction matrix where users represent the rows and books represent the columns and the cell contains the rating.
2. Select a given user with all books he rated and computer the similarity to get the K nearest neighbors. 
3. Among the neighbors, compute the average rating, sort the values in descending order and get the top n books to recommend


#### Challenge and workaround. 

Collaborative filtering usually suffers from cold start, as it can only recommend to a user who has in our case already rated some books. 
There are many techniques to solve this problem. We can for example recommend popular items, and random items to new users. As they rate the books, the system will improve over time. 

In [27]:
ratings.shape

(1031136, 3)

In [28]:
usersPerISBN = ratings.isbn.value_counts()
print(f'distinct users: {len(usersPerISBN)}')
ISBNsPerUser = ratings.user.value_counts()
print(f'distinct books: {len(ISBNsPerUser)}')

distinct users: 270151
distinct books: 92106


In [29]:
# reducing the number of users and books for computational reasons.
print(f'ratings before reduce {ratings.shape}')
# reduce rows 
ratings = ratings[ratings["user"].isin(ISBNsPerUser[ISBNsPerUser > 50].index)]
# reduce columns
ratings = ratings[ratings["isbn"].isin(usersPerISBN[usersPerISBN > 50].index)]

print(f'ratings after reduce {ratings.shape}')

ratings before reduce (1031136, 3)
ratings after reduce (137573, 3)


In [30]:
# Build the user-item interaction matrix
userItemRatingMatrix = pd.pivot_table(ratings,  values= "rating",
                                      index=['user'], columns=['isbn'])

userItemRatingMatrix.head()

isbn,000649840X,002026478X,0020442203,002542730X,0028604199,006000438X,0060008032,0060008776,006001203X,0060085444,...,1860492592,1878424319,1885171080,1931561648,3257228007,3257229534,3404148665,3423202327,3442541751,3492045170
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
243,,,,,,,,,,,...,,,,,,,,,,
254,,,,,,0.0,,,,,...,,,,,,,,,,
507,,,,,,,,,,,...,,,,,,,,,,
638,,,,,,,,,,,...,,,,,,,,,,
643,,,,,,,,,,,...,,,,,,,,,,


Find K nearest neighbors

In [31]:
user1 = 254
user2 = 243

In [32]:
user1Ratings = userItemRatingMatrix.transpose()[user1]
user1Ratings.head()

isbn
000649840X   NaN
002026478X   NaN
0020442203   NaN
002542730X   NaN
0028604199   NaN
Name: 254, dtype: float64

In [33]:
user2Ratings = userItemRatingMatrix.transpose()[user2]

In [34]:
# measure correlation matrix 
from scipy.spatial.distance import hamming
hamming(user1Ratings, user2Ratings)

0.9985721085197525

In [35]:
def distance(user1, user2):
    try:
        user1Ratings = userItemRatingMatrix.transpose()[user1]
        user2Ratings = userItemRatingMatrix.transpose()[user2]
        distance = hamming(user1Ratings, user2Ratings)
    except:
        distance = np.nan
    return distance



In [36]:
np.NaN

nan

In [37]:
user = user1
allUsers = pd.DataFrame(userItemRatingMatrix.index)
allUsers = allUsers[allUsers.user != user]
allUsers.head()


Unnamed: 0,user
0,243
2,507
3,638
4,643
5,741


In [38]:
allUsers["distance"] = allUsers['user'].apply(lambda x: distance(user, x))
allUsers.head()

Unnamed: 0,user,distance
0,243,0.998572
2,507,1.0
3,638,1.0
4,643,1.0
5,741,0.999524


In [39]:
# find nearest neighbors for an active user 

def nearestNeighbors(user, K=10):
    #fetch all users 
    allUsers = pd.DataFrame(userItemRatingMatrix.index)
    allUsers = allUsers[allUsers.user != user]
    allUsers["distance"] = allUsers['user'].apply(lambda x: distance(user, x))
    KnearestUsers = allUsers.sort_values(["distance"], ascending=True)['user'][:K]
    return KnearestUsers

KnearestUsers = nearestNeighbors(user1)

In [40]:
NNRatings = userItemRatingMatrix[userItemRatingMatrix.index.isin(KnearestUsers)]
NNRatings

isbn,000649840X,002026478X,0020442203,002542730X,0028604199,006000438X,0060008032,0060008776,006001203X,0060085444,...,1860492592,1878424319,1885171080,1931561648,3257228007,3257229534,3404148665,3423202327,3442541751,3492045170
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11676,8.0,,,6.0,,6.0,8.0,,0.0,0.0,...,0.0,,,10.0,0.0,0.0,10.0,8.0,7.0,9.0
21014,,,,,,,,,,,...,,0.0,,,,,,,,
60244,,,,0.0,,0.0,,,0.0,,...,,0.0,,,,,,,,
76352,,,,,,,,0.0,,,...,,,,,,,,,,
87555,,,0.0,,,,,,,,...,,,,,,,,,,
102967,,,,,,,0.0,,,,...,,,,,,,,,,
185233,,,,0.0,0.0,,,,,,...,,,,,,,,,,
204864,,,,0.0,,,,,,,...,,,,,,,,,,
211426,,0.0,,0.0,,0.0,,,,,...,,,,,,,,,,
230522,,,0.0,,,,,,,,...,,,,0.0,,,,,,


In [41]:
userItemRatingMatrix.index

Int64Index([   243,    254,    507,    638,    643,    741,    882,    929,
              1211,   1424,
            ...
            277928, 277965, 278026, 278137, 278144, 278188, 278418, 278582,
            278633, 278843],
           dtype='int64', name='user', length=2954)

In [42]:
avgRatings = NNRatings.apply(np.nanmean).dropna()
avgRatings.head()

  results[i] = self.f(v)


isbn
000649840X    8.0
002026478X    0.0
0020442203    0.0
002542730X    1.2
0028604199    0.0
dtype: float64

In [43]:
Readbooks = userItemRatingMatrix.transpose()[user].dropna().index


In [44]:
avgRatings = avgRatings[~avgRatings.index.isin(Readbooks)]

In [45]:
def topN(user, N=3):
    KnearestUsers = nearestNeighbors(user)
    NNRatings = userItemRatingMatrix[userItemRatingMatrix.index.isin(KnearestUsers)]
    avgRatings = NNRatings.apply(np.nanmean).dropna()
    booksAlreadyRated = userItemRatingMatrix.transpose()[user].dropna().index
    avgRatings = avgRatings[~avgRatings.index.isin(booksAlreadyRated)]
    topNISBNs = avgRatings.sort_values(ascending=False).index[:N]
    return pd.Series(topNISBNs).apply(bookInfo)

test = topN(user1)

In [46]:
#popularBooks = ratings [ratings['user'].value_counts()[:6]]

userRatedIsbn = pd.Series(ratings[ratings['user']== user].isbn.unique())
len(userRatedIsbn)

popularBooks = ratings['isbn'].value_counts().index
popularRecom = popularBooks[~popularBooks.isin(userRatedIsbn)]


#len(popularRecom)
popularRecom[:4]
#popularBooks

Index(['0971880107', '0316666343', '0385504209', '0060928336'], dtype='object')

Fin