<a href="https://colab.research.google.com/github/MANISH-KUMAR-CODES/Book-Recommendtion-System/blob/main/Book_Recommendtion_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).
Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. The main objective is to create a book recommendation system for users.

Content The Book-Crossing dataset comprises 3 files.

 ● Users Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.
 
  ● Books Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website. 
  
  ● Ratings Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [None]:
# importing all the necessary files 
import re
import operator
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from scipy.sparse import csr_matrix
from scipy.spatial.distance import correlation
from pandas.api.types import is_numeric_dtype
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings("ignore")

### Recommendation Systems are one of the largest application areas of Machine Learning. They enable tailoring personalized content for users, thereby generating revenue for businesses

### There are 2 main types of personalized recommendation systems:

## Content based filtering
### Recommendations are based on user's past likes/ dislikes & item feature space. The system makes recommendations which are similar to items the user has liked in the past. Items are considered similar based on item's features such as author, publisher, genre etc

## Collaborative based filtering
### Recommendations are based solely on user's past likes/ dislikes & how other users have rated other items. The system does not take into consideration an item's features like author, publisher, genre etc nor a user's features like age, gender, location etc. These take either a memory based approach or a model based approach

### **(1)Memory based approach:** Utilizes entire user-item rating information to calculate similarity scores between items or users for making recommendations. These are further either of 2 types:

  User based: Two users are considered similar, if they rate items in a similar manner. An item is recommended to a user, if another user i.e., similar to the user in question has liked the item

  Item based: Two items are considered similar, if users rate them in a similar manner. An item is recommended to a user, that is similar to the items the user has rated in the past

### **(2)Model based approach:** Utilizes user-item rating information to build a model & the model (not the entire dataset) is thereafter used for making recommendations. This approach is preferred in instances where time & scalability are a concern

This project aims to build a recommendation system based on collaborative filtering & will tackle an example of both memory based & model based algorithm

### Datasource:
This project will use the 3 different csv files provided by our almabetter team

1. User.csv = with 278858 rows and 3 columns

2. Books.csv = with 271360 rows and 8 columns

3. Ratings.csv = with  1149780 rows and 3 columns




In [None]:
user_df = pd.read_csv('/content/drive/MyDrive/Copy of Users.csv')
books_df = pd.read_csv('/content/drive/MyDrive/Copy of Books.csv')
ratings_df = pd.read_csv('/content/drive/MyDrive/Copy of Ratings.csv')

In [None]:
#Top rows of our user_df
user_df.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [None]:
print("shape of our user_df:",user_df.shape)
print('\n')
print('null values in user_df:')
print(user_df.isnull().sum())
print('\n')
print('checking duplicates on entire  user dataframe')
len(user_df)-len(user_df.drop_duplicates())


shape of our user_df: (278858, 3)


null values in user_df:
User-ID          0
Location         0
Age         110762
dtype: int64


checking duplicates on entire  user dataframe


0

In [None]:
#Checking info of our user_df
user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


From above we conclude that our User_id column is of int type ,so we need to change it  to object type and also Age column contains lots of null values so we have to find way to treat them also

In [None]:
#Converting user_id column to object type
user_df['User-ID']= user_df['User-ID'].astype('object')

In [None]:
#Checking unique values in our Age cloumn
user_df['Age'].unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

From above we saw our age column contains null values as well as invalid age values ,so we will keep the valid age range of readers as 10 to 90 ,replace null values and invalid ages in the Age column with the mean of valid ages.

In [None]:
#keeping age in range of 10 to 90 and replacing null values in age with mean
user_df.loc[(user_df.Age > 90) | (user_df.Age < 5), 'Age'] = np.nan
user_df.Age = user_df.Age.fillna(user_df.Age.mean())
user_df.Age = user_df.Age.astype(np.int32)

In [None]:
#Checking unique values again
print(user_df['Age'].unique())

[34 18 17 61 26 14 25 19 46 55 32 24 20 23 51 31 21 44 30 57 43 37 41 54
 42 50 39 53 47 36 28 35 13 58 49 38 45 62 63 27 33 29 66 40 15 60 79 22
 16 65 59 48 72 56 67 80 52 69 71 73 78  9 64 12 74 75 76 83 68 11 77 70
  8  7 81 10  5  6 84 82 90 85 86 87 89 88]


In [None]:
#Checking unique  values in location column
user_df['Location'].unique()

array(['nyc, new york, usa', 'stockton, california, usa',
       'moscow, yukon territory, russia', ...,
       'sergnano, lombardia, italy', 'stranraer, n/a, united kingdom',
       'tacoma, washington, united kingdom'], dtype=object)

In [None]:
#spilliting each string  in location with the help of split()
list_of_location = user_df.Location.str.split(', ')
list_of_location

0                         [nyc, new york, usa]
1                  [stockton, california, usa]
2            [moscow, yukon territory, russia]
3                  [porto, v.n.gaia, portugal]
4         [farnborough, hants, united kingdom]
                          ...                 
278853                 [portland, oregon, usa]
278854    [tacoma, washington, united kingdom]
278855             [brampton, ontario, canada]
278856             [knoxville, tennessee, usa]
278857                  [dublin, n/a, ireland]
Name: Location, Length: 278858, dtype: object

In [None]:
city = []                      #  empty list to store city locations
state =[]                      #  empty list to store state location
country = []                   #  empty list to store country location
for i in range(0,len(list_of_location)):
    if list_of_location[i][0] == ' ' or list_of_location[i][0] == '' or list_of_location[i][0]=='n/a' or list_of_location[i][0] == ',':  #removing invalid entries too
        city.append('other')
    else:
        city.append(list_of_location[i][0].lower())  #adding city in empty city list

    if(len(list_of_location[i])<2):
        state.append('other')
        country.append('other')
    else:
        if list_of_location[i][1] == ' ' or list_of_location[i][1] == '' or list_of_location[i][1]=='n/a' or list_of_location[i][1] == ',':   #removing invalid entries 
            state.append('other')          
        else:
            state.append(list_of_location[i][1].lower())   #adding state in empty city list

    if(len(list_of_location[i])<3):
            country.append('other')
    else:
         if list_of_location[i][2] == ''or list_of_location[i][1] == ',' or list_of_location[i][2] == ' ' or list_of_location[i][2] == 'n/a':
                country.append('other')
         else:
              country.append(list_of_location[i][2].lower())     #adding country in empty city list

user_df = user_df.drop('Location',axis=1)

temp = []
for ent in city:
    c = ent.split('/')            #handling cases where city/state entries from city list as state is already given 
    temp.append(c[0])                     
        

In [None]:
df_city = pd.DataFrame(temp,columns=['City'])
df_state = pd.DataFrame(state,columns=['State'])
df_country = pd.DataFrame(country,columns=['Country'])

user_df = pd.concat([user_df, df_city], axis=1)
user_df = pd.concat([user_df, df_state], axis=1)
user_df = pd.concat([user_df, df_country], axis=1)

In [None]:
## Drop duplicate rows
user_df.drop_duplicates(keep='last', inplace=True)
user_df.reset_index(drop=True, inplace=True)

In [None]:
# checking our user_df again
user_df.head()

Unnamed: 0,User-ID,Age,City,State,Country
0,1,34.0,nyc,new york,usa
1,2,18.0,stockton,california,usa
2,3,34.0,moscow,yukon territory,russia
3,4,17.0,porto,v.n.gaia,portugal
4,5,34.0,farnborough,hants,united kingdom
