**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [2]:
#Loading data
import pandas as pd

books = pd.read_csv("C:/PyThon/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('C:/PyThon/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('C:/PyThon/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
books.shape
books.dtypes

(271360, 8)

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

In [4]:
users.shape
users.dtypes

(278858, 3)

userID        int64
Location     object
Age         float64
dtype: object

In [5]:
ratings.shape
ratings.dtypes

(1149780, 3)

userID         int64
ISBN          object
bookRating     int64
dtype: object

## Exploring books dataset

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
books=books.drop(columns={"imageUrlS","imageUrlM","imageUrlL"}, axis=1)
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

In [8]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [9]:
books["yearOfPublication"].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [10]:
books[books["yearOfPublication"]=="DK Publishing Inc"]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [11]:
books=books.drop(books[books.yearOfPublication=='DK Publishing Inc'].index)


In [12]:
books=books.drop(books[books.yearOfPublication=='Gallimard'].index)

### Change the datatype of yearOfPublication to 'int'

In [13]:
books["yearOfPublication"]=books.yearOfPublication.astype(int)

In [14]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [15]:
books.publisher.dropna(inplace=True)


## Exploring Users dataset

In [16]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [17]:
print(sorted(users.Age.unique()))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [18]:
users["Age"].replace(users[users["Age"]<5],"NaN",inplace=True)

In [19]:
users["Age"].replace(users[users["Age"]>90],"NaN",inplace=True)

### Replace null values in column `Age` with mean

In [20]:
users.Age.fillna(users.Age.mean,inplace=True)

### Change the datatype of `Age` to `int`

In [21]:
users.dtypes
users["Age"]=users.Age.astype

userID       int64
Location    object
Age         object
dtype: object

## Exploring the Ratings Dataset

### check the shape

In [22]:
ratings.shape

(1149780, 3)

In [23]:
n_users = users.shape[0]
n_books = books.shape[0]

In [24]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [46]:
ratings.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

In [51]:
r_df=ratings.merge(books,on='ISBN')
r_df.shape
r_df.dtypes

(1031132, 7)

userID                int64
ISBN                 object
bookRating            int64
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

In [53]:
#r_df1=r_df.pivot(index='userID',columns='ISBN',values='bookRating').fillna(0)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [58]:
r_df1=r_df.merge(users,on='userID')
r_df1.dtypes
r_df1.shape

userID                int64
ISBN                 object
bookRating            int64
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
Location             object
Age                  object
dtype: object

(1031132, 9)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [59]:
r_df1=r_df1.drop(r_df1[r_df1.bookRating ==0].index,axis=0)
r_df1.sample(10)

RecursionError: maximum recursion depth exceeded

TypeError: cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

### Find out which rating has been given highest number of times

In [29]:
final.groupby("bookRating").agg("count")
# 8 rating has been given highest number of times


Unnamed: 0_level_0,userID,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,Location,Age
bookRating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1481,1481,1481,1481,1481,1481,1481,1481
2,2375,2375,2375,2375,2375,2375,2375,2375
3,5118,5118,5118,5118,5118,5118,5118,5118
4,7617,7617,7617,7617,7617,7617,7617,7617
5,45355,45355,45355,45355,45355,45355,45355,45355
6,31687,31687,31687,31687,31687,31687,31687,31687
7,66401,66401,66401,66401,66401,66401,66401,66401
8,91804,91804,91804,91803,91804,91804,91804,91804
9,60778,60778,60778,60778,60778,60776,60778,60778
10,71225,71225,71225,71225,71225,71225,71225,71225


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [30]:
final_res=pd.DataFrame(final.groupby('userID').agg("sum")).reset_index()
final_res.sample(10)


Unnamed: 0,userID,bookRating,yearOfPublication
14579,59203,13,3991
53163,218086,20,5998
29108,118673,15,3978
49710,204449,8,1989
9950,40318,28,6006
11350,46130,7,1993
30847,125918,37,9987
2051,8686,7,1999
46485,190725,8,1994
18570,75764,68,13933


In [31]:
final_res["userID"]=final_res[final_res["bookRating"]>100]
X=pd.DataFrame(final_res["userID"])


In [32]:
final2=final.merge(X,on='userID')
final2.head()

TypeError: cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

RecursionError: maximum recursion depth exceeded while calling a Python object

In [33]:
final2.groupby('userID').agg("sum")

Unnamed: 0_level_0,bookRating,yearOfPublication
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
242,193,39910
243,131,35923
254,443,113878
388,124,33911
503,141,29950
507,296,71881
638,501,107870
643,335,81887
651,188,55878
709,153,41994


### Generating ratings matrix from explicit ratings


In [34]:
#R_df = final2.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)

#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [35]:
final2.fillna(0,inplace=True)

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [36]:
from scipy.sparse.linalg import svds

U, sigma, Vt = svds(final2, k = 8)

AttributeError: 'str' object has no attribute 'conjugate'

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [None]:
userID = 2110

In [None]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [None]:
user_data.head()

In [None]:
user_data.shape

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [None]:
book_data.shape

In [None]:
book_data.head()

In [None]:
user_full_info.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user