# Preprocessing the Goodreads data for the Recommendation System

As an avid reader I was interested in creating a recommendation system for books. The social media/rating site Goodreads is the obvious source for data about books and their ratings.

For this project I use the Goodreads data prepared on Kaggle:
 https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m
 
 This dataset contains a million books and the ratings by 6000 users. 

## Combining all the separate DataFrames

Due to the big amount of datapoints we have a number of csv files for the book information and the user ratings. I will combine them now but at the end will trim the csv files down to a manageable size that fits our purposes well.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Combine the book and rating files:

In [5]:
books_1 = pd.read_csv('CSV/book1-100k.csv', engine = 'python', encoding = 'latin-1')
books_2 = pd.read_csv('CSV/book100k-200k.csv', engine = 'python', encoding = 'latin-1')
books_3 = pd.read_csv('CSV/book200k-300k.csv', engine = 'python', encoding = 'latin-1')
books_4 = pd.read_csv('CSV/book300k-400k.csv', engine = 'python', encoding = 'latin-1')
books_5 = pd.read_csv('CSV/book400k-500k.csv', engine = 'python', encoding = 'latin-1')
books_6 = pd.read_csv('CSV/book500k-600k.csv', engine = 'python', encoding = 'latin-1')
books_7 = pd.read_csv('CSV/book600k-700k.csv', engine = 'python', encoding = 'latin-1')
books_8 = pd.read_csv('CSV/book700k-800k.csv', engine = 'python', encoding = 'latin-1')
books_9 = pd.read_csv('CSV/book800k-900k.csv', engine = 'python', encoding = 'latin-1')
books_10 = pd.read_csv('CSV/book900k-1000k.csv', engine = 'python', encoding = 'latin-1')
books_11 = pd.read_csv('CSV/book1000k-1100k.csv', engine = 'python', encoding = 'latin-1')
books_12 = pd.read_csv('CSV/book1100k-1200k.csv', engine = 'python', encoding = 'latin-1')
books_13 = pd.read_csv('CSV/book1200k-1300k.csv', engine = 'python', encoding = 'latin-1')
books_14 = pd.read_csv('CSV/book1300k-1400k.csv', engine = 'python', encoding = 'latin-1')
books_15 = pd.read_csv('CSV/book1400k-1500k.csv', engine = 'python', encoding = 'latin-1')
books_16 = pd.read_csv('CSV/book1500k-1600k.csv', engine = 'python', encoding = 'latin-1')
books_17 = pd.read_csv('CSV/book1600k-1700k.csv', engine = 'python', encoding = 'latin-1')
books_18 = pd.read_csv('CSV/book1700k-1800k.csv', engine = 'python', encoding = 'latin-1')


In [6]:
books = pd.concat([books_1,books_2,books_3,books_4,books_5,books_6,books_7,books_8,books_9,books_10,books_11,books_12,books_13,books_14,books_15,books_16,books_17,books_18]).reset_index(drop = True)

In [7]:
ratings_1 = pd.read_csv('CSV/user_rating_0_to_1000.csv', engine = 'python', encoding = 'latin-1')
ratings_2 = pd.read_csv('CSV/user_rating_1000_to_2000.csv', engine = 'python', encoding = 'latin-1')
ratings_3 = pd.read_csv('CSV/user_rating_2000_to_3000.csv', engine = 'python', encoding = 'latin-1')
ratings_4 = pd.read_csv('CSV/user_rating_3000_to_4000.csv', engine = 'python', encoding = 'latin-1')
ratings_5 = pd.read_csv('CSV/user_rating_4000_to_5000.csv', engine = 'python', encoding = 'latin-1')
ratings_6 = pd.read_csv('CSV/user_rating_5000_to_6000.csv', engine = 'python', encoding = 'latin-1')

In [8]:
Ratings = pd.concat([ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,ratings_6])


In [9]:
(ratings_1[ratings_1['Rating']=="This user doesn't have any rating"].shape,ratings_2[ratings_2['Rating']=="This user doesn't have any rating"].shape,ratings_3[ratings_3['Rating']=="This user doesn't have any rating"].shape,ratings_4[ratings_4['Rating']=="This user doesn't have any rating"].shape,ratings_5[ratings_5['Rating']=="This user doesn't have any rating"].shape,ratings_6[ratings_6['Rating']=="This user doesn't have any rating"].shape)

((303, 3), (521, 3), (580, 3), (498, 3), (471, 3), (331, 3))

In [35]:
(ratings_1.shape,ratings_2.shape,ratings_3.shape,ratings_4.shape,ratings_5.shape,ratings_6.shape)

((51945, 3), (42986, 3), (30633, 3), (46970, 3), (46903, 3), (15481, 3))

## Editing the DataFrames

The book file has a number of duplicate books so we proceed to delete them, while keeping the first entry:

In [10]:
bb = books.drop_duplicates('Name', keep = 'first')

In [11]:
bb = bb.reset_index(drop=True)

Create a new column which gives a new ID to the books, based on the index of the file. This is makes it easier to track the books that our users rate. Then we substitute the book names in our rating file with this new Book ID.

In [12]:
bb['book_id'] = bb.index

In [13]:
Ratings_1= pd.merge(Ratings, bb[['Name', 'book_id']], on='Name')

### Trimming down our Rating file

Our recommendation system will need an input which consists of rows of users and columns of books. This means that the more users and books we allow the heavier its job becomes. For this reason we decide to only allow users with 3 or more ratings. On top of this we only keep the books that were reviewed in our book Data Frame.

In [14]:
s = Ratings_1['book_id'].value_counts()
Ratings_1 =Ratings_1[Ratings_1.isin(s.index[s >= 3]).book_id.values]
bb= bb[bb.isin(s.index[s >= 3]).book_id.values]
number_books = len(bb.book_id.value_counts())
number_books

8524

In [15]:
Ratings_1.isin(s.index[s >= 3]).book_id.values

array([ True,  True,  True, ...,  True,  True,  True])

We want the Rating file ordered by user, not by book ID:

In [16]:
Ratings_2=Ratings_1.sort_values('ID').reset_index(drop = True)

In [17]:
Ratings_2 = Ratings_2.rename(columns={'ID': 'user_id'})

In [18]:
Ratings_2 = Ratings_2.reindex(columns= ['user_id','book_id','Name', 'Rating' ])

The rating column contains categorical data instead of numerical. We proceed to change this by giving a number to the review comment with 5 being the highest and 1 the lowest rating given. 0 will stand for no review during our recommendation process but is irrelevant at this stage

In [19]:
Ratings_3 = Ratings_2.replace({'really liked it': 4, 'it was amazing': 5, 'liked it': 3,'it was ok': 2,'did not like it':1})

In [20]:
Ratings_3['Rating'].value_counts()

4    39240
5    30026
3    27308
2     7752
1     2136
Name: Rating, dtype: int64

In [21]:
nb_books = len(bb)

In [22]:
bb = bb.reset_index()
bb['book_id'] = bb.index

In [23]:
bb.book_id.values

array([   0,    1,    2, ..., 8521, 8522, 8523])

We now prepare the final rating file with the column order being : uder ID, book ID and Rating:

In [24]:
Ratings_3 = Ratings_3.drop('book_id', axis =1)
Ratings_4 = pd.merge(Ratings_3, bb[['Name', 'book_id']], on='Name')
Final_Ratings = (Ratings_4.reindex(columns= ['user_id','book_id', 'Rating' ])).sort_values(by = 'user_id')

In [25]:
Final_Ratings

Unnamed: 0,user_id,book_id,Rating
0,1,22,5
14116,1,6773,5
14122,1,373,5
14144,1,187,4
14259,1,985,4
...,...,...,...
88643,5993,1145,3
14112,5993,572,5
75220,5993,2007,3
4262,5993,7790,3


## Saving our new Dataframes as CSV files to be imported by our Recommendation System!

In [147]:
bb.to_csv('Final_books.csv')
Final_Ratings.to_csv('Final_ratings.csv')