## Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

warnings.filterwarnings('ignore')

sns.set()

## Data preprocessing

#### We will only need the books.csv and ratings.csv as those two files contains all the information we need for our model.

In [2]:
books = pd.read_csv('books.csv')

ratings = pd.read_csv('ratings.csv')

In [3]:
books.columns

Index(['id', 'book_id', 'best_book_id', 'work_id', 'books_count', 'isbn',
       'isbn13', 'authors', 'original_publication_year', 'original_title',
       'title', 'language_code', 'average_rating', 'ratings_count',
       'work_ratings_count', 'work_text_reviews_count', 'ratings_1',
       'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'image_url',
       'small_image_url'],
      dtype='object')

**As we can see there are a lot of unncessary features which we won't be needing in recommendation**

We will grab the relevant features only

In [4]:
columns = ['id', 'book_id', 'isbn', 'authors', 'original_publication_year', 'title', 'average_rating',
           'ratings_count', 'small_image_url']

books_new = books[columns]

books_new.head()

Unnamed: 0,id,book_id,isbn,authors,original_publication_year,title,average_rating,ratings_count,small_image_url
0,1,2767052,439023483,Suzanne Collins,2008.0,"The Hunger Games (The Hunger Games, #1)",4.34,4780653,https://images.gr-assets.com/books/1447303603s...
1,2,3,439554934,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Sorcerer's Stone (Harry P...,4.44,4602479,https://images.gr-assets.com/books/1474154022s...
2,3,41865,316015849,Stephenie Meyer,2005.0,"Twilight (Twilight, #1)",3.57,3866839,https://images.gr-assets.com/books/1361039443s...
3,4,2657,61120081,Harper Lee,1960.0,To Kill a Mockingbird,4.25,3198671,https://images.gr-assets.com/books/1361975680s...
4,5,4671,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,3.89,2683664,https://images.gr-assets.com/books/1490528560s...


In [5]:
books_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   isbn                       9300 non-null   object 
 3   authors                    10000 non-null  object 
 4   original_publication_year  9979 non-null   float64
 5   title                      10000 non-null  object 
 6   average_rating             10000 non-null  float64
 7   ratings_count              10000 non-null  int64  
 8   small_image_url            10000 non-null  object 
dtypes: float64(2), int64(3), object(4)
memory usage: 703.2+ KB


### Checking null values

#### For books

In [6]:
books_new.isna().sum()

id                             0
book_id                        0
isbn                         700
authors                        0
original_publication_year     21
title                          0
average_rating                 0
ratings_count                  0
small_image_url                0
dtype: int64

As we can see we got some null values here. But we are not going to use only our ratings dataset for our model so we will just fill those empy data points with 'NA'

In [7]:
books_new = books_new.fillna('NA')
books_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   isbn                       10000 non-null  object 
 3   authors                    10000 non-null  object 
 4   original_publication_year  10000 non-null  object 
 5   title                      10000 non-null  object 
 6   average_rating             10000 non-null  float64
 7   ratings_count              10000 non-null  int64  
 8   small_image_url            10000 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 703.2+ KB


This dataset will be used for recommending the books to customers.

#### For ratings

In [8]:
ratings.isna().sum()

book_id    0
user_id    0
rating     0
dtype: int64

There is no null value here.

#### There is not much preprocessing needed further as the relevant books details are already extracted and null values have been dealt with. We will now save the dataset for later use during recommendation.

In [9]:
books_new.to_csv('books_cleaned.csv')