# Book Recommander System


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

### 2) Data Collection
- Dataset Source - https://gist.github.com/jaidevd/23aef12e9bf56c618c41
- The data consists of 6 column and 211 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn library.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt 

#### Import the CSV Data as Pandas DataFrame

In [2]:
book = pd.read_csv('data/books_new.csv')

#### Show Top 5 Records

In [4]:
book.sample(25)


Unnamed: 0,Title,Author,Genre,SubGenre,Height,Publisher
206,Structure and Randomness,"Tao, Terence",science,mathematics,252,
173,"Cathedral and the Bazaar, The","Raymond, Eric",tech,computer_science,217,
200,Tales of Beedle the Bard,"Rowling, J K",fiction,novel,184,
94,New Markets & Other Essays,"Drucker, Peter",science,economics,176,Penguin
55,Soft Computing & Intelligent Systems,"Gupta, Madan",tech,data_science,242,Elsevier
126,Char Shabda,"Deshpande, P L",nonfiction,misc,214,
179,"World's Great Thinkers, The",,science,physics,189,
139,"Killing Joke, The",,fiction,comic,283,
29,"Complete Sherlock Holmes, The - Vol II","Doyle, Arthur Conan",fiction,classic,176,Random House
153,"Journal of Economics, vol 106 No 3",,science,economics,235,


#### Shape of the dataset

In [55]:
book.shape

(211, 6)

### 2.2 Dataset information

- Title : The title or name of the book.
- Author : The author or authors of the book, separated by commas.
- Genre : The broad category or type of the book, such as Science, Fiction, or Nonfiction.
- SubGenre :  A more specific sub-category or theme of the book.
- Height : The height of the book in millimeters.
- Publisher :  The publishing company or entity responsible for producing the book.

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

In [56]:
book.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      211 non-null    object
 1   Author     187 non-null    object
 2   Genre      211 non-null    object
 3   SubGenre   211 non-null    object
 4   Height     211 non-null    int64 
 5   Publisher  115 non-null    object
dtypes: int64(1), object(5)
memory usage: 10.0+ KB


### 3.1 Check Missing values

In [57]:
book.isna().sum()

Title         0
Author       24
Genre         0
SubGenre      0
Height        0
Publisher    96
dtype: int64

In [58]:
# book['Publisher'].value_counts()


#### Removing the Pusblisher Column

There are 96 missing values in the dataset for the column 'Publisher' of out of 211 . So we are removing the Publisher Column

In [59]:
book.drop(columns= ['Publisher'],inplace = True)

In [60]:
book.head()

Unnamed: 0,Title,Author,Genre,SubGenre,Height
0,Fundamentals of Wavelets,"Goswami, Jaideva",tech,signal_processing,228
1,Data Smart,"Foreman, John",tech,data_science,235
2,God Created the Integers,"Hawking, Stephen",tech,mathematics,197
3,Superfreakonomics,"Dubner, Stephen",science,economics,179
4,Orientalism,"Said, Edward",nonfiction,history,197


In [61]:
book['Author'].value_counts()


Author
Steinbeck, John      8
Deshpande, P L       7
Rutherford, Alex     5
Sen, Amartya         4
Rand, Ayn            4
                    ..
Grisham, John        1
Durant, Will         1
Poe, Edgar Allen     1
Crichton, Michael    1
Dickens, Charles     1
Name: count, Length: 129, dtype: int64

### Replacing the missing values from the author column

There are 24 missing values in Author column . As this column is important for us   to filter the data and perform analysis, We are going to fill the author column with random values from present authors

In [62]:
import random
# Find non-null author names
non_null_authors = book['Author'].dropna().unique()

In [63]:
# Replace missing author values with a random author name from the dataset
book['Author'] = book['Author'].fillna(random.choice(non_null_authors))

### 3.2 Check Duplicates

In [64]:
book.duplicated().sum()

0

In [65]:
book.nunique()

Title       210
Author      129
Genre         5
SubGenre     22
Height       65
dtype: int64

#### There is one duplicate in Title column . As each title should be unique we should remove that row which contains the duplicated title

In [66]:
book[book['Title'].duplicated()]

Unnamed: 0,Title,Author,Genre,SubGenre,Height
195,Angels & Demons,"Brown, Dan",fiction,novel,170


In [67]:
book = book[~book['Title'].duplicated()]

Now there are no Duplicated values in Title

In [68]:
book.shape

(210, 5)

### 3.3 Check data types

In [69]:
book.info()

<class 'pandas.core.frame.DataFrame'>
Index: 210 entries, 0 to 210
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     210 non-null    object
 1   Author    210 non-null    object
 2   Genre     210 non-null    object
 3   SubGenre  210 non-null    object
 4   Height    210 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 9.8+ KB


### 3.4 Checking the number of unique values of each column

In [70]:
book.nunique()

Title       210
Author      129
Genre         5
SubGenre     22
Height       65
dtype: int64

### 3.5 Check statistics of data set

In [71]:
book.describe()
# all the columns are categorical so cannot check statics for all 

Unnamed: 0,Height
count,210.0
mean,206.228571
std,26.775786
min,160.0
25%,180.0
50%,199.5
75%,229.75
max,283.0


### 3.6 Exploring Data

In [72]:
book.head()

Unnamed: 0,Title,Author,Genre,SubGenre,Height
0,Fundamentals of Wavelets,"Goswami, Jaideva",tech,signal_processing,228
1,Data Smart,"Foreman, John",tech,data_science,235
2,God Created the Integers,"Hawking, Stephen",tech,mathematics,197
3,Superfreakonomics,"Dubner, Stephen",science,economics,179
4,Orientalism,"Said, Edward",nonfiction,history,197


In [73]:
book['Category'] = book['Genre'] + ' ' + book['SubGenre']

In [74]:
book.head()

Unnamed: 0,Title,Author,Genre,SubGenre,Height,Category
0,Fundamentals of Wavelets,"Goswami, Jaideva",tech,signal_processing,228,tech signal_processing
1,Data Smart,"Foreman, John",tech,data_science,235,tech data_science
2,God Created the Integers,"Hawking, Stephen",tech,mathematics,197,tech mathematics
3,Superfreakonomics,"Dubner, Stephen",science,economics,179,science economics
4,Orientalism,"Said, Edward",nonfiction,history,197,nonfiction history



### CONVERTING CATEGORICAL COLUMNS TO NUMERICAL COLUMNS

The code converts text data like book titles and authors into numbers using TF-IDF, which measures the importance of words. It also encodes categorical data like genres into binary features. Then, it combines these new numerical features with existing ones. Afterward, it removes the original text and categorical columns, leaving only numerical data for analysis. Essentially, it transforms words and categories into numbers that a computer can understand, making it easier to build a recommendation system that suggests books based on similarities in their features.







In [105]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenize and extract features from titles and authors
title_vectorizer = TfidfVectorizer()
title_features = title_vectorizer.fit_transform(book['Title'])

author_vectorizer = TfidfVectorizer()
author_features = author_vectorizer.fit_transform(book['Author'])

# One-hot encode Genre and SubGenre columns
genre_dummies = pd.get_dummies(book['Genre'], prefix='Genre')
subgenre_dummies = pd.get_dummies(book['SubGenre'], prefix='SubGenre')

# Concatenate the original DataFrame with the new features
new_book = pd.concat([book, pd.DataFrame(title_features.toarray(), columns=title_vectorizer.get_feature_names_out()),
                      pd.DataFrame(author_features.toarray(), columns=author_vectorizer.get_feature_names_out()),
                      genre_dummies, subgenre_dummies], axis=1)

# Drop original categorical columns
new_book.drop(columns=[ 'Category'], inplace=True)



In [106]:
new_book.head()

Unnamed: 0,Title,Author,Genre,SubGenre,Height,106,20000,22,39,advocate,...,SubGenre_novel,SubGenre_objectivism,SubGenre_philosophy,SubGenre_physics,SubGenre_poetry,SubGenre_politics,SubGenre_psychology,SubGenre_science,SubGenre_signal_processing,SubGenre_trivia
0,Fundamentals of Wavelets,"Goswami, Jaideva",tech,signal_processing,228.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,True,False
1,Data Smart,"Foreman, John",tech,data_science,235.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
2,God Created the Integers,"Hawking, Stephen",tech,mathematics,197.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
3,Superfreakonomics,"Dubner, Stephen",science,economics,179.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
4,Orientalism,"Said, Edward",nonfiction,history,197.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False


In [107]:
new_book.shape

(211, 656)

In [86]:
book.to_csv('data/final_books.csv',index=False)

In [98]:
new_book.to_csv('data/encoded_book.csv',index=False)