# Loading Dataset

First , we load our data frame using pandas library.

In [105]:
import pandas as pd
df = pd.read_excel('/content/drive/MyDrive/Data_Train.xlsx')
df

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.00
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.00
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.00
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62
...,...,...,...,...,...,...,...,...,...
5694,Who Ordered This Truckload of Dung?: Inspiring...,Ajahn Brahm,"Paperback,– 30 Aug 2005",4.9 out of 5 stars,9 customer reviews,“Laugh your way to enlightenment” with this in...,Buddhism (Books),Humour,1009.00
5695,PostCapitalism: A Guide to Our Future,Paul Mason,"Paperback,– 2 Jun 2016",4.1 out of 5 stars,2 customer reviews,'The most important book about our economy and...,Macroeconomics Textbooks,Politics,781.00
5696,The Great Zoo Of China,Matthew Reilly,"Paperback,– 14 Jan 2016",4.1 out of 5 stars,28 customer reviews,The Chinese government has been keeping a secr...,Action & Adventure (Books),"Crime, Thriller & Mystery",449.00
5697,Engleby,Sebastian Faulks,"Paperback,– 27 Mar 2008",1.0 out of 5 stars,1 customer review,Mike Engleby has a secret...\n\nThis is the st...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",108.00


As you can see our dataset isn't one of those large ones and it contains only 9 columns.

In [106]:
df.head(3)

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0


In [107]:
df.tail(3)

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
5696,The Great Zoo Of China,Matthew Reilly,"Paperback,– 14 Jan 2016",4.1 out of 5 stars,28 customer reviews,The Chinese government has been keeping a secr...,Action & Adventure (Books),"Crime, Thriller & Mystery",449.0
5697,Engleby,Sebastian Faulks,"Paperback,– 27 Mar 2008",1.0 out of 5 stars,1 customer review,Mike Engleby has a secret...\n\nThis is the st...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",108.0
5698,Only Dull People Are Brilliant at Breakfast (P...,Oscar Wilde,"Paperback,– 3 Mar 2016",4.5 out of 5 stars,7 customer reviews,'It would be unfair to expect other people to ...,Essays (Books),Humour,99.0


In [108]:
df.shape

(5699, 9)

Now we get a list of our columns' names and later the type of each column to have a better mindset for  processing our data.

In [109]:
df.columns

Index(['Title', 'Author', 'Edition', 'Reviews', 'Ratings', 'Synopsis', 'Genre',
       'BookCategory', 'Price'],
      dtype='object')

In [110]:
df.dtypes

Title            object
Author           object
Edition          object
Reviews          object
Ratings          object
Synopsis         object
Genre            object
BookCategory     object
Price           float64
dtype: object

Since there are many columns with the "object" type in our dataset , we need to figure out the number of unique values of each feature.

In [111]:
df.nunique()

Title           5130
Author          3438
Edition         3183
Reviews           36
Ratings          333
Synopsis        5114
Genre            335
BookCategory      11
Price           1538
dtype: int64

In [112]:
df.describe()

Unnamed: 0,Price
count,5699.0
mean,554.857428
std,674.363427
min,25.0
25%,249.0
50%,373.0
75%,599.0
max,14100.0


# Handling Null Values

First of all we get a copy of our dataframe to work with that. This will leave the original dataframe unchanged and therefore if we do anything wrong in the following steps , we can just come back and get another copy of the data frame.

In [116]:
null_df = df.copy(deep=True)

In [117]:
import numpy as np
na_cols=df.columns[null_df.isna().any()].tolist()
na_cols

[]

In [118]:
null_values=pd.DataFrame(null_df[na_cols].isna().sum(), columns=['Number of Null values'])
null_values['Percentage of Null Values']=np.round(100*null_values['Number of Null values']/len(null_df),2)
print(null_values)


Empty DataFrame
Columns: [Number of Null values, Percentage of Null Values]
Index: []


It seems that there no null values in our dataset, due to the fact that none of the cells are empty. But we should keep that in my that there may be some missing values in our dataset that are filled with irrelevant data or only some parts of it are missing.
like how in date the year could be missing but the whole value exists:
"-12-6".

# Encoding Categorical columns

Since the number of unique values of our categorical columns are quite large the simple methods like one-hot or binary encoding wouldn't work for encoding the categorical features.
And we have to decide to use a new method for each column based on the feature's unique characteristics.

In [119]:
cat_df = null_df.copy(deep=True)

## **Ratings**

In [120]:
ratings = cat_df['Ratings'].values
ratings

array(['8 customer reviews', '14 customer reviews', '6 customer reviews',
       ..., '28 customer reviews', '1 customer review',
       '7 customer reviews'], dtype=object)

In [121]:
ratings = ratings.tolist()

In [122]:
ratings

['8 customer reviews',
 '14 customer reviews',
 '6 customer reviews',
 '13 customer reviews',
 '1 customer review',
 '8 customer reviews',
 '72 customer reviews',
 '16 customer reviews',
 '111 customer reviews',
 '1 customer review',
 '132 customer reviews',
 '17 customer reviews',
 '4 customer reviews',
 '3 customer reviews',
 '5 customer reviews',
 '2 customer reviews',
 '1 customer review',
 '23 customer reviews',
 '76 customer reviews',
 '5 customer reviews',
 '10 customer reviews',
 '2 customer reviews',
 '2 customer reviews',
 '10 customer reviews',
 '9 customer reviews',
 '1 customer review',
 '15 customer reviews',
 '34 customer reviews',
 '17 customer reviews',
 '9 customer reviews',
 '32 customer reviews',
 '2 customer reviews',
 '49 customer reviews',
 '49 customer reviews',
 '10 customer reviews',
 '8 customer reviews',
 '62 customer reviews',
 '61 customer reviews',
 '1 customer review',
 '8 customer reviews',
 '7 customer reviews',
 '5 customer reviews',
 '18 customer rev

In [123]:
num_ratings  =[]
for i in ratings:
   r = i.split(" ")[0]
   num_ratings.append(r)
num_ratings

['8',
 '14',
 '6',
 '13',
 '1',
 '8',
 '72',
 '16',
 '111',
 '1',
 '132',
 '17',
 '4',
 '3',
 '5',
 '2',
 '1',
 '23',
 '76',
 '5',
 '10',
 '2',
 '2',
 '10',
 '9',
 '1',
 '15',
 '34',
 '17',
 '9',
 '32',
 '2',
 '49',
 '49',
 '10',
 '8',
 '62',
 '61',
 '1',
 '8',
 '7',
 '5',
 '18',
 '16',
 '6',
 '3',
 '2',
 '98',
 '12',
 '14',
 '3',
 '97',
 '1',
 '7',
 '2',
 '2',
 '7',
 '5',
 '285',
 '29',
 '1',
 '27',
 '4',
 '267',
 '24',
 '7',
 '6',
 '2',
 '146',
 '1',
 '4',
 '1',
 '95',
 '1',
 '234',
 '35',
 '3',
 '5',
 '7',
 '2',
 '4',
 '66',
 '1',
 '15',
 '8',
 '20',
 '39',
 '9',
 '6',
 '3',
 '7',
 '7',
 '12',
 '171',
 '13',
 '2',
 '2',
 '9',
 '2',
 '7',
 '399',
 '1',
 '7',
 '42',
 '2',
 '2',
 '2',
 '142',
 '4',
 '10',
 '62',
 '3',
 '1',
 '1',
 '15',
 '5',
 '6',
 '11',
 '20',
 '4',
 '1',
 '839',
 '47',
 '5',
 '1',
 '3',
 '18',
 '1',
 '165',
 '30',
 '7',
 '5',
 '53',
 '3',
 '2',
 '1',
 '6',
 '32',
 '10',
 '14',
 '77',
 '16',
 '33',
 '35',
 '37',
 '3',
 '4',
 '6',
 '8',
 '10',
 '2',
 '1',
 '9',
 '6',


In [124]:
cat_df['New_Ratings'] = num_ratings

In [125]:
cat_df['New_Ratings'] = pd.to_numeric(cat_df['New_Ratings'].str.replace(',', ''), errors='coerce')

In [126]:
cat_df['New_Ratings'] = pd.to_numeric(cat_df['New_Ratings'], errors='coerce')

## **Reviews**

In [127]:
reviews = cat_df['Reviews'].values.tolist()
reviews

['4.0 out of 5 stars',
 '3.9 out of 5 stars',
 '4.8 out of 5 stars',
 '4.1 out of 5 stars',
 '5.0 out of 5 stars',
 '4.5 out of 5 stars',
 '4.4 out of 5 stars',
 '4.7 out of 5 stars',
 '4.2 out of 5 stars',
 '4.0 out of 5 stars',
 '4.9 out of 5 stars',
 '3.5 out of 5 stars',
 '4.1 out of 5 stars',
 '5.0 out of 5 stars',
 '3.8 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars',
 '4.9 out of 5 stars',
 '4.5 out of 5 stars',
 '4.3 out of 5 stars',
 '4.9 out of 5 stars',
 '5.0 out of 5 stars',
 '3.1 out of 5 stars',
 '3.1 out of 5 stars',
 '4.8 out of 5 stars',
 '4.0 out of 5 stars',
 '4.3 out of 5 stars',
 '4.3 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars',
 '4.0 out of 5 stars',
 '3.9 out of 5 stars',
 '4.4 out of 5 stars',
 '4.2 out of 5 stars',
 '4.3 out of 5 stars',
 '4.3 out of 5 stars',
 '4.4 out of 5 stars',
 '3.9 out of 5 stars',
 '5.0 out of 5 stars',
 '4.6 out of 5 stars',
 '5.0 out of 5 stars',
 '4.7 out of 5 stars',
 '3.8 out of 5 stars',
 '4.5 out o

In [128]:
new_reviews = []
for i in reviews:
  r = i.split(" ")[0]
  new_reviews.append(r)

In [129]:
new_reviews

['4.0',
 '3.9',
 '4.8',
 '4.1',
 '5.0',
 '4.5',
 '4.4',
 '4.7',
 '4.2',
 '4.0',
 '4.9',
 '3.5',
 '4.1',
 '5.0',
 '3.8',
 '5.0',
 '5.0',
 '4.9',
 '4.5',
 '4.3',
 '4.9',
 '5.0',
 '3.1',
 '3.1',
 '4.8',
 '4.0',
 '4.3',
 '4.3',
 '5.0',
 '5.0',
 '4.0',
 '3.9',
 '4.4',
 '4.2',
 '4.3',
 '4.3',
 '4.4',
 '3.9',
 '5.0',
 '4.6',
 '5.0',
 '4.7',
 '3.8',
 '4.5',
 '4.6',
 '5.0',
 '5.0',
 '4.1',
 '4.8',
 '4.8',
 '3.8',
 '4.4',
 '2.0',
 '4.6',
 '5.0',
 '5.0',
 '4.2',
 '4.1',
 '4.7',
 '4.1',
 '5.0',
 '3.7',
 '4.0',
 '3.2',
 '4.0',
 '4.0',
 '4.7',
 '5.0',
 '4.3',
 '5.0',
 '4.0',
 '5.0',
 '4.1',
 '5.0',
 '4.6',
 '3.7',
 '5.0',
 '4.1',
 '4.4',
 '4.0',
 '5.0',
 '4.6',
 '5.0',
 '4.5',
 '4.5',
 '3.2',
 '4.6',
 '4.2',
 '5.0',
 '3.7',
 '4.8',
 '2.6',
 '4.9',
 '4.1',
 '5.0',
 '4.0',
 '3.5',
 '4.5',
 '5.0',
 '3.2',
 '4.6',
 '5.0',
 '4.0',
 '4.5',
 '4.0',
 '2.8',
 '5.0',
 '4.2',
 '3.4',
 '2.9',
 '3.9',
 '4.1',
 '5.0',
 '5.0',
 '4.5',
 '5.0',
 '4.8',
 '4.7',
 '4.7',
 '4.7',
 '5.0',
 '4.3',
 '3.9',
 '3.5',
 '4.0',


In [130]:
cat_df["New_Reviews"] = new_reviews
cat_df.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price,New_Ratings,New_Reviews
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0,8,4.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93,14,3.9
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0,6,4.8
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0,13,4.1
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62,1,5.0


In [131]:
cat_df['New_Reviews'] = pd.to_numeric(cat_df['New_Reviews'], errors='coerce')

In [132]:
cat_df = cat_df.drop(["Ratings",'Reviews'],axis=1)
cat_df.head()

Unnamed: 0,Title,Author,Edition,Synopsis,Genre,BookCategory,Price,New_Ratings,New_Reviews
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0,8,4.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93,14,3.9
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982","""During the time men live without a common Pow...",International Relations,Humour,299.0,6,4.8
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0,13,4.1
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006","For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62,1,5.0


## Genre

In [133]:
cat_df["Genre"].unique()

array(['Action & Adventure (Books)', 'Cinema & Broadcast (Books)',
       'International Relations', 'Contemporary Fiction (Books)',
       'Photography Textbooks', 'Healthy Living & Wellness (Books)',
       'Crime, Thriller & Mystery (Books)',
       'Sports Training & Coaching (Books)',
       'Biographies & Autobiographies (Books)', 'Asian History',
       'Banks & Banking', 'Comics & Mangas (Books)',
       "Children's Mysteries & Curiosities (Books)", 'Mangas',
       'Artificial Intelligence',
       'Software & Business Applications (Books)', 'German',
       'International Business', 'Cricket (Books)',
       'Comics & Graphic Novels (Books)', 'PC & Video Games (Books)',
       'Short Stories (Books)', 'Astrology', 'Romance (Books)', 'Design',
       'Introductory & Beginning Programming', 'Travel (Books)',
       'Sports (Books)', 'Communications', 'Foreign Languages',
       'Linguistics (Books)', 'Music Books',
       'Outdoor Survival Skills (Books)', 'True Accounts (Books

In [134]:
cat_df.groupby('BookCategory')['Genre'].nunique()

BookCategory
Action & Adventure                       38
Arts, Film & Photography                 77
Biographies, Diaries & True Accounts    121
Comics & Mangas                          66
Computing, Internet & Digital Media      87
Crime, Thriller & Mystery                47
Humour                                   92
Language, Linguistics & Writing         110
Politics                                 86
Romance                                  28
Sports                                   93
Name: Genre, dtype: int64

By looking at the "genre" and "bookcategory" columns an also the numbers above we can easily undrestand that genre is somehow the subset of bookcategory. This means that book categories are divided into different genres ( which are basically the catgeories only more detailed). So by having the data of book category column we can that there's no need for genre and we can just drop this column to make everything easier.

In [135]:
cat_df = cat_df.drop(["Genre"],axis=1)
cat_df.head(2)

Unnamed: 0,Title,Author,Edition,Synopsis,BookCategory,Price,New_Ratings,New_Reviews
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9


## **Edition**

In [136]:
edition = cat_df['Edition'].values.tolist()
edition

['Paperback,– 10 Mar 2016',
 'Paperback,– 7 Nov 2012',
 'Paperback,– 25 Feb 1982',
 'Paperback,– 5 Oct 2017',
 'Hardcover,– 10 Oct 2006',
 'Paperback,– 5 May 2009',
 'Paperback,– 5 Oct 2017',
 'Hardcover,– Import, 1 Mar 2018',
 'Paperback,– 15 Dec 2015',
 'Paperback,– 26 Mar 2013',
 'Paperback,– 20 Jan 2017',
 'Paperback,– Import, 14 Jun 2018',
 'Paperback,– 1 Jul 1999',
 'Paperback,– 15 Nov 2002',
 'Paperback,– 1 Sep 2011',
 'Paperback,– 26 Feb 2015',
 'Hardcover,– 8 Mar 2018',
 'Paperback,– 1 Nov 2016',
 'Mass Market Paperback,– 1 Jan 1991',
 'Paperback,– 2016',
 'Hardcover,– 24 Nov 2018',
 'Paperback,– Import, 4 Oct 2018',
 'Paperback,– 5 Jul 2012',
 'Paperback,– 1 Nov 2014',
 'Paperback,– 31 Aug 2012',
 'Hardcover,– Deckle Edge, 18 Oct 2011',
 'Paperback,– 1 Mar 2016',
 'Paperback,– Box set, 15 Jun 2014',
 'Hardcover,– 15 Sep 2014',
 'Paperback,– 23 Apr 1989',
 'Paperback,– 21 Nov 2013',
 'Paperback,– 21 Jul 2015',
 'Paperback,– 14 Oct 2000',
 'Paperback,– 5 Sep 2005',
 'Hardcover,

Based on the characteristics of the "edition" I assume that the certain month that a book was published wouldn't affect its price that much, but the year and the format of the book are absolutely effective.
For instance there must be a huge gap between the price of a hardcover book that had been published in 2000 and a paperback book in the currenct year.
So we change the "edition" values in the following way:

In [137]:
new_edition=[]
for i in edition:
  values = i.split(",– ")
  format = values[0]
  year = values[1].split(" ")[-1]
  ED = format+year
  new_edition.append(ED)
new_edition


['Paperback2016',
 'Paperback2012',
 'Paperback1982',
 'Paperback2017',
 'Hardcover2006',
 'Paperback2009',
 'Paperback2017',
 'Hardcover2018',
 'Paperback2015',
 'Paperback2013',
 'Paperback2017',
 'Paperback2018',
 'Paperback1999',
 'Paperback2002',
 'Paperback2011',
 'Paperback2015',
 'Hardcover2018',
 'Paperback2016',
 'Mass Market Paperback1991',
 'Paperback2016',
 'Hardcover2018',
 'Paperback2018',
 'Paperback2012',
 'Paperback2014',
 'Paperback2012',
 'Hardcover2011',
 'Paperback2016',
 'Paperback2014',
 'Hardcover2014',
 'Paperback1989',
 'Paperback2013',
 'Paperback2015',
 'Paperback2000',
 'Paperback2005',
 'Hardcover2016',
 'Paperback2019',
 'Paperback2014',
 'Paperback2009',
 'Paperback2006',
 'Paperback2013',
 'Paperback2013',
 'Hardcover2013',
 'Paperback2008',
 'Hardcover2015',
 'Hardcover2019',
 'Paperback2014',
 'Paperback2006',
 'Paperback2014',
 'Paperback2012',
 'Paperback2012',
 'Paperback2017',
 'Paperback2016',
 'Sheet music2018',
 'Mass Market Paperback2000',
 '

In [138]:
unique_values = set(new_edition)
num_unique_values = len(unique_values)
print(num_unique_values)

164


As you can see just by omitting the month from the edition the number of unqiue values for this feature decreased highly.

In [139]:

cat_df['New_Edition'] = new_edition
cat_df.head()

Unnamed: 0,Title,Author,Edition,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,New_Edition
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,Paperback2016
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,Paperback2012
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982","""During the time men live without a common Pow...",Humour,299.0,6,4.8,Paperback1982
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",A handful of grain is found in the pocket of a...,"Crime, Thriller & Mystery",180.0,13,4.1,Paperback2017
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006","For seven decades, ""Life"" has been thrilling t...","Arts, Film & Photography",965.62,1,5.0,Hardcover2006


In [140]:
cat_df = cat_df.drop(['Edition'],axis=1)

In [141]:
cat_df.head()

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,New_Edition
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,Paperback2016
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,Paperback2012
2,Leviathan (Penguin Classics),Thomas Hobbes,"""During the time men live without a common Pow...",Humour,299.0,6,4.8,Paperback1982
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,A handful of grain is found in the pocket of a...,"Crime, Thriller & Mystery",180.0,13,4.1,Paperback2017
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"For seven decades, ""Life"" has been thrilling t...","Arts, Film & Photography",965.62,1,5.0,Hardcover2006


Now there's something we can do for better results.
We can divide the value into two different sections. one will be the format of the book and the other will be for the year of publishment.
This will help us very much because year is a numerical value and therefore there will be no encoding needed for that.

In [142]:
import re
def divide(string):
    match = re.match(r"([a-zA-Z ]+)([0-9]+)", string)
    if match:
        format = match.group(1)
        year = match.group(2)
        return format, year


In [143]:
f , m = divide("paperback2002")
print(f,m)

paperback 2002


In [144]:
'''Format = []
production_year=[]
for ed in cat_df["New_Edition"].values.tolist():
      print(ed)
      format , year = divide(ed)
      Format.append(format)
      production_year.append(year)
print(Format)
print(production_year)'''

'Format = []\nproduction_year=[]\nfor ed in cat_df["New_Edition"].values.tolist():\n      print(ed)\n      format , year = divide(ed)\n      Format.append(format)\n      production_year.append(year)\nprint(Format)\nprint(production_year)'

Here, we defined a function for this purpose. It's clear though that we're facing some problems. Which is that eventhough there were not any empty cells fot the edition there exists some missing values. like the word "Import".
We have to handle it in some way.

In [145]:
count = 0
for ed in cat_df['New_Edition'].values.tolist():
   if "Import" in ed :
    count += 1
print(count)

9


In [146]:
count = 0
for ed in cat_df['New_Edition'].values.tolist():
   if "HardcoverUnabridged" in ed :
    count += 1
print(count)

1


In [147]:
'''cat_df = cat_df[~cat_df['New_Edition'].str.contains('Import')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('CombNTSC')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('Facsimile')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('Paperbackset')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('PaperbackEdition')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('Hardcoverset')]
cat_df = cat_df[~cat_df['New_Edition'].str.contains('HardcoverUnabridged')]

cat_df.shape'''

"cat_df = cat_df[~cat_df['New_Edition'].str.contains('Import')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('CombNTSC')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('Facsimile')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('Paperbackset')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('PaperbackEdition')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('Hardcoverset')]\ncat_df = cat_df[~cat_df['New_Edition'].str.contains('HardcoverUnabridged')]\n\ncat_df.shape"

See the number of these values that can cause us difficulty is fairly large eventhough the number of the occurence of each one is few. So it's better to somehow fill those in instead of dropping their rows.

In [148]:
def divide(string):
    global prev_year
    match = re.match(r"([a-zA-Z ]+)([0-9]+)", string)
    if match:
        format = match.group(1)
        year = match.group(2)
        prev_year = year
        return format, year
    else:
        format = string
        year = prev_year
        return format, year

In [149]:
Format = []
production_year=[]
for ed in cat_df["New_Edition"].values.tolist():
      print(ed)
      format , year = divide(ed)
      Format.append(format)
      production_year.append(year)
print(Format)
print(production_year)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Paperback2015
Paperback2017
Paperback2018
Paperback2016
Paperback2010
Paperback1998
Paperback2012
Hardcover2008
Paperback2016
Paperback2019
Paperback2011
Paperback2012
Paperback2017
Paperback2017
Paperback2015
Paperback2019
Paperback2016
Loose Leaf2015
Paperback2017
Mass Market Paperback2008
Paperback2016
Paperback2019
Paperback2006
Paperback2017
Paperback2013
Paperback2016
Paperback2007
Paperback2019
Paperback2011
Paperback1997
Paperback2017
Hardcover2016
Paperback2015
Paperback2016
Paperback2015
Paperback1990
Paperback2014
Paperback1998
Paperback2011
Paperback2011
Paperback2016
Paperback2015
Paperback2013
Paperback2016
Paperback2018
Paperback2019
Paperback2017
Paperback2015
Hardcover2017
Paperback2019
Paperback2017
Paperback1971
Paperback2017
Paperback2018
Hardcover2018
Paperback2018
Paperback2006
Paperback2003
Mass Market Paperback2006
Paperback1992
Hardcover2017
Hardcover2015
Paperback2018
Hardcover2008
Paperback2013


Now we have add these two lists to our data frame.

In [150]:
cat_df['Format'] = Format
cat_df['Year_of_publish'] = production_year
cat_df.head(4)

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,New_Edition,Format,Year_of_publish
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,Paperback2016,Paperback,2016
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,Paperback2012,Paperback,2012
2,Leviathan (Penguin Classics),Thomas Hobbes,"""During the time men live without a common Pow...",Humour,299.0,6,4.8,Paperback1982,Paperback,1982
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,A handful of grain is found in the pocket of a...,"Crime, Thriller & Mystery",180.0,13,4.1,Paperback2017,Paperback,2017


Now since we have our two divided columns, there's no need to keep "New_Edition" , so we just drop it.

In [151]:
cat_df = cat_df.drop(["New_Edition"],axis=1)
cat_df.head(2)

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,Format,Year_of_publish
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,Paperback,2016
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,Paperback,2012


### **Format**

In [152]:
onehot_BC = pd.get_dummies(cat_df['Format'], prefix='Format')
cat_df = pd.concat([cat_df, onehot_BC], axis=1)
cat_df.head(4)

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,Format,Year_of_publish,"Format_(French),Paperback2010",...,Format_Paperbackset,Format_Perfect Paperback,Format_Plastic CombNTSC,Format_Product Bundle,Format_Sheet music,Format_Spiral-bound1986,Format_Spiral-bound2007,Format_Spiral-bound2012,Format_Spiral-bound2016,Format_Tankobon Softcover
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,Paperback,2016,0,...,0,0,0,0,0,0,0,0,0,0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,Paperback,2012,0,...,0,0,0,0,0,0,0,0,0,0
2,Leviathan (Penguin Classics),Thomas Hobbes,"""During the time men live without a common Pow...",Humour,299.0,6,4.8,Paperback,1982,0,...,0,0,0,0,0,0,0,0,0,0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,A handful of grain is found in the pocket of a...,"Crime, Thriller & Mystery",180.0,13,4.1,Paperback,2017,0,...,0,0,0,0,0,0,0,0,0,0


In [153]:
cat_df = cat_df.drop(['Format'],axis=1)
cat_df.head(3)

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014",...,Format_Paperbackset,Format_Perfect Paperback,Format_Plastic CombNTSC,Format_Product Bundle,Format_Sheet music,Format_Spiral-bound1986,Format_Spiral-bound2007,Format_Spiral-bound2012,Format_Spiral-bound2016,Format_Tankobon Softcover
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,Action & Adventure,220.0,8,4.0,2016,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,"Biographies, Diaries & True Accounts",202.93,14,3.9,2012,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Leviathan (Penguin Classics),Thomas Hobbes,"""During the time men live without a common Pow...",Humour,299.0,6,4.8,1982,0,0,...,0,0,0,0,0,0,0,0,0,0


## Book Category

In [154]:
cat_df["BookCategory"].unique()

array(['Action & Adventure', 'Biographies, Diaries & True Accounts',
       'Humour', 'Crime, Thriller & Mystery', 'Arts, Film & Photography',
       'Sports', 'Language, Linguistics & Writing',
       'Computing, Internet & Digital Media', 'Romance',
       'Comics & Mangas', 'Politics'], dtype=object)

Now it's a good thing that there only a few unique values for this feature. This means that we may be able to use onehot encoding for this one and since the number of our columns is not large , it's okay to use this method.

In [155]:
onehot_BC = pd.get_dummies(cat_df['BookCategory'], prefix='BookCategory')
cat_df = pd.concat([cat_df, onehot_BC], axis=1)

In [156]:
cat_df.tail()

Unnamed: 0,Title,Author,Synopsis,BookCategory,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014",...,"BookCategory_Arts, Film & Photography","BookCategory_Biographies, Diaries & True Accounts",BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports
5694,Who Ordered This Truckload of Dung?: Inspiring...,Ajahn Brahm,“Laugh your way to enlightenment” with this in...,Humour,1009.0,9,4.9,2005,0,0,...,0,0,0,0,0,1,0,0,0,0
5695,PostCapitalism: A Guide to Our Future,Paul Mason,'The most important book about our economy and...,Politics,781.0,2,4.1,2016,0,0,...,0,0,0,0,0,0,0,1,0,0
5696,The Great Zoo Of China,Matthew Reilly,The Chinese government has been keeping a secr...,"Crime, Thriller & Mystery",449.0,28,4.1,2016,0,0,...,0,0,0,0,1,0,0,0,0,0
5697,Engleby,Sebastian Faulks,Mike Engleby has a secret...\n\nThis is the st...,"Crime, Thriller & Mystery",108.0,1,1.0,2008,0,0,...,0,0,0,0,1,0,0,0,0,0
5698,Only Dull People Are Brilliant at Breakfast (P...,Oscar Wilde,'It would be unfair to expect other people to ...,Humour,99.0,7,4.5,2016,0,0,...,0,0,0,0,0,1,0,0,0,0


Before moving to the next step we should drop the old bookcategory column.

In [157]:
cat_df = cat_df.drop(["BookCategory"],axis=1)
cat_df.tail(3)

Unnamed: 0,Title,Author,Synopsis,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",...,"BookCategory_Arts, Film & Photography","BookCategory_Biographies, Diaries & True Accounts",BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports
5696,The Great Zoo Of China,Matthew Reilly,The Chinese government has been keeping a secr...,449.0,28,4.1,2016,0,0,0,...,0,0,0,0,1,0,0,0,0,0
5697,Engleby,Sebastian Faulks,Mike Engleby has a secret...\n\nThis is the st...,108.0,1,1.0,2008,0,0,0,...,0,0,0,0,1,0,0,0,0,0
5698,Only Dull People Are Brilliant at Breakfast (P...,Oscar Wilde,'It would be unfair to expect other people to ...,99.0,7,4.5,2016,0,0,0,...,0,0,0,0,0,1,0,0,0,0


## **Synopsis**

In [158]:
synopsis = cat_df['Synopsis'].values.tolist()
synopsis

["THE HUNTERS return in their third brilliant novel from the Sunday Times Top Ten bestselling author Chris Kuzneski, whose writing James Patterson says has 'raw power'. The team are hunting Marco Polo's hidden treasure, but who is on their tail?\nTHE HUNTERS\nIf you seek, they will find...\n\nThe travels of Marco Polo are known throughout the world.\nBut what if his story isn't complete?\nWhat if his greatest adventure has yet to be discovered?\nGuided by a journal believed to have been dictated by Polo himself,\nthe Hunters set out in search of his final legacy:\nthe mythical treasure gathered during Polo's lifetime of exploration.\nBut as every ancient clue brings them closer to the truth,\neach new step puts them in increasing danger...\nExplosive action. Killer characters. Classic Kuzneski.",
 'A layered portrait of a troubled genius for whom art was not merely a thing of beauty but a vital part of living itself.\nSelling Points: The original Marathi book won the National Award for

In [159]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
def embedding(cat_df,column):
 max_words = 10000
 maxlen = 13095
 tokenizer = Tokenizer(num_words=max_words)
 tokenizer.fit_on_texts(cat_df[column])
 sequences = tokenizer.texts_to_sequences(cat_df[column])
 word_index = tokenizer.word_index
 print('Found %s unique tokens.' % len(word_index))
 return sequences,word_index
#data = pad_sequences(sequences, maxlen=maxlen)
#labels = encoded_df['Label']
#model = Sequential()
#model.add(Embedding(max_words, 32, input_length=maxlen))
#model.add(Flatten())
#model.add(Dense(1, activation='sigmoid'))
#model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
#model.summary()
#history = model.fit(data, labels, epochs=10, batch_size=32, validation_split=0.2)

In [160]:
s,w = embedding(cat_df,'Synopsis')

Found 46323 unique tokens.


In [161]:
print(w)



In [162]:
cat_df['Synopsis_sequences'] = s

In [163]:
cat_df.head()

Unnamed: 0,Title,Author,Synopsis,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",...,"BookCategory_Biographies, Diaries & True Accounts",BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports,Synopsis_sequences
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,THE HUNTERS return in their third brilliant no...,220.0,8,4.0,2016,0,0,0,...,0,0,0,0,0,0,0,0,0,"[1, 5008, 896, 6, 33, 581, 382, 116, 15, 1, 36..."
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,A layered portrait of a troubled genius for wh...,202.93,14,3.9,2012,0,0,0,...,1,0,0,0,0,0,0,0,0,"[4, 6296, 1011, 3, 4, 1950, 1099, 8, 1275, 159..."
2,Leviathan (Penguin Classics),Thomas Hobbes,"""During the time men live without a common Pow...",299.0,6,4.8,1982,0,0,0,...,0,0,0,0,1,0,0,0,0,"[400, 1, 55, 262, 353, 281, 4, 525, 145, 5, 37..."
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,A handful of grain is found in the pocket of a...,180.0,13,4.1,2017,0,0,0,...,0,0,0,1,0,0,0,0,0,"[4, 4369, 3, 7, 330, 6, 1, 1857, 3, 4, 1162, 8..."
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"For seven decades, ""Life"" has been thrilling t...",965.62,1,5.0,2006,0,0,0,...,0,0,0,0,0,0,0,0,0,"[8, 672, 613, 36, 23, 51, 804, 1, 37, 9, 56, 5..."


In [164]:
cat_df = cat_df.drop(['Synopsis'],axis=1)

## **Author**

In [165]:
au_df = cat_df.copy(deep=True)

In [166]:
s,w = embedding(au_df,'Author')

Found 4610 unique tokens.


In [167]:
print(w)

{'john': 1, 'james': 2, 'david': 3, 'r': 4, 'michael': 5, 'k': 6, 'j': 7, 's': 8, 'p': 9, 'singh': 10, 'robert': 11, 'agatha': 12, 'christie': 13, 'm': 14, 'a': 15, 'g': 16, 'george': 17, 'dk': 18, 'ladybird': 19, 'd': 20, 'bill': 21, 'stephen': 22, 'albert': 23, 'peter': 24, 'v': 25, 'lee': 26, 'william': 27, 'uderzo': 28, 'martin': 29, 'paul': 30, 'tom': 31, 'richard': 32, 'dan': 33, 'patterson': 34, 'sidney': 35, 'scott': 36, 'l': 37, 'smith': 38, 'c': 39, 'herge': 40, 'mark': 41, 'dr': 42, 'roberts': 43, 'sheldon': 44, 'cussler': 45, 'e': 46, 'watterson': 47, 'grisham': 48, 'press': 49, 'kumar': 50, 'b': 51, 'daniel': 52, 'clive': 53, 'wodehouse': 54, 'brown': 55, 'ian': 56, 'jim': 57, 'h': 58, 'chris': 59, 'thomas': 60, 'nora': 61, 'n': 62, 'andrew': 63, 'sharma': 64, 'king': 65, 'sophie': 66, 'trinity': 67, 'college': 68, 'taylor': 69, 'anthony': 70, 'simon': 71, 'sarah': 72, 'kinsella': 73, 'matthew': 74, 'charles': 75, 'christopher': 76, 'louis': 77, 'various': 78, 'stilton': 7

In [168]:
au_df['New_Author'] = s
au_df.head()

Unnamed: 0,Title,Author,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,...,BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports,Synopsis_sequences,New_Author
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,220.0,8,4.0,2016,0,0,0,0,...,0,0,0,0,0,0,0,0,"[1, 5008, 896, 6, 33, 581, 382, 116, 15, 1, 36...","[59, 556]"
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,202.93,14,3.9,2012,0,0,0,0,...,0,0,0,0,0,0,0,0,"[4, 6296, 1011, 3, 4, 1950, 1099, 8, 1275, 159...","[248, 1814]"
2,Leviathan (Penguin Classics),Thomas Hobbes,299.0,6,4.8,1982,0,0,0,0,...,0,0,0,1,0,0,0,0,"[400, 1, 55, 262, 353, 281, 4, 525, 145, 5, 37...","[60, 1061]"
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,180.0,13,4.1,2017,0,0,0,0,...,0,0,1,0,0,0,0,0,"[4, 4369, 3, 7, 330, 6, 1, 1857, 3, 4, 1162, 8...","[12, 13]"
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,965.62,1,5.0,2006,0,0,0,0,...,0,0,0,0,0,0,0,0,"[8, 672, 613, 36, 23, 51, 804, 1, 37, 9, 56, 5...","[557, 428, 1815]"


In [169]:
au_df = au_df.drop(['Author'],axis=1)

In [170]:
au_df.tail()

Unnamed: 0,Title,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,...,BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports,Synopsis_sequences,New_Author
5694,Who Ordered This Truckload of Dung?: Inspiring...,1009.0,9,4.9,2005,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"[42, 81, 5, 9, 14, 1596, 2, 440, 4843, 231, 3,...","[4609, 4610]"
5695,PostCapitalism: A Guide to Our Future,781.0,2,4.1,2016,0,0,0,0,0,...,0,0,0,0,0,1,0,0,"[532, 41, 216, 16, 40, 74, 988, 2, 404, 5, 34,...","[30, 1279]"
5696,The Great Zoo Of China,449.0,28,4.1,2016,0,0,0,0,0,...,0,0,1,0,0,0,0,0,"[1, 1300, 563, 23, 51, 1837, 4, 207, 8, 2208, ...","[74, 144]"
5697,Engleby,108.0,1,1.0,2008,0,0,0,0,0,...,0,0,1,0,0,0,0,0,"[3278, 23, 4, 207, 14, 7, 1, 52, 3, 3278, 4, 4...","[973, 1494]"
5698,Only Dull People Are Brilliant at Breakfast (P...,99.0,7,4.5,2016,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"[6344, 143, 34, 5, 2792, 80, 92, 5, 34, 12, 68...","[1790, 1791]"


## **Title**

In [171]:
tl_df = au_df.copy(deep=True)

In [172]:
s,w = embedding(tl_df,'Title')

Found 7512 unique tokens.


In [173]:
print(w)

{'the': 1, 'of': 2, 'and': 3, 'a': 4, 'to': 5, 'in': 6, 'for': 7, 'book': 8, '1': 9, 'with': 10, 'guide': 11, 'edition': 12, 'from': 13, 'english': 14, 'how': 15, 'classics': 16, 'series': 17, '2': 18, 'on': 19, 'an': 20, 'my': 21, 'vol': 22, 'story': 23, '3': 24, 'world': 25, 'life': 26, 'you': 27, 'india': 28, 'penguin': 29, 'novel': 30, 'one': 31, 'man': 32, 'new': 33, 'complete': 34, 'your': 35, 'i': 36, 'love': 37, 'art': 38, 'dictionary': 39, 'graphic': 40, 'is': 41, 'adventures': 42, 'oxford': 43, 'indian': 44, 'modern': 45, 'it': 46, 'by': 47, 'stories': 48, '4': 49, 'at': 50, 'all': 51, 'history': 52, 'me': 53, 'data': 54, 'learning': 55, 'grammar': 56, 'asterix': 57, 'no': 58, 'secret': 59, 'volume': 60, 'that': 61, 'big': 62, 'great': 63, 'course': 64, 'cd': 65, 'books': 66, 'vintage': 67, '5': 68, 'what': 69, 'novels': 70, 'step': 71, 'design': 72, 'practice': 73, 'autobiography': 74, 'first': 75, 'about': 76, 'who': 77, 'easy': 78, 'war': 79, 'cambridge': 80, 'girl': 81, '

In [174]:
tl_df['New_Title'] = s
tl_df.head(6)

Unnamed: 0,Title,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,...,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports,Synopsis_sequences,New_Author,New_Title
0,The Prisoner's Gold (The Hunters 3),220.0,8,4.0,2016,0,0,0,0,0,...,0,0,0,0,0,0,0,"[1, 5008, 896, 6, 33, 581, 382, 116, 15, 1, 36...","[59, 556]","[1, 3690, 493, 1, 3691, 24]"
1,Guru Dutt: A Tragedy in Three Acts,202.93,14,3.9,2012,0,0,0,0,0,...,0,0,0,0,0,0,0,"[4, 6296, 1011, 3, 4, 1950, 1099, 8, 1275, 159...","[248, 1814]","[1743, 2359, 4, 1127, 6, 211, 1744]"
2,Leviathan (Penguin Classics),299.0,6,4.8,1982,0,0,0,0,0,...,0,0,1,0,0,0,0,"[400, 1, 55, 262, 353, 281, 4, 525, 145, 5, 37...","[60, 1061]","[2360, 29, 16]"
3,A Pocket Full of Rye (Miss Marple),180.0,13,4.1,2017,0,0,0,0,0,...,0,1,0,0,0,0,0,"[4, 4369, 3, 7, 330, 6, 1, 1857, 3, 4, 1162, 8...","[12, 13]","[4, 327, 837, 2, 1745, 451, 664]"
4,LIFE 70 Years of Extraordinary Photography,965.62,1,5.0,2006,0,0,0,0,0,...,0,0,0,0,0,0,0,"[8, 672, 613, 36, 23, 51, 804, 1, 37, 9, 56, 5...","[557, 428, 1815]","[26, 1128, 184, 2, 539, 240]"
5,ChiRunning: A Revolutionary Approach to Effort...,900.0,8,4.5,2009,0,0,0,0,0,...,0,0,0,0,0,0,1,"[1, 645, 120, 3, 1, 141, 4, 1976, 953, 15, 289...","[1062, 1063]","[3692, 4, 959, 289, 5, 2361, 1746, 665, 349]"


In [175]:
tl_df = tl_df.drop(['Title'],axis=1)
tl_df.tail(3)

Unnamed: 0,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,Format_Flexibound,...,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports,Synopsis_sequences,New_Author,New_Title
5696,449.0,28,4.1,2016,0,0,0,0,0,0,...,0,1,0,0,0,0,0,"[1, 1300, 563, 23, 51, 1837, 4, 207, 8, 2208, ...","[74, 144]","[1, 63, 953, 2, 657]"
5697,108.0,1,1.0,2008,0,0,0,0,0,0,...,0,1,0,0,0,0,0,"[3278, 23, 4, 207, 14, 7, 1, 52, 3, 3278, 4, 4...","[973, 1494]",[7510]
5698,99.0,7,4.5,2016,0,0,0,0,0,0,...,0,0,1,0,0,0,0,"[6344, 143, 34, 5, 2792, 80, 92, 5, 34, 12, 68...","[1790, 1791]","[390, 7511, 231, 169, 1225, 50, 7512, 29, 106,..."


# Conversion

In [176]:
con_df = tl_df.copy(deep=True)

In [177]:
def conversion(con_df,column):
  max_length = max(len(v) for v in con_df[column])
  new_columns = [f'{column}{i + 1}' for i in range(max_length)]
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  print(con_df)

In [178]:
conversion(con_df,'New_Title')
con_df

        Price  New_Ratings  New_Reviews Year_of_publish  \
0      220.00            8          4.0            2016   
1      202.93           14          3.9            2012   
2      299.00            6          4.8            1982   
3      180.00           13          4.1            2017   
4      965.62            1          5.0            2006   
...       ...          ...          ...             ...   
5694  1009.00            9          4.9            2005   
5695   781.00            2          4.1            2016   
5696   449.00           28          4.1            2016   
5697   108.00            1          1.0            2008   
5698    99.00            7          4.5            2016   

      Format_(French),Paperback2010  Format_(German),Paperback2014  \
0                                 0                              0   
1                                 0                              0   
2                                 0                              0   
3          

Unnamed: 0,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,Format_Flexibound,...,New_Title23,New_Title24,New_Title25,New_Title26,New_Title27,New_Title28,New_Title29,New_Title30,New_Title31,New_Title32
0,220.00,8,4.0,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,202.93,14,3.9,2012,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,299.00,6,4.8,1982,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,180.00,13,4.1,2017,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,965.62,1,5.0,2006,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5694,1009.00,9,4.9,2005,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5695,781.00,2,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5696,449.00,28,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5697,108.00,1,1.0,2008,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [179]:
conversion(con_df,'New_Author')
con_df

        Price  New_Ratings  New_Reviews Year_of_publish  \
0      220.00            8          4.0            2016   
1      202.93           14          3.9            2012   
2      299.00            6          4.8            1982   
3      180.00           13          4.1            2017   
4      965.62            1          5.0            2006   
...       ...          ...          ...             ...   
5694  1009.00            9          4.9            2005   
5695   781.00            2          4.1            2016   
5696   449.00           28          4.1            2016   
5697   108.00            1          1.0            2008   
5698    99.00            7          4.5            2016   

      Format_(French),Paperback2010  Format_(German),Paperback2014  \
0                                 0                              0   
1                                 0                              0   
2                                 0                              0   
3          

Unnamed: 0,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,Format_Flexibound,...,New_Author8,New_Author9,New_Author10,New_Author11,New_Author12,New_Author13,New_Author14,New_Author15,New_Author16,New_Author17
0,220.00,8,4.0,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,202.93,14,3.9,2012,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,299.00,6,4.8,1982,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,180.00,13,4.1,2017,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,965.62,1,5.0,2006,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5694,1009.00,9,4.9,2005,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5695,781.00,2,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5696,449.00,28,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5697,108.00,1,1.0,2008,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [180]:
conversion(con_df,'Synopsis_sequences')
con_df

  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[n

        Price  New_Ratings  New_Reviews Year_of_publish  \
0      220.00            8          4.0            2016   
1      202.93           14          3.9            2012   
2      299.00            6          4.8            1982   
3      180.00           13          4.1            2017   
4      965.62            1          5.0            2006   
...       ...          ...          ...             ...   
5694  1009.00            9          4.9            2005   
5695   781.00            2          4.1            2016   
5696   449.00           28          4.1            2016   
5697   108.00            1          1.0            2008   
5698    99.00            7          4.5            2016   

      Format_(French),Paperback2010  Format_(German),Paperback2014  \
0                                 0                              0   
1                                 0                              0   
2                                 0                              0   
3          

  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[n

Unnamed: 0,Price,New_Ratings,New_Reviews,Year_of_publish,"Format_(French),Paperback2010","Format_(German),Paperback2014","Format_(Kannada),Paperback2014",Format_Board book,Format_Cards,Format_Flexibound,...,Synopsis_sequences2044,Synopsis_sequences2045,Synopsis_sequences2046,Synopsis_sequences2047,Synopsis_sequences2048,Synopsis_sequences2049,Synopsis_sequences2050,Synopsis_sequences2051,Synopsis_sequences2052,Synopsis_sequences2053
0,220.00,8,4.0,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,202.93,14,3.9,2012,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,299.00,6,4.8,1982,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,180.00,13,4.1,2017,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,965.62,1,5.0,2006,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5694,1009.00,9,4.9,2005,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5695,781.00,2,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5696,449.00,28,4.1,2016,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5697,108.00,1,1.0,2008,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Model

In [189]:
model_df = con_df.copy(deep=True)

In [190]:
model_df=model_df.drop(['New_Author','New_Title','Synopsis_sequences'],axis=1)

In [191]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Assuming 'target_column' is your target variable
target_column = 'Price'

# Assuming all columns except the target column are features
feature_columns = [col for col in model_df.columns if col != target_column]

# Convert the target column to numeric (if not already)
#model_df[target_column] = pd.to_numeric(model_df[target_column], errors='coerce')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(model_df[feature_columns], model_df[target_column], test_size=0.2, random_state=10)

# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model (for regression, you might use mean squared error)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')


Mean Squared Error: 299924.39
