In [9]:
import numpy as np
import pandas as pd

In [10]:
books=pd.read_csv('data.csv')

In [11]:
books.head(2)

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0


In [12]:
books.shape

(6810, 12)

In [13]:
books=books[['isbn13','title','subtitle','authors','categories','description']]

In [14]:
books.head(1)

Unnamed: 0,isbn13,title,subtitle,authors,categories,description
0,9780002005883,Gilead,,Marilynne Robinson,Fiction,A NOVEL THAT READERS and critics have been eag...


<h2 style='color:#FF0D86'>CHECKING FOR NULL VALUES</h2>

In [15]:
books.isnull().sum()

isbn13            0
title             0
subtitle       4429
authors          72
categories       99
description     262
dtype: int64

<h5 style='color:red'>DROPPING THE COLUMN: 'subtitle'</h5>

In [16]:
books.drop('subtitle',axis=1,inplace=True)

<h5 style='color:red'>FILLING MISSING VALUES FOR 'authors'</h5>

In [17]:
books['authors'].fillna('Unknown Author',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  books['authors'].fillna('Unknown Author',inplace=True)


<h5 style='color:red'>FILLING MISSING VALUES FOR 'categories'</h5>

In [18]:
books['categories'].fillna('Unknown Category',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  books['categories'].fillna('Unknown Category',inplace=True)


<h5 style='color:red'>FILLING MISSING VALUES FOR 'description'</h5>

In [19]:
books['description'].fillna('No description available',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  books['description'].fillna('No description available',inplace=True)


In [20]:
books.isnull().sum()

isbn13         0
title          0
authors        0
categories     0
description    0
dtype: int64

<h2 style='color:#FF0D86'>CHECKING FOR DUPLICATED VALUES VALUES</h2>`

In [21]:
books.duplicated().sum()

0

In [22]:
books.head(1)

Unnamed: 0,isbn13,title,authors,categories,description
0,9780002005883,Gilead,Marilynne Robinson,Fiction,A NOVEL THAT READERS and critics have been eag...


<h2 style='color:#FF0D86'>CONVERTING 'authors' INTO A LIST</h3>

In [23]:
books['authors']=books['authors'].apply(lambda x: [x])

In [24]:
books.iloc[0].authors

['Marilynne Robinson']

<h5 style='color:red'>CONVERTING 'categories' INTO A LIST</h4>

In [25]:
books['categories']=books['categories'].apply(lambda x: [x])

In [26]:
books.iloc[0].categories

['Fiction']

In [27]:
books.head(1)

Unnamed: 0,isbn13,title,authors,categories,description
0,9780002005883,Gilead,[Marilynne Robinson],[Fiction],A NOVEL THAT READERS and critics have been eag...


<h4 style='color:red'>CONVERTING 'description' INTO A LIST</h4>

In [28]:
books.iloc[0].description

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

In [29]:
books['description']=books['description'].apply(lambda x: x.split())

In [30]:
books.head(2)

Unnamed: 0,isbn13,title,authors,categories,description
0,9780002005883,Gilead,[Marilynne Robinson],[Fiction],"[A, NOVEL, THAT, READERS, and, critics, have, ..."
1,9780002261982,Spider's Web,[Charles Osborne;Agatha Christie],[Detective and mystery stories],"[A, new, 'Christie, for, Christmas', --, a, fu..."


<h2 style='color:#FF0D86'>REMOVING SPACES BETWEEN THE WORDS</h3>

In [40]:
books['authors']=books['authors'].apply(lambda x:[i.replace(" ","") for i in x])
books['categories']=books['categories'].apply(lambda x:[i.replace(" ","") for i in x])


In [41]:
books.head(3)

Unnamed: 0,isbn13,title,authors,categories,description,tags
0,9780002005883,Gilead,[MarilynneRobinson],[Fiction],"[A, NOVEL, THAT, READERS, and, critics, have, ...","[A, NOVEL, THAT, READERS, and, critics, have, ..."
1,9780002261982,Spider's Web,[CharlesOsborne;AgathaChristie],[Detectiveandmysterystories],"[A, new, 'Christie, for, Christmas', --, a, fu...","[A, new, 'Christie, for, Christmas', --, a, fu..."
2,9780006163831,The One Tree,[StephenR.Donaldson],[Americanfiction],"[Volume, Two, of, Stephen, Donaldson's, acclai...","[Volume, Two, of, Stephen, Donaldson's, acclai..."


<h2 style='color:#FF0D86'>CONCATENATING ALL THE LISTS</h3>

In [42]:
books['tags']=books['description'] + books['categories'] + books['authors']

In [43]:
books.head(2)

Unnamed: 0,isbn13,title,authors,categories,description,tags
0,9780002005883,Gilead,[MarilynneRobinson],[Fiction],"[A, NOVEL, THAT, READERS, and, critics, have, ...","[A, NOVEL, THAT, READERS, and, critics, have, ..."
1,9780002261982,Spider's Web,[CharlesOsborne;AgathaChristie],[Detectiveandmysterystories],"[A, new, 'Christie, for, Christmas', --, a, fu...","[A, new, 'Christie, for, Christmas', --, a, fu..."


In [44]:
#books['tags'][0]

<h2 style='color:#FF0D86'>MAKING A NEW DATAFRAME</h3>

In [45]:
new_df=books[['isbn13','title','tags']]
new_df

Unnamed: 0,isbn13,title,tags
0,9780002005883,Gilead,"[A, NOVEL, THAT, READERS, and, critics, have, ..."
1,9780002261982,Spider's Web,"[A, new, 'Christie, for, Christmas', --, a, fu..."
2,9780006163831,The One Tree,"[Volume, Two, of, Stephen, Donaldson's, acclai..."
3,9780006178736,Rage of angels,"[A, memorable,, mesmerizing, heroine, Jennifer..."
4,9780006280897,The Four Loves,"[Lewis', work, on, the, nature, of, love, divi..."
...,...,...,...
6805,9788185300535,I Am that,"[This, collection, of, the, timeless, teaching..."
6806,9788185944609,Secrets Of The Heart,"[No, description, available, Mysticism, Khalil..."
6807,9788445074879,Fahrenheit 451,"[No, description, available, Bookburning, RayB..."
6808,9789027712059,The Berlin Phenomenology,"[Since, the, three, volume, edition, ofHegel's..."


<h2 style='color:#FF0D86'>CONVERTING THE LIST INTO STRING</h3>

In [46]:
new_df.loc[:,'tags'] = new_df['tags'].apply(lambda x: " ".join(x))

In [47]:
new_df.head(2)

Unnamed: 0,isbn13,title,tags
0,9780002005883,Gilead,A NOVEL THAT READERS and critics have been eag...
1,9780002261982,Spider's Web,A new 'Christie for Christmas' -- a full-lengt...


In [48]:
new_df['tags'][0]

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

<h2 style='color:#FF0D86'>CONVERTING THE STRING INTO LOWERCASE</h3>

In [49]:
new_df.loc[:,'tags']=new_df['tags'].apply(lambda x:x.lower())

In [50]:
new_df['tags'][0]

'a novel that readers and critics have been eagerly anticipating for over a decade, gilead is an astonishingly imagined story of remarkable lives. john ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. it’s 1956 in gilead, iowa, towards the end of the reverend ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. haunted by his grandfather’s presence, john tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. he is troubled, too, by his prodigal namesake, jack (john ames) boughton, his best friend’s lost son who returns to gilead searching for forgiveness and redemption. told in john ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, gilead is a song of celebration and acceptance of the best and the worst the world ha

<h1>TEXT VECTORIZATION</h1>

In [51]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [52]:
cv.fit_transform(new_df['tags'])

<6810x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 172513 stored elements in Compressed Sparse Row format>

<h2 style='color:#FF0D86'>CONVERTING SPARSE MATRIX INTO A NUMPY ARRAY</h3>

In [53]:
cv.fit_transform(new_df['tags']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [54]:
cv.fit_transform(new_df['tags']).toarray().shape
# Books * number of common words

(6810, 5000)

In [55]:
vectors=cv.fit_transform(new_df['tags']).toarray()

<h2 style='color:#FF0D86'>CHECKING THE MOST COMMON WORDS</h3>

In [56]:
cv.get_feature_names_out()

array(['000', '10', '100', ..., 'zizek', 'zombies', 'zone'], dtype=object)

<h1>STEMMING</h1>

In [57]:
import nltk

from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [58]:
def stem(text):
    y=[]

    for i in text.split():
        y.append(ps.stem(i))

    return " ".join(y)

In [59]:
new_df['tags']=new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(stem)


<h4>REPEATING THE STEPS</h4>

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [61]:
vectors=cv.fit_transform(new_df['tags']).toarray()

In [62]:
cv.get_feature_names_out()

array(['000', '10', '100', ..., 'zero', 'zizek', 'zone'], dtype=object)

<h2 style='color:#FF0D86'>FINDING THE DISTANCE BETWEEN MOVIES</h3>

In [63]:
from sklearn.metrics.pairwise import cosine_similarity
similarity=cosine_similarity(vectors)

In [64]:
similarity

array([[1.        , 0.0265046 , 0.        , ..., 0.        , 0.07877264,
        0.        ],
       [0.0265046 , 1.        , 0.02495326, ..., 0.        , 0.07402332,
        0.02495326],
       [0.        , 0.02495326, 1.        , ..., 0.        , 0.02696799,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.07877264, 0.07402332, 0.02696799, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.02495326, 0.        , ..., 0.        , 0.        ,
        1.        ]])

<h2 style='color:#FF0D86'>CREATING THE FUNCTION WHICH RECOMMENDS MOVIES</h3>

In [65]:
new_df[new_df['title'] == 'Gilead']

Unnamed: 0,isbn13,title,tags
0,9780002005883,Gilead,a novel that reader and critic have been eager...


In [66]:
new_df[new_df['title'] == 'Rage of angels'].index

Index([3], dtype='int64')

In [67]:
new_df[new_df['title'] == 'Rage of angels'].index[0]

3

In [68]:
similarity[0]

array([1.        , 0.0265046 , 0.        , ..., 0.        , 0.07877264,
       0.        ])

In [69]:
enumerate(similarity[0])

<enumerate at 0x23c10c25030>

In [71]:
#list(enumerate(similarity[0]))

In [2]:
#sorted(list(enumerate(similarity[0])),reverse=True)

In [73]:
#sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x: x[1])

In [77]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x: x[1])[1:6]

[(4329, 0.5352309301050474),
 (4774, 0.5177201831435471),
 (2101, 0.49372847423235194),
 (6167, 0.4935481167928245),
 (1829, 0.4885271508527603)]

<h2 style='color:#FF0D86'>CREATING THE FUNCTION WHICH RECOMMENDS MOVIES</h3>

In [80]:
def recommend(book):
    book_index=new_df[new_df['title'] == book].index[0]
    distances=similarity[book_index]
    book_list=sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x: x[1])[1:6]

    for i in book_list:
        print(i[0])

In [84]:
recommend("Spider's Web")

4329
4774
2101
6167
1829


<h2 style='color:#FF0D86'>TO GET THE BOOK NAMES</h3>

In [94]:
def recommend(book):
    book_index=new_df[new_df['title'] == book].index[0]
    distances=similarity[book_index]
    book_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x: x[1])[1:6]

    for i in book_list:
        print(new_df.iloc[i[0]].title)

In [95]:
recommend("Spider's Web")

Witness for the Prosecution & Selected Plays
Purple Cane Road
The Murder of Roger Ackroyd
Poirot
The Real Thing


In [96]:
recommend('Not Without My Daughter')

Independent People
Beach Music
Dreams from My Father
Speak, Memory
Man and Wife


In [97]:
recommend('The Lord of the Rings')

The Lord of the Rings
The Return of the King
The Fellowship of the Ring
The ring sets out
The Return of the King


In [98]:
recommend('The God of Small Things')

The Namesake
The Hotel New Hampshire
The Blackwater Lightship
Human Croquet
The Thorn Birds
