# **Content-Based Recommendation System Notebook**
- In this notebook, we will explore and implement a content-based recommendation system. Content-based recommendation systems suggest items to users based on the characteristics of the items and a profile of the user's preferences. 
- This approach is particularly useful when we have a lot of information about the items and the users' preferences. We will build a simple content-based recommendation system using Python and the scikit-learn library.

**Table of Contents**
1. Introduction
    - What is a Content-Based Recommendation System?
    - How Does it Work?
    - Data Preparation


2. Dataset
    - Feature Extraction
    - Data Preprocessing
    - Building the Content-Based Recommendation System


3. TF-IDF 
    - Vectorization
    - Cosine Similarity
    - Recommending Items


4. Evaluation
    - Evaluation Metrics


5. Conclusion
    - Summary
 --------------------------------------------

## **1. Introduction**
- **What is a Content-Based Recommendation System?**
    - A content-based recommendation system recommends items to users based on the content or characteristics of the items. This type of recommendation system focuses on understanding the properties of items and learning user preferences from the items they have interacted with in the past.


- **How Does it Work?**
    - The working principle of a content-based recommendation system can be summarized in a few steps:
        1. **Feature Extraction**: Extract relevant features from the items. For example, in a book recommendation system, features could include title, author, and category

        2. **User Profile**: Create a user profile based on their interactions with items. This profile is essentially a summary of the features of items the user has liked or interacted with in the past.

        3. **Recommendation**: Calculate the similarity between the user profile and each item's features. Items that are most similar to the user profile are recommended.

## 2. **Data Preparation**
**Dataset**
   - We will use a dataset containing books information, including titles, authors, and categories.

In [1]:
# Import needed modules
import numpy as np
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Read data
df = pd.read_csv("C:/Users/Taroo2/Music/Private/King_Of_Hell/Route_Ai/Works/Model/data.csv")

- Let's optain some analysis

In [3]:
# printing the first 5 rows of the dataframe
df.head()

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,9780006178736,6178731,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,9780006280897,6280897,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


In [4]:
# Get data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6810 entries, 0 to 6809
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   isbn13          6810 non-null   int64  
 1   isbn10          6810 non-null   object 
 2   title           6810 non-null   object 
 3   subtitle        2381 non-null   object 
 4   authors         6738 non-null   object 
 5   categories      6711 non-null   object 
 6   thumbnail       6481 non-null   object 
 7   description     6548 non-null   object 
 8   published_year  6804 non-null   float64
 9   average_rating  6767 non-null   float64
 10  num_pages       6767 non-null   float64
 11  ratings_count   6767 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 638.6+ KB


**Feature Extraction**
- We will extract relevant features from the dataset, such as book titles and authors.

In [5]:
# Selecting the relevant features for recommendation
selected_features = ['title','authors','categories','published_year']
print(selected_features)

['title', 'authors', 'categories', 'published_year']


**Data Preprocessing**
- Before building the recommendation system, we need to preprocess the data. This may include text cleaning, handling missing values, and tokenization.

In [6]:
# Replacing the null valuess with null string
for feature in selected_features:
    df[feature] = df[feature].fillna('')

In [7]:
# combining all the 4 selected features
combined_features = df['title'] + ' ' + df['categories'] + ' ' + df['authors'] + ' ' + f"{df['published_year']}"
combined_features


0       Gilead Fiction Marilynne Robinson 0       2004...
1       Spider's Web Detective and mystery stories Cha...
2       The One Tree American fiction Stephen R. Donal...
3       Rage of angels Fiction Sidney Sheldon 0       ...
4       The Four Loves Christian life Clive Staples Le...
                              ...                        
6805    I Am that Philosophy Sri Nisargadatta Maharaj;...
6806    Secrets Of The Heart Mysticism Khalil Gibran 0...
6807    Fahrenheit 451 Book burning Ray Bradbury 0    ...
6808    The Berlin Phenomenology History Georg Wilhelm...
6809    'I'm Telling You Stories' Literary Criticism H...
Length: 6810, dtype: object

## 3. **Building the Content-Based Recommendation System**
**TF-IDF Vectorization**
- We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text features (descriptions) into numerical vectors. 
- TF-IDF gives more weight to terms that are important in a specific document and less weight to common terms.


In [8]:
# converting the text data to feature vectors
vectorizer = TfidfVectorizer()

feature_vectors = vectorizer.fit_transform(combined_features)

In [9]:
print(feature_vectors)

  (0, 6894)	0.06443750710614472
  (0, 2840)	0.06443750710614472
  (0, 143)	0.06443750710614472
  (0, 5557)	0.06443750710614472
  (0, 7629)	0.06443750710614472
  (0, 6655)	0.06443750710614472
  (0, 102)	0.06443750710614472
  (0, 142)	0.06443750710614472
  (0, 91)	0.06443750710614472
  (0, 141)	0.06443750710614472
  (0, 140)	0.06443750710614472
  (0, 139)	0.06443750710614472
  (0, 103)	0.06443750710614472
  (0, 138)	0.06443750710614472
  (0, 110)	0.06443750710614472
  (0, 98)	0.12887501421228945
  (0, 92)	0.06443750710614472
  (0, 108)	0.06443750710614472
  (0, 112)	0.12887501421228945
  (0, 8035)	0.43700046813706117
  (0, 6013)	0.5885172279797457
  (0, 3422)	0.11189605685054443
  (0, 3882)	0.5885172279797457
  (1, 1822)	0.3013956698681646
  (1, 267)	0.30498953565603815
  :	:
  (6809, 10330)	0.33175135989257093
  (6809, 2240)	0.2206043839379868
  (6809, 9492)	0.30218650531542857
  (6809, 10435)	0.26439634843355025
  (6809, 5684)	0.20438577371569946
  (6809, 9065)	0.21319651041165227
  (6

**Cosine Similarity**
- We compute the cosine similarity between the TF-IDF vectors of items. Cosine similarity measures the cosine of the angle between two non-zero vectors and is used to determine how similar two items are based on their feature vectors.

In [10]:
# getting the similarity scores using cosine similarity
similarity = cosine_similarity(feature_vectors, feature_vectors)

In [11]:
print(similarity)

[[1.         0.08005311 0.11641123 ... 0.08492469 0.08320896 0.07576596]
 [0.08005311 1.         0.08011208 ... 0.06549297 0.06416982 0.10645452]
 [0.11641123 0.08011208 1.         ... 0.08498725 0.09570742 0.07582177]
 ...
 [0.08492469 0.06549297 0.08498725 ... 1.         0.06807484 0.06198557]
 [0.08320896 0.06416982 0.09570742 ... 0.06807484 1.         0.06073329]
 [0.07576596 0.10645452 0.07582177 ... 0.06198557 0.06073329 1.        ]]


**Test your Recommendation System**

In [12]:
# creating a list with all the book names given in the dataset

list_of_all_titles = df['title'].tolist()
print(list_of_all_titles)



In [22]:
# getting the book name from the user
book_name = input(' Enter your favourite book name : ') # input: Rage of angels

 Enter your favourite book name : Tropic of Cancer


In [23]:
# finding the close match for the book name given by the user
find_close_match = difflib.get_close_matches(book_name, list_of_all_titles)
print(find_close_match)

['Tropic of Cancer', 'Tropic of Capricorn', 'The Music of Chance']


In [24]:
# finding the index of the book with title
close_match = find_close_match[0]
index_of_the_book = df[df.title == close_match].index[0]

In [25]:
# getting a list of similar books
similarity_score = list(enumerate(similarity[index_of_the_book]))
print(similarity_score)

[(0, 0.1131535308913172), (1, 0.07787019332372604), (2, 0.11323688844241867), (3, 0.13046394159285463), (4, 0.08334154934338014), (5, 0.09839867495392973), (6, 0.10355340537553105), (7, 0.09892057751164188), (8, 0.06660202750554191), (9, 0.11149879574964734), (10, 0.07899356744153149), (11, 0.1000018730443227), (12, 0.12145589953811992), (13, 0.0691142103889098), (14, 0.07265809135506765), (15, 0.09538082774901466), (16, 0.11135695665669457), (17, 0.07494357636565215), (18, 0.09516211174338889), (19, 0.09782281494344103), (20, 0.0970816775917241), (21, 0.08317021669173072), (22, 0.09838676699200619), (23, 0.07384961880594426), (24, 0.10709434469148679), (25, 0.08121777834393885), (26, 0.09707462759476401), (27, 0.07965707957378625), (28, 0.11595789193599448), (29, 0.08606970901445515), (30, 0.07979091449030064), (31, 0.08420506217003945), (32, 0.08552172392030019), (33, 0.07560085386047502), (34, 0.09233274004938634), (35, 0.08445477901378311), (36, 0.0943988508570881), (37, 0.09537607

In [26]:
# sorting the books based on their similarity score
sorted_similar_books = sorted(similarity_score, key = lambda x:x[1], reverse = True) 
print(sorted_similar_books)

[(93, 1.0000000000000002), (4984, 0.7034984179380548), (5126, 0.5173081985367141), (4983, 0.46412968308869856), (1203, 0.4049657074901466), (5124, 0.36092745248831143), (4955, 0.3597349395354947), (1010, 0.32024807715677805), (3806, 0.3184946103800597), (5962, 0.31292669531503764), (2790, 0.3038921284414741), (1472, 0.2951917392601777), (951, 0.2951212176865599), (1199, 0.2933241927936083), (696, 0.2930642397356478), (713, 0.2930642397356478), (5125, 0.2845166799522812), (6566, 0.2838975087371368), (5310, 0.2833686091218577), (6332, 0.2821800666186036), (3115, 0.2816831813264502), (2657, 0.2763260673355066), (2016, 0.2682759289194526), (1276, 0.2657000450304601), (2455, 0.263703576599304), (5963, 0.263703576599304), (5965, 0.26163543489701796), (1274, 0.2609517628029866), (1226, 0.2598650116384157), (1464, 0.2574852542188878), (5865, 0.2527717934313136), (4023, 0.24879010180243377), (2319, 0.24721490578660854), (6327, 0.24629434902396483), (5150, 0.24328932459351899), (3170, 0.23733744

In [27]:
top_sim = sorted_similar_books[:5]
top_sim

[(93, 1.0000000000000002),
 (4984, 0.7034984179380548),
 (5126, 0.5173081985367141),
 (4983, 0.46412968308869856),
 (1203, 0.4049657074901466)]

In [28]:
# print the name of similar books based on the index
i = 1

for book in sorted_similar_books:
    index = book[0]
    title_from_index = df[df.index==index]['title'].values[0]
    if (i < 6):
        print(i, '-', title_from_index)
        i += 1

1 - Tropic of Cancer
2 - Tropic of Capricorn
3 - Henry Miller on Writing
4 - Sexus
5 - Daisy Miller and Other Stories


## **Full Recommendation System**

In [29]:
book_name = input(' Enter your favourite book name : ')

list_of_all_titles = df['title'].tolist()

find_close_match = difflib.get_close_matches(book_name, list_of_all_titles)

close_match = find_close_match[0]

index_of_the_book = df[df.title == close_match].index[0]

similarity_score = list(enumerate(similarity[index_of_the_book]))

sorted_similar_books = sorted(similarity_score, key = lambda x:x[1], reverse = True) 

print('Books suggested for you : \n')

i = 1

for book in sorted_similar_books:
    index = book[0]
    title_from_index = df[df.index==index]['title'].values[0]
    if (i < 30):
        print(i, '.',title_from_index)
        i+=1

 Enter your favourite book name : Tropic of Cancer
Books suggested for you : 

1 . Tropic of Cancer
2 . Tropic of Capricorn
3 . Henry Miller on Writing
4 . Sexus
5 . Daisy Miller and Other Stories
6 . The Air-conditioned Nightmare
7 . Quiet Days in Clichy
8 . The Portable Arthur Miller
9 . Death of a Salesman
10 . Henry V
11 . The Turn of the Screw and Daisy Miller
12 . The Autobiography of Henry VIII
13 . The Portrait of a Lady
14 . The Bostonians
15 . The Crucible
16 . The Crucible
17 . Mémoires, Plaidoiries Et Documents
18 . The History of Tom Jones
19 . Freaks!
20 . The New York Stories of Henry James
21 . Drama Of The Gifted
22 . Death of a Salesman
23 . Cross-X
24 . Every Second Counts
25 . 1 Henry IV
26 . Henry IV
27 . Henry VIII
28 . It's Not about the Bike
29 . Augustine
