#### RECOMMENDATION ENGINE
1. Data Collection - Kaggle
2. Data Cleaning - data imputation/deletion, KNN, duplicates (pandas, numpy)
3. Data Transformation - min-max scaking, Z-score normalization, create new features (pandas, numpy, missingno, scikit-learn)
4. Algorithm Selection - collaborative, content-based filtering, and hybrid methods
5. Model Training - sckit-learn, tensorflow, pytorch
6. Model Evaluation - classification (accuracy, precision, recall, F1-score), ranking (mean reciprocal rank, normalized dicounted cumulative gain)
7. Integration - RESTful API (Flask, FastAPI frameworks) to serve recommendations, db (PostgreSQL, MongoDB (NoSQL))
8. Deployment - Azure machine learning

In [10]:
%pip install pandas numpy scikit-learn matplotlib tensorflow keras missingno 

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [12]:
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
df = pd.read_csv('data.csv')

In [14]:
df.head()

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,9780006178736,6178731,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,9780006280897,6280897,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6810 entries, 0 to 6809
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   isbn13          6810 non-null   int64  
 1   isbn10          6810 non-null   object 
 2   title           6810 non-null   object 
 3   subtitle        2381 non-null   object 
 4   authors         6738 non-null   object 
 5   categories      6711 non-null   object 
 6   thumbnail       6481 non-null   object 
 7   description     6548 non-null   object 
 8   published_year  6804 non-null   float64
 9   average_rating  6767 non-null   float64
 10  num_pages       6767 non-null   float64
 11  ratings_count   6767 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 638.6+ KB


#### Feature Selection

In [16]:
selected_features = ['titles', 'authors', 'categories', 'published_year']
print(selected_features)

['titles', 'authors', 'categories', 'published_year']


#### Data Preparation

In [17]:
combined_features = df['title'] + ' ' + df['categories'] + ' ' + df['authors'] + ' ' + f"{df['published_year']}"
combined_features

0       Gilead Fiction Marilynne Robinson 0       2004...
1       Spider's Web Detective and mystery stories Cha...
2       The One Tree American fiction Stephen R. Donal...
3       Rage of angels Fiction Sidney Sheldon 0       ...
4       The Four Loves Christian life Clive Staples Le...
                              ...                        
6805    I Am that Philosophy Sri Nisargadatta Maharaj;...
6806    Secrets Of The Heart Mysticism Khalil Gibran 0...
6807    Fahrenheit 451 Book burning Ray Bradbury 0    ...
6808    The Berlin Phenomenology History Georg Wilhelm...
6809    'I'm Telling You Stories' Literary Criticism H...
Length: 6810, dtype: object

#### Building the Recommendation System

Term Frequency-Inverse Document Frequency (TF-IDF) is a natural language processing technique that is used to evaluate the importance of different words in a document relative to the collection of document or corpus.
- We will utilize TF-IDF vectorization to convert text features(descriptions) into numerical vectors. More weight is given to terms that are important in a document and less weight to common terms.

In [23]:
vectorizer = TfidfVectorizer()
feature_vectors = vectorizer.fit_transform(combined_features.values.astype('U'))

Cosine Similarity is a measure of the cosine angle between two non-zero vectors and is used to determine how similar two items are based on their feature vectors.

In [24]:
similarity = cosine_similarity(feature_vectors, feature_vectors)

In [None]:
list_of_all_titles = df['title'].tolist()
print(list_of_all_titles)