## Content Based Model Using Sigmoid Kernel

### Content-based recommendation algorithms are more concerned with item characteristics or qualities than with user data. They forecast a user's behavior based on the objects to which they respond. 

### Finding a decent movie to binge-watch over the weekend without having to do too much research is a typical challenge that millennials face nowadays. Let's look at how we might fix this problem for millennials by assisting them in finding a movie that they are likely to appreciate.

In [8]:
import pandas as pd
import numpy as np

### Importing Data from the dataset

In [9]:
books_data = pd.read_csv('../../data_preprocessing/books_data.csv')
books_data=books_data.dropna()

### Lets understand the data by taking a sample of it. Lets look at the books written by George Orwell.

In [10]:
books_data[books_data['authors']=='George Orwell']

Unnamed: 0.1,Unnamed: 0,id,best_book_id,work_id,books_count,isbn13,original_publication_year,title,language_code,average_rating,...,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,authors,summary
16,13,14,7613,2207778,896,9780452284240,1945.0,animal farm,eng,3.87,...,35472,66854,135147,433432,698642,648912,https://images.gr-assets.com/books/1424037542m...,https://images.gr-assets.com/books/1424037542s...,George Orwell,A satire on totalitarianism in which farm anim...
1102,845,846,5472,2966408,51,9780151010260,1950.0,animal farm,eng,4.26,...,1293,1212,3276,16511,40583,57179,https://images.gr-assets.com/books/1327959366m...,https://images.gr-assets.com/books/1327959366s...,George Orwell,George Orwell's classic satire on totalitarian...
5075,4003,4004,9646,2566499,151,9780156421170,1938.0,homage to catalonia,eng,4.14,...,1500,176,733,4407,10529,10036,https://images.gr-assets.com/books/1394868278m...,https://images.gr-assets.com/books/1394868278s...,George Orwell,Presents the British novelist's firsthand repo...
8519,6726,6727,9650,1171545,116,9781421808310,1934.0,burmese days,eng,3.84,...,929,144,811,3918,6521,3519,https://images.gr-assets.com/books/1415573403m...,https://images.gr-assets.com/books/1415573403s...,George Orwell,"George Bowling, the hero of this comic novel, ..."
12131,9641,9642,9648,3226250,99,9780141183720,1936.0,keep the aspidistra flying,eng,3.87,...,746,121,615,2926,4556,3043,https://images.gr-assets.com/books/1331244097m...,https://images.gr-assets.com/books/1331244097s...,George Orwell,"London 1934. Gordon Comstock, copywriter for t..."


### Before we begin the analysis, lets check for null data in the dataset.

In [11]:
books_data.isnull().sum()

Unnamed: 0                   0
id                           0
best_book_id                 0
work_id                      0
books_count                  0
isbn13                       0
original_publication_year    0
title                        0
language_code                0
average_rating               0
ratings_count                0
work_ratings_count           0
work_text_reviews_count      0
ratings_1                    0
ratings_2                    0
ratings_3                    0
ratings_4                    0
ratings_5                    0
image_url                    0
small_image_url              0
authors                      0
summary                      0
dtype: int64

### We'll need to transform our text in the summary column to word vectors and fit a TF-IDF on overview before we can conduct any analysis on the plot summaries:

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
tvf = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', ngram_range=(1,3), token_pattern=r'\w{1,}', stop_words='english')

In [14]:
tvf_matrix = tvf.fit_transform(books_data['summary'])

In [15]:
tvf_matrix.shape

(10498, 63687)

### So, to describe our 10,000 books, summaries employed roughly 63,500 unique words. This figure can be changed depending on the ngram range passed to the function parameter.

### Now that we have a word matrix, we can start computing similarity scores. This measure will assist us in identifying summary with plot descriptions comparable to the one provided by the user.

### With the below code we will start computing the sigmoid kernel

In [16]:
from sklearn.metrics.pairwise import sigmoid_kernel
sig = sigmoid_kernel(tvf_matrix, tvf_matrix)

In [17]:
sig[0]

array([0.76160075, 0.76159431, 0.7615942 , ..., 0.76159421, 0.7615943 ,
       0.76159416])

### Now to proceed ahead we will do reverse mapping of indices and books title.

In [18]:
indices= pd.Series(books_data.index, index=books_data['title']).drop_duplicates()

In [19]:
indices

title
the hunger games the hunger games                                                               0
twilight twilight                                                                               3
to kill a mockingbird                                                                           4
the great gatsby                                                                                5
the fault in our stars                                                                          6
                                                                                            ...  
billy budd sailor                                                                           12535
bayou moon the edge                                                                         12536
means of ascent the years of lyndon johnson                                                 12537
cinderella ate my daughter dispatches from the frontlines of the new girlie girl culture    12539
the first worl

In [20]:
def give_recom(title, sig=sig):
    # we will start by pulling the index of a given titile
    idx=indices[title]
    #We will get the pairwise similarity score.
    sig_score = list(enumerate(sig[idx]))
    # we will sort the movies.
    sig_score = sorted(sig_score, key=lambda x:x[1], reverse=True)
    # return the sigma score of top 10 similar books 
    sig_score = sig_score[1:11]
    book_indices = [i[0] for i in sig_score]
    #writing the data to an external file 
    output_file = open('content_op.txt', 'w')
    output_file.write("\n".join([str(x) for x in book_indices]))
    output_file.close()
    # Finally returning the data.
    return books_data['title'].iloc[book_indices]

In [21]:
# Finally returning top unique books.
give_recom('my sunshine away').unique()

array(['shakespeares romeo and juliet', 'getting over it',
       'a matter of honor', 'eeny meeny helen grace',
       'the sweet potato queens book of love', 'wait till helen comes',
       'smooth talking stranger travises', 'say what you will',
       'starcrossed starcrossed'], dtype=object)