#**RECOMMENDATION SYSTEM BASED ON CONTENT**

A content-based recommendation system using TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer suggests items by comparing the similarity between item descriptions and a user's preferences. It analyzes the content of items a user has interacted with and recommends similar items by comparing their characteristics.
**TF-IDF** transforms text into numerical vectors, reflecting the importance of words in relation to the entire dataset. By calculating the cosine similarity between these vectors, the system recommends items most similar to those the user has interacted with previously.

**TF-IDF Vectorizer** stands for **Term Frequency-Inverse Document Frequency Vectorizer**.

It’s a numerical statistic used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). It is widely used in information retrieval and text mining, as it transforms text data into a numerical format that machine learning algorithms can understand.

 **Term Frequency (TF)**

*This measures how frequently a term appears in a document. It’s calculated by dividing the number of times a term appears in a document by the total number of terms in that document. Higher frequencies indicate that the term is more important in that particular document.*

 **Inverse Document Frequency (IDF)**

 *This measures how important a term is across the entire corpus. It’s calculated by taking the logarithm of the total number of documents divided by the number of documents containing the term. This helps reduce the weight of common terms (like "the" or "and") that appear in many documents*.

**Combining TF and IDF**

*The TF-IDF score is the product of TF and IDF. This means that a term will have a high score if it appears frequently in a specific document but is rare across the entire corpus, highlighting its significance*.



In [30]:
# Importing libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [3]:
# Loading Dataset

df = pd.read_csv('netflix_titles.csv')

#The Dataset

In [4]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


#The number of rows and columns in the dataset.

In [6]:
print('Number of rows: ', df.shape[0])
print('Number of columns: ', df.shape[1])

Number of rows:  8807
Number of columns:  12


#Information of datset

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


#Display the summary statistics of the dataset.

In [8]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [9]:
# Summary of dataset only the columns contains objects as values

df.describe(include='object')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,17,220,514,8775
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,3207,1793,362,4


#Preprocess the data and handle missing values as necessary


In [20]:
df['description'].isna().sum()

0

In [21]:
# There is no null values present in the dataset

#Using cosine similarity to calculate the similarity between products based on their description to get content based recommendation


In [22]:
# TF-IDF used to evaluate the importance of a word in a document related to a collection of documents (or corpus)
# Create a TfidfVectorizer and Remove stopwords

tfidf = TfidfVectorizer(stop_words='english')

In [23]:
# Forming a matrix for transformed data after transforming 'Description' column
# Fit and transform the data to a tfidf matrix

tfidf_matrix = tfidf.fit_transform(df['description'])

tfidf_matrix

<8807x18895 sparse matrix of type '<class 'numpy.float64'>'
	with 121374 stored elements in Compressed Sparse Row format>

In [24]:
# Shape of tfidf_mattrix

tfidf_matrix.shape

(8807, 18895)

In [25]:
# All features in tfidf

tfidf.get_feature_names_out()

array(['000', '007', '009', ..., 'łukasz', 'ōarai', 'şeref'], dtype=object)

In [26]:
# Forming a dataset for tf-idf transformed dataset

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

In [27]:
# Naming the first column as 'Description'

tfidf_df.insert(0, 'Description', df['description'])

In [28]:
#tf-idf transformed dataset

tfidf_df

Unnamed: 0,Description,000,007,009,10,100,1000,102,108,10th,...,zé,álex,álvaro,ángel,émile,ömer,über,łukasz,ōarai,şeref
0,"As her father nears the end of his life, filmm...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"After crossing paths at a party, a Cape Town t...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,To protect his family from a powerful drug lor...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Feuds, flirtations and toilet talk go down amo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,In a city of coaching centers known to train I...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,"A political cartoonist, a crime reporter and a...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8803,"While living alone in a spooky town, a young g...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8804,Looking to survive in a world taken over by zo...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8805,"Dragged from civilian life, a former superhero...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
# Using a word to get all the recommedate products

tfidf_df.loc[np.where(tfidf_df['life'] > 0)][['Description', 'life']]

Unnamed: 0,Description,life
0,"As her father nears the end of his life, filmm...",0.125908
4,In a city of coaching centers known to train I...,0.120897
9,A woman adjusting to life after a loss contend...,0.138284
15,Students of color navigate the daily slights a...,0.123619
16,Declassified documents reveal the post-WWII li...,0.112309
...,...,...
8752,"After surviving a life-threatening accident, a...",0.139195
8762,The life of a chauffeur and part-time bootlegg...,0.141130
8763,Filmmaker John Huston narrates this Oscar-nomi...,0.118249
8769,"The lives of a middle-aged soap opera addict, ...",0.098288


#**Cosine Similarity**



**Cosine Similarity is a measure of the similarity between two vectors of an inner product space.**

For two vectors, A and B, the Cosine Similarity is calculated as:

Cosine Similarity = ΣA<sub>i</sub> * B<sub>i</sub> / (√ΣA<sub>i</sub><sup>2</sup> * √ΣB<sub>i</sub><sup>2</sup>)

#**Linear kernel**
Linear kernel measures the similarity between text documents by calculating the dot product of their TF-IDF vectors. This linear relationship captures how closely the content of two documents aligns, based on the importance of words in each document relative to the entire corpus.

In [32]:
# Checking cosine similarity within the different columns of the matrix using linear_kernel

linear_kernel(tfidf_matrix, tfidf_matrix)

array([[1.        , 0.        , 0.        , ..., 0.        , 0.01538292,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.02230089],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.01538292, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.02230089, ..., 0.        , 0.        ,
        1.        ]])

In [34]:
# Compute the cosine similarity between each movie description
# Creating a varialbe for linear_kernel transformed dataset

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [38]:
# Forming a dataset with similarity scores where values are based on data tranformed by linear kernel and columns will be the Products name

similar_score = pd.DataFrame(cosine_sim, columns=df['title'], index=df['title'])

In [40]:
# The given Product name will suggest the all other products based on the similar description with similarity score
# Sorting the values in ascending order to get all highly similar product and iloc[1: ] is used to get the similar product at top but not the same one

similar_score['Jailbirds New Orleans'].sort_values(ascending=False).iloc[1: ]

Unnamed: 0_level_0,Jailbirds New Orleans
title,Unnamed: 1_level_1
Project Power,0.183325
The Prince,0.179163
The Originals,0.177901
The Runner,0.174703
The Princess and the Frog,0.172572
...,...
The Road to Love,0.000000
The Last O.G.,0.000000
V Wars,0.000000
The Irishman: In Conversation,0.000000


In [42]:
# Implementing a content-based recommendation engine that suggests top similar products based on a given product
# Using define fuction to create function and iloc to to get a number of products as per needs

def get_recommendations(Title, number):

    return similar_score[Title].sort_values(ascending=False).iloc[1:number+1]

In [43]:
# To get top 15 recommendate products similar to given product

get_recommendations('Zubaan', 15)

Unnamed: 0_level_0,Zubaan
title,Unnamed: 1_level_1
"Kalel, 15",0.16177
Kipo and the Age of Wonderbeasts,0.158786
Krish Trish and Baltiboy: Face Your Fears,0.156511
Little Evil,0.154387
The Rainmaker,0.151522
My Own Man,0.149373
Agatha Christie's Crooked House,0.142009
Larva,0.136312
Babamın Ceketi,0.133303
A Flying Jatt,0.129314


In [44]:
# To get top 10 recommendate products similar to given product

get_recommendations('Kota Factory', 10)

Unnamed: 0_level_0,Kota Factory
title,Unnamed: 1_level_1
Drishyam,0.130063
The Creative Indians,0.125887
The Bridge Curse,0.123769
She's Dating the Gangster,0.123101
Racket Boys,0.119272
Code 8,0.117941
Girl's Revenge,0.117512
The Bye Bye Man,0.116211
Train of the Dead,0.111841
The Politician,0.110862
