<a href="https://colab.research.google.com/github/Bhandari007/recommendation_system/blob/main/content_based_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content-Based Filtering

In this notebook, we will implement content-based filtering using a neural network to build a recommender system.

# Packages

I will packages, NumPy, TensorFlow, Pandas

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

# Dataset

The dataset is available on Kaggle website.

Unzip and download the dataset

In [3]:
import zipfile
! wget https://github.com/Bhandari007/recommendation_system/blob/main/netflix_titles.csv.zip?raw=true
zipfolder = zipfile.ZipFile("netflix_titles.csv.zip?raw=true")
zipfolder.extractall()
zipfolder.close()

--2022-10-15 10:51:01--  https://github.com/Bhandari007/recommendation_system/blob/main/netflix_titles.csv.zip?raw=true
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/Bhandari007/recommendation_system/raw/main/netflix_titles.csv.zip [following]
--2022-10-15 10:51:01--  https://github.com/Bhandari007/recommendation_system/raw/main/netflix_titles.csv.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Bhandari007/recommendation_system/main/netflix_titles.csv.zip [following]
--2022-10-15 10:51:01--  https://raw.githubusercontent.com/Bhandari007/recommendation_system/main/netflix_titles.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent

##  Data Cleaning

In [4]:
df = pd.read_csv("netflix_titles.csv",parse_dates = ["date_added","release_year"])
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019-01-01,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016-09-09,2016-01-01,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2018-09-08,2013-01-01,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2018-09-08,2016-01-01,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017-09-08,2017-01-01,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [5]:
# Check shape
df.shape

(6234, 12)

There are 6234 datasets with 12 different features

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       6234 non-null   int64         
 1   type          6234 non-null   object        
 2   title         6234 non-null   object        
 3   director      4265 non-null   object        
 4   cast          5664 non-null   object        
 5   country       5758 non-null   object        
 6   date_added    6223 non-null   datetime64[ns]
 7   release_year  6234 non-null   datetime64[ns]
 8   rating        6224 non-null   object        
 9   duration      6234 non-null   object        
 10  listed_in     6234 non-null   object        
 11  description   6234 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(9)
memory usage: 584.6+ KB


We have three features that have some null-values `director`, `cast`, `country` and `date_added`.

For simplicity we will remove the rows that have missing values.

### Removing missing rows

In [7]:
df = df.dropna(axis=0)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 0 to 6213
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       3774 non-null   int64         
 1   type          3774 non-null   object        
 2   title         3774 non-null   object        
 3   director      3774 non-null   object        
 4   cast          3774 non-null   object        
 5   country       3774 non-null   object        
 6   date_added    3774 non-null   datetime64[ns]
 7   release_year  3774 non-null   datetime64[ns]
 8   rating        3774 non-null   object        
 9   duration      3774 non-null   object        
 10  listed_in     3774 non-null   object        
 11  description   3774 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(9)
memory usage: 383.3+ KB


### Feature Engineering

* Extracting appropriate feature from datetime features
  * Extracting month and day from `day_added` column

In [9]:
df["release_year"] = df["release_year"].dt.year
df["day_added"] = df["date_added"].dt.day
df["month_added"] = df ["date_added"].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
# Now we can drop "date_added" column
df = df.drop(columns = ["date_added"])

In [11]:
# sanity check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 0 to 6213
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       3774 non-null   int64 
 1   type          3774 non-null   object
 2   title         3774 non-null   object
 3   director      3774 non-null   object
 4   cast          3774 non-null   object
 5   country       3774 non-null   object
 6   release_year  3774 non-null   int64 
 7   rating        3774 non-null   object
 8   duration      3774 non-null   object
 9   listed_in     3774 non-null   object
 10  description   3774 non-null   object
 11  day_added     3774 non-null   int64 
 12  month_added   3774 non-null   int64 
dtypes: int64(4), object(9)
memory usage: 412.8+ KB


In [12]:
df = df.set_index("show_id")
df.head()

Unnamed: 0_level_0,type,title,director,cast,country,release_year,rating,duration,listed_in,description,day_added,month_added
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,9,9
80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,8,9
70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f...",8,9
80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...,8,9
70304990,Movie,Good People,Henrik Ruben Genz,"James Franco, Kate Hudson, Tom Wilkinson, Omar...","United States, United Kingdom, Denmark, Sweden",2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...,8,9


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 81145628 to 80126599
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          3774 non-null   object
 1   title         3774 non-null   object
 2   director      3774 non-null   object
 3   cast          3774 non-null   object
 4   country       3774 non-null   object
 5   release_year  3774 non-null   int64 
 6   rating        3774 non-null   object
 7   duration      3774 non-null   object
 8   listed_in     3774 non-null   object
 9   description   3774 non-null   object
 10  day_added     3774 non-null   int64 
 11  month_added   3774 non-null   int64 
dtypes: int64(3), object(9)
memory usage: 383.3+ KB


## Feature preprocessing
* Convert `type, title, director, cast, country, rating, listed_in` into embeddings.
* Processing `description` column 

## Categorical Features into embeddings

### Defining the vocabulary

In [14]:
type_lookup = tf.keras.layers.StringLookup()
title_lookup = tf.keras.layers.StringLookup()
director_lookup = tf.keras.layers.StringLookup()
cast_lookup = tf.keras.layers.StringLookup()
country_lookup = tf.keras.layers.StringLookup()
rating_lookup = tf.keras.layers.StringLookup()
listed_in_lookup = tf.keras.layers.StringLookup()

In [15]:
type_lookup.adapt(df.apply(lambda x: df["type"]))
title_lookup.adapt(df.apply(lambda x: df["title"]))
director_lookup.adapt(df.apply(lambda x: df["director"]))
cast_lookup.adapt(df.apply(lambda x: df["cast"]))
country_lookup.adapt(df.apply(lambda x: df["country"]))
rating_lookup.adapt(df.apply(lambda x: df["rating"]))
listed_in_lookup.adapt(df.apply(lambda x :df["listed_in"]))

### Define the embeddings

In [16]:
type_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = type_lookup.vocabulary_size(),
    output_dim = 32
)

title_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = title_lookup.vocabulary_size(),
    output_dim = 32
)

director_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = director_lookup.vocabulary_size(),
    output_dim = 32
)

type_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = type_lookup.vocabulary_size(),
    output_dim = 32
)

cast_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = cast_lookup.vocabulary_size(),
    output_dim = 32
)


country_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = country_lookup.vocabulary_size(),
    output_dim = 32
)

rating_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = rating_lookup.vocabulary_size(),
    output_dim = 32
)

listed_in_lookup_embedding = tf.keras.layers.Embedding(
    input_dim = listed_in_lookup.vocabulary_size(),
    output_dim = 32
)

### Combining the vocabulary and embedding layers

In [17]:
type_lookup_model = tf.keras.Sequential([
    type_lookup,
    type_lookup_embedding
])

title_lookup_model = tf.keras.Sequential([
    title_lookup,
    title_lookup_embedding
])

director_lookup_model = tf.keras.Sequential([
    director_lookup,
    director_lookup_embedding
])

type_lookup_model = tf.keras.Sequential([
    type_lookup,
    type_lookup_embedding
])

cast_lookup_model = tf.keras.Sequential([
    cast_lookup,
    cast_lookup_embedding
])

country_lookup_model = tf.keras.Sequential([
    country_lookup,
    country_lookup_embedding
])


rating_lookup_model = tf.keras.Sequential([
    rating_lookup,
    rating_lookup_embedding
])


listed_in_lookup_model = tf.keras.Sequential([
    listed_in_lookup,
    listed_in_lookup_embedding
])

### Processing `description` column
Following steps can be used to process `description` features:

i. Tokenization

ii. Vocabulary Learning

iii. Embedding

In [85]:
description_text = tf.keras.layers.TextVectorization()
description_text.adapt(df.apply()

ValueError: ignored

Reference: https://www.tensorflow.org/recommenders/examples/featurization