<a href="https://colab.research.google.com/github/Bhandari007/recommendation_system/blob/main/content_based_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content-Based Filtering

In this notebook, we will implement content-based filtering using a neural network to build a recommender system.

# Packages

I will packages, NumPy, TensorFlow, Pandas

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

# Dataset

The dataset is available on Kaggle website.

Unzip and download the dataset

In [None]:
import zipfile
! wget https://github.com/Bhandari007/recommendation_system/blob/main/netflix_titles.csv.zip?raw=true
zipfolder = zipfile.ZipFile("netflix_titles.csv.zip?raw=true")
zipfolder.extractall()
zipfolder.close()

--2022-10-15 04:56:20--  https://github.com/Bhandari007/recommendation_system/blob/main/netflix_titles.csv.zip?raw=true
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/Bhandari007/recommendation_system/raw/main/netflix_titles.csv.zip [following]
--2022-10-15 04:56:20--  https://github.com/Bhandari007/recommendation_system/raw/main/netflix_titles.csv.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Bhandari007/recommendation_system/main/netflix_titles.csv.zip [following]
--2022-10-15 04:56:21--  https://raw.githubusercontent.com/Bhandari007/recommendation_system/main/netflix_titles.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent

##  Data Cleaning

In [None]:
df = pd.read_csv("netflix_titles.csv",parse_dates = ["date_added","release_year"])
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019-01-01,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016-09-09,2016-01-01,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2018-09-08,2013-01-01,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2018-09-08,2016-01-01,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017-09-08,2017-01-01,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [None]:
# Check shape
df.shape

(6234, 12)

There are 6234 datasets with 12 different features

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       6234 non-null   int64         
 1   type          6234 non-null   object        
 2   title         6234 non-null   object        
 3   director      4265 non-null   object        
 4   cast          5664 non-null   object        
 5   country       5758 non-null   object        
 6   date_added    6223 non-null   datetime64[ns]
 7   release_year  6234 non-null   datetime64[ns]
 8   rating        6224 non-null   object        
 9   duration      6234 non-null   object        
 10  listed_in     6234 non-null   object        
 11  description   6234 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(9)
memory usage: 584.6+ KB


We have three features that have some null-values `director`, `cast`, `country` and `date_added`.

For simplicity we will remove the rows that have missing values.

### Removing missing rows

In [None]:
df = df.dropna(axis=0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 0 to 6213
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       3774 non-null   int64         
 1   type          3774 non-null   object        
 2   title         3774 non-null   object        
 3   director      3774 non-null   object        
 4   cast          3774 non-null   object        
 5   country       3774 non-null   object        
 6   date_added    3774 non-null   datetime64[ns]
 7   release_year  3774 non-null   datetime64[ns]
 8   rating        3774 non-null   object        
 9   duration      3774 non-null   object        
 10  listed_in     3774 non-null   object        
 11  description   3774 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(9)
memory usage: 383.3+ KB


### Feature Engineering

* Extracting appropriate feature from datetime features
  * Extracting month and day from `day_added` column

In [None]:
df["release_year"] = df["release_year"].dt.year
df["day_added"] = df["date_added"].dt.day
df["month_added"] = df ["date_added"].dt.month

In [None]:
# Now we can drop "date_added" column
df = df.drop(columns = ["date_added"])

In [None]:
# sanity check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 0 to 6213
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       3774 non-null   int64 
 1   type          3774 non-null   object
 2   title         3774 non-null   object
 3   director      3774 non-null   object
 4   cast          3774 non-null   object
 5   country       3774 non-null   object
 6   release_year  3774 non-null   int64 
 7   rating        3774 non-null   object
 8   duration      3774 non-null   object
 9   listed_in     3774 non-null   object
 10  description   3774 non-null   object
 11  day_added     3774 non-null   int64 
 12  month_added   3774 non-null   int64 
dtypes: int64(4), object(9)
memory usage: 412.8+ KB


### Exploring data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = df.set_index("show_id")
df.head()

Unnamed: 0_level_0,type,title,director,cast,country,release_year,rating,duration,listed_in,description,day_added,month_added
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,9,9
80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,8,9
70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f...",8,9
80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...,8,9
70304990,Movie,Good People,Henrik Ruben Genz,"James Franco, Kate Hudson, Tom Wilkinson, Omar...","United States, United Kingdom, Denmark, Sweden",2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...,8,9


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774 entries, 81145628 to 80126599
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          3774 non-null   object
 1   title         3774 non-null   object
 2   director      3774 non-null   object
 3   cast          3774 non-null   object
 4   country       3774 non-null   object
 5   release_year  3774 non-null   int64 
 6   rating        3774 non-null   object
 7   duration      3774 non-null   object
 8   listed_in     3774 non-null   object
 9   description   3774 non-null   object
 10  day_added     3774 non-null   int64 
 11  month_added   3774 non-null   int64 
dtypes: int64(3), object(9)
memory usage: 383.3+ KB


In [None]:
df["rating"].value_counts()

TV-MA       1189
TV-14        917
R            501
TV-PG        358
PG-13        278
PG           176
NR           175
TV-G          54
TV-Y7         48
G             35
TV-Y          24
TV-Y7-FV      11
UR             7
NC-17          1
Name: rating, dtype: int64

* Implement TFIDF model
* Convert object datatype to numeric datatype

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["description"])
print(X)

  (0, 4275)	0.1867816340650283
  (0, 699)	0.29480588312719125
  (0, 3890)	0.20070947068489256
  (0, 4507)	0.11310219986430502
  (0, 776)	0.32083740889280393
  (0, 10583)	0.24046604039042407
  (0, 968)	0.1754523136553007
  (0, 10910)	0.16676036802855285
  (0, 7321)	0.15184456774912578
  (0, 6152)	0.23078312557787206
  (0, 1108)	0.2737892026716545
  (0, 8318)	0.28642560363045627
  (0, 4825)	0.26039407786484364
  (0, 5250)	0.07738944949308178
  (0, 4385)	0.10073234050075984
  (0, 12083)	0.21228789877708604
  (0, 947)	0.3056099424833639
  (0, 555)	0.18727264539924882
  (0, 8255)	0.27957841671775124
  (0, 1143)	0.19187211999106488
  (1, 1873)	0.22043542019807608
  (1, 6903)	0.22043542019807608
  (1, 10255)	0.19839629690245525
  (1, 3899)	0.1830595013126377
  (1, 4989)	0.30414942167517073
  :	:
  (3772, 6699)	0.17171581822571352
  (3772, 1171)	0.23592658802272942
  (3772, 566)	0.11279845940206325
  (3772, 5571)	0.06304426158811641
  (3772, 7679)	0.060537720065986934
  (3772, 5208)	0.15252695

In [None]:
import tensorflow as tf

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128,activation="relu"),
    tf.keras.layers.Dense(10,activation="softmax")
])

In [None]:
words = vectorizer.get_stop_words()

<3774x12451 sparse matrix of type '<class 'numpy.float64'>'
	with 81732 stored elements in Compressed Sparse Row format>