#### Lab 01 - Content Based Recommender (Film Genre)

Maintain User Engagement by Generating Recommendation Using Film Genre

#### What Will We Perform ?

We'll Feature Engineer the Genre Column

Perform TF IDF Vectorizer on Genre, Cosine Similarity on it

Create Model to Perform Recommender

Tackle Cold Start Problem by Asking User Preferences on Their Genre

Our Subject ? Michael

In [1]:
import pandas, warnings

warnings.filterwarnings("ignore")

In [2]:
# Use Movies Table on Origin

origin = pandas.read_table("movies.csv", sep=",")

origin.iloc[:5]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
origin.tail()

Unnamed: 0,MovieID,Title,Genre
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [4]:
null = origin.isnull().sum()

null.sum()

0

In [5]:
total = origin.shape[0]

total

9742

In [6]:
total_film = origin["Title"].nunique()

total_film

9737

In [7]:
total == total_film

False

In [8]:
origin = origin.drop_duplicates("Title")

assert origin.shape[0] == origin["Title"].nunique(), "Fail !"

#### Feature Engineer on Genre

In [9]:
# Change Genre Column Separator

splita = lambda val : " ".join(val.split("|"))

origin["Genre"] = origin["Genre"].apply(splita)

In [10]:
# Find List of Genres

TotalGenre = []

for value in origin["Genre"].values:

  value = value.split(" ")

  for val in value:

    if val not in TotalGenre: TotalGenre.append(val)

len(TotalGenre)

22

In [11]:
TotalGenre[15:]

['Documentary', 'IMAX', 'Western', 'Film-Noir', '(no', 'genres', 'listed)']

In [12]:
# Delete Film Without Genre

nolist = "(no genres listed)"

origin = origin[origin["Genre"] != nolist]

origin.iloc[:5]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
# Find List of Genres

TotalGenre = []

for insect in origin["Genre"].values:

  insect = insect.split(" ")

  for val in insect:

    if val not in TotalGenre: TotalGenre.append(val)

len(TotalGenre)

19

In [14]:
TotalGenre[:5]

['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

For Our Goal Using `TfidfVectorizer`, Replacing `Sci-Fi` to `SciFi`, `Film-Noir` to `Filnoir` is Mandatory

Others ? Resulting `Sci-Fi` to (`Sci` + `Fi`), `Film-Noir` to (`Film` + `Noir`)

In [15]:
# Replace Sci-Fi + Film-Noir

replacer = {"Sci-Fi":"SciFi", "Film-Noir":"Filnoir"}

for i, t in replacer.items():

  origin["Genre"] = origin["Genre"].str.replace(i, t)

origin.iloc[:5]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
# Converting Genres to TF IDF (Term Frequency-Inverse Document Frequency) Matrix

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words="english")

metrics = tfidf.fit_transform(origin["Genre"])

metrics.shape

(9703, 19)

Our TF IDF Metrics Consist of 9708 Film and 19 Genres

In [17]:
# Understand our TF IDF Metrics (1)

tfidf.get_feature_names_out().tolist()[:5]

['action', 'adventure', 'animation', 'children', 'comedy']

In [18]:
# Understand our TF IDF Metrics (2)

len(tfidf.get_feature_names_out().tolist())

19

In [19]:
# Understand our TF IDF Metrics (3)

assert len(tfidf.get_feature_names_out().tolist()) == len(TotalGenre), "Fail !"

In [20]:
metrics.todense()[0]

matrix([[0.        , 0.41679332, 0.51629181, 0.50489821, 0.26739237,
         0.        , 0.        , 0.        , 0.48301679, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        ]])

In [21]:
metrics.todense()[1]

matrix([[0.        , 0.51228317, 0.        , 0.62057343, 0.        ,
         0.        , 0.        , 0.        , 0.59367884, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        ]])

In [22]:
origin.loc[:1]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy


In [23]:
# Use Cosine Similarity to Find Similarity of Our TF IDF Metrics

from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(metrics)

similarities.shape

(9703, 9703)

In [24]:
len(similarities[0])

9703

In [25]:
similarities[0][:5]

array([1.        , 0.81359947, 0.15253902, 0.13495614, 0.26739237])

In [26]:
origin.iloc[:5]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [27]:
# Helper Function

intler = lambda i : origin.iloc[i, 1]

titint = lambda t : origin[origin["Title"] == t].index.values[0]

In [28]:
# Helper Test

title = "Jumanji (1995)"

titint(title)

1

#### Model 0

In [29]:
similar_result = enumerate(similarities[int(titint(title))])

orilist = list(similar_result)

len(orilist)

9703

In [30]:
[(i, round(t, 5)) for i, t in orilist[:5]]

[(0, 0.8136), (1, 1.0), (2, 0.0), (3, 0.0), (4, 0.0)]

In [31]:
orilist = sorted(orilist, key=lambda i : i[1], reverse=True)

[(i, round(t, 5)) for i, t in orilist[:5]]

[(1, 1.0), (53, 1.0), (109, 1.0), (767, 1.0), (1514, 1.0)]

In [32]:
origin[origin.index.isin([i[0] for i in orilist[:5]])]

Unnamed: 0,MovieID,Title,Genre
1,2,Jumanji (1995),Adventure Children Fantasy
53,60,"Indian in the Cupboard, The (1995)",Adventure Children Fantasy
109,126,"NeverEnding Story III, The (1994)",Adventure Children Fantasy
767,1009,Escape to Witch Mountain (1975),Adventure Children Fantasy
1514,2043,Darby O'Gill and the Little People (1959),Adventure Children Fantasy


#### Main Model

In [33]:
# Create Main Model Output Function

def output(title, limit):

  morate = enumerate(similarities[int(titint(title))])
  molist = list(morate)
  morter = sorted(molist, key=lambda i:i[1], reverse=True)

  molist = list(morter); lite = limit + 1
  molist = filter(lambda i : i[0] != titint(title), molist[:lite])
  titles = [origin.iloc[i[0], 1] for i in molist]

  return origin[origin["Title"].isin(titles)].iloc[:limit]

title = "Jumanji (1995)"

output(title, 5)

Unnamed: 0,MovieID,Title,Genre
53,60,"Indian in the Cupboard, The (1995)",Adventure Children Fantasy
109,126,"NeverEnding Story III, The (1994)",Adventure Children Fantasy
767,1009,Escape to Witch Mountain (1975),Adventure Children Fantasy
1514,2043,Darby O'Gill and the Little People (1959),Adventure Children Fantasy
1556,2093,Return to Oz (1985),Adventure Children Fantasy


In [34]:
title = origin.iloc[3574, 1]

title

"Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)"

In [35]:
output(title, 10)

Unnamed: 0,MovieID,Title,Genre
1,2,Jumanji (1995),Adventure Children Fantasy
53,60,"Indian in the Cupboard, The (1995)",Adventure Children Fantasy
109,126,"NeverEnding Story III, The (1994)",Adventure Children Fantasy
767,1009,Escape to Witch Mountain (1975),Adventure Children Fantasy
1514,2043,Darby O'Gill and the Little People (1959),Adventure Children Fantasy
1556,2093,Return to Oz (1985),Adventure Children Fantasy
1617,2161,"NeverEnding Story, The (1984)",Adventure Children Fantasy
1618,2162,"NeverEnding Story II: The Next Chapter, The (1...",Adventure Children Fantasy
1799,2399,Santa Claus: The Movie (1985),Adventure Children Fantasy
6075,41566,"Chronicles of Narnia: The Lion, the Witch and ...",Adventure Children Fantasy


#### First Time User

Cold Start Problem

On Our Solution, Cold Start Problem Can be Tackle with Asking User Genre Preference

Let's Try Tackle Cold Start Problem on Michael, Our Customers

In [36]:
# Michael Recently Subscribe to Our Product

# Michael Choose Children and Fantasy Genres

option = ["Children", "Fantasy"]

option

['Children', 'Fantasy']

In [37]:
# Find Film to Recommend to Michael

def starter(option, limit):

  result = []

  ontari = origin["Genre"].unique()

  for item in ontari:
    # Split Genre
    spliter = item.split(" ")

    # Find Overlap Between Genre Input and Available Genre
    overlap = len(set(option) & set(spliter)) / len(spliter)
    if overlap > 0.5:
      result.append((item, overlap))

  # Sort Table Result
  result = sorted(result, key=lambda i:i[1], reverse=True)[:limit]
  result = [i[0] for i in result]
  result = origin[origin["Genre"].isin(result)].iloc[:limit]

  return result

starter(option, 5)

Unnamed: 0,MovieID,Title,Genre
1,2,Jumanji (1995),Adventure Children Fantasy
53,60,"Indian in the Cupboard, The (1995)",Adventure Children Fantasy
109,126,"NeverEnding Story III, The (1994)",Adventure Children Fantasy
209,243,Gordy (1995),Children Comedy Fantasy
301,343,"Baby-Sitters Club, The (1995)",Children


Michael Choose The Indian in the Cupboard

Let's Recommend Other Film to Michael

In [38]:
# Recommend Other Film to Michael Based on The Indian in the Cupboard

title = "Indian in the Cupboard, The (1995)"

output(title, 5)

Unnamed: 0,MovieID,Title,Genre
1,2,Jumanji (1995),Adventure Children Fantasy
109,126,"NeverEnding Story III, The (1994)",Adventure Children Fantasy
767,1009,Escape to Witch Mountain (1975),Adventure Children Fantasy
1514,2043,Darby O'Gill and the Little People (1959),Adventure Children Fantasy
1556,2093,Return to Oz (1985),Adventure Children Fantasy
