# Recommendation System Workflow

## Business Problem

>Crunchyroll and Funimation have been the leaders in the anime industry. They have been providing the best and newest anime shows and movies on their respective streaming apps. Funimation is known for having English dubbed versions of anime on their platform while Crunchyroll is best known for having the Japanese and English Subbed versions on their platform. Earlier this year, Sony Pictures Entertainment, the company that owns Funimation, acquired Crunchyroll. After this acquisition, Sony decided to keep and maintain the Crunchyroll brand and streaming platform and slowly phase out its Funimation platform. This means Funimation content including shows and movies will be moving to Crunchyroll’s streaming platform. For Funimation users, they will have to eventually acquire a Crunchyroll subscription to watch their favorite shows and movies. 
>
>I found this to be a great opportunity for Crunchyroll to revamp their recommendation systems by creating a content based recommendation system using both the existing Crunchyroll titles and the new Funimation titles that will be coming to the platform. A recommendation system will provide new and existing content to the Crunchyroll users. The Funimation users that transition to Crunchyroll can use this recommendation system to find new shows that maybe were only available on Crunchyroll. The goal of this project is to develop a content based recommendation system and deploy it through an app on Streamlit for Crunchyroll to use.

## About This Notebook

>The goal of this notebook is to use the data collected from JustWatch.com to create a content based recommender system. This notebook will first clean the data and preprocess it. Once the data is preprocessed, I will create two recommender systems. One using cosine similarity and the other using a nearest neighbors approach. Evaluation of the two systems will take place to determine which one to use for the final recommender system. I will create a streamlit app using the final recommender system that will allow users to enter in a title, number of recommendations, type of content, and genre to have similar titles recommended. 
>
>This project workflow will include:
>1. Data Cleaning
2. Preprocessing
3. Recommender Systems: Cosine Similarity and Nearest Neighbors
4. Functions for Recommender Systems
5. Next Steps - Streamlit App

## Data Cleaning

In [10]:
# import libraries
import pandas as pd
import numpy as np
import ast

# sklearn libraries
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer

# nltk libraries
from nltk.corpus import stopwords

# API libraries
import requests
import time

# Justwatch API
from justwatch import JustWatch

### Load in the dataset

The dataset, `titles.csv`, comes from the JustWatch_scraping notebook in the root of the repository. The dataset includes information on movies and shows belonging to Funimation and Crunchyroll.

In [11]:
# read in dataset
df = pd.read_csv('./Data/titles.csv', index_col=0)

In [12]:
# take a look at the first few rows
df.head(3)

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,imdb_votes,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app
0,ts20740,20740,Dragon Ball Z,/poster/8569195/{profile},Dragon Ball Z is a Japanese animated televisio...,1989.0,show,805.0,387.966,8.8,128409.0,8.286,tt0121220,26447.0,"[14, 1, 2, 3, 7, 12, 6]",TV-PG,24.0,['JP'],16,Funimation
1,ts20682,20682,Attack on Titan,/poster/174708726/{profile},"Several hundred years ago, humans were nearly ...",2013.0,show,55.0,89.689,9.0,325381.0,8.643,tt2560140,205148.0,"[1, 14, 2, 6, 7, 9]",TV-MA,24.0,['JP'],4,Funimation
2,ts21560,21560,Dragon Ball,/poster/290552685/{profile},"Long ago in the mountains, a fighting master k...",1986.0,show,1936.0,15.964,8.6,56606.0,8.218,tt0088509,210469.0,"[2, 1, 3, 7, 14, 12]",TV-14,24.0,['JP'],10,Funimation


In [13]:
# info of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1941 entries, 0 to 0
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   jw_entity_id          1941 non-null   object 
 1   id                    1941 non-null   int64  
 2   title                 1941 non-null   object 
 3   poster                1930 non-null   object 
 4   description           1908 non-null   object 
 5   release_year          1940 non-null   float64
 6   type                  1941 non-null   object 
 7   imdb_popularity       369 non-null    float64
 8   tmdb_popularity       1924 non-null   float64
 9   imdb_score            1699 non-null   float64
 10  imdb_votes            1699 non-null   float64
 11  tmdb_score            1854 non-null   float64
 12  imdb_id               1722 non-null   object 
 13  tmdb_id               1924 non-null   float64
 14  genre_ids             1931 non-null   object 
 15  age_certification     15

In [14]:
# check out null values
df.isna().sum()

jw_entity_id               0
id                         0
title                      0
poster                    11
description               33
release_year               1
type                       0
imdb_popularity         1572
tmdb_popularity           17
imdb_score               242
imdb_votes               242
tmdb_score                87
imdb_id                  219
tmdb_id                   17
genre_ids                 10
age_certification        432
runtime                    1
production_countries      26
seasons                    0
streaming_app              0
dtype: int64

In [15]:
# check out the number of duplicate rows
df.duplicated().sum()

1

>The dataset incldudes null values and duplicate rows. I will have to address these two issues separately. In the cell below I will take a look at the duplicate row.

In [16]:
# take a look at the duplicate rows
df.loc[df.duplicated(keep=False)]

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,imdb_votes,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app
89,tm102863,102863,Fafner in the Azure: Dead Aggressor - Heaven a...,/poster/174761052/{profile},The year is 2148. Two years have passed since...,2010.0,movie,,5.287,6.4,70.0,3.8,tt1794963,137502.0,"[1, 2, 6, 14]",,95.0,['JP'],0,Funimation
90,tm102863,102863,Fafner in the Azure: Dead Aggressor - Heaven a...,/poster/174761052/{profile},The year is 2148. Two years have passed since...,2010.0,movie,,5.287,6.4,70.0,3.8,tt1794963,137502.0,"[1, 2, 6, 14]",,95.0,['JP'],0,Funimation


>The `jw_entity_id` column should be the unique identifier for each movie so if there are duplicates of this then there are duplicate movies.

In [17]:
# duplicates for jw_entity_id
df.jw_entity_id.duplicated().sum()

345

In [18]:
# how many duplicates of titles are there
df.drop_duplicates(subset=['jw_entity_id']).title.duplicated().sum()

3

### Drop duplicate rows

In [19]:
# drop duplicates by jw_entity_id
df.drop_duplicates(subset=['jw_entity_id'], keep='last', inplace=True)

### Handling Null Values

>There are four columns that I will handle the null values for. They are `release_year`, `genre_ids`, `runtime`, and `descriptions`. I will have to handle each columns separately and will have to use outside sources from the internet to find the information I need to fill in the values because justwatch.com does not have this informaiton on their website.

#### Filling in missing values for `release_year`

In [20]:
# locating release_year null values
df.loc[df['release_year'].isna()]

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,imdb_votes,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app
758,ts61912,61912,Japanese Anime Classic Collection,,,,show,,0.6,,,,,56020.0,,,7.0,,1,Crunchyroll


>I researched online and found on Amazon that this collection was released in 2007. I will replace it's `release_year` value with 2007

In [21]:
# replacing value with 2007
df['release_year'].fillna(2007, inplace=True)

In [22]:
# check if it worked
df.loc[df['title'] == 'Japanese Anime Classic Collection']

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,imdb_votes,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app
758,ts61912,61912,Japanese Anime Classic Collection,,,2007.0,show,,0.6,,,,,56020.0,,,7.0,,1,Crunchyroll


#### Filling in missing values for `genre_ids`

>The process for filling and cleaning `genre_ids` is longer than the rest of the other columns. First, I have to fill the missing values with genres of those shows by researching the shows on other websites to find information on what genres they are. Then I will have to match the genres of those shows with the genres that justwatch.com uses so they are consistent with the other shows and movies. I will get a dictionary from the JustWatch API of all the genre titles and Ids for each genre. Once I fill in the null values, I will have to clean the remaining rows of the column. The values in the `genre_ids` columns are currently a list of numbers with the numbers corresponding the specific genre. I want to change the numbers in the list to the actual name of the genre. For example, I want to convert '1' to 'Action & Adventure'. I'll do this so I can easily identify the genre of each show and movie.

In [23]:
# get list of genres from justwatch
just_watch = JustWatch(country='US')

In [24]:
# save genres to a genre list of dictionaries
genre_list = just_watch.get_genres()

In [25]:
# take a look at the list
genre_list

[{'id': 1,
  'short_name': 'act',
  'technical_name': 'action',
  'translation': 'Action & Adventure',
  'slug': 'action-and-adventure'},
 {'id': 2,
  'short_name': 'ani',
  'technical_name': 'animation',
  'translation': 'Animation',
  'slug': 'animation'},
 {'id': 3,
  'short_name': 'cmy',
  'technical_name': 'comedy',
  'translation': 'Comedy',
  'slug': 'comedy'},
 {'id': 4,
  'short_name': 'crm',
  'technical_name': 'crime',
  'translation': 'Crime',
  'slug': 'crime'},
 {'id': 5,
  'short_name': 'doc',
  'technical_name': 'documentation',
  'translation': 'Documentary',
  'slug': 'documentary'},
 {'id': 6,
  'short_name': 'drm',
  'technical_name': 'drama',
  'translation': 'Drama',
  'slug': 'drama'},
 {'id': 7,
  'short_name': 'fnt',
  'technical_name': 'fantasy',
  'translation': 'Fantasy',
  'slug': 'fantasy'},
 {'id': 8,
  'short_name': 'hst',
  'technical_name': 'history',
  'translation': 'History',
  'slug': 'history'},
 {'id': 9,
  'short_name': 'hrr',
  'technical_name'

>I'll create a new dictionary with the genre id as the keys and the translation as the values for each key.

In [26]:
# genre dictionary
genre_dict = {}

# for loop to get info from genre_list
for genre in genre_list:
    genre_dict[genre['id']] = genre['translation']

In [27]:
# take a look at the new dictionary
genre_dict

{1: 'Action & Adventure',
 2: 'Animation',
 3: 'Comedy',
 4: 'Crime',
 5: 'Documentary',
 6: 'Drama',
 7: 'Fantasy',
 8: 'History',
 9: 'Horror',
 10: 'Kids & Family',
 11: 'Music & Musical',
 12: 'Mystery & Thriller',
 13: 'Romance',
 14: 'Science-Fiction',
 15: 'Sport',
 16: 'War & Military',
 17: 'Western',
 23: 'Reality TV',
 18: 'Made in Europe'}

In [28]:
# locate missing genre_ids
df.loc[df['genre_ids'].isna()]

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,imdb_votes,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app
730,ts252020,252020,Scared Rider Xechs,/poster/253937855/{profile},"The Blue World, which symbolizes reason, is un...",2016.0,show,,1.187,,,7.0,tt13774780,107208.0,,,24.0,['JP'],1,Funimation
737,ts27040,27040,Gunslinger Girl -Il Teatrino-,,,2008.0,show,,0.217056,,,5.0,,26605.0,,,25.0,,1,Funimation
477,ts297686,297686,Crunchyroll Anime Awards,/poster/247410363/{profile},Annual awards ceremony by the anime streaming ...,2017.0,show,,2.853,,,,,128427.0,,,90.0,['US'],5,Crunchyroll
758,ts61912,61912,Japanese Anime Classic Collection,,,2007.0,show,,0.6,,,,,56020.0,,,7.0,,1,Crunchyroll
818,ts215898,215898,Hakata Mentai! Pirikarako-chan,/poster/136330187/{profile},Hakata Mentai! Pirikarako-chan is set in a mys...,2019.0,show,,1.313,,,5.3,,90847.0,,,4.0,['JP'],1,Crunchyroll
972,ts94758,94758,The Journey Home,/poster/146688947/{profile},Insects are taken up into space for use in exp...,2015.0,show,,,7.0,15.0,,tt6667152,,,TV-G,23.0,['CA'],20,Crunchyroll
983,ts84363,84363,Asenshu Anime Recap,/poster/85505445/{profile},"Anime Synopsis, News and Spoilers in Albanian.",2018.0,show,,0.608,,,,,82824.0,,,3.0,['AL'],1,Crunchyroll
996,ts208160,208160,"Demian, o Justiceiro",/poster/251015786/{profile},,1968.0,show,,,,,,tt0243695,,,,19.0,,1,Crunchyroll
1032,ts341696,341696,Sony Music AnimeSongs ONLINE 2022,,"Sony Music AnimeSongs ONLINE 2022"" is a festiv...",2022.0,show,,0.6,,,10.0,,194597.0,,,208.0,['JP'],1,Crunchyroll
1037,ts53246,53246,Fan Service,,"Gray Haddock, Kerry Shawcross, Miles Luna, and...",2016.0,show,,1.498,,,,,69193.0,,,75.0,['US'],4,Crunchyroll


>There are 10 rows that don't have genres attached to them. I will have to use an outside source (website) to find genres for them. If I am unable to find a genre for them I will list it as 'other'.

In [29]:
# i'll create a list of the titles that have null values of genres
missing_genres = list(df.loc[df['genre_ids'].isna()].title)
missing_genres

['Scared Rider Xechs',
 'Gunslinger Girl -Il Teatrino-',
 'Crunchyroll Anime Awards',
 'Japanese Anime Classic Collection',
 'Hakata Mentai! Pirikarako-chan',
 'The Journey Home',
 'Asenshu Anime Recap',
 'Demian, o Justiceiro',
 'Sony Music AnimeSongs ONLINE 2022',
 'Fan Service']

>Below I created a dictionary with the title of the show or movie and the values being a list of the genres I found for those titles. The genres I found on the internet and hard coded them into the dictionary below.

In [30]:
# filling the values of the missing genres
missing_genres_dict = {
    'Scared Rider Xechs': ['Action & Adventure', 'Romance', 'Science-Fiction'],
    'Gunslinger Girl -Il Teatrino-': ['Science-Fiction', 'Action & Adventure'],
    'Crunchyroll Anime Awards': ['Other'],
    'Japanese Anime Classic Collection': ['Other'],
    'Hakata Mentai! Pirikarako-chan': ['Comedy'],
    'The Journey Home': ['Action & Adventure', 'Comedy', 'Kids & Family'],
    'Asenshu Anime Recap': ['Other'],
    'Demian, o Justiceiro': ['Science-Fiction', 'Action & Adventure'],
    'Sony Music AnimeSongs ONLINE 2022': ['Music & Musical'],
    'Fan Service': ['Other'] 
}

>Below I created a function that will help me fill the null values in the `genre_ids` column. The function will take in a title of the show or movie and the missing_genre_dict I created above and it will return the associated list of genres for that title.

In [31]:
# function to take in title and return list of genres
def return_genre(title, genre_dict):
    return genre_dict[title]

# test the function on the missing_genres list
for title in missing_genres:
    print(return_genre(title, missing_genres_dict))

['Action & Adventure', 'Romance', 'Science-Fiction']
['Science-Fiction', 'Action & Adventure']
['Other']
['Other']
['Comedy']
['Action & Adventure', 'Comedy', 'Kids & Family']
['Other']
['Science-Fiction', 'Action & Adventure']
['Music & Musical']
['Other']


>Finally I will create a new series using the dataframe and a lambda function. The below cell applies a conditional lambda function that checks if the `genre_ids` column is null and will run the return_genre function if the column is null and fill the value with the associated list of genres for that title. I will use the new series I created and assign it to a new column in the dataframe called `genre_ids_fill`.

In [32]:
# create new series that will fill the null values for genre_ids
df2 = df.apply(lambda x: return_genre(x['title'], missing_genres_dict) 
               if pd.isnull(x['genre_ids']) else x['genre_ids'], axis=1)

In [33]:
df['genre_ids_fill'] = df2

>Now I need to transform the `genre_ids_fill` from numbers in the list to the genres they correspond to. I'll create a new function that will take in a list and returns a new list of the genres corresponding to the numbers.
>
>First I'll have to change the values of the genre_ids_fill column because the lists are stored as strings and not lists. After, I will create a function that will go through the list of numbers and convert the numbers to the associated genre name. For example, '1' will changed to 'Action & Adventure'. The function will then return the new list of genres. Finally, I will do the same as above and create a new series that uses a conditional lambda function that will check if the value is a list of numbers and if it is it will run the list through the num_to_genre function and return a list of the genres. I will assign this series to the column `genre_ids_fill`.

In [34]:
# change strings of list to lists
df['genre_ids_fill'] = df['genre_ids_fill'].map(lambda x: ast.literal_eval(x) if type(x) == str else x)

In [35]:
# function to convert number id to genre name
def num_to_genre(num_list, genre_dict):
    # create empty list for genres
    genre_list = []
    
    for num in num_list:
        
        # append genre to new list
        genre_list.append(genre_dict[num])
        
    return genre_list

In [36]:
# create new series that will convert the numbers to genres
df2 = df.apply(lambda x: num_to_genre(x['genre_ids_fill'], genre_dict) 
               if type(x['genre_ids_fill'][0]) == int else x['genre_ids_fill'], 
               axis=1)

In [37]:
df['genre_ids_fill'] = df2

>I successfully transformed the genre_ids from numbers to the strings of the genres they represent. 
>
>Next, I will fill the null value in the `runtime` column.

#### Filling in missing values for `runtime`

>I have to locate where the null value is first. I also have to reset the index so it starts at 0.

In [38]:
# reset index to start at 0
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)

In [39]:
# locating the row with the missing value
df.loc[df['runtime'].isna()]

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,...,tmdb_score,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app,genre_ids_fill
1234,ts26242,26242,Kite Liberator,/poster/272395473/{profile},"KITE Liberator is an American-released, Japane...",2008.0,show,,2.346,,...,4.0,,45881.0,"[6, 2]",,,['JP'],1,Crunchyroll,"[Drama, Animation]"


>The title of the show is Kite Liberator. I'll research online to find the runtime of the episodes.
>
>After researching this item, I found out this title is actually a movie with a runtime of 58 minutes. I'll change these values accordingly

In [40]:
# replacing values for 'Kite Liberator'
df.loc[1234, ['type']] = 'movie'
df.loc[1234, ['runtime']] = 58
df.loc[1234, ['seasons']] = 0

#### Filling `descriptions` null values

In [41]:
df.loc[df['description'].isna()][['title','release_year','streaming_app', 'type']]

Unnamed: 0,title,release_year,streaming_app,type
310,PUCHIM@S,2013.0,Funimation,show
368,Blessing of the Campanella,2010.0,Funimation,show
369,Sureiyâzu Revolution,2008.0,Funimation,show
390,Venus Project: Climax,2015.0,Funimation,show
393,Gunslinger Girl -Il Teatrino-,2008.0,Funimation,show
756,City Hunter,1987.0,Crunchyroll,show
961,Final Fantasy XV: Episode Ardyn -Prologue-,2019.0,Crunchyroll,show
1131,To Be Heroine,2018.0,Crunchyroll,show
1142,Rusted Armors,2022.0,Crunchyroll,show
1261,Japanese Anime Classic Collection,2007.0,Crunchyroll,show


>There is no easy way to fill get these descriptions for these 33 shows except for looking each one up on Crunchyroll or 
Funimation and copying and pasting them into a python cell. That is what I will do with each of the titles.

In [42]:
df.loc[df['description'].isna()].jw_entity_id.values

array(['ts32397', 'ts178443', 'ts316995', 'ts78923', 'ts27040', 'ts77258',
       'ts251702', 'ts81286', 'ts304489', 'ts61912', 'ts217594',
       'ts23918', 'ts84520', 'ts78793', 'ts27568', 'ts42978', 'ts78756',
       'ts221663', 'ts53170', 'ts78868', 'ts58050', 'ts53544', 'ts27610',
       'ts134197', 'ts208160', 'ts220110', 'ts285494', 'ts83571',
       'ts80093', 'ts113733', 'ts57532', 'ts219894', 'ts64924'],
      dtype=object)

>The dictionary below has the titles with the missing descriptions as the keys and the descriptions for those shows as the values.

In [43]:
# create dictionary for descriptions
descriptions_dict = {
    'PUCHIM@S': "Hyperactive hijinks are sure to follow when the idol girls of 765 Productions discover chibis with adorable abilities.", 
    'Blessing of the Campanella': "Leicester Maycraft’s gang gathers to watch a meteor shower. After one of the stars crashes into a church, Maycraft discovers a sleeping girl.", 
    'Sureiyâzu Revolution': "The original cast of The Slayers is reunited for the first time in over a decade in The Slayers Revolution! Lina’s gang goes head-to-head with Pokota, a little furball with a big destructive streak! But there’s more to Pokota than anyone knows, and if Lina can get to the bottom of his habit of blowing stuff up, she could end up with an awesome new ally!",
    'Venus Project: Climax': "A unique series that will not only pit 2D idols against each other, but also features voice actresses competing in live challenges.", 
    'Gunslinger Girl -Il Teatrino-': "When the Social Welfare Agency investigates the disappearance of an operative, their inquiry leads them right into the lair of their rival, the Five Republics. The assassin Triela infiltrates the hostile organization, but her search is cut short when she finds herself staring down the barrel of a gun",
    'City Hunter': "Ryo Saeba, the legendary City Hunter, is a first-class sweeper for hire, taking on jobs from protecting beautiful women to taking out bad guys permanently. He can be a private detective or hitman, whatever the case calls for, and it often requires the use of his superhuman marksmanship. But even so, Ryo can’t do it alone. His partner is Kaori Makimura, the younger sister of his murdered best friend. Kaori serves as his assistant, while also protecting their attractive clients from Ryo’s \“mokkori\” advances with her trusty supply of 100 ton hammers.", 
    'Final Fantasy XV: Episode Ardyn -Prologue-': "FINAL FANTASY XV EPISODE ARDYN depicts the story of Ardyn Izunia, the main nemesis in FINAL FANTASY XV. The story of suffering, death, and resurrection leading up to the main story in FFXV. This animation reveals the story two millennia ago",
    'To Be Heroine': "Everyone around Futaba expects her to grow up and become an adult, and she's lost the ability to keep herself mentally balanced. At the bottom of her heart, her childish self is still there, and still strong. One day she wanders into another dimension, a world where the light has been lost, and darkness rules. The people there exist as babies wearing only their underpants. The clothes they wear can be summoned as powerful fighters called SpiCloths. In this world, a battle was being fought between light and darkness", 
    'Rusted Armors': 'The era is a turbulent Warring States period in Hinomoto. Kinokuni, the unexplored region of the mountains and the sea. In the depths of this deep forest, there is a group using "Yatagarasu" as their flag symbol and using guns as weapons. Their name is "Saikashu". It is "Magoichi" who arrived from a foreign country that inherited the name as the head of Saikashu. On the other hand, "Saburo" quickly senses the signs of aggression from the European powers and struggles to protect Hinomoto.The fate of the two who would not meet originally will be crossed by the fight against the invaders who came from a foreign country ─!',
    'Japanese Anime Classic Collection': "Japanese Anime Classic Collection is a set of vintage anime presents 55 titles from the 1920s and 1930s, the Golden Age of Japanese silent film.",
    'SD Gundam World: Sangoku Souketsuden': 'A mysterious trinity called the “Etiolation Trinity” has been spreading across the land for the past several years, and the BUGs, who lose their self of self, start attacking Mobile Suits one after the next. In order to protect themselves from the BUGs coming to attack them, the remaining Mobile Suits are forced to live inside city strongholds. However, behind the fortifications, after the death of the former lord, the lord Dong Zhuo is as domineering as can be. As if led by fate, after meeting Guan Yu Nu Gundam and Zhang Fei God Gundam, Liu Bei Unicorn Gundam resolves to save the world',
    "Kaasan Mom's Life": "A sidesplitting essay in animation that depicts the simple everyday life of career-committed mother and her rather out of standard family set in contemporary Japan.",
    'THE IDOLM@STER SideM Wakeatte Mini!': "From \"THE iDOLM@STER: SideM\" franchise which spawned a game, live shows, and comics, comes a short anime featuring the idols from 315 Production in mini form. The total number of idols on the show... is all 46 of them! Director Mankyu, who is a regular for THE iDOLM@STER series (\"Puchimasu!" "THE iDOLM@STER Cinderella Girls Theater\"), brings you the cool and cute daily lives of the idols to all the producers out there.", 
    'The Melody of Oblivion': "This takes place in the 20th century where the Monsters have succeeded in defeating the humans through a violent war. The Monsters rule the earth in the 20th century with no one recalling what had happened in the past. Bocca, is a teenage boy who chooses the path of becoming the Warrior.",
    'PES: Peace Eco Smile': "It takes place in Kichijoji, one of the most popular cities in Japan. Pes saves Kurumi when she almost falls into the pond of Iinokashira Park. Pes falls in love with Kurumi as soon as she kisses him. Pes starts working as a part-time in the flower shop and begins his adventures on Earth.", 
    'Shounen Ashibe Go! Go! Goma-chan': "A comedy manga, \"Shonen Ashibe,\" that follows the friendship between baby spotted seal Goma-chan and first-grade student Ashiya Ashibe. First serialized in 1988 and adapted to anime in 1991, the adorable Goma-chan created a massive following and a spotted seal boom. This spring, Goma-chan returns to \"Tentere Anime.\" A cute, pleasant story of Ashibe and Goma-chan and their unusual school and their neighbors. Sometimes endearing, sometimes bizarre, it's a fun anime for the whole family!",
    'Tantei Team KZ Jiken Note': "Tantei Team KZ Jiken Note centers on sixth grader Aya Tachibana who worries about friends, family, grades, and more. One day, she joins the Detective Team KZ with four very unique boys that she met in cram school: Kazuomi Wakatake, Takakazu Kuroki, Kazunori Uesagi, and Kazuhiko Kozuka.", 
    'Days of Urashimasakatasen': "School life -- it's an experience that everyone should have, and no one should take for granted. Of course, everyone knows that the most enviable way to spend high school is as the most popular person in class. Transfer student Urata has decided that his high school debut will be brilliant, and as he reaches nervously for the door -- it happens. In his way stand fellow high school students Shima, Sakata, and Senra! Are they enemies? Allies? Or something else entirely?! This heart-pounding transfer school youth story is about to begin!",
    'SENGOKUCHOJYUGIGA': "Popular young actors and a unique cast come together to have fun messing around and playing the roles of samurai generals. SENGOKUCHOJYUGIGA is an anime that plays around with both Japan's history and the generals of Japan's sengoku period!", 
    'World Fool News': "Takahashi is transferred to a main anchor of a news program, which is known for being a little...weird. This is a comedy about somewhat ridiculous happenings occurring at a broadcasting station.",
    'Ikemen Sengoku: Toki wo Kakeru ga Koi wa Hajimaranai': "My name is Sasuke. I was a college student, but one day I fell into a time slip back to the Sengoku Period. What's more is that this Sengoku Period is nothing like what I learned about in the history books... The much anticipated anime version of the popular romance game \"Ikémen Sengoku\"! A high-tension Sengoku comedy interlaced with adorable generals.",
    'Forest Fairy Five': "A beautiful nation, prospering since ancient times, Japan is now known as the Anime Kingdom. There are more than just humans living there; animes truly do exist in Japan. Past the Fairy Ring, to the world of fairies, live anime-chans. There's a Fairy Ring in your town, too. Here. And there. Even in Harajuku. Maybe even in the Ashigara mountains. By some chance, we'll open that door. And we might get to meet the anime-chans. This is the land where you get to meet anime-chans.", 
    'Lychee Light Club': "The cramped town of Keiko-cho is stained black with factory exhaust and oil. Late one night, a piercing whistle echoes from some ruins in a seemingly empty corner of the town, accompanied by the eerie echoes of harsh words spoken in German. The sounds come from a group of nine boys dressed in the starched collars and caps of high school uniforms. There in the darkness stands a secret base built by these nine boys under the leadership of Zera, the \"king of the ruins,\" known as the Light Club. Get the manga by Vertical!", 
    'Tabimachi Late Show': "Comix Wave Films is producing four episodes with the theme of “goodbyes and journeys,” entitled “Recipe,” “Transistor Smartphone,” “Summer Festival,” and “Clover”, as part of the Ultra Super Anime Time programming block broadcast in Japan.",
    'Demian, o Justiceiro': "A story about an elite government team that pilots a five into one super robot fighting force. They are called to take down their robot’s creator who is now a terrorist. ", 
    'Lovely Muuuuuuuco!': "Muco is the sweetest, most lovable dog in the neighborhood. This sparkly-nosed Shiba Inu lives with her owner and best friend, Komatsu, a glassblower who lives in the mountains. These two companions have wonderful adventures, from going on walks to playing in the pond in front of their house. No matter where they go or what they do, Muco will always love her best friend Komatsu!",
    'The Nameko Families': "A home drama centering around protagonist \"Nameko.\" Amusing family anecdotes, surprising and funny events, and a little bit of tear-jerking in the warm, everyday lives of the Nameko family.", 
    'The Sprites of Floria': "After an invasion by the Vivolian army, the Sprite Himawari and her childhood friend Tsubaki flee their homeland of Floria for the town of Romton. There, they live out their lives in peace... for a time. One day, Himawari's friend and fellow Sprite Ajisai is kidnapped by mysterious strangers. Now Himawari and detective Tsubaki set out to rescue her, but find the dreaded Vivolians standing in their way!", 
    'Crossing Time': "“Clank, clank, clank, clank...” Today, the railroad crossing bar goes down again, stopping someone on their way somewhere. The various stories of youth, eros, art, first love, etc that occur during the time spent waiting at a railroad crossing... All railroad crossings, all the time. Bringing you a variety of short stories about railroad crossings!",
    'BWFC: Banpresto World Figure Colosseum': "The Banpresto World Figure Colosseum pits 12 of the world's best sculptors against each other to see who can sculpt the ultimate Dragon Ball Z and One Piece figures. Witness the birth of Banpresto’s next generation of toys before they hit the shelves. And, be sure to cast your vote at BanprestoWFC.com. Hosted by VampyBitMe and ninjamikey.", 
    'Dream Festival!': "The Dream Festival is the stage that all idols dream of singing on, with their professional debut on the line. In order to get there, idols work their hardest every day to perfect their performance... and the key to coming out on top is the Dream Festival Cards sent by fans to their favorite idols. Receiving these Cards makes the idols who make it to the stage shine even brighter. Now head to the Dream Festival with your Dream Festival Card in hand for the idol you love most!",
    'Peeping Life': "If you're a fan of shows like Comedy Central's \"Shorties Watching Shorties\" and independent style animation, with a sharp sense of humor, this series is for you! Brought to you by CoMixWaveFilms, Peeping life presents hilarious shorts of animated rodoscoped skits by popular Japanese comic-duos. The Peeping Life series is currently airing on Japanese TV and Cruchyroll to rest of the world.", 
    'Web Ghosts Pipopa': "Net Ghosts PiPoPa (Web Ghosts PiPoPa) Web Ghosts PiPoPa is a comedic, action adventure of a boy who is swallowed into his cell phone and transported to the virtual world of the internet, where he befriends three internet ghosts: Pit, Pot and Pat."
}

In [44]:
# filling in the null values with the values from the dict
df2 = df.apply(lambda x: return_genre(x['title'], descriptions_dict) 
               if pd.isnull(x['description']) else x['description'], axis=1)

In [45]:
# assign df2 to a new column 'descriptions_filled'
df['descriptions_filled'] = df2

>Before I move on with the adding descriptions to the recommender systems, I have to address an error I came across with the information for the title 'Demian, o Justiceiro'. This is not the correct name and information for this show. JustWatch.com stores the information for the correct show 'DEMIAN' on their website as a title with a similar name. 
>
>I will have to change the release_year to 2014, the title to 'DEMIAN', and replace the imdb_id to NaN.

In [46]:
# replace the values with the correct values
df.loc[1499, ['title']] = 'DEMIAN'
df.loc[1499, ['release_year']] = 2014
df.loc[1499, ['imdb_id']] = np.nan

### Changing Titles With Same Names

>There are a few titles with the same name so I will evaluate why they have the same name and determine what to do with them.

In [47]:
# checking for duplicate titles
df.loc[df['title'].duplicated(keep=False)].sort_values('title')

Unnamed: 0,jw_entity_id,id,title,poster,description,release_year,type,imdb_popularity,tmdb_popularity,imdb_score,...,imdb_id,tmdb_id,genre_ids,age_certification,runtime,production_countries,seasons,streaming_app,genre_ids_fill,descriptions_filled
61,ts31181,31181,Fruits Basket,/poster/155050012/{profile},Tohru Honda is 16 year old orphaned girl who g...,2001.0,show,3548.0,15.592,7.9,...,tt0328738,36941.0,"[6, 2, 3, 7, 13]",TV-PG,24.0,['JP'],1,Funimation,"[Drama, Animation, Comedy, Fantasy, Romance]",Tohru Honda is 16 year old orphaned girl who g...
561,ts87522,87522,Fruits Basket,/poster/246787476/{profile},After a family tragedy turns her life upside d...,2019.0,show,769.0,43.038,8.6,...,tt9304350,85991.0,"[2, 3, 6, 7, 13]",TV-14,24.0,['JP'],3,Crunchyroll,"[Animation, Comedy, Drama, Fantasy, Romance]",After a family tragedy turns her life upside d...
267,ts272319,272319,The Duke of Death and His Maid,/poster/247591590/{profile},"Due to a childhood curse, anything that the Du...",2021.0,show,,32.898,,...,,117992.0,"[2, 3, 6]",TV-14,24.0,['JP'],1,Funimation,"[Animation, Comedy, Drama]","Due to a childhood curse, anything that the Du..."
1114,ts280992,280992,The Duke of Death and His Maid,/poster/249239431/{profile},A cursed duke who kills everyone he touches li...,2021.0,show,,,7.2,...,tt13971512,,"[2, 3, 6, 13]",TV-14,23.0,,2,Crunchyroll,"[Animation, Comedy, Drama, Romance]",A cursed duke who kills everyone he touches li...
441,tm299419,299419,Tokyo Ghoul,/poster/30444162/{profile},A Tokyo college student is attacked by a ghoul...,2017.0,movie,,48.513,5.7,...,tt5815944,433945.0,"[12, 1, 6, 14, 7, 9]",NC-17,119.0,['JP'],0,Funimation,"[Mystery & Thriller, Action & Adventure, Drama...",A Tokyo college student is attacked by a ghoul...
547,ts20202,20202,Tokyo Ghoul,/poster/249116603/{profile},Ken Kaneki is a bookworm college student who m...,2014.0,show,700.0,155.001,7.8,...,tt3741634,61374.0,"[1, 7, 9, 12, 2, 6]",TV-MA,24.0,['JP'],4,Crunchyroll,"[Action & Adventure, Fantasy, Horror, Mystery ...",Ken Kaneki is a bookworm college student who m...


>For Fruits Basket, I can see there are two shows. One came out in 2001 and the other in 2019. I'll rename each show with the year at the end of it to differentiate between the two.
>
>For The Duke of Death and His Maid, I can see it is the same show but with different Ids. I am guessing JustWatch.com has a duplicate of this show on their website. For this project, I will drop the row pertaining to the Funimation streaming app because this row has less information than it's Crunchyroll counterpart.
>
>For Tokyo Ghoul, there is a movie and show with the same name. Similar to what I will do with Fruits Basket I will add 'movie' to the end of the movie title to differentiate between them.

In [48]:
# change Fruits Basket title
df.loc[61, 'title'] = 'Fruits Basket - 2001'
df.loc[561, 'title'] = 'Fruits Basket - 2019'

# change Tokyo Ghoul movie title
df.loc[441, 'title'] = 'Tokyo Ghoul - movie'

# drop row for The Duke of Death and His Maid - Funimation
df.loc[267]
df.drop(index=267, inplace=True)

> The duplicate titles were taken care of, but because I dropped a row I have to reset the index starting at 0.

In [49]:
# reseting index to 0
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)

## Data Preprocessing

>The data preprocessing process will include: 
>1. Using .get_dummies() on the categorical columns `genre_ids_fill` and `type`. 
2. Drop unused columns.
3. Using TFIDF Vectorizer on the `descriptions_filled` column.
4. Scale all columns except for `tmdb_score`, `genres`, and the vectorized columns because I am going to use KNN Imputer to fill in the values of `tmdb_score` after I scale the other columns.
    * I am going to experiment with two types of scaling for my recommender systems and then decide with one I think is better. I will use StandardScaler and MinMaxScaler because cosine similarity and nearest neighbors are distance based, these two scaling processes can have different results. For this reason, I am going to use both separately and experiment with them to determine the better scaling.
5. After scaling I am going to use KNN Imputer fill in the missing values for the `tmdb_score`.


### Creating dummy columns for each genre

>Currently the genres of each title are stored in lists. I want to create dummy columns for each of the unique genres so if a title has the genre in their list the value will be a 1 in that column or else it will be a 0. 
>
>I will use the .get_dummies() method to create the columns and then concat the dataframe with the dummy columns into a new dataframe called `model_df`.

In [50]:
# create model_df after using get_dummies() on genres
model_df = pd.concat([df, df['genre_ids_fill'].str.join('|').str.get_dummies()], axis=1)

### Creating dummy columns for `type` column

>Next I have to use .get_dummies() on the `type` column. First I want to change the values in the column from 'movie' and 'show' to 'movie_' and 'show_' because if I vectorize the descriptions there may be tokens called 'movie' and 'show'. This will display as multiple columns having the same name. To differentiate between them I am going to add a an '_' to the type values.

In [51]:
# creating type dictionary so I can map it
type_dict = {
    'movie': 'movie_',
    'show': 'show_'
}

In [52]:
# map the dictionary on the type column
model_df['type'] = model_df['type'].map(type_dict)

> Now I will use .get_dummies() on the `type` column.

In [53]:
# get dummies on the type column
model_df = pd.concat([model_df,
           pd.get_dummies(model_df['type'])],axis=1)

### Getting rid of unnecessary columns

>The `model_df` has extra columns that I don't need for the recommender system. The columns that I need are as follows: `jw_entity_id`, `release_year`, `tmdb_score`, `runtime`, `seasons`, `descriptions_filled`, the `genre` columns, and the `type`.

In [54]:
# columns to drop
cols_to_drop = [
    'id', 'title', 'type', 'poster', 'description', 'imdb_popularity', 
    'tmdb_popularity', 'imdb_score', 'imdb_votes', 
    'imdb_id', 'tmdb_id', 'genre_ids',
    'age_certification', 'production_countries',
    'streaming_app', 'genre_ids_fill',
]

# drop columns from mvp_df
model_df.drop(columns=cols_to_drop, inplace=True)

In [55]:
# take a look at the dataframe
model_df.head(3)

Unnamed: 0,jw_entity_id,release_year,tmdb_score,runtime,seasons,descriptions_filled,Action & Adventure,Animation,Comedy,Crime,...,Mystery & Thriller,Other,Reality TV,Romance,Science-Fiction,Sport,War & Military,Western,movie_,show_
0,ts28221,1985.0,8.6,24.0,3,Robotech is an 85-episode adaptation of three ...,1,1,0,0,...,0,0,0,0,1,0,0,0,0,1
1,ts199,2006.0,8.071,25.0,3,The story follows a team of pirate mercenaries...,1,1,1,1,...,1,0,0,0,0,0,0,0,0,1
2,ts25674,1998.0,8.3,25.0,1,Lain—driven by the abrupt suicide of a classma...,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1


> Before I scale and use TFIDF Vectorizer, I will set the index as `jw_entity_id`

In [56]:
# set jw_entity_id as index
model_df.set_index('jw_entity_id', inplace=True)

### Scale, TFIDF Vectorizer, KNN Imputer

>I am going to create two separate dataframes. One with a Standard Scaler normalization and the other with a MinMax Scaler. I am going to create a pipeline that will scale and TFIDF vectorize all in one for both of these dataframes. After I will use KNN Imputer on `tmdb_score` and then scale the column based on the other features of the dataset, either Standard Scaler or MinMax Scaler.

### Standard Scaler and TFIDF Vectorizer

In [57]:
# instantiate Standard Scaler
ss = StandardScaler()

# create wordlist
stopwords_list = stopwords.words('english')

# instantiate TFIDF Vectorizer
tfidf = TfidfVectorizer(stop_words=stopwords_list, max_features=500, ngram_range=(1,2))

# Create Column Transformer with Standard Scaler
CT_ss = ColumnTransformer(transformers=[
    ('ss', ss, ['release_year','runtime','seasons']),
    ('tfidf', tfidf, 'descriptions_filled')
    ],remainder='passthrough')

In [58]:
# fit descriptions to tfidf to get the vocabulary 
desc_tfidf = tfidf.fit(model_df['descriptions_filled'])

In [59]:
# these column headers will be used when creating the dataframe
column_headers = ['release_year','runtime','seasons']
column_headers += list(desc_tfidf.vocabulary_.keys())
column_headers += ['tmdb_score']
column_headers += list(model_df.columns)[5:]

In [60]:
# transform and assign the new model_df_ss
model_df_ss = pd.DataFrame(CT_ss.fit_transform(model_df).toarray(),
                             index=model_df.index,
                             columns=column_headers)

> I created the standard scaler dataframe. Now I will look at the number of null values in the `tmdb_score` and fill them using the KNN Imputer. After that I will transform the column using Standard Scaler.

In [61]:
# look at the number of null values in tmdb_score column
model_df_ss['tmdb_score'].isna().sum()

81

In [62]:
# Instantiate KNN Imputer
knn_impute = KNNImputer()

# fit_transform model_df_ss
model_knn_ss = knn_impute.fit_transform(model_df_ss)

# transform model_knn_ss to dataframe
model_df_ss = pd.DataFrame(model_knn_ss, 
                           columns=model_df_ss.columns,
                           index=model_df_ss.index)

In [63]:
# fit transform tmdb_score with Standard Scaler
model_df_ss['tmdb_score'] = ss.fit_transform(pd.DataFrame(model_df_ss['tmdb_score']))

In [64]:
model_df_ss.shape

(1595, 526)

### MinMaxScaler and TFIDF Vectorizer

In [65]:
# instantiate Min Max Scaler
minmax = MinMaxScaler()

# create wordlist
stopwords_list = stopwords.words('english')

# instantiate TFIDF Vectorizer
tfidf = TfidfVectorizer(stop_words=stopwords_list, max_features=500, ngram_range=(1,2))

# Create Column Transformer with Min Max Scaler
CT_mm = ColumnTransformer(transformers=[
    ('minmax', minmax, ['release_year','runtime','seasons']),
    ('tfidf', tfidf, 'descriptions_filled')
    ],remainder='passthrough')

In [66]:
# fit descriptions to tfidf to get the vocabulary 
desc_tfidf = tfidf.fit(model_df['descriptions_filled'])

In [67]:
# these column headers will be used when creating the dataframe
column_headers = ['release_year','runtime','seasons']
column_headers += list(desc_tfidf.vocabulary_.keys())
column_headers += ['tmdb_score']
column_headers += list(model_df.columns)[5:]

In [68]:
# transform and assign the new model_df_ss
model_df_mm = pd.DataFrame(CT_mm.fit_transform(model_df).toarray(),
                             index=model_df.index,
                             columns=column_headers)

> I created the Min Max Scaler dataframe. Now I will look at the number of null values in the `tmdb_score` and fill them using the KNN Imputer. After that I will transform the column using Min Max Scaler.

In [69]:
# look at the number of null values in tmdb_score column
model_df_mm['tmdb_score'].isna().sum()

81

In [70]:
# Instantiate KNN Imputer
knn_impute = KNNImputer()

# fit_transform model_df_ss
model_knn_mm = knn_impute.fit_transform(model_df_mm)

# transform model_knn_ss to dataframe
model_df_mm = pd.DataFrame(model_knn_mm, 
                           columns=model_df_mm.columns,
                           index=model_df_mm.index)

In [71]:
# fit transform tmdb_score with Min Max scaler
model_df_mm['tmdb_score'] = minmax.fit_transform(pd.DataFrame(model_df_mm['tmdb_score']))

In [72]:
model_df_mm.shape

(1595, 526)

> I have completed the preprocessing for the data. I now have two dataframes that I am going to experiment with for my recommender systems. The dataframes use two different forms of scaling one with Standard Scaling and the other with Min Max Scaling. The two dataframes are `model_df_ss` and `model_df_mm`.

## Recommender System

>I am going to create two recommender systems. One using Cosine Similarity and the other using Nearest Neighbors.
>
>Before creating the recommender systems, I have to create a look up table that will look up the jw_entity_id and have the corresponding title. I need this because the recommender system will return the `jw_entity_id` for the recommended titles and look up table will allow me retrieve the corresponding titles and other information for those ids.

In [73]:
# creating look up table
lookup_table = df[['jw_entity_id','title','type','release_year','seasons']].set_index('jw_entity_id')

# changing release year to int from float
lookup_table['release_year'] = lookup_table['release_year'].astype(int)

### Cosine Similarity

>First I'll use a cosine similarity approach on `model_df_ss` followed by `model_df_mm`.

#### Cosine Similarity with `model_df_ss`

In [74]:
# get the index for a title to test
title_index = lookup_table.index[lookup_table['title'] == 'Cowboy Bebop']

# get the row of the title_index from model_df
title_array = np.array(model_df_ss.loc[title_index])

# reshape it so it can be passed to cosine_sim function
title_array = title_array.reshape(1,-1)

Now to create cosine similarity matrix using the model_df and the title_array.

In [75]:
# cosine similarity matrix
cosine_matrix = cosine_similarity(model_df_ss, title_array)

# create a dataframe from the cosine_matrix
cosine_df = pd.DataFrame(data=cosine_matrix, index=model_df_ss.index)

In [76]:
# top 10 results of the cosine_df
results = cosine_df.sort_values(0, ascending=False).index.values[1:11]

These are the index numbers for the title. I'll look up these values in the lookup_table

In [77]:
# this dataframe will return the title and the cosign similarities
results_df_ss = pd.concat([lookup_table.loc[results], 
                        cosine_df[0].sort_values(0, ascending=False)], axis=1).iloc[:10]

#### Cosine Similarity with `model_df_mm`

In [78]:
# get the index for a title to test
title_index = lookup_table.index[lookup_table['title'] == 'Cowboy Bebop']

# get the row of the title_index from model_df
title_array = np.array(model_df_mm.loc[title_index])

# reshape it so it can be passed to cosine_sim function
title_array = title_array.reshape(1,-1)

Now to create cosine similarity matrix using the model_df and the title_array.

In [79]:
# cosine similarity matrix
cosine_matrix = cosine_similarity(model_df_mm, title_array)

# create a dataframe from the cosine_matrix
cosine_df = pd.DataFrame(data=cosine_matrix, index=model_df_mm.index)

In [80]:
# top 10 results of the cosine_df
results = cosine_df.sort_values(0, ascending=False).index.values[1:11]

These are the index numbers for the title. I'll look up these values in the lookup_table

In [81]:
# this dataframe will return the title and the cosign similarities
results_df_mm = pd.concat([lookup_table.loc[results], 
                        cosine_df[0].sort_values(0, ascending=False)], axis=1).iloc[:10]

#### Comparing the results from `model_df_ss` and `model_df_mm` for the title Cowboy Bebop

In [82]:
# top 10 results from model_df_ss
results_df_ss

Unnamed: 0,title,type,release_year,seasons,0
ts25497,Outlaw Star,show,1998.0,1.0,0.910293
ts22056,TRIGUN,show,1998.0,1.0,0.880682
ts39908,Kurau Phantom Memory,show,2004.0,1.0,0.846936
ts28167,Martian Successor Nadesico,show,1996.0,1.0,0.813077
ts31799,Samurai Champloo,show,2004.0,1.0,0.811085
ts13543,Excel Saga,show,1999.0,1.0,0.807988
ts27269,The Vision of Escaflowne,show,1996.0,1.0,0.793097
ts34299,Lost Universe,show,1998.0,1.0,0.79256
ts6619,Chrono Crusade,show,2003.0,1.0,0.790903
ts13335,DNA²,show,1994.0,1.0,0.786419


In [83]:
# top 10 results from model_df_mm
results_df_mm

Unnamed: 0,title,type,release_year,seasons,0
ts25497,Outlaw Star,show,1998.0,1.0,0.899702
ts39908,Kurau Phantom Memory,show,2004.0,1.0,0.846919
ts22056,TRIGUN,show,1998.0,1.0,0.844069
ts42254,Servamp,show,2016.0,1.0,0.842214
ts56253,The Silver Guardian,show,2017.0,2.0,0.840162
ts27999,Robotics;Notes,show,2012.0,1.0,0.838785
ts20429,Sword Art Online,show,2012.0,4.0,0.823257
ts22244,Steins;Gate,show,2011.0,2.0,0.801959
ts296797,Black Clover,show,2017.0,4.0,0.801021
ts22386,RWBY,show,2013.0,8.0,0.79983


### Nearest Neighbors

>Another approach for recommender systems is Nearest Neighbors. First I'll use a nearest neighbors approach on `model_df_ss` followed by `model_df_mm`.
>
>First I have to instantiate a Nearestneighbors object.

In [84]:
# Instantiate Nearest Neighbors
nn = NearestNeighbors(n_neighbors=10, metric='manhattan')

#### Nearest Neigbors Using `model_df_ss`

In [85]:
# fit the model on the model_df_ss
nn.fit(model_df_ss)

# get the index for a title to test
title_index = lookup_table.index[lookup_table['title'] == "Cowboy Bebop"]
# get the row of the title_index from model_df_ss
title_array = np.array(model_df_ss.loc[title_index])
# reshape it so it can be passed to cosine_sim function
title_array = title_array.reshape(1,-1)

# Return results using .kneighbors attribute of nn model
results_ss = nn.kneighbors(X=title_array, n_neighbors=11, return_distance=False).flatten()
results_ss = model_df.iloc[results_ss].index.values

Look up the arrays on the `lookup_table`

In [86]:
# look up table for the results of model_df_ss
lookup_table.loc[results_ss][1:]

Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts25497,Outlaw Star,show,1998,1
ts22056,TRIGUN,show,1998,1
ts31799,Samurai Champloo,show,2004,1
ts44317,Gankutsuou,show,2004,1
ts25674,Serial Experiments Lain,show,1998,1
ts39908,Kurau Phantom Memory,show,2004,1
ts28167,Martian Successor Nadesico,show,1996,1
ts35155,Gurren Lagann,show,2007,1
ts25067,Gasaraki,show,1998,1
ts27220,Gun x Sword,show,2005,1


#### Nearest Neigbors Using `model_df_mm`

In [87]:
# fit the model on the model_df_mm
nn.fit(model_df_mm)

# get the index for a title to test
title_index = lookup_table.index[lookup_table['title'] == "Cowboy Bebop"]
# get the row of the title_index from model_df_mm
title_array = np.array(model_df_mm.loc[title_index])
# reshape it so it can be passed to cosine_sim function
title_array = title_array.reshape(1,-1)

# Return results using .kneighbors attribute of nn model
results_mm = nn.kneighbors(X=title_array, n_neighbors=11, return_distance=False).flatten()
results_mm = model_df.iloc[results_mm].index.values

Look up the arrays on the `lookup_table`

In [88]:
# look up table for the results of model_df_mm
lookup_table.loc[results_mm][1:]

Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts25497,Outlaw Star,show,1998,1
ts31799,Samurai Champloo,show,2004,1
ts22056,TRIGUN,show,1998,1
ts44317,Gankutsuou,show,2004,1
ts39908,Kurau Phantom Memory,show,2004,1
ts14127,Fist of the North Star,show,1984,6
ts87210,Mobile Suit Gundam Wing Endless Waltz,show,1997,1
ts1340,El Cazador de la Bruja,show,2007,1
ts296797,Black Clover,show,2017,4
ts27220,Gun x Sword,show,2005,1


#### Initial Evaluation

>The recommendations from the cosine similarity and nearest neighbors systems and the `model_df_ss` and `model_df_mm` dataframes are similar to each other when it is being tested on 'Cowbow Bebop'. They do give some slightly different recommendations. Next I will create functions for Cosine Similarity and Nearest Neighbors which will have the user input a dataframe, title, and the number of recommendations and it will output the titles of the recommendations.

## Creating Functions For Recommender Systems

>The final product for this project is a function that will have the user input a title and a number of recommendations and the function will output the top number of recommendations of that title. I will create two functions. One using cosine similarity from before and the other using nearest neighbors from before. I will create these two functions and test them using a couple titles to make sure they work and then test them on `model_df_ss` and `model_df_mm`. Once I create these functions and they run properly, I will create new functions from these that will take in user preferences to better refine the recommendations being outputted.

### Cosine Similarity Function

>First I am creating the function using cosine similarity. The function will be very similar to the code blocks used for cosine similarity in the previous section except the user will input a title and number of recommendations. The model will also take in a dataframe as a variable. This will make it easier to get recommendations from `model_df_ss` and `model_df_mm`.

In [89]:
# Creating cosine recommendation function
def cosine_rec(model_df):
    '''
    This function will return a number of recommendations based off 
    the title and number of recommendations the user inputs.
    '''
    # Input title you want recommendations for
    title = input("Enter title of show or movie: ")
    
    # Input number of recommendations you want
    num_recs = int(input("Enter number of recommendations: "))
    
    # Get the index of the row for the title
    title_index = lookup_table.index[lookup_table['title'] == title]
    # get the row of the title_index from model_df and reshape it
    title_array = np.array(model_df.loc[title_index]).reshape(1,-1)

    # Create the cosine similarity matrix based on the title
    # cosine similarity matrix
    cosine_matrix = cosine_similarity(model_df, title_array)

    # create a dataframe from the cosine_matrix
    cosine_df = pd.DataFrame(data=cosine_matrix, index=model_df.index)
    
    # top n results of the cosine_df
    results = cosine_df.sort_values(0, ascending=False).index.values[1:num_recs+1]
    
    # look up values in look up table and return the table
    return lookup_table.loc[results]

In [90]:
# testing cosine recommendation function on model_df_ss
cosine_rec(model_df_ss)

Enter title of show or movie: Samurai Champloo
Enter number of recommendations: 6


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts39908,Kurau Phantom Memory,show,2004,1
ts34435,Cowboy Bebop,show,1998,1
ts21676,Hellsing,show,2001,1
ts32478,Kaiji,show,2007,2
ts25497,Outlaw Star,show,1998,1
ts199,Black Lagoon,show,2006,3


In [91]:
# testing cosine recommendation function on model_df_mm
cosine_rec(model_df_mm)

Enter title of show or movie: Samurai Champloo
Enter number of recommendations: 6


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts283022,SPY x FAMILY,show,2022,2
ts87037,ルパン三世 PART5,show,2018,1
ts42254,Servamp,show,2016,1
ts30077,Durarara!!,show,2010,2
ts199,Black Lagoon,show,2006,3
ts56253,The Silver Guardian,show,2017,2


## Nearest Neighbors Function

>Next, I am creating the function using nearest neighbors. The function will be very similar to the code blocks used for nearest neighbors in the previous section except the user will input a title and number of recommendations. The model will also take in a dataframe as a variable. This will make it easier to get recommendations from `model_df_ss` and `model_df_mm`.

In [92]:
# Creating Nearest Neighbors recommendation function
def neighbors_rec(model_df):
    '''
    This function will return a number of recommendations based off 
    the title and number of recommendations the user inputs. 
    This function uses a Nearest Neighbors approach to content based
    recommendations.
    '''
    # Input title you want recommendations for
    title = input("Enter title of show or movie: ")
    
    # Input number of recommendations you want
    num_recs = int(input("Enter number of recommendations: "))
    
    # Get the index of the row for the title
    title_index = lookup_table.index[lookup_table['title'] == title]
    # get the row of the title_index from model_df and reshape it
    title_array = np.array(model_df.loc[title_index]).reshape(1,-1)
    
    # Instantiate Nearest Neighbors
    nn = NearestNeighbors(n_neighbors=10, metric='manhattan')
    # fit the model on the model_df
    nn.fit(model_df)

    # Return results using .kneighbors() attribute of nn model
    results = nn.kneighbors(X=title_array, n_neighbors=num_recs+1, 
                             return_distance=False).flatten()
    results = model_df.iloc[results].index.values[1:]
    
    # look up values in look up table and return the table
    return lookup_table.loc[results]

In [93]:
# testing neighbors_rec function on model_df_ss
neighbors_rec(model_df_ss)

Enter title of show or movie: Samurai Champloo
Enter number of recommendations: 6


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts44317,Gankutsuou,show,2004,1
ts87037,ルパン三世 PART5,show,2018,1
ts9527,Eden of the East,show,2009,1
ts199,Black Lagoon,show,2006,3
ts186408,Keep Your Hands Off Eizouken!,show,2020,1
ts39908,Kurau Phantom Memory,show,2004,1


In [94]:
# testing neighbors_rec function model_df_mm
neighbors_rec(model_df_mm)

Enter title of show or movie: Samurai Champloo
Enter number of recommendations: 6


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts87037,ルパン三世 PART5,show,2018,1
ts84687,Voice of Fox,show,2018,1
ts186408,Keep Your Hands Off Eizouken!,show,2020,1
ts199,Black Lagoon,show,2006,3
ts9527,Eden of the East,show,2009,1
ts52836,The Numbers,show,2016,2


## Adding User Preferences To Recommender Functions

>I want to build upon the functions I created for cosine similarity and nearest neighbors by adding user preferences. This means users can input whether they want to return shows or movies and/or a genre they want recommended. For example, a user can input 'Attack on Titan', 'movie', and 'Comedy' and it will return comedy movies similar to 'Attack on Titan'.
>
>First I will create this function for cosine similarity.

In [95]:
# Creating cosine recommendation function
def preference_cosine_rec(model_df):
    '''
    This function will return a number of recommendations based off 
    the title and number of recommendations the user inputs.
    '''
    # Input title you want recommendations for
    title = input("Enter title of show or movie: ")
    
    # Input number of recommendations you want
    num_recs = int(input("Enter number of recommendations: "))
    
    # Input movie, show, no preference
    type_pref = input('Enter show, movie, or no preference: ')
    type_pref = type_pref.lower()
    
    # Input movie, show, no preference
    genre_pref = input('Enter a genre or none for no genre: ')   
    
    # Filtering model_df based on type_pref
    if type_pref == 'show':
        filtered_model_df = model_df.loc[model_df['show_'] == 1]
    elif type_pref == 'movie':
        filtered_model_df = model_df.loc[model_df['movie_'] == 1]
    else:
        filtered_model_df = model_df
        
    # Filtering filtered_model_df by genre
    if genre_pref == 'none':
        filtered_model_df = filtered_model_df
    else:
        filtered_model_df = filtered_model_df.loc[filtered_model_df[genre_pref] == 1]
    
    # Get the index of the row for the title
    title_index = lookup_table.index[lookup_table['title'] == title]
    # get the row of the title_index from model_df and reshape it
    title_array = np.array(model_df.loc[title_index]).reshape(1,-1)
    
    # Check if title_index is in filtered_model_df.index
    # Append the title array to the filtered_model_df if it is not
    if title_index not in list(filtered_model_df.index):
        filtered_model_df = filtered_model_df.append(model_df.loc[title_index])    

    # Create the cosine similarity matrix based on the title
    # cosine similarity matrix
    cosine_matrix = cosine_similarity(filtered_model_df, title_array)

    # Create a dataframe from the cosine_matrix
    cosine_df = pd.DataFrame(data=cosine_matrix, index=filtered_model_df.index)
    
    # Top n results of the cosine_df
    results = cosine_df.sort_values(0, ascending=False).index.values[:num_recs+1]
    
    # Look up values in look up table and return the table
    return lookup_table.loc[results][1:]

In [96]:
# testing function on model_df_ss
preference_cosine_rec(model_df_ss)

Enter title of show or movie: Demon Slayer: Kimetsu no Yaiba
Enter number of recommendations: 6
Enter show, movie, or no preference: show
Enter a genre or none for no genre: Comedy


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts20433,JoJo's Bizarre Adventure,show,2012,5
ts236473,Cells at Work! Code Black,show,2021,2
ts296797,Black Clover,show,2017,4
ts39013,Mob Psycho 100,show,2016,3
ts44928,KonoSuba – God's blessing on this wonderful wo...,show,2016,3
ts223258,Gleipnir,show,2020,1


In [97]:
# testing function on model_df_mm
preference_cosine_rec(model_df_mm)

Enter title of show or movie: Demon Slayer: Kimetsu no Yaiba
Enter number of recommendations: 6
Enter show, movie, or no preference: show
Enter a genre or none for no genre: Comedy


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts20433,JoJo's Bizarre Adventure,show,2012,5
ts6619,Chrono Crusade,show,2003,1
ts15151,Bleach,show,2004,16
ts223258,Gleipnir,show,2020,1
ts7449,D.Gray-man,show,2006,2
ts21740,Black Butler,show,2008,3


### Nearest Neighbors Preferences Function

>Next I am creating the function using nearest neighbors.

In [98]:
# Creating Nearest Neighbors recommendation function
def preferences_neighbors_rec(model_df):
    '''
    This function will return a number of recommendations based off 
    the title and number of recommendations the user inputs. 
    This function uses a Nearest Neighbors approach to content based
    recommendations.
    '''
    # Input title you want recommendations for
    title = input("Enter title of show or movie: ")
    
    # Input number of recommendations you want
    num_recs = int(input("Enter number of recommendations: "))
    
    # Input movie, show, no preference
    type_pref = input('Enter show, movie, or no preference: ')
    type_pref = type_pref.lower()
    
    # Input movie, show, no preference
    genre_pref = input('Enter a genre or none for no genre: ')   
    
    # Filtering model_df based on type_pref
    if type_pref == 'show':
        filtered_model_df = model_df.loc[model_df['show_'] == 1].copy()
    elif type_pref == 'movie':
        filtered_model_df = model_df.loc[model_df['movie_'] == 1].copy()
    elif type_pref == 'none':
        filtered_model_df = model_df
        
    # Filtering filtered_model_df by genre
    if genre_pref.lower() == 'none':
        pass
    else:
        filtered_model_df = filtered_model_df.loc[filtered_model_df[genre_pref] == 1]    
    
    # Get the index of the row for the title
    title_index = lookup_table.index[lookup_table['title'] == title]
    # get the row of the title_index from model_df and reshape it
    title_array = np.array(model_df.loc[title_index]).reshape(1,-1)
    
    # Check if title_index is in filtered_model_df.index
    # Append the title array to the filtered_model_df if it is not
    if title_index not in list(filtered_model_df.index):
        filtered_model_df = filtered_model_df.append(model_df.loc[title_index])

    # Instantiate Nearest Neighbors
    nn = NearestNeighbors(n_neighbors=10, metric='manhattan')
    # Fit the model on the model_df
    nn.fit(filtered_model_df)

    # Return results using .kneighbors() attribute of nn model
    results = nn.kneighbors(X=title_array, n_neighbors=num_recs+1, 
                             return_distance=False).flatten()
    
    results = filtered_model_df.iloc[results].index.values[1:]
    
    # look up values in look up table and return the table
    return lookup_table.loc[results]

In [99]:
# testing function on model_df_ss
preferences_neighbors_rec(model_df_ss)

Enter title of show or movie: Demon Slayer: Kimetsu no Yaiba
Enter number of recommendations: 6
Enter show, movie, or no preference: show
Enter a genre or none for no genre: Comedy


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts20433,JoJo's Bizarre Adventure,show,2012,5
ts296797,Black Clover,show,2017,4
ts39013,Mob Psycho 100,show,2016,3
ts285135,Mieruko-chan,show,2021,1
ts223258,Gleipnir,show,2020,1
ts44928,KonoSuba – God's blessing on this wonderful wo...,show,2016,3


In [100]:
# testing function on model_df_mm
preferences_neighbors_rec(model_df_mm)

Enter title of show or movie: Demon Slayer: Kimetsu no Yaiba
Enter number of recommendations: 6
Enter show, movie, or no preference: show
Enter a genre or none for no genre: Comedy


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts20433,JoJo's Bizarre Adventure,show,2012,5
ts285135,Mieruko-chan,show,2021,1
ts25863,Soul Eater,show,2008,1
ts55906,Bikini Warriors,show,2015,1
ts296797,Black Clover,show,2017,4
ts28370,Fairy Musketeers,show,2006,2


## Evaluation and Final Recommender System

> I currently have 4 different recommender systems created in this notebook. I have a combination of using cosine similarity or nearest neighbors with either `model_df_ss` or `model_df_mm`. After evaluating the results from the models in this notebook I decided to use cosine similarity and `model_df_ss` as my final recommender system. Overall, I am very happy with the recommendations it gives for the titles entered in it. Below I will show a variety of different recommendations using this recommendation system.

In [101]:
# testing cosine similarity and model_df_ss
preference_cosine_rec(model_df_ss)

Enter title of show or movie: One Piece
Enter number of recommendations: 5
Enter show, movie, or no preference: none
Enter a genre or none for no genre: none


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts22069,Naruto Shippūden,show,2007,24
ts10145,Case Closed,show,1996,51
ts15151,Bleach,show,2004,16
ts20740,Dragon Ball Z,show,1989,16
ts94758,The Journey Home,show,2015,20


In [102]:
# testing cosine similarity and model_df_ss
preference_cosine_rec(model_df_ss)

Enter title of show or movie: Re:ZERO -Starting Life in Another World-
Enter number of recommendations: 5
Enter show, movie, or no preference: none
Enter a genre or none for no genre: none


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts75762,GARO -VANISHING LINE-,show,2017,1
ts20250,Parasyte -the maxim-,show,2014,1
ts80096,Kakuriyo -Bed & Breakfast for Spirits-,show,2018,1
ts84633,The Promised Neverland,show,2019,2
ts20312,Brynhildr in the Darkness,show,2014,1


In [103]:
# testing cosine similarity and model_df_ss
preference_cosine_rec(model_df_ss)

Enter title of show or movie: Attack on Titan
Enter number of recommendations: 6
Enter show, movie, or no preference: show
Enter a genre or none for no genre: Romance


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ts25765,Vampire Knight,show,2008,2
ts75516,The Ancient Magus' Bride,show,2017,2
ts41257,Re:ZERO -Starting Life in Another World-,show,2016,2
ts22426,The Irregular at Magic High School,show,2014,2
ts20014,Sailor Moon Crystal,show,2014,4
ts28010,High School DxD,show,2012,4


In [104]:
# testing cosine similarity and model_df_ss
preference_cosine_rec(model_df_ss)

Enter title of show or movie: Dragon Ball
Enter number of recommendations: 5
Enter show, movie, or no preference: movie
Enter a genre or none for no genre: Romance


Unnamed: 0_level_0,title,type,release_year,seasons
jw_entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tm185557,Sailor Moon R: The Movie,movie,1993,0
tm192670,Tenchi the Movie 2: The Daughter of Darkness,movie,1997,0
tm114586,Tenchi Muyo! In Love,movie,1996,0
tm49446,Revolutionary Girl Utena: The Adolescence of U...,movie,1999,0
tm23135,Escaflowne: The Movie,movie,2000,0


## Next Steps - Streamlit App

>The next part of the project is to create an app through streamlit. This app will use the function I created for cosine similarity and have the user enter in the title, number of recommendations, type of content (movie or show), and the genre to return the recommendations for the that particular title. I have to save the `model_df_ss` and `lookup_table` dataframes to a csv files so I can use them in the streamlit app.

In [105]:
# save files to csv
model_df_ss.to_csv('./Data/model_df.csv')
lookup_table.to_csv('./Data/lookup_table.csv')