# Data Exploration and Analysis
The following notebook will take a look at the data sets that will be used in the recommendation engine and prep the data for use in the engine itself.<br><br>
## Section 1
The first part of the notebook will take a look at the animes.csv data. All information of the anime shows will be in this file. Section 1 will deal with Knowledge Based Recommendations.

## Section 2
Section 2 will be about Content Based Recommendations using the animes.csv data.

## Sections 3
Section 3 will be about Collaborative Based Recommendations using the reviews.csv and users.csv files.

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Read in Data

In [17]:
animes_df = pd.read_csv('data/animes.csv')
users_df = pd.read_csv('data/profiles.csv')
reviews_df = pd.read_csv('data/reviews.csv')

print("Animes DF: {}\nUsers DF: {}\nReviews DF: {}".format(animes_df.shape, users_df.shape, reviews_df.shape))

Animes DF: (19311, 12)
Users DF: (81727, 5)
Reviews DF: (192112, 7)


# Section 1 - Knowledge Based Filtering
For knowledge based filtering, I want to classify each anime by the different genres they fall into and which decade they belong to. Many animes come out every year so filtering by specific year would be unneccesary. The recommendations coming from the filters will be ranked based. <br>
The steps for section one will be the following:
- separate all genres into their own column with a 1 in the cell if an anime falls in that category, 0 otherwise
- extract the air dates of the animes to define which decade they belong to. 1 if anime aired in that decade, 0 otherwise
- create ranked based filtering function to be used in the web app

In [18]:
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


First I want to check for duplicate or null rows. If there are, I want to get rid of them. I purposely used this dataset because there were no null values. So I expect isnull().sum() to return 0.

In [19]:
# check how many duplicate rows are there
animes_df.duplicated().sum()

2943

In [20]:
# delete duplicate rows
animes_df.drop_duplicates(inplace=True)

In [21]:
# check to see if duplicate rows were deleted
animes_df.duplicated().sum()

0

In [22]:
# check to see if any rows are null
animes_df.genre.isnull().sum()

0

#### IMPORTANT:
The animes dataframe is too big for this project so I will work with a subset of the dataframe It would be a burden for the graders to review and it will also be a little too much for my computer to handle. So I will cut the data frame in half and use a subset from now on. I will sort the dataframe first by rank in ascending order because a recommendation system should only provide high ranked/rated content. The metrics for this system are also based on rank so low ranked/rated shows would not be used anyway.

In [23]:
# sort by rank
animes_df = animes_df.sort_values(by='ranked', ascending=True)
# keep first half of dataframe
half = int(animes_df.shape[0] / 2)
animes_df = animes_df.head(half)

In [24]:
animes_df.shape

(8184, 12)

One thing I noticed when looking through the dataset was that some of the show titles have special characters. That will make routing to the show-info page result in an error so they need to be replaced.

In [25]:
# get rid of any / in the show titles and replace with -
animes_df['title'] = [w.replace('/','-') for w in animes_df['title']]

In [26]:
# delete any ? in the show titles
animes_df['title'] = [w.replace('?','') for w in animes_df['title']]

Now I need to extract the genres for every anime. As seen below, the genres are held as strings so I need to split the string by the ',' to get every genre

In [27]:
animes_df.iloc[0]['genre']

"['Action', 'Military', 'Adventure', 'Comedy', 'Drama', 'Magic', 'Fantasy', 'Shounen']"

In [28]:
animes_df['genre'].dtype

dtype('O')

I need to loop through the entire genre column to get a set of every possible genre type there is in the dataset.

In [29]:
genres = []

# for every anime, split the genre column value to get each genre type
for genre_set in animes_df.genre:
    values = genre_set.strip("[]").split(",")
    values = [w.strip()[1:-1] for w in values]
    
    # add genres to list
    genres.extend(values)

# drop all duplicate values
genres = set(genres)
print("The number of genres is {}.".format(len(genres)))
print(genres)

The number of genres is 41.
{'', 'Thriller', 'Kids', 'Supernatural', 'Sports', 'Magic', 'Game', 'Demons', 'Slice of Life', 'Military', 'Sci-Fi', 'Josei', 'Historical', 'Cars', 'Samurai', 'Horror', 'Psychological', 'Super Power', 'Fantasy', 'Martial Arts', 'School', 'Mecha', 'Shoujo', 'Shounen', 'Music', 'Space', 'Romance', 'Comedy', 'Ecchi', 'Harem', 'Shounen Ai', 'Mystery', 'Vampire', 'Parody', 'Seinen', 'Action', 'Drama', 'Adventure', 'Police', 'Shoujo Ai', 'Dementia'}


Notice the first element is empty. This happend when calling split(","). We can quickly delete that.

In [30]:
# delete first empty element
genres = list(genres)
genres.pop(0)
# alphabetize the list of genres
genres = sorted(genres)
print("The number of genres is {}.".format(len(genres)))
print(genres)

The number of genres is 40.
['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Historical', 'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire']


Now that I have a list of all possible genre types. I wanted to make a column for each genre. The value of the column will be a 1 if an anime is listed under this genre or a 0 if it is not.

In [31]:
def split_genres(anime):
    '''
    Will split the genre column of any anime row and return a 1 if the anime is listed in that genre.
    
    INPUT:
    anime - a string of the genres column for a specific anime
    
    OUTPUT:
    1 - if anime is listed in genre
    0 - if anime is not listed in genre
    '''
    try:
        # if genre is listed in the 'genre' column of an anime
        if anime.find(genre) > -1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# create column for each genre and fill in columns
for genre in genres:
    animes_df[genre] = animes_df['genre'].apply(split_genres)

Now there is an easier way to identify the genres that an anime is listed as. This will be used for filtering and content based recommendations.

In [32]:
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,...,Shoujo Ai,Shounen,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,...,0,1,0,0,0,0,0,0,0,0
773,9253,Steins;Gate,The self-proclaimed mad scientist Rintarou Oka...,"['Thriller', 'Sci-Fi']","Apr 6, 2011 to Sep 14, 2011",24.0,1331710,7,2.0,9.11,...,0,0,0,0,0,0,0,0,1,0
772,11061,Hunter x Hunter (2011),Hunter x Hunter is set in a world where Hunte...,"['Action', 'Adventure', 'Fantasy', 'Shounen', ...","Oct 2, 2011 to Sep 24, 2014",148.0,1052761,20,3.0,9.11,...,0,1,0,0,0,0,1,0,0,0
771,32281,Kimi no Na wa.,"Mitsuha Miyamizu, a high school girl, yearns t...","['Romance', 'Supernatural', 'School', 'Drama']","Aug 26, 2016",1.0,1139878,15,4.0,9.09,...,0,0,0,0,0,0,0,1,0,0
770,38524,Shingeki no Kyojin Season 3 Part 2,Seeking to restore humanity’s diminishing hope...,"['Action', 'Drama', 'Fantasy', 'Military', 'My...","Apr 29, 2019 to Jul 1, 2019",10.0,446370,175,5.0,9.07,...,0,1,0,0,0,0,1,0,0,0


## Create Decade columns for further filtering

I followed the same process as above and extracted the decades that these shows aired and created a column for each decade. 1 will indicated that the show is from that decade, otherwise 0. I only want the year when it first aired and not the end year.

In [33]:
animes_df['aired'].value_counts()

2002                            9
Jul 5, 2013 to Sep 27, 2013     8
Apr 21, 2007                    7
Not available                   7
2008                            7
                               ..
Jul 14, 2006 to Sep 29, 2006    1
Jul 1, 1983 to Jun 29, 1984     1
Aug 31, 2012 to Dec 22, 2012    1
Oct 4, 2003 to Dec 27, 2003     1
Mar 10, 2014 to Mar 21, 2014    1
Name: aired, Length: 6388, dtype: int64

The format of the 'aired' column changes from show to show so it'll be difficult to tell what the earliest air date is. I refined the function below after many errors to handle all of the formats. I grouped together everything that is Pre 1970's for my own sanity.

In [34]:
def split_anime_decade(val, decade):
    '''
    INPUT:
    val the 'aired' column value for a row
    decade - one of the values from the (list) decade list 
             below this cell
    
    OUTPUT:
    1 - if show in decade
    0 - if show not in decade
    '''
    # extract year from 'aired' column
    try:
        year = val.split(',')[1]
        year = year.strip()[:4]
    except:
        year = val.strip()[:4]
    
    # decide wether show belongs to decade
    try:
        if decade == 'Pre 1970':
            if int(year) < 1970:
                return 1
            return 0
        if int(year) >= int(decade) and int(year) < int(decade) + 10:
            return 1
        else:
            return 0
    except:
        return 0

In [35]:
# valid decades
decades = ['Pre 1970s', '1970s', '1980s', '1990s', '2000s', '2010s']

# for every decade, find what shows belong to it
for decade in decades:
    column = []
    for row in animes_df['aired']:
        column.append(split_anime_decade(row, decade[:-1]))
    animes_df[decade] = column
    
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,...,Super Power,Supernatural,Thriller,Vampire,Pre 1970s,1970s,1980s,1990s,2000s,2010s
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,...,0,0,0,0,0,0,0,0,1,0
773,9253,Steins;Gate,The self-proclaimed mad scientist Rintarou Oka...,"['Thriller', 'Sci-Fi']","Apr 6, 2011 to Sep 14, 2011",24.0,1331710,7,2.0,9.11,...,0,0,1,0,0,0,0,0,0,1
772,11061,Hunter x Hunter (2011),Hunter x Hunter is set in a world where Hunte...,"['Action', 'Adventure', 'Fantasy', 'Shounen', ...","Oct 2, 2011 to Sep 24, 2014",148.0,1052761,20,3.0,9.11,...,1,0,0,0,0,0,0,0,0,1
771,32281,Kimi no Na wa.,"Mitsuha Miyamizu, a high school girl, yearns t...","['Romance', 'Supernatural', 'School', 'Drama']","Aug 26, 2016",1.0,1139878,15,4.0,9.09,...,0,1,0,0,0,0,0,0,0,1
770,38524,Shingeki no Kyojin Season 3 Part 2,Seeking to restore humanity’s diminishing hope...,"['Action', 'Drama', 'Fantasy', 'Military', 'My...","Apr 29, 2019 to Jul 1, 2019",10.0,446370,175,5.0,9.07,...,1,0,0,0,0,0,0,0,0,1


### The genre and aired columns are dropped because they are no longer needed. There is an easier way to identify what genres a show belongs to now.

In [36]:
animes_df = animes_df.drop(['genre', 'aired'], axis=1)
animes_df.head()

Unnamed: 0,uid,title,synopsis,episodes,members,popularity,ranked,score,img_url,link,...,Super Power,Supernatural,Thriller,Vampire,Pre 1970s,1970s,1980s,1990s,2000s,2010s
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...,...,0,0,0,0,0,0,0,0,1,0
773,9253,Steins;Gate,The self-proclaimed mad scientist Rintarou Oka...,24.0,1331710,7,2.0,9.11,https://cdn.myanimelist.net/images/anime/5/731...,https://myanimelist.net/anime/9253/Steins_Gate,...,0,0,1,0,0,0,0,0,0,1
772,11061,Hunter x Hunter (2011),Hunter x Hunter is set in a world where Hunte...,148.0,1052761,20,3.0,9.11,https://cdn.myanimelist.net/images/anime/11/33...,https://myanimelist.net/anime/11061/Hunter_x_H...,...,1,0,0,0,0,0,0,0,0,1
771,32281,Kimi no Na wa.,"Mitsuha Miyamizu, a high school girl, yearns t...",1.0,1139878,15,4.0,9.09,https://cdn.myanimelist.net/images/anime/5/870...,https://myanimelist.net/anime/32281/Kimi_no_Na_wa,...,0,1,0,0,0,0,0,0,0,1
770,38524,Shingeki no Kyojin Season 3 Part 2,Seeking to restore humanity’s diminishing hope...,10.0,446370,175,5.0,9.07,https://cdn.myanimelist.net/images/anime/1517/...,https://myanimelist.net/anime/38524/Shingeki_n...,...,1,0,0,0,0,0,0,0,0,1


There are two other columns that won't be needed for this project and that is the number of members and it's popularity. They can be used to further filter items but this project will be graded on ranked based metrics. Only ranked and score will be used for that. The 'link' column can also be deleted because there is no need for it and it will take up memory space.

In [37]:
animes_df = animes_df.drop(['members', 'popularity', 'link'], axis=1)
animes_df.head()

Unnamed: 0,uid,title,synopsis,episodes,ranked,score,img_url,Action,Adventure,Cars,...,Super Power,Supernatural,Thriller,Vampire,Pre 1970s,1970s,1980s,1990s,2000s,2010s
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...",64.0,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,1,1,0,...,0,0,0,0,0,0,0,0,1,0
773,9253,Steins;Gate,The self-proclaimed mad scientist Rintarou Oka...,24.0,2.0,9.11,https://cdn.myanimelist.net/images/anime/5/731...,0,0,0,...,0,0,1,0,0,0,0,0,0,1
772,11061,Hunter x Hunter (2011),Hunter x Hunter is set in a world where Hunte...,148.0,3.0,9.11,https://cdn.myanimelist.net/images/anime/11/33...,1,1,0,...,1,0,0,0,0,0,0,0,0,1
771,32281,Kimi no Na wa.,"Mitsuha Miyamizu, a high school girl, yearns t...",1.0,4.0,9.09,https://cdn.myanimelist.net/images/anime/5/870...,0,0,0,...,0,1,0,0,0,0,0,0,0,1
770,38524,Shingeki no Kyojin Season 3 Part 2,Seeking to restore humanity’s diminishing hope...,10.0,5.0,9.07,https://cdn.myanimelist.net/images/anime/1517/...,1,0,0,...,1,0,0,0,0,0,0,0,0,1


Now that I have clean data for anime shows, I can save it to a new csv file and only use that file from now on.

In [38]:
# save to a new csv file
animes_df.to_csv('./data/animes-clean.csv')

## Ranked Based Filtering
Now that the anime shows dataframe is clean. Ranked based filtering functions can now be made.

The first filtering function is simply sorting by the top ranked shows. This function will be used for the home page to recommend the top shows regardless of genre or decade.

In [81]:
def get_top_ranked(n, df=animes_df):
    '''
    INPUT:
    df - animes df from cells above
    n - number of recs to return
    
    OUTPUT:
    recs -  the name and img url of the all time top rated animes
    '''
    recs = []
    # sort all anime shows by their rank
    top_ranked = df.sort_values(by='ranked', ascending=True)
    
    # grab n recommendation
    for i in range(n):
        # will only need title and img_url for home page
        recs.append((top_ranked.iloc[i].title, top_ranked.iloc[i].img_url))
                    
    return recs

In [82]:
top_rated = get_top_ranked(10)
for show in top_rated:
    print(show)

('Fullmetal Alchemist: Brotherhood', 'https://cdn.myanimelist.net/images/anime/1223/96541.jpg')
('Steins;Gate', 'https://cdn.myanimelist.net/images/anime/5/73199.jpg')
('Hunter x Hunter (2011)', 'https://cdn.myanimelist.net/images/anime/11/33657.jpg')
('Kimi no Na wa.', 'https://cdn.myanimelist.net/images/anime/5/87048.jpg')
('Shingeki no Kyojin Season 3 Part 2', 'https://cdn.myanimelist.net/images/anime/1517/100633.jpg')
('Gintama°', 'https://cdn.myanimelist.net/images/anime/3/72078.jpg')
("Gintama'", 'https://cdn.myanimelist.net/images/anime/4/50361.jpg')
('Ginga Eiyuu Densetsu', 'https://cdn.myanimelist.net/images/anime/13/13225.jpg')
('3-gatsu no Lion 2nd Season', 'https://cdn.myanimelist.net/images/anime/3/88469.jpg')
('Koe no Katachi', 'https://cdn.myanimelist.net/images/anime/1122/96435.jpg')


Next is filtering the shows by a specific genre and sorting by highest ranked.

In [7]:
def get_top_ranked_genre(genre, n, df=animes_df):
    '''
    INPUT:
    genre - a string containing the genre that will be filtered by
    n - the number of recommendations to be returned
    df - the animes df from above
    
    OUTPUT:
    recs - a list of recommendations with title and url link
    '''
    
    recs = []
    # grab all shows that are in a specific genre and sort by highest rank
    genre_df = df[df[genre] == 1].sort_values(by='ranked', ascending=True)
    max_len = len(genre_df['title']) - 1

    for i in range(n):
        if i > max_len:
            break
        recs.append((genre_df.iloc[i].title, genre_df.iloc[i].img_url))
    
    return recs

In [39]:
genre = 'Romance'
# get top shows
recs = get_top_ranked_genre(genre, 100)
for show in recs:
    print(show)

('Kimi no Na wa.', 'https://cdn.myanimelist.net/images/anime/5/87048.jpg')
('Clannad: After Story', 'https://cdn.myanimelist.net/images/anime/13/24647.jpg')
('Shigatsu wa Kimi no Uso', 'https://cdn.myanimelist.net/images/anime/3/67177.jpg')
('Monogatari Series: Second Season', 'https://cdn.myanimelist.net/images/anime/3/52133.jpg')
('Rurouni Kenshin: Meiji Kenkaku Romantan - Tsuioku-hen', 'https://cdn.myanimelist.net/images/anime/1807/102419.jpg')
('Seishun Buta Yarou wa Yumemiru Shoujo no Yume wo Minai', 'https://cdn.myanimelist.net/images/anime/1613/102179.jpg')
('Howl no Ugoku Shiro', 'https://cdn.myanimelist.net/images/anime/5/75810.jpg')
('Suzumiya Haruhi no Shoushitsu', 'https://cdn.myanimelist.net/images/anime/2/73842.jpg')
('Yojouhan Shinwa Taikei', 'https://cdn.myanimelist.net/images/anime/10/50457.jpg')
('Bakuman. 3rd Season', 'https://cdn.myanimelist.net/images/anime/6/41845.jpg')
('Kara no Kyoukai 5: Mujun Rasen', 'https://cdn.myanimelist.net/images/anime/8/9246.jpg')
('Ten

The last ranked based filter is by the decade of the show.

In [85]:
def get_top_ranked_decade(decade, n, df=animes_df):
    '''
    INPUT:
    decade - a string containing the decade that will be filtered by
    n - the number of recommendations to be returned
    df - the animes df from above
    
    OUTPUT:
    recs - a list of recommendations with title and url link
    '''
    
    recs = []
    # grab all shows that are in a specific decade and sort by highest rank
    decade_df = df[df[decade] == 1].sort_values(by='ranked', ascending=False)
    
    # grab n recommendations
    for i in range(n):
        # will only need title and img_url for home page
        recs.append((decade_df.iloc[i].title, decade_df.iloc[i].img_url))
    
    return recs

In [86]:
decade = '1990s'
# get top 1990's animes
decades = get_top_ranked_decade(decade, 20)
for show in decades:
    print(show)

('Hiroshima e no Tabi', 'https://cdn.myanimelist.net/images/anime/4/60233.jpg')
('Hei Mao Jing Zhang (1992)', 'https://cdn.myanimelist.net/images/anime/2/83207.jpg')
('Gall Force: The Revolution', 'https://cdn.myanimelist.net/images/anime/8/26528.jpg')
('Chika Gentou Gekiga: Shoujo Tsubaki', 'https://cdn.myanimelist.net/images/anime/13/77662.jpg')
('Lesson XX', 'https://cdn.myanimelist.net/images/anime/13/25994.jpg')
('Kaitouranma The Animation', 'https://cdn.myanimelist.net/images/anime/5/61785.jpg')
('Hello Kitty no Shirayuki-hime', 'https://cdn.myanimelist.net/images/anime/2/47809.jpg')
('Hajime Ningen Gon', 'https://cdn.myanimelist.net/images/anime/3/24193.jpg')
('A.D. Police (TV)', 'https://cdn.myanimelist.net/images/anime/2/15440.jpg')
('Tobira wo Akete (1995)', 'https://cdn.myanimelist.net/images/anime/8/5028.jpg')
('Soliton no Akuma', 'https://cdn.myanimelist.net/images/anime/3/29527.jpg')
('Legend of Crystania', 'https://cdn.myanimelist.net/images/anime/9/2764.jpg')
('Hello Ki

# Section 2 - Content Based Filtering

Content based filtering requires finding similar shows. To find similar shows, a subset of the animes_df has to created of just the genres and decades (the attributes of the shows). The subset should be a matrix filled with 1's and 0's. The dot product of that subet with the TRANSPOSE of that subset will result in a similarity matrix. With the similarity matrix, finding similar shows is as easy as pandas matrix filtering.

Run the cell below to see what the subset should look like.

In [87]:
animes_df.iloc[:, 7:]

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Thriller,Vampire,Yaoi,Yuri,Pre 1970s,1970s,1980s,1990s,2000s,2010s
3,1,1,0,1,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
773,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
772,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
771,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
770,1,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3249,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3248,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3271,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3247,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [88]:
# get a subset of the animes df starting at the first genre column
show_attributes = animes_df.iloc[:, 7:]

# take the dot product of the show_attributes with the transpose of show_attributes
dot_prod_shows = show_attributes.dot(np.transpose(show_attributes))

Run the cell below to see the dot product matrix. This is the similarity matrix between any two shows. The higher the number in a cell, the more similar the two shows are. 

In [89]:
dot_prod_shows.head()

Unnamed: 0,3,773,772,771,770,769,768,767,766,765,...,3252,3270,3253,3250,3251,3249,3248,3271,3247,3273
3,9,0,4,1,5,3,3,2,1,2,...,1,4,1,2,2,0,0,1,1,2
773,0,3,1,1,1,2,2,1,1,1,...,0,0,1,1,0,1,1,0,1,1
772,4,1,6,1,5,3,3,0,1,2,...,0,2,1,3,0,1,1,0,1,2
771,1,1,1,5,2,1,1,1,2,3,...,1,1,1,1,2,1,1,0,2,1
770,5,1,5,2,8,3,3,2,2,3,...,1,2,1,3,1,1,1,0,1,2


In [90]:
dot_prod_shows.shape

(8184, 8184)

I will need this matrix for the web app so I will save it into its own csv file. Note: This might take some time to complete.

In [91]:
dot_prod_shows.to_csv('./data/similar-shows.csv')

Finding similar shows will be easy now with the dot product matrix. Looking at the row that corresponds to the anime, the function will grab the other anime ids that have a higher number in the cell (i.e. the more similar animes) and return a pandas dataframe filled only with the data of the most similar shows.

In [92]:
def find_similar_shows(anime_id):
    '''
    Finds similar shows based on what genres/decades they have in common
    
    INPUT:
    anime_id - int, id of anime show that appears in the animes_df
    
    OUTPUT:
    similar_shows - pandas dataframe of similar shows sorted by highest rated
    '''
    # find index of show in similarity matrix
    show_idx = np.where(animes_df['uid'] == anime_id)[0][0]
    
    # find other shows that are similar to the one passed in as an arg
    similar_idxs = np.where(dot_prod_shows.iloc[show_idx] > np.max(dot_prod_shows.iloc[show_idx])-2)[0]
    
    # find their info in the animes dataframe
    similar_shows = animes_df.iloc[similar_idxs, ]
    # remove the show that was passed in as an arg
    # Note: the most similar show to the one that was passed IS the one that was passed
    similar_shows = similar_shows[similar_shows['uid'] != anime_id]
    # sort by highest rank
    similar_shows.sort_values(by=['ranked'], ascending=True)
    
    
    return similar_shows

In [93]:
# find similar shows to the highest ranked show
# anime_id 5114 belongs to the show ranked number 1
top_ranked_recs = find_similar_shows(5114)
for name, img in zip(top_ranked_recs['title'], top_ranked_recs['img_url']):
    print(name, img)

Fullmetal Alchemist https://cdn.myanimelist.net/images/anime/10/75815.jpg
InuYasha Movie 3: Tenka Hadou no Ken https://cdn.myanimelist.net/images/anime/1658/95332.jpg
InuYasha Movie 2: Kagami no Naka no Mugenjo https://cdn.myanimelist.net/images/anime/1162/92219.jpg
InuYasha Movie 2: Kagami no Naka no Mugenjo https://cdn.myanimelist.net/images/anime/1162/92219.jpg
InuYasha Movie 1: Toki wo Koeru Omoi https://cdn.myanimelist.net/images/anime/1683/94370.jpg
InuYasha Movie 4: Guren no Houraijima https://cdn.myanimelist.net/images/anime/1216/94369.jpg
Fullmetal Alchemist: The Sacred Star of Milos https://cdn.myanimelist.net/images/anime/2/29550.jpg


It seems that if you're a fan on Full Metal Alchemist (top ranked show), you might like InuYasha!

# Collaborative Filtering
Collaborative based filtering is the most tricky. There are a few ways this can be done. Unfortunately after trial and error, my computer could not handle SVD or Funk SVD on such a large dataset. So I chose to implement a more simplistic approach. The approach for collaborative filtering is about finding similar users. This can be done through a user-item matrix where the users are the index and the shows are the columns.

### Methodology
This implementation is similar to the Recommendations with IBM project in that the user-item matrix will be filled with 1's and 0's. In the Recs with IBM project a 1 was interpreted as a user interaction with an article. Wether the interaction was positive or negative was unknown. For this project, a 1 will represent a user has favorited a show. The 1 will represent a ***positive*** interaction. So the more 'interactions' a user has, the easier it will be to find similar users because any 'interaction' is KNOWN as positive.

In [40]:
valid_shows = list(dot_prod_shows.columns.astype(int))

NameError: name 'dot_prod_shows' is not defined

In [142]:
len(valid_shows)

8184

In [97]:
reviews_df.shape

(192112, 7)

In [104]:
reviews_df['anime_uid'].value_counts()

1535     1708
9253     1558
32281    1436
11757    1292
5114     1274
         ... 
4773        1
37954       1
39215       1
10832       1
29293       1
Name: anime_uid, Length: 8113, dtype: int64

In [110]:
most_reviewed = []
for anime, num_review in zip(reviews_df['anime_uid'].value_counts().index, reviews_df['anime_uid'].value_counts().values):
    if num_review >= 100:
        most_reviewed.append(anime)

In [111]:
reviews_df = reviews_df[reviews_df['anime_uid'].isin(most_reviewed)]

In [112]:
reviews_df.shape

(110127, 7)

In [15]:
reviews_df.shape

(192112, 7)

In [13]:
reviews_df['uid'].value_counts()

321183    4
321837    4
321498    4
321144    4
321148    4
         ..
46599     1
211503    1
156351    1
198366    1
193145    1
Name: uid, Length: 130519, dtype: int64

The most animes one user has reviewed is 4 which will not help in collaborative filtering. There would be too much missing data. So I will use peoples favorited animes shows. This is similar to the Udacity lesson where in the user item matrix, a user has a 1 if they interacted with an article, otherwise a 0. This is slightly different because a 1 will represent a positive interaction with a show since they belong to someones favorite.

In [159]:
users_df = pd.read_csv('./data/profiles.csv')

In [144]:
users_df.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link
0,DesolatePsyche,Male,"Oct 2, 1994","['33352', '25013', '5530', '33674', '1482', '2...",https://myanimelist.net/profile/DesolatePsyche
1,baekbeans,Female,"Nov 10, 2000","['11061', '31964', '853', '20583', '918', '925...",https://myanimelist.net/profile/baekbeans
2,skrn,,,"['918', '2904', '11741', '17074', '23273', '32...",https://myanimelist.net/profile/skrn
3,edgewalker00,Male,Sep 5,"['5680', '849', '2904', '3588', '37349']",https://myanimelist.net/profile/edgewalker00
4,aManOfCulture99,Male,"Oct 30, 1999","['4181', '7791', '9617', '5680', '2167', '4382...",https://myanimelist.net/profile/aManOfCulture99


In [145]:
users_df.shape

(81727, 5)

In [160]:
users_df.duplicated().sum()

33825

In [161]:
users_df.drop_duplicates(inplace=True)

In [162]:
users_df.duplicated().sum()

0

In [149]:
users_df['favorites_anime'].value_counts()

[]                                                             10424
['5114']                                                          68
['1535']                                                          61
['9253']                                                          54
['21']                                                            44
                                                               ...  
['5114', '1535', '4224', '21', '853']                              1
['12431', '7674', '1210', '11061', '2034']                         1
['20', '135', '4898', '35994', '24211']                            1
['4224', '154', '120', '1142', '16']                               1
['37510', '35180', '30', '31043', '5081', '37450', '11843']        1
Name: favorites_anime, Length: 35395, dtype: int64

In [163]:
users_df = users_df[users_df['favorites_anime'] != '[]']

In [164]:
users_df['favorites_anime'].value_counts()

['5114']                                                       68
['1535']                                                       61
['9253']                                                       54
['21']                                                         44
['11061']                                                      40
                                                               ..
['5114', '1535', '4224', '21', '853']                           1
['12431', '7674', '1210', '11061', '2034']                      1
['20', '135', '4898', '35994', '24211']                         1
['4224', '154', '120', '1142', '16']                            1
['37510', '35180', '30', '31043', '5081', '37450', '11843']     1
Name: favorites_anime, Length: 35394, dtype: int64

In [165]:
users_df.shape

(37478, 5)

Similar to how I extracted the genres and decades from the animes dataframe I will extract all the anime shows that people have favorited and create a user-item matrix.

In [135]:
shows = []

# for every user, split their favorites_anime column to get their favs
for show in users_df.favorites_anime:
    favs = show.strip("[]").split(",")
    if '' in favs:
        favs.remove('')
    favs = [int(w.strip()[1:-1]) for w in favs]
    
    # add show to list
    shows.extend(favs)

# drop all duplicate values
shows = set(shows)
shows = list(shows)
shows.sort()
print("The number of favorited shows is {}.".format(len(shows)))
print(shows)

The number of favorited shows is 4768.
[1, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 77, 79, 80, 81, 82, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 109, 110, 113, 114, 115, 116, 117, 119, 120, 121, 122, 123, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 173, 174, 177, 178, 180, 181, 182, 183, 185, 186, 187, 189, 190, 191, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 217, 218, 219, 220, 221, 222, 223, 225, 226, 227, 228, 229, 230, 232, 233, 235, 237, 238, 239, 240, 241, 242, 243, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 25

In [166]:
def split_favs(anime):
    '''
    Will split the genre column of any anime row and return a 1 if the anime is listed in that genre.
    
    INPUT:
    anime - a string of the genres column for a specific anime
    
    OUTPUT:
    1 - if anime is listed in genre
    0 - if anime is not listed in genre
    '''
    try:
        favs = anime.strip("[]").split(",")
        if '' in favs:
            favs.remove('')
        favs = [int(w.strip()[1:-1]) for w in favs]
        if show in favs:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# create column for each genre and fill in columns
for show in valid_shows:
    users_df[show] = users_df['favorites_anime'].apply(split_favs)

  users_df[show] = users_df['favorites_anime'].apply(split_favs)


In [174]:
users_df.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link,3,773,772,771,770,...,3252,3270,3253,3250,3251,3249,3248,3271,3247,3273
0,DesolatePsyche,Male,"Oct 2, 1994","['33352', '25013', '5530', '33674', '1482', '2...",https://myanimelist.net/profile/DesolatePsyche,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,baekbeans,Female,"Nov 10, 2000","['11061', '31964', '853', '20583', '918', '925...",https://myanimelist.net/profile/baekbeans,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,skrn,,,"['918', '2904', '11741', '17074', '23273', '32...",https://myanimelist.net/profile/skrn,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,edgewalker00,Male,Sep 5,"['5680', '849', '2904', '3588', '37349']",https://myanimelist.net/profile/edgewalker00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,aManOfCulture99,Male,"Oct 30, 1999","['4181', '7791', '9617', '5680', '2167', '4382...",https://myanimelist.net/profile/aManOfCulture99,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [158]:
users_df.iloc[0]['favorites_anime']

"['33352', '25013', '5530', '33674', '1482', '269', '18245', '2904', '27899', '17074', '12291', '226', '28851', '8525', '6594', '4981', '1698', '457', '235', '34618']"

In [168]:
users_df.shape

(37478, 8189)

The user-item matrix is everything starting from column '1' to the end of the dataframe.

In [169]:
# change column names to string (Int is not JSON serializable)
# will need column names for Flask app
users_df.columns = users_df.columns.astype(str)

In [176]:
# get a subset of users_df starting from column '1'
user_item = users_df.loc[:, '3':]

Run the cell below to look at the user-item matrix!

In [177]:
user_item.head()

Unnamed: 0,3,773,772,771,770,769,768,767,766,765,...,3252,3270,3253,3250,3251,3249,3248,3271,3247,3273
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [178]:
user_item.shape

(37478, 8184)

In [179]:
# save to new csv file
user_item.to_csv('./data/user-item-matrix.csv')

Finding similar users can be done with the user-item matrix now.

In [180]:
def find_similar_users(user_id, user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by animes: 
                1's when a user has favorited an anime, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered list
    
    '''
    # compute similarity of each user to the provided user
    similarity = user_item[user_item.index == user_id].dot(user_item.T)
    # sort by similarity
    similarity = similarity.sort_values(user_id, axis=1, ascending=False)

    # create list of just the ids
    most_similar_users = list(similarity.columns)
    # remove the own user's id
    most_similar_users.remove(user_id)
    
    return most_similar_users # return a list of the users in order from most to least similar

In [181]:
find_similar_users(0, user_item)

[9376,
 9661,
 72548,
 5722,
 54428,
 76109,
 7970,
 22674,
 63304,
 17565,
 68690,
 73411,
 875,
 8004,
 12431,
 20105,
 3932,
 72070,
 37631,
 63585,
 847,
 1445,
 22560,
 15682,
 73166,
 30817,
 10752,
 3850,
 22390,
 6191,
 12102,
 50782,
 20390,
 1819,
 59496,
 22486,
 38113,
 63031,
 3880,
 26686,
 68620,
 50633,
 49850,
 25948,
 43149,
 73697,
 71941,
 17874,
 15115,
 15087,
 36647,
 31872,
 5371,
 12952,
 67051,
 2565,
 5305,
 14981,
 71884,
 23483,
 25222,
 68931,
 4066,
 44327,
 43572,
 54922,
 64076,
 31663,
 25943,
 22865,
 31395,
 17702,
 5670,
 15314,
 68740,
 69795,
 67213,
 67210,
 23001,
 12658,
 8217,
 8222,
 8232,
 5559,
 15226,
 6249,
 22263,
 75166,
 72830,
 16382,
 61520,
 53031,
 2678,
 60447,
 28416,
 21698,
 61642,
 16888,
 7453,
 52046,
 2798,
 39134,
 10993,
 70172,
 2650,
 2802,
 51873,
 16385,
 16390,
 52294,
 29653,
 39827,
 52812,
 21334,
 11264,
 61022,
 72758,
 29452,
 16706,
 21190,
 2707,
 29503,
 2700,
 21106,
 40647,
 40701,
 60653,
 11439,
 60202,


In [182]:
def get_user_animes(uid,user_item):
    '''
    Gets the shows that are favorited by a specific user.
    
    INPUT:
    uid - (int) user id
    user-item - the user-item matrix from above
    '''
    
    user_row = user_item[user_item.index == uid]
    user_row = user_row.loc[:, (user_row.sum(axis=0) > 0)]
    return list(user_row.columns.values.astype(int))

In [183]:
import random
def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds animes the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    recs = []
    similar_users = find_similar_users(user_id, user_item)
    random.shuffle(similar_users)
    viewed_anime_ids = get_user_animes(user_id, user_item)
    
    for user in similar_users:
        anime_ids = get_user_animes(user, user_item)
        for anime_id in anime_ids:
            if anime_id in viewed_anime_ids:
                pass
            else:
                recs = list(set().union(recs, anime_ids))
        if len(recs) >= m:
            break
        
    
    return recs[:m] # return your recommendations for this user_id    

In [184]:
user_user_recs(0, m=20)

[1,
 392,
 136,
 14345,
 270,
 21,
 5530,
 287,
 9253,
 813,
 47,
 1210,
 6594,
 16067,
 329,
 849,
 339,
 6746,
 93,
 874]

This is all of the data preparation/cleaning that needs to be done for the project. The Flask app will take care of the rest!