# Data Exploration and Analysis
The following notebook will take a look at the data sets that will be used in the recommendation engine and prep the data for use in the engine itself.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Read in Data

In [2]:
animes_df = pd.read_csv('data/animes.csv')
users_df = pd.read_csv('data/profiles.csv')
reviews_df = pd.read_csv('data/reviews.csv')

print("Animes DF: {}\nUsers DF: {}\nReviews DF: {}".format(animes_df.shape, users_df.shape, reviews_df.shape))

Animes DF: (19311, 12)
Users DF: (81727, 5)
Reviews DF: (192112, 7)


In [5]:
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


### Animes DataFrame Exploration and Cleaning

First I want to check for duplicate and null rows. If there are, I want to get rid of them. I purposely used this dataset because there were no null values. So I expect isnull().sum() to return 0.

In [6]:
animes_df.duplicated().sum()

2943

In [7]:
animes_df.drop_duplicates(inplace=True)

In [8]:
animes_df.duplicated().sum()

0

In [9]:
animes_df.genre.isnull().sum()

0

One thing I noticed when looking through the dataset was that some of the show titles have special characters that will make routing to the show info page result in an error so they need to be replaced.

In [None]:
animes_df['title'] = [w.replace('/','-') for w in animes_df['title']]

In [None]:
animes_df['title'] = [w.replace('?','') for w in animes_df['title']]

Perfect! One thing I noticed about the animes dataframe was that the genre for each anime seems to be a list of genres. I want to check what data type the genre column is and create a column for each genre type.

In [9]:
animes_df['genre'].dtype

dtype('O')

The data type of the genre column seems to be an Object, meaning a string. So I need to loop through the genre column and get a set of the genres.

In [10]:
genres = []

# for every anime, split the genre column value to get each genre type
for genre_set in animes_df.genre:
    values = genre_set.strip("[]").split(",")
    values = [w.strip()[1:-1] for w in values]
    
    # add genres to list
    genres.extend(values)

# drop all duplicate values
genres = set(genres)
print("The number of genres is {}.".format(len(genres)))
print(genres)

The number of genres is 44.
{'', 'Game', 'Magic', 'Shoujo', 'Super Power', 'Shounen Ai', 'Dementia', 'Seinen', 'Hentai', 'Parody', 'Music', 'Samurai', 'Romance', 'Psychological', 'Comedy', 'Yaoi', 'Action', 'Demons', 'Shounen', 'Sci-Fi', 'Kids', 'Josei', 'Thriller', 'Slice of Life', 'Cars', 'Vampire', 'Historical', 'Shoujo Ai', 'Yuri', 'Drama', 'Supernatural', 'Mystery', 'Fantasy', 'Military', 'Mecha', 'Harem', 'Police', 'Horror', 'Martial Arts', 'Sports', 'Adventure', 'Ecchi', 'School', 'Space'}


Notice the first element is empty. This happend when calling split. We can quickly delete that.

In [11]:
genres = list(genres)
genres.pop(0)
genres = sorted(genres)
print("The number of genres is {}.".format(len(genres)))
print(genres)

The number of genres is 43.
['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri']


Now that I have a list of all possible genre types. I wanted to make a column for each genre. The value of the column will be a 1 if an anime is listed under this genre or a 0 if it is not.

In [12]:
def split_genres(anime):
    '''
    Will split the genre column of any anime row and return a 1 if the anime is listed in that genre.
    
    INPUT:
    anime - a string of the genres column for a specific anime
    
    OUTPUT:
    1 - if anime is listed in genre
    0 - if anime is not listed in genre
    '''
    try:
        if anime.find(genre) > -1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# create column for each genre and fill in columns
for genre in genres:
    animes_df[genre] = animes_df['genre'].apply(split_genres)

Now there is an easier way to identify the genres that an anime is listed as. This will be used for filtering and content based recommendations.

In [13]:
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,...,0,0,0,1,0,0,0,0,0,0
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,...,0,0,0,0,0,0,0,0,0,0
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,...,0,0,0,0,0,0,0,0,0,0
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,...,0,0,0,0,0,0,0,0,0,0
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,...,0,0,0,0,0,1,0,1,0,0


## Create Decade columns for further filtering

I followed the same process as above and extracted the decades that these shows aired and created a column for each decade. 1 will indicated that the show is from that decade, otherwise 0.

In [14]:
def split_anime_decade(val, decade):
    '''
    INPUT:
    val the 'aired' column value for a row
    decade - one of the values from the (list) decade list 
             below this cell
    
    OUTPUT:
    1 - if show in decade
    0 - if show not in decade
    '''
    # extract year from string
    try:
        year = val.split(',')[1]
        year = year.strip()[:4]
    except:
        year = val.strip()[:4]
    
    # decide wether show belongs to decade
    try:
        if decade == 'Pre 1970':
            if int(year) < 1970:
                return 1
            return 0
        if int(year) >= int(decade) and int(year) < int(decade) + 10:
            return 1
        else:
            return 0
    except:
        return 0

It was hard to extract all of the air dates with the way the 'aired' column was set up so I decided to group every show before 1970 to be its own 'decade'.

In [15]:
# valid decades
decades = ['Pre 1970s', '1970s', '1980s', '1990s', '2000s', '2010s']

# for every decade, find what shows belong to it
for decade in decades:
    column = []
    for row in animes_df['aired']:
        column.append(split_anime_decade(row, decade[:-1]))
    animes_df[decade] = column
    
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,...,Thriller,Vampire,Yaoi,Yuri,Pre 1970s,1970s,1980s,1990s,2000s,2010s
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,...,0,0,0,0,0,0,0,0,0,1
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,...,0,0,0,0,0,0,0,0,0,1
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,...,0,0,0,0,0,0,0,0,0,1
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,...,0,0,0,0,0,0,0,0,1,0
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,...,0,1,0,0,0,0,0,0,0,1


The genre and aired columns are dropped because they are no longer needed. There is an easier way to identify what genres a show belongs to now.

In [16]:
animes_df = animes_df.drop(['genre', 'aired'], axis=1)
animes_df.head()

Unnamed: 0,uid,title,synopsis,episodes,members,popularity,ranked,score,img_url,link,...,Thriller,Vampire,Yaoi,Yuri,Pre 1970s,1970s,1980s,1990s,2000s,2010s
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...,...,0,0,0,0,0,0,0,0,0,1
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...,...,0,0,0,0,0,0,0,0,0,1
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss,...,0,0,0,0,0,0,0,0,0,1
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...,...,0,0,0,0,0,0,0,0,1,0
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...,...,0,1,0,0,0,0,0,0,0,1


In [17]:
# save to a new csv file
animes_df.to_csv('./data/animes-clean.csv')

## Ranked Based Filtering
Now that the anime shows dataframe is clean. Ranked based filtering functions can now be made.

The first filtering function is simply sorting by the top rated shows.

In [18]:
def get_top_rated(n, df=animes_df):
    '''
    INPUT:
    df - animes df from cells above
    n - number of recs to return
    
    OUTPUT:
    recs -  the name and img url of the all time top rated animes
    '''
    recs = []
    top_ranked = df.sort_values(by='ranked', ascending=True).drop_duplicates()
    
    for i in range(n):
        recs.append((top_ranked.iloc[i].title, top_ranked.iloc[i].img_url))
                    
    return recs

In [19]:
top_rated = get_top_rated(10)
print(top_rated)

('Fullmetal Alchemist: Brotherhood', 'https://cdn.myanimelist.net/images/anime/1223/96541.jpg')


In [20]:
print(top_rated[5])

('Gintama°', 'https://cdn.myanimelist.net/images/anime/3/72078.jpg')


Next is filtering the shows by a specific genre and sorting by high rated.

In [21]:
def get_top_rated_genre(genre, n, df=animes_df):
    '''
    INPUT:
    genre - a string containing the genre that will be filtered by
    n - the number of recommendations to be returned
    df - the animes df from above
    
    OUTPUT:
    recs - a list of recommendations with title and url link
    '''
    
    recs = []
    genre_df = df[df[genre] == 1].sort_values(by='score', ascending=False).drop_duplicates()
    
    for i in range(n):
        recs.append((genre_df.iloc[i].title, genre_df.iloc[i].img_url))
    
    return recs

In [22]:
genre = 'Romance'
romance_recs = get_top_rated_genre(genre, 20)
romance_recs

[('Kimi no Na wa.', 'https://cdn.myanimelist.net/images/anime/5/87048.jpg'),
 ('Clannad: After Story',
  'https://cdn.myanimelist.net/images/anime/13/24647.jpg'),
 ('Shigatsu wa Kimi no Uso',
  'https://cdn.myanimelist.net/images/anime/3/67177.jpg'),
 ('Monogatari Series: Second Season',
  'https://cdn.myanimelist.net/images/anime/3/52133.jpg'),
 ('Rurouni Kenshin: Meiji Kenkaku Romantan - Tsuioku-hen',
  'https://cdn.myanimelist.net/images/anime/1807/102419.jpg'),
 ('Seishun Buta Yarou wa Yumemiru Shoujo no Yume wo Minai',
  'https://cdn.myanimelist.net/images/anime/1613/102179.jpg'),
 ('Howl no Ugoku Shiro',
  'https://cdn.myanimelist.net/images/anime/5/75810.jpg'),
 ('Suzumiya Haruhi no Shoushitsu',
  'https://cdn.myanimelist.net/images/anime/2/73842.jpg'),
 ('Yojouhan Shinwa Taikei',
  'https://cdn.myanimelist.net/images/anime/10/50457.jpg'),
 ('Bakuman. 3rd Season',
  'https://cdn.myanimelist.net/images/anime/6/41845.jpg'),
 ('Kara no Kyoukai 5: Mujun Rasen',
  'https://cdn.myanim

The last ranked based filter is by the decade of the show.

In [23]:
def get_top_rated_decade(decade, n, df=animes_df):
    '''
    INPUT:
    decade - a string containing the decade that will be filtered by
    n - the number of recommendations to be returned
    df - the animes df from above
    
    OUTPUT:
    recs - a list of recommendations with title and url link
    '''
    
    recs = []
    decade_df = df[df[decade] == 1].sort_values(by='score', ascending=False).drop_duplicates()
    
    for i in range(n):
        recs.append((decade_df.iloc[i].title, decade_df.iloc[i].img_url))
    
    return recs

In [24]:
decade = '1990s'
decades_df = get_top_rated_decade(decade, 20)
decades_df

[('Kaitei-koku no Koutsu Anzen',
  'https://cdn.myanimelist.net/images/anime/1957/99484.jpg'),
 ('Cowboy Bebop', 'https://cdn.myanimelist.net/images/anime/4/19644.jpg'),
 ('Mononoke Hime', 'https://cdn.myanimelist.net/images/anime/7/75919.jpg'),
 ('Rurouni Kenshin: Meiji Kenkaku Romantan - Tsuioku-hen',
  'https://cdn.myanimelist.net/images/anime/1807/102419.jpg'),
 ('Great Teacher Onizuka',
  'https://cdn.myanimelist.net/images/anime/13/11460.jpg'),
 ('Slam Dunk', 'https://cdn.myanimelist.net/images/anime/12/86890.jpg'),
 ('One Piece', 'https://cdn.myanimelist.net/images/anime/6/73245.jpg'),
 ('Neon Genesis Evangelion: The End of Evangelion',
  'https://cdn.myanimelist.net/images/anime/12/39305.jpg'),
 ('Time Slip 1923: Mori no Miracle Jishin Taiken',
  'https://cdn.myanimelist.net/images/anime/1511/98953.jpg'),
 ('Kenpuu Denki Berserk',
  'https://cdn.myanimelist.net/images/anime/12/18520.jpg'),
 ('Yuu☆Yuu☆Hakusho', 'https://cdn.myanimelist.net/images/anime/8/25152.jpg'),
 ('Hunter x

# Content Based Filtering

To find similar shows, a subset of the animes_df has to created of just the genres and decades. The subset should be a matrix filled with 1's and 0's. The dot product needs to be taken of that subset with the transpose of that subset to find the similarity between any two shows.

In [None]:
animes_clean_df = pd.read_csv('data/animes-clean.csv')

In [6]:
# get a subset of the animes df starting at the first genre column
show_contents = animes_clean_df.iloc[:, 11:]
# take the dot product of that subset with the transpose of that subset
dot_prod_shows = show_contents.dot(np.transpose(show_contents))

In [60]:
print(type(dot_prod_shows))

<class 'pandas.core.frame.DataFrame'>


Run the cell below to see the dot product matrix. The higher the number in a cell, the more similar the two shows are.

In [50]:
dot_prod_shows.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16358,16359,16360,16361,16362,16363,16364,16365,16366,16367
0,6,4,2,3,1,2,1,2,2,1,...,4,2,1,3,2,3,2,3,1,4
1,4,6,2,2,1,1,1,2,1,1,...,3,1,3,3,2,2,1,2,1,4
2,2,2,6,3,2,1,2,1,2,2,...,2,3,1,1,1,1,1,1,1,2
3,3,2,3,9,1,2,3,2,1,4,...,2,4,0,1,1,3,1,2,1,2
4,1,1,2,1,5,3,1,3,4,1,...,1,0,2,1,1,2,2,1,2,1


In [None]:
dot_prod_shows.shape

Save the dot product matrix in a new csv file.

In [8]:
dot_prod_shows.to_csv('./data/similar-shows.csv')

Finding similar shows will be easy now with the dot product matrix. Looking at the row that corresponds to the anime, the function will grab the other anime ids that have a higher number in the cell (i.e. the more similar animes) and return a pandas dataframe filled only with the data of the most similar shows.

In [5]:
def find_similar_shows(anime_id):
    '''
    Finds similar shows based on what genres/decades they have in common
    
    INPUT:
    anime_id - int, id of anime show that appears in the animes_df
    
    OUTPUT:
    similar_shows - pandas dataframe of similar shows sorted by highest rated
    '''
    show_idx = np.where(animes_clean_df['uid'] == anime_id)[0][0]
    
    similar_idxs = np.where(dot_prod_shows_df.iloc[show_idx] > np.max(dot_prod_shows_df.iloc[show_idx])-2)[0]
    
    similar_shows = animes_df.iloc[similar_idxs, ]
    similar_shows = similar_shows[similar_shows['uid'] != anime_id]
    similar_shows.sort_values(by=['score'], ascending=False)
    
    
    return similar_shows

In [97]:
a = find_similar_shows(11061)
for name, img in zip(a['title'], a['img_url']):
    print(name, img)

Saint Seiya: Meiou Hades Meikai-hen https://cdn.myanimelist.net/images/anime/12/75732.jpg
Shingeki no Kyojin OVA https://cdn.myanimelist.net/images/anime/9/59221.jpg
Magi: Sinbad no Bouken https://cdn.myanimelist.net/images/anime/13/60471.jpg
Fairy Tail (2014) https://cdn.myanimelist.net/images/anime/3/60551.jpg
Kekkai Sensen & Beyond https://cdn.myanimelist.net/images/anime/3/88282.jpg
One Piece: Episode of East Blue - Luffy to 4-nin no Nakama no Daibouken https://cdn.myanimelist.net/images/anime/10/87473.jpg
Magi: Sinbad no Bouken (TV) https://cdn.myanimelist.net/images/anime/10/78783.jpg
One Piece Film: Gold https://cdn.myanimelist.net/images/anime/12/81081.jpg
One Piece Film: Strong World Episode 0 https://cdn.myanimelist.net/images/anime/2/24152.jpg
Nanatsu no Taizai https://cdn.myanimelist.net/images/anime/8/65409.jpg
Saint Seiya: The Lost Canvas - Meiou Shinwa 2 https://cdn.myanimelist.net/images/anime/12/29597.jpg
One Piece: Episode of Nami - Koukaishi no Namida to Nakama no Ki

# Collaborative Filtering
Either using the reviews dataframe or the favorite animes in the users dataframe can lead to good collaborative filtering. A user-item matrix will need to be created so personal user recommendations can be made.

In [10]:
reviews_df.head()

Unnamed: 0,uid,profile,anime_uid,text,score,scores,link
0,255938,DesolatePsyche,34096,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8...",https://myanimelist.net/reviews.php?id=255938
1,259117,baekbeans,34599,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=259117
2,253664,skrn,28891,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=253664
3,8254,edgewalker00,2904,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9...",https://myanimelist.net/reviews.php?id=8254
4,291149,aManOfCulture99,4181,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=291149


In [15]:
reviews_df.shape

(192112, 7)

In [13]:
reviews_df['uid'].value_counts()

321183    4
321837    4
321498    4
321144    4
321148    4
         ..
46599     1
211503    1
156351    1
198366    1
193145    1
Name: uid, Length: 130519, dtype: int64

The most animes one user has reviewed is 4 which will not help in collaborative filtering. There would be too much missing data. So I will use peoples favorited animes shows. This is similar to the Udacity lesson where in the user item matrix, a user has a 1 if they interacted with an article, otherwise a 0. This is slightly different because a 1 will represent a positive interaction with a show since they belong to someones favorite.

In [54]:
users_df.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link
0,DesolatePsyche,Male,"Oct 2, 1994","['33352', '25013', '5530', '33674', '1482', '2...",https://myanimelist.net/profile/DesolatePsyche
1,baekbeans,Female,"Nov 10, 2000","['11061', '31964', '853', '20583', '918', '925...",https://myanimelist.net/profile/baekbeans
2,skrn,,,"['918', '2904', '11741', '17074', '23273', '32...",https://myanimelist.net/profile/skrn
3,edgewalker00,Male,Sep 5,"['5680', '849', '2904', '3588', '37349']",https://myanimelist.net/profile/edgewalker00
4,aManOfCulture99,Male,"Oct 30, 1999","['4181', '7791', '9617', '5680', '2167', '4382...",https://myanimelist.net/profile/aManOfCulture99


In [62]:
users_df.shape

(81727, 5)

Similar to how I extracted the genres and decades from the animes dataframe I will extract all the anime shows that people have favorited and create a user-item matrix.

In [3]:
shows = []

# for every anime, split the genre column value to get each genre type
for show in users_df.favorites_anime:
    favs = show.strip("[]").split(",")
    if '' in favs:
        favs.remove('')
    favs = [int(w.strip()[1:-1]) for w in favs]
    
    # add genres to list
    shows.extend(favs)

# drop all duplicate values
shows = set(shows)
shows = list(shows)
shows.sort()
print("The number of genres is {}.".format(len(shows)))
print(shows)

The number of genres is 4768.
[1, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 77, 79, 80, 81, 82, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 109, 110, 113, 114, 115, 116, 117, 119, 120, 121, 122, 123, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 173, 174, 177, 178, 180, 181, 182, 183, 185, 186, 187, 189, 190, 191, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 217, 218, 219, 220, 221, 222, 223, 225, 226, 227, 228, 229, 230, 232, 233, 235, 237, 238, 239, 240, 241, 242, 243, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 2

In [18]:
def split_favs(anime):
    '''
    Will split the genre column of any anime row and return a 1 if the anime is listed in that genre.
    
    INPUT:
    anime - a string of the genres column for a specific anime
    
    OUTPUT:
    1 - if anime is listed in genre
    0 - if anime is not listed in genre
    '''
    try:
        favs = anime.strip("[]").split(",")
        if '' in favs:
            favs.remove('')
        favs = [int(w.strip()[1:-1]) for w in favs]
        if show in favs:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# create column for each genre and fill in columns
for show in shows:
    users_df[show] = users_df['favorites_anime'].apply(split_favs)

  users_df[show] = users_df['favorites_anime'].apply(split_favs)


In [19]:
users_df.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link,1,5,6,7,15,...,40507,40540,40542,40610,40729,40767,40852,40936,40937,40952
0,DesolatePsyche,Male,"Oct 2, 1994","['33352', '25013', '5530', '33674', '1482', '2...",https://myanimelist.net/profile/DesolatePsyche,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,baekbeans,Female,"Nov 10, 2000","['11061', '31964', '853', '20583', '918', '925...",https://myanimelist.net/profile/baekbeans,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,skrn,,,"['918', '2904', '11741', '17074', '23273', '32...",https://myanimelist.net/profile/skrn,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,edgewalker00,Male,Sep 5,"['5680', '849', '2904', '3588', '37349']",https://myanimelist.net/profile/edgewalker00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,aManOfCulture99,Male,"Oct 30, 1999","['4181', '7791', '9617', '5680', '2167', '4382...",https://myanimelist.net/profile/aManOfCulture99,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The user-item matrix is everything starting from column '1' to the end of the dataframe.

In [25]:
# change column names to string (Int is not JSON serializable)
# will need column names for Flask app
users_df.columns = users_df.columns.astype(str)

In [27]:
# get a subset of users_df starting from column '1'
user_item = users_df.loc[:, '1':]

Run the cell below to look at the user-item matrix!

In [28]:
user_item.head()

Unnamed: 0,1,5,6,7,15,16,17,18,19,20,...,40507,40540,40542,40610,40729,40767,40852,40936,40937,40952
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
user_item.shape

(81727, 4768)

In [30]:
# get rid of any row with no 1
user_item = user_item[user_item.sum(axis=1) > 0]

In [31]:
user_item.shape

(65125, 4768)

In [2]:
# save to new csv file
user_item.to_csv('./data/user-item-matrix.csv')

Finding similar users can be done with the user-item matrix now.

In [8]:
def find_similar_users(user_id, user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by animes: 
                1's when a user has favorited an anime, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered list
    
    '''
    # compute similarity of each user to the provided user
    similarity = user_item[user_item.index == user_id].dot(user_item.T)
    # sort by similarity
    similarity = similarity.sort_values(user_id, axis=1, ascending=False)

    # create list of just the ids
    most_similar_users = list(similarity.columns)
    # remove the own user's id
    most_similar_users.remove(user_id)
    
    return most_similar_users # return a list of the users in order from most to least similar

In [9]:
find_similar_users(0, user_item)

[21906,
 63074,
 24608,
 26031,
 62415,
 6916,
 20428,
 41931,
 37544,
 55287,
 12456,
 49384,
 23163,
 48873,
 22922,
 22486,
 50073,
 22178,
 16560,
 24191,
 50090,
 3835,
 50203,
 10601,
 63252,
 47465,
 11304,
 16519,
 59689,
 46624,
 46591,
 46458,
 46434,
 537,
 46168,
 45970,
 16491,
 25971,
 45969,
 4907,
 21163,
 50917,
 11371,
 58890,
 58678,
 54139,
 58745,
 17305,
 17379,
 17645,
 58771,
 53623,
 53598,
 53592,
 18295,
 12045,
 18309,
 18341,
 18406,
 2762,
 18436,
 52433,
 52390,
 52387,
 11916,
 18839,
 52039,
 51372,
 11541,
 20251,
 51170,
 20495,
 45644,
 20755,
 63857,
 27202,
 26624,
 32535,
 855,
 33299,
 8117,
 32929,
 39116,
 60869,
 39394,
 38662,
 60846,
 6408,
 60784,
 60760,
 31271,
 60750,
 38733,
 6802,
 44565,
 34603,
 36270,
 35951,
 61613,
 61313,
 35538,
 37522,
 34344,
 38392,
 37936,
 6924,
 7879,
 34034,
 33900,
 33875,
 31230,
 62300,
 40914,
 28088,
 9250,
 9345,
 28340,
 28223,
 43295,
 43348,
 9499,
 31011,
 27931,
 27879,
 27581,
 27571,
 5322,
 

In [11]:
def get_user_animes(uid,user_item):
    '''
    Gets the shows that are favorited by a specific user.
    
    INPUT:
    uid - (int) user id
    user-item - the user-item matrix from above
    '''
    
    user_row = user_item[user_item.index == uid]
    user_row = user_row.loc[:, (user_row.sum(axis=0) > 0)]
    return list(user_row.columns.values.astype(int))

In [36]:
import random
def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds animes the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    recs = []
    similar_users = find_similar_users(user_id, user_item)
    random.shuffle(similar_users)
    viewed_anime_ids = get_user_animes(user_id, user_item)
    
    for user in similar_users:
        anime_ids = get_user_animes(user, user_item)
        for anime_id in anime_ids:
            if anime_id in viewed_anime_ids:
                pass
            else:
                recs = list(set().union(recs, anime_ids))
        if len(recs) >= m:
            break
        
    
    return recs[:m] # return your recommendations for this user_id    

In [37]:
user_user_recs(0, m=20)

[1,
 392,
 2581,
 21,
 790,
 33,
 4901,
 934,
 813,
 35120,
 31933,
 967,
 1356,
 1358,
 2904,
 11741,
 227,
 249,
 2025,
 889]

This is all of the data preparation/cleaning that needs to be done for the project. The Flask app will take care of the rest!