In [53]:
import pandas as pd
import numpy as np

#### Content

##### Anime.csv
* anime_id - myanimelist.net's unique id identifying an anime.
* name - full name of anime.
* genre - comma separated list of genres for this anime.
* type - movie, TV, OVA, etc.
* episodes - how many episodes in this show. (1 if movie).
* rating - average rating out of 10 for this anime.
* members - number of community members that are in this anime's "group".

##### Rating.csv
* user_id - non identifiable randomly generated user id.
* anime_id - the anime that this user has rated.
* rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

##### Acknowledgements
Thanks to myanimelist.net API for providing anime data and user ratings.

##### Inspiration
Building a better anime recommendation system based only on user viewing history.

### Anime.csv

In [54]:
# import your own directory
#df1 = pd.read_csv(r'C:\Users\Vasin\Desktop\Project\Anime_recommender_system\Data\anime.csv')
df11 = pd.read_csv('\..\anime.csv')
df1.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [55]:
df1.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [56]:
df1.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

For exploratory:
* genre -- fill null with "no detail"
* type -- fill null with "etc".
* episodes -- change 1 to "movie"
* rating -- fill na with 0 that represent no data

In [71]:
df1['episodes'].value_counts()

movie    5677
2        1076
12        816
13        572
26        514
         ... 
155         1
1006        1
191         1
167         1
175         1
Name: episodes, Length: 187, dtype: int64

In [57]:
df1['genre'].fillna('No details',inplace=True)

In [58]:
df1['type'].fillna('No details',inplace=True)

In [59]:
df1['episodes'].replace('1','movie',inplace=True)

In [60]:
df1['rating'].fillna(0,inplace=True)

In [61]:
# Go check null again
df1.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [62]:
# df1.to_csv('anime_C.csv') # _C mean its already cleaned if you use Jupyrer for working on, save for exploration

### Rating.csv

In [63]:
df2 = pd.read_csv(r'C:\Users\Vasin\Desktop\Project\Anime_recommender_system\Data\rating.csv')
df2.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1
5,1,355,-1
6,1,356,-1
7,1,442,-1
8,1,487,-1
9,1,846,-1


In [64]:
df2.dtypes

user_id     int64
anime_id    int64
rating      int64
dtype: object

In [65]:
df2.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

In [66]:
df2['rating'].value_counts()

 8     1646019
-1     1476496
 7     1375287
 9     1254096
 10     955715
 6      637775
 5      282806
 4      104291
 3       41453
 2       23150
 1       16649
Name: rating, dtype: int64

For the model
* Because of -1 mean rating wasn't then the value changed to 0 and will using as testing data

In [67]:
df2['rating'].replace(-1,0,inplace=True)

In [68]:
df2['rating'].value_counts()

8     1646019
0     1476496
7     1375287
9     1254096
10     955715
6      637775
5      282806
4      104291
3       41453
2       23150
1       16649
Name: rating, dtype: int64

In [69]:
#df2.to_csv('rating_C.csv')