<a href="https://colab.research.google.com/github/SonAz/BABOK/blob/main/Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning
This notebook will focus on the preprocessing of the datasets before using them in the two recommendation systems (Content-based and collaborative filtering) I will be developing later on.

In [None]:
# Import libraries

import pandas as pd
import numpy as np

# Suppress scientific notation of pandas and round floats to 3dp
pd.options.display.float_format = '{:,.3f}'.format

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

## Anime Dataset

Keys:
- anime_id - myanimelist.net's unique id identifying an anime.
- name - full name of anime.
- genre - comma separated list of genres for this anime.
- type - movie, TV, OVA, etc.
- episodes - how many episodes in this show. (1 if movie).
- rating - average rating out of 10 for this anime.
- members - number of community members that are in this anime's
"group".

In [None]:
# load anime dataset
anime_df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/datasets/anime.csv")

anime_df.head(10).

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


Just by looking at the top 5 entries in the dataframe. We already see there seems to be duplicates due to formatting or differences in the way values where entered. I will need to clean this.

In [None]:
# number of rows and columns in the dataframe
anime_df.shape

(12294, 7)

In [None]:
# are the columns using suitable datatypes
anime_df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

It seems that all of the columns except episodes are the correct datatype.

In [None]:
# replace animes where the number of episodes are unknown into nan and then convert everything to float
anime_df["episodes"].replace({"Unknown": "nan", "unknown": "nan"}, inplace=True)
anime_df["episodes"] = anime_df["episodes"].astype("float")
anime_df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes    float64
rating      float64
members       int64
dtype: object

Now

In [None]:
# Check which rows have missing values
anime_df.isnull().any()

anime_id    False
name        False
genre        True
type         True
episodes     True
rating       True
members     False
dtype: bool

In [None]:
# How many missing values do we have for each column?
anime_df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes    340
rating      230
members       0
dtype: int64

It seems like there are missing values for the genre(62), type(25), and rating(230) columns. For a recommendation system, missing values may make a content based filtering system less inaccurate as features like the genre a film may influence the enjoyment one may have viewing a certain anime.

In this project, I am only going to be using animes that have a "type" value of "TV". As a result of this, I will be removing all rows where "type" is not equal to "TV".

In [None]:
# remove rows where the film is not classified as "TV"
anime_df = anime_df[anime_df["type"] == "TV"]
anime_df.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10.0,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148.0,9.13,425855
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13.0,9.11,81109
10,4181,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24.0,9.06,456749
12,918,Gintama,"Action, Comedy, Historical, Parody, Samurai, S...",TV,201.0,9.04,336376
13,2904,Code Geass: Hangyaku no Lelouch R2,"Action, Drama, Mecha, Military, Sci-Fi, Super ...",TV,25.0,8.98,572888


In [None]:
# check how many missing values we have now
anime_df.isnull().sum()

anime_id      0
name          0
genre        10
type          0
episodes    209
rating      116
members       0
dtype: int64

It seems we still have 10 rows with missing genre values and also 116 rows with missing rating values. For the content based filtering, the genre of the show will be required so when it is time to develop the content based filtering model I will be dropping those 10 rows with missing genre values. On the other hand, with the collaborative filtering system, I am not required to used the genre values at all so I will simply use the whole dataset without the genre column.

## Ratings Dataset

Keys:
- user_id - non identifiable randomly generated user id.
- anime_id - the anime that this user has rated.
- rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

In [None]:
# load ratings dataset
rating_df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/datasets/rating.csv")

rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [None]:
rating_df.tail()

Unnamed: 0,user_id,anime_id,rating
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9
7813736,73516,8074,9


In [None]:
rating_df.shape

(7813737, 3)

In [None]:
rating_df.dtypes

user_id     int64
anime_id    int64
rating      int64
dtype: object

In [None]:
rating_df.isnull().any()

user_id     False
anime_id    False
rating      False
dtype: bool

Using -1 as the value for a no-rating may skew future analysis and the building of the recommender. Instead of using -1, I will replace all ratings of -1 with a null value.

In [None]:
rating_df["rating"].replace({-1: np.nan}, inplace=True)
values = rating_df["rating"].unique()
values.sort
print(values)

[nan 10.  8.  6.  9.  7.  3.  5.  4.  1.  2.]


Now the ratings has values from 1-10 and nan for empty ratings.

The ratings dataset doesn't seem to need any more cleaning unless other issues arise.

## Exporting Dataframes to CSV

In [None]:
anime_df.to_csv("/content/drive/My Drive/Colab Notebooks/datasets/cleaned_anime.csv", index=False)
rating_df.to_csv("/content/drive/My Drive/Colab Notebooks/datasets/cleaned_rating.csv", index=False)