# Next Level Data Visualization in Python using Plotly

In order to try and learn new data analysis tools and especially visualization ones, I decided to explore and learn the opensource library Plotly which seemed at first glance to offer a lot of great possibilities.

Most people are familiar with matplotlib but will notice how hard it can be to figure out how to add y-axis or how to format dates and so on. This is where Plotly comes handy for the visualization of your data.

### Plotly intro

Plotly is an opensource library which is build on plotly.js which is also build on d3.js.

For those who don't know, d3.js is a JavaScript library for manipulating documents based on data. D3 helps to bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Which means we get the efficiency of coding in Python with the incredible interactive graphics capabilities of d3.

# Exploratory Data Analysis and Visualization

### The Dataset

Being a fan of anime and manga, I felt the urge to use the data provided by MyAnimeList to explore it and get some insights about all the shows I have binged and am going to binge (the dataset is avalaible on Kaggle).

Lets import it and get a glance of it to see what this dataset looks like.
First lets load the data with Pandas and check it size.

In [3]:
import pandas as pd
df = pd.read_csv("datasets_571_1094_anime.csv")
df.shape

(12294, 7)

We can see that the dataset has 7 different columns and 12294 rows, thats a lot of anime to watch if I do say so myself.

Whats does the dataset really look like ?

In [4]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


Pretty straightforward as we can see. We have the __anime_id__, its __name__, __genre__, number of __episode__ (1 being probably a movie), __ratings__ and finaly __members__ (people who watched the anime).

With that we can already look for meaningful informations such us the __mean rating__ of all the anime, the __number of movies__ there are in the dataset and even the genres which have the most members or the least one for example.

There are a lot of possibilities ! But first lest see if our dataset is complete.

In [7]:
def NaN_percent(df, column_name):
    row_count = df[column_name].shape[0]
    empty_values = row_count - df[column_name].count()
    return (100.0*empty_values)/row_count

print("Percentage of missing values in our dataset : \n")

for i in list(df):
    print(i +': ' + str(NaN_percent(df,i))+'%')  

Percentage of missing values in our dataset : 

anime_id: 0.0%
name: 0.0%
genre: 0.5043110460387181%
type: 0.20335122824141857%
episodes: 0.0%
rating: 1.870831299821051%
members: 0.0%


As we can see, there are only a few columns which has missing values. The column with the most missing values being the __rating__ one. 
We can guess that people who didn't rate an anime are those who didn't finished that anime or dropped it without rating it.

So we have approximatively 2% of anime which didn't get any ratings. Lest see them


In [9]:
df[df['rating'].isnull()].head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
8968,34502,Inazma Delivery,"Action, Comedy, Sci-Fi",TV,10,,32
9657,34309,Nananin no Ayakashi: Chimi Chimi Mouryou!! Gen...,"Comedy, Supernatural",TV,Unknown,,129
10896,34096,Gintama (2017),"Action, Comedy, Historical, Parody, Samurai, S...",TV,Unknown,,13383
10897,34134,One Punch Man 2,"Action, Comedy, Parody, Sci-Fi, Seinen, Super ...",TV,Unknown,,90706
10898,30484,Steins;Gate 0,"Sci-Fi, Thriller",,Unknown,,60999


Interesting ! Didn't expect that at all !

In fact looking at this, there is a new hypothesis about those NaN values and it is about the __not yet aired anime__.

When a new show is added to the database as soon as a studio announce it, it doesn't have yet any ratings and the number of episodes are still unknown, thus the missing values.

We can also see that the __episodes__ columns is actually missing values because "__unknown__" is not a number of episode.

For the next parts we will consider any anime __without__ episodes as __not yet aired anime__.

Lets see how many not yet aired anime we have then.

In [11]:
print(df[df['episodes']=='Unknown'].shape[0])
print(df[df['episodes']=='Unknown'].shape[0]*100/df.shape[0])

340
2.765576704083293


So we have only 340 (2.76%) shows who were not airing at the time when this dataset was made (2017). Now *One Punch Man* and *Gintama (2017)* have aired, go watch it !