# Exploratory Data Analysis

We want to achieve a few things in this section
1. Find out what each column does, and what values it can take
2. List the columns we might find helpful to answer questions
3. Plot the distribution of interesting columns (variables)
4. Analyse relationships between these variables

5. From all of the above, eventually we should come up with a definition of "success" for each anime


### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

### Import the Dataset (UserList)

The dataset is in CSV format; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

In [2]:
userlist = pd.read_csv('DataSets/Cleaned data/outV1.csv')
userlist.head()

Unnamed: 0,title,type,source,episodes,status,airing,aired,duration,rating,score,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,Inu x Boku SS,TV,Manga,12,Finished Airing,False,"{'from': '2012-01-13', 'to': '2012-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,7.63,...,0,0,0,0,0,0,0,0,0,0
1,Seto no Hanayome,TV,Manga,26,Finished Airing,False,"{'from': '2007-04-02', 'to': '2007-10-01'}",24 min. per ep.,PG-13 - Teens 13 or older,7.89,...,0,0,0,0,0,0,0,0,0,0
2,Shugo Chara!! Doki,TV,Manga,51,Finished Airing,False,"{'from': '2008-10-04', 'to': '2009-09-25'}",24 min. per ep.,PG - Children,7.55,...,0,0,0,0,0,0,0,0,0,0
3,Princess Tutu,TV,Original,38,Finished Airing,False,"{'from': '2002-08-16', 'to': '2003-05-23'}",16 min. per ep.,PG-13 - Teens 13 or older,8.21,...,0,0,0,0,0,0,0,0,0,0
4,Bakuman. 3rd Season,TV,Manga,25,Finished Airing,False,"{'from': '2012-10-06', 'to': '2013-03-30'}",24 min. per ep.,PG-13 - Teens 13 or older,8.67,...,0,0,0,0,0,0,0,0,0,0


### Printing out possible values of categorical variables

We are interested in finding out how many types of values there are for the following columns
1. Type
2. Source
3. Status
4. Rating

In [3]:
# Filtering the Type column
type_unique_values = userlist['type'].unique()

# Print the unique values for Type
print("Unique values for Type:")
for value in type_unique_values:
    print(value)
print("\n")

# Filtering the Source column
source_unique_values = userlist['source'].unique()

# Print the unique values for Source
print("Unique values for source:")
for value in source_unique_values:
    print(value)
print("\n")

# Filtering the Status column
status_unique_values = userlist['status'].unique()

# Print the unique values for Status
print("Unique values for status:")
for value in status_unique_values:
    print(value)
print("\n")

# Filtering the Rating column
rating_unique_values = userlist['rating'].unique()

# Print the unique values for Rating
print("Unique values for rating:")
for value in rating_unique_values:
    print(value)
print("\n")

# Filtering the Studio column
studio_unique_values = userlist['studio'].unique()

# Print the unique values for Studio
print("Unique values for studio:")
for value in studio_unique_values:
    print(value)
print("\n")

Unique values for Type:
TV
Movie
Music
OVA
ONA
Special
Unknown


Unique values for source:
Manga
Original
Light novel
4-koma manga
Novel
Visual novel
Unknown
Other
Music
Game
Picture book
Card game
Web manga
Book
Radio
Digital manga


Unique values for status:
Finished Airing
Currently Airing
Not yet aired


Unique values for rating:
PG-13 - Teens 13 or older
PG - Children
G - All Ages
R+ - Mild Nudity
R - 17+ (violence & profanity)
Rx - Hentai


Unique values for studio:
SmallStudio
Gonzo
Satelight
J.C.Staff
Production Reed
Bones
Studio Deen
Brain&#039;s Base
Studio Pierrot
Madhouse
Production I.G
TMS Entertainment
Tatsunoko Production
Shin-Ei Animation
Toei Animation
Sunrise
Zexcs
unknown
Lerche
Studio 4°C
Kachidoki Studio
DLE
Xebec
A-1 Pictures
Nippon Animation
Kyoto Animation
OLM
Shaft
ufotable
Silver Link.
Seven
Arms
Diomedea
Studio Ghibli
feel.
Gainax
Doga Kobo
P.A. Works
AIC
PoRO
Studio Hibari
Studio Gallop




### Printing out range of values for numerical variables

We are interested in finding out the minimum and maximum values for the following columns
1. Episodes
2. Duration
3. Score
4. Rank
5. Popularity
6. Members
7. Favourites

In [10]:
# Filtering the Episodes column
episode_minimum_value = userlist['episodes'].min()
episode_maximum_value = userlist['episodes'].max()

print("Minimum value in for episode:", episode_minimum_value)
print("Maximum value in for episode:", episode_maximum_value)
print("\n")

# Filtering the Duration column
duration_minimum_value = userlist['duration'].min()
duration_maximum_value = userlist['duration'].max()

print("Minimum value in for duration:", duration_minimum_value)
print("Maximum value in for duration:", duration_maximum_value)
print("\n")

# Filtering the Score column
score_minimum_value = userlist['score'].min()
score_maximum_value = userlist['score'].max()

print("Minimum value in for score:", score_minimum_value)
print("Maximum value in for score:", score_maximum_value)
print("\n")

# Filtering the Rank column
rank_minimum_value = userlist['rank'].min()
rank_maximum_value = userlist['rank'].max()

print("Minimum value in for rank:", rank_minimum_value)
print("Maximum value in for rank:", rank_maximum_value)
print("\n")

# Filtering the Popularity column
popularity_minimum_value = userlist['popularity'].min()
popularity_maximum_value = userlist['popularity'].max()

print("Minimum value in for popularity:", popularity_minimum_value)
print("Maximum value in for popularity:", popularity_maximum_value)
print("\n")

# Filtering the Members column
member_minimum_value = userlist['members'].min()
member_maximum_value = userlist['members'].max()

print("Minimum value in for member:", member_minimum_value)
print("Maximum value in for member:", member_maximum_value)
print("\n")

# Filtering the Favourites column
favourite_minimum_value = userlist['favorites'].min()
favourite_maximum_value = userlist['favorites'].max()

print("Minimum value in for favourite:", favourite_minimum_value)
print("Maximum value in for favourite:", favourite_maximum_value)
print("\n")

Minimum value in for episode: 0
Maximum value in for episode: 1818


Minimum value in for duration: 1 hr.
Maximum value in for duration: Unknown


Minimum value in for score: 0.0
Maximum value in for score: 10.0


Minimum value in for rank: 0.0
Maximum value in for rank: 12919.0


Minimum value in for popularity: 0
Maximum value in for popularity: 14487


Minimum value in for member: 0
Maximum value in for member: 1456378


Minimum value in for favourite: 0
Maximum value in for favourite: 106895


