# Introductory program for Data Visualization in Python

## The Pokemon Dataset

### Description of the Data

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It has been of great use when teaching statistics to kids. With certain types you can also give a geeky introduction to machine learning.

This are the raw attributes that are used for calculating how much damage an attack will do in the games. This dataset is about the pokemon games (NOT pokemon cards or Pokemon Go).

The data as described by Myles O'Neill is:

    #: ID for each pokemon
    Name: Name of each pokemon
    Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
    Type 2: Some pokemon are dual type and have 2
    Total: sum of all stats that come after this, a general guide to how strong a pokemon is
    HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
    Attack: the base modifier for normal attacks (eg. Scratch, Punch)
    Defense: the base damage resistance against normal attacks
    SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
    SP Def: the base damage resistance against special attacks
    Speed: determines which pokemon attacks first each round

The data for this table has been acquired from several different sites, including:

    pokemon.com
    pokemondb
    bulbapeida

One question has been answered with this database: The type of a pokemon cannot be inferred only by it's Attack and Deffence. It would be worthy to find which two variables can define the type of a pokemon, if any. Two variables can be plotted in a 2D space, and used as an example for machine learning. This could mean the creation of a visual example any geeky Machine Learning class would love.

This dataset has been taken from [Kaggle - Pokemon with stats](https://www.kaggle.com/abcsds/pokemon)

For any given dataset, we need to follow certain things:

https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

The first step in our analysis will be to import necessary packages.

In [1]:
import pandas as pd # package for a collection of functions for data processing and analysis.
import numpy as np # foundational package for scientific computing
import matplotlib.pyplot as plt # collection of functions for data visualization
import seaborn as sns # for displaying better visualizations

In [4]:
# Load the dataset
df = pd.read_csv(r'C:\Simulation Kernel\The Tings\Arcanefiles\Datasets\Pokemon.csv')
print(df.head(15)) # load the first 15 records in the dataset

     #                       Name Type 1  Type 2  Total  HP  Attack  Defense  \
0    1                  Bulbasaur  Grass  Poison    318  45      49       49   
1    2                    Ivysaur  Grass  Poison    405  60      62       63   
2    3                   Venusaur  Grass  Poison    525  80      82       83   
3    3      VenusaurMega Venusaur  Grass  Poison    625  80     100      123   
4    4                 Charmander   Fire     NaN    309  39      52       43   
5    5                 Charmeleon   Fire     NaN    405  58      64       58   
6    6                  Charizard   Fire  Flying    534  78      84       78   
7    6  CharizardMega Charizard X   Fire  Dragon    634  78     130      111   
8    6  CharizardMega Charizard Y   Fire  Flying    634  78     104       78   
9    7                   Squirtle  Water     NaN    314  44      48       65   
10   8                  Wartortle  Water     NaN    405  59      63       80   
11   9                  Blastoise  Water

We will first check if there are any null values present in the dataset. Null values pose a big threat to prediction or analysis part. Check this [link]()

In [7]:
print(df.isnull().sum())

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64


We have 386 null values in the `Type 2` column. Since it's a categorical value, we need to handle it differently. For now let's keep this aside.

If we look at the data, we can see that `#` is extra and it doesn't contribute to our analysis. so we remove it.

In [8]:
df.drop(['#'], axis=1, inplace= True)
print(df.head())

                    Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  \
0              Bulbasaur  Grass  Poison    318  45      49       49       65   
1                Ivysaur  Grass  Poison    405  60      62       63       80   
2               Venusaur  Grass  Poison    525  80      82       83      100   
3  VenusaurMega Venusaur  Grass  Poison    625  80     100      123      122   
4             Charmander   Fire     NaN    309  39      52       43       60   

   Sp. Def  Speed  Generation  Legendary  
0       65     45           1      False  
1       80     60           1      False  
2      100     80           1      False  
3      120     80           1      False  
4       50     65           1      False  


In [None]:
data = data.set_index('Name')
#print(data.head(10))

In [None]:
data.index = data.index.str.replace(".*(?=Mega)", "")
#print(data.head(10))

In [None]:
#print(data.shape)
#print(data.columns)

In [None]:
data['Type 2'].fillna(data['Type 1'], inplace=True)
#print(data.head(10))

In [None]:
#print(data.isnull().sum())

In [None]:
#print('Pokemon with High HP: ', data['HP'].argmax())
#print('Pokemon with Low HP: ', data['HP'].argmin())
#print('Pokemon with High Attack: ', data['Attack'].argmax())
#print('Pokemon with Low Attack: ', data['Attack'].argmin())
#print('Pokemon with High Defense: ', data['Defense'].argmax())
#print('Pokemon with Low Defense: ', data['Defense'].argmin())

In [None]:
#print(data.describe())

In [None]:
# Total Overview of The Pokemon Types
#print(data['Type 1'].value_counts(), '\n', data['Type 2'].value_counts())

In [None]:
# Barplot
data.Type2.value_counts().plot(kind='bar')
plt.title('Number of pokemon of each type')
plt.ylabel('Frequency')
fig=plt.gcf()
fig.set_size_inches(10, 10)

In [None]:
#pieplot
data.Type1.value_counts().plot(kind='pie')
plt.title('Number of pokemon of each type')
fig=plt.gcf()
fig.set_size_inches(10, 10)

In [None]:
labels = 'Water', 'Normal', 'Grass', 'Bug', 'Psychic', 'Fire', 'Electric', 'Rock', 'Other'
sizes= [112, 98, 70, 69, 57, 52, 44, 44, 175]
colors= ['b', 'm', '#00FF00', '#808000', '#008080', 'r', 'y', '#641E16', '#00FFFF']
explode= (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, shadow=True, 
    autopct='%1.1f%%')
plt.title('Percentage of Different Types of Type 1 Pokemon')
fig=plt.gcf()
fig.set_size_inches(10, 10)
plt.show()
plt.close()

In [None]:
# Univariate Plots
f, ax=plt.subplots(1, 3, figsize=(15, 5))
sns.distplot(data['Attack'], color='c', bins=25, ax=ax[0])
ax[0].set_title('Attack Univariate Plot')
sns.distplot(data['Defense'], color='m', bins=25, ax=ax[1])
ax[1].set_title('Defense Univariate Plot')
sns.distplot(data['HP'], color='b', bins=25, ax=ax[2])
ax[2].set_title('HP Univariate Plot')
#plt.close()

In [None]:
f, ax = plt.subplots(figsize=(15, 13))
sns.pointplot(x=data['Type1'],y=data['Attack'],data=data,color='r')
plt.title('Type1 vs Attack Pointplot')
plt.xlabel('Type1')
plt.ylabel('Attack')

In [None]:
# Bivariate Comparision Plots

f, ax=plt.subplots(1, 3, figsize=(15, 5))
sns.boxplot(x=data['Type 1'], y=data['Attack'], data=data, ax=ax[0], linewidth=0.5)
ax[0].set_title('Type 1 vs Attack Scatter Plot')
ax[0].set_xticklabels(data['Type 1'],rotation=90)

sns.boxplot(x=data['Type 1'], y=data['Defense'], data=data, ax=ax[1], linewidth=0.5)
ax[1].set_title('Type 1 vs Defense Scatter Plot')
ax[1].set_xticklabels(data['Type 1'],rotation=90)

sns.boxplot(x=data['Type 1'], y=data['HP'], data=data, ax=ax[2], linewidth=0.5)
ax[2].set_title('Type 1 vs HP Plot')
ax[2].set_xticklabels(data['Type 1'],rotation=90)
#plt.close()

f, ax=plt.subplots(1, 3, figsize=(25, 5))
sns.violinplot(x=data['Type 2'], y=data['Attack'], data=data, ax=ax[0], split=False)
ax[0].set_title('Type 2 vs Attack Violin Plot')
ax[0].set_xticklabels(data['Type 2'], rotation=90)

sns.violinplot(x=data['Type 2'], y=data['Defense'], data=data, ax=ax[1], split=True)
ax[1].set_title('Type 2 vs Defense Violin Plot')
ax[1].set_xticklabels(data['Type 2'], rotation=90)

sns.violinplot(x=data['Type 2'], y=data['HP'], data=data, ax=ax[2], split=True)
ax[2].set_title('Type 2 vs HP Violin Plot')
ax[2].set_xticklabels(data['Type 2'], rotation=90)
#plt.close()

plt.subplots(figsize=(15, 5))
plt.title('Strongest Generation')
sns.violinplot(x=data['Generation'], y=data['Total'], data=data)
#plt.close()

In [None]:
# Correlation plot
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(data.corr(), annot=True, fmt='.2f', linewidths=0.5, ax=ax)
#plt.close()

In [None]:
# Regression Plots
f, ax=plt.subplots(1, 2, figsize=(15, 5))
sns.regplot(x=data['Attack'], y=data['HP'], ax=ax[0], data=data)
ax[0].set_title('Attack vs HP Regression')

sns.regplot(x=data['Defense'], y=data['HP'], ax=ax[1], data=data)
ax[1].set_title('Defense vs HP Regression')
plt.show()

In [None]:
sns.pairplot(data, hue="Generation")

In [None]:
g=sns.FacetGrid(data, row="Type1", col="Generation", margin_titles=True)
bins = np.linspace(0, 60, 13)
g.map(plt.hist, "Speed", color="steelblue", bins=bins, lw=0)