This is an analysis on the Pokemon stats data set.
Do check it out and give me suggestions :)

In [None]:
#Import the necessary libraries
import numpy as np
import pandas as pd
import seaborn as sb
from scipy import stats
import matplotlib.pyplot as plt
import regex as re

In [None]:
#Read in the dataset
df = pd.read_csv("../input/PokemonData.csv") 

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.isna().sum()

Type 2 seems to have loads of missing values. Makes sense, because a majority of the pokemon don't exhibit the ability to be more than a single type. For now, I'll fill it up with "None".

In [None]:
df.Type2.fillna("None",inplace=True)
df['Total'] = df.HP + df.Attack + df.Defense+df.Speed + df.SpAtk+df.SpDef

I'll make sure to remove the duplicates, if any.

In [None]:
df.drop_duplicates('Num', keep='first', inplace=True)

The dataset contains both information regarding the identity and statistics of each Pokemon species; therefore, let's separate these two observational units into separate tables: Dex and statistics.

In [None]:
Dex = df[['Num', 'Name', 'Type1', 'Type2', 'Generation', 'Legendary']]

statistics = pd.merge(df, Dex, on='Num').loc[:, ['Num', 'HP', 'Attack', 'Defense', 'SpAtk', 'SpDef', 'Speed',
          'Total']]


I have added an extra feature "Total" which is the sum of the 6 other features - attack , defense , HP , Special Attack , Special Defense and Speed.

Let's now begin with some visual analysis.

In [None]:
plt.figure(figsize=(15,10))
sb.heatmap(df.corr(),annot = True)

If we look at the data set carefully, we see a lot of extra names like mega, etc.
I'll go ahead and remove them.

In [None]:
df.Name = df.Name.apply(lambda x: re.sub(r'(.+)(Mega.+)',r'\2',x))
df.Name = df.Name.apply(lambda x: re.sub(r'(.+)(Primal.+)',r'\2',x))
df.Name = df.Name.apply(lambda x: re.sub(r'(HoopaHoopa)(.+)','Hoopa'+r'\2',x))

In [None]:
NL_Poke = df.loc[(df['Legendary']==False)]
L_Poke = df.loc[(df['Legendary']==True)]

In [None]:
#Pie chart for pokemon - Legendary vs Non Legendary
Split = [NL_Poke['Name'].count(),L_Poke['Name'].count()]
LegPie = plt.pie(Split,labels=['Not Legendary','Legendary'],autopct='%1.1f%%',shadow=True)
plt.title('Legendary vs Non-Legendary')
fig=plt.gcf()

Let's check out some of the standard distribution plots.

In [None]:
plt.figure(figsize=(6,3))
sb.kdeplot(df["Total"],legend=False,color="blue",shade=True)

In [None]:
sb.kdeplot(df["HP"],legend = False,color="blue",shade=True)

In [None]:
sb.kdeplot(df["Attack"],legend = False,color="blue",shade=True)

In [None]:
sb.kdeplot(df["Defense"],legend = False,color="blue",shade=True)


In [None]:
sb.kdeplot(df["Speed"],legend = False,color="blue",shade=True)

Some countplots :

In [None]:
plt.figure(figsize=(20,10))
sb.countplot(x='Type1',data = df)

Water is the most common type of pokemon, followed by normal. Normal types were the most used pokemons during the start.


In [None]:
plt.figure(figsize=(20,10))
sb.countplot(x='Type2',data = df)

Many pokemons, do not exhibit the second feature/type.

In [None]:
sb.catplot(x='Generation', data=df,col='Type1',kind='count',col_wrap=3).set_axis_labels('Generation', 'Pokemons');

Conclusion : Generation 1 had loads of Poison type pokemon , but the generations after that, we see the numbers diminishing.
Flying pokemon numbers are pretty much negligible. Generation 5 has a lot of psychic and dark pokemons while Steel had a great count in Generation 3.
Normal pokemon had a good count until generation 6.

In [None]:
plt.figure(figsize = (15,10))
dualTypes = df[df['Type2'] != 'None']
sb.heatmap( dualTypes.groupby(['Type1', 'Type2']).size().unstack(),linewidths=1,annot=True)

This plot reveals that five most common combinations of primary and secondary type are in order:

* Normal-Flying-type
* Bug-Flying-type
* Bug-Poison-type
* Grass-Poison-type
* Water-Ground-type

In [None]:
plt.figure(figsize=(20,10))
Defhist = sb.distplot(df['Defense'],color='red',hist=True)
Atthist = sb.distplot(df['Attack'],color='teal',hist=True)
Atthist.set(title='Distribution of Defense and Attack',xlabel = 'Defense:red , Attack:teal')
FigHist = Atthist.get_figure()

We see that both the defense and attack curves are positively skewed and some pokemon seem to exhibit a higher defense stat over attack stat. Let's check it out for Special attack and Special Defense too.

In [None]:
plt.figure(figsize=(20,10))
SpDefHist = sb.distplot(df['SpDef'],color='red',hist=True)
SpAttHist = sb.distplot(df['SpAtk'],color='teal',hist=True)
SpAttHist.set(title='Distribution of Sp Defense and Sp attack',xlabel='SpDef : red , SpAtk : teal')
Fighist = SpAttHist.get_figure()


Again, both are positively skewed, and there's only a bit of a visible difference towards the 100 side.
Let's now have a look at the best of all features - Total , HP , Attack , Defense , SpAtk , SpDef , Speed.

In [None]:
stats = ['Total','HP','Attack','Defense','SpAtk','SpDef','Speed']


def maxStats(df,cols):
    st = ''
    for col in cols:
        stat = df[col].max()
        name=df[df[col] == df[col].max()]['Name'].values[0]
        gen = df[df[col] == df[col].max()]['Generation'].values[0]
        st += name + " of Generation "+str(gen)+" has the best "+col+" stat of "+str(stat)+".\n"
        
    return st
print(maxStats(NL_Poke,stats))

The ability to mega evolve is not a common ability. So that'll obviously add power to the team. One must check if a pokemon can mega evolve.

In [None]:
#Compare base stats of all generations
plt.figure(figsize=(20,10))
bp = sb.boxplot(x='Generation',y='Total',data=NL_Poke)
plt.title('Base Stat Total',fontsize=17)
plt.xlabel('Generation',fontsize=12)
plt.ylabel('Total',fontsize=12)

A glance at this box plot, we can directly infer that generation 4 has the best total stat out of all.

In [None]:
df.sort_values('Total',ascending=False).head(30)

As I played around, I came across this problem. About 6-7 pokemons are vying for a spot over the another, but have the same stat score. Immediately, we can see that using this metric introduces a major hurdle: Therefore, summing the statistics is not the answer. We need to come up with a new technique.

Hence, we'll make use of the z-score here. This essentially means that all the 6 features will be converted into a z-score value and when we take the sum it'll account for the variation in the each statistic using its mean and standard deviation across all Pokemon species.


In [None]:
stdStats = statistics.drop('Total', axis='columns').set_index('Num').apply(
    lambda x: (x - x.mean()) / x.std())


We'll define a new column called strength, which is the sum of the z-scores of each statistic—the higher this value, the stronger is the Pokemon.

In [None]:
stdStats['strength'] = stdStats.sum(axis='columns')

In [None]:
stdStats.reset_index(inplace=True)
pd.merge(Dex, stdStats, on='Num').sort_values('strength', ascending=False).head(30)

Yes!! That's perfect. We now have a definite list of top 10s. But,legendary pokemon are hard to catch. So,let's alter this a bit by applying this technique for non-legendary pokemon.

In [None]:
pd.merge( Dex[~Dex['Legendary']],stdStats, on='Num').sort_values('strength', ascending=False).head(10)

So it shows us cresselia,celebi and mew aren't legendary pokemon, but they are.
We can ignore this problem and check for non-legendary pokemon on the list.

Let's now find the strongest dual-type pokemons.

In [None]:
joined = pd.merge(Dex,stdStats,on='Num')
medians = joined.groupby(['Type1', 'Type2']).median().loc[:, 'strength']
plt.figure(figsize=(20,10))
sb.heatmap(medians.unstack(),linewidths=1, cmap='RdYlBu_r');

In [None]:
medians.reset_index().sort_values('strength', ascending=False).head()

Conclusions: 

The five strongest combinations of primary and secondary types are shown above.
The strongest Pokemon tend to have Dragon-type as either their primary or secondary type, and among the weakest Pokemon are primary Bug-types.

Let's filter out the legendary and check now.

In [None]:
joined = pd.merge(Dex[Dex['Legendary']==False],stdStats,on='Num')
medians = joined.groupby(['Type1', 'Type2']).median().loc[:, 'strength']
plt.figure(figsize=(20,10))
sb.heatmap(medians.unstack(),linewidths=1, cmap='RdYlBu_r');

In [None]:
medians.reset_index().sort_values('strength', ascending=False).head()

The results change a bit when excluding legendary Pokemon: Dragon-type is not dominating here; in fact, there's more diversity in strength among the different types. This also indicates that many legendary Pokemon species are Dragon-type, and the game maintains balance!

Do any pokemon in particular have an advantage in some characteristic feature over the others?
Let's have a look at HP and Speed for non-legendary pokemon.

In [None]:
joined = pd.merge(Dex[Dex['Legendary']==False],stdStats,on='Num')
plt.figure(figsize=(20,10))
sb.heatmap( joined.groupby('Type1').median().loc[:, 'HP':'Speed'], linewidths=1,cmap='RdYlBu_r')

Conclusions:
* The fastest Pokemon are Flying-type or Electric-type, while Fairy-type, Rock-type or Steel-type are the slow ones.
* Fighting-type Pokemon have crazy attack power but the worst special attack power.
* Psychic-type, Flying-type and Fairy-type Pokemon have horrible attack power. The former two at least make it up by excelling elsewhere.
* No special Pokemon have standout HP or special defense.
* Water-type Pokemon have average statistics across the board, which confirms our earlier hunch as they're the most common type of Pokemon. (Game balance +1 )
* Rock-type and Steel-type Pokemon still have the absolute best defense but lack in speed. (Perhaps there is a correlation between these two statistics?)

That's all folks.
This is my analysis on the Pokemon Data set.
Do let me know if y'all come up with something interesting.