# Pokemon Dataset using Pandas

In [112]:
import pandas as pd
import re

First, load in the data and store it as a DataFrame.
Printing the first 5 rows using the .head() method gives an overview of the structure of the data.

In [113]:
df = pd.read_csv('pokemon_data.csv')
df = pd.DataFrame(df)
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


This data appears to have a missing column, relative to most useful pokemon databases.

By adding the 'Total' column, the stat totals for each pokemon can be compared.

In [114]:
df.rename(columns={'Sp. Atk': 'Sp_Atk', 'Sp. Def': 'Sp_Def'}, inplace=True)
df['Total'] = df.HP + df.Attack + df.Defense + df.Sp_Atk + df.Sp_Def + df.Speed
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False,318
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False,405
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False,525
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False,625
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False,309


Next, we only want the entries for non-Mega and non-Primal Pokemon.

To do this, we will first check all the entries in the DataFrame which contain the word 'Mega'. It is important to note that there may be Pokemon for whom the string 'mega' is present within their name (hint: there are!).

In [115]:
df[df['Name'].str.contains('mega') | df['Name'].str.contains('Mega')]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False,625
7,6,CharizardMega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False,634
8,6,CharizardMega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False,634
12,9,BlastoiseMega Blastoise,Water,,79,103,120,135,115,78,1,False,630
19,15,BeedrillMega Beedrill,Bug,Poison,65,150,40,15,80,145,1,False,495
23,18,PidgeotMega Pidgeot,Normal,Flying,83,80,80,135,80,121,1,False,579
71,65,AlakazamMega Alakazam,Psychic,,55,50,65,175,95,150,1,False,590
87,80,SlowbroMega Slowbro,Water,Psychic,95,75,180,130,80,30,1,False,590
102,94,GengarMega Gengar,Ghost,Poison,60,65,80,170,95,130,1,False,600
124,115,KangaskhanMega Kangaskhan,Normal,,105,125,100,60,100,100,1,False,590


In order to filter out the actual Mega and Primal pokemon, we will create a Boolean mask which is true for all entries containing the words 'Mega' or 'Primal'. However, we also have to consider pokemon with the string 'mega' contained within the name.

From a quick glance at the above output, it seems mega and primal pokemon have two words in their names. As such, we can add in another conditional, so that the mask returns true if 'mega' is contained within the string *and* the length of the words in the string is greater than one.

We then modify our dataframe so that it only contains rows where the value for our mask is False.

In [116]:
mask = ((df['Name'].str.contains('Primal')) | 
        ((df['Name'].str.contains('Mega')) &
         df['Name'].apply(lambda x: len(x.split()) > 1)))

df = df[~mask]
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False,318
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False,405
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False,525
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False,309
5,5,Charmeleon,Fire,,58,64,58,80,65,80,1,False,405


Now we will check to see if the pokemon with 'mega' in their names are still present in the DataFrame.

In [117]:
df[df['Name'].str.contains('mega') | df['Name'].str.contains('Mega')]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
168,154,Meganium,Grass,,80,82,100,83,100,80,2,False,525
520,469,Yanmega,Bug,Flying,86,76,86,116,56,95,4,False,515


They are!

We only want to work with Pokemon from generations 1 through to 4.
In order to do this, we can simply select the pokemon from our DataFrame where the Generation value is less than or equal to 4.

In [118]:
df = df[(df.Generation >= 1) & (df.Generation <= 4)]
print(max(df.index))
df.tail()

552


Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
548,490,Manaphy,Water,,100,100,100,100,100,100,4,False,600
549,491,Darkrai,Dark,,70,90,90,135,90,125,4,True,600
550,492,ShayminLand Forme,Grass,,100,100,100,100,100,100,4,True,600
551,492,ShayminSky Forme,Grass,Flying,100,103,75,120,75,127,4,True,600
552,493,Arceus,Normal,,120,120,120,120,120,120,4,True,720


Here we can see that the last Pokemon in the DataFrame are from Generation 4.

Now the dataset has been cleaned, the Pokemon with the highest stat totals can be compared by sorting the DataFrame by the 'Total' column.

This will allow us to get an idea of who the strongest Pokemon are.

In [119]:
df.sort_values(by=['Total'], ascending=False)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,Legendary,Total
552,493,Arceus,Normal,,120,120,120,120,120,120,4,True,720
545,487,GiratinaOrigin Forme,Ghost,Dragon,150,120,100,120,100,90,4,True,680
270,250,Ho-oh,Fire,Flying,106,130,90,110,154,90,2,True,680
540,483,Dialga,Steel,Dragon,100,120,120,150,100,90,4,True,680
541,484,Palkia,Water,Dragon,90,120,100,150,120,100,4,True,680
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13,10,Caterpie,Bug,,45,30,35,20,20,45,1,False,195
16,13,Weedle,Bug,Poison,40,35,30,20,20,50,1,False,195
446,401,Kricketot,Bug,,37,25,41,25,41,25,4,False,194
322,298,Azurill,Normal,Fairy,50,20,40,20,40,20,3,False,190
