## Table of Contents
- [pokemon_df](#pokemon_df)
- [pokemon_types_df](#pokemon_types_df)
- [pokemon_abilities_df](#pokemon_abilities_df)
- [moves_df](#moves_df)
- [strategydex_df](#strategydex_df)
- [pokemon_learnsets_df](#pokemon_learnsets_df)
- [strategies_dict](#strategies_dict)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import json
import pandas_profiling

<a id='pokemon_df'></a>
### pokemon_df

First, we need load the dictionary of pokemon data from Smogon.com, contained in smogonpokemondata2021/smogonpokemondata2021/scraped_data/pokedex_dict.json (as a relative path from this file).

If you are interested in examining this json in more detail to see more about its structure and the content it contains, all with a convenient graphical user interface, I recommend copying the text and pasting it into http://jsonviewer.stack.hu/

In [2]:
with open("smogonpokemondata2021/smogonpokemondata2021/scraped_data/pokedex_dict.json") as infile:
    pokedex_dict = json.load(infile)

Once loaded, we need to turn that JSON object into a Pandas Dataframe.  The part of the JSON we are accessing can be best observed by checking the JSON in the jsonviewer I mentioned before.  This is the data about the pokemon, which are the central feature of our project.  It contains many useful things like the names, the stats, the types, the abilities, as well as the target variable "formats" (which is the competitive format we ultimately aim to be able to predict).

In [3]:
pokemon_df = pd.DataFrame(pokedex_dict['injectRpcs'][1][1]['pokemon'])
pokemon_df

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
0,Bulbasaur,45,49,49,65,65,45,6.9,0.7,"[Grass, Poison]","[Chlorophyll, Overgrow]",[LC],Standard,"{'dex_number': 1, 'evos': ['Ivysaur'], 'alts':..."
1,Ivysaur,60,62,63,80,80,60,13.0,1.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",[NFE],Standard,"{'dex_number': 2, 'evos': ['Venusaur'], 'alts'..."
2,Venusaur,80,82,83,100,100,80,100.0,2.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",[RUBL],Standard,"{'dex_number': 3, 'evos': [], 'alts': ['Venusa..."
3,Charmander,39,52,43,60,50,65,8.5,0.6,[Fire],"[Blaze, Solar Power]",[LC],Standard,"{'dex_number': 4, 'evos': ['Charmeleon'], 'alt..."
4,Charmeleon,58,64,58,80,65,80,19.0,1.1,[Fire],"[Blaze, Solar Power]",[NFE],Standard,"{'dex_number': 5, 'evos': ['Charizard'], 'alts..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,Scizor-Mega,70,150,140,65,100,75,125.0,2.0,"[Bug, Steel]",[Technician],[National Dex],NatDex,
1202,Blastoise-Gmax,79,83,100,85,105,78,0.0,1.6,[Water],"[Rain Dish, Torrent]",[AG],Standard,
1203,Crucibelle-Mega,106,135,75,91,125,108,22.5,1.4,"[Rock, Poison]",[Magic Guard],[CAP],CAP,
1204,Meowth-Gmax,40,45,35,40,40,90,0.0,33.0,[Normal],"[Pickup, Technician, Unnerve]",[AG],Standard,


In [4]:
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1206 entries, 0 to 1205
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           1206 non-null   object 
 1   hp             1206 non-null   int64  
 2   atk            1206 non-null   int64  
 3   def            1206 non-null   int64  
 4   spa            1206 non-null   int64  
 5   spd            1206 non-null   int64  
 6   spe            1206 non-null   int64  
 7   weight         1206 non-null   float64
 8   height         1206 non-null   float64
 9   types          1206 non-null   object 
 10  abilities      1206 non-null   object 
 11  formats        1206 non-null   object 
 12  isNonstandard  1206 non-null   object 
 13  oob            1102 non-null   object 
dtypes: float64(2), int64(6), object(6)
memory usage: 132.0+ KB


From the info, we can see that our dataframe is mostly full, which is great, but we have some null values in the oob column which is a good first thing to check out.

In [5]:
pokemon_df.loc[pokemon_df.loc[:, 'oob'].isnull()]

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
1102,Darmanitan-Galar-Zen,105,160,55,30,55,135,120.0,1.7,"[Ice, Fire]",[Zen Mode],[Uber],Standard,
1103,Houndoom-Mega,75,90,90,140,90,115,49.5,1.9,"[Dark, Fire]",[Solar Power],[National Dex],NatDex,
1104,Blastoise-Mega,79,103,120,135,115,78,101.1,1.6,[Water],[Mega Launcher],[National Dex],NatDex,
1105,Alcremie-Gmax,65,60,75,110,121,64,0.0,30.0,[Fairy],"[Aroma Veil, Sweet Veil]",[AG],Standard,
1106,Mimikyu-Busted-Totem,55,90,80,50,105,96,2.8,0.4,"[Ghost, Fairy]",[Disguise],[National Dex],NatDex,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1201,Scizor-Mega,70,150,140,65,100,75,125.0,2.0,"[Bug, Steel]",[Technician],[National Dex],NatDex,
1202,Blastoise-Gmax,79,83,100,85,105,78,0.0,1.6,[Water],"[Rain Dish, Torrent]",[AG],Standard,
1203,Crucibelle-Mega,106,135,75,91,125,108,22.5,1.4,"[Rock, Poison]",[Magic Guard],[CAP],CAP,
1204,Meowth-Gmax,40,45,35,40,40,90,0.0,33.0,[Normal],"[Pickup, Technician, Unnerve]",[AG],Standard,


These are a lot of alternative forms of pokemon, such as Mega, Gmax and Totem, which we are not going to include in our project for the following reasons:

1. They are difficult to represent in the context of the machine learning model we wish to train.

2. They are outside of the standard Smogon 6v6 pokemon battle rulesets, which give a fair and consistent context to the ranking of pokemon that we are trying to predict.

Therefore, all of these pokemon and their rows will simply be removed.

To have a record of pokemon that we are removing, they will be stored in another dataframe called pokemon_removed_df.  This is useful for records and also in the case that we decide to use the removed pokemon in another way in the future of this project.

In [6]:
pokemon_removed_df = pokemon_df.loc[pokemon_df.loc[:, 'oob'].isnull()].copy()
pokemon_df = pokemon_df.loc[~pokemon_df.loc[:, 'oob'].isnull()].copy()

With that obvious step out of the way, let's make a pandas profiling report to see if there are any less obvious things in pokemon_df that we should be examining or cleaning.

In [None]:
profile_report = pokemon_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

We gained two key insights from this report:

1. Several columns (types, abilities, formats, and oob) have unsupported types, which prevents us from gaining much further insight about them.  This is a problem that we will have to solve. Types and abilities will have their own separate dataframes, so we will deal with those later.  Formats only contains a list with a single value though, so it's easier to solve now.

2. If you examine the extreme values for a numerical column like "weight", you can see that the amount of pokemon with said weights don't correspond to the number of such pokemon listed on pokemon reference websites (such as Bulbapedia).  Let's examine this now:

In [7]:
pokemon_df.loc[pokemon_df['weight'] == 0.2]

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
875,Cutiefly,40,45,40,55,40,84,0.2,0.1,"[Bug, Fairy]","[Honey Gather, Shield Dust, Sweet Veil]",[NFE],Standard,"{'dex_number': 742, 'evos': ['Ribombee'], 'alt..."
1043,Sinistea,40,45,45,74,54,50,0.2,0.1,[Ghost],"[Cursed Body, Weak Armor]",[LC],Standard,"{'dex_number': 854, 'evos': ['Polteageist'], '..."
1044,Sinistea-Antique,40,45,45,74,54,50,0.2,0.1,[Ghost],"[Cursed Body, Weak Armor]",[LC],Standard,"{'dex_number': 854, 'evos': ['Polteageist-Anti..."


In [8]:
pokemon_df.loc[pokemon_df['weight'] == 0.3]

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
389,Rotom-Heat,50,65,107,105,107,86,0.3,0.3,"[Electric, Fire]",[Levitate],[UU],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
390,Rotom-Wash,50,65,107,105,107,86,0.3,0.3,"[Electric, Water]",[Levitate],[UU],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
391,Rotom-Frost,50,65,107,105,107,86,0.3,0.3,"[Electric, Ice]",[Levitate],[Untiered],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
392,Rotom-Fan,50,65,107,105,107,86,0.3,0.3,"[Electric, Flying]",[Levitate],[Untiered],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
393,Rotom-Mow,50,65,107,105,107,86,0.3,0.3,"[Electric, Grass]",[Levitate],[NU],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
517,Rotom,50,50,77,95,77,91,0.3,0.3,"[Electric, Ghost]",[Levitate],[Untiered],Standard,"{'dex_number': 479, 'evos': [], 'alts': [], 'g..."
518,Uxie,75,75,130,75,130,95,0.3,0.3,[Psychic],[Levitate],[Untiered],Standard,"{'dex_number': 480, 'evos': [], 'alts': [], 'g..."
519,Mesprit,80,105,105,105,105,80,0.3,0.3,[Psychic],[Levitate],[PU],Standard,"{'dex_number': 481, 'evos': [], 'alts': [], 'g..."
520,Azelf,75,125,70,125,70,115,0.3,0.3,[Psychic],[Levitate],[UU],Standard,"{'dex_number': 482, 'evos': [], 'alts': [], 'g..."
661,Tynamo,35,55,40,45,40,60,0.3,0.2,[Electric],[Levitate],[National Dex],NatDex,"{'dex_number': 602, 'evos': ['Eelektrik'], 'al..."


Here we can see what the problem is: there are other alternative forms of pokemon besides the one's we deleted before.  Some examples are the "antique" form (Sinistea has it), and the alternative forms of Rotom.  We need a strategy to find and decide what to do with all of these alternative forms, which is probably the most difficult task in cleaning pokemon_df.

So far, it seems like the best strategy to find all of the alternative forms is to look for each pokemon name which contains the "-" character, since it's been attached to every alternative form so far (though it may not be the case that every instance of "-" corresponds to an alternative form.

We'll have to convert the slice of the dataframe we're looking for to a string, because we need to see all of it and that's one of the easiest ways to display it.

In [None]:
print(pokemon_df.loc[pokemon_df['name'].str.contains("-")].to_string())

Which alternative forms of pokemon are fair and can remain in the dataframe as legitimately separate forms for competitive purposes?:
- The different forms of Rotom, because they have completely different types and moveset strategies on Smogon.com
- The Nidorans, because the M and F versions correspond to different genders which are completely different pokemon
- Ho-Oh because it's name simply contains a "-"
- Porygon-Z because it's name simply contains a "-"
- Giratina-Origin, because it's a different form of Giratina with a totally different set of stats, and thus is quite different competitively
- Therian forms of the Forces of Nature because they have different stats as with Giratina
- Kyurem-White and Kyurem-Black are stat changes similar to Origin form or Therian forms
- Meowstic genders, which have different abilities and strategies
- Pumpakaboo and Gourgeist alternate size forms, which have different stats and strategies
- Lycanroc forms, which have different stats and strategies
- Alola forms of Raichu, Sandslash, Ninetails, Dugtrio, Persian, Exeggutor, Marowak, and all other Alola forms in their evolutionary family, because Alola forms are typed differently and essentially different pokemon
- Different forms of Silvally, since they are differently typed
- pokemon with "-o", because that's just the name of an evolutionary family of pokemon, not an alternative form
- Necrozma alternative forms, because they have type, stat and tier differences
- Galar forms of Meowth, Rapidash, Slowbro, Farfetch'd, Weezing, Mr. Mime, Articuno, Zapdos, Moltres, Slowking, Corsola, Linoone, Darmanitan, Yamask, Stunfisk, and all other Galar forms in their evolutionary family, because Galar forms are typed differently and different, similarly to Alola forms
- Indeedee F and M, because they are different genders with different stats and type, similar to Nidoran and Meowstic
- Zamazenta-Crowned, because it's an alternative form of Zamazenta with different stats
- Urshifu forms, because they have different type, moves and competitive tiering
- Calyrex forms, because they have different types and abilities

By contrast, which alternative forms of pokemon are either purely appearance oriented (and thus essentially the same pokemon) or unfair for some reason, and thus must be removed?:
- Basculin forms, because they are mostly appearance, they have an ability difference but it isn't used competitively
- Keldeo-Resolute, which is purely an appearance change which happens when it knows a move
- Genesect forms, because alternative forms are almost never used competitively and even if they are, they mostly just change the type of the move named "technoblast", and since this move is in the learnset this feature of genesect will be indirectly accounted for by my model anyways
- Pikachu alternative forms, because they are purely appearance
- Magearna-original, because it's purely appearance
- Toxtricity low-key, which has a difference in moves and ability but the difference isn't used competitively and otherwise it's just appearance
- Sinistea and Polteageist antique, since the antique form only affects appearance
- Zacian Crowned, because this form of Zacian was deemed too powerful even for the highest competitive tier, Uber, so it was banned
- Zarude Dada, because it's purely an appearance difference from Zarude

Let's make the necessary removals below and add them to pokemon_removed_df:

In [9]:
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Basculin-Blue-Striped'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Basculin-Blue-Striped']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Keldeo-Resolute'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Keldeo-Resolute']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Genesect-Douse'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Genesect-Douse']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Genesect-Shock'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Genesect-Shock']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Genesect-Burn'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Genesect-Burn']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Genesect-Chill'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Genesect-Chill']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Original'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Original']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Hoenn'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Hoenn']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Sinnoh'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Sinnoh']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Unova'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Unova']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Kalos'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Kalos']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Alola'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Alola']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-Partner'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-Partner']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Pikachu-World'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Pikachu-World']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Magearna-Original'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Magearna-Original']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Toxtricity-Low-Key'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Toxtricity-Low-Key']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Sinistea-Antique'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Sinistea-Antique']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Polteageist-Antique'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Polteageist-Antique']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Zacian-Crowned'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Zacian-Crowned']
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['name'] == 'Zarude-Dada'])
pokemon_df = pokemon_df.loc[pokemon_df['name'] != 'Zarude-Dada']

pokemon_removed_df.shape

(124, 14)

With that difficult and research-intensive task taken care of, we can switch to cleaning the "formats" column, which contains our target variable as a list with single entry.  We should check precisely what data type the entries of this column have, since it's unsupported by pandas_profiling and may merely resemble a list.

In [None]:
for i, l in enumerate(pokemon_df["formats"]):
    print("list",i,"is",type(l))

They are actually in list format, so we can simply apply the pandas "series" function to this column to make the values strings which can be processed instead of inconvenient lists.

In [10]:
pokemon_formats = pokemon_df['formats'].apply(pd.Series)
pokemon_formats

Unnamed: 0,0
0,LC
1,NFE
2,RUBL
3,LC
4,NFE
...,...
1097,
1098,CAP
1099,
1100,CAP


In [11]:
pokemon_df['formats'] = pokemon_formats.values
pokemon_df

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
0,Bulbasaur,45,49,49,65,65,45,6.9,0.7,"[Grass, Poison]","[Chlorophyll, Overgrow]",LC,Standard,"{'dex_number': 1, 'evos': ['Ivysaur'], 'alts':..."
1,Ivysaur,60,62,63,80,80,60,13.0,1.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",NFE,Standard,"{'dex_number': 2, 'evos': ['Venusaur'], 'alts'..."
2,Venusaur,80,82,83,100,100,80,100.0,2.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",RUBL,Standard,"{'dex_number': 3, 'evos': [], 'alts': ['Venusa..."
3,Charmander,39,52,43,60,50,65,8.5,0.6,[Fire],"[Blaze, Solar Power]",LC,Standard,"{'dex_number': 4, 'evos': ['Charmeleon'], 'alt..."
4,Charmeleon,58,64,58,80,65,80,19.0,1.1,[Fire],"[Blaze, Solar Power]",NFE,Standard,"{'dex_number': 5, 'evos': ['Charizard'], 'alts..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1097,Solotl,68,48,34,72,24,84,11.8,0.6,"[Fire, Dragon]","[Magician, Regenerator, Vital Spirit]",,CAP,"{'dex_number': -56, 'evos': ['Astrolotl'], 'al..."
1098,Astrolotl,108,108,74,92,64,114,50.0,1.9,"[Fire, Dragon]","[Magician, Regenerator, Vital Spirit]",CAP,CAP,"{'dex_number': -57, 'evos': [], 'alts': [], 'g..."
1099,Miasmite,40,85,60,52,52,44,10.1,0.6,"[Bug, Dragon]","[Compound Eyes, Hyper Cutter, Neutralizing Gas]",,CAP,"{'dex_number': -58, 'evos': ['Miasmaw'], 'alts..."
1100,Miasmaw,85,135,60,115,85,92,57.0,1.2,"[Bug, Dragon]","[Compound Eyes, Hyper Cutter, Neutralizing Gas]",CAP,CAP,"{'dex_number': -59, 'evos': [], 'alts': [], 'g..."


We should check the unique values of this formats column to make sure they are what we are looking for.

In [12]:
pokemon_df['formats'].unique()

array(['LC', 'NFE', 'RUBL', 'PU', 'NU', 'Untiered', 'National Dex', 'UU',
       'OU', 'UUBL', 'PUBL', 'RU', 'Uber', 'NUBL', 'CAP', nan, 'AG'],
      dtype=object)

There are several formats here which are incorrect:

1. AG: this is due to Zacian, who is listed as AG format, even though it's only Zacian's alternative form Zacian Crowned which is AG (and Zacian-Crowned has already been removed). Regular Zacian is actually an Uber, so his corresponding "formats" value will be switched to Uber and this should remove AG from the set of possible values.

2. National Dex: This is a real format for pokemon not obtainable in generation 8 by standard means, and it doesn't contain any listed strategies and in many cases even movesets.  Since it doesn't fit within the ranking we're trying to predict either, we're going to have to remove all pokemon with this format (though maybe we can examine what tier they'd end up in later for fun).

3. CAP: This is the Create-A-Pokemon format, which contains pokemon which were created and are not real.  These also need to be removed, but also might be fun to examine later.

4. nan, which is a null value that likely indicates an error of some sort.

First of all, we need to know what the nan values indicate:

In [13]:
pokemon_df.loc[pokemon_df['formats'].isna()]

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
532,Syclar,40,76,45,74,39,91,4.0,0.2,"[Ice, Bug]","[Compound Eyes, Ice Body, Snow Cloak]",,CAP,"{'dex_number': -1, 'evos': ['Syclant'], 'alts'..."
533,Embirch,60,40,55,65,40,60,15.0,0.6,"[Fire, Grass]","[Chlorophyll, Leaf Guard, Reckless]",,CAP,"{'dex_number': -4, 'evos': ['Flarelm'], 'alts'..."
534,Flarelm,90,50,95,75,70,40,73.0,1.4,"[Fire, Grass]","[Battle Armor, Rock Head, White Smoke]",,CAP,"{'dex_number': -5, 'evos': ['Pyroak'], 'alts':..."
535,Breezi,50,46,69,60,50,75,0.6,0.4,"[Poison, Flying]","[Frisk, Own Tempo, Unburden]",,CAP,"{'dex_number': -7, 'evos': ['Fidgit'], 'alts':..."
536,Rebble,45,25,65,75,55,80,7.0,0.3,[Rock],"[Levitate, Sniper, Solid Rock]",,CAP,"{'dex_number': -9, 'evos': ['Tactite'], 'alts'..."
537,Tactite,70,40,65,100,65,95,16.0,0.6,[Rock],"[Levitate, Sniper, Technician]",,CAP,"{'dex_number': -10, 'evos': ['Stratagem'], 'al..."
538,Privatyke,65,75,65,40,60,45,35.0,1.0,"[Water, Fighting]","[Technician, Unaware]",,CAP,"{'dex_number': -12, 'evos': ['Arghonaut'], 'al..."
539,Voodoll,55,40,55,75,50,70,25.0,1.0,"[Normal, Dark]","[Cursed Body, Lightning Rod, Volt Absorb]",,CAP,"{'dex_number': -18, 'evos': ['Voodoom'], 'alts..."
710,Scratchet,55,85,80,20,70,40,20.0,0.5,"[Normal, Fighting]","[Prankster, Scrappy, Vital Spirit]",,CAP,"{'dex_number': -20, 'evos': ['Tomohawk'], 'alt..."
712,Necturine,49,55,60,50,75,51,1.8,0.3,"[Grass, Ghost]","[Anticipation, Telepathy]",,CAP,"{'dex_number': -22, 'evos': ['Necturna'], 'alt..."


The nan's are always paired with the isNonstandard column having a value of CAP, so these are almost certainly created (not real) pokemon which aren't even used in the CAP format.  They will also be removed, so let's remove everything we've set out to below:

In [14]:
#fixing AG
pokemon_df.loc[pokemon_df['name'] == 'Zacian', 'formats'] = 'Uber'

#fixing National Dex
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['isNonstandard'] == 'NatDex'])
pokemon_df = pokemon_df.loc[pokemon_df['isNonstandard'] != 'NatDex']

#fixing CAP and nan values
pokemon_removed_df = pokemon_removed_df.append(pokemon_df.loc[pokemon_df['isNonstandard'] == 'CAP'])
pokemon_df = pokemon_df.loc[pokemon_df['isNonstandard'] != 'CAP']

pokemon_df

Unnamed: 0,name,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
0,Bulbasaur,45,49,49,65,65,45,6.9,0.7,"[Grass, Poison]","[Chlorophyll, Overgrow]",LC,Standard,"{'dex_number': 1, 'evos': ['Ivysaur'], 'alts':..."
1,Ivysaur,60,62,63,80,80,60,13.0,1.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",NFE,Standard,"{'dex_number': 2, 'evos': ['Venusaur'], 'alts'..."
2,Venusaur,80,82,83,100,100,80,100.0,2.0,"[Grass, Poison]","[Chlorophyll, Overgrow]",RUBL,Standard,"{'dex_number': 3, 'evos': [], 'alts': ['Venusa..."
3,Charmander,39,52,43,60,50,65,8.5,0.6,[Fire],"[Blaze, Solar Power]",LC,Standard,"{'dex_number': 4, 'evos': ['Charmeleon'], 'alt..."
4,Charmeleon,58,64,58,80,65,80,19.0,1.1,[Fire],"[Blaze, Solar Power]",NFE,Standard,"{'dex_number': 5, 'evos': ['Charizard'], 'alts..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1092,Glastrier,100,145,130,65,110,30,800.0,2.2,[Ice],[Chilling Neigh],NU,Standard,"{'dex_number': 896, 'evos': [], 'alts': [], 'g..."
1093,Spectrier,100,65,60,145,80,130,44.5,2.0,[Ghost],[Grim Neigh],Uber,Standard,"{'dex_number': 897, 'evos': [], 'alts': [], 'g..."
1094,Calyrex,100,80,80,80,80,80,7.7,1.1,"[Psychic, Grass]",[Unnerve],Untiered,Standard,"{'dex_number': 898, 'evos': [], 'alts': [], 'g..."
1095,Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,"[Psychic, Ice]",[As One (Glastrier)],Uber,Standard,"{'dex_number': 898, 'evos': [], 'alts': [], 'g..."


That looks correct.  Let's just do a few final steps: check whether there are no more undesirable formats, and place the pokemon names as the index of the Dataframe.  This index of pokemon names will serve as the key linking all of our different dataframes and dictionaries together by the end of this Data Wrangling step.

In [15]:
pokemon_df['formats'].unique()

array(['LC', 'NFE', 'RUBL', 'PU', 'NU', 'Untiered', 'UU', 'OU', 'UUBL',
       'PUBL', 'RU', 'Uber', 'NUBL'], dtype=object)

That seems completely fixed.

In [16]:
pokemon_df.set_index('name', inplace=True)

In [None]:
#And we'll generate one final profile report to make sure there isn't anything else to pay attention to.
profile_report = pokemon_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

It looks good in general, but there are still the unsupported columns and isNonstandard which is a completely uniform column (with the single value "Standard"). isNonstandard no longer contains any information and should be removed.

Let's come back to cleaning all of these columns when we get what we need out of types, abilities, and oob, since those unsupported values contain lots of information that we should store in another way that will allow us to access it more effectively.

<a id='pokemon_types_df'></a>
### pokemon_types_df

We need to go back to pokedex_dict to make another dataframe, similar to what we did for pokemon_df.  However, this time we will use a different section of the JSON file (which is best observed through jsonviewer).

In [17]:
types_df = pd.DataFrame(pokedex_dict['injectRpcs'][1][1]['types'])
types_df

Unnamed: 0,name,atk_effectives,genfamily,description
0,Bug,"[[Bug, 1], [Dark, 2], [Dragon, 1], [Electric, ...","[RB, GS, RS, DP, BW, XY, SM, SS]",
1,Dark,"[[Bug, 1], [Dark, 0.5], [Dragon, 1], [Electric...","[GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type are immune to Status move...
2,Dragon,"[[Bug, 1], [Dark, 1], [Dragon, 2], [Electric, ...","[RB, GS, RS, DP, BW, XY, SM, SS]",
3,Electric,"[[Bug, 1], [Dark, 1], [Dragon, 0.5], [Electric...","[RB, GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type cannot become paralyzed.
4,Fairy,"[[Bug, 1], [Dark, 2], [Dragon, 2], [Electric, ...","[XY, SM, SS]",
5,Fighting,"[[Bug, 0.5], [Dark, 2], [Dragon, 1], [Electric...","[RB, GS, RS, DP, BW, XY, SM, SS]",
6,Fire,"[[Bug, 2], [Dark, 1], [Dragon, 0.5], [Electric...","[RB, GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type cannot become burned.
7,Flying,"[[Bug, 2], [Dark, 1], [Dragon, 1], [Electric, ...","[RB, GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type are airborne and lose the...
8,Ghost,"[[Bug, 1], [Dark, 0.5], [Dragon, 1], [Electric...","[RB, GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type cannot be prevented from ...
9,Grass,"[[Bug, 0.5], [Dark, 1], [Dragon, 0.5], [Electr...","[RB, GS, RS, DP, BW, XY, SM, SS]",Pokemon of this type cannot become affected by...


For our purposes, we really only need the types.  Genfamilies, attack effectiveness, and descriptions always come with the type anyways and aren't being directly analyzed in the machine learning task.  Also, we want to link our types dataframe with our pokemon dataframe by using the same index, so let's initialize a new dataframe to hold the types of each pokemon in a one-hot encoded fashion (which is most convenient for machine learning).

In [18]:
pokemon_types_df = pd.DataFrame(0, index=pokemon_df.index, columns=types_df['name'].rename(''))
pokemon_types_df

Unnamed: 0_level_0,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Charmander,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we need to check the unsupported data type from pokemon_df in the types column, which is where we'll get our values for this new dataframe from:

In [None]:
for i, l in enumerate(pokemon_df["types"]):
    print("list",i,"is",type(l))

Since they're lists, we can handle them the same way we did the formats column:

In [19]:
pokemon_types = pokemon_df['types'].apply(pd.Series)
pokemon_types

Unnamed: 0_level_0,0,1
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bulbasaur,Grass,Poison
Ivysaur,Grass,Poison
Venusaur,Grass,Poison
Charmander,Fire,
Charmeleon,Fire,
...,...,...
Glastrier,Ice,
Spectrier,Ghost,
Calyrex,Psychic,Grass
Calyrex-Ice,Psychic,Ice


In [20]:
pokemon_types.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       739 non-null    object
 1   1       396 non-null    object
dtypes: object(2)
memory usage: 17.3+ KB


This is a slightly more complicated result than formats because we have two columns from this application of pd.Series, and not only that, but the second column contains many null values.  This is acceptable though, since some pokemon have only 1 type and some have 2 types, but it can only be one of those two options.  And it looks like all pokemon have at least one type (since there are 739 rows and each of them is filled from the first column), so there is no problem.

Now we will iterate through pokemon_types to fill in values of "1" in our one-hot encoded pokemon_types_df for each type that a pokemon has:

In [21]:
for index, row in pokemon_types.iterrows():
    pokemon_types_df.loc[index, row[0]] = 1
    if row.isna().sum() == 0:
        pokemon_types_df.loc[index, row[1]] = 1

pokemon_types_df

Unnamed: 0_level_0,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
Charmander,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0


In [22]:
pokemon_types_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Data columns (total 18 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Bug       739 non-null    int64
 1   Dark      739 non-null    int64
 2   Dragon    739 non-null    int64
 3   Electric  739 non-null    int64
 4   Fairy     739 non-null    int64
 5   Fighting  739 non-null    int64
 6   Fire      739 non-null    int64
 7   Flying    739 non-null    int64
 8   Ghost     739 non-null    int64
 9   Grass     739 non-null    int64
 10  Ground    739 non-null    int64
 11  Ice       739 non-null    int64
 12  Normal    739 non-null    int64
 13  Poison    739 non-null    int64
 14  Psychic   739 non-null    int64
 15  Rock      739 non-null    int64
 16  Steel     739 non-null    int64
 17  Water     739 non-null    int64
dtypes: int64(18)
memory usage: 125.9+ KB


In [23]:
pokemon_types_df.apply(sum, axis=1).value_counts()

2    396
1    343
dtype: int64

It seems to have worked with no problems and it looks like there is a correct amount of pokemon with 2 types (said to be 396 before) and a correct amount with 1 type and no values out of that range.  Let's just make a pandas profile to make sure there are no more non-obvious issues:

In [None]:
profile_report = pokemon_types_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

There are no problems at all, so let's move on to our dataframe for abilities!

<a id='pokemon_abilities_df'></a>
### pokemon_abilities_df

Once again, we are making another dataframe, this time using another different section of the JSON file (which is best observed through jsonviewer).

In [24]:
abilities_df = pd.DataFrame(pokedex_dict['injectRpcs'][1][1]['abilities'])
abilities_df

Unnamed: 0,name,description,isNonstandard,genfamily
0,Cute Charm,30% chance of infatuating Pokemon of the oppos...,Standard,"[RS, DP, BW, XY, SM, SS]"
1,Effect Spore,30% chance of poison/paralysis/sleep on others...,Standard,"[RS, DP, BW, XY, SM, SS]"
2,Flame Body,30% chance a Pokemon making contact with this ...,Standard,"[RS, DP, BW, XY, SM, SS]"
3,Flash Fire,This Pokemon's Fire attacks do 1.5x damage if ...,Standard,"[RS, DP, BW, XY, SM, SS]"
4,Intimidate,"On switch-in, this Pokemon lowers the Attack o...",Standard,"[RS, DP, BW, XY, SM, SS]"
...,...,...,...,...
265,Steam Engine,This Pokemon's Speed is raised by 6 stages aft...,Standard,[SS]
266,Steely Spirit,This Pokemon and its allies' Steel-type moves ...,Standard,[SS]
267,Transistor,This Pokemon's attacking stat is multiplied by...,Standard,[SS]
268,Unseen Fist,All contact moves hit through protection.,Standard,[SS]


We've seen the isNonstandard column before, so we should explore it and see what values it has (since we likely only want the Standard abilities).

In [25]:
abilities_df['isNonstandard'].unique()

array(['Standard', 'CAP'], dtype=object)

Indeed, we should remove any CAP abilities since those are only for created pokemon which are not usable in the standard format (which contextualizes this project).  Basically, since CAP pokemon are removed, there is no reason to keep any CAP abilities).

In [26]:
abilities_df = abilities_df.loc[abilities_df['isNonstandard'] != 'CAP']
abilities_df

Unnamed: 0,name,description,isNonstandard,genfamily
0,Cute Charm,30% chance of infatuating Pokemon of the oppos...,Standard,"[RS, DP, BW, XY, SM, SS]"
1,Effect Spore,30% chance of poison/paralysis/sleep on others...,Standard,"[RS, DP, BW, XY, SM, SS]"
2,Flame Body,30% chance a Pokemon making contact with this ...,Standard,"[RS, DP, BW, XY, SM, SS]"
3,Flash Fire,This Pokemon's Fire attacks do 1.5x damage if ...,Standard,"[RS, DP, BW, XY, SM, SS]"
4,Intimidate,"On switch-in, this Pokemon lowers the Attack o...",Standard,"[RS, DP, BW, XY, SM, SS]"
...,...,...,...,...
265,Steam Engine,This Pokemon's Speed is raised by 6 stages aft...,Standard,[SS]
266,Steely Spirit,This Pokemon and its allies' Steel-type moves ...,Standard,[SS]
267,Transistor,This Pokemon's attacking stat is multiplied by...,Standard,[SS]
268,Unseen Fist,All contact moves hit through protection.,Standard,[SS]


Some final touches:

1. We should set the ability names as the index, since the current index is uninformative.

2. We can remove the isNonstandard and genfamily columns are they provide no relevant information to the competitive aspect of pokemon, but we should keep the descriptions as they might be useful to EDA later (there are so many abilities that I will likely have to bucket them, and having descriptions may make it easier to do that).

In [27]:
abilities_df.set_index('name', inplace=True)
abilities_df = abilities_df.drop(columns=['isNonstandard', 'genfamily'])
abilities_df

Unnamed: 0_level_0,description
name,Unnamed: 1_level_1
Cute Charm,30% chance of infatuating Pokemon of the oppos...
Effect Spore,30% chance of poison/paralysis/sleep on others...
Flame Body,30% chance a Pokemon making contact with this ...
Flash Fire,This Pokemon's Fire attacks do 1.5x damage if ...
Intimidate,"On switch-in, this Pokemon lowers the Attack o..."
...,...
Steam Engine,This Pokemon's Speed is raised by 6 stages aft...
Steely Spirit,This Pokemon and its allies' Steel-type moves ...
Transistor,This Pokemon's attacking stat is multiplied by...
Unseen Fist,All contact moves hit through protection.


We also need to initialize the dataframe that's going to hold our one-hot encoded abilities, and like last time we will use the pokemon_df index to sync the dataframes together.

In [28]:
pokemon_abilities_df = pd.DataFrame(0, index=pokemon_df.index, columns=abilities_df.index.rename(''))
pokemon_abilities_df

Unnamed: 0_level_0,Cute Charm,Effect Spore,Flame Body,Flash Fire,Intimidate,Lightning Rod,Minus,Plus,Poison Point,Pressure,...,Quick Draw,Ripen,Sand Spit,Screen Cleaner,Stalwart,Steam Engine,Steely Spirit,Transistor,Unseen Fist,Wandering Spirit
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmander,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As before, we need to check the data types of the abilities column of pokemon_df, to understand what transformation we need to apply to it before using it to fill pokemon_abilities_df.

In [None]:
for i, l in enumerate(pokemon_df["abilities"]):
    print("list",i,"is",type(l))

They're lists, so like last time we just need to apply pd.Series.

In [29]:
pokemon_abilities = pokemon_df['abilities'].apply(pd.Series)
pokemon_abilities

Unnamed: 0_level_0,0,1,2,3
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bulbasaur,Chlorophyll,Overgrow,,
Ivysaur,Chlorophyll,Overgrow,,
Venusaur,Chlorophyll,Overgrow,,
Charmander,Blaze,Solar Power,,
Charmeleon,Blaze,Solar Power,,
...,...,...,...,...
Glastrier,Chilling Neigh,,,
Spectrier,Grim Neigh,,,
Calyrex,Unnerve,,,
Calyrex-Ice,As One (Glastrier),,,


This is slightly more complicated still, because pokemon can have anywhere from one to four possible abilities.  However, four is very rare; how many pokemon have four?

In [30]:
pokemon_abilities.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       739 non-null    object
 1   1       593 non-null    object
 2   2       395 non-null    object
 3   3       1 non-null      object
dtypes: object(4)
memory usage: 45.0+ KB


Just one!  And which pokemon is that?

In [31]:
pokemon_df.loc[pokemon_abilities.loc[:, 3].notna()]

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,types,abilities,formats,isNonstandard,oob
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Rockruff,45,65,40,30,40,60,9.2,0.5,[Rock],"[Keen Eye, Own Tempo, Steadfast, Vital Spirit]",LC,Standard,"{'dex_number': 744, 'evos': ['Lycanroc', 'Lyca..."


Checking it in the reference material on Smogon, it seems that Rockruff really has 4 possible abilities so there is no mistake.

Since there are up to 4 possible abilities for each pokemon, our code for filling pokemon_abilities_df will have to be slightly more complicated than the code for pokemon_types_df, but not intractably so.

In [32]:
for index, row in pokemon_abilities.iterrows():
    pokemon_abilities_df.loc[index, row[0]] = 1
    if row.isna().sum() < 3:
        pokemon_abilities_df.loc[index, row[1]] = 1
    if row.isna().sum() < 2:
        pokemon_abilities_df.loc[index, row[2]] = 1
    if row.isna().sum() < 1:
        pokemon_abilities_df.loc[index, row[3]] = 1

pokemon_abilities_df

Unnamed: 0_level_0,Cute Charm,Effect Spore,Flame Body,Flash Fire,Intimidate,Lightning Rod,Minus,Plus,Poison Point,Pressure,...,Quick Draw,Ripen,Sand Spit,Screen Cleaner,Stalwart,Steam Engine,Steely Spirit,Transistor,Unseen Fist,Wandering Spirit
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmander,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's do some sanity checks to make sure this worked out properly:

In [33]:
pokemon_abilities_df.loc['Rockruff'].sum()

4

In [34]:
pokemon_abilities_df.loc['Bulbasaur'].sum()

2

In [35]:
pokemon_abilities_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Columns: 267 entries, Cute Charm to Wandering Spirit
dtypes: int64(267)
memory usage: 1.5+ MB


It's interesting that our dataframe consists of integer instead of boolean values.  I'm not sure which is better for machine learning, but it is something we can consider later.  Pandas profiling seems to interpret them as booleans, in any case.

We should check that we have the correct number of abilities in each row:

In [36]:
pokemon_abilities_df.apply(sum, axis=1).value_counts()

3    394
2    198
1    146
4      1
dtype: int64

394 is correct for "3 abilities", because there were 395 values in the 3rd ability column and only Rockruff has 4, which takes one away from rows which sum to exactly 3 (since 395 are "at least" 3).

Basically the calculation was 395 - 1 = 394

Likewise:
593 - 395 = 198, so 198 is correct for "2 abilities"

793 - 593 = 146, so 146 is correct for "1 ability"

We know 1 is correct is "4 abilities" (Rockruff), thus our database is officially validated at the basic level.  Let's see if a pandas profile report picks up anything less obvious.

In [None]:
profile_report = pokemon_abilities_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

Upon examination, all the columns with constant values of zero seem to occur only in "National Dex" pokemon, which were removed from pokemon_df and thus of course these abilities never had a single "1" selected within the set of pokemon I'm using.  These abilities need to be removed, but since I'm saving National Dex pokemon, I might as well save these abilities aside as well.

In [37]:
abilities_removed_df = pokemon_abilities_df.loc[:, pokemon_abilities_df.sum(axis=0) == 0].columns
abilities_removed_df

Index(['Color Change', 'Forecast', 'Magma Armor', 'Pure Power', 'Bad Dreams',
       'Normalize', 'Multitype', 'Poison Heal', 'Toxic Boost', 'Aerilate',
       'Parental Bond', 'Delta Stream', 'Desolate Land', 'Grass Pelt',
       'Primordial Sea', 'Protean', 'Battle Bond', 'Comatose', 'Dancer',
       'Dazzling', 'Galvanize', 'Neuroforce', 'Power of Alchemy',
       'Shields Down'],
      dtype='object', name='')

In [38]:
pokemon_abilities_df = pokemon_abilities_df.loc[:, pokemon_abilities_df.sum(axis=0) != 0].copy()
pokemon_abilities_df

Unnamed: 0_level_0,Cute Charm,Effect Spore,Flame Body,Flash Fire,Intimidate,Lightning Rod,Minus,Plus,Poison Point,Pressure,...,Quick Draw,Ripen,Sand Spit,Screen Cleaner,Stalwart,Steam Engine,Steely Spirit,Transistor,Unseen Fist,Wandering Spirit
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmander,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Since 267-24=243, we removed exactly the right amount of abilities.  We should also remove these abilities from abilities_df, since we are planning on using abilities_df for EDA also.

In [39]:
abilities_df = abilities_df.loc[~abilities_df.index.isin(abilities_removed_df)].copy()
abilities_df

Unnamed: 0_level_0,description
name,Unnamed: 1_level_1
Cute Charm,30% chance of infatuating Pokemon of the oppos...
Effect Spore,30% chance of poison/paralysis/sleep on others...
Flame Body,30% chance a Pokemon making contact with this ...
Flash Fire,This Pokemon's Fire attacks do 1.5x damage if ...
Intimidate,"On switch-in, this Pokemon lowers the Attack o..."
...,...
Steam Engine,This Pokemon's Speed is raised by 6 stages aft...
Steely Spirit,This Pokemon and its allies' Steel-type moves ...
Transistor,This Pokemon's attacking stat is multiplied by...
Unseen Fist,All contact moves hit through protection.


There is just one more small matter to attend to: in my research about pokemon alternative forms, I discovered that Zygarde-10% can't use one of its abilities (Power Construct) in its format because that would transform Zygarde-10% into Zygarde-Complete, which is only admissible in the Uber format. Therefore, for our purposes we should remove this ability from Zygarde-10%, as it isn't likely to be used in Uber (compared to regular Zygarde which is itself an Uber tier pokemon) and is used much more in lower tiers nearer to RU.

In [40]:
pokemon_abilities_df.loc['Zygarde-10%', 'Power Construct'] = 0
pokemon_abilities_df.loc[pokemon_abilities_df['Power Construct'] == 1, :]

Unnamed: 0_level_0,Cute Charm,Effect Spore,Flame Body,Flash Fire,Intimidate,Lightning Rod,Minus,Plus,Poison Point,Pressure,...,Quick Draw,Ripen,Sand Spit,Screen Cleaner,Stalwart,Steam Engine,Steely Spirit,Transistor,Unseen Fist,Wandering Spirit
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zygarde,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


That seems to have worked!

Now that we've finished using the types and abilities columns of pokemon_df to fill our new dataframes, we should put the finishing touches on each column of pokemon_df by removing it or suitably modifying it so that there are no more unsupported data types.

Types, abilities, and isNonstandard (which literally contains no information) can just be dropped, but oob might have some potentially interesting information.

In [41]:
pokemon_df = pokemon_df.drop(columns=['types', 'abilities', 'isNonstandard'])
pokemon_df

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,formats,oob
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bulbasaur,45,49,49,65,65,45,6.9,0.7,LC,"{'dex_number': 1, 'evos': ['Ivysaur'], 'alts':..."
Ivysaur,60,62,63,80,80,60,13.0,1.0,NFE,"{'dex_number': 2, 'evos': ['Venusaur'], 'alts'..."
Venusaur,80,82,83,100,100,80,100.0,2.0,RUBL,"{'dex_number': 3, 'evos': [], 'alts': ['Venusa..."
Charmander,39,52,43,60,50,65,8.5,0.6,LC,"{'dex_number': 4, 'evos': ['Charmeleon'], 'alt..."
Charmeleon,58,64,58,80,65,80,19.0,1.1,NFE,"{'dex_number': 5, 'evos': ['Charizard'], 'alts..."
...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,800.0,2.2,NU,"{'dex_number': 896, 'evos': [], 'alts': [], 'g..."
Spectrier,100,65,60,145,80,130,44.5,2.0,Uber,"{'dex_number': 897, 'evos': [], 'alts': [], 'g..."
Calyrex,100,80,80,80,80,80,7.7,1.1,Untiered,"{'dex_number': 898, 'evos': [], 'alts': [], 'g..."
Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,Uber,"{'dex_number': 898, 'evos': [], 'alts': [], 'g..."


Examining the JSON file (in jsonviewer) that was used to make pokedex_dict, it appears that oob is mostly unimportant information for our purposes, but it may be interesting to know the generation in which pokemon were introduced (to answer questions about how powerful each generation is, whether there was "power creep", etc.).  So we can make a new column in pokemon_df that will just contain the generation in which the pokemon was introduced, and get rid of the oob column after that.

In [None]:
for i, l in enumerate(pokemon_df["oob"]):
    print("list",i,"is",type(l))

The oob column consists of dictionaries, so we can access it by looking at the appropriate paths in the jsonviewer for pokedex_dict.

In [42]:
oob_df = pokemon_df['oob'].apply(pd.Series)
genfamily_df = oob_df['genfamily'].apply(pd.Series)
pokemon_df['generation'] = genfamily_df[0].values
pokemon_df = pokemon_df.drop(columns=['oob'])

pokemon_df

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,formats,generation
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bulbasaur,45,49,49,65,65,45,6.9,0.7,LC,RB
Ivysaur,60,62,63,80,80,60,13.0,1.0,NFE,RB
Venusaur,80,82,83,100,100,80,100.0,2.0,RUBL,RB
Charmander,39,52,43,60,50,65,8.5,0.6,LC,RB
Charmeleon,58,64,58,80,65,80,19.0,1.1,NFE,RB
...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,800.0,2.2,NU,SS
Spectrier,100,65,60,145,80,130,44.5,2.0,Uber,SS
Calyrex,100,80,80,80,80,80,7.7,1.1,Untiered,SS
Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,Uber,SS


Looks good, let's just make a profile report to make sure there is nothing we missed.

In [None]:
profile_report = pokemon_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

There are no problems, so let's move on to the moves dataframe

<a id='moves_df'></a>
### moves_df

As usual, we need to fetch the section we need from pokedex_dict, using the jsonviewer if necessary.

In [43]:
moves_df = pd.DataFrame(pokedex_dict['injectRpcs'][1][1]['moves'])
moves_df

Unnamed: 0,name,isNonstandard,category,power,accuracy,priority,pp,description,type,flags,genfamily
0,Acid,Standard,Special,40,100,0,30,10% chance to lower the foe(s) Sp. Def by 1.,Poison,[],"[RB, GS, RS, DP, BW, XY, SM, SS]"
1,Amnesia,Standard,Non-Damaging,0,0,0,20,Raises the user's Sp. Def by 2.,Psychic,[],"[RB, GS, RS, DP, BW, XY, SM, SS]"
2,Aurora Beam,Standard,Special,65,100,0,20,10% chance to lower the target's Attack by 1.,Ice,[],"[RB, GS, RS, DP, BW, XY, SM, SS]"
3,Bide,NatDex,Physical,0,0,1,10,Waits 2 turns; deals double the damage taken.,Normal,[],"[RB, GS, RS, DP, BW, XY, SM, SS]"
4,Bind,Standard,Physical,15,85,0,20,Traps and damages the target for 4-5 turns.,Normal,[],"[RB, GS, RS, DP, BW, XY, SM, SS]"
...,...,...,...,...,...,...,...,...,...,...,...
838,Terrain Pulse,Standard,Special,50,100,0,10,"User on terrain: power doubles, type varies.",Normal,[],[SS]
839,Thunder Cage,Standard,Special,80,90,0,15,Traps and damages the target for 4-5 turns.,Electric,[],[SS]
840,Thunderous Kick,Standard,Physical,90,100,0,10,100% chance to lower the target's Defense by 1.,Fighting,[],[SS]
841,Triple Axel,Standard,Physical,20,90,0,10,"Hits 3 times. Each hit can miss, but power rises.",Ice,[],[SS]


Importantly, we have another "isNonstandard" column, which tends to be very informative regarding which values I should keep and which I should get rid of.

In [44]:
moves_df['isNonstandard'].value_counts()

Standard    698
NatDex      143
CAP           2
Name: isNonstandard, dtype: int64

According to smogon, NatDex and CAP moves can't be used in the standard format, but can only be used legally in the NatDex or CAP formats, so I should remove these moves.

What is flags?  Nothing like last time?

In [45]:
for index, row in moves_df.iterrows():
    if row['flags'] != []:
        print(row['flags'])

Indeed, it's nothing.  So we're going to remove NatDex and CAP moves, and then remove the uninformative "isNonstandard", "flags", and "genfamily" columns, while setting the index to move names since that is much more informative than the current integer index.

In [46]:
moves_df.set_index('name', inplace=True)
moves_df = moves_df.loc[moves_df['isNonstandard'] == 'Standard']
moves_df = moves_df.drop(columns=['isNonstandard', 'flags', 'genfamily'])
moves_df

Unnamed: 0_level_0,category,power,accuracy,priority,pp,description,type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Acid,Special,40,100,0,30,10% chance to lower the foe(s) Sp. Def by 1.,Poison
Amnesia,Non-Damaging,0,0,0,20,Raises the user's Sp. Def by 2.,Psychic
Aurora Beam,Special,65,100,0,20,10% chance to lower the target's Attack by 1.,Ice
Bind,Physical,15,85,0,20,Traps and damages the target for 4-5 turns.,Normal
Bite,Physical,60,100,0,25,30% chance to make the target flinch.,Dark
...,...,...,...,...,...,...,...
Terrain Pulse,Special,50,100,0,10,"User on terrain: power doubles, type varies.",Normal
Thunder Cage,Special,80,90,0,15,Traps and damages the target for 4-5 turns.,Electric
Thunderous Kick,Physical,90,100,0,10,100% chance to lower the target's Defense by 1.,Fighting
Triple Axel,Physical,20,90,0,10,"Hits 3 times. Each hit can miss, but power rises.",Ice


Now we can use a profile report to tell us what issues there might be:

In [None]:
profile_report = moves_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

Some interesting facts:

1. Move descriptions only have a cardinality of 454, which means many moves have the same description.  This is important for bucketing, but that will happen during EDA.

2. More moves have zero power (246) than are in the non-damaging category (220), so there must be some other reason for the 26 zero power moves, or some mistake.

3. Priority is an interesting concept and could also be useful for bucketing, but we will examine it later.

4. There are a lot of zero accuracy moves which should be looked into.

So first, we should figure out which zero powered moves we need to examine.

In [47]:
damaging = moves_df.loc[moves_df['category'] != 'Non-Damaging']
zero_damaging = damaging.loc[damaging['power'] == 0]
zero_damaging

Unnamed: 0_level_0,category,power,accuracy,priority,pp,description,type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Counter,Physical,0,100,-5,20,"If hit by physical attack, returns double damage.",Fighting
Night Shade,Special,0,100,0,15,Does damage equal to the user's level.,Ghost
Seismic Toss,Physical,0,100,0,20,Does damage equal to the user's level.,Fighting
Super Fang,Physical,0,90,0,10,Does damage equal to 1/2 target's current HP.,Normal
Low Kick,Physical,0,100,0,20,More power the heavier the target.,Fighting
Fissure,Physical,0,30,0,5,OHKOs the target. Fails if user is a lower level.,Ground
Guillotine,Physical,0,30,0,5,OHKOs the target. Fails if user is a lower level.,Normal
Horn Drill,Physical,0,30,0,5,OHKOs the target. Fails if user is a lower level.,Normal
Beat Up,Physical,0,100,0,10,All healthy allies aid in damaging the target.,Dark
Flail,Physical,0,100,0,15,More power the less HP the user has left.,Normal


These moves are legitimiate; they just have non-standard means of acquiring their power.  So what about zero accuracy moves?  Let's just examine the non-damaging one's:

In [48]:
non_damaging = moves_df.loc[moves_df['category'] == 'Non-Damaging']
zero_accurate_non_damaging = non_damaging.loc[non_damaging['accuracy'] == 0]
zero_accurate_non_damaging

Unnamed: 0_level_0,category,power,accuracy,priority,pp,description,type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Amnesia,Non-Damaging,0,0,0,20,Raises the user's Sp. Def by 2.,Psychic
Conversion,Non-Damaging,0,0,0,30,Changes user's type to match its first move.,Normal
Focus Energy,Non-Damaging,0,0,0,30,Raises the user's critical hit ratio by 2.,Normal
Growth,Non-Damaging,0,0,0,20,Raises user's Attack and Sp. Atk by 1; 2 in Sun.,Normal
Haze,Non-Damaging,0,0,0,30,Eliminates all stat changes.,Ice
...,...,...,...,...,...,...,...
Life Dew,Non-Damaging,0,0,0,10,Heals the user and its allies by 1/4 their max...,Water
Max Guard,Non-Damaging,0,0,4,10,Protects user from moves &amp; Max Moves this ...,Normal
No Retreat,Non-Damaging,0,0,0,5,Raises all stats by 1 (not acc/eva). Traps user.,Fighting
Stuff Cheeks,Non-Damaging,0,0,0,10,"Must hold Berry to use. User eats Berry, Def +2.",Normal


These appear to be moves which only affect the user or its team, so accuracy just doesn't apply to them.  What about damaging moves with no accuracy?

In [49]:
zero_accurate_damaging = damaging.loc[damaging['accuracy'] == 0]
zero_accurate_damaging

Unnamed: 0_level_0,category,power,accuracy,priority,pp,description,type
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Struggle,Physical,50,0,0,1,User loses 1/4 of its max HP.,Normal
Swift,Special,60,0,0,20,This move does not check accuracy. Hits foes.,Normal
Vital Throw,Physical,70,0,-1,10,This move does not check accuracy. Goes last.,Fighting
Aerial Ace,Physical,60,0,0,20,This move does not check accuracy.,Flying
Magical Leaf,Special,60,0,0,20,This move does not check accuracy.,Grass
...,...,...,...,...,...,...,...
Max Rockfall,Physical,10,0,0,10,Base move affects power. Starts Sandstorm.,Rock
Max Starfall,Physical,10,0,0,10,Base move affects power. Starts Misty Terrain.,Fairy
Max Steelspike,Physical,10,0,0,10,Base move affects power. Allies: +1 Defense.,Steel
Max Strike,Physical,10,0,0,10,Base move affects power. Foes: -1 Speed.,Normal


These are moves which "always hit", and things like accuracy or evasiveness don't affect (Swift is famous for that).  Probably there is no issue here, so we'll just move on.

Next we need to look into the data I scraped from smogon that contains information about the learnsets of each pokemon, as well as the competitive strategies used by these pokemon in generation 8 (the names of those competitive strategies and the movesets that they employ within the strategy).

<a id='strategydex_df'></a>
### strategydex_df

In [50]:
with open("smogonpokemondata2021/smogonpokemondata2021/scraped_data/PokemonData2021.csv", encoding="utf8") as infile:
    strategydex_df = pd.read_csv(infile)

strategydex_df

Unnamed: 0,pokemonData,pokemonName
0,"{'languages': ['en', 'pt', 'es', 'fr', 'it'], ...",ninetales
1,"{'languages': ['en'], 'learnset': ['Amnesia', ...",bulbasaur
2,"{'languages': ['en'], 'learnset': ['Ally Switc...",wigglytuff
3,"{'languages': ['en'], 'learnset': ['Amnesia', ...",nidorino
4,"{'languages': ['en'], 'learnset': ['Ally Switc...",jigglypuff
...,...,...
1096,"{'languages': ['en'], 'learnset': ['Aqua Jet',...",squirtle
1097,"{'languages': ['en'], 'learnset': ['Aqua Jet',...",wartortle
1098,"{'languages': ['en'], 'learnset': ['Acrobatics...",charmeleon
1099,"{'languages': ['en'], 'learnset': ['Acrobatics...",charmander


In [51]:
strategydex_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1101 entries, 0 to 1100
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   pokemonData  1101 non-null   object
 1   pokemonName  1101 non-null   object
dtypes: object(2)
memory usage: 17.3+ KB


There appears to be no missing data, which is strange because it looked like there was missing data in the excel file.  It's certainly a good thing though.  Let's check the format of the pokemonData column, which looks pretty complicated.

In [None]:
for i, l in enumerate(strategydex_df["pokemonData"]):
    print("list",i,"is",type(l))

These are strings, and unfortunately they were extremely messy to clean up!  We need them in the form of JSON or dictionary-like structures, but this requires having the right quotation marks in the right places (" instead of '), and it requires cleaning out many other mistakes and html code using regex's.  Putting the whole process of experimentation which lead me to decide what string replacements to make in this notebook would be a waste of time and space, so if you are interested in that, please visit [data_wrangling_experiments](./data_wrangling_experiments.ipynb#strategydex_df_1) (go to the "strategydex_df experiments" section if it doesn't take you there directly).  For now, I'm just going to display the final result of that extensive experimentation which found the right aspects of these complicated string to replace in order to arrive at a clean dictionary format.

In [52]:
with open("smogonpokemondata2021/smogonpokemondata2021/scraped_data/PokemonData2021.csv", encoding="utf8") as infile:
    strategydex_df = pd.read_csv(infile)

strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('<p>.+?<\/p>', '', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('<section>.+?<\/section>', '', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('<h1>.+?<\/h1>', '', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('\\\\n', '', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace("\'", '\"')
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('King\"s', "King\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Maki\"s', "Maki\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Land\"s', "Land\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Dragon\"s', "Dragon\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Sirfetch\"d', "Sirfetch\'d")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Farfetch\"d', "Farfetch\'d")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Nature\"s', "Nature\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Forest\"s', "Forest\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('drampa\"s', "drampa\'s")
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('False', '\"\"')
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('True', '\"\"')
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('None', '\"\"')
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('\"\"\"\s', '\"', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('<ul>.+?<\/ul', '', regex=True)
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Swipe', 'False Swipe') #newline
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].str.replace('Surrender', 'False Surrender') #newline
strategydex_df['pokemonData'] = strategydex_df['pokemonData'].apply(json.loads)

In [None]:
for i, l in enumerate(strategydex_df["pokemonData"]):
    print("list",i,"is",type(l))

<a id='pokemon_learnsets_df'></a>
### pokemon_learnsets_df

Now we want to use the information contained in pokemon_df, moves_df and strategydex_df (which contains the learnsets of each pokemon) to one-hot encode which moves each pokemon can learn (which is important for understanding the competitive viability of a pokemon, as access or lack of access to certain moves can be a decisive factor in what a pokemon can do).  As usual, we'll index this dataframe with the pokemon_df index that's syncing everything, and we'll use the index of moves_df as our columns since that's what's being one-hot encoded.

In [53]:
pokemon_learnsets_df = pd.DataFrame(0, index=pokemon_df.index, columns=moves_df.index.rename(''))
pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Strange Steam,Stuff Cheeks,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Venusaur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmander,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


After some experimentation, it turned out that the easiest way to organize the filling of the above dataframe is via a python dictionary, indexed by pokemon_df index, containing the movesets of each pokemon from strategydex_df.

The string replacements were arrived at by a similar process of experimentation to those of strategydex_df (and which can be examined at [data_wrangling_experiments](./data_wrangling_experiments.ipynb#pokemon_learnsets_df_2) in the "pokemon_learnsets_df experiments" section). I'll just present the best results here.

In [None]:
pokemon_learnsets = {}

strategydex_experiment = strategydex_df.copy()

for pokemon in pokemon_df.index:
    pokemon = pokemon.replace("'", "").replace(" ", "-").replace(".", "").replace("-10%", "").replace(":", "")
    pokemon_learnsets[pokemon] = strategydex_experiment.loc[strategydex_experiment['pokemonName'] == pokemon.lower(), 'pokemonData'].item()['learnset']

pokemon_learnsets

Now we can easily use this dictionary, in the same manner as we used dataframes before, to fill pokemon_learnsets_df:

In [55]:
for pokemon in pokemon_df.index:
    pokemon_cleaned = pokemon.replace("'", "").replace(" ", "-").replace(".", "").replace("-10%", "").replace(":", "")
    learnset = pokemon_learnsets[pokemon_cleaned]
    for move in learnset:
        pokemon_learnsets_df.loc[pokemon, move] = 1

pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow,Fury False Swipes,Breaking False Swipe
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Ivysaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Venusaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,,
Charmander,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
Charmeleon,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex-Ice,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,


In [56]:
pokemon_learnsets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Columns: 700 entries, Acid to Breaking False Swipe
dtypes: float64(2), int64(698)
memory usage: 4.0+ MB


We need to check if there are any problematic duplicates in this data, since there are pokemon with similar forms, etc.  Though it's important to understand that some duplicates may not indicate any issue since they may belong to pokemon in the same evolutionary families.

In [57]:
pokemon_learnsets_df[pokemon_learnsets_df.duplicated(keep=False)].index.to_list()

['Bulbasaur',
 'Ivysaur',
 'Charmander',
 'Charmeleon',
 'Squirtle',
 'Wartortle',
 'Porygon',
 'Wobbuffet',
 'Porygon2',
 'Larvitar',
 'Pupitar',
 'Wynaut',
 'Shinx',
 'Luxio',
 'Giratina',
 'Giratina-Origin',
 'Venipede',
 'Whirlipede',
 'Litwick',
 'Lampent',
 'Vanillite',
 'Vanillish',
 'Pidove',
 'Tranquill',
 'Solosis',
 'Duosion',
 'Tornadus',
 'Tornadus-Therian',
 'Thundurus',
 'Thundurus-Therian',
 'Landorus',
 'Landorus-Therian',
 'Zygarde',
 'Pumpkaboo-Small',
 'Pumpkaboo-Large',
 'Gourgeist-Small',
 'Gourgeist-Large',
 'Honedge',
 'Doublade',
 'Pumpkaboo-Super',
 'Gourgeist-Super',
 'Zygarde-10%',
 'Rowlet',
 'Dartrix',
 'Popplio',
 'Brionne',
 'Silvally',
 'Silvally-Bug',
 'Silvally-Dark',
 'Silvally-Dragon',
 'Silvally-Electric',
 'Silvally-Fairy',
 'Silvally-Fighting',
 'Silvally-Fire',
 'Silvally-Flying',
 'Silvally-Ghost',
 'Silvally-Grass',
 'Silvally-Ground',
 'Silvally-Ice',
 'Silvally-Poison',
 'Silvally-Psychic',
 'Silvally-Rock',
 'Silvally-Steel',
 'Silvally-Wat

And we need to understand the number of moves in each column so that we can see which duplicates are associated with one another.

In [58]:
pokemon_learnsets_df[pokemon_learnsets_df.duplicated(keep=False)].sum(axis=1).to_list()

[75.0,
 75.0,
 95.0,
 95.0,
 91.0,
 91.0,
 71.0,
 9.0,
 71.0,
 66.0,
 66.0,
 9.0,
 57.0,
 57.0,
 82.0,
 82.0,
 43.0,
 43.0,
 60.0,
 60.0,
 45.0,
 45.0,
 45.0,
 45.0,
 65.0,
 65.0,
 67.0,
 67.0,
 74.0,
 74.0,
 63.0,
 63.0,
 65.0,
 0.0,
 0.0,
 0.0,
 0.0,
 52.0,
 52.0,
 4.0,
 4.0,
 65.0,
 57.0,
 57.0,
 58.0,
 58.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 77.0,
 43.0,
 43.0]

Most of the these make total sense; they are pokemon in the same families one generation apart, so having the same moveset is reasonable in nearly all of these cases.

However, forms of pumpkaboo and gourgeist having 0 (or 4) moves is not correct and needs to be fixed. Interestingly, the super versions have exactly 4 moves. Upon examination, the 4 moves pumpkaboo super and gourgeist super have are already in the standard movesets for pumpkaboo and gourgeist.

All of these forms just need to get their moveset from the standard pumpkaboo and gorgeist.  Let's see how many moves pumpkaboo and gourgeist have first so we can check if we have the right amount later, and then try to make the replacement.

In [59]:
pokemon_learnsets_df.loc['Pumpkaboo', :].sum()

66.0

In [60]:
pumpkaboo_learnset = pokemon_learnsets['Pumpkaboo']
for move in pumpkaboo_learnset:
    pokemon_learnsets_df.loc['Pumpkaboo-Small', move] = 1
    pokemon_learnsets_df.loc['Pumpkaboo-Large', move] = 1
    pokemon_learnsets_df.loc['Pumpkaboo-Super', move] = 1
    
pokemon_learnsets_df.loc[['Pumpkaboo', 'Pumpkaboo-Small', 'Pumpkaboo-Large', 'Pumpkaboo-Super'], :].sum(axis=1)

name
Pumpkaboo          66.0
Pumpkaboo-Small    66.0
Pumpkaboo-Large    66.0
Pumpkaboo-Super    66.0
dtype: float64

In [61]:
pokemon_learnsets_df.loc['Gourgeist', :].sum()

74.0

In [62]:
gourgeist_learnset = pokemon_learnsets['Gourgeist']
for move in gourgeist_learnset:
    pokemon_learnsets_df.loc['Gourgeist-Small', move] = 1
    pokemon_learnsets_df.loc['Gourgeist-Large', move] = 1
    pokemon_learnsets_df.loc['Gourgeist-Super', move] = 1

pokemon_learnsets_df.loc[['Gourgeist', 'Gourgeist-Small', 'Gourgeist-Large', 'Gourgeist-Super'], :].sum(axis=1)

name
Gourgeist          74.0
Gourgeist-Small    74.0
Gourgeist-Large    74.0
Gourgeist-Super    74.0
dtype: float64

The last thing I want to check is a mass comparison between dictionary learnset length and learnset dataframes row-wise sums. These should be equal except in the few cases we manually changed since the dictionary was wrong in the first place.

In [63]:
for pokemon in pokemon_df.index:
    pokemon_cleaned = pokemon.replace("'", "").replace(" ", "-").replace(".", "").replace("-10%", "").replace(":", "")
    if len(pokemon_learnsets[pokemon_cleaned]) != pokemon_learnsets_df.loc[pokemon, :].sum():
        print(pokemon)
        print(len(pokemon_learnsets[pokemon_cleaned]))
        print(pokemon_learnsets_df.loc[pokemon, :].sum())

Pumpkaboo-Small
0
66.0
Pumpkaboo-Large
0
66.0
Gourgeist-Small
0
74.0
Gourgeist-Large
0
74.0
Pumpkaboo-Super
4
66.0
Gourgeist-Super
4
74.0


Now we should use a profile report to make sure there aren't any more subtle issues with pokemon_learnsets_df.  This is going to take some time because there are hundreds of columns (it took about 5 minutes to load in my case).

In [None]:
profile_report = pokemon_learnsets_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

The moves "Fury False Swipes" and "Breaking False Swipe" are mistakes and shouldn't exist; they likely are artifacts of string replacement we did earlier.  The good news is that the pokemon they have listed as "1's" likely actually have Fury Swipes or Breaking Swipe instead, since those moves are conspicuously all "0" which is not realistic.

Max and G-Max moves are all zeroes and I'm not considering Dynamax in this analysis anyways, so those moves will simply be removed.

In [64]:
dynamax_moves = [move for move in pokemon_learnsets_df.columns.to_list() if "Max" in move]
dynamax_moves

['G-Max Befuddle',
 'G-Max Cannonade',
 'G-Max Centiferno',
 'G-Max Chi Strike',
 'G-Max Cuddle',
 'G-Max Depletion',
 'G-Max Drum Solo',
 'G-Max Finale',
 'G-Max Fire Ball',
 'G-Max Foam Burst',
 'G-Max Gold Rush',
 'G-Max Gravitas',
 'G-Max Hydrosnipe',
 'G-Max Malodor',
 'G-Max Meltdown',
 'G-Max One Blow',
 'G-Max Rapid Flow',
 'G-Max Replenish',
 'G-Max Resonance',
 'G-Max Sandblast',
 'G-Max Smite',
 'G-Max Snooze',
 'G-Max Steelsurge',
 'G-Max Stonesurge',
 'G-Max Stun Shock',
 'G-Max Sweetness',
 'G-Max Tartness',
 'G-Max Terror',
 'G-Max Vine Lash',
 'G-Max Volcalith',
 'G-Max Volt Crash',
 'G-Max Wildfire',
 'G-Max Wind Rage',
 'Max Airstream',
 'Max Darkness',
 'Max Flare',
 'Max Flutterby',
 'Max Geyser',
 'Max Guard',
 'Max Hailstorm',
 'Max Knuckle',
 'Max Lightning',
 'Max Mindstorm',
 'Max Ooze',
 'Max Overgrowth',
 'Max Phantasm',
 'Max Quake',
 'Max Rockfall',
 'Max Starfall',
 'Max Steelspike',
 'Max Strike',
 'Max Wyrmwind']

In [65]:
pokemon_learnsets_df = pokemon_learnsets_df.drop(columns=dynamax_moves)
pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow,Fury False Swipes,Breaking False Swipe
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Ivysaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Venusaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,,
Charmander,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
Charmeleon,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex-Ice,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,


Struggle is also not a move pokemon can actually learn, so it makes sense that it turned out as all zeroes and it should be removed from the dataframe.

In [66]:
pokemon_learnsets_df = pokemon_learnsets_df.drop(columns='Struggle')
pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow,Fury False Swipes,Breaking False Swipe
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Ivysaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Venusaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,,
Charmander,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
Charmeleon,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex-Ice,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,


Behomoth Blade belongs only to Zacian_Crowned, who was removed due to being overpowerd, thus we can drop the move.

In [67]:
pokemon_learnsets_df = pokemon_learnsets_df.drop(columns='Behemoth Blade')
pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow,Fury False Swipes,Breaking False Swipe
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Ivysaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Venusaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,,
Charmander,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
Charmeleon,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
Calyrex-Ice,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,,


Now let's see who was given "Fury False Swipes".

In [68]:
pokemon_learnsets_df.loc[~pokemon_learnsets_df["Fury False Swipes"].isna(), ["Fury False Swipes"]].sort_index()

Unnamed: 0_level_0,Fury False Swipes
name,Unnamed: 1_level_1
Barbaracle,1.0
Beartic,1.0
Binacle,1.0
Charizard,1.0
Charmander,1.0
Charmeleon,1.0
Cubchoo,1.0
Diglett,1.0
Diglett-Alola,1.0
Drilbur,1.0


This is precisely the list of pokemon with Fury Swipes.

In [69]:
pokemon_learnsets_df.loc[:, 'Fury Swipes'] = pokemon_learnsets_df.loc[:, "Fury False Swipes"].values
pokemon_learnsets_df.loc[:, 'Fury Swipes']

name
Bulbasaur         NaN
Ivysaur           NaN
Venusaur          NaN
Charmander        1.0
Charmeleon        1.0
                 ... 
Glastrier         NaN
Spectrier         NaN
Calyrex           NaN
Calyrex-Ice       NaN
Calyrex-Shadow    NaN
Name: Fury Swipes, Length: 739, dtype: float64

Let's do the same thing for Breaking False Swipe.

In [70]:
pokemon_learnsets_df.loc[~pokemon_learnsets_df["Breaking False Swipe"].isna(), ["Breaking False Swipe"]].sort_index()

Unnamed: 0_level_0,Breaking False Swipe
name,Unnamed: 1_level_1
Altaria,1.0
Axew,1.0
Charizard,1.0
Dialga,1.0
Dracozolt,1.0
Dragapult,1.0
Dragonair,1.0
Dragonite,1.0
Drakloak,1.0
Drampa,1.0


As expected, these are exactly the pokemon that can learn Breaking Swipe

In [71]:
pokemon_learnsets_df.loc[:, 'Breaking Swipe'] = pokemon_learnsets_df.loc[:, "Breaking False Swipe"].values
pokemon_learnsets_df.loc[:, 'Breaking Swipe']

name
Bulbasaur        NaN
Ivysaur          NaN
Venusaur         NaN
Charmander       NaN
Charmeleon       NaN
                  ..
Glastrier        NaN
Spectrier        NaN
Calyrex          NaN
Calyrex-Ice      NaN
Calyrex-Shadow   NaN
Name: Breaking Swipe, Length: 739, dtype: float64

In [72]:
pokemon_learnsets_df = pokemon_learnsets_df.drop(columns=["Fury False Swipes", "Breaking False Swipe"])
pokemon_learnsets_df = pokemon_learnsets_df.fillna(0)
pokemon_learnsets_df

Unnamed: 0_level_0,Acid,Amnesia,Aurora Beam,Bind,Bite,Blizzard,Bubble Beam,Conversion,Counter,Crabhammer,...,Strange Steam,Stuff Cheeks,Surging Strikes,Tar Shot,Teatime,Terrain Pulse,Thunder Cage,Thunderous Kick,Triple Axel,Wicked Blow
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ivysaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Venusaur,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Charmander,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
Charmeleon,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectrier,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Calyrex-Ice,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
pokemon_learnsets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Columns: 644 entries, Acid to Wicked Blow
dtypes: float64(2), int64(642)
memory usage: 3.7+ MB


Those two newly updated moves are probably the float columns due to how pandas considers "nan" a float value (and they previously had nans instead of 0's), so let's change them all to ints just to have them as a uniform data type.

In [74]:
pokemon_learnsets_df = pokemon_learnsets_df.astype('int64')
pokemon_learnsets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 739 entries, Bulbasaur to Calyrex-Shadow
Columns: 644 entries, Acid to Wicked Blow
dtypes: int64(644)
memory usage: 3.7+ MB


Let's make one more profile report for the purposes of observing correlation groups, which will be useful to EDA and move bucketing later (we need to bucket moves because there are probably way too many moves compared to their level of importance in the anaylsis, thus it's up to us to use them to identify higher level move features).

In [None]:
profile_report = pokemon_learnsets_df.profile_report(html={'style': {'full_width': True}})
profile_report.to_widgets()

#### Correlation Groups:
- Substitute, Rest, Endure, Protect, Sleep Talk, Round, Facade
- Double Team, Toxic, Confide, Swagger
- Mega Punch, Mega Kick
- Conversion, Conversion 2
- Spit Up, Stockpile, Swallow
- Defend Order, Attack Order
- Giga Impact, Hyper Beam
- Bulldoze, Earthquake
- Thousand Waves, Thousand Arrows, Land's Wrath, Core Enforcer
- Prismatic Laser, Photon Geyser
- Clangorous Soul, Clanging Scales
- Eternabeam, Dynamax Cannon
- Pyro Ball, Court Change
- Rising Voltage, Volt Switch

Some of these, like the first list, are moves almost all pokemon can learn, whereas some are much more particular and rare.

<a id='strategies_dict'></a>
### strategies_dict

For the strategies of each pokemon, I want two things:
1. the names of various moveset strategies (to later be able to make a set of them and find patterns in them which can help identify high level strategic features)
2. a list of the moves used competitively (so I can remove those which aren't competitive and see what patterns exist in the moves that are used)

Actually, the best way to organize this "strategies" section of my data is probably just a dictionary. It can easily be saved as a json, it's much more flexible than a dataframe, and it's not going to be put into a dataframe either. It's not even going to be used in the machine learning phase of this project; it's just to help with EDA. So instead of strategies_df, I'm going to create strategies_dict.

The method of construction of strategies_dict will likely be very similar to pokemon_learnsets, since I'm taking pokemon from the pokemon_df index and then transforming the strings of the names to be compatible with strategydex_df.

One more thing I will need is a list of acceptable formats for movesets (e.g. PU, RU, etc.), since I will only be taking movesets from within that list of acceptable formats. I will add ZU as that seems like the format listed for strategies instead of "untiered".

The structure:
- dictionary with pokemon name as key, then a list of dictionaries as values which have main keys as formats
- then within each format key, a value that's a list of moveset dictionaries with moveset name as key and list of moves as value

In [75]:
acceptable_formats = pokemon_df['formats'].unique().tolist()
acceptable_formats.append('ZU')
acceptable_formats

['LC',
 'NFE',
 'RUBL',
 'PU',
 'NU',
 'Untiered',
 'UU',
 'OU',
 'UUBL',
 'PUBL',
 'RU',
 'Uber',
 'NUBL',
 'ZU']

In [76]:
strategies_dict = {}

strategydexcopy = strategydex_df.copy()

for pokemon in pokemon_df.index:
    pokemon_access = pokemon.replace("'", "").replace(" ", "-").replace(".", "").replace("-10%", "").replace(":", "")
    dex_entry = strategydexcopy.loc[strategydexcopy['pokemonName'] == pokemon_access.lower(), 'pokemonData'].item()
    format_list = []
    for competitive_format in dex_entry['strategies']:
        format_dict = {}
        current_format = competitive_format['format']
        if current_format in acceptable_formats:
            moveset_list = []
            for moveset in competitive_format['movesets']:
                moveset_dict = {}
                moveset_name = moveset['name']
                move_list = []
                for move in moveset['moveslots']:
                    for entry in move:
                        move_list.append(entry['move'])
                moveset_dict[moveset_name] = move_list
                moveset_list.append(moveset_dict)
            format_dict[current_format] = moveset_list
            format_list.append(format_dict)
    strategies_dict[pokemon] = format_list

In [None]:
strategies_dict

<a id='Saving Our Data'></a>
### Saving Our Data

Let's go over the items that need to be saved:
- pokemon_df
- pokemon_removed_df
- pokemon_types_df
- abilities_df
- pokemon_abilities_df
- abilities_removed_df
- moves_df
- pokemon_learnsets_df
- strategies_dict

The dataframe files will be stored in the csv format, which seems standard.  The dictionary will be stored as a JSON, which also seems standard.

In [77]:
pokemon_df.to_csv('./data/pokemon_df.csv')
pokemon_removed_df.to_csv('./data/pokemon_removed_df.csv')
pokemon_types_df.to_csv('./data/pokemon_types_df.csv')
abilities_df.to_csv('./data/abilities_df.csv')
pokemon_abilities_df.to_csv('./data/pokemon_abilities_df.csv')
abilities_removed_df.to_series().to_csv('./data/abilities_removed_df.csv')
moves_df.to_csv('./data/moves_df.csv')
pokemon_learnsets_df.to_csv('./data/pokemon_learnsets_df.csv')

abilities_removed_df had to be converted to a series because it was only a pandas index object, which has no to_csv method.

Now we have to save the strategies_dict as a json.

In [78]:
with open('./data/strategies_dict.json', 'w') as outfile:
    json.dump(strategies_dict, outfile)