I started with the necessary imports, getting the pokebase API, pandas to work with dataframes, and time because my two dataframe API pulls took about 24 hours of constant running to get everything.

In [1]:
# imports
import pokebase as pb
import pandas as pd
import time

The pokebase API is extraordinarily generous with its information. There are tables with Berries, Competitions, Pokemon Cries, etc. If anyone is interested, [PokeAPI](https://pokeapi.co/) is where to get the API information and [Pokebase has a great Github](https://github.com/PokeAPI/pokebase) with instructions on how to get started.

I connected to the proper 'Pokemon-Species' table and started looping through every pokedex entry. Generation I pokemon like Bulbasaur has entries for every game, while Generation IV pokemon like Sprigatito are more recent, so only have entries for their games. Because of this, the code "speeds up" as is gets to later generations, because there are less entries to grab.

I printed every 10 pokemon because I wanted the progress report as it ran.

Each entry has multiple languages, but I'm only interested in English, so I filtered only English before pulling all entries.

In [None]:
dex = []

# Looking at species since thats where the dex entries are
pokemon_species_list = pb.APIResourceList('pokemon-species')

# Get dynamic list of the number of pokemon so I know when to redownload the data frame
total_pokemon_species = pokemon_species_list.count

start_time = time.time()
# Get the name, number, color, egg group, habitat, and gen
for i in range(total_pokemon_species):
    pokemon = pb.pokemon_species(i+1)
    number = pokemon.id
    name = pokemon.name.capitalize()
    color = pokemon.color.name
    habitat = pokemon.habitat
    generation = pokemon.generation.name

    
    if (i+1)%10 == 0:
        end_time = time.time()
        run_time = end_time - start_time
        print(name, run_time)
        
    # For each pokemon, get all dex entries
    for j in pokemon.flavor_text_entries:
        if j.language.name == 'en':
            flavor_text = j.flavor_text.strip().replace('\n', ' ').lower()
            dex.append({'Pokemon': name, 'Number': number, 'Color': color, 'Habitat': habitat, 'Generation': generation, 'Egg Group': egg_group, 'Description': flavor_text})
final_time = time.time()
print(final_time - start_time)

Sometimes different games or generations have the same dex entry, which I don't believe would help in this exercise. We are more looking at a word's existence at all in the description and its context, so I dropped duplicate entries. This also helps with storage.

In [None]:
# Turn it into a data frame, ditch duplicates, and recount the index
dex = pd.DataFrame(dex).drop_duplicates(subset = 'Description').reset_index(drop = True)

# Turn it into a csv
dex.to_csv('pokedex.csv',index = False)

This API call grabs the pokemon types because for some reason, the type isn't included in the species information.

In [18]:
pokemon_list = pb.APIResourceList('pokemon')
total_pokemon = pokemon_list.count

type_time_start = time.time()
types = []
for i in range(total_pokemon):
    pokemon_type = pb.pokemon(i+1)
    name = pokemon_type.name.capitalize()

    if (i+1)%10 == 0:
        print(name)
    
    for j in pokemon_type.types:
        type = j.type.name.strip().capitalize()
        types.append({'Pokemon': name, 'Type': type})

    time.sleep(0.5)

type_time_end = time.time()
print(types)
print(type_time_end - type_time_start)

Caterpie
Raticate
Nidorina
Wigglytuff
Diglett
Poliwag
Weepinbell
Slowbro
Shellder
Voltorb
Weezing
Staryu
Gyarados
Kabuto
Mewtwo
Feraligatr
Chinchou
Flaaffy
Aipom
Misdreavus
Granbull
Swinub
Kingdra
Magby
Ho-oh
Swampert
Lotad
Ralts
Nincada
Skitty
Manectric
Wailmer
Flygon
Whiscash
Milotic
Wynaut
Luvdisc
Latias
Chimchar
Bibarel
Shieldon
Cherubi
Honchkrow
Happiny
Hippowdon
Abomasnow
Leafeon
Uxie
Manaphy
Emboar
Liepard
Tranquill
Excadrill
Sewaddle
Basculin-red-striped
Scrafty
Zorua
Ducklett
Foongus
Klang
Axew
Mienshao
Mandibuzz
Virizion
Chespin
Diggersby
Floette
Doublade
Skrelp
Sylveon
Pumpkaboo-average
Hoopa
Primarina
Crabominable
Mudsdale
Bewear
Palossand
Drampa
Cosmoem
Necrozma
Grookey
Greedent
Eldegoss
Applin
Sizzlipede
Morgrem
Falinks
Dracozolt
Eternatus
Kleavor
Crocalor
Lokix
Arboliva
Wattrel
Klawf
Wiglett
Glimmora
Clodsire
Iron-treads
Gholdengo
Iron-leaves
Gouging-fire


HTTPError: 404 Client Error: Not Found for url: https://pokeapi.co/api/v2/pokemon/1026/

In [22]:
# Turn it into a data frame, ditch duplicates, and recount the index
type_dex = pd.DataFrame(types)

# Turn it into a csv
# type_dex.to_csv('pokedex_types.csv',index = False)

In [23]:
display(type_dex)

Unnamed: 0,Pokemon,Type
0,Bulbasaur,Grass
1,Bulbasaur,Poison
2,Ivysaur,Grass
3,Ivysaur,Poison
4,Venusaur,Grass
...,...,...
1546,Iron-crown,Steel
1547,Iron-crown,Psychic
1548,Terapagos,Normal
1549,Pecharunt,Poison


In [26]:
type_dex.to_csv('types.csv',index = False)

This concludes the data pull. For Preprocessing, please read that document.