To create a local relational database of Pokemon data using Python-BS4 and Pandas for ETL in coordination with SQLAlchemy and Psycopg2 to communicate with pgAdmin.
The point of this project is to take a look at how much data manipulation is necessary to create a functioning relational system with Pokemon data.
Finally, where a relational system isn't viable, what should be used?
| Python Library | Install |
|---|---|
| Requests | pip install requests |
| BS4 - BeautifulSoup | pip install bs4 |
| Pandas | pip install pandas |
| SQLAlchemy | pip install SQLAlchemy |
| Psycopg2 | pip install psycopg2-binary |
pgAdmin - PostgreSQL
All Data for this project is scraped from Serebii.net and Pokemondb.net.
All python code within this repository is for educational purposes only.
Anyone looking to build their own local Pokemon database, please see The RESTful Pokemon API via Pokedex.py
If you are unfamiliar with the Pokemon franchise, each monster character is a goldmine of data.
For the sake of organization, we will be creating two main tables separated by game function:
- Pokemon_Stats to contain Battle-related data
- Pokemon_Details to contain character-specific information (height, weight, breeding stats, etc)
As you will see in a moment, we will be making a third table, 'Abilities', that will assist us with cleaning up the raw scrape in Pokemon_Stats.
Based on scraping two pages, the first DataFrame being made for battle-related character data comes out pretty raw even after 50+ lines of code.
Cleaning:
At this point, the 'Type' and 'Abilities' columns are not suitable for a postgreSQL database and will need exploding into new columns.
- 'Type' is a simple fix with list comprehension, there can only be a maximum of two types per Pokemon, and each type is only one word.
- 'Abilities' on the other hand, can have up to 4 per Pokemon, and can be up to 3 words in length.
Before locking this table into a .csv we will have to retrieve all of the possible Abilities to cross reference against this messy column.
Based on this page, obtaining all of the possible Abilities was the easiest scrape in this project.
Using the list of abilities, we can then use itertools to make some sense out of the Abilities column in the pokemon_stats table.
for i in pokemon['Abilities']:
no1 = i.split(' ')
no2 = [' '.join(f) for f in permutations(i.split(), r=2)]
no3 = [' '.join(f) for f in permutations(i.split(), r=3)]
Above, I'm using the permutations function imported from itertools to generate three lists of possible string variations to check for valid abilities.
- no1 is a list of all single words found for the current ability cell
- no2 is all combinations of strings within the current cell that could be made with 2 words (r=2)
- no3 is the same as no2 but attempting to make permutations with a word length of 3 (r=3)
After looping through the dataframe and parsing the abilities out to 4 new columns, we have our first completed table:
Based on this page, the Details table requires requesting and parsing a new page for every Pokemon.
Due to the amount of pings made to the host server, I made sure to include a 1 second time.sleep() function per iteration of Pokemon name.
I learned the hard way going into this section that there are some Pokemon that have names that don't have a 1:1 representation between their name and PokemonDB url.
The following dictionary is the result of playing some url golf.
to_change = {'Mr. Mime': 'mr-mime',
'Mime Jr.': 'mime-jr',
'Flabébé': 'flabebe',
'Type: Null': 'type-null',
'Tapu Koko': 'tapu-koko',
'Tapu Lele': 'tapu-lele',
'Tapu Bulu': 'tapu-bulu',
'Tapu Fini': 'tapu-fini',
"Sirfetch'd": 'sirfetchd',
'Mr. Rime': 'mr-rime'}
The main loop for this table accomplishes two things:
- obtaining as much data as possible from each page for the dataframe
- saving the official artwork for each character under a new folder in this repo's directory named 'Artwork'
While these images don't help much with our goal of importing data into a postgres database, it is a very handy bit of code to keep in the event we decide to build a proper JSON object later on down the road. Since we are already hitting a page for every single pokemon, might as well grab what we can while we are there.
The only columns that DON'T need cleaning in the raw details table are 'Name' and 'Species'.
Using Regex, Pandas functions, list comprehensions, and a large looping section to handle the EVs_Given, the following table is our final output for the Pokemon_Details table:
pokemon_details['Egg_Steps'] = pokemon_details['Egg_Steps'].map(lambda x: re.findall('\-(.+)\s',x)[0].replace(',',''))
pokemon_details['Catch_Rate'] = pokemon_details['Catch_Rate'].apply(lambda x: x.split()[0])
pokemon_details['Base_Friendship'] = pokemon_details['Base_Friendship'].apply(lambda x: x.split(' ')[0])
pokemon_details['Height'] = pokemon_details['Height'].map(lambda x: re.findall('\((.*)\)', x)[0])
pokemon_details['Weight'] = pokemon_details['Weight'].map(lambda x: re.findall('\((.*)lbs\)', x)[0])
pokemon_details = pokemon_details.rename(columns={'Weight':'Weight_lbs'})
genders = [(i.split(',')[0].replace('% male',''), i.split(',')[1].replace('% female','')) if i!='Genderless' else i for i in pokemon_details['Gender_Spread']]
pokemon_details['Male%'] = [i[0] if len(i)==2 else 'Genderless' for i in genders]
pokemon_details['Female%'] = [i[1] if len(i)==2 else 'Genderless' for i in genders]
pokemon_details = pokemon_details.drop(columns=['Gender_Spread'])
egg_groups = [[j for j in i.split() if not j.replace(',','').isnumeric()] for i in pokemon_details['Egg_Groups']]
pokemon_details['Egg_Group1'] = [i[0].replace(',','') for i in egg_groups]
pokemon_details['Egg_Group2'] = [i[1].replace(',','') if len(i)>1 else 'None' for i in egg_groups]
pokemon_details = pokemon_details.drop(columns=['Egg_Groups'])
With the three dataframes that we have so far, the tables and their relationships look something like this:
As it currently stands, there is no good reason why we can't combine both Pokemon_Stats and Pokemon_Details into one large table other than for the sake of simplicity and exhibition.
Connecting to a local postgreSQL server in python is a fairly simple process.
In the pokemonpostgres.ipynb file, you will see the steps taken to create a database from scratch with the tables built above.
Be advised there is a config.py file with my personal database password saved as a string variable named db_pass that is being imported.
Should you use this code as a jumping off point, the connection will fail without your password implemented appropriately for cells 2 and 3.
After connecting to the local postgreSQL server with psycopg2, creating the database, and connecting to the new database with sqlalchemy, we can add the Pokemon .csv files to new tables.
While there are approaches within sqlalchemy to use dot notation with pythonic functions to query databases, I prefer to use the .execute() method to query with SQL commands.
Below each cell where a table is created is a simple query to confirm our import was a success.
Looking in pgAdmin, after a refresh of our local server we can confirm the new database and its tables: