# Pokemon Data Analysis
---
Introduction to Data Analysis in python using `pandas`, `numpy`, and `matplotlib`. 

All data drawn from Kaggle's "Complete Pokemon Dataset" for Generations 1 - 7 and a web scraping of Serebii.net for Gen 8.
The sites for the original datasets are listed below: 
- [Generations 1 - 7](https://www.kaggle.com/rounakbanik/pokemon/version/1)
- [Generation 8](https://github.com/yaylinda/serebii-parser)  
  
![Pokeball](https://image.businessinsider.com/5dcee8473afd37158f6c8ab9?width=1100&format=jpeg&auto=webp)

## Installing Dependencies
--- 
First, we need to make sure that we have access to the libraries that we need. There are a couple of ways that we could do this, one of which involves conda, but I like `pip`. `Pip`, or the Python Installer Package, solves this pretty easily in a couple lines. 

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib



## Importing The Proper Libraries
---
For this analysis, I'm going to be using `pandas` to interact with the data. Below, we will import the proper libraries and use the preferred shorthand notation for them. "`pd`" is common for pandas, while "`np`" and "`plt`" are common substitions for numpy and matplotlib's plotting functionality. `Regex` or "`re`" is also imported for some more complex sorting, but more on that later. 

In [2]:
# import pandas and numpy for data analysis
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import re # TODO: update md documentation

## Reading Data Into The Notebook
Pandas has a handy set of functions called read_{file extension} that allows you to import your data into a workable format called a dataframe. Both of my files are .CSV, but there are plenty of other supported extensions, including excel and standard .txt files.  

In [3]:
# Use pandas to read the csv file into a dataframe for the first 7 generations
gen17 = pd.read_csv("pokemon.csv")

In [4]:
# Notice that since notebook runs in an interactive environment, we don't have to use print() to show the output. We can just call the variable we want to see, similar to how we would in the REPL. 
gen17

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.00,1.0,1.0,0.5,0.5,0.50,2.00,2.00,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.00,1.0,1.0,0.5,0.5,0.50,2.00,2.00,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.00,1.0,1.0,0.5,0.5,0.50,2.00,2.00,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.50,1.0,1.0,1.0,0.5,1.00,0.50,1.00,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.50,1.0,1.0,1.0,0.5,1.00,0.50,1.00,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0
5,"['Blaze', 'Solar Power']",0.25,1.0,1.0,2.0,0.5,0.50,0.50,1.00,1.0,...,88.1,6,159,115,100,fire,flying,90.5,1,0
6,"['Torrent', 'Rain Dish']",1.00,1.0,1.0,2.0,1.0,1.00,0.50,1.00,1.0,...,88.1,7,50,64,43,water,,9.0,1,0
7,"['Torrent', 'Rain Dish']",1.00,1.0,1.0,2.0,1.0,1.00,0.50,1.00,1.0,...,88.1,8,65,80,58,water,,22.5,1,0
8,"['Torrent', 'Rain Dish']",1.00,1.0,1.0,2.0,1.0,1.00,0.50,1.00,1.0,...,88.1,9,135,115,78,water,,85.5,1,0
9,"['Shield Dust', 'Run Away']",1.00,1.0,1.0,1.0,1.0,0.50,2.00,2.00,1.0,...,50.0,10,20,20,45,bug,,2.9,1,0


In [5]:
# Data is read into a pandas object called a dataframe
type(gen17)

pandas.core.frame.DataFrame

In [6]:
# Create a second dataframe containing the scaped gen 8 data
gen8 = pd.read_csv("gen8.csv")

In [7]:
gen8

Unnamed: 0,id,name,pokemon_url,types,abilities,image_src,stats,moves
0,810,Grookey,/pokedex-swsh/grookey/,['grass'],"['Overgrow', 'Grassy Surge']",/swordshield/pokemon/small/810.png,"['50', '65', '50', '40', '40', '65']","[{'name': 'Scratch', 'type': 'normal', 'catego..."
1,811,Thwackey,/pokedex-swsh/thwackey/,['grass'],"['Overgrow', 'Grassy Surge']",/swordshield/pokemon/small/811.png,"['70', '85', '70', '55', '60', '80']","[{'name': 'Double Hit', 'type': 'normal', 'cat..."
2,812,Rillaboom,/pokedex-swsh/rillaboom/,['grass'],"['Overgrow', 'Grassy Surge']",/swordshield/pokemon/small/812.png,"['100', '125', '90', '60', '70', '85']","[{'name': 'Drum Beating', 'type': 'grass', 'ca..."
3,813,Scorbunny,/pokedex-swsh/scorbunny/,['fire'],"['Blaze', 'Libero']",/swordshield/pokemon/small/813.png,"['50', '71', '40', '40', '40', '69']","[{'name': 'Tackle', 'type': 'normal', 'categor..."
4,814,Raboot,/pokedex-swsh/raboot/,['fire'],"['Blaze', 'Libero']",/swordshield/pokemon/small/814.png,"['65', '86', '60', '55', '60', '94']","[{'name': 'Tackle', 'type': 'normal', 'categor..."
5,815,Cinderace,/pokedex-swsh/cinderace/,['fire'],"['Blaze', 'Libero']",/swordshield/pokemon/small/815.png,"['80', '116', '75', '65', '75', '119']","[{'name': 'Pyro Ball', 'type': 'fire', 'catego..."
6,816,Sobble,/pokedex-swsh/sobble/,['water'],"['Torrent', 'Sniper']",/swordshield/pokemon/small/816.png,"['50', '40', '40', '70', '40', '70']","[{'name': 'Pound', 'type': 'normal', 'category..."
7,817,Drizzile,/pokedex-swsh/drizzile/,['water'],"['Torrent', 'Sniper']",/swordshield/pokemon/small/817.png,"['65', '60', '55', '95', '55', '90']","[{'name': 'Pound', 'type': 'normal', 'category..."
8,818,Inteleon,/pokedex-swsh/inteleon/,['water'],"['Torrent', 'Sniper']",/swordshield/pokemon/small/818.png,"['70', '85', '65', '125', '65', '120']","[{'name': 'Snipe Shot', 'type': 'water', 'cate..."
9,824,Blipbug,/pokedex-swsh/blipbug/,['bug'],"['Swarm', 'Compoundeyes', 'Telepathy']",/swordshield/pokemon/small/824.png,"['25', '20', '20', '25', '45', '45']","[{'name': 'Struggle Bug', 'type': 'bug', 'cate..."


In [8]:
# Using iloc and the corresponding integer of the row that we want, pandas will return the row as a series
grookey_series = gen8.iloc[0]
grookey_series

id                                                           810
name                                                     Grookey
pokemon_url                               /pokedex-swsh/grookey/
types                                                  ['grass']
abilities                           ['Overgrow', 'Grassy Surge']
image_src                     /swordshield/pokemon/small/810.png
stats                       ['50', '65', '50', '40', '40', '65']
moves          [{'name': 'Scratch', 'type': 'normal', 'catego...
Name: 0, dtype: object

In [9]:
type(grookey_series)

pandas.core.series.Series

In [10]:
# Use nested list structure to format grookey's data as its own dataframe
grookey_table = gen8.iloc[[0]]
grookey_table

Unnamed: 0,id,name,pokemon_url,types,abilities,image_src,stats,moves
0,810,Grookey,/pokedex-swsh/grookey/,['grass'],"['Overgrow', 'Grassy Surge']",/swordshield/pokemon/small/810.png,"['50', '65', '50', '40', '40', '65']","[{'name': 'Scratch', 'type': 'normal', 'catego..."


In [11]:
type(grookey_table)

pandas.core.frame.DataFrame

## Basic Navigation  
---
However, notice that these datasets are quite large, especially gens 1 - 7, with over 800 rows and 41 columns. As such, not all of it is shown in the output cell. Fortunately, `pandas` has a couple helpful methods that we can use to view the parts we want to see.

First up, `.head()`. By default, this will display the first 5 rows of the dataframe, however, you can change this by adding a numerical argument for the number of rows you wish to view.  

In [12]:
gen17.head() # Displays the first 5 rows or the dataframe

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0


In [13]:
gen17.head(10) # Optional numerical argument used to display the first 10 rows instead

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,1,65,65,45,grass,poison,6.9,1,0
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,2,80,80,60,grass,poison,13.0,1,0
2,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,88.1,3,122,120,80,grass,poison,100.0,1,0
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0
5,"['Blaze', 'Solar Power']",0.25,1.0,1.0,2.0,0.5,0.5,0.5,1.0,1.0,...,88.1,6,159,115,100,fire,flying,90.5,1,0
6,"['Torrent', 'Rain Dish']",1.0,1.0,1.0,2.0,1.0,1.0,0.5,1.0,1.0,...,88.1,7,50,64,43,water,,9.0,1,0
7,"['Torrent', 'Rain Dish']",1.0,1.0,1.0,2.0,1.0,1.0,0.5,1.0,1.0,...,88.1,8,65,80,58,water,,22.5,1,0
8,"['Torrent', 'Rain Dish']",1.0,1.0,1.0,2.0,1.0,1.0,0.5,1.0,1.0,...,88.1,9,135,115,78,water,,85.5,1,0
9,"['Shield Dust', 'Run Away']",1.0,1.0,1.0,1.0,1.0,0.5,2.0,2.0,1.0,...,50.0,10,20,20,45,bug,,2.9,1,0


Now, if you're thinking, "hey, that's useful, but is there another optional argument to reverse it?" The answer would technically be no, but, there is another method. This one is aptly named `.tail()`. `.tail()` functions exactly like `.head()` does, execpt it shows the *last* 5 rows in a dataframe. Again though, you can change this with an optional numeric argument.

In [14]:
gen17.tail() # Displays the last 5 rows 

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
796,['Beast Boost'],0.25,1.0,0.5,2.0,0.5,1.0,2.0,0.5,1.0,...,,797,107,101,61,steel,flying,999.9,7,1
797,['Beast Boost'],1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,...,,798,59,31,109,grass,steel,0.1,7,1
798,['Beast Boost'],2.0,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,...,,799,97,53,43,dark,dragon,888.0,7,1
799,['Prism Armor'],2.0,2.0,1.0,1.0,1.0,0.5,1.0,1.0,2.0,...,,800,127,89,79,psychic,,230.0,7,1
800,['Soul-Heart'],0.25,0.5,0.0,1.0,0.5,1.0,2.0,0.5,1.0,...,,801,130,115,65,steel,fairy,80.5,7,1


In [15]:
gen17.tail(3) # Argument added to display only the last 3 rows

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
798,['Beast Boost'],2.0,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,...,,799,97,53,43,dark,dragon,888.0,7,1
799,['Prism Armor'],2.0,2.0,1.0,1.0,1.0,0.5,1.0,1.0,2.0,...,,800,127,89,79,psychic,,230.0,7,1
800,['Soul-Heart'],0.25,0.5,0.0,1.0,0.5,1.0,2.0,0.5,1.0,...,,801,130,115,65,steel,fairy,80.5,7,1


We can use `.columns` to display the headers of our dataframe

In [16]:
# Show the names of each column in the dataframe
gen17.columns

Index(['abilities', 'against_bug', 'against_dark', 'against_dragon',
       'against_electric', 'against_fairy', 'against_fight', 'against_fire',
       'against_flying', 'against_ghost', 'against_grass', 'against_ground',
       'against_ice', 'against_normal', 'against_poison', 'against_psychic',
       'against_rock', 'against_steel', 'against_water', 'attack',
       'base_egg_steps', 'base_happiness', 'base_total', 'capture_rate',
       'classfication', 'defense', 'experience_growth', 'height_m', 'hp',
       'japanese_name', 'name', 'percentage_male', 'pokedex_number',
       'sp_attack', 'sp_defense', 'speed', 'type1', 'type2', 'weight_kg',
       'generation', 'is_legendary'],
      dtype='object')

In [17]:
# Display selected columns of our dataframe
gen17[["pokedex_number", "name", "type1", "type2", "hp", "attack", "sp_attack", "defense", "sp_defense", "speed"]].head(15)

Unnamed: 0,pokedex_number,name,type1,type2,hp,attack,sp_attack,defense,sp_defense,speed
0,1,Bulbasaur,grass,poison,45,49,65,49,65,45
1,2,Ivysaur,grass,poison,60,62,80,63,80,60
2,3,Venusaur,grass,poison,80,100,122,123,120,80
3,4,Charmander,fire,,39,52,60,43,50,65
4,5,Charmeleon,fire,,58,64,80,58,65,80
5,6,Charizard,fire,flying,78,104,159,78,115,100
6,7,Squirtle,water,,44,48,50,65,64,43
7,8,Wartortle,water,,59,63,65,80,80,58
8,9,Blastoise,water,,79,103,135,120,115,78
9,10,Caterpie,bug,,45,30,20,35,20,45


In [19]:
# Use .loc to return all rows where the parameter is True. In this case, all pokemon that are fire type
gen17.loc[gen17["type1"] == "fire"].head()

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
3,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,4,60,50,65,fire,,8.5,1,0
4,"['Blaze', 'Solar Power']",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,88.1,5,80,65,80,fire,,19.0,1,0
5,"['Blaze', 'Solar Power']",0.25,1.0,1.0,2.0,0.5,0.5,0.5,1.0,1.0,...,88.1,6,159,115,100,fire,flying,90.5,1,0
36,"['Flash Fire', 'Drought', 'Snow Cloak', 'Snow ...",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,24.6,37,50,65,65,fire,ice,,1,0
37,"['Flash Fire', 'Drought', 'Snow Cloak', 'Snow ...",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,...,24.6,38,81,100,109,fire,ice,,1,0


Now, let's use these methods to reorganize our data. Currently, the dataframe headers are run alphabetically. While this might be useful in some cases, it isn't in ours. So, let's move things around to resemble a more traditional pokedex.

In [23]:
# Set the dataframe equal to a reoganized version
gen17_simplified = gen17[["pokedex_number", "name", "type1", "type2", "hp", "attack", "sp_attack", "defense", "sp_defense", "speed", "generation", "is_legendary"]]

In [22]:
gen17.columns

Index(['abilities', 'against_bug', 'against_dark', 'against_dragon',
       'against_electric', 'against_fairy', 'against_fight', 'against_fire',
       'against_flying', 'against_ghost', 'against_grass', 'against_ground',
       'against_ice', 'against_normal', 'against_poison', 'against_psychic',
       'against_rock', 'against_steel', 'against_water', 'attack',
       'base_egg_steps', 'base_happiness', 'base_total', 'capture_rate',
       'classfication', 'defense', 'experience_growth', 'height_m', 'hp',
       'japanese_name', 'name', 'percentage_male', 'pokedex_number',
       'sp_attack', 'sp_defense', 'speed', 'type1', 'type2', 'weight_kg',
       'generation', 'is_legendary'],
      dtype='object')

In [24]:
gen17_simplified

Unnamed: 0,pokedex_number,name,type1,type2,hp,attack,sp_attack,defense,sp_defense,speed,generation,is_legendary
0,1,Bulbasaur,grass,poison,45,49,65,49,65,45,1,0
1,2,Ivysaur,grass,poison,60,62,80,63,80,60,1,0
2,3,Venusaur,grass,poison,80,100,122,123,120,80,1,0
3,4,Charmander,fire,,39,52,60,43,50,65,1,0
4,5,Charmeleon,fire,,58,64,80,58,65,80,1,0
5,6,Charizard,fire,flying,78,104,159,78,115,100,1,0
6,7,Squirtle,water,,44,48,50,65,64,43,1,0
7,8,Wartortle,water,,59,63,65,80,80,58,1,0
8,9,Blastoise,water,,79,103,135,120,115,78,1,0
9,10,Caterpie,bug,,45,30,20,35,20,45,1,0
